This cheat sheet provides a comprehensive guide to data wrangling using Dask, covering key operations such as importing Dask objects, basic DataFrame operations, aggregation, merging, time series operations, and handling missing data. It also includes advanced features like custom functions, parallel computing, and data quality validation. Each section includes code snippets for practical implementation.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
57 views7 pages
Data Wrangling With Dask CheatSheet 1731972488
This cheat sheet provides a comprehensive guide to data wrangling using Dask, covering key operations such as importing Dask objects, basic DataFrame operations, aggregation, merging, time series operations, and handling missing data. It also includes advanced features like custom functions, parallel computing, and data quality validation. Each section includes code snippets for practical implementation.
● Create bag from list: bag = db.from_sequence([1, 2, 3, 4, 5])
● Map function to bag: result = bag.map(lambda x: x * 2) ● Filter bag: filtered = bag.filter(lambda x: x > 2) ● Flatten bag of lists: flattened = bag.flatten() ● Reduce bag: sum_result = bag.sum() ● Group by key: grouped = bag.groupby(lambda x: x % 2) ● Count items: count = bag.count()
By: Waleed Mousa
12. Computation and Persistence
● Compute result: result = df.column.sum().compute()
● Persist DataFrame in memory: df = df.persist() ● Visualize task graph: df.visualize(filename='graph.svg') ● Get dask performance report: from dask.diagnostics import ProgressBar; with ProgressBar(): result = df.compute() ● Write DataFrame to CSV: df.to_csv('output/*.csv') ● Write DataFrame to Parquet: df.to_parquet('output/data.parquet')
13. Parallel and Distributed Computing
● Set number of workers: from dask.distributed import Client; client =
Client(n_workers=4) ● Submit function to cluster: future = client.submit(func, *args) ● Map function across cluster: futures = client.map(func, sequence) ● Gather results: results = client.gather(futures) ● Scale cluster: client.scale(10) # Scale to 10 workers
14. Dask-ML Operations
● Import Dask-ML: import dask_ml.preprocessing as dmp
● Scale features: scaler = dmp.StandardScaler(); scaled = scaler.fit_transform(df) ● Train-test split: from dask_ml.model_selection import train_test_split; X_train, X_test, y_train, y_test = train_test_split(X, y) ● Linear regression: from dask_ml.linear_model import LinearRegression; lr = LinearRegression(); lr.fit(X, y) ● Logistic regression: from dask_ml.linear_model import LogisticRegression; lr = LogisticRegression(); lr.fit(X, y)
15. Advanced Dask Features
● Use custom scheduler: from dask.distributed import Client; client =
Client('scheduler-address:8786') ● Create delayed function: from dask import delayed; @delayed def func(x): return x * 2 ● Compute delayed function: result = func(10).compute() ● Create Dask collection from delayed objects: dask_list = [delayed(func)(i) for i in range(10)]
By: Waleed Mousa
● Optimize Dask graph: from dask.optimization import fuse; optimized = fuse(dask_list) ● Use callback for computation: from dask.callbacks import Callback; with Callback(): result = df.compute() ● Profile Dask computation: from dask.diagnostics import ResourceProfiler; with ResourceProfiler() as rprof: result = df.compute() ● Visualize resource usage: rprof.visualize()