0% found this document useful (0 votes)
57 views7 pages

Data Wrangling With Dask CheatSheet 1731972488

This cheat sheet provides a comprehensive guide to data wrangling using Dask, covering key operations such as importing Dask objects, basic DataFrame operations, aggregation, merging, time series operations, and handling missing data. It also includes advanced features like custom functions, parallel computing, and data quality validation. Each section includes code snippets for practical implementation.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views7 pages

Data Wrangling With Dask CheatSheet 1731972488

This cheat sheet provides a comprehensive guide to data wrangling using Dask, covering key operations such as importing Dask objects, basic DataFrame operations, aggregation, merging, time series operations, and handling missing data. It also includes advanced features like custom functions, parallel computing, and data quality validation. Each section includes code snippets for practical implementation.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

# [ Data Wrangling with Dask ] ( CheatSheet )

1. Importing and Creating Dask Objects

● Import Dask DataFrame: import dask.dataframe as dd


● Import Dask Array: import dask.array as da
● Import Dask Bag: import dask.bag as db
● Create Dask DataFrame from CSV: df = dd.read_csv('data/*.csv')
● Create Dask DataFrame from Parquet: df =
dd.read_parquet('data/*.parquet')
● Create Dask Array: arr = da.random.random((10000, 10000), chunks=(1000,
1000))
● Create Dask Bag from list: bag = db.from_sequence([1, 2, 3, 4, 5])
● Create Dask DataFrame from Pandas: ddf = dd.from_pandas(pdf,
npartitions=4)

2. Basic Dask DataFrame Operations

● Show DataFrame info: df.info()


● Get column names: df.columns
● Get data types: df.dtypes
● Select a column: df['column_name']
● Select multiple columns: df[['column1', 'column2']]
● Rename columns: df = df.rename(columns={'old_name': 'new_name'})
● Add new column: df['new_column'] = df['column1'] + df['column2']
● Drop column: df = df.drop('column_name', axis=1)
● Filter rows: df[df['column'] > 5]
● Sort values: df = df.sort_values('column')
● Reset index: df = df.reset_index()
● Set index: df = df.set_index('column')

3. Aggregation and Grouping

● Compute column sum: df['column'].sum().compute()


● Compute column mean: df['column'].mean().compute()
● Compute column median: df['column'].quantile(0.5).compute()
● Group by and aggregate: df.groupby('category').agg({'value':
'sum'}).compute()

By: Waleed Mousa


● Group by multiple columns: df.groupby(['cat1', 'cat2']).agg({'value':
['sum', 'mean']}).compute()
● Count unique values: df['column'].nunique().compute()
● Value counts: df['column'].value_counts().compute()

4. Merging and Joining

● Merge DataFrames: merged = dd.merge(df1, df2, on='key')


● Left join: left_join = dd.merge(df1, df2, on='key', how='left')
● Right join: right_join = dd.merge(df1, df2, on='key', how='right')
● Outer join: outer_join = dd.merge(df1, df2, on='key', how='outer')
● Concatenate DataFrames vertically: concat_df = dd.concat([df1, df2])
● Concatenate DataFrames horizontally: concat_df = dd.concat([df1, df2],
axis=1)

5. Time Series Operations

● Convert to datetime: df['date'] = dd.to_datetime(df['date'])


● Set datetime index: df = df.set_index('date')
● Resample to monthly frequency: monthly = df.resample('M').mean()
● Rolling window calculation: df['rolling_mean'] =
df['value'].rolling(window=7).mean()
● Shift data: df['previous_day'] = df['value'].shift(1)
● Time series difference: df['diff'] = df['value'].diff()

6. String Operations

● Convert to lowercase: df['text'] = df['text'].str.lower()


● Convert to uppercase: df['text'] = df['text'].str.upper()
● Strip whitespace: df['text'] = df['text'].str.strip()
● Replace substring: df['text'] = df['text'].str.replace('old', 'new')
● Extract substring: df['substring'] = df['text'].str[0:5]
● String contains: mask = df['text'].str.contains('pattern')
● String split: df['split'] = df['text'].str.split(',')

7. Missing Data Handling

● Check for missing values: df.isnull().sum().compute()


● Drop rows with missing values: df = df.dropna()
● Fill missing values with a constant: df = df.fillna(0)

By: Waleed Mousa


● Fill missing values with method: df = df.fillna(method='ffill')
● Interpolate missing values: df = df.interpolate()

8. Data Type Conversion

● Convert to integer: df['column'] = df['column'].astype(int)


● Convert to float: df['column'] = df['column'].astype(float)
● Convert to string: df['column'] = df['column'].astype(str)
● Convert to category: df['column'] = df['column'].astype('category')

9. Advanced Operations

● Apply custom function: df['new_col'] = df['col'].apply(lambda x: x * 2)


● Map values: df['mapped'] = df['col'].map({1: 'A', 2: 'B', 3: 'C'})
● One-hot encoding: df = dd.get_dummies(df, columns=['category'])
● Binning: df['binned'] = dd.cut(df['value'], bins=[0, 25, 50, 75, 100])
● Calculate percentiles: df['percentile'] = df['value'].rank(pct=True)

10. Dask Array Operations

● Create array from numpy: darr = da.from_array(np.array([1, 2, 3]),


chunks=2)
● Array shape: darr.shape
● Array mean: darr.mean().compute()
● Array sum: darr.sum().compute()
● Element-wise operations: result = da.sin(darr)
● Matrix multiplication: result = da.matmul(arr1, arr2)
● Concatenate arrays: combined = da.concatenate([arr1, arr2], axis=0)
● Reshape array: reshaped = darr.reshape((2, 2))

11. Dask Bag Operations

● Create bag from list: bag = db.from_sequence([1, 2, 3, 4, 5])


● Map function to bag: result = bag.map(lambda x: x * 2)
● Filter bag: filtered = bag.filter(lambda x: x > 2)
● Flatten bag of lists: flattened = bag.flatten()
● Reduce bag: sum_result = bag.sum()
● Group by key: grouped = bag.groupby(lambda x: x % 2)
● Count items: count = bag.count()

By: Waleed Mousa


12. Computation and Persistence

● Compute result: result = df.column.sum().compute()


● Persist DataFrame in memory: df = df.persist()
● Visualize task graph: df.visualize(filename='graph.svg')
● Get dask performance report: from dask.diagnostics import ProgressBar;
with ProgressBar(): result = df.compute()
● Write DataFrame to CSV: df.to_csv('output/*.csv')
● Write DataFrame to Parquet: df.to_parquet('output/data.parquet')

13. Parallel and Distributed Computing

● Set number of workers: from dask.distributed import Client; client =


Client(n_workers=4)
● Submit function to cluster: future = client.submit(func, *args)
● Map function across cluster: futures = client.map(func, sequence)
● Gather results: results = client.gather(futures)
● Scale cluster: client.scale(10) # Scale to 10 workers

14. Dask-ML Operations

● Import Dask-ML: import dask_ml.preprocessing as dmp


● Scale features: scaler = dmp.StandardScaler(); scaled =
scaler.fit_transform(df)
● Train-test split: from dask_ml.model_selection import train_test_split;
X_train, X_test, y_train, y_test = train_test_split(X, y)
● Linear regression: from dask_ml.linear_model import LinearRegression; lr
= LinearRegression(); lr.fit(X, y)
● Logistic regression: from dask_ml.linear_model import LogisticRegression;
lr = LogisticRegression(); lr.fit(X, y)

15. Advanced Dask Features

● Use custom scheduler: from dask.distributed import Client; client =


Client('scheduler-address:8786')
● Create delayed function: from dask import delayed; @delayed def func(x):
return x * 2
● Compute delayed function: result = func(10).compute()
● Create Dask collection from delayed objects: dask_list =
[delayed(func)(i) for i in range(10)]

By: Waleed Mousa


● Optimize Dask graph: from dask.optimization import fuse; optimized =
fuse(dask_list)
● Use callback for computation: from dask.callbacks import Callback; with
Callback(): result = df.compute()
● Profile Dask computation: from dask.diagnostics import ResourceProfiler;
with ResourceProfiler() as rprof: result = df.compute()
● Visualize resource usage: rprof.visualize()

16. Advanced Data Manipulation

● Pivot table: pivoted = df.pivot_table(values='value', index='category',


columns='date')
● Melt DataFrame: melted = df.melt(id_vars=['id'], value_vars=['col1',
'col2'])
● Explode list column: exploded = df.explode('list_column')
● Cumulative sum: df['cumsum'] = df.groupby('category')['value'].cumsum()
● Rolling window with custom function: df['custom_roll'] =
df.rolling(window=3).apply(lambda x: x.max() - x.min())

17. Time Series Advanced Operations

● Lag multiple periods: df['lag_3'] = df.groupby('id')['value'].shift(3)


● Forward fill within groups: df['filled'] =
df.groupby('category')['value'].ffill()
● Compute year-over-year growth: df['yoy_growth'] =
df.groupby('id')['value'].pct_change(freq='Y')
● Resample with custom aggregation: resampled =
df.resample('M').agg({'value': 'mean', 'count': 'sum'})
● Time-based rolling operation: df['roll_7d'] =
df.rolling('7D')['value'].mean()

18. Window Functions

● Rank within groups: df['rank'] = df.groupby('category')['value'].rank()


● Percent rank: df['percentile'] =
df.groupby('category')['value'].rank(pct=True)
● Cumulative distribution: df['cdf'] =
df.groupby('category')['value'].rank(pct=True)
● Moving correlation: df['rolling_corr'] = df.groupby('id')['x',
'y'].rolling(window=10).corr().unstack().iloc[:, 1]

By: Waleed Mousa


● Expanding window calculations: df['expanding_mean'] =
df.groupby('category')['value'].expanding().mean()

19. String and Text Processing

● Extract using regex: df['extracted'] = df['text'].str.extract('(\d+)')


● Count occurrences: df['count'] = df['text'].str.count('pattern')
● Pad strings: df['padded'] = df['text'].str.pad(10, side='left',
fillchar='0')
● Remove accents: df['clean'] =
df['text'].str.normalize('NFKD').str.encode('ascii',
errors='ignore').str.decode('utf-8')
● Concatenate strings across rows: df['concat'] =
df.groupby('id')['text'].transform(lambda x: ' '.join(x))

20. Complex Aggregations

● Weighted average: result = df.groupby('category').apply(lambda x:


np.average(x['value'], weights=x['weight'])).compute()
● First and last values: result = df.groupby('category').agg({'value':
['first', 'last']})
● Custom aggregation function: result =
df.groupby('category').agg({'value': lambda x: x.nlargest(3).mean()})
● Multiple aggregations: result = df.groupby('category').agg({'value':
['mean', 'median', 'std', 'min', 'max']})
● Aggregation with filtering: result = df[df['value'] >
0].groupby('category')['value'].mean()

21. Advanced Joining and Merging

● Merge with indicator: merged = dd.merge(df1, df2, on='key', how='outer',


indicator=True)
● Merge multiple DataFrames: merged = dd.multi.merge([df1, df2, df3],
on='key')
● Merge with complex conditions: merged = dd.merge(df1, df2,
left_on='key1', right_on='key2', suffixes=('_1', '_2'))
● Merge and aggregate: result = dd.merge(df1, df2,
on='key').groupby('category').agg({'value': 'sum'})
● Self-join: self_joined = dd.merge(df, df, left_on='parent',
right_on='id')

By: Waleed Mousa


22. Data Quality and Validation

● Check for duplicates: duplicates = df.duplicated().sum().compute()


● Identify outliers (Z-score method): df['is_outlier'] = abs((df['value'] -
df['value'].mean()) / df['value'].std()) > 3
● Check column correlation: correlation = df[['col1',
'col2']].corr().compute()
● Validate value ranges: invalid = df[(df['value'] < min_val) |
(df['value'] > max_val)]
● Check for inconsistent categories: inconsistent =
set(df['category'].unique().compute()) - set(valid_categories)

23. Advanced Dask Features

● Custom partitioning: df = df.repartition(npartitions=20)


● Repartition by column: df = df.set_index('date', sorted=True)
● Optimize task graph: from dask.optimization import cull; culled_dask, _ =
cull(df.dask, list(df.__dask_keys__()))
● Use high-level task graphs: from dask.highlevelgraph import
HighLevelGraph; hlg = HighLevelGraph.from_collections('name', df.dask,
dependencies={'dep': df2.dask})
● Create custom Dask collection: from dask.base import DaskMethodsMixin;
class CustomCollection(DaskMethodsMixin): ...

By: Waleed Mousa

You might also like