0% found this document useful (0 votes)
73 views2 pages

Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4

This document provides a cheat sheet for getting started with GPU accelerated DataFrames in Python. It outlines how to create DataFrames from various data sources, extract properties from DataFrames and Series, save DataFrames to disk or convert to other formats, query and transform DataFrames, and perform string operations on Series. Functions are provided to load and save DataFrames from/to CSV, JSON, Parquet files and convert between cuDF and pandas DataFrames. DataFrames can be queried, transformed by applying custom functions, joined, aggregated and strings in Series can be analyzed using regular expressions.

Uploaded by

Junio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views2 pages

Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4

This document provides a cheat sheet for getting started with GPU accelerated DataFrames in Python. It outlines how to create DataFrames from various data sources, extract properties from DataFrames and Series, save DataFrames to disk or convert to other formats, query and transform DataFrames, and perform string operations on Series. Functions are provided to load and save DataFrames from/to CSV, JSON, Parquet files and convert between cuDF and pandas DataFrames. DataFrames can be queried, transformed by applying custom functions, joined, aggregated and strings in Series can be analyzed using regular expressions.

Uploaded by

Junio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

GPU Accelerated

DataFrames in Python
Getting Started Cheat Sheet

Try out enterprise solutions for free with NVIDIA LaunchPad.

Get started with immediate access to hands-on labs at nvidia.com/try-data-science


For additional cheat sheets go to: nvidia.com/rapids-kit/

CREATE PROPERTIES
Instantiate DataFrames from files and host memory. Extract properties from DataFrames and Series

Create a DataFrame. FOR SERIES


Retrieve rows and columns by index label.
cudf.DataFrame([1,2,3,4], columns=[‘foo’]) - from a list of elements
ser.loc[1] - row with index 1
cudf.DataFrame({‘foo’: [1,2,3,4], ‘bar’: [‘a’,’b’,’c’,None]}) - from a dictionary
of columns ser.loc[1:4] - row with indicies 1 to 4
cudf.DataFrame([(1,’a’), (2,’b’)], columns=[‘foo’,’bar’]) - from a list of tuples ser.values - Get an array of all elements.
cudf.from_pandas(pd.DataFrame([1,2,3,4], columns=[‘ints’])) - Convert
pandas DataFrame (CPU) to cuDF DataFrame (GPU). SAVE
Persist data to disk or convert to other memory representations.
cudf.read_csv(‘results.csv’) - Read contents of a CSV file.
df.to_csv(‘results.csv’) - Save cuDF DataFrame in a CSV format with index
cudf.read_csv(‘results.csv’, nrows=2, usecols=[‘foo’]) - Read two rows and and header.
column foo of a CSV file.
df.to_csv(‘results.csv’, index=False, header=False) - Save cuDF DataFrame in
cudf.read_csv(‘results.csv’, skiprows=1, names=[‘foo’,’bar’]) - Replace a CSV format without index and header.
column names when reading a CSV file.
df.to_dlpack() - Convert DataFrame to DLPack tensor for deep learning.
cudf.read_json(‘results.json’) - Read contents of a JSON file.
df.to_json(‘results.json’) - Save cuDF DataFrame in a JSON format.
cudf.read_json(‘results.json’, lines=True, engine=’cudf’) - Read contents of
lines-formatted JSON file using GPU. df.to_json(‘results.json’, orient=’records’, lines=True) - Save cuDF DataFrame in
a JSON Lines format.
cudf.read_parquet(‘results/df_default.parquet’) - Read contents of a
Parquet file. df.to_pandas() - Convert cuDF DataFrame (GPU) to pandas DataFrame
(CPU).
cudf.read_parquet(‘results/df_default.parquet’, columns=[‘foo’]) - Read
column foo from a Parquet file. df.to_parquet(‘results.parquet’) - Save cuDF DataFrame in a Parquet format.

Create a series.
QUERY
cudf.Series([0,1,2,3]) - from a list of elements Extract information from data.

df[‘foo’] - get column ‘foo’ from DataFrame as a cuDF Series df.head() - Retrieve top 5 rows from DataFrame.

df.head(2) - Retrieve top 2 rows from DataFrame.


PROPERTIES
Extract properties from DataFrames and Series df.memory_usage() - Learn how much memory your DataFrame consumes
(in bytes).
FOR DATAFRAMES
df.columns - Get a list of column names. df.nlargest(3, ‘foo’) - Retrieve 3 rows with largest values in column foo.

df.dtypes - Get a list of columns with data types. df.nsmallest(2, ‘foo’) - Retrieve 2 rows with smallest values in column foo.

Retrieve rows and columns by index label. df.query(‘foo == 1’) - Get all rows where column foo equals to 1.

df.loc[3] - row with index 3 df.query(‘foo > 10’) - Get all rows where column foo is greater than 10.

df.loc[3, ‘foo’] - row with index 3 and column ‘foo’ df.sample() - Fetch a random row.

df.loc[2:5, [‘foo’, ‘bar’]] - rows with labels 2 to 5 and columns ‘foo’ and ‘bar’ df.sample(3) - Fetch a random 3 rows.

df.shape - Know data shape (row #, col #) TRANSFORM


Alter the information and structure of DataFrames
df.size - Know total number of elements.
df.apply_rows(func, incols=[‘foo’], outcols={‘bar’: ‘float64’}, kwargs={}) - Apply
df.values - Get an array with all elements.
custom transformation defined in func to column foo and store in column bar.
TRANSFORM STRING
Alter the information and structure of DataFrames Operate on string columns on GPU.

def func(foo, bar): ser.str.contains(‘foo’) - Check if Series of strings contains foo.


for i, f in enumerate(foo):
bar[i] = f + 1 - Kernel definition to use in apply_rows() function. ser.str.contains(‘foo[a-z]+’) - Check if Series of strings contains words starting
with foo.
cudf.concat([df1, df2]) - Append a DataFrame to another DataFrame.
ser.str.extract(‘(foo)’) - Retrieve regex groups matching pattern in Series of strings.
df.drop(1) - Remove row with index equal to 1.
ser.str.extract(‘[a-z]+flow (\d)’) - Retrieve IDs of dataflows, workflows, etc.,
df.drop([1,2]) - Remove rows with index equal to 1 and 2. in Series of strings.

df.drop(‘foo’, axis=1) - Remove column foo. ser.str.findall(‘([a-z]+flow)’) - Retrieve all instances of words like dataflow,
workflow, etc.
df.dropna() - Remove rows with one or more missing values.
ser.str.len() - Find the total length of a string.
df.dropna(subset=’foo’) - Remove rows with a missing value in column foo.
ser.str.lower() - Cast all the letters in a string to lowercase characters.
df.fillna(-1) - Replace any missing value with a default.
ser.str.match(‘[a-z]+flow’) - Check if every element matches the pattern.
df.fillna({‘foo’: -1}) - Replace a missing value in column foo with a default.
ser.str.ngrams_tokenize(n=2, separator=‘_’) - Generate all bi-grams from a
df1.join(df2) - Join with a DataFrame on index. string separated by underscore.

df1.merge(df2, on=’foo’, how=’inner’) - Perform an inner join with a ser.str.pad(width=10) - Make every string of equal length.
DataFrame on column foo.
ser.str.pad(width=10, side=’both’, fillchar=’$’) - Make every string of equal
df1.merge(df2, left_on=’foo’, right_on=’bar’, how=’left’) - Perform a left length with word centered and padded with dollar signs.
outer join with a DataFrame on different keys.
ser.str.replace(‘foo’, ‘bar’) - Replace all instances of word foo with bar.
df.rename({‘foo’: ‘bar’}, axis=1) - Rename column foo to bar.
ser.str.replace(‘f..’, ‘bar’) - Replace all instances of 3-letter words beginning
df.rename({1: 101}) - Replace index 1 with value 101. with f with bar.

df.reset_index() - Replace index and retain the old one as a column. ser.str.split() - Split the string on spaces.

df.reset_index(drop=True) - Replace index and remove the old one. ser.str.split(‘,’, n=5) - Split the string on comma and retain only the first 5
occurences (6 column retains the remainder of the string).
df.set_index(‘foo’) - Replace index with the values of column foo.
tokens, masks, metadata = ser.str.subword_tokenize(‘hash.txt’) - Tokenize
df.set_index(‘foo’, drop =False) - Replace index with the values of column text using perfectly hashed BERT vocabulary.
foo and retain the column.
ser.str.upper() - Cast all the letters in a string to uppercase characters.
SUMMARIZE
Learn from data by aggregating and exploring. CATEGORICAL
Work with categorical columns on GPU.
df.groupby(by=’foo’).agg({‘bar’: ‘sum’, ‘baz’: ‘count’}) - Aggregate
DataFrame: sum elements of bar, count elements of baz by values of foo. ser.cat.add_categories([‘foo’,’bar’]) - Extend the list of categorical allowed values.

df.describe() - Learn basic statistics about DataFrame. ser.cat.categories - Retrieve the list of all categories.

df.describe(percentiles=[.1,.9]) - Learn basic statistics about DataFrame ser.cat.remove_categories([‘foo’]) - Remove the foo category from categorical column.
and only produce 1st and 9th decile.

df.max() - Learn the maximum value in each column. DATETIME


Deal with date and time columns on GPU.
df.max(axis=1) - Learn the maximum value in each row.
ser.dt.day - Extract day from DateTime column.
df.mean() - Learn the average value of each column.
ser.dt.dayofweek - Extract the day of a week from DataTime column.
df.mean(axis=1) - Learn the average value of each row.
ser.dt.year - Extract year from DateTime column.
df.min() - Learn the minimum value in each column.

df.min(axis=1) - Learn the minimum value in each row. MATH/STAT


Perform mathematical and statistical operations on columns.
df.quantile() - Learn the median of each column.
df.corr() - Calculate coefficient of correlation.
df.quantile(.25) - Learn the 1st quartile of each column.
df.exp() - Exponentiate values in all columns.
df.std() - Learn the standard deviation of each column.
df.kurt() - Find kurtosis of each column.
df.std(axis=1) - Learn the standard deviation of each row.
df.log() - Take a logarithm of values in all columns.
df.sum() - Get the sum of each column.
df.pow(2) - Raise values in all columns to the power of 2.
df.sum(axis=1) - Get the sum of each row.
df.skew() - Find skewness of each column.
ser.unique() - Find all unique values in Series.
df.sqrt() - Find root squares of values in all columns.

You might also like