Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
DataFrames in Python
Getting Started Cheat Sheet
CREATE PROPERTIES
Instantiate DataFrames from files and host memory. Extract properties from DataFrames and Series
Create a series.
QUERY
cudf.Series([0,1,2,3]) - from a list of elements Extract information from data.
df[‘foo’] - get column ‘foo’ from DataFrame as a cuDF Series df.head() - Retrieve top 5 rows from DataFrame.
df.dtypes - Get a list of columns with data types. df.nsmallest(2, ‘foo’) - Retrieve 2 rows with smallest values in column foo.
Retrieve rows and columns by index label. df.query(‘foo == 1’) - Get all rows where column foo equals to 1.
df.loc[3] - row with index 3 df.query(‘foo > 10’) - Get all rows where column foo is greater than 10.
df.loc[3, ‘foo’] - row with index 3 and column ‘foo’ df.sample() - Fetch a random row.
df.loc[2:5, [‘foo’, ‘bar’]] - rows with labels 2 to 5 and columns ‘foo’ and ‘bar’ df.sample(3) - Fetch a random 3 rows.
df.drop(‘foo’, axis=1) - Remove column foo. ser.str.findall(‘([a-z]+flow)’) - Retrieve all instances of words like dataflow,
workflow, etc.
df.dropna() - Remove rows with one or more missing values.
ser.str.len() - Find the total length of a string.
df.dropna(subset=’foo’) - Remove rows with a missing value in column foo.
ser.str.lower() - Cast all the letters in a string to lowercase characters.
df.fillna(-1) - Replace any missing value with a default.
ser.str.match(‘[a-z]+flow’) - Check if every element matches the pattern.
df.fillna({‘foo’: -1}) - Replace a missing value in column foo with a default.
ser.str.ngrams_tokenize(n=2, separator=‘_’) - Generate all bi-grams from a
df1.join(df2) - Join with a DataFrame on index. string separated by underscore.
df1.merge(df2, on=’foo’, how=’inner’) - Perform an inner join with a ser.str.pad(width=10) - Make every string of equal length.
DataFrame on column foo.
ser.str.pad(width=10, side=’both’, fillchar=’$’) - Make every string of equal
df1.merge(df2, left_on=’foo’, right_on=’bar’, how=’left’) - Perform a left length with word centered and padded with dollar signs.
outer join with a DataFrame on different keys.
ser.str.replace(‘foo’, ‘bar’) - Replace all instances of word foo with bar.
df.rename({‘foo’: ‘bar’}, axis=1) - Rename column foo to bar.
ser.str.replace(‘f..’, ‘bar’) - Replace all instances of 3-letter words beginning
df.rename({1: 101}) - Replace index 1 with value 101. with f with bar.
df.reset_index() - Replace index and retain the old one as a column. ser.str.split() - Split the string on spaces.
df.reset_index(drop=True) - Replace index and remove the old one. ser.str.split(‘,’, n=5) - Split the string on comma and retain only the first 5
occurences (6 column retains the remainder of the string).
df.set_index(‘foo’) - Replace index with the values of column foo.
tokens, masks, metadata = ser.str.subword_tokenize(‘hash.txt’) - Tokenize
df.set_index(‘foo’, drop =False) - Replace index with the values of column text using perfectly hashed BERT vocabulary.
foo and retain the column.
ser.str.upper() - Cast all the letters in a string to uppercase characters.
SUMMARIZE
Learn from data by aggregating and exploring. CATEGORICAL
Work with categorical columns on GPU.
df.groupby(by=’foo’).agg({‘bar’: ‘sum’, ‘baz’: ‘count’}) - Aggregate
DataFrame: sum elements of bar, count elements of baz by values of foo. ser.cat.add_categories([‘foo’,’bar’]) - Extend the list of categorical allowed values.
df.describe() - Learn basic statistics about DataFrame. ser.cat.categories - Retrieve the list of all categories.
df.describe(percentiles=[.1,.9]) - Learn basic statistics about DataFrame ser.cat.remove_categories([‘foo’]) - Remove the foo category from categorical column.
and only produce 1st and 9th decile.