0% found this document useful (0 votes)
9 views

Data Cheat Sheet

Uploaded by

nikhildixit31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Cheat Sheet

Uploaded by

nikhildixit31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Pandas Reference Sheet

POWERED BY THE SCIENTISTS AT THE DATA INCUBATOR

Loading/exporting a data set Examining the data


path_to_file: string indicating the path to the file, df.head(n)—returns first n rows
e.g., ‘data/results.csv’
df.tail(n)—returns last n rows
df = pd.read_csv(path_to_file)—read a CSV file df.describe()—returns summary statistics for each
df = pd.read_excel(path_to_file)—read an Excel file numerical column

df = pd.read_html(path_to_file)—parses HTML to find df[‘State’].unique()—returns unique values for the


all tables column

df.to_csv(path_to_file)—creates CSV of the data frame df.columns—returns column names


df.shape—returns the number of rows and columns

Selecting and filtering Statistical operations


can be applied to both data frames and series/column
SELECTING COLUMNS
df[‘State’]—selects ‘State’ column df[‘Population’].sum()—sum of all values of a column
df[[‘State’, ‘Population’]]—selects ‘State’ and df.sum()—sum for all numerical columns
‘Population’ column df.mean()—mean
df.std()—standard deviation
SELECTING BY LABEL
df.min()— minimum value
df.loc[‘a’]—selects row by index label
df.count()—count of values, excludes missing values
df.loc[‘a’, ‘State’]—selects single value of row ‘a’ and
column ‘State’ df.max()—maximum value
df[‘Population’].apply(func)—apply func to each
SELECTING BY POSITION value of column
df.iloc[0]—selects rows in position 0
df.iloc[0, 0]—selects single value by position at row 0 and
column 0
Data cleaning and modifications
FILTERING
df[‘State’].isnull()—returns True/False for rows with
df[df[‘Population’] > 20000000]]—filter out rows not missing values
meeting the condition
df.dropna(axis=0)—drop rows containing missing values
df.query(“Population > 20000000”)—filter out rows
df.dropna(axis=1)—drop columns containing missing
not meeting the condition
values
df.fillna(0)—fill in missing values, here filled with 0
State Capital Population df.sort_values(‘Population’, ascending=True)
—sort rows by a column’s values
a Texas Austin 28700000
df.set_index(‘State’)—changes index to a specified
column
b New York Albany 19540000
df.reset_index()—makes the current index a column

c Washington Olympia 7536000 df.rename(columns={‘Population’=’Pop.’})


—renames columns
Example data frame
© 2019 Pragmatic Institute, LLC
Grouping and aggregation
grouped = df.groupby(by=’col1’)—create grouped by object
grouped[‘col2’].mean()—mean value of ‘col2’ for each group
grouped.agg({‘col2’: np.mean, ‘col3’: [np.mean, np.std]})—apply different functions to different columns
grouped.apply(func)—apply func to each group

col1 col2 col3


col1 col2 col3

col1 col2 col3

col1 col2 col3

Merging data frames


There are several ways to merge two data frames, depending on the value of method. The resulting indices are integers starting with zero.

df1.merge(df2, how=method, on=’State’)

State Capital Population State Highest Point

+
a Texas Austin 28700000 x Washington Mount Rainier

b New York Albany 19540000 y New York Mount Marcy

c Washington Olympia 7536000 z Nebraska Panorama Point

Data frame df1 Data frame df2

State Capital Population Highest Point State Capital Population Highest Point

0 Texas Austin 28700000 NaN 0 New York Albany 19540000 Mount Marcy

1 New York Albany 19540000 Mount Marcy 1 Washington Olympia 7536000 Mount Rainier

2 Washington Olympia 7536000 Mount Rainier how=‘inner’

how=‘left’ State Capital Population Highest Point

State Capital Population Highest Point 0 Texas Austin 28700000 NaN

0 New York Albany 19540000 Mount Marcy 1 New York Albany 19540000 Mount Marcy

1 Washington Olympia 7536000 Mount Rainier 2 Washington Olympia 7536000 Mount Rainier

2 Nebraska NaN NaN Panorama Point 3 Nebraska NaN NaN Panorama Point

how=‘right’ how=‘outer’

Register or learn more about other courses in our data curriculum by visiting pragmaticinstitute.com/data-science or calling 480.515.1411.

You might also like