Python CheatSheet
Python CheatSheet
Boolean (True/False)
*For unlike columns, i.e. different column names but similar items in rows
Functions
Write pd.merge(df1, df2 ,left_on='Column1', right_on='Column2')
Concat
pd.concat([df1,df2], axis=0, join'outer', ignore_index=False)
Slicing Concat
Unique
df[df_name].unique
Formatting • Return only unique values (rows) of the dataframe
Python Shortcuts
1) ^ < > : Centre/left/right align • In order of appearance, removes repeats
2) s=string, d=decimal, f=float
3) .2f: 2d.p. Random Syntax
Type Conversion 4) ',': Auto splits numbers by thousands
CSV
Input Lists
Functions: len(), min() or max(), sum(), insert(), reverse()
.split(): Separates string into lists, by certain delimiter (" ", "-")
Math Operations .strip(): Removes white spaces at the beg and end
.pop(-1): Removes items at index: -1
Append Sort Reverse Types: bar, scatter, box, stackplot, hist, pie
Tuple
Errors Dictionary
Syntax Error: Syntax/variables/indent/brackets Functions: len(dict), dict.clear() = empties dict
Runtime Error: Cannot run, debug dict.items() = returns pairs of key value: dict_items([('keys1','values1'), etc])
Logical Error: Able to run, but output wrong dict.keys() = returns all keys: dict_keys(['keys1'], ['keys2'])
dict.values() = returns all values: dict_values(['values1'], ['values2'])
Simple
Iteration
Data Definition Language (DDL) Changing Title & Sort Changing Title & Listing Runtime > import numpy as np (data computation & data manipulation)
CREATE SCHEMA AUTHORIZATION/TABLE/INDEX/VIEW SELECT Movie_title AS Movies 120 import pandas as pd (Dataframe & Data structure)
ALTERT TABLE (AS); NOT NULL; UNIQUE; PRIMARY KEY FROM Movie SELECT Movie_title AS Movies import matplotlib.pyplot as plt (Visualisation)
FOREIGN KEY; DEFAULT; CHECK; DROP TABLE/INDEX/VIEW ORDER BY Movie_title FROM Movie from scipy import <sub_module_name> (cluster, stats, linalg)
WHERE Movie_Runtime >120 pd.DataFrame(data=<df/dict>, index=[‘<char>,…], columns=[‘<char>’…],
ORDER BY Movie_title;
dtype=<char>,copy - <true/false> → Construct Dataframe
<var1> = pd.read_csv(“<file>”) → Read CSV file
Count How Many Movies in Each Genre
df.head() → get first 5 rows | df.tail() → last 5 rows [Specify no. of rows]
SELECT Movie_Genre, COUNT(*)
FROM Movie
df.index → gives range or specific index
GROUP BY Movie_Genre; df.columns or .keys() → show the headers of the column
df.shape → size of data (rows x columns)
Select Unique Movie ID from Carey Dopico df.idxmax() → returns max value index
SELECT DISTINCT Movie_ID df.isnull() → returns True/False
FROM Rental df.dtypes → data type of each column in a Dataframe
INNER JOIN Customer ON Customer.Cus_No = Rental.Cus_No df[“column”].unique → shows the different items in the column
WHERE Customer.Cus_Lname = “Carey” AND df.drop(number) → Drop certain row
Customer.Cus_Fname = “Dopico”; df.append(another_df) → Insert another dataframe into df
df.pop(column_name) → remove a column
SQL Constraints Listing First and Last Name Who Ordered Movie 3 df.duplicated → returns True/False | df.drop_duplicates()
NOT NULL – Column does not accept nulls SELECT DISTINCT Cus_Fname, Cus_Lname df.drop_duplicated([‘col_name’]) → drop duplicates
UNIQUE – All values in column are unique FROM Customer
df.fillna(‘content to fill’)
DEFAULT – Assign value to an attribute when new row is added to table INNER JOIN Rental ON
df.dropna() → drop all Nan | df.replace(“111”,”Others”,inplace = True)
CHECK – Validates data when an attribute value is entered Customer.Cus_No = Rental.Cus_No
WHERE Movie_ID = 3; df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=True)
axis=0 is index, =1 is column
how=’any’ vs ‘all’. OR or AND logic
Listing First and Last Name Who Never Ordered Movie 3 subset=[‘col_name’] → which column to look for NAs
SELECT DISTINCT Cus_Fname, Cus_Lname thresh=2 means requires 2 NAs | inplace =True = change original instead of returning a new one
FROM Customer df.mean(), .median(), .mode(), .std(), .sum(), .min(), .max(), .var()
WHERE NOT (Cus_Fname = “Baitmore” AND Cus_Lname = “Aliza”); df.describe() → auto tabulate all variables and give a numeric summary
df.corr(method=’pearson’) → correlation
Average Movie runtime for movies with <120 df.info() → show info of each column
SELECT AVG(Movie_Runtime) AS AVERAGE df.loc[“f”,”age”] =1.5 → Changing Cell Values
FROM Movie
WHERE Movie_Runtime <120; Data Selection
Pandas Additional df.<column> → print the column data only (any column header)
df.Dataframe → print will give first 5 and last 5 rows df.<column>.mean() → mean of values in column
df.to_string() → return entire DataFrame df[df.<column> == 3] | df[df.<column> == 3].price.mean()
df.head() → Returns headers & specified row, from top df.loc[] → Locating a particular row index or column name
df.dropna() → default returns new Dataframe and not change original df.loc[0,[‘beds’,’type’]] → see row 0 in 2 columns | df.loc[[0,1]] ->1st&2nd
Loc includes last row, e.g.2:11 includes 11, iloc doesn’t
df.plot() → plot (draw) diagrams
pd.Series(mylist) → create a Pandas Series df.iloc[row#, column #] → index on column and row
myseries[0] → return first value of pandas series df.iloc[1:3] → slicing data see 2nd to 4th row
pd.Series(mylist, index =[“x”,”y”,”z”]) → Add lables to Pandas Series .groupby(col_name) → groups by value
df[col_name][row_index]
Join Clause Corresponding month with highest muffin sales df[df.col_name>5000]
SELECT * FROM MANAGER, EXECUTIVE WHERE MANAGER>MANAGER_CODE = df[df.sales_muffin == df.sales_muffin.max()][“months”] df.col_name[row_index]
EXECUTIVE.MANAGER_CODE;
Mean sale of muffin of first 3 mths of sales (2dp) df.groupby(‘host_name’).number_reviews.sum().sort_values(ascending=False) →
Inner Join → combines results & look for values common to both tables number of reviews for host_name
round(df.loc[:2,”sales_muffin”.mean(),2])
Outer Join → Keeps the non-matching results (Left, Right, Full) df[(df.CustomerID == 15311) & (df.Quantity >= 24.0)].loc[:,['CustomerID', 'Quantity']]
df[['area','neighbourhood']] = df.neighbour_hood_info.str.split(",", expand = True) [split column]