0% found this document useful (0 votes)
96 views

Data Science Python Cheat Sheet

This document provides a cheat sheet on key Python libraries and packages for data science, including: - Numpy for numerical computing and arrays, with functions for data manipulation, aggregation, random number generation, etc. - Matplotlib for data visualization and plotting graphs. It allows customizing plots with labels, titles, legends. - Pandas for data structures and data analysis, with capabilities like loading data, selecting columns, handling missing data, merging/concatenating tables. - Scikit-learn for machine learning tasks like classification, regression, clustering and model tuning. It supports algorithms like linear regression and preprocessing tools. - Seaborn for statistical data visualization built on top of Matplotlib, with visualizations like joint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Data Science Python Cheat Sheet

This document provides a cheat sheet on key Python libraries and packages for data science, including: - Numpy for numerical computing and arrays, with functions for data manipulation, aggregation, random number generation, etc. - Matplotlib for data visualization and plotting graphs. It allows customizing plots with labels, titles, legends. - Pandas for data structures and data analysis, with capabilities like loading data, selecting columns, handling missing data, merging/concatenating tables. - Scikit-learn for machine learning tasks like classification, regression, clustering and model tuning. It supports algorithms like linear regression and preprocessing tools. - Seaborn for statistical data visualization built on top of Matplotlib, with visualizations like joint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Science

Python cheat sheet


numpy | panda | matplotlib | scipy | seaborn | tensorflow
Numpy:

Numpy data types:

 i - integer
 b - boolean
 u - unsigned integer
 f - float
 c - complex float
 m - timedelta
 M - datetime
 O - object
 S - string
 U - unicode string
 V - fixed chunk of memory for other type ( void )

np.array() : to make array


np.array([[1, 2, 3], [4, 5, 6]]) : 2d array
np.array([1, 2, 3, 4], ndmin=<n>) : create array of n dimensions
np.full((3,3),4,dtype=int) : to make an array full of same value
n.eye(5,5) : identity matrix
np.zeroes(n,m) and np.ones(n,m) : array of zeroes and ones
np.arange() : to make array for ap
np.linspace() : to make array of equal space

array.ndim : check no. of dimensions


arr.dtype : check data type
arr.astype('i') : change data type

np.mean(arr) : get mean


np.median(arr) : get median
np.mode(arr) : get mode
np.std(arr) : get standard deviation
np.var(arr) : get variance
np.percentile(arr,percentage) : get max value as per percentile
np.sum(arr) : get sum to all elements

arr.copy() : to make a copy (non updatable)


arr.view() : to make a view (updatable)
arr.shape : to get the dimensions
arr.reshape() : to reshape the array
np.transpose(arr) : transpose of matrix
reshape(n,m,-1) : -1 for unknown value
reshape(-1) : to convert any array to 1D

for x in np.nditer(arr): to iterate array in numpy


for idx, x in np.ndenumerate(arr): to iterate array with index

np.concatenate((arr1, arr2), axis=1) : to join two array


np.stack((arr1, arr2), axis=1) : to stack two array (also check hstack, vstack and dstack)

np.array_split(arr, n, axis=1) : to divide array in n no. of array (also check hsplit, vsplit and dsplit)

np.where(arr == n) : to find all indices if n


np.searchsorted(arr, 7, side='right') : to return the index where a value should be placed to keep the
array sorted
np.sort(arr) : to sort array
np.around(arr, dec) : to round up the array elements to the specified decimal places (also check
np.floor(arr) and np.ceil(arr))

np.random.uniform(a, b, n) : give n no. of uniform values between a-b


np.random.normal(a, b, n) : give n no. of normal values between a-b
random.randint(n) : to print a random number between 0 to n
random.rand(n) : random n no. of floats between 0 to 1
random.randint(n, size=(m)) : to give m no. of int between 0 to n
random.choice([3, 5, 7, 9], size=(3, 5)) : one out of the choice of give size

random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=(100)) : probability


Matplotlib

matplotlib.pyplot as plt
plt.plot(x,y,label=“linename”, marker = ‘’, color = ‘’, linestyle = ‘’, lw=) : plots a graph using arrays
x and y
plt.plot(x1,y1,x2,y2) : plots a graph of 2 lines
plt.xlabel(“labname”) and plt.ylabel(“labname”) : to put labels for axis
plt.title(“title”) : give title
plt.xlim(n,m) and plt.ylim(n,m) : to set upper and lower limit of graph
plt.axis(y0,yn,x0,xn) : to set upper and lower limit
plt.show() : show graph
plt.grid() : show grid lines in graph (use grid(axis=‘x|y’))
plt.subplot(x,y,z) : plots many graphs (x:no of rows, y: no of columns, z: graph no.)
plt.legend(title=“title”,loc=“”) : to show legend in the graph
ply.figure(figsize=(n,m)) : to set graph size

plt.hist(x, n,color= ‘’) : plot histogram with n no. of bars


plt.scatter(x,y,color= “”, size=,alpha=,cmap=) : plot points on coordinates x,y
plt.bar(x, y,color=,width=,height=,label= ‘’) : plot bar graph of x values and y no. (use barh() for
horizontal)
ply.xticks(x,y,rotation) : to change the value of array x on x axis with array y
plt.pie(x,labels=arr,startangle=,explode=,shadow=,colors=, autopct= ‘%.2f%%’) : to plot pie chart
plt.boxplot(arr) : to get a boxplot of the data
Seaborn

sb.displot(tab[‘col’],kde= true, hist= true, bin= n) : to get a displot with n bins


sb.kdeplot(arr,shade=true) : to display a kernel density estimation plot
sb.boxplot(y= ‘colname’, data= tab) : to get a boxplot
sb.violinplot(y= ‘colname’, data= tab) : to get a violin plot (plot to check symmetry)
sb.jointplot(x= ‘colname1’, y= ‘colname2’, data= tab, kind= ‘’, size= n, color= ‘’) : to get a joint plot
sb.lmplot(‘colname1’, ‘colname2’, data= tab, hue= ‘colname3’, col= ‘colname4’, row= ‘colname5’) : to get
a scatter plot between 2 columns
sb.countplot(x= ‘colname1’, data= tab, hue= ‘colname2’) : to get a count bar graph between 2 columns
sb.boxplot(x= ‘colname1’, y= ‘colname2’, data= tab) : to get a boxplot between 2 columns
sb.heatmap(tab.corr()) : to get a heat map of all correlations in the table
sb.pairplot(tab) : to get a scatter plot of all pairs of columns in the table
Pandas

pd.Series([values], index = [values], columns = [values]) : to create a series like np.array(), but also
like dictionary
arr.index : to display index
arr.values : to display value
arr.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True) : to count each
value
arr.apply(np.fuction) : to apply functions of numpy to each value of panda array

to use with datetime (import datetime as dt)


arr.dt.year : returns the year of the date time.
arr.dt.month : returns the month of the date time.
arr.dt.day : returns the day of the date time.
arr.dt.quarter : returns the quarter of the date time.
arr.dt.dayofweek : returns the day of the week.
arr.dt.weekday_name : returns the name of the day of the week.
arr.dt.hour
arr.dt.second
arr.dt.minute
arr.dt.dayofyear
arr.dt.date
arr.dt.time
arr.dt.freq
arr.weekofyear
pd.to_datetime(‘date and time’) : to give date and time in a proper format
pd.date_range(start= ‘date1’, end= ‘date2’, freq= ‘D’) : to display dates between two dates

arr.dropna() : to drop empty values


arr.fillna(value) : to fill empty values with given value
arr.map({‘value’: ‘newvalue’}) : to replace/map values

pd.readtable(r“addr”, sep=“seperator”,names= <array of col names>,usecols=[name or index],


dtype={“colname” : type}) : to read a table in given address
tab.head(n) and tab.tail(n) : to give first or last n number of columns

tab[‘colname’] : to access the give column (can be used to create a new column, like tab[‘newcol’] =
tab.col1 + ’,’ + tab.col2)
tab.colname : to access the give column
tab.shape : no. of rows and columns

tab.colname.unique() : names of unique columns


tab.colname.nunique() : number of unique columns
tab.columns : all columns

tab.describe(include= ‘all’) : describe data in table (use describe(include= ‘all’) to get all data)
tab.info() : describe table
tab.dtypes : data type of each column
tab.col.value_counts(normalize=True, dropna=True) : to count number of rows of the column
tab.plot(kind= ‘’,x=“namexplot”,y=“nameyplot”) : to plot on graph (use with matplotlib)

tab[colname].mean() : to get mean { also look for mode(),median(),max() }


tab.corr() : correlation of each column with themselves
tab.colname.agg([‘functions’]) : to run functions aggregately on a column (like mean, median, count)
tab.col.quantile(0.25|0.75) : to get a quantile
tab.groupby(‘colname’) : to group entries by column name
pd.crosstab(tab.col1,tab.col2,margin=True) : to get crosstable between 2 columns
tab1.merge(tab2, on= ‘colname’, how= ‘’) : to merge to tables on common columns (how= left, right, inner,
outer)
pd.concat([tab1, tab2], axis=0) : to concatenate 2 tables
tab.set_index(‘colname’) : to set a column as index

tab.append(dict, ignore_index=True, sort=False) : to append data in table using a dictionary.


tab.loc[len(tab.index)]=list(dict[0].values()) : to add data in tab
tab.iloc[n]=list(dict[0].values()) : to replace data using a dictionary
tab.assign(**{‘newcol’ : arr) : to add a new column using an array
tab.rename(columns={“oldname”: “newname”},inplace = True) : change column names
tab.colname.astype(type) : to change type of values

tab.drop(‘colname’,axis = 1, inplace= True) : to drop a column


tab.drop_duplicates(keep= ‘first’, inplace= True) : to drop duplicate value rows
tab.drop(index ,axis = 0, inplace= True) : to drop a row at the given index
tab.dropna(subset= [“col”], how=“value”) : to drop all rows with any empty data {values: all, any}
tab[‘col’].interpolate(method=‘linear’) : to fill none value with estimated values
tab.duplicated() : to check for duplicate values
tab.isna() : to check for null values (also check tab.notna(), tab.isnull() and tab.notnull())
tab[‘colname’].fillna(value, inplace=true) : to replace null value to given value
tab.replace(to_replace= “val1”,replace= “val2” ) : replace val1 with val2 anywhere in the table
tab[‘col’].replace({‘val’: ‘valnew’}) : to replace a value in a row
tab.colname.sort_values(ascending= “true”).head() : to sort values
tab.sort_values(‘colname’,ascending= “true”).head() : to sort by a column
tab[tab.colname>=200].colname : conditional statement

tab.loc[index or col conditional, ‘colname’] : to show selected column


tab.iloc[[index or rows],[index of columns]] : to show selected rows and columns are
tab.col.isin([array of values]) : to select rows with specific values
tab.str.method() : to use a string method on rows
pd.get_dummies(tab.col, prefix=“value” prefix_step= ‘_’, drop_first=False) : to show k values with k-1
entries
tab = pd.get_dummies(tab,columns=[‘col’]) : same as one hot encoder
Scipy

from scipy.stats import bernoulli


bernoulli.rvs(p=0.5, size=n) : to print a Bernoulli series of size n and p
Sklearn

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(tab,arr, test_size = 0.3, random_state = 0): to get
data for training and testing

classifier.score(x_train, y_train): to get score of accuracy

from sklearn.feature_selection import VarianceThreshold


sel = VarianceThreshold(threshold=0.01)
sel.fit(tab) : to get data with the given threshold

from sklearn.linear_model import LinearRegression


lr = LinearRegression()
lr.fit(x, tab.col) : to get a linear regression (x can be multiple columns of the table)
lr.score(x, tab.col) : to get r2 score (x can be multiple columns of the table)
y_predict = lr.predict(x) : to predict values
lr.coef_ , lr.intercept_ = m, c #y = mx + c (m is array if multiple regression)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train, y_train) : to get logistic regression
y_pred = classifier.predict(x_test)

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf' | ‘linear’ | ‘poly’, c = [0.01,0.1,1,10], gamma = [0.01,0.1,1])
classifier.fit(x_train, y_train) : to get svm
y_pred = classifier.predict(x_test)

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=n) : to get with n neighbors
classifier.fit(x_train, y_train) : to get K neighbor Classification
y_pred = classifier.predict(x_test)

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = ‘gini’ | ‘entropy’, max_depth = [2,3,4],
min_samples_split = [int])
classifier.fit(x_train, y_train) : to get decision tree classification
y_pred = classifier.predict(x_test)
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(criterion = ‘gini’ | ‘entropy’, n_estimators = [int], max_depth = [3,4],
min_samples_split = [5,7], random_state=43)      
clr_rf = clf_rf.fit(x_train,y_train) : to get random forest
y_pred = clr_rf.predict(x_test)

from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(x_train, y_train) : to get XGB classification
y_pred = classifier.predict(x_test)

from sklearn.svm import SVR
clf = SVR()
clf.fit(X_train, y_train) : to run svr
predicted = clf.predict(X_test)

import xgboost as xgb
clf = xgb.XGBRegressor()
clf.fit(X_train, y_train) : to get XGB regression
predicted = clf.predict(X_test)
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf.fit(X_train, y_train) : to get decision tree regression
predicted = clf.predict(X_test)

from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X_train, y_train) : to get random forest regression
predicted = clf.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


mean_absolute_error(y,y_predict) : give mean absolute error between actual and predicted values
mean_squared_error (y,y_predict) : give mean squared error between actual and predicted values
np.sqrt(mean_squared_error(y,y_predict)) : give root mean absolute error between actual and predicted
values
r2_score(y,y_predict) : give R2 score between actual and predicted values (same ar lr.score(x,tab.col))
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_true,y_predict) : gives confusion matrix
[TP FP]
[FN TN]
cm = classification_report(y_true,y_predict) : gives report of classification

from sklearn.metrics import f1_score, accuracy_score
ac = accuracy_score(y_test,clf_rf.predict(x_test)) : to get accuracy score

from sklearn.feature_selection import SelectKBest, chi2
# find best scored 5 features
select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)
print('Score list:', select_feature.scores_) : to get best scores

from sklearn.feature_selection import RFE
# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestClassifier()      
rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)
rfe = rfe.fit(x_train, y_train) : to get RFE ranks

from sklearn.feature_selection import RFECV
# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_4 = RandomForestClassifier() 
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)
print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_]) : to get rfecv

from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
tab[‘col’] = lb.fit_transform(tab[‘col’])
for i in tab.columns:
tab[i] = lb.fit_transform(tab[i]) : to convert any entry of values to int form
from sklearn.preprocessing import OneHotEncoder
ob = OneHotEncoder()
for i in tab.columns:
tab[i] = ob.fit_transform(tab[i]) : to convert any entry of values to int form columns

from sklearn.manifold import TSNE
tn = TSNE(n_components=2, random_state=0)
xn = tn.fit_transform(tab)

from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(model, parameters, cv=n, scoring= ‘f1_macro’)
gs.fit(X_train,Y_train)

from sklearn.cluster import KMeans


import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X) : to get kmean clusters
kmeans.labels_ : labels for n clusters
ypredict = kmeans.predict(x)
from sklearn.decomposition import PCA
pca = PCA(n_components=n)
tabpca = pca.fit_transform(tab) : to standardize a table into an array of n components

from sklearn.preprocessing import StandardScaler


sc = StandardScaler(n_components=n)
tabsc = sc.fit_transform(tab) : to standardize a table into an array
(scaled value = actual-mean/standard deviation)

from sklearn.preprocessing import MinMaxScaler


mns = MinMaxScaler((n,m) : to scale values between n and m
tabmns = mns.fit_transform(tab)
from sklearn.preprocessing import normalize
tabnorm = normalize(tab, norm= ‘l1’) : to normalize a table into an array
ss=[]
k=range(1,20)
for i in k:
km = KMeans(n_clusters=i)
km.fit(x)
ss.append(km.inertia_)
plt.plot(k,wss)
plt.show() : Elbow method

from sklearn.metrics import silhouette_score


silhouette_score(tabpca, KMeans(n_clusters=n).fit_predict(tabpca)) : to calculate silhouette score
(higher the score, better the clusters)

from sklearn.cluster import MeanShift


ms = MeanShift(bandwidth = 2, bin_seeding = True)
ms.fit(x) : to get Mean Shift clusters
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
gscv = GridSearchCV(classifier, param_grid = dict)
gscv.fit(x_train,y_train)
print(gscv.best_params_) : to get the best parameters for a classifier
print(gscv.best_estimator_) : to get the best estimators for a classifier
gscv.predict(x_test)

rmcv = RandomizedSearchCV (classifier, param_distributions = dict)


gscv.fit(x_train,y_train)
print(gscv.best_params_) : to get the best parameters for a classifier
print(gscv.best_estimator_) : to get the best estimators for a classifier
gscv.predict(x_test)

from sklearn.model_selection import LeaveOneOut, KFold, StratifiedKFold


n = np.array([1, 2, 6, 3, 2, 6, 8, 7, 3, 5, 8, 4, 3, 7, 3, 7])
km = KFold(n_splits=4, shuffle=False)
for train, test in km.split(n):
    print("Train: ", n[train], " Test: ", n[test])
n = np.array([1, 2, 6, 3, 2, 6, 8, 7, 3, 5, 8, 4, 3, 7, 3, 7])
y = np.array([1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0])
sk = StratifiedKFold(n_splits=4, shuffle=False)
for train, test in km.split(n, y):
    print("Train: ", n[train], " Test: ", n[test])

lt = LeaveOneOut()
for train, test in lt.split(n):
    print("Train: ", n[train], " Test: ", n[test])

You might also like