Machine Learning Notes
1. All the Import Modules Commands :
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
2. All the commands for Eda :
df.isna() / df.isna().sum()
df.info()
df.describe()
df.dropna( axis = 0,1 ) #0 for row and 1 for column
df.fillna()
To calculate mean :-
df['column_name'].mean()
To fill missing values by mean :-
x = df['column_name'].mean()
df['column_name'].fillna(x, inplace=True)
To read a csv file :-
df = pd.read_csv('cars.csv')
df["column_name"].unique()
df["column_name"].value_counts()
To replace a string by nan value :-
df['column_name'].replace("string",np.nan,inplace =True)
df['column_name'] = df['column_name'].astype("float")
To create a new df with specific data type :-
# df_cat / df_num = df with categorical / numerical data
df_cat = df.select_dtypes(object)
df_num = df.select_dtypes(['int64','float64'])
Steps to handle missing values :
#step1 - use replace
df['column_name'].replace("string",np.nan,inplace =True)
#step2 - change the datatype to float
df['column_name'] = df['column_name'].astype("float")
#step3 - calculate the mean for the cols
x = df['column_name'].mean()
#step4 - use fillna
df['column_name'].fillna(x, inplace=True)
Label Encoder :
from sklearn.preprocessing import LabelEncoder
for col in df_cat:
le=LabelEncoder()
df_cat[col] = le.fit_transform(df_cat[col])
To drop columns and rows :
df.drop('column_name', axis = 1) #for a single column
df.drop(['column_name','column_name'],axis=1) #multiple
df.drop(index_number) #to drop a Row
To handle outliers :
#Step1-: Make boxplot with two variable
Eg :- sns.boxplot(data=df,x='price',y='make')
#Step2-: Filter out the outliers
Eg :- df[(df['make']=='dodge') & (df['price']>10000)]
#Step3-: Drop the outliers
Eg :- df.drop(29,inplace=True)
Feature engineering : It is used to reduce the columns / features in the
data frame. Eg : if a data set has height and width
column ,we can create a new column = area ; a=l*b
and then remove height and width columns .
Skewness and handling Skewness :
from scipy.stats import skew
To find skewness of a column :
skew(df_num['column_name'])
Using for loop & plotting graph :
for col in df_num:
print(col)
print(skew(df_num[col]))
plt.figure()
sns.distplot(df_num[col])
plt.show()
#to find correlation
df_num.corr()
sns.heatmap(df_num.corr(), annot=True)
WE SHOULD NOT REMOVE THE SKEWNESS FOR THE COLUMN WHICH HAS
VERY HIGH CO-RELATION WITH TARGET, BECAUSE IF WE DO THAT THEN
THEIR CO-RELATION WITH THE TARGET WILL ALSO BE CHANGE.
ALSO NEVER REMOVE SKEWNESS OF A NEGATIVE COLUMNS , IT WILL GIVE
YOU A NAN VALUE.
To Handle Skewness either find the Square root or log of that
column :
df_num['column_name']= np.sqrt(df_num['column_name'])
Scaling :-
1. MinMax Scaler
from sklearn.preprocessing import MinMaxScaler
for col in df_new:
ms = MinMaxScaler()
df_new[col]=ms.fit_transform(df_new[[col]])
2. Standard Scaler
from sklearn.preprocessing import StandardScaler
for col in df_new:
sc = StandardScaler()
df_new[col]=sc.fit_transform(df_new[[col]])
Requirements for working with data in Sklearn :-
Feature and response should be seperated objects
Feature and response should be Numeric
Feature and response should be numpy array
Feature and response should have specific shape (2D)
x = df.iloc[:,:-1].values #Features -> independent Variable
y = df.iloc[:,-1].values # Response-> dependent variable
Taking care of missing data :-
from sklearn.impute import SimpleImputer
#step1: define the missing value & strategy
si = SimpleImputer(missing_values=np.nan, strategy='mean'
)
#step2: select the col that has missing values
si.fit(x[:,1:3])
#step3: fill the value using transform method to selected
cols and save it back
x[:,1:3] = si.transform(x[:,1:3])
Encoding categorical data ( One Hot Encoder ) : -
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers= [('encoder',
OneHotEncoder(), [0])], remainder=' passthrough ')
#selecting and apply change at the same time
x = np.array(ct.fit_transform(x))
Splitting the dataset into the training set and test set :-
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x,y,
test_size=0.2, random_state = 1)
Feature Scaling :-
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
xtrain[:,3:] = sc.fit_transform(xtrain[:,3:])
xtest[:,3:] = sc.fit_transform(xtest[:,3:])
Linear regression model :-
#step 1-: Select a model from sklearn
from sklearn.linear_model import LinearRegression
#step 2 -: Create an object of your model
linreg = LinearRegression()
#step 3 -: Train your model
linreg.fit(xtrain, ytrain)
#step 4: Predict the value
ypred = linreg.predict(xtest)