0% found this document useful (0 votes)
17 views132 pages

Python

The document provides an overview of using Jupyter Notebook and Python programming, including basic syntax, data types, and operations on lists, sets, dictionaries, and tuples. It also covers libraries like NumPy and Pandas, detailing their functionalities for handling arrays and data frames. Key concepts such as indexing, appending, and manipulating data structures are explained with examples.

Uploaded by

wgamerz247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views132 pages

Python

The document provides an overview of using Jupyter Notebook and Python programming, including basic syntax, data types, and operations on lists, sets, dictionaries, and tuples. It also covers libraries like NumPy and Pandas, detailing their functionalities for handling arrays and data frames. Key concepts such as indexing, appending, and manipulating data structures are explained with examples.

Uploaded by

wgamerz247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

Python

Jupyter notebook : interactive console application . it has cells where we can


execute a program .
How to open jupyter nb → download zip folder from git hub → open it in jupyter
nb
Comments : ‘#’ & “ ‘’’ ‘’’ ”
## for smaller heading in markdown and # for big heading
Insert cell above → insert → esc + a
Kernel → compiler
If stuck program → restart the kernel

10 ** 2 → 100
3 * ‘a’ → ‘aaa’ → string is multiplied 3 times
Inbuilt function → type(1) → to check the datatype of any function → therefore it
will return int
We dont dont need to write int a =10 ⇒ only a= 10.
String can be defined in “ or ‘
print((a*b) + (a/b)) → follows bodmas rule .
To learn about a inbuilt tag → shift+tab → eg : print
Printing complex sentence :
1)​Method 1
first_name='Krish'
last_name='Naik'
print("My first name is {} and last name is {}".format(“jay”, “modak”))
My first name is krish and last name is Naik
Dot operator is used . first and last are replaced in the closed br.
2)​Method 2
print("My First name is {first} and last name is
{last}".format(last=last_name,first=first_name))
Expl : even if we change the order . Since we have assigned the value
O/p : My First name is Krish and last name is Naik
len(‘jay’) → o/p 3 → length of string
List is a grp of different data types
Python data structure :
1)​Boolean
2)​Boolean and logical operators
3)​Lists
4)​Comparison operators
5)​Dictionaries
6)​Tuples and sets
Tab to see all the keywords . eg : str. → opens a drop down
print(my_str.isalnum()) #check if all char are numbers. Combination of alphabet
and numbers
print(my_str.isalpha()) #check if all char in the string are alphabetic
print(my_str.isdigit()) #test if string contains digits
print(my_olr. islille()) #test if string contains title words(capital)
print(my_str.isupper()) #test if string contains upper case
print(my_str.islower()) #test if string contains Lower case
print(my str. isspace()) #test if string contains spaces
print(my_str.endswith('d')) #test if string endswith a d → case sensitive
print(my_slr. startswith('H')) #lest if string starts with H
→ returns true or false

Datatypes :
Lists :
Can store different data str.
mutable, or changeable, ordered sequence of elements. Each element or value
item.
Values between square brackets [ ].
Indexing→ 0,1,2 . . . .
Indexing a list of elements : we want to select a list of
elements([‘maths’,’chem’,100,’phy’]) → 1st[:] → selects all the elements. If we
want to select from chem to end → 1st[1 : ]
Select from chem to 100 → 1st [1 : 3] → it selects the element before 3
Initialising a list :
1.​type( [ ] )
2.​1_eg=[ ]
type(1_eg)
3.​1st=list()
type(1st)
4.​1st=[‘maths’,’chem’,100]
Functions in list :
1.​Append : add items in a string .
Eg : 1st.append(‘phy’)
​ [‘maths’,’chem’,100,’phy’]
2.​To check what the element is 1st[2] → o/p chem
3.​Append : add element
1st.append([‘john’]) → creates a nested list →
[‘maths’,’chem’,100,’phy’,[‘john’]]
4.​Insert : 1st[1 : ‘pushkar’] → [‘maths’,’pushkar’,’chem’,100,’phy’]
5.​Extend : adds elements at the end pf the list → 1st.extend([8,9]) →
[‘maths’,’chem’,100,8,9]
6.​Sum : adds all the numbers in a list eg : 2nd=[1,2,3] → sum(2nd) → 6
7.​Pop : removes the last element → 2nd.pop() → 2nd=[1,2]
OR 2nd.pop(0) → 2nd=[2,3]
8.​Count : calculates total occurrence of given element of list
9.​Index : return the index of first occurrence . start and
Eg : 2nd.index(1) → 0
Syntax : index(value, start , end)
10.​ Multiplication of list : 2nd*2 → [1,2,3,1,2,3]

Sets :
unordered collection data type that is iterable , mutable, and has no duplicate
elements . pythons set class represents the mathematical notion of a set . this
based on a data structure known as hash table.
does not support indexing ie we cannot access the element like list →
eg : set[0] → error
does not support subscripting eg: set1[1]
CODE :
1)​Set1=set()
2)​set={1,2,3,3} → o/p printing → {1,2,3} // duplicate elements taken as 1
element

Inbuilt functions:
1)​Add : set1.add(“jay”) → added in the last
We can also do unions and intersections in sets (maths).
2)​Difference : set1={“jay”,”krishna”,”balram”}
​ ​ set2=​ {“jay”}
set2.difference(set1)
o/p → {”krishna”,”balram”}
This does not updates set2 .
3)​Difference update : changes the value of set2 = o/p → {”krishna”,”balram”}

Dictionaries:
Collection of unordered , changeable and indexed . written with curly braces,
they have keys : value pairs
Declaration :
dic=() ⇒ just use empty braces
Eg : dic={“car1”:”audi”,”car2”:”pagani”} ⇒ manual way of creating
Inbuilt function ⇒ dict() → creates a empty dict .
For accessing the elements of the dict using indexing → it will NOT be index
number
→ it will be key names (eg : car1,car2 )
Eg :iterating through the keys → therefore prints all the values
for x in dict:
print(x)
Eg :iterating through the values → therefore prints all the values
For x in dict.values():
​ print(x)
o/p {‘audi’, ‘bmw’}
Eg : iterating through key and value both
For x in dict.items():
​ Print (x)

Adding items in dictionaries : dict[‘car3’]=’lambo’ →


{“car1”:”audi”,”car2”:”pagani”, “car3”:”lambo”}

If we write dict[‘car1’] = ‘maruti’ → value is replaced .

Nested dictionaries :
Eg:
car1_model={'Mercedes':1960}
car2_model={'Audi':1970}
car3_model={'Ambassador' : 1980}

car_type={'car1':car1_model, 'car2':car2_model, 'car3':car3_model}

print(car_type)

o/p : {'car1': {'Mercedes': 1960}, 'car2': {'Audi': 1970}, 'car3': {'Ambassador':


1980}}

●​ Accessing the items in the dictionary


print(car_type['car1'])
{'Mercedes': 1960}
●​ 2nd way print(car_type[‘car1’][‘mercedes’])
o/p : 1960

Tuples :
●​ Not mutable , cannot change element
●​ It supports indexing using no.
●​ Supports diff data types
●​ We use round braces
●​ Eg : tup=(“jay”,”nitai”,”manohar”)
●​ We can replace the whole tuple.

Libraries :

NUMPY :
High dimensional array object, tools for working with arrays(data str with
similar datatyp)
After installing python→ command prompt → pip install numpy
Importing numpy : Import numpy as np
Np → alias for numpy
Reference type : sharing the same value . so if we change 1 variable other
is also updated . eg : array refer page 7.
Value type :can be an integer value , assign it to something else. If we
change the other value . then the updation will not take place
Codes :
my_1st=[1,2,3,4,5]
arr=np. array(my_1st )

In : type(arr)
Out : numpy.ndarray
In : arr
Out : array([1,2,3,4,5])

●​ Arr.shape → helps us to find how many no of rows & cols are there
In : arr.shape
​ ​ Out : (5,) → 5 rows
Creating a 2D arr

In: Multinested array


my_lsl1=[1,2, 3,4,5]
my_lst2=[2,3,4,5,6]
my_lst3=[9,7,6,8,9]

arr=np.array([my_lst1,my_lst2,my_lst3])
In : arr
Out :array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[9, 7, 6, 8, 9]]) → 2 closing br indicates a 2D arr
In: arr. shape
Out : (3, 5) → (rows,columns)
●​ Arr.reshape : converts
●​ During reshape the no. of elements should remain constant
In: arr.reshape(5,3) → return an array containing the same data with a new
shape
Out: ([[1,2,3],
[4,5,2],
[3,4,5],
[6,9,7],
[6,8,9]])
Indexing in array:
●​ Arr[0] ⇒ for 1d arr
●​ Arr[: , :] ⇒ Picks up all the elements
array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[9, 7, 6, 8, 9]])
●​ Arr[0:2, :] → we want the 0th row and the 1st row
Until the 2nd index in the row
o/p : array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
●​ Arr [0:2,0:2]
o/p array([1,2],
[2,3]])
​ ​ ​ Arr[1:,3:]
​ ​ ​ o/p : array([[5,6],
[8,9]])

●​ Arrange : return evenly spaced values within a given interval


○​ Creates a 1D array
Eg :
​ arr=np.arange(0,10)
o/p : array([0,1,2,3,4,5,6,7,8,9])
arr=np.arange(0,10,step2)
o/p : array([0,2,4,6,8])
Syntax :
arrange([start,] stop[,stop ,], dtype = none)

●​ Linspace : syntax :
np.linspace[‘start’,’stop’,’num=50’,’endpoint=true’,’retstep=false’,’dtype=non
e’,
’axis=a’]
Used in ML
np.linspace(0,10,50) ⇒ we want from 1 to 10 we want 50 points → creates
EQUALLY spaced points
o/p : array([ 1. ,1.91836735,2.83, → ,9.08,9.81,10. ])
●​ Copy & broadcasting :

Eg :
Arr[3: ]= 100 ⇒ From index 3 to the last replace all by 100
arr1=arr
Arr1[3 : ]= 500
Print (arr)
o/p : [1,2,3,500,500,500] **Reference type

NOTE : Thus to prevent this updation we have the copy function : syntax :
arr1.arrcopy() → this creates another memory space to store the value of
arr1 .

Conditions important for exploratory data analysis


Eg: val = 2
​ Arr < 2
⇒ it will check which elements are less than 2
o/p : array([true,false,false etc])
If we want to see those elements which are less than 2 → arr[arr < 2]
→ o/p → arr([1])

We can also do arithmetic operations on the complete array


Eg : arr * 2 . all elements are multiplied by 2

Ones : creates an array where all the elements are replaced by 1


Syntax : np.ones(shape, dtype=none) → shape ⇒ size of array, by default
dtype is set to float
​ ​ i/p : np.ones(4)
​ o/p : array ([1.,1.,1.,1.])
For integer : np.ones(4, dtype=int) →array ([1,1,1,1])
If we want 2D array → np.ones((2,5), dtype=int) → creates 2 rows and 5 col
1s

Random distribution :
​ Type 1 :
●​ np.random.rand(3,3) → 3 row and 3 col random element array
●​ The elements will not be >1 and < 0.
​ Type 2 :
●​ np.random.rand(4,4) → selects value based upon standard normal
distribution(stats)
​ ​ Type 3 :
●​ np.random.randint(low,high=none,size=none,dtype=’1’)
np.random.randint(0,100,8) ⇒ between 0 and 100 select 8
numbers
We can also reshape it ⇒ np.random.randint(0,100,8).reshape(2,4)

PANDAS :
●​ Importing pandas : import pandas as pd
​ ​ ​ Import numpy as np
●​ Data Frames : combination of columns and rows .2D representation
format how data looks in the excel sheet
○​ Eg : df=pd.DataFrame(np.arrange(0,20).reshape(5,4),
index=[‘Row1’,’Row2’,’Row3’,’Row4’,’Row5’,],columns=[‘column
1’,’column2’,’column3’,’column4’],dtype=int) ⇒ we are making a
2D array with 20 elements and arranging it in 5R and 4 Col
df.head()
​ o/p :

Another way of seeing output : df.to_csv(‘Test1.csv’)


Creates a new file

●​ Accessing the elements : 2 ways 1) .loc → only row index


​ ​ ​ ​ ​ 2) .iloc → both row and column index
●​ Eg : df.loc[‘Row1’] → picks up all the elements of row 1
​ o/p : Column1​ 0
​ ​ ​ Column2​ 1
​ ​ ​ Column3​ 2
​ ​ ​ Column4​ 3
​ ​ In: type(df.loc[‘Row1’] )
​ ​ Out : pandas.core.series.series
Eg2 : df.iloc[ : , : ] → gives all the columns and rows
​ df.iloc[ 0:3 , 0:2 ]

type(df.iloc[ 0:3 , 0:2 ]) → o/p : pandas.core.frame.DataFrame

Converting dataframes into arrays :


Df.iloc[:,1:].values “we use the above example only of shape 5,3”
o/p : array([[ 1 , 2 , 3],
[ 5 , 6 , 7 ],
[ 9 , 10, 11],
filter[13 , 14, 15],
[17,18, 19]])

Checking the null condition :


df.isnull().sum()
o/p :
Column 1 0
Column 2 0
Column 3 0
Column 4 0
Dtype : int 64
“It means columns 1 to 4 have zero null values”
●​ Data series : can be either 1 column or 1 row
●​ df[‘column1’].value_counts()
o/p : 12 ​ 1
4​ 1
16​ 1
8​ 1
0​ 1
“How many times a number is present”
●​ df[‘column1’].unique()
o/p : array ([0 , 4 , 8 , 12 , 16 ]) → shows all the unique values
●​ Df[[‘column3’,’column4’]] → displays column 3 & 4

Read different file formats (csv or excel format): ​


Csv → comma separated values . → therefore separator(sep) ,
df=pd.read_csv(file.csv)
['filepath_or_buffer',”seps=’, ’ ,'delimiter=None' , "header='infer' ",
'names=None','index_col=None','usecols=None', 'squeeze=False',
'prefix=None', 'mangle_dupe_cols=Tru]
df.info() → gives info . like rows, cols,integers and float
df.describe() → provides more data like count , mean. Here ONLY
INTEGER and FLOAT values are taken and NOT object → categral
features will be skipped because we cant find its mean, min. value etc.
Eg of a csv file :
,Column1, Column2, Column3, Column4
row1,0,1,2,3
row2, 4, 5, 6, 7
row3,8,9,10,11
row4, 12, 13, 14, 15
row5,16,17,18,19

But if the above file has semicolon instead of comma then we will put sep
as ;
Eg : df=pd.read_csv(file.csv,sep=’ ; ’)
o/p :

●​ df[df[‘y’]>100] : wherever the y value is > 100 → display only those


numbers

Data conversion :
CSV :

Basic:
●​ from io import StringIO,BytesIO
●​ data = ('col1, col2, col3\n'
'x, y, \n'
'a,b,2\n'
'c,d,3')
// “\n for new line ”
// “This data is in the form of string , we can also put this line as a csv file
and then load it ”
●​ pd.read_csv(StringIO(data)) // stringIO → converts text to table
●​ // df=pd.read_csv(StringIO(data),usecols=landa x: x.upper() in
[‘COL1’,’COL3’])

●​ df=pd.read_csv(StringIO(data),usecols=[‘COL1’,’COL3’])
●​ Df

○​ o/p :
●​ // converting above table back to data
●​ df.to_csv(‘test.csv’) → this is saved in the same file
●​ // if we want to take values other than strings ⇒
●​ df=pd.read_csv(StringIO(data),dtype=object) → now all datatypes will be
considered as objects → in our case it will be considered as a string
●​ df

○​ o/p : → 1 2 3 4 will actually be a string

●​ Df[‘a’]
○​ o/p : ‘5’ → gives a string value

●​ // If we want to have different values for each column


●​ df=pd.read_csv(StringIO(data),dtype={‘b’: int,’c’:float,’a’:’Int64’}
●​ Df
○​ o/p :

●​ data = ('index, a,b, c\n'


'4, apple, bat, 5.7\n'
'8,orange,cow,10')
●​ pd.read_csv(StringIO(data),index_col=0)
→ make the 0st column as index→ default

○​ o/p :

●​ // eg 2
●​ pd.read_csv(StringIO(data),index_col=1)

○​ o/p :
●​ data = ('index, a,b, c\n'
'4, apple, bat, 5.7\n'
'8,orange,cow,10')
●​ pd.read_csv(StringIO(data))

○​ o/p :
○​ // When default data type is none it follows same order as the data. If
it is a number ⇒ index
●​ pd.read_csv(StringIO(data),index_col=False)

○​ o/p :
●​ data= ‘a,b \n “hello , \\”bob\\’”, “ , 5’
●​ pd.read_csv(StringIO(data),escapechar=’\\’) → skips \\
●​ df=pd.read_csv(‘https://download.bis.gov.item’,sep=’\t’)
●​ df.head()

○​

Read Json to CSV :


●​ Json : Javascript Object Notation file . consist of key value pairs , where
keys are strings and values
○​ Json eg : {
"name": "John Doe",
"hobbies": ["reading", "hiking", "gardening"],
"address": {
"city": "New York",
"country": "USA"
}
}
●​ Data = ‘ (“employee_name”: “James” , “email ” : “[email protected]”,
“job_profile”: [(“title1” : team lead”, “title2”: “sr. developer”)]
●​ pd.read_json(Data)
●​ // nested info : underline part → it will not exactly get converted into
columns.
The last keyword will be taken as column
\\similar to our example

●​ Opening a json file without downloading it : wine.data IMP


●​ Df = pd.read_csv(‘https://archive.ics.uci.edu/wine/wine.data’ ,header =
none )
●​ df.head()

●​
●​ df.to_csv(wine.csv) → actually converts json to csv
●​ df.to_json(orient=”index”) → convert object to a json string
○​ Df.to_json()
■​ o/p
'{"employee_name":{"0":"James"},"email":{"0":"[email protected]
m"},"job_profile":{"0":{"title1":"Team Lead","title2":"Sr.
Developer"}}}'
○​ df.to_json(orient=”records”) IMP
■​ '[{"employee_name":"James","email":"[email protected]",
"job_profile":{"title1":"Team Lead", "title2":"Sr. Developer"}}]'
// makes the o/p record by record.

Reading HTML content :

Used for web scraping.


When a website is created it has table , thead etc which is retrieved by pandas
It is returned as the list of tables as on the website .
●​ url= ‘https://www.fdic.gov/bank/individual.html’
●​ Dfs = pd.read_html(url)
●​ type(dfs) → o/p list
●​ Dfs[0] → // takes all the information on the webpage . irrespective of page
“1”,pg ”2” as on the website
●​ url_mcc=’https://en.wikepedis.org/wiki/mobile_country_code ’
●​ Dfs = pd.read_html(url_mcc,match=’country’,header=0)
●​ // match = country . once the given link has found match as a keyword it
picks up all the details
Reading excel files :
●​ df_excel=pd.read_excel(‘excelsample.xlsx’) // one of the parameters
is sheet name. We can number the sheets as 0th ,1st etc.
●​ df_excel.head()

PICKLING:
Create machine learning algo to pickles.
To_pickle methods which use pythons pickle module to save data structures to
disk using pickle format.

df_excel.to_pickle(‘df_excel’)
df=pd.read_pickle(‘df_excel’)
df.head()
o/p : // displays the content of the pickle
NOTE : search pandas documentation . shows all the info for pandas

MatplotLib tutorial :
Dont remember matplot. Seaborn is better .
Plotting library for the python and its numerical mathematics extension NumPy. It
provides an object oriented API for embedding plots into applications using
general purpose GUI tools like Tkinter , wxpython , Qt or GTK+

Pros of Matplotlib are:


. Generally easy to get started for simple plots
. Support for custom labels and texts
. Great control of every element in a figure
. High-quality output in many formats
. Very customizable in general
●​ Import matplotlib.pyplot as plt
●​ %matplotlib online // only for jupyter notebook
●​ Import numpy as np
●​ // simple examples
●​ x= np.arange(0,10) // return evenly spaced values within a given
interval.betw. 0 and 10
●​ y= np.arange(11,21) // since we are making a 2d graph we need 2
variables.
●​ a= np.arange(40,50)
●​ b= np.arange(50,60)
●​ // plotting using matplotlib
●​ // plt scatter
●​ plt.scatter(x,y,c=’g’) // first value as x axis value and second value as y axis
value, s= solid , c= colour, here g⇒ green.
●​ plt.xlabel(‘X axis’) // x title
●​ plt.ylabel(‘Y axis’) // y title
●​ plt.title(‘graph in 2d’)
●​ plt.savefig(‘test.png’) // saves image in memory
●​ //Since we have written matplotlib inline , we dont have to write plt.show
after above line. If we are using some other coding tools we have to
write(plt.show) that line. Like spyder.

●​ In plot we get a straight line .


●​ Y = x*x
●​ // plt plot
●​ plt.plot(x,y,’r* - - ‘ , linestyle=’dashed’, linewidth=2, markersize=12)
●​ // r - - ⇒ dashed line. r* we get crossed points

●​ // subplots , within one diagram we can create multiple diagrams


●​ plt.subplot(2,2,1) // 2 rows , 2 columns , 1 st position we want 1 diagram
●​ plt.plot(x,y,'r')
●​ plt.subplot(2,2,2)
●​ plt.plot(x,y,'g')
●​ plt.subplot(2,2,3)
●​ plt.plot(x,y,'b')

●​
●​ X = np.arange(1,11)
●​ y=3*x+5
●​ plt.title("Matplotlib demo")
●​ plt.xlabel("x axis caption")
●​ plt.ylabel("y axis caption")
●​ plt.plot(x,y)
●​ plt. show()
●​
●​ # Compute the x and y coordinates for points on a sine curve
●​ X = np.arange(0, 4 * np.pi, 0.1)// 0.1 is the stepsize
●​ y = np.sin(x)
●​ plt.title("sine wave form")
●​ # Plot the points using matplotlib
●​ plt.plot(x, y)
●​ plt. show()

●​
●​ #Subplot()
●​ # Compute the x and y coordinates for points on sine and cosine curves
●​ X = np.arange(0, 5 * np.pi, 0.1)
●​ y_sin = np.sin(x)
●​ y_cos = np.cos(x)
●​ # Set up a subplot grid that has height 2 and width 1,
●​ # and set the first such subplot as active.
●​ plt.subplot(2, 1, 1)
●​ # Make the first plot
●​ plt.plot(x, y_sin)
●​ plt.title('sine')
●​ # Set the second subplot as active, and make the second plot.
●​ plt.subplot(2, 1, 2)
●​ plt.plot(x, y_cos)
●​ plt.title('Cosine')
●​ // show the figure
●​ plt.show()
Bar plot
●​ X = [2,8,10]
●​ y = [11,16,9]
●​ x2 = [3,9,11]
●​ y2 = [6,15,7]
●​ plt.bar(x, y)
●​ plt.bar(x2, y2, color = 'g')
●​ plt.title('Bar graph')
●​ plt.ylabel('Y axis')
●​ plt.xlabel('x axis')
●​ plt. show()

○​
●​ //https://www.youtube.com/watch?v=czQO1_GEEos&list=PPSV

Histograms:
wrt to numbers (as shown below) what is the density/ count on the y axis
By default in a histogram there are 10 bins .
●​ a = np. array ([22, 87, 5, 43, 56, 73, 55, 54, 11, 20, 51, 5, 79, 31, 27])
●​ plt.hist(a)
●​ plt.title("histogram")
●​ plt. show()
○​ for 10 bins

○​
●​ On x axis between 0 and 10. We have 3 values.
●​ For 20bins , we will write input as : plt.hist(a,bins=20) -> all the range will be
divided into 20 bins.

○​

Box plot in matplotlib :


Helps to find out percentile
●​ data = [np.random.normal(),std, 100) for std in range(1, 4)] // normal
distribution from lower value (0) to the standard deviation value (std) with a
stepsize as 100 . data here will be in the form of list since we are using for
loop . in our input it means 3 list of values will get created
●​ # rectangular box plot
●​ plt.boxplot(data, vert=True,patch_artist=True);
●​ // if we make patch artist = false the colours are gone .
●​ // if vertical = false the graph becomes horizontal
●​ Data
●​ Array([2.3456564,0584646 etc ])
●​ //The black bottom line is the 0 th percentile .
●​ //The start of the blue box is the 25th percentile . middle yellow line is the 50
th percentile . top horizontal black line is the 100 th percentile.
●​ //The circles at the bottom of graph 3 are outliers . because we are
randomly selecting the values

Pie chart :
●​ # Data to plot
●​ labels = 'Python', 'C++", 'Ruby', 'Java'
●​ sizes = [215, 130, 245, 210] // sizes based on the cumulative total. Thus
each value is assigned as percentage .
●​ colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
●​ explode = (0.1, 0, 0, 0) // explode 1st slice . how far the 1st slice has to go .
if we write (0.1,0,0.2,0) then the third slice is exploded/ moved away
●​ // Plot
●​ plt.pie(sizes, explode=explode, labels=labels, colors=colors,
●​ autopct='%1.1f%%', shadow=True) // autopct defines the format in which
we want the slice . %1.1f -> floating format .
●​ plt.axis('equal')
●​ plt.show()

○​
________________________________________________________________

SEABORN tutorial :
Statistical tools .
Dataset : f1 , f2 ,f3 ,f4 . based on the classification and regression problem we
will be dividing the dataset into independent and dependent features eg : f1, f2 &
f3 , f4. Suppose f3 and f4 are the o/p feature that we need to compute ->
dependent feature . suppose with 2 features f1 and f2 we can draw a 2D plot. 3
feature -> 3D diagram . 4 f -> 4D d.
If we only have f1 -> univariate analysis
F1 &f2 -> bivariate (WRT supervised ML )
Distribution plots :
These distribution plots helps us to analyze how many no. of features are there
in a dataset .
●​ Displot
●​ Joinplot : bivariate
●​ Pairplot : more than 2 features
Practice problems on IRIS dataset
●​ Import seaborn as sns
●​ df = sns.load_dataset(“tips”) // inbuilt function like load_dataset and tips
●​ df.head()

// we should be able to create a model wherein we can assume what tip it will be
based on the other features like total bill, day, sex etc . here tip is a dependent
feature . else all are indep. Feature . since tip is dependent on the day, time etc .
df.dtypes
total_bill ​ float64
tip ​ float64
sex ​ category
smoker ​ category
day ​ category
time ​ category
size ​ int64
dtype: object
________________________________________________________________
Correlation with heatmap :
1. Uses colored cells . typically in monochromatic scale to show a 2D correlation
matrix (table ) between two discrete dimensions or event types .
2. This correlation can only be found out if our values are integer or floating .
3. We cant find out for categorical features because they are object type .
4. Values will be ranging from -1 to +1 (coefficient Pearson’s correlation)
Above data is used to make the heatmap
●​ Df.corr()

●​ ​

○​ //Question: why are we getting only 3 features (columns) even if we


had 5 feat. Above.
○​ //Ans : here total bill, size, tip are float & integer values
●​ //analysing the above table 1 row , 2nd c -> relation b/w total bill and tip is
67% ie , if total bill increases then tip also increases.
●​ // if the value was – ve then it showed that total bill dec then tip also
decreases. DOUBT: in above line, shouldn’t it be if total bill inc then tip
should decrease
●​ // based on colours
●​ Sns.heatmap(df.corr())

○​
________________________________________________________________
Join plot :
A join plot allows to study the relationship between 2 numeric variables . the
central chart display their correlation . it is usually a scatterplot, a hexbinplot, a
2D histogram or a 2D density plot.

Bivariate analysis
sns. jointplot(x='tip', y='total_bill', data=df, kind='hex')// at x and y axis we take all
the features . kind : features which are displayed in between , their shape .

●​ //Major concentration -> dark spots at the same pt on above and rhs
histograms are higher .
●​ //this shows that many people have given a tipa somewhere around 2
dollars. And majority of the bill is between 10 – 20.
●​ // there are outliers -> whose bill was more than 50$ and tip 10$
●​ sns. jointplot(x='tip', y='total_bill', data=df, kind='reg') // reg is regression .
draw a probability density function also known as KDE(kernel density
estimation). it will also draw a regression line .
○​
________________________________________________________________

Pair plot(Scatterplot) :
More than 2 indep feature.
In which one variable in the same data row is matched with another variables
value .
It will to all the possible perm. And com. Of all the features
●​ sns.pairplot(df)

○​

○​
●​ //for using a category for scatterplot eg sex.
●​ sns.pairplot(df.hue=’sex’)
○​
________________________________________________________________

Dist plot :
Helps us to check distributions of column feature
●​ Sns.distplot(df[‘tip’])

○​
●​ Sns.distplot(df[‘tip’] ,kde= False ,bins=10). Removes the continuous line .
________________________________________________________________

Categorical plots
1) Boxplot
2) Violinplot
3) Countplot
4) Barplot
Count plot:
Show the count of observation in each
//using the same tip data set
Sns.countplot(‘sex’,data=df)
●​ // y = ‘sex’ then the graph is horizontal .

Bar plot :
Give both x and y val.
Sns.barplot(x=’total_bill’,y=’sex’,data=df)

Box plot :
●​ Sns.boxplot(‘smoker’,’total_bill’,data=df) // smoker -> x axis

○​
●​ Sns.boxplot(x=”day”,y=”total_bill”,data=df,palette=’rainbow’)
○​
●​ // without giving rainbow we get blue, yellow, green, red.
●​ sns.boxplot(x=’total_bill’,y=’day’,hue=”smoker”,data=df) // hue = smoker à
classify the pts wrt smoker

○​

Violin plot :
●​ //We are able to see data in terms of kernel density estimation and the box
plot .
●​ Sns.violinplot(x=’total_bill’,y=’day’,data=df,palette=’rainbow’)

●​
●​ Try to practise iris = sns.load_dataset(‘iris’)
Read kaggle kernels problems . problems from MEDIUM
________________________________________________________________

Exploratory data analysis :


60 % data analysis -> feature eng -> clean data , handle missing data set &
feature selection : selecting the appropriate features to solve a problem .
Steps to follow after having data :
1) Feature eng .
2) Handling categorical features .
3) Feature selection . correlations , forward propog. , backward prop
________________________________________________________________

EDA with python and applying Logistic Regression :


Download titanic dara set -> tutorial 11
//import libraries
●​ Import pandas as pd
●​ Import numpy as np
●​ Import matplotlib.pyplot as plt
●​ Import seaborn as sns
●​ Matplotlib inline
●​ Train = pd.read_csv(‘titanic_train.csv’)
●​ Train.head()

Sibsp → sibling spouse


Parch → parent child
We have to predict whether the passenger has survived or not
Missing data → We have to first see how many Nan values -> null values
●​ train.isnull() → gives whether each and every value in a row /column is true
or false
○​
○​ // if it is nan it gives true value .
●​ Checking the values by this method is difficult.therefore we use heatmap .
●​ sns.heatmap(train.isnull(),yticketlabels=false,char=false,cmap=’viridis’) //
yticklabel → false means on y axis we dont include the record ⇒ rec 1 , 2
,3 etc. on the x axis we have the columns . if we said xticklabel = true → all
indexes
All the null values are shown in yellow colour

○​
Roughly 20 percent of the Age data is missing. The proportion of Age
missing is likely small enough for reasonable replacement with some form
of imputation.
Looking at the Cabin column, it looks like we are just missing too much of
that data to do something useful with at a basic level. We'll probably drop
this later, or
change it to another feature like "Cabin Known: 1 or 0"
Let's continue on by visualising some more of the data! Check out the video
for full explanations over these plots, this code is just to serve as reference.
●​ sns.set_style(‘whitegrid’)
●​ sns.countplot(x-’survived’,data-train)
○​
●​ sns.set_style(‘whitegrid’)
●​ sns.countplot(x-’survived’,hue=’sex’,data-train,palette=’RdBu_r’)
●​ // in the ship women and children→ priority → more survive

○​
●​ sns.set_style(‘whitegrid’)
●​ sns.countplot(x=’survived’,hue=’Pclass’,data=train,palette=’rainbow’)

○​
○​ // pclass is the passenger class. The rich people survived by
bribing the sailors.
○​ // pclass 1 is richer.
●​ sns.displot(train[‘age’].dropna(),kde=false,color=’darkened’,bins=10)
○​
●​ sns.countplot(x=’sibsp’,data=train) // 0 → no sibling and spouse

○​

●​ train[‘fare’].hist(color=’green’,bins=40,figsize=(0,4))

○​

Data cleaning :
Removing the null values. Which are present in age and cabin values .
We first found out the relation between passenger class and age
●​ plt.figure(figsize=(12,7))
●​ sns.boxplot(x-’pclass’,y-’age’,data-train,palette-’winter’)

○​
●​ Def inpute_age(col); // inpute age is a function
Age = cols[0]
Pclass = cols[1]
If pd.isnull(age):
If pclass ==1 :
​ Return 37 ; because the average value of passenger in 1st class
in 37 . from box plot graph
elIf pclass ==2 :
​ Return 29 ;
Else :
Return 24;
Else :
Return age;
●​ Train[‘age’] = train[(‘age’,’pclass’)].apply(input_age,axis = 1) // age and
pclass are passed onto inpute age function .
●​ // we check the heat map
●​ sns.heatmap(train.isnull(),yticklabels=False,char=false,cmap=’virdis’)
○​
●​ So to replace cabin values we need a lot of feature engineering .
convenient to remove it.
●​ train.drop(‘cabin’,axis=1,inplace=true) // this completely removes the
column cabin
●​ train.head()

○​
●​ sns.heatmap(train.isnull(),yticklabels=False,char=false,cmap=’virdis’)

○​
○​ Thus we have successfully handled all the Nan values
●​ Details like passenger id , name , ticket no are not required .
Converting categorical features :
We are going to convert sex & embarked into an integer function
using pandas get dummy .
Converts all the columns into dummies .
●​ pd.get_dummies(train[‘embarked’],drop_first=true).head() // removes
the first column because the other 2 columns can represent the first
column.
Suppose we had 3 columns like p,q,s then . 01 will be for S , 10 will
be for Q and 00 for P . therefore drop P.
○​
●​ //Similarly for sex we will do the same .
●​ Sex = pd.get_dummies(train[‘sex’],drop_first=True)
●​ embark= pd.get_dummies(train[‘embarked’],drop_first=true)
●​ train.drop([‘sex’,’embarked’,’name’,’ticket’],axis=1,inplace=true)
●​ train.head()

○​
●​ Train = pd.concat([train,sex,embark],axis =1) // Q & S for embark and
male for sex data
●​ train.head()
●​ // survived is a dependent feature , rest all are indep .

________________________________________________________________

Building a Logistic regression model:


We should start by splitting our data into training set and test set

Train test split :


●​ train.drop(‘survived’,axis=1).head() // here we remove the survived column
since it is our dependent feature .
●​ Train[survived ].head()

○​
●​ From sklearn.model_selection import train_test_split
●​ X_train,x_test,y_train,y_test =
train_test_split(‘survived’,axis=1),train[‘survived’],test_size=0.30,
​ ​ ​ ​ random_state=101)

Training and predicting :


●​ From sklearn.linear_model import LogisticRegression
●​ logmodel = Logisticregression()
●​ logmodel.fit(x_train,y_train)
○​ LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='12', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
●​ predictions = logmodel.predict(X_test)
●​ From sklearn.metrics import confusion_matrix // confusion matrix →
accuracy of a classification model or algorithm.
●​ accuracy=confusion_matrix(y_test,prediction)
●​ Accuracy
○​ array([[144,19],[56,48]])
●​ From sklearn.metrics import accuracy_score
●​ accuracy = accuracy_score(y_test.predictions)
accuracy
●​ 0.7191011

Functions in python :
●​ //Common code
Num =24
Def even_odd(num):
If num%2==0:
print(“the number is even”)
Else:
print(“it is odd “)

Positional and keyword arguments :


●​ Def hello(name,age =29) // here 29 is the default argument
print(“my name is () and age is ()”.format(name,age))
●​ hello(‘jay’)
○​ My name is jay and age is 29
●​ NOTE : Here name → positional argument → doesnt have any value
​ Age = 29 → keyword argument
●​ Def hello(*args, **kwargs) :
Print (args)
print(kwargs)
Args → positional argument
Kwargs → keyword argument
●​ hello(“jay”,”cat”,age =29,dob=1000)
○​ (‘jay’,’cat’)
○​ (‘age’:29,’dob’:1000)
●​ Storing the arguments in a list
●​ 1st=[‘jay’,’cat’]
●​ dict_args=(‘age’:29, ‘dob’ = 1000)
●​ hello(*1st,**dict_args)
○​ (‘jay’,’cat’)
○​ (‘age’:29,’dob’:1000)
●​ NOTE : we can return multiple values from a function. Unlike c and c++
●​ Eg : 1st = [1,2,3,4,5,6,7]
​ Def evenoddsum(1st) :
​ evensum=0
oddsum=0
For i in 1st :
​ ifi%2==0:
​ ​ evensum=evensum+i
​ Else:
​ ​ Oddsum = oddsum + i
Return evensum,oddsum
●​ evenoddsum(1st)
○​ (12,16)

Map function :
1.​2 parameters : function & iterables
2.​Uses LAZY LOADING technique
3.​
●​ def even_or_odd(num):
if num%2 == 0:
return "The number {} is Even".format (num)
Else:
return "The number {} is Odd".format(num)
●​ even_or_odd(24)
○​ 'The number 24 is Even'
●​ lst=[1,2,3,4,5,6,7,8,9,24,56,78]
●​ map(even_or_odd,lst)
○​ <map at 0x26164655c50> // the memory has not been instantiated by
using map
●​ list(map(even_or_odd,lst))
○​ ['The number 1 is odd',
○​ 'The number 2 is Even',
.......
○​ 'The number 78 is Even'

Lambda function :
1.​Or anonymous function
2.​A function with no name
3.​It works faster than a normal function
4.​If the function has a single line of code → convert it to lambda
Eg : return a + b
5.​Similar to inline in c++
●​ Addition = Lambda a,b : a+b // function → stored in variable “addition”
●​ addition(12,50) // we can take multiple variables(a,b,c . . . )
○​ 62
●​ even1= lambda a:a%2==0
●​ even1(12)
○​ True

Filter function
●​ Def even(num):
if%2==0:
​ Return true
●​ 1st=[1,2,3,4,5,6,7,8,9,0]
●​ list(filter(even,1st))
○​ [2, 4, 6, 8, 0]
●​ list(filter(lamda num:num%2==0,1st)) // best way
○​ [2, 4, 6, 8, 0]
●​ list(filter(lamda num:num%2==0,1st))
○​ [false,true . . . . . . . . ]
List comprehension :
1.​ Concise way to create lists
2.​More line of code → more memory is occupied
3.​It consists of brackets containing an expression followed by a for clause ,
then zero or more for or if clauses. The expressions can be anything,
meaning you can put in all kinds of objects in lists
●​ Def 1st_square(lst):
​ For i in lst:
lst1.append(i*i)
​ Return 1st1
●​ 1st_square([1,2,3,4,5,6,7])
○​ [1,4,9,16,25,36,49]
●​ lst=[1,2,3,4,5,6,7]
●​ 1st = [i * i for i in 1st ] // this replaces all the line of code we have written in
the first bullet point
○​ [1,4,9,16,25,36,49]
//If we want to square only the even nos
●​ 1st = [i * i for i in 1st if i%2 == 0 ]
●​ print(1st )
○​ [4 , 16 , 36]

String formatting :
●​ Def welcome (name, age ):
​ Return “welcome {name1} , your age is {age1}”.format(name1=name ,
age1=age)
●​ welcome(‘jay’,55)
○​ ‘Welcome jay your age is 55 ‘

Python list iterables vs iterators :

●​ lst = [1,2,3,4,5,6,7]
●​ For i in lst :
​ ​ print(i)
○​ 1
2
3
.. ..
●​ iter(lst) → *

* iter operator : will convert a list(iterable) into an iterator . what will happen is
that all the values will not be initialised in the memory at once . So we have to
call an inbuilt function called next. Which will initialize it one by one
●​ lst1= iter(lst) → **
●​ next(lst1)
○​ 1 → we execute / run again → o/p 2 ⇒ pick up next element

Another method : for loop

●​ For i in lst1:
Print(i)
○​ 1
2
3
.......

Why do we need iterators ? → if we had a list of millions of elements as each


element is initialised in the list stored in the memory → so if we convert it into an
iterator we won't require that much memory unless we call the element , it will not
get initialised .

after we reach the last element we get the stop iteration in NEXT . 4
We don't get in for loop because it stops at the last element

OOPs in python
●​ class car :
​ Pass → we dont have any properties defined yet . .
●​ car1=car()
●​ car1.windows=5
car1.doors=5
print(car1.windows) -> o/p 5
Init function → acts as a constructor
●​ class Dog:
def __init__(self, name, age):
self.name = name
self.age = age
def owner(self): →method inside a class→ here “self will contain the name”
​ Return “His name is ()”.format(self.name)
●​ dog1=dog(trevor,6) → self is used for dog1
●​ dog1.owner() → his name is trevor
Self parameter → reference to the instance of the class itself , similar to THIS in
c++
Interview Q : Errors: Include both syntax and runtime errors. Syntax errors must
be fixed in the code.
Exceptions: Are runtime errors that can be handled to prevent program crashes.
Exception handling :

TRY ELSE block

●​ try : // code block where exception can occur


a=b
except exception as ex : // ex → alias
print(ex)​ ​ ​ ​ // exception CLASS
○​ Name ‘b’ is not defined
●​ a = b
○​ name error → type of exception → variable not defined
●​ try : // code block where exception can occur
a=b
except NameError as ex1 :
print(“the user have not defined the variable”)​ ​
except exception as ex : //written in the last
print(ex)​ ​ ​ ​
○​ the user have not defined the variable
1.​Instead in try block →
​ A = 1 & b = ‘g’ → c = a + b → then name error exception case will not
run . since name error is not displayed . then it will be caught in the last
case and o / p → TypeError → we will make a new exc case →
except TypeError: // we dont have to put alias since we are displaying
our own message
​ print(“put same datatype”)
NOTE : the exception are read TOP to BOTTOM .
in our eg : name error → type error → exception
If exception is getting executed then else block will not get executed
Note ELSE block → executed if no exception is caught

Finally block :
The code in finally block is written after else block .
The code is executed irrespective error is caught or not .
​ Use this block to close the database . we cant use the else block .

Custom exception handling

●​ Class error(Exception): // inheriting exception class


​ ​ pass
​ class dob(error):
​ ​ pass

●​ year=int(input(“enter the DOB”))


​ age=2021- year
try:
​ If age<= 30 & age>20:
​ ​ print(“the age is valid”)
​ Else:
​ ​ raise DOB
​ Except DOB: // catch the exception in except block
​ ​ print(“age is not valid ”)

o/p :

Age is valid

________________________________________________________________

Access specifiers :
OOPs - public,private , protected

●​ Class car()
​ Def __init__(self,windows,doors,enginetype):;
​ ​ self._windows=windows
​ ​ self._doors=doors
​ ​ self._enginetype=enginetype
●​ bmw=car(4,5,”petrol”)
NOTE: 1. Java,c# = strongly typed → functionality is restricted
2.​Python → can be overridden
3.​ ._ → we add underscore →PROTECTED
4.​dir(bmw) → see all the attributes → o/p :
{‘__class__’ // here functions are also displayed
‘__dir__’
‘Doors’ ,
‘Enginetype’ ,
‘windows ’ → not written with _ →public
}
Protected can only be accessed from subclasses by inheritance .
Overriding should only be done from the subclass
Continue from above :
●​ Class truck(car ):
​ Def __init__(self,windows,doors,enginetype,horsepower)
// horsepower is a new parameter
​ ​ Super ().__init__(windows,doors,enginetype)
​ ​ self.horsepower=horsepower
// super ⇒ to inherit all the para. like windows , doors , engine type
●​ Truck=truck(4,4,”diesel”,4000)
●​ dir(truck)
○​ ‘_doors’ → protected
‘_enginetype’
‘Horsepower’ → public
‘_windows ’
5.​Private → __ → double underscore → cannot access anywhere or modified
outside of the class . if we want to modify it we can override it
6.​Private parameters in dir → _car__doors
​ ​ ​ ​ ​ _car__enginetype
​ ​ ​ ​ ​ _car__windows
________________________________________________________________

Inheritance :

●​ Class car()
​ Def __init__(self,windows,doors,enginetype):;
​ ​ self.windows=windows
​ ​ self.doors=doors
​ ​ self.enginetype=enginetype
●​ Def drive(self):
​ print(“can drive ”)
●​ car= car(4,5,”electric”)
●​ Class audi(car):
​ Def __init__(self,windows,doors,enginetype,luxury):
​ ​ super().__init__(windows,doors,enginetype)
​ ​ ​ ​ self.luxury=luxury → luxury is boolean
​ ​ def selfdriving(self):
​ ​ ​ print(“audi supports self driving ”)
●​ audiQ7=audi(5,5,”electric”,True)
●​ audiQ7.selfdriving()
○​ audi supports self driving

________________________________________________________________

Uni, bi & multivariate analysis

Univariate :

Data : 3 columns containing height weight & o / p(obese,slim, fit )

Due to the overlap we take bivariate and multivariate


We cant use logistic regression if there is a lot of overlapping between the points
In logistic regression we draw a sigmoid and try to divide the points
Then we will use non linear classifying algorithms like decision trees , random
forests etc .
If we have more than 3 features we cant draw 4D diagram

Bivariate :
Using seaborn .
Pearsons correlation helps us to determine if one feature is changing how is the
other one being affected. (-1 → +1 )
●​ import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
●​ df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-p
ages/data/iris.csv')
●​ df. head()

●​
●​ df.shape
○​ (150,5) → (records , features)

Univariate analysis code :


●​ df.setosa=df.loc[df[‘species ’]== ‘setosa’] → we get all the points of setosa //
similarly we will get data for other 2 types

●​ df.virginica=df.loc[df[‘species ’]== ‘virginica’]


df.versicolor=df.loc[df[‘species ’]== ‘versicolor’]
●​ plt.plot(df_setosa[‘sepal_length’],np.zeros_like(df_setosa[‘sepal_lengt
h’]),’o’)
1.​df_setosa[‘sepal_length’]--> x value , since we are doing univariate we are
taking Y axis to be zero .
2.​We have put df_setosa inside np.zero since we have to give same length to
Y axis
3.​ Similarly for other 2 samples
4.​O makes the size bigger
●​ plt.xlabel(‘petal length’)
●​ plt.show()
without ‘o'

3 fo different color

Extreme left → mentioned first in the code

​ Bivariate analysis code :


●​ sns.FaceGrid(df.hue=”species”,size=5).map(plt.scatter,”scatter_length”,”se
pal_width”).addlegend
​ Hue → what is the feature we are trying to categorise with
●​ Plt.show

○​

We can see setosa on top left . it can be easily classified


________________________________________________________________

Z score :

U = μ(mu )

For standard normal distribution μ = 0 & σ(sigma) = 1


First unit on right⇒ 1 ,2 . . .
Refer statsai2 notes .

Formula to convert gaussian normal to S.N.D. is z score

Z score = xi - μ
​ _____
​ ​ σ
Normalization or standard norm wherein we apply the form for each
case .
μ - σ to μ + σ → 68% of all data lies
μ - 2σ to μ + 2σ → 95%

Sample Q :
μ = 75 , σ = 10 . probability student score > 60
For 3 rd region we get z score = 0.0668
1st region → -1.5 to 0
2nd region → 0 → RHS → 50 %
Let the region 1 be x
Total ⇒ 100 = x + 50 + 0.0668
X = 44 %

To find out the probability > 60


44 + 50 = 94 % student will get marks > 60

________________________________________________________________

Probability density function (PDF) :

1.​Smoothen the histogram


2.​Using data (weight and height ) → x axis → weight
​ ​ ​ ​ ​ ​ Y axis → count
3.​draw a bell curve→ count value ⇒percentage of the distribution

Cumulative density function


1.​We add the previous points .

2.​On y axis → 0.9 → 90% of the data is below 130 kg

____________________________________________________________
Linear regression Indepth math intuition

1.​y =mx + c ( m → slope , c → intercept )


2.​The eq is for best fit line(distance[error] from line should be minimal ) .
3.​When (x = 0) then y = c
𝑚
4.​We select cost function = [1 / 2m ] ∑ (y^ - y)2
𝑖=1
m = number of points
^ → on top of y → y hat
y hat ⇒ y = mx + c ⇒ the points that we predict in a best fit line
Y ⇒ real points

But we cant keep on applying this formula again and again . this is
discussed later .

Thus by subtracting y hat and y we are minimizing errors

Example :
Consider a graph with data points as y^ = mx + c
We generally consider c = 0 (ie passess through origin ) otherwise we will have
to draw a 3d diagram
Putting x = 1 in our equation , y =1
m = 1 (consideration )
Cost function = [1 / 2 m] [(1 - 1)2 + (2 - 2)2 +(3 - 3)2] = 0

graph of cost function :


Y axis = cost function (J [m])
X axis = m function
WRT every m value that is initialized what is the cost function we have got .
Next case we will consider m = 0.5
For x = 1 ⇒ y^ = 0.5
X=2⇒y^=1
Calculate cost function = 0.58
Thus we can plot all the points
When should we stop to select a m value for the regression line or for the best fit
?

Gradient descent
If we get the initial point high (x= 0, y = 2). So we have to go “downwards ”
to get the minimal value . so we use Convergence theorem .

Convergence theorem
m = m - (δ𝑚/dm) × α

α (alpha )= learning rate


δ𝑚/dm = derivative of m wrt m ⇒ Slope

Next step is we have to find out if the slope is – ve or +ve

1.​If the RHS is pointing downwards → (– ) ve slope


​ if here m value → – 0.5 . feasible m value ≌ 1
Calculate this → (δ𝑚/dm) = ( - ve) & α = smaller value *
Thus the m value should INCREASE therefore come nearer to 1

* Why did we select m to be small ?


​ If it was greater, then the point will jump to another point and we may
never reach the minima
2.​When we take a point such that its RHS is upwards and LHS is downwards
(δ𝑚/dm) = ( + ve)

●​ At global minima slope is ZERO (we reach best fit case )

Based on the number of features gradient diagram maybe 3d or 4d .


Every feature will try to move toward the global minima .
________________________________________________________________

Ridge and lasso regression :

Regularization hypertuning techniques .


Insert from one note
________________________________________________________________

Ridge and lasso implementation in py

Dataset - predict the price of the house


●​ From sklearn.datasets import load_boston
●​ Import numpy as np
Import pandas as pd
Import matplotlib.pyplot as plt
●​ Df = load_boston()
●​ Df
// all of the data is shown
●​ Dataset = pd.dataframe(df.data)
print(dataset.head())

○​
●​ dataset.columns= df.feature_names
●​ dataset.head()

○​
●​ Df.target.shape
○​ (506,) → 506 rows
We create a new column called price and add target variable there
●​ dataset[“price”]=df.target
●​ dataset.head()

○​
●​ x=dataset.iloc[:,:-1] //independent feature
●​ y=dataset.iloc[:,-1] //dependent feature

Linear regression code :

Cross_val_score → used for cross validation

●​ From sklearn.model_selection import cross_val_score


●​ From sklearn.learn_model import LinearRegression
●​ lin regressor=LinearRegression()
mse=cross_val_score(lin_regressor,x,y,scoring='neg_mean_squared_error'
, cv=5]
mean_mse=np.mean(mse)
print(mear_mse)
mse=cross_val_score(object initialized ,x,y,whatever mean sq error we
get[nearer to 0→ better model])
○​ -37.131 → - ve since negation

Ridge regression :
●​ from sklearn.linear_model import RIdge
from sklearn.model_selection import GridSearchCV

●​ ridge=Ridge()
●​ parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2, 1, 5, 10, 20, 30, 35, 40,
45, 50, 55, 100]}
●​ ridge_regressor=GridSearchCV(ridge,parameters,scoring='neg_mean_squ
ared_error', cv=5)
ridge_regressor.fit(x,y)
●​ print(ridge_regerssor.best_params_) // best param helps to find out which
λ 𝑣𝑎𝑙𝑢𝑒 𝑖𝑠 𝑠𝑢𝑖𝑡𝑎𝑏𝑙𝑒 .
print(ridge_regerssor.best_score_)
○​ {'alpha': 100}
○​ -29.871945115432595

Remember : α > 0 & can be any finite number

1e-15 → 10 -15

Lasso regression :
●​ from sklearn. linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso=Lasso()
parameters {'alpha':[1e-15,le-10,1e-8,1e-3,1e-2, 1,5, 10, 20, 30, 35,40, 45,
50, 55,100]}
●​ lasso_regressor GridSearchCV(lasso, parameters,
scoring-'neg_mean_squared_error', cv=5)

lasso_regressor.fit(X,y)
print(lasso_regressor.best_params_)
print(lasso_regressor.best_seore_)
○​ {'alpha': 1}
-35.491283263627095
●​ from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)
●​ predictior_lasso-lasso_regressor.predict(X_test)
predictinr_ridge=ridge_regressor.predirt(X_test)
●​ import seaborn as sns
sns.distplot(y_test-prediction_lasso)

○​
●​ import seaborn as sns
sns.distplot(y_test-prediction_ridge)
○​
●​ Thus ridge and lasso give a similar graph . good prediction

________________________________________________________________

Multiple linear regression :

For multiple features equation of line will be :-


Y = m1x1 + m2x2 + m3x3 + c .
m1 , m2 and m3 are the slopes WRT indep features → x1 , x2 , x3
m1 says that if we increase the value of x1 by 1 unit then how will it affect the
result .
In the below dataset we have to find out profit (dependent feature ) based on the
R&D spend admins. Marketting spend
Code :

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1] // all columns except the last are put in x variable
y = dataset.iloc[:, 4]

#Convert the column into categorical columns

states=pd.get_dummies(X['State'],drop_first=True) // create dummy variables


based on the no of categories **
# Drop the state coulmn
X=X.drop('State',axis=1)

# concat the dummy variables


X=pd.concat([X,states],axis=1)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 0)

# Fitting Multiple Linear Regression to the Training set


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results


y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score ***


score=r2_score(y_test,y_pred)

EXPLANATION:
3 categorical features → california , florida , new york

** → so we can show new york without


mentioning it .
But in our example we will delete cali .
We can take any column .
process is called as dummy variable trap
First column is removed

m1 → r&d , m2 → admin , m3 → marketing


, m4 –.flo , m5 → ny

***r2 score .

Ss res → sum of residual


Ss mean → sum of mean

Remember : ss mean > ss res → good model

Thus when we divide in formula → small no .


Generally the value after division is > 0.8

When we execute score value is 0.93 → closer to 1 → good model

THUS WE USE R2 TO MEASURE GOOD MODEL


Multicollinearity in linear regression :

1.​ occurrence of high intercorrelations among two or more independent


variables in a multiple regression model
●​ import pandas as pd
●​ import statsmodels.api as sm
df_adv = pd.read_csv('data/Advertising.csv', index_col=0)
X = df_adv[['TV', 'radio','newspaper']] // 3 independent features
y = df_adv['sales"]
df_adv.head()

○​
// shows the expenditure in various departments . so based on them we
have predict the sales value .
We solve the problem with help of ordinary least square (OLS)→ used in multiple
linear regression
OLS:
1.​estimating coefficients of linear regression equations which
describe the relationship between one or more
independent quantitative variables and a dependent
variable
2.​Why we choose to minimize the sum of squared errors
instead of the sum of errors directly.

It takes into account the sum of squared errors instead of


the errors as they are because sometimes they can be
negative or positive and they could sum up to a nearly null
value.

For example, if your real values are 2, 3, 5, 2, and 4 and


your predicted values are 3, 2, 5, 1, 5, then the total error
would be (3-2)+(2-3)+(5-5)+(1-2)+(5-4)=1-1+0-1+1=0 and
the average error would be 0/5=0, which could lead to false
conclusions.
3.​Wikipedia

Thus our equation based on OLS will be : Y = β0 + β1(TV) + β2(radio) +


β3(newspaper) + ε
ε → error in measurement
β0 → intercept
We dont have the value for β0 so we will add a column with all values 1
●​ X= sm.add_constant(X) // add constant = 1
model= sm.OLS(y, X).fit()

By pressing shift + tab ↓

endog value → o/p feature


Exog value → i/p feature
●​ model.summary()

○​
○​ // below (columnwise ) are β1 , β2 & β3. This indicates that if there is a
unit increase in sales value . then we need to increase the
expenditure of TV by(0.0458units) . newspaper → -ve → we dont
have to spend a lot → if we decrease the spending then sales may
still inc (by 1 unit ).
○​ // std err → low → none of the features have a multicoll. problem .
If there is a correlation then std err → bigger number .
○​ // P value for all is less but not for newspaper (0.860)

Plotting in terms of correlation :


●​ Import matplotlib.pyplot as plt
x.iloc[: , 1:].corr()
○​

●​ There is not much corr as the value are <0.5.

●​ Df_salary = pd.read_csv(‘data/Salary_data.csv’)
●​ df_salary.head()

○​
○​ // years exp and age are indep features, salary -> dep feature
●​ x= df_salary[(‘YearsExperience’,’age’)]
y=df_salary[‘salary’]
●​ x =sm.add_constant(X)
model=sm.OLS(y,x).fit()
●​ model.summary()
○​
○​ // if we increase the age by 1 year then how much should our salary
increase .
1.​R2 is also small → fits good
2.​But std err is HUGE VALUE
3.​If theere is a multicollinearity problem then std error is big value .
4.​If we add another feature which is correlated with other features
then the std will be VERY VERY HIGH.
5.​P value is > 0.05 for age and years of experience might have
some kind of corr . thus to confirm it we write the below code .
●​ x.iloc[:,1:].corr()

○​
○​ // age and years of exp has 98% correlation . thus we may drop the
age feature .
________________________________________________________________
y^ = mx + c
Y = value in graph
_________________________

Huge gap b/w → OVERFITTING → Training


dataset has given excellent results → low bias
or less error → but gives test dataset high error

Underfitting → in both dataset HIGH error

Note : In ridge and lasso to convert high variance to low variance .

A good model should have both low bias & low variance .

Ridge regression :
Note : steep slope ⇒ overfitting
Steep – unit ↓ in x axis → small dec in y axis

λ → 0 to + ve integer

Green → 2nd line

1st line → 0 + 1(1.3)2= 1.69 — 1


⤷ example

(y - y ^)2 → smaller value & slope → small → steepness ↓

(y - y ^)2 + λ(slope)2 — 2

2<1

Now green line is the new best fit line is the new best fit line

Note : We PENALIZE features with higher slopes (↑ M) & make it less steeper

Y = mx + c → 1 feature
y = m1x1 + m2x2 + c1 → 2 slopes
y ⇒ formula → λ[m12 + m22]

When we apply the condn with less slope to our test case → the difference will
be lesser as compared to steeper slope .

Green line → ↑ Bias as compared to the white line

↑ Bias for training dataset we have same amount of error


As λ ↑ → line → almost straight → ↓ steep → close to 0
________________________________________________________________

Lasso regression :

→ overfitting
→ feature selection

Overfitting → only for ridge


Ridge regression → line not completely towards zero
Lasso – exactly going towards zero
→ where the slope value is less those features will be removed

It gets canceled since we are using magnitude .


Bias & variance : (DOP – degree of polynomial )

MODEL 1 MODEL 2 MODEL 3


R2 High error Medium low
1. Under fitting Proper balance Overfitting
(↑ error for (satisfies all points
both training in training data but
data & test for test data
data ) accuracy ↓)
2. High bias Low bias Low bias
High variance Low variance High variance

Classification problem → eg : classify binary data → yes or no

Bias – error in training data


Variance – error in test data
1 – error rate for training data & test data → X X
– both high
– under fitting condn

2 – overfitting condn

3 – we have to find out a model where generalized model can be created


– low bias & low variance
________________________________________________________________

1)​Decision tree

– scenarios like overfitting condn


– split the decision tree till its complete depth .
– low bias & high variance
– since the decision tree goes in great depth we try to stop it at level → decision
pruning

2)​Random forest
– multiple decision trees in parallel
– Scenario – low Bias & high var
– also called BOOTSTRAP aggregation
– bootstrap agg. We take a data set & give it to multiple models
– NOT complete records but partial records to bootstrap agg. Which gives it to
multiple decision trees & we get the o/p .
– the o/p is then aggregated
– since the decision trees are in parallel then high variance → low variance
Q ) what kind of technique → xg boost has ? High bias , low var or low bias &
high var etc .

________________________________________________________________

R2 & adjusted R2 :

SSres = sum of residual / error


Residual is the gap b/w data point and best fir line

Yi^ → predicted points


Y → actual / real points
SS total = sum of avg total
X → predicted points
R2 → b/w 0 & 1 → more nearer to 1 → better model
R2 value can be lower than 0 if our best fit line is worser than average line .
If SS res > SS tot → ÷ → > 1 on subtraction → ans → – ve

●​ → as we keep on adding new independent features our R2 → increases


→ As SSres ↓ – as coeff is assigned
→ R2 does not penalize the new added feature

R2 = sample R2
P = no of predictors / indep features
N = total sample size

→ R2 adj PENALIZES attributes which are not correlated


→ as p ↑ → denom↓ → adj R2 ↓
→ the features should not be correlated
→ if correlated then adj R2 ↑

1.​Every time you add an independent variable to a model, the R-squared


increases, even if the independent variable is insignificant. It never
declines. Whereas Adjusted R-squared increases only when
independent variable is significant and affects dependent variable.
2.​Adjusted r-squared value always be less than or equal to r-squared
value.
________________________________________________________________

Hypothesis testing :-

→ evaluate 2 (can be > 2 ) mutual exclusive statements on a population


using sample data.

Steps :
1.​Make initial assumption(Ho)
2.​Collect data
3.​Gather evidences to reject or accept the null hypothesis .
CONFUSION MATRIX :

If null hypothesis is true,but we dont have much evidence.


________________________________________________________________

T test , chi square test , anova test

Lets say we ask the q , what is the difference in proportion of male and female ?
→ say h1 → yes there is a difference in proportion .
→ the above is a sample data set

** continued below

P value / significance value / α :


●​ Any statistical analysis → we perform 2 tailed test
●​ Eg : in a keyboard . the middle part of a spacebar is most commonly
pressed .
●​ At point x. P = 0.01. Ie if we repeat this experiment 100 times then no of
times we touch that place is 1.
●​ Similarly at y → p =0.8 → 100 try → touch 80 times
●​ P value → probability for the null hypo to be true
●​ Null hypo → it treats everything same or equal
→ eg : coin → Ho→ coin is fair
​ ​ ​ H1→coin is NOT fair
To get the NULL hypo to be true we want the
Result to be closer to the mean(in 95% region)

●​
●​ At 2.5% region → reject null hypo.
●​ Since value is far from mean value

** continued
We make a Ho,H1 & test table , table . lets say H1 - there is a diff b/w the
proportion of male and female .
Ho → there is no diff

Considering a test case using one categorical feature we need to apply a test
which says that where we have this
Null hypo. As true what is the likelihood that our alternate hypo. is true .
We take p ≤ 0.05

1.​For one sample feature the test -> One sample prop. Test
P value is selected before the test.
P -> 0.05 has the same graph as mentioned above .
If p <0.05 -> reject null hypo
2.​If we have 2 categorical feature -> test -> chi square test
3.​T test
a.​1st case continuous variable (eg : height )
b.​2nd case 1 numerical var and cat var with only 2 categories (M & F)
4.​2 numerical variable -> test -> correlation (eg: for pearsons value range ->
-1 to +1 if near to 0 -> no correlation )
5.​Anova test
a.​ one numerical var and one categorical var .
b.​ cat var + cat var which has more categories (eg : age : adult ,elderly,
young)
Features selected

H0 No difference

H1 Some diff.

test Name of test

Practical implementation in python for test

T test :
A t test is a type of inferential statistics which is used to determine if there is a
significant difference between the means of two groups which may be related in
certain features.

1.​One sample t test


2.​Two sampled t test

1.​One sample
Tells whether the sample and the population are different

Where,
𝜇= Proposed constant for the population mean
X(x bar)= Sample mean
n = Sample size (i.e., number of observations)
s = Sample standard deviation

= Estimated standard error of the mean (s/sqrt(n))


^(S subscript x bar)
Eg 1
●​ Ages = [10,20,35, . . . . 19,70,43]
●​ Import numpy as np
Ages_mean=np.mean(ages)
Print(ages_mean)
●​ 30.34375
●​ //taking a sample
●​ Sample_size=10
Age_sample=np.random.choice(ages,sample_size)
●​ Age_sample
○​ Array([26,43 . . . 50])
●​ From scipy.stats import ttest_1samp
●​ Ttest,p_pvalue=ttest_1samp(age_sample,30)// 2 o/p ttest and p value
//(age_sample,30) -> we are testing that if our age sample is based on our
population mean(ie 30), if it is same we say no diff.
●​ Print(p_value)
○​ 0.7403
●​ If p value < 0.05:
Print("reject null hypo")
Else:
Print("accept null hypo")
●​ Accept null hypo

Poisson distribution

●​ import numpy as np
import pandas as pd
import scipy.stats as stats
import math
np.random.seed(6)
school_ages=stats.poisson.rvs(loc=18,mu=35, size=1500)
classA_ages=stats.poisson.rvs(loc=18,mu=30,size=60)
// loc =18 means our ages start from 18 ,
Mu = mean
Loc also means nodes extreme left value (in bell curve )
●​ classA_ages.mean()
○​ 46.9
●​ _,p_value=stats.ttest_1samp(a=classA_ages, popmean =
school_ages.mean())
●​ school_ages.mean()
○​ 53.303333333333335
●​ P_mean
○​ 1.13 e -13
●​ We reject the null hypo
Two sample T test :

Defn: the independent samples t test or 2 sample t-test compares the


means of two independent groups in order to determine whether there is
statistical evidence that the associated population means are significantly
different . The independent samples t test is a parametric test. This test is also
known as independent t test .

●​ np.random.seed(12)
ClassB_age5=stats.poisson.rvs(loc=18,mu=33,size=60)
ClassB_ages.mean()
○​ 50.63333333333333
●​ _,p_value=stats.ttest_ind(a=classA_height,b=ClassB_ages,equal_var=Fals
e)
// a value -> 1st grp
b value -> 2nd grp

●​ if p_value < 0.05: ​ # alpha value is 0.05 or 5%


print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")
○​ we are rejecting null hypothesis

Paired t test with python :


Check how different samples from the same group are .

●​ weight1=[25,30,28,35, 28, 34, 26, 29, 30, 26, 28,32, 31, 30,45]
weight2=weight1+stats.norm.rvs(scale=5,loc =- 1.25,size=15)
W2 is the change in wt after some years
●​ print(weight1)
print(weight2)
○​ [25, 30, 28, 35, 28, 34, 26, 29, 30, 26, 28, 32, 31, 30, 45]
[30.57926457 . . . 41.32984284]
●​ weight_df=pd.DataFrame({"weight_10":np.array(weight1),
"weight_20":np.array(weight2),
"weight_change":np.array(weight2)-np.array(weight1)})
○​
●​ _,p_value=stats.ttest_rel(a=weight1,b=weight2)
A -> previous wt
B -> recent wt
●​ print(p_value)
○​ 0.5732936534411279
●​ if p_value < 0.05:# alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")
○​ we are accepting null hypothesis
Correlation
●​ Import seaborn as sns
●​ df=sns.load_dataset('iris')
●​ df.shape
○​ (150, 5)
●​ df.corr()

○​
// from the above data we can see that sepal length and petal length are
highly correlated
If value was nearer to 0 not much corr.
●​ Sns.pairplot()
○​

Chi square test :


The test is applied when you have two categorical variables from a single
population. It is used to determine whether there is a significant association
between the two variables.

●​ import scipy.stats as stats

●​ import seaborn as sns


import pandas as pd
import numpy as np
dataset=sns.load_dataset('tips')

●​ dataset.head()

○​
●​ dataset_table=pd.crosstab(dataset['sex'],dataset['smoker'])
print(dataset_table)
○​ Smoker ​ Yes ​ no
sex
Male ​ 60 ​ 97
Female ​ 33 54
●​ dataset_table.values
○​ array([[60, 97],
​ [33, 54]], dtype=int64)

●​ #Observed Values
Observed_Values = dataset_table.values
print("Observed Values :- \n",observed_Values]
○​ Observed Values :-
[[60 97]
[33 54]]
●​ Val =stats.chi2_contingency( dataset_table)
​ ^ chi square conti. Function -> shift + tab -> finds p value
●​ val
○​ (0.008763290531773594, 0.925417020494423, 1,
array([[59.84016393, 97.15983607]
[33.15983607, 53.84016393]]))
Underlined values ->we see difference in expected and observed values
●​ Expected_Values=val[3]
●​ no_of_rows=len(dataset_table.iloc[0:2,0]) // dataset table = cross tab info
no_of_columns=len(dataset_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1) // formula
print("Degree of Freedom :- ",ddof)
Alpha = 0.05 // 95% variance we should capture between the 2 features
○​ Degree of freedom = - 1

Chi square formula :

●​
O -> observed
E -> expected
●​ from scipy.stats import chi2
chi_square=sum([(o-e) ** 2./e for o,e in
zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
●​ print("chi-square statistic :- ",chi_square_statistic)
○​ chi-square statistic :- 0.001934818536627623
●​ critical_value=chi2.ppf(q=1-alpha,df=ddof)
//ppf -> percent point function (inverse of cdf ).
print('critical_value:',critical_value)
●​ critical_value: 3.841458820694124
●​ #p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('p-value:",p_value)
○​ p-value: 0.964915107315732
Significance level: 0.05
Degree of Freedom: 1
p-value: 0.964915107315732
●​ if chi_square_statistic>=critical_value:
print("Reject H0,There is a relationship between 2 categorical variables")
else:
print("Retain H0,There is no relationship between 2 categorical
variables")

if p_value <= alpha:


print("Reject H0,There is a relationship between 2 categorical variables")
else:
print("Retain H0,There is no relationship between 2 categorical
variables")
○​ Retain H0,There is no relationship between 2 categorical variables
○​ Retain H0, There is no relationship between 2 categorical variables

____________________________________________________________

​ Metrics in classification

1.​Confusion mtx
2.​FPR(type 1 error)
3.​FNR(type 2 error)
4.​ RECALL(TPR, SENSITIVE)
5.​PRECISION(+VE PRED VAL)
6.​Accuracy
7.​F beta / F 1 score
8.​Cohen kappa
9.​ROC curve,ABC score
10.​ PR curve

Eg : problem -> classification problem statement


W2 ways to solve the classification problem
1)​Class labels
2)​probabilities

1)​Class labels
Eg : in binary classification there will be 2 classes A and B .
​ By default the THRESHOLD VALUE = 0.5 ⇒ say if value is > 0.5 then B
class else A class(<0.5)

2) Probability → 2 consists of ROC curve,ABC score & PR curve

1)​Class labels
Balanced dataset → 1000 records ⇒ 600 yes , 400 no / 700 yes & 300 no
​ Yes and no are almost equal . so when we provide our ML algorithm with
the data it will not be biased based on the majority of output.
If we have have 800 Y & 200 N → biased o/p

If we have a balanced dataset we use accuracy .


Imbalanced dataset → recall , precision and F beta . (4,5,7)

1.​Confusion mtx
2 X 2 mtx for binary classification where the top values are the actual values
LHS → predicted values

T → true
F → false
P → positive
N → negative

FP → type 1 error ⇒ calculated with the help of FPR


FN → type 2 error ⇒ calculated with the help of FNR

REMEMBER : TP & TN are most accurate results


In any classification problem our aim is to reduce type 1 error & type 2 .

Since we have a balanced problem statement we directly compute the accuracy


Most accurate result divided by total no. of result .

Imbalanced dataset :
Recall,precision , F beta

Out of the total positive actual values how many values did we correctly
predicted positively ⇒circle below ⇒ recall / TPR (true positive rate ) / sensitivity

Out of the total predicted positive result how many results are actually positive
⇒precision / positive prediction value

Uses of recall & precision:Spam detection → here we have to focus on precision


If we have a mail that is not a spam but it is predicted to be spam ⇒ FP
We try to reduce FP .

Stock market /Cancer prediction → recall value . if the test is +ve but model says
-ve → disaster
statquest
WHENEVER FP is much more IMP use PRECISION
If FN imp → RECALL
F beta : whenever we need both FP & FN (recall & precision )
If β = 1 → F1 score .
Similarly if β = 2 → F2 score
when β = 1
Fβ = 2(precision x recall)/ (precision + recall) = HARMONIC MEAN
= 2xy / (x + y)

when to take β = 1 , FP & FN both equally imp .


If FP more impact than FN (type 2) we reduce(0 → 1) β value (generally to 0.5 )

Similarly if FN impact is high → increase β


Continued later ** machine learning
________________________________________________________________

Logistic regression :
Microsoft onenote
Time complexity (prop to )input
●​ Logistic Regression​

used for Binary classification​

Classification :​

1.​Binary
2.​Multiclass
Why we call it regression?​
●​ If wt is > 75 we say them to be obese .​

If y >=0.5 then we consider y =1 for it ie we consider him to be obese​



By using one line we can solve classification​

problem then why do we need logistic regression.​

See above example of 90 kg Reason: -​

1.​An outlier changes the best fit line.


●​ But wkt 90 kg is obese . High error rate .​

Thus we should not use linear regression for these kind of
problems .​

1.​If our value for y comes put to be >>1 or - ve .


Part 2
Log. Reg. is applied to a problem statement where 2 classification problem
can be linearly separable
Linearly separated (divided with help of straight line )
Assumptions :
For +ve -> +1
​ -ve -> - 1
Since the line passes through the origin c = 0.
New eqn y = mx / w^t x
Find out the distance of a pt from a plane .

Above the plane distance is always positive. For below the plane is -ve
Which shows that if we have this kind of scenario.
From 1 we know that y -> +ve
Thus it is getting properly classified

Thus we are able to classify correctly.

Max means that it should not be negative as seen above


If we consider distance of each point from plane to be 2
Total no of points = 8 => 8(2) = 16
But for -ve pt on +ve side
=> 16 - 500 = -484

For - ve & +ve pts


But for the pt outside we take its dist to be -2.
Therefore total = 0 + 2 = 2 <- MAX VALUE
Thus to prevent an inaccurate line we add a function
Around the formula
F function or sigmoid function

As we saw in our first case the we could not take our best fit line. Sigmoid
function -> it takes the - ve value ( - 500) the value is transformed b/w 0 to
+1.

Sigmoid function removes the effect of outliers

Multiple models are made when we have more than 2 categories .


One of them is left & the rest are grouped together .
So if we have 3 categories -> 3 models M1,M2& M3.
Here we treat each category differently as +1 & -1 for each model as we
have a different line for each case .
All models will give different probabilities -> [0.20,0.25,0.55] sum of all = 1 .
Next we will take model with more probability -> will be considered as o/p .
___

Decision tree entropy:


The final points in decision tree will be LEAF NODES / FINAL CLASS
LABELS .
While constructing the decision tree we use the ID3 algorithm -> first step ->
select the right attribute
-> for splitting the node -> here we use entropy.
Entropy -> measure the purity of the script.
PURE SUB SPLIT -> we should either get yes / no
The final leaf node has either a yes or a no .
We have check the purity of each split -> use entropy
Formula :-
If we have complete impure subset(3 Y & 3 N) then our entropy = 1 bits
Impure subset -> worst split
Best split -> at leaf nodes -> 4 Y & 0 N -> 0 bits
Value of entropy -> b/w 0 & 1

Decision tree information gain : -


Formula : -
On applying the above formula we get

IMP : model gives the highest information gain -> used to split the decision tree
Gini impurity in decision trees
Graph -

Gini Impurity takes less time to execute -> faster


Performance matrix part 2 :
ROC & AUC CURVE :
Threshold deciding factor :
K nearest Neighbor

●​ If we have a new pt -> in which category should we classify it into​



Algorithm :​

1.​K nearest N. (eg : K = 5)


2.​We calculate the distance of the nearest neighbors .
(if cat 1:3 & cat 2 => point is c1)​
Adaboost (Booting Technique) :

We create base learners sequentially . Some of the records are passed to


the base learner. Once the BL is trained we will pass all the records and see
how it will perform . If some of the records are incorrectly classified the next
model which is created sequentially . Only the incorrect records are passed
to base learner 2 . So most of the records are trained wrt bl2 .
Similarly if there are errors in BL2 they are passed onto BL3.
It will go on until we specify the number of base learners we want to create .

_______

ADABOOST:
In Adaboost the weights are assigned to the records
In our dataset we have features such as f1 , f2 , f3 and o/p .
All these records get sample weight .
Initially all the records are assigned the same weights .
We created our base learner with the help of decision trees in Adaboost .
Here the decision trees are created with the help of only 1 depth . With 2
leaf nodes
These decision trees are called stumps .
Here all the base learners are decision trees .

For each and every features we will create a stump. Eg for f1 , f2

Entropy or gini coefficient . We can select either of them .


The stump which has the least entropy . We select that decision tree as our
base learning model .
If our selected stump has correctly classified 4 records & incor. 1 record .
For this incorrect classification we find the total error .
S2 : Total error = error / total records
We calculate the performance of stump to update the weight .
Increase the weight of wrong classified records and decrease wt of correctly
classified .

New sample wt = old wt x

Performance say = perf. Of stump

In case of old wt (sample wt ) when we did summation all the records we


get 1. but for new wt it DOESN’T add up to 1 . So we divide it by a no
(summation of all the new wts ) .

S4 : we create a new dataset . That dataset based on the updated values .


Will select the wrong records for its training purpose
We divide the wt into buckets
In the first iteration it has selected a item . Then it will go and see which
bucket it falls into .
If it falls into the wrong record
Then it is populated . It will be our new records and we will create a new
decision tree stump.
Math Behind K Means clustering
K means --> unsupervised technique using sklearn

1.​Algorithm
2.​Metrics
Euclidian
Manhattan
3.​Elbow method (selecting the K value )

Eucladian :

Hypotenuse distance
Manhattan distance (for hypotenuse ) :

Manhattan formula : |(x2 - x1)| + |(y2 - y1)|


“Manhattan since roads are in grid format “

K means

It is able to find the similarity in a group of outputs and group them into
clusters .

Centroids -- clusters

Steps in K means clustering -


A) K value -> centroids
B) initialize centroids randomly

C) select the group and find the mean .


D) move the centroid based on the mean position we find .
E) then we again start from B) (centroid is updated )and draw a line that
divides the points
F) this will go on until , we get a fixed number of groups and no point
movement will get changed . o/p 2 clusters , K = 2

In the example on rhs . The points are divided and are closer to the
centroids . Thus they are divided into two groups

Elbow method :

If k = 1 . Find distance --> WCSS Within cluster sum of square -->


summation distance between the centroid and all the points
Last value which had a abrupt decrease (a)
This a value will be our K . we can group our data into 3 groups since it is
the optimal value

Lec 71
______
Hierarchical clustering intuition :
-- unsupervised ML

Trick : how to find the exact number of clusters . Longest vertical line such
that none of the horizontal line passess through it
DBSCAN :

-- unsupervised ML algorithm
-- 1 epsilon
-- Minpts
we try to make clusters -- helps us to find out the most similar points in a
distribution

If the point a with radius ep. Has 4 pts int it --> CORE PT.
2 cond for core pt :
1) boundary with ep.
2) min pts should be <=4

Boundary pt : whic has 1 core pt inside it .


Noise pt : has an ep. Radius but has no pts inside it . → outlier . so when
DBSCAN algo encounters a noise pt it skips that pt.
Yellow : if min_pt = 4 and pts < 4.
Noise pt --> outlier -- no boundary across it .

. Advantages of DBSCAN:
. Is great at separating clusters of high density versus clusters of low
density within
a given dataset.
. Is great with handling outliers within the dataset.

· Disadvantages of DBSCAN:
. Does not work well when dealing with clusters of varying densities. While
DBSCAN is great at separating high density clusters from low density
clusters,
DBSCAN struggles with clusters of similar density.
· Struggles with high dimensionality data. I know, this entire article I have
stated
how DBSCAN is great at contorting the data into different dimensions and
shapes.
However, DBSCAN can only go so far, if given data with too many
dimensions,
DBSCAN suffers
_______

Silhouette (clustering ) :

It is used to verify that the clustering algorithm we have used works properly
or not
3 steps
1.-- take 1 data pt and calculate the distance
using Euclidean or Manhattan distance.
2. take another cluster(c2) bi . We calculate the distance from points of c1
to points of c2
3. If clustering is done properly then ( ai<<bi )
For silhouette clustering the value is between -1 to +1
If value towards - ve then ai>>bi -- bad.
formula:

Curse of Dimensionality :
-- dimensions (features ) -- attributes

If we select different features from each model


Eg : if we want to find the price of the house wrt to area ,size
We give the same dataset to different models . For m2 we will have same
features
as above and state, dimensions , no of bedrooms etc .
M2 --> accuracy 2 (acc2) similarly for M1
M2 has more information than m1 . Thus acc1 < acc2 < acc3.
After a threshold value our model will not be more accurate .

Thus it is not necessary that as we increase the no of features our models


will be
More accurate .
Acc5<acc4 .
Reason : m1 -> m2 -> m3 - models learns from the data , as the no of
features
increases EXPONENTIALLY . The model gets confused . Thus accuracy
decreases .
Principal component analysis

2d feature → 1d feature

If we have some pts on x-y plane . we can draw a principal component line such
that all the pts can be projected on that line .

Next we will create pc2 which is perpendicular to pc1 .


Such that the pts are not that far away from pc2. Information is lost / lots of
variance

Eg:mean radius of a cancer cell in a dataset has to go through(x𝑖 − 𝑚𝑒𝑎𝑛 )/S.D.


Sd -> standard deviation . this converts it into standard normal distribution where
mean = 0 and SD = 1 . this is called Standard scalar .
So for cancer dataset there are many features we rescale the values into same
unit .

______________________________________________________________

Machine learning pipelines :

All the lifecycle in a data science project ;


1.​Data analysis
2.​Feature engineering
3.​Feature selection
4.​Model building
5.​Model deployment

●​ Import pandas as pd
Import numpy as np
Import matplotlib as plt
%matplotlib inline
import seaborn as sns
●​ //Display all the columns of the dataframe
pd.pandas.set_option(‘display.max_column’,None)
●​ dataset=pd.read_csv(‘train.csv’)
​ //Print shape of data set with rows and columns
​ print(dataset,shape)
○​ (1460,81) rown ,columns
●​ //Print the top 5 records
dataset.head()
○​ //various categories which determine the house prices

●​ In data analysis we will analyze to find out the below stuff :


1.​Missing values
2.​ All the numerical variables
3.​Distribution of the numeric variables
4.​Categorical variables
5.​Cardinality o f the categorical variables
6.​Outliers
7.​Relationship between independent and dependent features (sale
price)

Missing values :

Here we will check the percentage of nan values present in each feature
1st step make the list of features which has the missing values

●​ features_with_na=[features for features in dataset.columns if


dataset[features].isnull().sum()>1 ]
2nd step print the feature name and the percentage of missing values:

●​ feature_with_na
print(feature, np.round(dataset[feature].isnull().mean(),4) , %missing values
) // whatever value we get → round it up upto 4 decimal pts . and we print
the st. missing values

○​

Since they are many missing values we need to find the relationship b/w missing
values and sales price
We cant just drop the missing values since they might have some dependency
on the output

●​ For feature in features_with_na:


​ data = dataset.copy()
make a variable that indicates if the observation was missing or zero
●​ Data[feature] = np.where(data[feature].isnull(), 1, 0)

Calculate the mean saleprice where the information is missing or present
●​ data.groupby(feature)[‘saleprice’].median().plot.bar()
plt.title(feature)
plt.show()
○​
Do not have nan values → 0.
Like this there are other graphs . wherever there are nan values it has higher
median sales price . so it plays a major role
So to replace nan values with something meaningful → feature eng.
●​ print(“Id of houses()”.format(len(dataset.Id)))
○​ Id of houses = 1460 // total no of houses

Numerical variables
How many features are numerical variable :
●​ numerical_features = feature for feature in dataset.column if
dataset[feature].dtypes is != ‘O’ ] → //If the feature is not object then it
becomes a numerical
●​ print('Number of numerical variables: ', len(numerical_features))

//visualize the numerical variables


●​ dataset[numerical_features].head()
○​ Number of numerical variables: 38

○​
Yr sold → temporal variable → data will be updated each yr
From the dataset we have 4 variables .we have to extract information from the
datetime variables like no of years or no of days . one example diff in years
between the year the house was built and the house was sold . this analysis is
done in feature engineering .
# List of variables that contain year information
●​ year_feature = [feature for feature in numerical_features if 'Yr' in feature or
‘year’ in feature ] // we have the keywords yr or year in each category
●​ year_feature

○​ ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold' ]


# Let's explore the content of these year variables
●​ for feature in year_feature:
print(feature, dataset[feature].unique())
○​ YearBuilt [2003 1976 2001 1915 2000 1993 2004 1973 1931 1939
1965 2005 1962 2006 similarly for yearRemodadd etc . . . . .]

## Lets analyze the Temporal Datetime Variables


## We will check whether there is a relation between year the house is sold and

dataset.groupby('YrSold')['SalePrice'].median().plot()
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title("House Price vs YearSold")

○​
●​ year_feature
○​ ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

## Here we will compare the difference between ALL years feature with
Salesprice
●​ for feature in year_feature:
if feature != 'YrSold':
data=dataset.copy()

## We will capture the difference between year variable and year the house
was sold for
data[feature]=data['YrSold' ] -data[feature]

plt.scatter(data[feature],data['SalePrice'])
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.show()

○​
Observe from above graph that if the house was very old (140 yr) low price

○​
Similarly for other features
## Numerical variables are usually of 2 type
## 1. Continuous variable and Discrete Variables
●​
●​ discrete_feature=[feature for feature in numerical_features if
len(dataset[feature].unique())<25 and feature not in year_feature+[‘Id’]
print("Discrete Variables Count: {}".format(len(discrete_feature)))
○​ Discrete Variables Count: 17
●​ Discrete_feature
○​ [ ‘MSSubclass’,
‘Fullbath’,
‘Halfbath’,
. . . . . . . . .]
●​ dataset[discrete_feature].head()

○​
## Lets Find the realtionship between them and Sale PRice
●​ for feature in discrete_feature:
data=dataset.copy()
data.groupby(feature)['SalePrice'].median().plot.bar()
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.title(feature)
plt.show()

○​
//Similarly we have other graphs

Continuous Variable
●​ continuous_feature=[feature for feature in numerical_features if feature not
in discrete_feature+year_feature+[‘Id’]]
●​ print("Continuous feature Count {}".format(len(continuous_feature)))
○​ Continuous feature Count 16

●​ ## Lets analyse the continuous values by creating histograms to


understand the d
●​ for feature in continuous_feature:
data=dataset.copy()
data[feature].hist(bins=25)
plt.xlabel(feature)
plt.ylabel("Count")
plt.title(feature)
plt.show()
●​ Sfdb

○​
Similarly other graphs are not gaussian distribution so we convert them into
normal distribution

Advanced house prediction lec 2

We will be using logarithmic transformation

●​ for feature in continuous_feature:


data=dataset.copy()
if 0 in data[feature].unique():
pass
Else:
data[feature]=np.log(data[feature]) // np.log → convert into log
normal dist
​ ​ data[‘salesprice’]=np.log(data[‘salesprice’])
​ ​ ​ plt.scatter(data[feature],data[‘Salesprice’])
plt.xlabel(feature)
plt.ylabel('SalesPrice")
plt.title(feature)
plt.show()
○​
○​ //The above graph for sales price will be linear

6. Outliers
●​ for feature in continuous_feature:
data=dataset.copy()
​ if 0 in data[teature].unique():
pass
​ else :
​ ​ data[feature]=np.log(data[feature])
​ ​ ​ data.boxplot(column=feature)
​ ​ ​ plt.ylabel(feature)
​ ​ ​ plt.title(feature)
​ ​ ​ plt.show

○​
○​ // many outliers for all features . this does not work for categorical
features . we only use for continuous f.
Categorical var
●​ Categorical_features = for feature in dataset.columns if data[feature].dtypes
==’O’]
●​ categorical _features
○​ ['MSZoning',
○​ 'Street',
○​ 'Alley', . . . . . .
●​ dataset[ategorical_features].head()
○​ // display top 5 results
○​ // focus on cardinality values → how many different categories we
have inside categorical feature .
●​ For feature in categorical_features:
​ print(‘the features is () and number of categories are
()’.format(feature,len(dataset[feature].unique())))
○​ The feature is MSZoning and number of categories are 5
The feature is Street and number of categories are 2
The feature is Alley and number of categories are 3
The feature is LotShape and number of Categories are 4
........

Find the relationship between categorical variable and dependent feature


SalesPrice
​ For feature in categorical_feature:
​ ​ data=dataset.copy()
​ ​ data.groupby(feature)['SalePrice'].median().plot.bar()
plt.xlabel(feature)
plt.ylabel('SalePrice")
plt.title(feature)
plt.show()

○​ ..........

____________

Feature engineering - lec 1 :

Steps of feature engineering :


1.​Missing values
2.​Temporal variables
3.​Categorical variables : remove rare labels
4.​Standardize the values of the variables to the same range

●​ import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# to visualize all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

●​ dataset=pd.read_csv('train.csv")
dataset.head()

○​

In a kaggle problem st . we have a train data and test data . now usually in
kaggle we want very good accuracy we try to combine train and test data . once
when we combine both then we do the feature eng. Because of that there is data
leakage. Some info from train data to test data and vice versa . accuracy

●​ ## Always remember there way always be a chance of data leakage so we


need to split the data first and then apply feature eng.
●​ from sklearn.model_selection import traim_test_split
X_train,X_test,y_train,y_test=train_test_split(dataset,dataset['SalePrice'],tst
_size=0.1,random_state
●​ X train.shape, X_test.shape
●​ ((1314,81),(146,81))

Missing values

●​ ## Let us capture all the nan values


## First lets handle Categorical features which are missing
features_nan=[feature for feature in dataset.columns if dataset[ feature
].isnull( ).sum()>1 and dataset[feature].dtypes==’O’

for feature in features_nan:


print("{}: {}% missing values".format( feature,np.round( dataset[ feature
].isnull().mean(),4)))

○​ Alley: 0.9377% missing values


○​ MasVnrType: 0.0055% missing values
○​ BsmtQual: 0.0253% missing values
○​
●​ ##replace missing values with a new label
●​ def replace_cat_feature(dataset, features_nan):
data=dataset.copy()
data[features_nan]=data[features_nan].fillna('Missing') // nan values are
changed to ‘missing’
return data

dataset=replace_cat_feature(dataset,features_nan)
dataset[features_nan].isnull().sum()
○​ Alley: 0
MasVnrType:0
BsmtQual:0

## Now lets check for, numerical variables the contains missing values
●​ numerical_with_nan=[feature for feature in dataset.columns if dataset [
feature ].isnull().sum()>1 and dataset[feature].dtypes!=’O’

## We will print the numerical nan variables and percentage of missing values

●​ for feature in numerical_with_nan:


print("{}: {}% missing value".format( feature,np. around( dataset[
feature ].isnull( ).mean(),4)))

○​ LotFrontage: 0.1774% missing value


MasVnrArea: 0.0055% missing value
GarageYrBlt: 0.0555% missing value
●​ ## Replacing the numerical Missing Values
for feature in numerical_with_nan:
## We will replace by using median since there are outliers
median_value=dataset[feature].median()
## create a new feature to capture nan values
dataset[feature+'nan']=np.where(dataset[feature].isnull(),1,0) →// if
nan value = 1. If no nan value = 0
dataset[feature].fillna(median_value,inplace=True)

●​ dataset[numerical_with_nan].isnull().sum()

○​ LotFrontage 0
MasVnrArea 0
GarageYrBlt 0
●​ dataset.head(50)

○​
// date time variables → ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

## Temporal Variables (Date Time Variables)


for feature in ['YearBuilt', 'YearRemodAdd','GarageYrBlt']:
dataset[feature]=dataset[YrSold' ]-dataset[feature]
●​ dataset.head()

○​
●​ dataset[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].head()

○​
Feature eng- part 2 :
●​ dataset.head()
○​
●​ import numpy as np
num_features=['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'Saleprice']
// convert into log normal dist
for feature in num_features:
dataset[feature]=np.log(dataset[feature])
●​ dataset.head()

○​

Handling rare categorical feature :


Removing categorical features that are present less than 1% of the observations
●​ categorical_features=[feature for feature in dataset.columns if dataset[
feature ].dtype == '0']
●​ Categorical_features
○​ [‘MSZoning’]
[‘Street’]
[‘Alley’]
●​ for[ feature in categorical_features:
temp=dataset.groupby(feature)['SalePrice'].count()/len(dataset)
temp_df=temp[temp>0.01].index
dataset[feature]=np.where(dataset[feature].isin(temp_df),dataset[featu
re],'Rare_var') // dataset[feature] is in r=temp_df then take
dataset[feature] otherwise take ‘Rare_var’
So the categorical features in MSZoning which are less than 1% will be
replaced by rare _var

Feature scaling :
We have many features which are measured through different units .
Minmaxscalar : convert data b/w 0 &1
Stdscalar:

Code :
●​ feature_scale=[feature for feature in dataset.columns if feature not in ['Id',
'SalePrice']]

●​ from sklearn.preprocessing import MinMaxScaler


scaler=MinMaxScaler()
scaler.fit(dataset[feature_scale])
○​ MinMaxScaler(copy=True, feature_range=(0, 1))
●​ scaler.transform(dataset[feature_scale])
○​ array([[0.23529412, 0.75 , 0.41820812, ... , 0. , 0.
.......
# transform the train and test set, and add on the Id and SalePrice
variables
●​ X_train = pd.concat([dataset[['Id', 'SalePrice']].reset_index(drop=True),
pd.DataFrame(scaler.transform(dataset[feature_scale]),
columns=feature_scale)],axis=1)
●​ data.head()

○​
●​ data.to_csv(‘X_train.csv’,index=False)

Feature selection Advanced house price prediction :


Dataset download link :
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
●​ import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## for feature slection


●​ from sklearn.linear_model import Lasso T
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe


pd.pandas.set_option('display.max_columns', None)
●​ dataset=pd.read_csv('X_train.csv')
●​ dataset.head()

○​
●​ We drop salesprice and Id . becoz id is continuously increasing and sales
price is a dependent feature
●​ ## Capture the dependent feature
y_train=dataset[['SalePrice']]

●​ ## drop dependent feature from dataset


X_train=dataset.drop(['Id','SalePrice'],axis=1)

●​ ### Apply Feature Selection


# first, I specify the Lasso Regression model, and I
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the Less features that will be selected.

# Then I use the selectFromModel object from sklearn, which


# will select the features which coefficients are non-zero
●​
●​ feature_sel_model = SelectFromModel(Lasso(alpha=0.005]
random_state=0)) # remember to set the seed(random_state), we use the
same value for test dataset
●​ feature_sel_model.fit(x_train, y_train)
○​ SelectFromModel(estimator=Lasso(alpha=0.005, copy_X=True,
fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False,
random_state=0,
selection='cyclic', tol=0.0001, warm_start=False),
max_features=None, norm_order=1, prefit=False, threshold=None)
●​ feature_sel_model.get_support()
○​ array([ True, True, False, False, False, False, False, False, False,
True → feature is imp , we use. False → we dont use , not imp
In [33]:

In [30]:

●​ # let's print the number of total and selected features

# this is how we can make a list of the selected features


selected_feat = X_train.columns[(feature_sel_model.get_support())]

# let's print some stats


●​ print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
np.sum(sel _. estimator _. coef _== 0)))

○​ total features: 82
selected features: 21
features with coefficients shrank to zero: 61
●​ selected_feat
○​ Index(['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual',
'YearRemodAdd', . . . . . .
●​ X_train=X_train[selected_feat]
●​ X_train.head()

○​

_____

ROC and AUC curve : in one note

___

Performance metrics on multiclass classification problems


●​ From sklearn import metrics
●​ C =”cat”
D=”dog”
F=”Fox”
// the precision for the Cat class is the number of correctly predicted Cat out of all
predicted Cat
the recall for Cat is the number of correctly predicted Cat photos out of the
number of actual Cat
●​ # True values
y_true =[C,C,C,C,C,C, F,F,F,F,F,F,F,F, F, F, D,D,D,D, D, D, D, D, D]
# Predicted values
y_pred=[C,C,C,C,D,F, C,C,C,C,C,C,D,D,F,F, C, C, C, D, D, D, D, D, D]

# Print the confusion matrix


print(metrics.confusion_matrix(y_true, y_pred))

# Print the precision and recaLl, among other metrics


print(metrics.classification_report(y_true, y_pred, digits=3))

○​ [[4 1 1]
[3 6 0]
[6 2 2]]

K nearest neighbor → one note


___

K nearest neighbor → python

●​ import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as na
%matplotlib inline

Getting the data :


●​ df = pd.read_csv("Classified Data",index_col=0)
df.head()

○​

##Standardize the Variables


Because the KNN classifier predicts the class of a given test observation by
identifying the observations that are nearest to it, the scale of the variables
matters. Any variables that are on a large scale will have a much larger effect on
the distance between the observations, and hence on the KNN classifier, than
variables that are on a small scale.
●​ from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
●​ scaler.fit(df.drop('TARGET CLASS',axis=1))
○​ StandardScaler(copy=True, with_mean=True, with_std=True)
●​ scaled_fdatures = scaler.transform(df.drop('TARGET CLASS',axis=1))
●​ df_feat = pd.DataFrame(scaled_features,columns=df.columns[ :- 1]) →
skipping the last column and taking all the columns
df_feat.head()

○​

##Train Test Split


●​ from sklearn.model_selection import train_test_split
●​ X_train, X_test, y_train, y_test
=train_test_split(scaled_features,df['TARGET CLASS'],
test_size=0.30

##Using KNN
Remember that we are trying to come up with a model to predict whether
someone will TARGET CLASS or not. We'll start with k=1.

●​ from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
○​ KNeighborsClassifier(algoritfm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform')
●​ pred = knn.predict(X_test)
##Predictions and Evaluations
Let's evaluate our KNN model!

●​ from sklearn.metrics import classification_report, confusion matrix


●​ print(confusion_matrix(y_test,pred))
[[125 18]
[ 13 144]]
●​ print(classification_report(y_test,pred))

○​

Choosing a K value :

●​ error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors= i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pifed_i != y_test))

●​ plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate, color='blue', linestyle='dashed', marker="'o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate")
○​
##Here we can see that that after arouns K>23 the error rate just tends to hover
around 0.06-0.05 Let's retrain the model with that and check the classification
report!

# FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1

●​ knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print('WITH K-1')
print('\n")
print(confusion_matrix(y_test,pred))
print('\n")
print(classification_report(y_test,pred))

○​

# NOW WITH K=23

●​ knn = KNeighborsClassifier(n_neighbors=23)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print('WITH K=23')
print('\n")
print(confusion_matrix(y_test,pred))
print('\n")
print(classification_report(y_test,pred))
​ ​ ## here error rate has decreased
○​

Working of K nearest

K value should be odd. Even → equal no of pts

Regression usecase :
Only 1 category . average mean of all the nearest neighbors
K=5

Also affected by outliers

__

Ensemble technique :
.Combining multiple models
1.​Bagging(bootstrap aggregation)
a.​Random forest
2.​Boosting
a.​ADABOOST
b.​GRADIENT BOOSTING
c.​ Xgboost
1.​Bagging

M1 ,m2 -> base models


For each model we provide sample of records .
For the next model we will again resample the records and pick up another
sample and add it → row sampling with replacement .
We use the above technique for all the other models .

The output is 0 / 1 . here we will use voting classifier → majority of the votes by
models → considered

Random forest :
In random forest the models (in bootstrap agg ) are called decision trees

Decision tree to its depth -> low bias(get properly trained - error less) high
variance (prone to give large error with new test data ) → Overfitting

When we combine all the decision trees (individually with high variance) the high
variance is converted to low variance .

Classifier → majority o/p


In regression problem → we take mean / median of the o/p

Hyper parameter →it helps us to decide how many DT we need


__

Handling imbalanced dataset :


Credit card kaggle
credit card companies are able to recognize fraudulent credit card transactions
so that customers are not charged for items that they did not purchase.
Undersampling : reduce the number of observations from all classes but the
minority class. The minority class is that with the least number of observations. ​
If we have 800 yes and 50 no . so we take 50 yes at random .
When we have huge records (10^6)

Code :
●​ import numpy as np
import pandas as pd
import sklearn
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from pylab import rcParams
rcParams['figure.figsize'] = 14, 8
RANDOM SEED = 42
LABELS = ["Normal", "Fraud"]

●​ Data = pd.read_csv(‘creditcard.csv’, sep=’,’)


data.head()

○​
Class → dependent feature
Remaining → independent feature

0 - normal
1 - fraudulent

●​ data.info()
○​

●​ #Create independent and Dependent Features


columns = data.columns.tolist()
# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["class"]] I
# Store the variable we are predicting
target = "Class"
# Define a random state
state = np.random.RandomState(42)
X = data[columns]
Y = data[target]
X_outliers = state.uniform(low=0, high=1, size=(x.shape[0], x.shape[1]))
# Print the shapes of x & Y
print(X.shape)
print(Y.shape)

○​ (284807, 30)
(284807,)

●​ count_classes = pd.value_counts(data['Class'], sort = True)


count_classes.plot(kind = 'bar', rot=0)
plt.title("Transaction Class Distribution")
plt.xticks(range(2), LABELS)
plt.xlabel("Class")
plt.ylabel("Frequency")

○​ Text(0, 0.5, 'Frequency')


○​
// fraud transactions are very less
## Get the Fraud and the normal dataset
●​ fraud = data[data['Class'] == 1]
normal = data[data['Class'] == 0]
print(fraud.shape,normal.shape)
○​ (492, 31) (284315, 31)
***
There are 492 fraud records → so we will take 492 out of
284315 genuine records
●​ from imblearn.under_sampling import NearMiss
​ # Implementing Undersampling for Handling Imbalanced
nm = NearMiss(random_state=42)
X_res,y_res=nm.fit_sample(X,Y)
●​ X_res.shape,y_res.shape
○​ ((984, 30), (984,))

●​ from collections import Counter


print("Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_res)))
○​ Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 492, 1: 492})

Imbalanced dataset - oversampling


increase the amount of minority sample .
Continue :
●​ from imblearn.combine import SMOTETomek
from imblearn.under_sampling import NearMiss

●​ # Implementing Oversampling for Handling Imbalanced


smk = SMOTETomek(random_state=42)
X_res,y_res=smk.fit_sample(X,Y)
X_res.shape,y_res.shape
○​ ((567562, 30), (567562,))
from collections import Counter
print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_res)))
○​ Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 283781, 1: 283781})

●​ ## RandomOverSampler to handle imbalanced data


from imblearn.over_sampling import RandomOverSampler
●​ os = RandomOverSampler(ratio=1)
## If we have 1 = 500 and 0 = 100 and ratio is 1→ add more 400 to 0s .
●​ X_train_res, y_train_res = os.fit_sample(x, Y)
X train_res.shape,y_train_res.shape
○​ ((568630, 30), (568630,))
●​ print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'. format(Counter(y_train_res)))

○​ Original dataset shape Counter({0: 284315, 1: 492})


Resampled dataset shape Counter({0: 284315, 1: 284315})

Hyper parameter optimization for Xgboost using randomizedsearchCV


●​ import pandas as pd
## Read the Dataset
df=pd.read_csv('Churn_Modelling.csv')
df.head()
Whether the person has credit card, will he exit the bank or not .
Line 34 → axis =1 → appended column wise

How to install xg boost :


Open anaconda prompt
Enter → pip install xgboost

Randomized search .
line 40 Works in various parameters and tries to find out xg boost will work for
which parameter .

Line 36 : we will take those parameters which are present inside XGB classifier
We give various values learning rate [0.05,0.10,0.15] . the randomized search
algo will do Perm&Comb for each of the values .
We shouldn't lower the value of learning rate beyond 0.05 otherwise it will lead to
overfitting condn and more training time .
Gamma and colsample_bytree should be less than 1

Line 41 : verbose → to give message abt the time and status of the job etc

Line 42 : timer → how much time it takes ti run randomised seafchcv

Data insertion into database is after feature enginer.

Cluster model . based on no of cluster we will create the models


Hyper parameter tuning - create indep. Models

Anaconda new environment python → on google → copy in command prompt


Visibility Climate Prediction- You Can Add This In Your Resume

Manhattan distance – houses in blocks


Euclidean distance – flight path – hypotenuse

PCA (principal component analysis) :


– unsupervised ML .
– lower the no of dimensions .
– 2 d (multiple points in graph ) to 1 d (single line )
– first line on which the points are projected → principal component 1 (PC1)
– on PC 2 which is perpendicular ot pc 1 / the pooints are far away – more
variance . try to reduce variance / information loss

Standard normalization : if our attributes are following a gaussian or normal


disribution .
If not we pass it through the formula (xi - μ)/ σ . mean of that column . then it is
converted to standard normal distribution (μ = 0 & σ = 1 ) ⇒ Standard scalar
We do this because the values may be in different units so we need to rescale
the values in the same unit .
Line 35 : n_componenets =2 → we wanted to convert 30 features to 2 features .
Reducing the dimension is part of data pre processing

Auto encoders and decoders are used in dimensional reduction in case of deep
learning . but in case of machine learning we use PCA

Types of cross validation :


Dataset -- 1000 records -- we perform train test split -- 70 % (randomly)to train
the model and 30% to test it s accuracy .
Remember : whenever we do train test split we do random state .
Eg : if we put random_state = 0 then acc = 85% if we increase it to 5 then acc
changes to 87% so don’t know the exact accuracy of our model(fluctuates ) so
we use cross validation

Types of cross validation .


1.​Leave one out CV (LOOCV) : if we have 1000 records .
Exp1 Out of all records we will take one dataset ata time as our test and
remaining all will be train dataset .
Similarly for exp 2 , we will take only the test remaining all will be train .
Drawback : we have to run 1000 iterations.
Leads to low bias

2.​K fold CV : if we have 1000 records .


We select K = 5 , which means no of experiments. Which decides the test
data, so 1000/5 = 200 . So first 200 will be test data . Similarly we will select
the next 200 for our exp2 .
Then we take the mean of the accuracy
But if we select the first 200 but later ones might have different data , which
makes it imbalanced so we use the next method .
3.​Stratified CV : whenever test data is selected we have to make sure that
the no of instances of each class for each exp is taken in a proper way .
Eg : if we have 1000 records . 600 yes & 400 no -- imbalanced dataset .
So we will have good proportion of yes and no
4.​Time series CV : we cant do train test split on future stock prices for eg
since it is based on time series

Day 1
D2
D3
D4
D5

Convert it to

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ o/p :
d1 d2 d3 d4 d5
​ ​ ​ ​ ​ ​ ​ ​ ​ ​
d6
Similarly if we want prediction of day 7 we will take day 2 to 6. Day 1 will be
removed

You might also like