Unit 5
Unit 5
Wrangling
weightage:20 Marks
Chapter Covered From Book
y=-mx+c
=0.4
Linear Regression: Mathematical Implementation
put x=3 and y=3.6 and m=0.4 in
c=y-mx= 3.6-(0.4)*3=3.6-1.2 =2.4
=0.4
Regression equation
Using Regression equation calculate value of y
from value of x
m=0.4
c=2.4
y=0.4x+2.4
y – actual value
yp is predicted value
ybar is mean value
• Methods:
1. object.fit(N-dimension array X, array Y)
here value of x and y points will be passed to perform linear regression.
2. ypre=object.predict(x) : returns new value of y on bases of linear regression
equation y=mx+c, by applying value of m, x and c
3. object.score(x,y) : returns distance of actual points and predicted points-R2
Implementation using python
x=[1,2,3,4,5]
y=[3,4,2,4,5]
import pandas as pd
d={'X':x,'Y':y}
df=pd.DataFrame(d)
print(df)
from sklearn.linear_model import LinearRegression
X = df['X'].values.reshape(-1,1)
#the unspecified value is inferred to be original value
Y = df['Y'].values.reshape(-1,1)
h=LinearRegression()
h.fit(X,Y)
print(h.coef_) #value of m
print(h.intercept_)# value of c
ypre=h.predict(X)
print(ypre)
h.score(X,Y) #distance of actual points and predicted points
Linear Regression of boston dataset
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data,boston.target
print (X.shape, y.shape)
from sklearn.linear_model import LinearRegression
hypothesis = LinearRegression(normalize=True)
hypothesis.fit(X,y)
print (hypothesis.coef_)
print(hypothesis.intercept_)
print(hypothesis.score(X,y))
we print the value of the boston_dataset to understand what it contains. print(boston_dataset.keys()) gives
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
•data: contains the information for various houses
•target: prices of the house
•feature_names: names of the features
•DESCR: describes the dataset
To know more about the features use boston_dataset.DESCR The description of all the features is given below:
CRIM: Per capital crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature
variables based on which we will predict the value of a house.
We will now load the data into a pandas dataframe using pd.DataFrame. We then print the first 5 rows of the data using head()
Execute code on multicore instead of single core-Multiprocessing
1. sklearn.datasets.load_digits() : loads digit dataset from sklearn
2. sklearn.model_selection.cross_val_score(regression-objectname,x-independent,y-
dependent, cv=number,n_jobs=1/-1)
• cross_val_score method returns R-square error with number of folds specified by
cv (cross-validation)
• cv=number of folds (groups of data) Example : cv=5
• jobs=1 – uniprocessor , -1 : multicore
Execute code on multicore instead of single core-Multiprocessing
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression OUTPUT
digits = load_digits() (1797, 64)
X, y = digits.data,digits.target (1797,)
print(X.shape)
print(y.shape)
l=LinearRegression()
single_core_learning = cross_val_score(l, X, y, cv=20, n_jobs=1)
#print(single_core_learning)
Need of hashing/feature hashing
Mod function
If we want to
generate only
index up to
specific number
We have to set
mod
Need of hashing/feature hashing
Need of hashing/feature hashing
Need of hashing/feature hashing
Hashing Collision Resolution: separate
chaining
Feature Hashing
Feature Hashing
OUTPUT:
{'python': 4, 'for': 1, 'data': 0, 'science': 5, 'machine': 3, 'learning': 2}
learning python
data for machine science
Limitations:
1. Require more storage if document is large (per feature-word one index occupied)
2. Optimally encodes text in matrix but once matrix generated, new text can not be
inserted
Bag of word one hot encoding using hashing-user defined
string_1 = 'Python for data science'
string_2 = 'Python for machine learning'
Remark : if vectorsize is small then many words overlap to same location in list
representing feature vector. To keep overlap minimum mod value should be large enough
HashingVectorizer
HashingVectorizer: A class that rapidly transforms any collection of text in to separate
sparse data matrix using hashing trick.
Package and module: sklearn.feature_extraction.text
bowh=HashingVectorizer(n_features=20,binary=True,norm=None)
• Here n_features: you can specified mod value to specify maximum number of
index generation. For example: 20 means indexes will be 0-19 to store
words(features)
• Binary=True : binary encoding – If feature is available then 1 else 0
Sr. No CountVectorizer HashingVectorizer
1 Require more storage if document is large We can generate user defined features by
(per feature-word one index occupied) parameter n_features. No storage limitation but it
suffers from collision
2 Optimally encodes text in matrix but once We can insert new data after encoding
matrix generated, new text can not be
inserted
3 Takes more time to execute Takes less time as compare to CountVectorizer
HashingVectorizer
from sklearn.feature_extraction.text import *
bowh = HashingVectorizer(n_features=20, binary=True,norm=None)
bowhm = bowh.fit_transform(['Python for data science','Python for machine
learning']).toarray()
print(bowhm)
bowhmn=bowh.transform(['new text has inserted']).toarray()
print(bowhmn)
OUTPUT:
[[0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]
[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
Time complexity of count Vectorizer vs HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
bowh = HashingVectorizer(n_features=20, binary=True, norm=None)
bowc = CountVectorizer()
texts = ['Python for data science','Python for machine learning']
Memory Profiler
Useful to find out memory usage (space complexity) of particular statement of
program
Package: memory_profiler
•1. import
Command to install (type on Jupyter Notebook)
sys
2. ! (sys.executable) –m pip install memory_profiler
After installation completes, type below command on jupyter Notebook
3. %load memory_profiler
For find out memory requirement for hashing matrix give use magic function
%memit as below:
%memit bowhm = bowh.fit_transform(texts).toarray()
Memory Profiler
iris = load_iris()
iris
1. Kurtosis shows whether the data distribution, especially the peak and
the tails, are of the right shape.
2. If the kurtosis is above zero, the distribution has a marked peak. If it is
below zero, the distribution is too flat instead.
Kurtosis=+Ve
Kurtosis=0
Kurtosis=Negative
Python Program to find kurtosis
import numpy as 33.9
npx=np.concatenate((np.linspace(10,19,1), kurtosis= 0.7624441090238117
np.linspace(20,29,2),np.linspace(30,39,9),
np.linspace(40,49,2),np.linspace(50,59,1)))
print(x.mean())
from scipy.stats import kurtosis
k=kurtosis(x)
print("kurtosis=",k)
import seaborn as sns Positive Kurtosis=Leptokurtic
sns.distplot(x)
68-95-99.7 Rule
Find out variance bases on flower group
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category')
group0 = df['group'] == 'setosa'
print(group0.head())
group1 = df['group'] == 'versicolor'
group2 = df['group'] == 'virginica'
print('variance of geoup 0 is',df['petal length (cm)'][group0].var())
print('variance of geoup 1 is',df['petal length (cm)'][group1].var())
print('variance of geoup 2 is',df['petal length (cm)'][group2].var())
Hypothesis Testing
• In hypothesis testing evaluates two mutually exclusive statements on a population
using a sample of data
• Example of mutually exclusive statements: in court of law:
• 1. defendant is innocent (Null Hypothesis –Ho)
• 2. defendant is Guilty (Alternate Hypothesis –Ha)
• Steps:
• 1. Make initial assumption (Null Hypothesis H0)
• 2.Collect Data (like figure prints, DNA samples etc)
• 3. Gather Evidences to reject or not reject NULL Hypothesis
• Function:pandas.plotting.parallel_coordinates(dataframe, label)
• Here label is one of the column of dataframe uses as a label
Observing parallel coordinates of iris dataframe
from pandas.plotting import
parallel_coordinates
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,
columns=iris.feature_names)
df['label']=pd.Series( [iris.target_names[k]
for k in iris.target],dtype='category')
df['group']=iris.target
pl=parallel_coordinates(df,'label')
Graphing Distribution using
histogram and density plot
Functions for density plot and histogram
• Function for density plot:
• pandas.DataFrame.plot(kind=‘density’)
• Or
• pandas.DataFrame[‘column-name’].plot(kind=‘density’)
• pandas.DataFrame.plot(kind=‘hist’)
• Or
• pandas.DataFrame[‘column-name’].plot(kind=‘hist’)
Histogram and desinty plot of iris dataframe
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,
columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k]
for k in iris.target],dtype='category')
df.plot(kind='density')
df.plot(kind='hist')
Plotting scatterplots and
scatter matrix
Function for scatter plot
• Function for scatter plot:
• Pandas.DataFrame.plot(kind=‘scatter’,x=‘column-name’,y=‘column-name’,
c=colorlist)
• It plots scatterplot of x and y columns and to represent different groups we
can use different colors
Function for scatter plot
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=iris.target
pelette={0:'red',1:'yellow',2:'blue'}
color=[pelette[k] for k in df['group']]
print(color)
df.plot(kind='scatter',x='petal length (cm)',y='petal width (cm)',c=color)
Function for scatter matrix
• Useful for plotting scatter plot of all columns of dataframe vs all columns
• At diagonal position we can plot either histogram or density plot
•pandas.plotting.scatter_matrix(dataframename,figsize(width,height),color=color
Function:
s, diagonal=‘hist/kde’)
•figsize(float,float),
PARAMETERS
optional
• A tuple (width, height) in inches.
• diagonal{‘hist’, ‘kde’}
Pick between ‘kde’ and ‘hist’ for either Kernel Density Estimation or Histogram
plot in the diagonal.
Scatter matrix for all columns->all columns of
iris
from pandas.plotting import scatter_matrix
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=iris.target
pelette={0:'red',1:'yellow',2:'blue'}
colorlist=[pelette[k] for k in df['group']]
scatter_matrix(df,figsize=(6,6),color=colorlist,diagonal='hist')
Correlation and
Covariance
covariance
• Covariance and Correlation are terms used in
statistics to measure relationships between two
random variables. Both of these terms measure linear
dependency between a pair of random variables
Quantify Relationship between size and price
Size < -- > price ?
Like size increase -- price increases or
Size of house Price Size decreases ----price decreases
1200sqm 100k$
1500sqm 200k$
1800sqm 300k$
covariance
If x increases, y will also increases,
Then covariance will be +ve
If x increases, y decreases, then covariance
Will be negative
1200sqm 100k$
1500sqm 200k$
𝑛
1800sqm 300k$ • Cov(X,X)= (𝑥 − 𝑥𝑏𝑎𝑟)(𝑥 − 𝑥𝑏𝑎𝑟)
𝑛−1
𝑖=1
Function:
pandas.DataFrameob.cov()
covariance
Covariance is useful to
quantify relation between
random variables
It’s gives direction : either
positive or negative
-1 <=p(x,y)<=1
Function:
pandas.DataFrameob.corr()
correlation