0% found this document useful (0 votes)

16 views171 pages

Unit 5

Uploaded by

hsrushti191

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views171 pages

Unit 5

Uploaded by

hsrushti191

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 171

Data

Wrangling
weightage:20 Marks
Chapter Covered From Book

1. Stetching Python’s Capabilities (chapter-12)

2. Exploring Data Analysis (chapter-13)
CHAPTER 12
STRETCHING PYTHON’S
CAPABILITIES
Linear Regression
• It is a statistical algorithm which is useful to show relation between two or
more variable using linear equation
Linear Regression
• Example: Annual sale
• Dependent Variable: Annual sale
• Represent on Graph on Y axis
• Independent Variable: factors
affects the annual sale like number
of employee, profit, loss etc.
• Represent on graph on X axis
• Usage:
• To establish a relation between independent and dependent variable regression is useful
• Determine the strength of Predicators (how independent are dependent variable are
related)
• Example: sales-> marketing, Age->income
• Trend Analysis
• Example: what will be price of bit coin in next 6 month?
Linear Regression
Slope of Line is Increasing
Y=mx+c is line of linear regression Slope of Line is Decreasing

y=-mx+c

Positive Linear Regression Negative Linear Regression

Line of Linear Regression Shows Relation between dependent variable and Independent Variable
Linear Regression
Aftre Adding Some data points:
1. Generate Regression line/Best fit line
2. Reduce Error : reduce distance between estimated/predicted value and actual value
Linear Regression
Speed:
Distance:Distance covered with constant time
Equation of line : y=mx+c

y is a distance travelled in fix duration of time

x is speed of vehicle
m is positive slope
c is y-intercept of line
Types of Linear regression
1. Simple Linear Regression:
only 1 dependent and 1 independent variable:
Equation : y=mx+c
2. Multiple Linear Regression (If dependent variable depends on more then one independent factors
like x1,x2,x3..xn
y=m1x1+m2x2+m3x3+m4x4+.....+mnxn +c
Example y=1.2x1 +2x2+4x3+1x4+0.9
here c=0.9, m1=1.2, m2=2, m3=4 and m4=1
impact of m3 and it’s variable x3 is high on dependent variable
Example : result (pass, fail)
x1=enrolment number
x2=percentage , here impact of percentage is high on result
Linear Regression: Mathematical Implementation
• 1. Find equation of line from
x and y values
• For that
• calculate mean of x and
mean of y:
mean(x)=(1+2+3+4+5)/5=3
mean(y)=(3+4+2+4+5)/5=18/5
=3.6
Plot mean point(3,3.6)
Linear Regression: Mathematical Implementation
• for y=mx+c, find value of m and c
∑(𝒙− xbar̄)(𝒚−𝒚𝒃𝒂𝒓)
m=
∑ 𝒙−𝒙𝒃𝒂𝒓 𝟐
x-xbar is a distance of all
points of line from it’s mean
value
Linear Regression: Mathematical Implementation
• for y=mx+c, find value of m and c
∑(𝒙− xbar̄)(𝒚−𝒚𝒃𝒂𝒓)
m=
∑ 𝒙−𝒙𝒃𝒂𝒓 𝟐
x-xbar is a distance of all points of
line from it’s mean value
Linear Regression: Mathematical Implementation
• for y=mx+c, find value of m and c
∑(𝒙− xbar̄)(𝒚−𝒚𝒃𝒂𝒓)
m=
∑ 𝒙−𝒙𝒃𝒂𝒓 𝟐
y-ybar is a distance of all points of
line from it’s mean value of 3.6
Linear Regression: Mathematical Implementation
Linear Regression: Mathematical Implementation

=0.4
Linear Regression: Mathematical Implementation
put x=3 and y=3.6 and m=0.4 in
c=y-mx= 3.6-(0.4)*3=3.6-1.2 =2.4

=0.4

Regression equation
Using Regression equation calculate value of y
from value of x
m=0.4
c=2.4
y=0.4x+2.4

for given m=0.4 and &c=2.4

lets predict value for y using
x=[1,2,3,4,5]

x=1,y=0.4 *1+ 2.4 =2.8

x=2,y=0.4 *2+ 2.4 =3.2
x=3,y=0.4 *3+ 2.4 =3.6
x=4,y=0.4 *4+ 2.4 =4.0
x=5,y=0.4 *5+ 2.4 =4.4
Using Regression equation calculate value of y
from value of x
m=0.4
c=2.4
y=0.4x+2.4

for given m=0.4 and &c=2.4

lets predict value for y using
x=[1,2,3,4,5]

x=1,y=0.4 *1+ 2.4 =2.8

x=2,y=0.4 *2+ 2.4 =3.2
x=3,y=0.4 *3+ 2.4 =3.6
x=4,y=0.4 *4+ 2.4 =4.0
x=5,y=0.4 *5+ 2.4 =4.4
Calculate distance between actual value and
predicted value
Reduce error (distance) between
actual points and predicted points

For that we have to try different value

of m to get closer to actual points
R square method : How close the data to fitted
regression line

y – actual value
yp is predicted value
ybar is mean value

value of R square must be between 0-1.

1 means very closer to actual point
≈0.3
Implementation using python
• Package : sklearn
• module : linear_model
• class : LinerRegression
• object= sklearn.linear_model.LinearRegression()
• Attributes of LinearRegression class:
1. object.coef_ : returs value of slope
2. object.intercept : returns value of constant c

• Methods:
1. object.fit(N-dimension array X, array Y)
here value of x and y points will be passed to perform linear regression.
2. ypre=object.predict(x) : returns new value of y on bases of linear regression
equation y=mx+c, by applying value of m, x and c
3. object.score(x,y) : returns distance of actual points and predicted points-R2
Implementation using python
x=[1,2,3,4,5]
y=[3,4,2,4,5]
import pandas as pd
d={'X':x,'Y':y}
df=pd.DataFrame(d)
print(df)
from sklearn.linear_model import LinearRegression
X = df['X'].values.reshape(-1,1)
#the unspecified value is inferred to be original value
Y = df['Y'].values.reshape(-1,1)
h=LinearRegression()
h.fit(X,Y)
print(h.coef_) #value of m
print(h.intercept_)# value of c
ypre=h.predict(X)
print(ypre)
h.score(X,Y) #distance of actual points and predicted points
Linear Regression of boston dataset
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data,boston.target
print (X.shape, y.shape)
from sklearn.linear_model import LinearRegression
hypothesis = LinearRegression(normalize=True)
hypothesis.fit(X,y)
print (hypothesis.coef_)
print(hypothesis.intercept_)
print(hypothesis.score(X,y))
we print the value of the boston_dataset to understand what it contains. print(boston_dataset.keys()) gives
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
•data: contains the information for various houses
•target: prices of the house
•feature_names: names of the features
•DESCR: describes the dataset
To know more about the features use boston_dataset.DESCR The description of all the features is given below:
CRIM: Per capital crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature
variables based on which we will predict the value of a house.
We will now load the data into a pandas dataframe using pd.DataFrame. We then print the first 5 rows of the data using head()
Execute code on multicore instead of single core-Multiprocessing
1. sklearn.datasets.load_digits() : loads digit dataset from sklearn
2. sklearn.model_selection.cross_val_score(regression-objectname,x-independent,y-
dependent, cv=number,n_jobs=1/-1)
• cross_val_score method returns R-square error with number of folds specified by
cv (cross-validation)
• cv=number of folds (groups of data) Example : cv=5
• jobs=1 – uniprocessor , -1 : multicore
Execute code on multicore instead of single core-Multiprocessing
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression OUTPUT
digits = load_digits() (1797, 64)
X, y = digits.data,digits.target (1797,)
print(X.shape)
print(y.shape)
l=LinearRegression()
single_core_learning = cross_val_score(l, X, y, cv=20, n_jobs=1)
#print(single_core_learning)
Need of hashing/feature hashing

Index is used to access element:

Array[3] will gives value4

When we do not want to worry about

Number of elements in advance :
Need of hashing/feature hashing

Mod function

Value returns by hash function can be used as an index to store element/object

Hash Function
• The hash(object) function returns the hash value of an object if it has one. Hash
values are just integers that are used to compare dictionary keys during a
dictionary lookup quickly.
•print (hash('Python'))
Example: OUTPUT:
567715945433905441
441
print (abs(hash('Python')) % 1000)
1
print(hash(1)) 1
print(hash(True)) 6747001990439529300
a='hello'
print(hash(a))

Remark : hash function is only applicable to immutable object

Need of hashing/feature hashing
Value returns by
hash function
can be used as
an index to store
element/object

If we want to
generate only
index up to
specific number
We have to set
mod
Need of hashing/feature hashing
Need of hashing/feature hashing
Need of hashing/feature hashing
Hashing Collision Resolution: separate
chaining
Feature Hashing
Feature Hashing

Number of extracted features = mod of number

Feature Hashing

Number of extracted features = mod of number

Comparison Bag of Words -> Feature Hashing

Feature hashing is powerful technique to reduce number of features

science”
• Assign arbitrary number to each word, for instance python=0, for=1, data=2,
science=3 Bag of word
Feature Encoding python for data Science
python 0 python for data 1 1 1 1
science
for 1
Binary vector
data 2
science 3

• Add one more sentence : python for machine learning

Bag of word
Feature Encoding
python for data Science machine learning
python 0
for 1 python for data 1 1 1 1 0 0
data 2 science
science 3 Python for machine 1 1 0 0 1 1
learning
machine 4
learning 5
encoding
from sklearn.feature_extraction.text import *
bow=CountVectorizer()
bowm=bow.fit_transform(['python for data science','python for machine
learning']).toarray()
print(bow.vocabulary_)

OUTPUT:
{'python': 4, 'for': 1, 'data': 0, 'science': 5, 'machine': 3, 'learning': 2}

learning python
data for machine science

python for data science

python for machine learning

Limitation:Bag of word to demonstrate one hot encoding
from sklearn.feature_extraction.text import *
bow=CountVectorizer()
bowm=bow.fit_transform(['python for data science','python for machine
learning']).toarray()
bowm1=bow.transform(['new text has arrived']).toarray()
ry_)
OUTPUT:
{'python': 4, 'for': 1, 'data': 0, 'science': 5, 'machine': 3, 'learning': 2}
learning python
data for machine science

New text has arrived

Limitations:
1. Require more storage if document is large (per feature-word one index occupied)
2. Optimally encodes text in matrix but once matrix generated, new text can not be
inserted
Bag of word one hot encoding using hashing-user defined
string_1 = 'Python for data science'
string_2 = 'Python for machine learning'

def hashing_trick(input_string, vector_size=20):

feature_vector = [0] * vector_size
for word in input_string.split(' '):

index = abs(hash(word)) % vector_size

print('word is:',word,'original hash value',hash(word),'index is:',index)
feature_vector[index] = 1
return feature_vector

print (hashing_trick(input_string='Python for data science', vector_size=20))

print (hashing_trick(input_string='Python for machine learning', vector_size=20))
Bag of word one hot encoding using hashing-user defined
OUTPUT:
word is: Python original hash value -5119337540373521090 index is: 10
word is: for original hash value -5304979803602257784 index is: 4
word is: data original hash value -250754315164222072 index is: 12
word is: science original hash value -4211303434629928500 index is: 0
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]
word is: Python original hash value -5119337540373521090 index is: 10
word is: for original hash value -5304979803602257784 index is: 4
word is: machine original hash value 3675180801325909671 index is: 11
word is: learning original hash value -7136462862053745689 index is: 9
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Remark : if vectorsize is small then many words overlap to same location in list
representing feature vector. To keep overlap minimum mod value should be large enough
HashingVectorizer
HashingVectorizer: A class that rapidly transforms any collection of text in to separate
sparse data matrix using hashing trick.
Package and module: sklearn.feature_extraction.text
bowh=HashingVectorizer(n_features=20,binary=True,norm=None)
• Here n_features: you can specified mod value to specify maximum number of
index generation. For example: 20 means indexes will be 0-19 to store
words(features)
• Binary=True : binary encoding – If feature is available then 1 else 0
Sr. No CountVectorizer HashingVectorizer

1 Require more storage if document is large We can generate user defined features by
(per feature-word one index occupied) parameter n_features. No storage limitation but it
suffers from collision
2 Optimally encodes text in matrix but once We can insert new data after encoding
matrix generated, new text can not be
inserted
3 Takes more time to execute Takes less time as compare to CountVectorizer
HashingVectorizer
from sklearn.feature_extraction.text import *
bowh = HashingVectorizer(n_features=20, binary=True,norm=None)
bowhm = bowh.fit_transform(['Python for data science','Python for machine
learning']).toarray()
print(bowhm)
bowhmn=bowh.transform(['new text has inserted']).toarray()
print(bowhmn)

OUTPUT:
[[0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]
[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
Time complexity of count Vectorizer vs HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
bowh = HashingVectorizer(n_features=20, binary=True, norm=None)
bowc = CountVectorizer()
texts = ['Python for data science','Python for machine learning']
Memory Profiler
Useful to find out memory usage (space complexity) of particular statement of
program
Package: memory_profiler
•1. import
Command to install (type on Jupyter Notebook)
sys
2. ! (sys.executable) –m pip install memory_profiler
After installation completes, type below command on jupyter Notebook
3. %load memory_profiler
For find out memory requirement for hashing matrix give use magic function
%memit as below:
%memit bowhm = bowh.fit_transform(texts).toarray()
Memory Profiler

1. Peak memory refers to the peak memory usage of your

system (including memory usage of other processes)
during the program runtime.
2. Increment is the increment in memory usage relative to
the memory usage just before the program is run (i.e.
increment = peak memory - starting memory
CHAPTER 13
Exploratory Data Analysis
(EDA)
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis is a general approach to exploring
datasets by means of simple summary statistics and graphical
visualizations in order to gain a deeper understanding of data.

• You can load it in python program using:

Dataset : iris (flower) dataset of sklearn
•
from sklearn.datasets import load_iris

iris = load_iris()
iris

Iris setosa Iris versicolor Iris virginica

Exploratory Data Analysis (EDA)
• The data set consists of 50 samples from each of three species of Iris
(Iris setosa, Iris virginica and Iris versicolor). Four features were
measured from each sample: the length and the width of the sepals
and petals, in centimeters. Based on the combination of these four
features, using linear discriminant model we can distinguish the
species from each other.

Iris setosa Iris versicolor Iris virginica

Python program to load iris dataset & print it’s description
and feature names
from sklearn.datasets import load_iris
iris = load_iris()
#print(iris)
print(iris.DESCR) # it prints decription of iris dataset
print(iris.feature_names) # it prints feature names going to be use for
prediction- it’s value will be available at iris. Data
print(iris.data) #prints values of features – n dimension array
print(iris.target) #print target variable value (iris flower category like
verginica, Sentosa, versicolor) contains value either 0 (setosa), 1(versicolor)
or 2(virginica)
print(iris.target_names) # ['setosa' 'versicolor' 'virginica']
Generate dataframe from iris dataset by using data attribute
and target attribute and print statistical analysis------version 1
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
print(df.head())
df['group']=iris.target # adding flower type in form of 0,1,2
print(df.head())
print(df.describe())
print flower type in string instead of category code
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
print(iris.target)
for k in iris.target:
print(iris.target_names[k])
Generate dataframe from iris dataset by using data attribute and
target attribute and print statistical analysis------version 2
considering as target as a category of flower (data in from of string)
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],
dtype='category')
print(df.head())
df.describe()
Exploratory Data Analysis (EDA)
• Steps: (for all features/columns contains numerical data)
• Measuring central tendency : find of mean() value
• Measuring variance : find out standard deviation
• Measuring range : find out difference of maximum and minimum
value
=154.5
=165.6
10
=57
Standard deviation
=75.2
calculating mean,std and range of iris
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category')
print(df.head())
print('mean')
print(df.mean())
print('standarad deviation')
print(df.std())
print('range')
print((df.max(numeric_only=True)-df.min(numeric_only=True)))
print('median')
print(df.median())
print(df.var()) #prints variance of all colums
Effect of transforming data on speread and
center
Effect of transforming data on speread and
center
Print quantile value and plotting box plot of iris
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category’)
print(df.quantile([0,0.25,0.50,0.75,1]))
import matplotlib.pyplot as plt
df.boxplot()
plt.show()
Skewness

1. Skewness Defines Asymmetry of data with respect to mean.

2. If Skewness is negative then the left tail is too long and mass of the
observations are on the right side of the distribution
3. If The skew is positive then right tail too long and mass of the
observations are on left side of distribution
Python Program to find skewness
import numpy as np Skewness = 3 * (Mean – Median) / Standard Deviation
43.61538461538461
x=np.concatenate((np.linspace(10,19,1), skewness= -0.8067819252577693
np.linspace(20,29,3),np.linspace(30,39,9),
np.linspace(40,49,11),np.linspace(50,59,15)))
print(x.mean())
from scipy.stats import skew
s=skew(x)
print("skewness=",s)
import seaborn as sns
sns.distplot(x) Negative skewness/Left
SKewness
Kurtosis

1. Kurtosis shows whether the data distribution, especially the peak and
the tails, are of the right shape.
2. If the kurtosis is above zero, the distribution has a marked peak. If it is
below zero, the distribution is too flat instead.
Kurtosis=+Ve

Kurtosis=0

Kurtosis=Negative
Python Program to find kurtosis
import numpy as 33.9
npx=np.concatenate((np.linspace(10,19,1), kurtosis= 0.7624441090238117
np.linspace(20,29,2),np.linspace(30,39,9),
np.linspace(40,49,2),np.linspace(50,59,1)))
print(x.mean())
from scipy.stats import kurtosis
k=kurtosis(x)
print("kurtosis=",k)
import seaborn as sns Positive Kurtosis=Leptokurtic

sns.distplot(x)
68-95-99.7 Rule
Find out variance bases on flower group
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category')
group0 = df['group'] == 'setosa'
print(group0.head())
group1 = df['group'] == 'versicolor'
group2 = df['group'] == 'virginica'
print('variance of geoup 0 is',df['petal length (cm)'][group0].var())
print('variance of geoup 1 is',df['petal length (cm)'][group1].var())
print('variance of geoup 2 is',df['petal length (cm)'][group2].var())
Hypothesis Testing
• In hypothesis testing evaluates two mutually exclusive statements on a population
using a sample of data
• Example of mutually exclusive statements: in court of law:
• 1. defendant is innocent (Null Hypothesis –Ho)
• 2. defendant is Guilty (Alternate Hypothesis –Ha)
• Steps:
• 1. Make initial assumption (Null Hypothesis H0)
• 2.Collect Data (like figure prints, DNA samples etc)
• 3. Gather Evidences to reject or not reject NULL Hypothesis

• Hypothesis testing returns P-value (significance value). If p-value<=0.05 then

we have to reject Null hypothesis and accept alternative hypothesis.
• If P-value >0.05 then we have to accept Null hypothesis
Types of Hypothesis Testing No of Type of
variab variable
Test

• 1 sample proportion test

le
• Chi-Square test
1 categorical 1 sample
• 1 sample T test for 1 continuous variable
proportio
• T test for 2 continuous variable n test
• ANOVA test for more than 2 continuous variables
2 categorical CHI-
Gender Age Weight Height
SQUARE
Group (kg) (m)
test
M Elderly 70 1.4 1 Continous 1 Sample
numerical t-test
F Adult 65 1.2
2 Continuous T
M Adult 65 1.4
numerical test/Corr
M Child 20 1 elation
F Adult 75 1.3 3 or Continuous ANOVA
M Elderly 80 1.3 more
Types of Hypothesis Testing: 1 sample Proportion
test
Gender Age Weight Height
Group

• Whether there is a difference in proportion of

M Elderly 70 1.4
between Male and Female?

F Adult 65 1.2 • Ho: No difference

M Adult 65 1.4 • H1: Difference
M Child 20 1
• Returns P-value
F Adult 75 1.3
• If p value >0.05 then Ho will be accepted else
M Elderly 80 1.3 HA will be accepted
Types of Hypothesis Testing: CHI SQUARE TEST
Gender Age Weight Height
Group

• Whether there is a difference in proportion of

M Elderly 70 1.4
between Male and Female bases on Age group
?
F Adult 65 1.2
M Adult 65 1.4 • Ho: No difference
M Child 20 1 • H1: Difference
F Adult 75 1.3
• Returns P-value
M Elderly 80 1.3 • If p value >0.05 then Ho will be accepted else
HA will be accepted
SIGNIFICANCE DIFFERENCE BETWEEN MEAN OF
SAMPLE AND POPULATIONS/ 2 VARIABLES
Types of Hypothesis Testing: 1 sample T test –
bases on height • Is there mean difference between sample
height and population height?
Gender Age Weight Height
Group • Ho: No difference
• H1: Difference
M Elderly 70 1.4
• Returns P-value
F Adult 65 1.2 • If p value >0.05 then Ho will be accepted else
HA will be accepted
M Adult 65 1.4 • Function: scipy.stats.ttest_1samp()
M Child 20 1 • Usage:
F Adult 75 1.3 t_stat,p-
M Elderly 80 1.3 value=scipy.stats.ttest_1samp(sequence)
1 sample T test for ages (1 numerical variable)
ages=[10,20,35,50,28,40,55,18,16,55,30,25,43,18,30,28,14,24,16,17,32,35,26,27,65,18,43,23,21,20,19,70]
print(len(ages))
import numpy as np
ages_mean=np.mean(ages)
print(ages_mean)
## Lets take sample
sample_size=10 32
age_sample=np.random.choice(ages,sample_size)
print(age_sample)
30.34375
from scipy.stats import ttest_1samp [18 40 16 50 28 27 50 30 30 55]
ttest,p_value=ttest_1samp(age_sample,30) 0.33568006273592693
print(p_value) we are accepting null hypothesis
if p_value < 0.05: # alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
new p value 1.0
else:
print("we are accepting null hypothesis")
ttest,p_value=ttest_1samp(age_sample,age_sample.mean())
print('new p value',p_value)
groups
Function: scipy.stats.ttest_ind(var1, var2)
It returs p value and statistical difference
from scipy.stats import ttest_ind
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category')
group0 = df['group'] == 'setosa'
group1 = df['group'] == 'versicolor'
group2 = df['group'] == 'virginica'
t,pvalue=ttest_ind(df['petal length (cm)'][group0],df['petal length (cm)'][group1])
print(pvalue)
< 0.05, Null hypothesis rejected
Means mean difference is there between group 0 and
group 1
ANOVA TEST
difference
Function: scipy.stats.f_oneway(group1, group2,group3,...)
It returs p value and statistical difference
from scipy.stats import f_oneway
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category')
group0 = df['group'] == 'setosa'
group1 = df['group'] == 'versicolor'
group2 = df['group'] == 'virginica'
f,pvalue=f_oneway(df['petal length (cm)'][group0],df['petal length
(cm)'][group1],df['petal length (cm)'][group2])
print(pvalue)
< 0.05, Null hypothesis rejected
Means mean difference is there between group 0, group 1
and group 2
OBSERVING PARALLEL
COORDINATES
OBSERVING PARALLEL COORDINATES
• Parallel coordinates are a common way of visualizing and analyzing high-
dimensional datasets
• A point in n-dimensional space is represented as a polyline with vertices on
the parallel axes and the position of the vertex corresponds to the coordinate
of the point.

• Function:pandas.plotting.parallel_coordinates(dataframe, label)
• Here label is one of the column of dataframe uses as a label
Observing parallel coordinates of iris dataframe
from pandas.plotting import
parallel_coordinates
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,
columns=iris.feature_names)
df['label']=pd.Series( [iris.target_names[k]
for k in iris.target],dtype='category')
df['group']=iris.target
pl=parallel_coordinates(df,'label')
Graphing Distribution using
histogram and density plot
Functions for density plot and histogram
• Function for density plot:
• pandas.DataFrame.plot(kind=‘density’)
• Or
• pandas.DataFrame[‘column-name’].plot(kind=‘density’)

• Function for histogram for specific column of dataframe/whole dataframe

• pandas.DataFrame.plot(kind=‘hist’)
• Or
• pandas.DataFrame[‘column-name’].plot(kind=‘hist’)
Histogram and desinty plot of iris dataframe
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,
columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k]
for k in iris.target],dtype='category')
df.plot(kind='density')
df.plot(kind='hist')
Plotting scatterplots and
scatter matrix
Function for scatter plot
• Function for scatter plot:
• Pandas.DataFrame.plot(kind=‘scatter’,x=‘column-name’,y=‘column-name’,
c=colorlist)
• It plots scatterplot of x and y columns and to represent different groups we
can use different colors
Function for scatter plot
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=iris.target
pelette={0:'red',1:'yellow',2:'blue'}
color=[pelette[k] for k in df['group']]
print(color)
df.plot(kind='scatter',x='petal length (cm)',y='petal width (cm)',c=color)
Function for scatter matrix
• Useful for plotting scatter plot of all columns of dataframe vs all columns
• At diagonal position we can plot either histogram or density plot

•pandas.plotting.scatter_matrix(dataframename,figsize(width,height),color=color
Function:

s, diagonal=‘hist/kde’)
•figsize(float,float),
PARAMETERS
optional
• A tuple (width, height) in inches.
• diagonal{‘hist’, ‘kde’}
Pick between ‘kde’ and ‘hist’ for either Kernel Density Estimation or Histogram
plot in the diagonal.
Scatter matrix for all columns->all columns of
iris
from pandas.plotting import scatter_matrix
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=iris.target
pelette={0:'red',1:'yellow',2:'blue'}
colorlist=[pelette[k] for k in df['group']]
scatter_matrix(df,figsize=(6,6),color=colorlist,diagonal='hist')
Correlation and
Covariance
covariance
• Covariance and Correlation are terms used in
statistics to measure relationships between two
random variables. Both of these terms measure linear
dependency between a pair of random variables
Quantify Relationship between size and price
Size < -- > price ?
Like size increase -- price increases or
Size of house Price Size decreases ----price decreases
1200sqm 100k$
1500sqm 200k$
1800sqm 300k$
covariance
If x increases, y will also increases,
Then covariance will be +ve
If x increases, y decreases, then covariance
Will be negative

Size of house Price

1200sqm 100k$
1500sqm 200k$
𝑛
1800sqm 300k$ • Cov(X,X)= (𝑥 − 𝑥𝑏𝑎𝑟)(𝑥 − 𝑥𝑏𝑎𝑟)
𝑛−1
𝑖=1

Function:
pandas.DataFrameob.cov()
covariance

Covariance is useful to
quantify relation between
random variables
It’s gives direction : either
positive or negative

+ve direction -ve direction

correlation
• It represents relationship between dependent and independent variable
• Like if we have two variable x and y then if x increases then will y increases or not?
• Covariance helps us to find direction of relationship
• Correlation helps us to find strength of relationship. Like if direction is +ve, then how
much +ve and if direction is –ve then how much -ve
• Perason’s correlation coefficient is used to find out strength of relationship
P(x,y) will between -1 to +1

-1 <=p(x,y)<=1

Function:
pandas.DataFrameob.corr()
correlation

P=1 0<p<1 0<p<1 P=0 -1<P<0

-1<P<0 -1<P<0
P=1 P=0 P=-1

Strong +ve medium week +ve No week Strong

Medium
+ve relation -ve -ve
-ve
Python program to print covariance of all columns with respect to
all other columns of iris
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=iris.target
df.cov()
Python program to print pearson’s correlation coefficient of all
columns with respect to all other columns of iris
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
df['group']=iris.target
df.corr()
Practical set 7: question 36
Write a python program to print telephone numbers placed in sentences
using regular expressions
import re
data1 = 'My phone number is: 800-555-1212.'
data2 = '800-555-1234 is my phone number.'
OUTPUT:
pattern = r'\d{3}-\d{3}-\d{4}'
800-555-1212
m1=re.search(pattern,data1)
800-555-1234
print(m1.group())
m2=re.search(pattern,data2)
print(m2.group())
Practical set 7: question 37
Write a python program to create basic adjacency matrix from the
NetworkX- supplied graph
import networkx as nx
G = nx.cycle_graph(10)
A = nx.adjacency_matrix(G)
print(A) #print sparse matrix
print(A.todense())
nx.draw_networkx(G)
Practical set 7: question 38
Write a python program to create applied visualization for EDA using
boxplots and perform t-tests.
from sklearn.datasets import load_iris
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category')
df.plot(kind='box')
group1=df['group']=='setosa'
group2=df['group']=='versicolor'
print(group1)
from scipy.stats import ttest_ind
t,pvalue=ttest_ind(df['petal length (cm)'][group1],df['petal length
(cm)'][group2])
print(pvalue)
THE END !!!

Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
ML Practical File
100% (2)
ML Practical File
43 pages
Risk To Reward Ratio Spreadsheet Tradingview
No ratings yet
Risk To Reward Ratio Spreadsheet Tradingview
19 pages
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
No ratings yet
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
2 pages
ERP - Enterprise Resource Planning System: Business Requirements Document
No ratings yet
ERP - Enterprise Resource Planning System: Business Requirements Document
22 pages
Chapter 1 Statement of The Problem
100% (2)
Chapter 1 Statement of The Problem
4 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
HubSpot Email Marketing Guide
No ratings yet
HubSpot Email Marketing Guide
25 pages
Regression Dataset Example
No ratings yet
Regression Dataset Example
14 pages
ML Combined
No ratings yet
ML Combined
254 pages
6 - Classification and Regression Tasks
No ratings yet
6 - Classification and Regression Tasks
115 pages
ML Experiment No 1 Linear Regression Analysis
No ratings yet
ML Experiment No 1 Linear Regression Analysis
3 pages
Sericulture Ece
No ratings yet
Sericulture Ece
27 pages
Project Management Software Application: Muhammad Tahir Khan
No ratings yet
Project Management Software Application: Muhammad Tahir Khan
57 pages
Samsung Ah68 02293b Users Manual 280484
No ratings yet
Samsung Ah68 02293b Users Manual 280484
39 pages
Data Science Chapitre 2
No ratings yet
Data Science Chapitre 2
98 pages
Staad To Afes - Google Search
No ratings yet
Staad To Afes - Google Search
2 pages
Department: Lab Manual
No ratings yet
Department: Lab Manual
36 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
Regression
No ratings yet
Regression
56 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Day.9 SML
No ratings yet
Day.9 SML
23 pages
AI Lec 3
No ratings yet
AI Lec 3
36 pages
Lecture-17-Linear Regression Using Sklearn
No ratings yet
Lecture-17-Linear Regression Using Sklearn
32 pages
Lecture 3
No ratings yet
Lecture 3
42 pages
ML LN 3
No ratings yet
ML LN 3
44 pages
Photoshop Cc2014 Shortcuts PC
No ratings yet
Photoshop Cc2014 Shortcuts PC
1 page
ML Manoj
No ratings yet
ML Manoj
51 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Lecture 3 - Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 3 - Linear Regression Imran 20022025 092939am
46 pages
Acer Aspire E1-510 - Compal La-A621p (Z5we3. Z5WT3)
No ratings yet
Acer Aspire E1-510 - Compal La-A621p (Z5we3. Z5WT3)
38 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Regression
No ratings yet
Regression
16 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
22UCS303 DS-Unit IV-LINEAR REGRESSION
No ratings yet
22UCS303 DS-Unit IV-LINEAR REGRESSION
19 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
Cleaning The Form 3 Roller Holder
No ratings yet
Cleaning The Form 3 Roller Holder
14 pages
QMS Quick Start Guide
No ratings yet
QMS Quick Start Guide
14 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
ML Unit
No ratings yet
ML Unit
23 pages
Machine Learning With Python Algorithms
No ratings yet
Machine Learning With Python Algorithms
28 pages
Regression
No ratings yet
Regression
16 pages
DSBDAL - Assignment No 4
No ratings yet
DSBDAL - Assignment No 4
15 pages
Csci5561 Spring2025 hw3
No ratings yet
Csci5561 Spring2025 hw3
8 pages
Lab 6 - Linear Regression and Multiple Linear Regression
No ratings yet
Lab 6 - Linear Regression and Multiple Linear Regression
12 pages
Assigment Regression
No ratings yet
Assigment Regression
9 pages
Unit 2 Regression Analysis
No ratings yet
Unit 2 Regression Analysis
16 pages
Chapter4 Regression
No ratings yet
Chapter4 Regression
15 pages
AI Lab7
No ratings yet
AI Lab7
13 pages
Regression Analysis
No ratings yet
Regression Analysis
16 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
GCN-based Soft Sensor Utilizing Process Flow
No ratings yet
GCN-based Soft Sensor Utilizing Process Flow
6 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Lect 10 Regression
No ratings yet
Lect 10 Regression
7 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
Data+Science+Intern JD ZIGRAM Aug2024 Reviewed
No ratings yet
Data+Science+Intern JD ZIGRAM Aug2024 Reviewed
12 pages
Assignment No.4 - (20-Ele-68)
No ratings yet
Assignment No.4 - (20-Ele-68)
17 pages
Afnan PPT (1) (Read-Only)
No ratings yet
Afnan PPT (1) (Read-Only)
13 pages
Lab 11,12
No ratings yet
Lab 11,12
7 pages
Coding Blog
No ratings yet
Coding Blog
8 pages
Aoc 2217v
No ratings yet
Aoc 2217v
51 pages
Raise As Exception
No ratings yet
Raise As Exception
6 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Dav Exp2
No ratings yet
Dav Exp2
3 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Raja Shankar Shah University, Chhindwara (M.P.)
No ratings yet
Raja Shankar Shah University, Chhindwara (M.P.)
2 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
7 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
5 pages
C1 W1 Lab03 Model Representation Soln-Copy1
No ratings yet
C1 W1 Lab03 Model Representation Soln-Copy1
7 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
7 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Secure Print Mode Overview and Guide For Windows 10 Users
No ratings yet
Secure Print Mode Overview and Guide For Windows 10 Users
6 pages
Brosur UPS Merk Centiel Essential Power
No ratings yet
Brosur UPS Merk Centiel Essential Power
6 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
5 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
FT Lenovo - V50a - 24IMB - AIO - Spec
No ratings yet
FT Lenovo - V50a - 24IMB - AIO - Spec
8 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
ICT Skills Class 10
No ratings yet
ICT Skills Class 10
2 pages
Cyber Crime Project Law
No ratings yet
Cyber Crime Project Law
5 pages
Install Apache PHP5 MySQL5.6 Debian 9.6
No ratings yet
Install Apache PHP5 MySQL5.6 Debian 9.6
5 pages
ml2020 Pythonlab02
No ratings yet
ml2020 Pythonlab02
3 pages
PCI Express Gen 4 and Gen 5 Card Edge Connectors
No ratings yet
PCI Express Gen 4 and Gen 5 Card Edge Connectors
4 pages
19BCS2059 DL1
No ratings yet
19BCS2059 DL1
4 pages
MCP User Manual
No ratings yet
MCP User Manual
13 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)