0% found this document useful (0 votes)

30 views23 pages

Python Unit 5

The document discusses different classes in the Scikit-learn machine learning library. It describes the four main class types for classification, regression, clustering, and transformation. It also provides examples of choosing algorithms and the general structure of building models in Scikit-learn.

Uploaded by

Shyam Bihade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views23 pages

Python Unit 5

Uploaded by

Shyam Bihade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Unit-05

Data Wrangling
Classes in Scikit-learn
o Understanding how classes work is an important prerequisite for being able to use the Scikit-
learn package appropriately.
o Scikit-learn is the package for machine learning and data science experimentation favored by
most data scientists.
o It contains wide range of well-established learning algorithms, error functions and testing
procedures.
o Install : conda install scikit-learn
o There are four class types covering all the basic machine-learning functionalities,
o Classifying
o Regressing
o Grouping by clusters
o Transforming data
o Even though each base class has specific methods and attributes, the core functionalities for
data processing and machine learning are guaranteed by one or more series of methods and
attributes.
Choosing a correct algorithm in Scikit-learn
Get
more
>50 Start
samples
data

Predicting a Predicting a No
category? Quantity? Yes
Not working
Do you have

Labeled
data?
<100K
<100K
no. of samples Randomized
category No PCA
samples known
Algo SGD Few
Linear Regressor features are <10K
SGD <10K <10K
SVC samples samples important samples
Classifier

Text Mini Batch RidgeRegressor

kernel KMeans Lasso Kernel Isomap
data? approximation KMeans SVR(kernel=‘linear’) approximation
ElasticNet Spectral
Spectral Embedding
Naïve Kneighbors SVC Clustering Mean Shift
Bayes Classifier EnsembleRegressor
Ensemble GMM
VBGMM SVR(kernel=‘rbf’) LLE
Classifiers
Supervised V/S Unsupervised Learning
SUPERVISED LEARNING UNSUPERVISED LEARNING
Input Data Uses Known and Labeled Data as input Uses Unknown Data as input

Computational Complexity Less Computational Complexity More Computational Complex

Real Time Uses off-line analysis Uses Real Time Analysis of Data
Number of Classes Number of Classes are known Number of Classes are not known
Accuracy of Results Accurate and Reliable Results Moderate Accurate and Reliable Results
Output data Desired output is given. Desired output is not given.
In supervised learning it is not possible to learn In unsupervised learning it is possible to learn
Model larger and more complex models than with larger and more complex models than with
supervised learning unsupervised learning
In supervised learning training data is used to In unsupervised learning training data is not
Training data
infer model used.

Another name Supervised learning is also called classification. Unsupervised learning is also called clustering.

Test of model We can test our model. We can not test our model.
Example Optical Character Recognition Find a face in an image.
Machine Learning Process for Supervised Learning

Test Model
Data Testing

Collecting Cleaning Splitting Update Model

data data data Parameters Deployment

Train Model
Data Training
Structure of Scikit-Learn
01 (import)
from sklearn.family import Model
Example
from sklearn.linear_model import LinearRegression

02 (create object)
m = Model(parameters)
Example
lr = LinearRegression()

03 (split data) (Only For Supervised Learning)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
Structure of Scikit-Learn

04 (Train Model)
m.fit(X_train,y_train) #For Supervised Learning
OR
m.fit(X_train) #For Unsupervised Learning

05 (Predict) (Only For Supervised Learning)

predictions = m.predict(X_train)
Structure of Scikit-Learn

06 (Evaluate Model)
• Evaluate the model and update the parameters if required.
• Evaluating technique will be different for regression, classification and clustering.
• We will explore all the techniques in later sessions.

Example Model Evaluation for Regression

from sklearn import metrics
print(metrics_mean_squared_error(ytest,predictions))
Linear Regression (Example) (iPhone Prices)
iPhoneLinReg.py
import pandas as pd
iphones.csv
df = pd.read_csv('iphones.csv')

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
X = df['iphonenumber'].values.reshape(-1,1)
y = df['price']

lr.fit(X,y)

lr.predict([[14]]) # price for the next iphone Output

predicted = lr.predict(X)

import matplotlib.pyplot as plt

%matplotlib inline

plt.scatter(df['iphonenumber'],df['price'],c='r')
plt.scatter(df['iphonenumber'],predicted,c='b')
Support Vector Regression (SVR) (Example)

import pandas as pd WorldBank.csv

df = pd.read_csv('WorldBank.csv')
Download link
pop15 = df[df['Indicator Name']=='Population ages 15-64 (% of
total population)'] Output
X = np.arange(1960,2020)
y = pop15.values[0][4:-1]

from sklearn.svm import SVR

model = SVR()
model.fit(X.reshape(-1,1),y)

predictPoly = model.predict(X.reshape(-1,1))

import matplotlib.pyplot as plt

%matplotlib inline

plt.scatter(X,y,c='r')
plt.scatter(X,predictPoly,c='b')
Bag of Words Model (Topic from Unit-03)
 Before we can perform analysis on textual data, we must tokenize every word within the
dataset.
 The act of tokenizing the words creates a bag of words.
 The creation of a bag of words revolves around Natural Language Processing (NLP).
 Before we can do NLP we should perform below specified processes
 Remove special characters
 Such as html tags and special characters
 Remove the stop words
 Some of the English stop words are a, an, the, are, at, this, is etc….
 Stemming words
 Stemming the words means removing suffix and prefix of the word
 Example :
Play
Playing
Play
Plays
Played
Hashing Trick
 Most Machine Learning algorithms uses numeric inputs, if our data contains text instead we
need to convert those text into numeric values first, this can be done using hashing tricks.
 For Example our dataset is something like below table,
Id Salary Gender Id Salary Gender Gender_Numeric
1 10000 Male 1 10000 Male 1
2 15000 Female 2 15000 Female 0
3 12000 Male 3 12000 Male 1
4 11000 Female 4 11000 Female 0
5 20000 Male 5 20000 Male 1

 We can not apply the machine learning algorithms in text like Male/Female, we need
numeric values instead of this, One thing which we can do here is assigning the number to
each word and replace that number with the word.
 If we assign 1 to male and 0 to female we can use ML algorithms, same data set in numeric
values are given above.
CountVectorizer
CountVectorizer contains many parameters which can be very useful
 while dealing with the textual data, some of the important parameters are
 max_features
 Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
 stop_words
 Will remove the stops words, default is None, we can set to ‘english’ if we want to remove english stop words.
 ngram_range
 The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be
extracted.
timeit (Magic Command in Jupyter Notebook)
o We can find the time taken to execute a statement or a cell in a Jupyter Notebook with the
help of timeit magic command.
o This command can be used both as a line and cell magic:
o In line mode you can time a single-line statement (though multiple ones can be chained with using
semicolons).
o In cell mode, the statement in the first line is used as setup code (executed but not timed) and the body
of the cell is timed. The cell body has access to any variables created in the setup code.
o Syntax :
o Line : %timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
o Cell : %%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
o Here, -n flag represents the number of loops and –r flag represents the number of repeats
o Example :
Memory Profiler
This is a python module for monitoring memory consumption of a process as well as line-by-
line analysis of memory consumption for python programs.
Installation : pip install memory_profiler
Usage :
 First Load the profiler by writing %load_ext memory_profiler in the notebook
 Then write %memit on each statement you want to monitor memory.
Example :
Running in Parallel
Most computers today are multicore (two or more processors in a single package), some
with multiple physical CPUs.
One of the most important limitations of Python is that it uses a single core by default. (It
was created in a time when single cores were the norm.)
Data science projects require quite a lot of computations.
Part of the scientific aspect of data science relies on repeated tests and experiments on
different data matrices.
Also the data size may be huge, which will take lots of processing power.
Using more CPU cores accelerates a computation by a factor that almost matches the
number of cores.
 For example, having four cores would mean working at best four times faster.
 But practically we will not have four times faster processing as assigning the different cores to different
processes will take some amount of time.
 So parallel processing will works best with huge datasets or where extreme processing power is required.
Running in Parallel (Cont.)
 In Scikit-learn library we need not to do any special programming in order to do the parallel
(multiprocessing) processing, we can simply use n_jobs parameter to do so.
 If we set number grater than 1 in the n_jobs it will uses that much of the processor, if we set
-1 to n_jobs it will use all available processor from the system.
 Lets see the Example how we can achieve parallel processing in Scikit-learn liabrary.
Using Single Processor

Using Multi Processor

Exploratory Data Analysis (EDA)
 Exploratory Data Analysis (EDA) refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and
to check assumptions with the help of summary statistics and graphical representations.
 EDA was developed at Bell Labs by John Tukey, a mathematician and statistician who
wanted to promote more questions and actions on data based on the data itself.
 In one of his famous writings, Tukey said:
 “The only way humans can do BETTER than computers is to take a chance of doing WORSE than them.”
 Above statement explains why, as a data scientist, your role and tools aren’t limited to
automatic learning algorithms but also to manual and creative exploratory tasks.
 Computers are unbeatable at optimizing, but humans are strong at discovery by taking
unexpected routes and trying unlikely but very effective solutions.
EDA (cont.)
 With EDA we,
 Describe data
 Closely explore data distributions
 Understand the relationships between variables
 Notice unusual or unexpected situations
 Place the data into groups
 Notice unexpected patterns within the group
 Take note of group differences
EDA - Measuring Central Tendency
 We can use many inbuilt pandas functions in order to find central tendency of the numerical
data, Some of the functions are
 df.mean()
 df.median()
 df.std()
 df.max()
 df.min()
 df.quantile(np.array([0,0.25,0.5,0.75,1]))
 We can define normality of the data using below measures
 Skewness defines the asymmetry of data with respect to the mean.
 It will be negative if the left tail is too long.
 It will be positive if the right tail is too long.
 It will be zero if left and right tail are same.
 Kurtosis shows whether the data distribution, especially the peak and the tails, are of the right shape.
 It will be positive if distribution has marked peak.
 It will be zero if distribution is flat.
Obtaining the frequencies
 We can use pandas built-in function value_counts() to obtain the frequency of the
categorical variables.
 valueCount.py  Output
 1  print(df["Sex"].value_counts())  male 577
 female 314
 Name: Sex, dtype: int64

 We also can use another pandas built-in function describe() to obtain insights of the data.
 valueCount.py  Output
 1  print(df.describe())
Visualization for EDA
We can use many types of graphs, some of them are listed below
 if we want to show how various data elements contribute towards a whole, we should use pie chart.
 If we want to compare data elements, we should use bar chart.
 If we want to show distribution of elements, we should use histograms.
 If we want to depict groups in elements, we should use boxplots.
 If we want to find patterns in data, we should use scatterplots.
 If we want to display trends over time, we should use line chart.
 If we want to display geographical data, we should use basemap.
 If we want to display network, we should use networkx.
In addition to all these graphs, we also can use seaborn library to plot advanced graphs like,
 Countplot
 Heatmap
 Distplot
 Pairplot
Covariance and Correlation
o Covariance
o Covariance is a measure used to determine how much two variables change in tandem.
o The unit of covariance is a product of the units of the two variables.
o Covariance is affected by a change in scale, The value of covariance lies between -∞ and +∞.
o Pandas does have built-in function to find covariance in DataFrame named cov()
o Correlation
o The correlation between two variables is a normalized version of the covariance.
o The value of correlation coefficient is always between -1 and 1.
o Once we’ve normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare
correlations.

covCorr.py
1 print(df.cov())
2 print(df.corr())

Python GTU Study Material E-Notes Unit-5 16012021061815AM
No ratings yet
Python GTU Study Material E-Notes Unit-5 16012021061815AM
9 pages
Python GTU Study Material E-Notes Unit-5 16012021061815AM
No ratings yet
Python GTU Study Material E-Notes Unit-5 16012021061815AM
9 pages
Python GTU Study Material E-Notes Unit-5 16012021061815AM
No ratings yet
Python GTU Study Material E-Notes Unit-5 16012021061815AM
9 pages
Unit 5 Material
No ratings yet
Unit 5 Material
18 pages
UNIT 1
No ratings yet
UNIT 1
28 pages
Python GTU Study Material Presentations Unit-5 20112020032922AM
No ratings yet
Python GTU Study Material Presentations Unit-5 20112020032922AM
24 pages
Report Intership Chapters
No ratings yet
Report Intership Chapters
39 pages
KLS'S Vishwanathrao Deshpande Institute of Technology, Haliyal
No ratings yet
KLS'S Vishwanathrao Deshpande Institute of Technology, Haliyal
17 pages
Data Mining Essen, Als 2: Data Mining in Prac, Ce, With Python
No ratings yet
Data Mining Essen, Als 2: Data Mining in Prac, Ce, With Python
31 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
12. SMART ATTENDANCE SYSTEM (Report)
No ratings yet
12. SMART ATTENDANCE SYSTEM (Report)
89 pages
ML Libraries
No ratings yet
ML Libraries
19 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
Scikit-Learn: Library For Machine Learning and Data Science With Python
No ratings yet
Scikit-Learn: Library For Machine Learning and Data Science With Python
11 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
1725629890-Unit1_Machine_Learning_Introduction_CU_3.0_
No ratings yet
1725629890-Unit1_Machine_Learning_Introduction_CU_3.0_
38 pages
Lesson 09 - Introduction To Model Building
No ratings yet
Lesson 09 - Introduction To Model Building
85 pages
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
No ratings yet
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
4 pages
Data Science With Python ML Course Syllabus
No ratings yet
Data Science With Python ML Course Syllabus
4 pages
SK Learn
No ratings yet
SK Learn
9 pages
Lecture # 2
No ratings yet
Lecture # 2
21 pages
Machine Learning Notes22
No ratings yet
Machine Learning Notes22
45 pages
Lab Manual
No ratings yet
Lab Manual
100 pages
OR forecasting tool
No ratings yet
OR forecasting tool
39 pages
algorithmeknn-121213175830-phpapp02
No ratings yet
algorithmeknn-121213175830-phpapp02
52 pages
Scikit Learn
No ratings yet
Scikit Learn
107 pages
Introduction To Scikit Learn
100% (1)
Introduction To Scikit Learn
108 pages
Python SciKit Learn Tutorial _ DigitalOcean
No ratings yet
Python SciKit Learn Tutorial _ DigitalOcean
11 pages
Machine Learning Section
No ratings yet
Machine Learning Section
29 pages
Python Notes
No ratings yet
Python Notes
37 pages
Scikit-Learn - Machine Learning in Python PDF
No ratings yet
Scikit-Learn - Machine Learning in Python PDF
6 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
MACHINE LEARNING LAB PROGRAMS
No ratings yet
MACHINE LEARNING LAB PROGRAMS
6 pages
20191120122749-Data Science Certification Training
No ratings yet
20191120122749-Data Science Certification Training
4 pages
AML LAB MANUAL Yash
No ratings yet
AML LAB MANUAL Yash
60 pages
ML Practicals
No ratings yet
ML Practicals
10 pages
Lab Manual
No ratings yet
Lab Manual
17 pages
SEng5305-chap-1-Introduction to ML (1)
No ratings yet
SEng5305-chap-1-Introduction to ML (1)
85 pages
L02 Fundamentals of ML
No ratings yet
L02 Fundamentals of ML
46 pages
Machine Learning With Scikit Learn Strata 2015
No ratings yet
Machine Learning With Scikit Learn Strata 2015
72 pages
3.-Machine-Learning-Section
No ratings yet
3.-Machine-Learning-Section
31 pages
ML -1_Sovan_Introduction to ML
No ratings yet
ML -1_Sovan_Introduction to ML
83 pages
Course Slides
No ratings yet
Course Slides
276 pages
Supervised - ML Complete Book
No ratings yet
Supervised - ML Complete Book
153 pages
305 BA PYTHON - APR 2022 ANSWER Key
No ratings yet
305 BA PYTHON - APR 2022 ANSWER Key
14 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
956_BSc DataScience Semester 4 DSC D ML Paper 4
No ratings yet
956_BSc DataScience Semester 4 DSC D ML Paper 4
3 pages
Complete Data Science.docx
No ratings yet
Complete Data Science.docx
10 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
ML Summer Training
No ratings yet
ML Summer Training
20 pages
Machine Learning - Lab Wise Manual Abbbbb
No ratings yet
Machine Learning - Lab Wise Manual Abbbbb
13 pages
ML LabManual (1)
No ratings yet
ML LabManual (1)
16 pages
Week 01
No ratings yet
Week 01
37 pages
IT Lab PPT Pratham Chouhan CSE174
No ratings yet
IT Lab PPT Pratham Chouhan CSE174
40 pages
Scikit-Learn: Machine Learning in Python
No ratings yet
Scikit-Learn: Machine Learning in Python
6 pages
Lect3 Machine Learning
No ratings yet
Lect3 Machine Learning
27 pages
internshipml(J2)
No ratings yet
internshipml(J2)
50 pages
Machine Learning
No ratings yet
Machine Learning
17 pages

Python Unit 5

Uploaded by

Python Unit 5

Uploaded by

Unit-05

Text Mini Batch RidgeRegressor

Computational Complexity Less Computational Complexity More Computational Complex

Collecting Cleaning Splitting Update Model

03 (split data) (Only For Supervised Learning)

05 (Predict) (Only For Supervised Learning)

Example Model Evaluation for Regression

from sklearn.linear_model import LinearRegression

lr.predict([[14]]) # price for the next iphone Output

import matplotlib.pyplot as plt

import pandas as pd WorldBank.csv

from sklearn.svm import SVR

import matplotlib.pyplot as plt

Using Multi Processor

You might also like