0% found this document useful (0 votes)
26 views

Python Unit 5

The document discusses different classes in the Scikit-learn machine learning library. It describes the four main class types for classification, regression, clustering, and transformation. It also provides examples of choosing algorithms and the general structure of building models in Scikit-learn.

Uploaded by

Shyam Bihade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Python Unit 5

The document discusses different classes in the Scikit-learn machine learning library. It describes the four main class types for classification, regression, clustering, and transformation. It also provides examples of choosing algorithms and the general structure of building models in Scikit-learn.

Uploaded by

Shyam Bihade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit-05

Data Wrangling
Classes in Scikit-learn
o Understanding how classes work is an important prerequisite for being able to use the Scikit-
learn package appropriately.
o Scikit-learn is the package for machine learning and data science experimentation favored by
most data scientists.
o It contains wide range of well-established learning algorithms, error functions and testing
procedures.
o Install : conda install scikit-learn
o There are four class types covering all the basic machine-learning functionalities,
o Classifying
o Regressing
o Grouping by clusters
o Transforming data
o Even though each base class has specific methods and attributes, the core functionalities for
data processing and machine learning are guaranteed by one or more series of methods and
attributes.
Choosing a correct algorithm in Scikit-learn
Get
more
>50 Start
samples
data

Predicting a Predicting a No
category? Quantity? Yes
Not working
Do you have

Labeled
data?
<100K
<100K
no. of samples Randomized
category No PCA
samples known
Algo SGD Few
Linear Regressor features are <10K
SGD <10K <10K
SVC samples samples important samples
Classifier

Text Mini Batch RidgeRegressor


kernel KMeans Lasso Kernel Isomap
data? approximation KMeans SVR(kernel=‘linear’) approximation
ElasticNet Spectral
Spectral Embedding
Naïve Kneighbors SVC Clustering Mean Shift
Bayes Classifier EnsembleRegressor
Ensemble GMM
VBGMM SVR(kernel=‘rbf’) LLE
Classifiers
Supervised V/S Unsupervised Learning
SUPERVISED LEARNING UNSUPERVISED LEARNING
Input Data Uses Known and Labeled Data as input Uses Unknown Data as input

Computational Complexity Less Computational Complexity More Computational Complex

Real Time Uses off-line analysis Uses Real Time Analysis of Data
Number of Classes Number of Classes are known Number of Classes are not known
Accuracy of Results Accurate and Reliable Results Moderate Accurate and Reliable Results
Output data Desired output is given. Desired output is not given.
In supervised learning it is not possible to learn In unsupervised learning it is possible to learn
Model larger and more complex models than with larger and more complex models than with
supervised learning unsupervised learning
In supervised learning training data is used to In unsupervised learning training data is not
Training data
infer model used.

Another name Supervised learning is also called classification. Unsupervised learning is also called clustering.

Test of model We can test our model. We can not test our model.
Example Optical Character Recognition Find a face in an image.
Machine Learning Process for Supervised Learning

Test Model
Data Testing

Collecting Cleaning Splitting Update Model


data data data Parameters Deployment

Train Model
Data Training
Structure of Scikit-Learn
01 (import)
from sklearn.family import Model
Example
from sklearn.linear_model import LinearRegression

02 (create object)
m = Model(parameters)
Example
lr = LinearRegression()

03 (split data) (Only For Supervised Learning)


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
Structure of Scikit-Learn

04 (Train Model)
m.fit(X_train,y_train) #For Supervised Learning
OR
m.fit(X_train) #For Unsupervised Learning

05 (Predict) (Only For Supervised Learning)


predictions = m.predict(X_train)
Structure of Scikit-Learn

06 (Evaluate Model)
• Evaluate the model and update the parameters if required.
• Evaluating technique will be different for regression, classification and clustering.
• We will explore all the techniques in later sessions.

Example Model Evaluation for Regression


from sklearn import metrics
print(metrics_mean_squared_error(ytest,predictions))
Linear Regression (Example) (iPhone Prices)
iPhoneLinReg.py
import pandas as pd
iphones.csv
df = pd.read_csv('iphones.csv')

from sklearn.linear_model import LinearRegression


lr = LinearRegression()
X = df['iphonenumber'].values.reshape(-1,1)
y = df['price']

lr.fit(X,y)

lr.predict([[14]]) # price for the next iphone Output

predicted = lr.predict(X)

import matplotlib.pyplot as plt


%matplotlib inline

plt.scatter(df['iphonenumber'],df['price'],c='r')
plt.scatter(df['iphonenumber'],predicted,c='b')
Support Vector Regression (SVR) (Example)

import pandas as pd WorldBank.csv


df = pd.read_csv('WorldBank.csv')
Download link
pop15 = df[df['Indicator Name']=='Population ages 15-64 (% of
total population)'] Output
X = np.arange(1960,2020)
y = pop15.values[0][4:-1]

from sklearn.svm import SVR


model = SVR()
model.fit(X.reshape(-1,1),y)

predictPoly = model.predict(X.reshape(-1,1))

import matplotlib.pyplot as plt


%matplotlib inline

plt.scatter(X,y,c='r')
plt.scatter(X,predictPoly,c='b')
Bag of Words Model (Topic from Unit-03)
 Before we can perform analysis on textual data, we must tokenize every word within the
dataset.
 The act of tokenizing the words creates a bag of words.
 The creation of a bag of words revolves around Natural Language Processing (NLP).
 Before we can do NLP we should perform below specified processes
 Remove special characters
 Such as html tags and special characters
 Remove the stop words
 Some of the English stop words are a, an, the, are, at, this, is etc….
 Stemming words
 Stemming the words means removing suffix and prefix of the word
 Example :
Play
Playing
Play
Plays
Played
Hashing Trick
 Most Machine Learning algorithms uses numeric inputs, if our data contains text instead we
need to convert those text into numeric values first, this can be done using hashing tricks.
 For Example our dataset is something like below table,
Id Salary Gender Id Salary Gender Gender_Numeric
1 10000 Male 1 10000 Male 1
2 15000 Female 2 15000 Female 0
3 12000 Male 3 12000 Male 1
4 11000 Female 4 11000 Female 0
5 20000 Male 5 20000 Male 1

 We can not apply the machine learning algorithms in text like Male/Female, we need
numeric values instead of this, One thing which we can do here is assigning the number to
each word and replace that number with the word.
 If we assign 1 to male and 0 to female we can use ML algorithms, same data set in numeric
values are given above.
CountVectorizer
CountVectorizer contains many parameters which can be very useful
 while dealing with the textual data, some of the important parameters are
 max_features
 Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
 stop_words
 Will remove the stops words, default is None, we can set to ‘english’ if we want to remove english stop words.
 ngram_range
 The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be
extracted.
timeit (Magic Command in Jupyter Notebook)
o We can find the time taken to execute a statement or a cell in a Jupyter Notebook with the
help of timeit magic command.
o This command can be used both as a line and cell magic:
o In line mode you can time a single-line statement (though multiple ones can be chained with using
semicolons).
o In cell mode, the statement in the first line is used as setup code (executed but not timed) and the body
of the cell is timed. The cell body has access to any variables created in the setup code.
o Syntax :
o Line : %timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
o Cell : %%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
o Here, -n flag represents the number of loops and –r flag represents the number of repeats
o Example :
Memory Profiler
This is a python module for monitoring memory consumption of a process as well as line-by-
line analysis of memory consumption for python programs.
Installation : pip install memory_profiler
Usage :
 First Load the profiler by writing %load_ext memory_profiler in the notebook
 Then write %memit on each statement you want to monitor memory.
Example :
Running in Parallel
Most computers today are multicore (two or more processors in a single package), some
with multiple physical CPUs.
One of the most important limitations of Python is that it uses a single core by default. (It
was created in a time when single cores were the norm.)
Data science projects require quite a lot of computations.
Part of the scientific aspect of data science relies on repeated tests and experiments on
different data matrices.
Also the data size may be huge, which will take lots of processing power.
Using more CPU cores accelerates a computation by a factor that almost matches the
number of cores.
 For example, having four cores would mean working at best four times faster.
 But practically we will not have four times faster processing as assigning the different cores to different
processes will take some amount of time.
 So parallel processing will works best with huge datasets or where extreme processing power is required.
Running in Parallel (Cont.)
 In Scikit-learn library we need not to do any special programming in order to do the parallel
(multiprocessing) processing, we can simply use n_jobs parameter to do so.
 If we set number grater than 1 in the n_jobs it will uses that much of the processor, if we set
-1 to n_jobs it will use all available processor from the system.
 Lets see the Example how we can achieve parallel processing in Scikit-learn liabrary.
Using Single Processor

Using Multi Processor


Exploratory Data Analysis (EDA)
 Exploratory Data Analysis (EDA) refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and
to check assumptions with the help of summary statistics and graphical representations.
 EDA was developed at Bell Labs by John Tukey, a mathematician and statistician who
wanted to promote more questions and actions on data based on the data itself.
 In one of his famous writings, Tukey said:
 “The only way humans can do BETTER than computers is to take a chance of doing WORSE than them.”
 Above statement explains why, as a data scientist, your role and tools aren’t limited to
automatic learning algorithms but also to manual and creative exploratory tasks.
 Computers are unbeatable at optimizing, but humans are strong at discovery by taking
unexpected routes and trying unlikely but very effective solutions.
EDA (cont.)
 With EDA we,
 Describe data
 Closely explore data distributions
 Understand the relationships between variables
 Notice unusual or unexpected situations
 Place the data into groups
 Notice unexpected patterns within the group
 Take note of group differences
EDA - Measuring Central Tendency
 We can use many inbuilt pandas functions in order to find central tendency of the numerical
data, Some of the functions are
 df.mean()
 df.median()
 df.std()
 df.max()
 df.min()
 df.quantile(np.array([0,0.25,0.5,0.75,1]))
 We can define normality of the data using below measures
 Skewness defines the asymmetry of data with respect to the mean.
 It will be negative if the left tail is too long.
 It will be positive if the right tail is too long.
 It will be zero if left and right tail are same.
 Kurtosis shows whether the data distribution, especially the peak and the tails, are of the right shape.
 It will be positive if distribution has marked peak.
 It will be zero if distribution is flat.
Obtaining the frequencies
 We can use pandas built-in function value_counts() to obtain the frequency of the
categorical variables.
 valueCount.py  Output
 1  print(df["Sex"].value_counts())  male 577
 female 314
 Name: Sex, dtype: int64

 We also can use another pandas built-in function describe() to obtain insights of the data.
 valueCount.py  Output
 1  print(df.describe())
Visualization for EDA
We can use many types of graphs, some of them are listed below
 if we want to show how various data elements contribute towards a whole, we should use pie chart.
 If we want to compare data elements, we should use bar chart.
 If we want to show distribution of elements, we should use histograms.
 If we want to depict groups in elements, we should use boxplots.
 If we want to find patterns in data, we should use scatterplots.
 If we want to display trends over time, we should use line chart.
 If we want to display geographical data, we should use basemap.
 If we want to display network, we should use networkx.
In addition to all these graphs, we also can use seaborn library to plot advanced graphs like,
 Countplot
 Heatmap
 Distplot
 Pairplot
Covariance and Correlation
o Covariance
o Covariance is a measure used to determine how much two variables change in tandem.
o The unit of covariance is a product of the units of the two variables.
o Covariance is affected by a change in scale, The value of covariance lies between -∞ and +∞.
o Pandas does have built-in function to find covariance in DataFrame named cov()
o Correlation
o The correlation between two variables is a normalized version of the covariance.
o The value of correlation coefficient is always between -1 and 1.
o Once we’ve normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare
correlations.

covCorr.py
1 print(df.cov())
2 print(df.corr())

You might also like