Python Unit 5
Python Unit 5
Data Wrangling
Classes in Scikit-learn
o Understanding how classes work is an important prerequisite for being able to use the Scikit-
learn package appropriately.
o Scikit-learn is the package for machine learning and data science experimentation favored by
most data scientists.
o It contains wide range of well-established learning algorithms, error functions and testing
procedures.
o Install : conda install scikit-learn
o There are four class types covering all the basic machine-learning functionalities,
o Classifying
o Regressing
o Grouping by clusters
o Transforming data
o Even though each base class has specific methods and attributes, the core functionalities for
data processing and machine learning are guaranteed by one or more series of methods and
attributes.
Choosing a correct algorithm in Scikit-learn
Get
more
>50 Start
samples
data
Predicting a Predicting a No
category? Quantity? Yes
Not working
Do you have
Labeled
data?
<100K
<100K
no. of samples Randomized
category No PCA
samples known
Algo SGD Few
Linear Regressor features are <10K
SGD <10K <10K
SVC samples samples important samples
Classifier
Real Time Uses off-line analysis Uses Real Time Analysis of Data
Number of Classes Number of Classes are known Number of Classes are not known
Accuracy of Results Accurate and Reliable Results Moderate Accurate and Reliable Results
Output data Desired output is given. Desired output is not given.
In supervised learning it is not possible to learn In unsupervised learning it is possible to learn
Model larger and more complex models than with larger and more complex models than with
supervised learning unsupervised learning
In supervised learning training data is used to In unsupervised learning training data is not
Training data
infer model used.
Another name Supervised learning is also called classification. Unsupervised learning is also called clustering.
Test of model We can test our model. We can not test our model.
Example Optical Character Recognition Find a face in an image.
Machine Learning Process for Supervised Learning
Test Model
Data Testing
Train Model
Data Training
Structure of Scikit-Learn
01 (import)
from sklearn.family import Model
Example
from sklearn.linear_model import LinearRegression
02 (create object)
m = Model(parameters)
Example
lr = LinearRegression()
04 (Train Model)
m.fit(X_train,y_train) #For Supervised Learning
OR
m.fit(X_train) #For Unsupervised Learning
06 (Evaluate Model)
• Evaluate the model and update the parameters if required.
• Evaluating technique will be different for regression, classification and clustering.
• We will explore all the techniques in later sessions.
lr.fit(X,y)
predicted = lr.predict(X)
plt.scatter(df['iphonenumber'],df['price'],c='r')
plt.scatter(df['iphonenumber'],predicted,c='b')
Support Vector Regression (SVR) (Example)
predictPoly = model.predict(X.reshape(-1,1))
plt.scatter(X,y,c='r')
plt.scatter(X,predictPoly,c='b')
Bag of Words Model (Topic from Unit-03)
Before we can perform analysis on textual data, we must tokenize every word within the
dataset.
The act of tokenizing the words creates a bag of words.
The creation of a bag of words revolves around Natural Language Processing (NLP).
Before we can do NLP we should perform below specified processes
Remove special characters
Such as html tags and special characters
Remove the stop words
Some of the English stop words are a, an, the, are, at, this, is etc….
Stemming words
Stemming the words means removing suffix and prefix of the word
Example :
Play
Playing
Play
Plays
Played
Hashing Trick
Most Machine Learning algorithms uses numeric inputs, if our data contains text instead we
need to convert those text into numeric values first, this can be done using hashing tricks.
For Example our dataset is something like below table,
Id Salary Gender Id Salary Gender Gender_Numeric
1 10000 Male 1 10000 Male 1
2 15000 Female 2 15000 Female 0
3 12000 Male 3 12000 Male 1
4 11000 Female 4 11000 Female 0
5 20000 Male 5 20000 Male 1
We can not apply the machine learning algorithms in text like Male/Female, we need
numeric values instead of this, One thing which we can do here is assigning the number to
each word and replace that number with the word.
If we assign 1 to male and 0 to female we can use ML algorithms, same data set in numeric
values are given above.
CountVectorizer
CountVectorizer contains many parameters which can be very useful
while dealing with the textual data, some of the important parameters are
max_features
Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
stop_words
Will remove the stops words, default is None, we can set to ‘english’ if we want to remove english stop words.
ngram_range
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be
extracted.
timeit (Magic Command in Jupyter Notebook)
o We can find the time taken to execute a statement or a cell in a Jupyter Notebook with the
help of timeit magic command.
o This command can be used both as a line and cell magic:
o In line mode you can time a single-line statement (though multiple ones can be chained with using
semicolons).
o In cell mode, the statement in the first line is used as setup code (executed but not timed) and the body
of the cell is timed. The cell body has access to any variables created in the setup code.
o Syntax :
o Line : %timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
o Cell : %%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
o Here, -n flag represents the number of loops and –r flag represents the number of repeats
o Example :
Memory Profiler
This is a python module for monitoring memory consumption of a process as well as line-by-
line analysis of memory consumption for python programs.
Installation : pip install memory_profiler
Usage :
First Load the profiler by writing %load_ext memory_profiler in the notebook
Then write %memit on each statement you want to monitor memory.
Example :
Running in Parallel
Most computers today are multicore (two or more processors in a single package), some
with multiple physical CPUs.
One of the most important limitations of Python is that it uses a single core by default. (It
was created in a time when single cores were the norm.)
Data science projects require quite a lot of computations.
Part of the scientific aspect of data science relies on repeated tests and experiments on
different data matrices.
Also the data size may be huge, which will take lots of processing power.
Using more CPU cores accelerates a computation by a factor that almost matches the
number of cores.
For example, having four cores would mean working at best four times faster.
But practically we will not have four times faster processing as assigning the different cores to different
processes will take some amount of time.
So parallel processing will works best with huge datasets or where extreme processing power is required.
Running in Parallel (Cont.)
In Scikit-learn library we need not to do any special programming in order to do the parallel
(multiprocessing) processing, we can simply use n_jobs parameter to do so.
If we set number grater than 1 in the n_jobs it will uses that much of the processor, if we set
-1 to n_jobs it will use all available processor from the system.
Lets see the Example how we can achieve parallel processing in Scikit-learn liabrary.
Using Single Processor
We also can use another pandas built-in function describe() to obtain insights of the data.
valueCount.py Output
1 print(df.describe())
Visualization for EDA
We can use many types of graphs, some of them are listed below
if we want to show how various data elements contribute towards a whole, we should use pie chart.
If we want to compare data elements, we should use bar chart.
If we want to show distribution of elements, we should use histograms.
If we want to depict groups in elements, we should use boxplots.
If we want to find patterns in data, we should use scatterplots.
If we want to display trends over time, we should use line chart.
If we want to display geographical data, we should use basemap.
If we want to display network, we should use networkx.
In addition to all these graphs, we also can use seaborn library to plot advanced graphs like,
Countplot
Heatmap
Distplot
Pairplot
Covariance and Correlation
o Covariance
o Covariance is a measure used to determine how much two variables change in tandem.
o The unit of covariance is a product of the units of the two variables.
o Covariance is affected by a change in scale, The value of covariance lies between -∞ and +∞.
o Pandas does have built-in function to find covariance in DataFrame named cov()
o Correlation
o The correlation between two variables is a normalized version of the covariance.
o The value of correlation coefficient is always between -1 and 1.
o Once we’ve normalized the metric to the -1 to 1 scale, we can make meaningful statements and compare
correlations.
covCorr.py
1 print(df.cov())
2 print(df.corr())