MachineLearning Presentation
MachineLearning Presentation
Developers
Dr Prakash Goteti
Technology Learning Services
Agenda
Establish Research
Goal
Build a model
Build a model
Establish Research
Goal
Internal Data
Gather the data
External Data
Prepare the data
Build a model
Establish Research
Goal
Data Aggregation
Explore the data
Build a model
Establish Research
Goal
Build a model
Establish Research
Goal
Model evaluation
Present the findings
Establish Research
Goal
Build a model
Presentation
Present the findings
Automation and inferences
Machine
Prepare the data scikit-learn
learning
toolkit
algorithms
These use cases reflect an important fact that predictive analytics (PA) can
provide significant impact towards Return –On -Investments for the organizations.
Training set:
• Set of columns/attributes collectively constitutes training set.
• The target variable or class the training example belongs to is then compared
to the predicted value to understand how accurate the algorithm is.
Training example:
• Each training example has features of a class and target variable.
Knowledge Representation:
• It is in the form of rules –like probability distribution
• These rules are readable by the machine.
Regression: A best fit line drawn through some data points to generalize the data
points
• Regression is prediction of a numeric value. For example, consider the problem of classification of items
Supervised learning:
• There is a target value given for the data
Un-supervised learning:
Copyright © 2017 Tech Mahindra. All rights reserved. 17
• There is no target value given for the data
Steps in Machine learning
Data • RSS feed, likes, dislikes
Collection extracting from Websites
Data
cleansing • Refining the data /columns
Analyze
input Data • Recognize if any patterns
Train the
Algorithm • Feed the MLA with clean data
Test the
algorithm • Infer the results
Mean:
Measuring Covariance:
– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)
– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)
– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size
Measuring Covariance:
– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)
– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)
– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size
• Matrices help in dimensionality reduction with respect to data set through Principal
Component Analysis (PCA).
• A classifier algorithm or regression one by minimizing error between the value calculated
by the nascent classifier and the actual value from the training data can be done using
linear algebra techniques.
Steps in solving linear equations:
Consider: −3𝑥 − 2𝑦 + 4𝑧 = 9 3𝑦 − 2𝑧 = 5 4𝑥 − 3𝑦 + 2𝑧 = 7
These can be expressed as: AX=B; 𝑋 = 𝐴−1 . 𝐵 ,where
A=[ −3 −2 4
0 3 −2
4 −3 2 ]
B=[
9
5
7
] X=[x,
y,
Copyright © 2017 Tech Mahindra. All rights reserved. 24
Working with Data structures -Set
A|B
Returns a set which is the union of sets A and B.
A.union(B)
A |= B
Adds all elements of array B to the set A.
A.update(B)
A&B
Returns a set which is the intersection of sets A and B.
A.intersection(B)
A &= B
Leaves in the set A only items that belong to the set B.
A.intersection_update(B)
A-B Returns the set difference of A and B (the elements
A.difference(B) included in A, but not included in B).
A -= B
Removes all elements of B from the set A.
A.difference_update(B)
Returns the symmetric difference of sets A and B (the
A^B
elements belonging to either A or B, but not to both
A.symmetric_difference(B)
sets simultaneously).
A ^= B
Writes in A the symmetric difference of sets A and B.
A.symmetric_difference_update(B)
A <= B
Returns true if A is a subset of B.
A.issubset(B)
A >= B
Returns true if B is a subset of A.
A.issuperset(B)
A<B Equivalent to A <= B and A != B
A>B Equivalent to A >= B and A != B
https://www.youtube.com/watch?v=BO6GQkOjR50
Techniques:
– Populate by mean, median, mode
– Multiple imputation techniques (regression, mean median..)
– Prediction algorithm to predict missing value
Text data
– Name formatting
– Upper case /lower case
Approach 1: pip install numpy scipy matplotlib ipython Jupyter Pandas sympy
Pandas: Pandas library provides two important data structures namely Series and DataFrame
Series
– A Series is cross breed of array indexing and dictionary:
Examples:
Series
– A Series is cross breed of array indexing and dictionary:
Examples:
𝑠𝑡𝑜𝑟𝑦 𝑖𝑛𝑘
Story ink ratio: =
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑘 𝑢𝑠𝑒𝑑 𝑡𝑜 𝑝𝑟𝑖𝑛𝑡 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ𝑖𝑐
– Portion of graphics ink is devoted to the non-redantant display of the story information
– Two types:
Horizontal (long list of categories)
Vertical (showing negative values, time periods)
Comparing the trends –line charts
– Pi Chart:
Best for showing few categories
Parts of pi chart should add to a meaningful whole Creating effective visualization
– Box plots:
Summarises the distribution (median, min_val, max_val) of the data;
identify outliers in the data
– Scatter plots:
Used to establish the relationship between the variables Copyright © 2017 Tech Mahindra. All rights reserved. 41
Data Visualization (3 - 6):
– Qualitative colour {contrast} They don’t carry obvious relationship among them
– Selection of colours
Light grey dark lines : to show simple data
Black and red: Correlation
Use legends: Indicates what each component represents
Use labels that paints directly on charts instead of axes
Make sure the visualization stands by itself
Use squint test: Can this visualization tell a story?
– Implementation aspects
k-Means
DBSCAN
With in k=3, we have 2 good and one bad as per the survey input
data
Conclude that the new tissue paper that pass laboratory tests with
X1=3, X2=7 is included in good category
Probability of occurrence of B is : 𝑃 𝐵
80 100 980
𝑃 𝑃𝑜𝑠 𝐶 = = 0.8; 𝑃 𝐶 = = 0.01; 𝑃 𝑃𝑜𝑠 = = 0.098
100 10000 10000
=8.16% which is significantly higher than our general assumption: 100/10000=1%
Copyright © 2017 Tech Mahindra. All rights reserved. 59
Naïve Bayes (3-3):
Example2: Spam mail detection. Observed a tendency that the mails
containing the work “gift” are spam. Classify a given new mail into spam or
ham based on the probability:
𝑷 𝒈𝒊𝒇𝒕 𝑺𝒑𝒂𝒎 𝑷(𝑺𝒑𝒂𝒎)
𝑷 𝑺𝒑𝒂𝒎 𝒈𝒊𝒇𝒕 =
𝑷(𝒈𝒊𝒇𝒕)
Probability of an email being spam, if it contains the word “gift”:: 𝑃 𝑆𝑝𝑎𝑚 𝑔𝑖𝑓𝑡
The Nr is “Probability of a message being spam and containing the word “gift” :
𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃(𝑆𝑝𝑎𝑚)
The Dr is the overall probability of an email containing the word “gift”: Equivalent
to : 𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃 𝑆𝑝𝑎𝑚 + 𝑃 𝑔𝑖𝑓𝑡 𝐻𝑎𝑚 𝑃(𝐻𝑎𝑚)
Cancer is falsely detected among 900 patients out of 900 healthy people
If the result of this screening test on a person is Positive? What is the probability
that he actually have cancer?
𝑃 𝐵 𝐴 𝑃(𝐴)
, positive: test shown positive, patient 𝑃 𝐴 𝐵 =
𝑃(𝐵)
Conclude that the new tissue paper that pass laboratory tests with X1=3, X2=7 is
included in good category
63
Copyright © 2017 Tech Mahindra. All rights reserved. 63
K Means clustering (2-7):
65
Copyright © 2017 Tech Mahindra. All rights reserved. 65
K Means clustering (4-7):
We obtain two clusters containing: {1,2,3} and {4,5,6,7}.
Their new centroids are:
https://my.techmahindra.com/personal/pl73819/blog/Lists/Post
s/Post.aspx?ID=2
Thank you
[email protected]
Disclaimer
Tech Mahindra Limited, herein referred to as TechM provide a wide array of presentations and reports, with the contributions of various
professionals. These presentations and reports are for information purposes and private circulation only and do not constitute an offer to buy or sell
any services mentioned therein. They do not purport to be a complete description of the market conditions or developments referred to in the
material. While utmost care has been taken in preparing the above, we claim no responsibility for their accuracy. We shall not be liable for any direct
or indirect losses arising from the use thereof and the viewers are requested to use the information contained herein at their own risk. These
presentations and reports should not be reproduced, re-circulated, published in any media, website or otherwise, in any form or manner, in part or as
a whole, without the express consent in writing of TechM or its subsidiaries. Any unauthorized use, disclosure or public dissemination of information
contained herein is prohibited. Individual situations and local practices and standards may vary, so viewers and others utilizing information contained
within a presentation are free to adopt differing standards and approaches as they see fit. You may not repackage or sell the presentation. Products
and names mentioned in materials or presentations are the property of their respective owners and the mention of them does not constitute an
endorsement by TechM. Information contained in a presentation hosted or promoted by TechM is provided “as is” without warranty of any kind, either
expressed or implied, including any warranty of merchantability or fitness for a particular purpose. TechM assumes no liability or responsibility for the
contents of a presentation or the opinions expressed by the presenters. All expressions of opinion are subject to change without notice.