0% found this document useful (0 votes)

231 views71 pages

MachineLearning Presentation

MachineLearning

Uploaded by

Ram Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

231 views71 pages

MachineLearning Presentation

MachineLearning

Uploaded by

Ram Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Machine Learning for

Developers

Dr Prakash Goteti
Technology Learning Services
Agenda

 Big Picture: Introduction to Data Science

 Where Machine learning fits in?

 What is machine learning

 Machine learning case studies

 Machine learning –Key terminology

 Predictive Analytics and Recommendation Systems

 (Un)Supervised learning algorithms

Copyright © 2017 Tech Mahindra. All rights reserved. 2

 Introduction to Data Science

Copyright © 2017 Tech Mahindra. All rights reserved. 3

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 4

Big Picture –Data Science
Data Science

Define Research goal

Establish Research
Goal
Prepare Project charter
Gather the data

Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 5

Big Picture –Data Science
Data Science

Establish Research
Goal

Internal Data
Gather the data
External Data
Prepare the data

Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 6

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Data cleansing

Prepare the data Data Transformation

Data Aggregation
Explore the data

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 7

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Graphical
techniques
Explore the data Visualization
Techniques
Non Graphical Techniques

Build a model

Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 8

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Model selection

Build a model Model execution

Model evaluation
Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 9

Big Picture –Data Science
Data Science

Establish Research
Goal

Gather the data

Prepare the data

Explore the data

Build a model
Presentation
Present the findings
Automation and inferences

Copyright © 2017 Tech Mahindra. All rights reserved. 10

Big Picture –Data Science
Data Science
Numpy and
Data cleansing
Pandas
Establish Research
Goal
Data
matplotlib
visualization and
package
Gather the data reporting

Machine
Prepare the data scikit-learn
learning
toolkit
algorithms

Explore the data

Natural
Nltk
language
framework
processing
Build a model

Social network NetworkX

analysis Library
Present the findings

Copyright © 2017 Tech Mahindra. All rights reserved. 11

 Introduction to Machine Learning

Copyright © 2017 Tech Mahindra. All rights reserved. 12

Machine Learning

Machine learning is amalgamation of computer science,

engineering and statistics.

It is a tool that can be applied to many problems with the nature of

data interpretation and action on data for the benefit of business

Machine learning uses statistics extensively.

Copyright © 2017 Tech Mahindra. All rights reserved. 13

Machine learning case studies (1-2)

GE already makes hundreds of millions of dollars by crunching the

data it collects from deep-sea oil wells or jet engines to optimize
performance, anticipate breakdowns, and streamline maintenance.

Outside North America:

In Europe, more than a dozen banks have replaced older
statistical-modeling approaches with machine-learning techniques
and, in some cases, experienced 10 percent increases in sales of
new products, 20 percent savings in capital expenditures, 20
percent increases in cash collections, and 20 percent declines in
churn.
This is through new recommendation engines for clients in retailing
and in small and medium-sized companies enabling more accurate
forecast.

Copyright © 2017 Tech Mahindra. All rights reserved. 14

Machine learning case studies (2-2)
 A Canadian bank uses predictive analytics to increase campaign response rates
by 600%, cut customer acquisition costs in half, and boost campaign ROI by
100%.

 A research group at a leading hospital combined predictive and text analytics to

improve its ability to classify and treat pediatric brain tumors.

 An airline increased revenue and customer satisfaction by better estimating the

number of passengers who won’t show up for a flight. This reduces the number of
overbooked flights that require re-accommodating passengers as well as the
number of empty seats.

 These use cases reflect an important fact that predictive analytics (PA) can
provide significant impact towards Return –On -Investments for the organizations.

 PA can help companies in achieving operational excellence through cost

reduction, process improvement, better understand customer behavior, identify
unexpected opportunities, and anticipate problems before they happen so that risk
mitigation, avoidance steps can be taken up effectively.

Copyright © 2017 Tech Mahindra. All rights reserved. 15

Key Terminology
Features: individual measurements that when combined with
other features make up a training example
• identifying key properties describing these entities.
• If these entities are represented as table, each column is identified as feature
or attribute.
• Each row in the table is described as instance.
• Features or attributes are the individual measurements which collectively make
up a training example.
• This is usually columns in a training or test set

Training set:
• Set of columns/attributes collectively constitutes training set.
• The target variable or class the training example belongs to is then compared
to the predicted value to understand how accurate the algorithm is.

Training example:
• Each training example has features of a class and target variable.

Copyright © 2017 Tech Mahindra. All rights reserved. 16

Key Terminology
Data Types
• Numeric Data (quantifiable things-discrete, continuous )
• Categorical ( Based on categories –enumerate the categories)
• Ordinal Data (mixture above: star ratings on product, movie etc)

Knowledge Representation:
• It is in the form of rules –like probability distribution
• These rules are readable by the machine.

Classification: To predict what class an instance of data should fall into.

Regression: A best fit line drawn through some data points to generalize the data
points
• Regression is prediction of a numeric value. For example, consider the problem of classification of items

Supervised learning:
• There is a target value given for the data

Un-supervised learning:
Copyright © 2017 Tech Mahindra. All rights reserved. 17
• There is no target value given for the data
Steps in Machine learning
Data • RSS feed, likes, dislikes
Collection extracting from Websites

Data
cleansing • Refining the data /columns

Analyze
input Data • Recognize if any patterns

Train the
Algorithm • Feed the MLA with clean data

Test the
algorithm • Infer the results

Copyright © 2017 Tech Mahindra. All rights reserved. 18

 Mathematical and Statistical Foundations

Copyright © 2017 Tech Mahindra. All rights reserved. 19

Binning No of Age
people Range
 Convert Numeric data into categorical data (bins)

 Use pre-defined ranges as bins 20 20-30

 Classification algorithm and age is class variable 33 31-40

 Indicator variables –convert categorical data into Boolean
data
45 41-50
 Centering and Scaling Time zone
– Standardise the range of values 41 51-60
– Better comparison
– Values are “centered” by subtracting them from the mean: 37 >60
– Values are scaled by dividing the above by SD
– ML algorithm gives better results with standardized values

Mean:

– Variance describes spread around the mean:

– SD Example: sample: (2,5,6,5,9) Mean =27/55.4
– (5.4) Differences from the mean =(-3.4, -0.4,0.6,-0.4,4.4)
– Squared differences =(11.56, 0.16, 0.36, 0.16,19.36)
– Avg of squared diffs =(11.56, 0.16, 0.36, 0.16,19.36)/5
– =31.6/5 =6.32 =2.51
Copyright © 2017 Tech Mahindra. All rights reserved. 20
Correlation
 Pearson correlation correlation coefficient r measures the strength and direction of a
linear relationship between two variables on a scatterplot. The value of r is always
between +1 and –1.

Copyright © 2017 Tech Mahindra. All rights reserved. 21

Covariance and Correlation
 How much two attributes (X, Y) are correlated or separated

 Measuring Covariance:

– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)

– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)

– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size

– Divide the covariance by SDs of both sets

– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation

– Correlation cannot be indicator for causation; It helps in what experiments to conduct

Copyright © 2017 Tech Mahindra. All rights reserved. 22

Covariance and Correlation
 How much two attributes (X, Y) are correlated or separated

 Measuring Covariance:

– Capture the data sets as n –dimensional vectors: (x1,x2 …..xn); (y1,y2 …..yn)

– Convert them into vectors of variances from their mean: (x1 –x0, x2 –x0 ……. xn –x0)
and (y1 –y0, y2 –y0 ……. yn –y0)

– Take the dot product of these two : cosine of angle between two vectors and Divide by
sample size

– Divide the covariance by SDs of both sets

– -1 means perfect inverse correlation; 0 means no correlation; 1 – perfect correlation

– Correlation cannot be indicator for causation; It helps in what experiments to conduct

Copyright © 2017 Tech Mahindra. All rights reserved. 23

Solving linear equation
• In machine learning, we deal with training sets and test data where the algorithms to be
trained on large data sets and Matrices are good representation for such data.

• Matrices help in dimensionality reduction with respect to data set through Principal
Component Analysis (PCA).

• A classifier algorithm or regression one by minimizing error between the value calculated
by the nascent classifier and the actual value from the training data can be done using
linear algebra techniques.
Steps in solving linear equations:
Consider: −3𝑥 − 2𝑦 + 4𝑧 = 9 3𝑦 − 2𝑧 = 5 4𝑥 − 3𝑦 + 2𝑧 = 7
These can be expressed as: AX=B; 𝑋 = 𝐴−1 . 𝐵 ,where
A=[ −3 −2 4
0 3 −2
4 −3 2 ]
B=[
9
5
7
] X=[x,
y,
Copyright © 2017 Tech Mahindra. All rights reserved. 24
Working with Data structures -Set
A|B
Returns a set which is the union of sets A and B.
A.union(B)
A |= B
Adds all elements of array B to the set A.
A.update(B)
A&B
Returns a set which is the intersection of sets A and B.
A.intersection(B)
A &= B
Leaves in the set A only items that belong to the set B.
A.intersection_update(B)
A-B Returns the set difference of A and B (the elements
A.difference(B) included in A, but not included in B).
A -= B
Removes all elements of B from the set A.
A.difference_update(B)
Returns the symmetric difference of sets A and B (the
A^B
elements belonging to either A or B, but not to both
A.symmetric_difference(B)
sets simultaneously).
A ^= B
Writes in A the symmetric difference of sets A and B.
A.symmetric_difference_update(B)
A <= B
Returns true if A is a subset of B.
A.issubset(B)
A >= B
Returns true if B is a subset of A.
A.issuperset(B)
A<B Equivalent to A <= B and A != B
A>B Equivalent to A >= B and A != B

Copyright © 2017 Tech Mahindra. All rights reserved. 25

Statistics
 Mean: sum of the values in the sample/size of the sample:
– (x1+x2+x3 ……xn)/N

 Median: It is middle value of the sorted set of values in the

sample.
– Median is less susceptible for outliers than the mean
– Median is better indicator to look at than mean

 Mode: Most common value in the data set

– It is an indicative of frequency

– Ex. 0,1,3, 4,0, 3,6,0: Mode is 0 –occurred 3 times in the sample

Copyright © 2017 Tech Mahindra. All rights reserved. 26

Statistics

 68% of the data falls within one SD of the mean

• 95% of the data falls within two SD of the mean
• 99.7% of the data falls within three SD of the mean

Copyright © 2017 Tech Mahindra. All rights reserved. 27

Statistics
 The probability density for a Gaussian distribution is given in terms of mean
value ( ) and the variance ( ) of the population as :

 The Central Limit Theorem states that

“Given a sufficiently large sample size from a population with a finite level of
variance, the mean of all samples from the same population will be approximately
equal to the mean of the population.

Furthermore, all of the samples will follow an approximate normal distribution

pattern, with all variances being approximately equal to the variance of the
population divided by each sample's size”.

https://www.youtube.com/watch?v=BO6GQkOjR50

Copyright © 2017 Tech Mahindra. All rights reserved. 28

 Working with Numpy –’NumpyNotebook1’ examples

Copyright © 2017 Tech Mahindra. All rights reserved. 29

 Cleansing the Data

Copyright © 2017 Tech Mahindra. All rights reserved. 30

Data Cleansing
 Issues with data quality
 Invalid values
 Formats of the data (dd-mm-yy); spelling issues
 Dependency –referential constraints, one to many unary relations
 Domain constraints, referential integrity constraints
 Duplicate records
 Missing values
 Values in wrong columns
 Issues with data quality
 …..
 Understanding Data Quality issues
 Understanding Data quality issues Pandas:
• Outlier analysis
• Exploratory data analysis –charts, visualization tools
 Understanding Data quality issues Pandas:
• Outlier analysis and data analysis – visualization tools
 Fixing the data quality issues
 Use coding language; fix the sources (R, Python..)
 Find issues in data processing streams
Copyright © 2017 Tech Mahindra. All rights reserved. 31
Data Cleansing –Data imputation
 If column is empty –what value we fill in?

 Fixing null, empty values

 Unlike RDBMS, any value in ML is valid

 ML Considers nulls as ‘class of data’

 Techniques:
– Populate by mean, median, mode
– Multiple imputation techniques (regression, mean median..)
– Prediction algorithm to predict missing value

Copyright © 2017 Tech Mahindra. All rights reserved. 32

Data Cleansing –Data Standardization
 Numeric data
– Logarithm
– Decimal places
– Floor, ceiling

 Date and time

– Time zone
– Fixing null, empty values

 Text data
– Name formatting
– Upper case /lower case

Copyright © 2017 Tech Mahindra. All rights reserved. 33

Python Libraries
Installation:

Approach 1: pip install numpy scipy matplotlib ipython Jupyter Pandas sympy

Approach 2: Python library bundles are available through environment platforms:

Anaconda: https://www.continuum.io
Canopy: https://www.enthought.com/products/canopy/

Numpy: It stands for 'Numerical Python'.

• Useful to perform operations on arrays (vectors) including multidimensional array objects. It
supports several operations on these objects
• The other operations include areas from linear algebra, random number generation etc.

Pandas: Pandas library provides two important data structures namely Series and DataFrame

Copyright © 2017 Tech Mahindra. All rights reserved. 34

Pandas (1- 4):
 A library that provides a way of processing tabular data supported by
two data structures: Series, DataFrame

Copyright © 2017 Tech Mahindra. All rights reserved. 35

Pandas (2- 4):
 Creating a Series:
– By passing a list of values
– Pd.Series?
– animals =[‘Lion’, ‘Tiger’, ‘Bear’, ‘Mouse’]
– pd.Series(animals) Pandas automatically assigns index values
0 ‘Lion’
1 Tiger
2 Bear
3 Mouse
dtype: Object
 Series from Dictionary
– city_cap =[‘India’: New Delhi, ‘US’: New Yark’] # to know the type of keys
s =pd.Series(city_caps)
for i in city_cap.keys():
US New Yark
India New Delhi print(type(i))
dtype Object
 Series Form a list of indices and corresponding values
– Pandas overrides automatic creation of index values using list of values provided
through index parameter
– s=pd.Series([value_item_list], index=[keys_list])
Copyright © 2017 Tech Mahindra. All rights reserved. 36
Pandas (3- 4):
 Working with DataFrame:

– A library that provides a way of processing tabular data supported by

two data structures: Series, DataFrame

 Series
– A Series is cross breed of array indexing and dictionary:
Examples:

Copyright © 2017 Tech Mahindra. All rights reserved. 37

Pandas (4 - 4):
 Pandas Data structures:

– A library that provides a way of processing tabular data supported by

two data structures: Series, DataFrame

 Series
– A Series is cross breed of array indexing and dictionary:
Examples:

Copyright © 2017 Tech Mahindra. All rights reserved. 38

 Data Visualization in Python

Copyright © 2017 Tech Mahindra. All rights reserved. 39

Data Visualization (1 - 6):
 Data visualization: Story telling by means of visual patterns
– Before looking at data creating an interesting the story

– Story will tell us specific tools needed for visualization

1. Identify the tool (excel/tableau/python …)

2. Define the story clearly

3. Pick up right visual aid to tell the story

4. Assess data visualization

a) Are there any distractions from main story
b) Are they describe your story?

𝑠𝑡𝑜𝑟𝑦 𝑖𝑛𝑘
 Story ink ratio: =
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑘 𝑢𝑠𝑒𝑑 𝑡𝑜 𝑝𝑟𝑖𝑛𝑡 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ𝑖𝑐

– Portion of graphics ink is devoted to the non-redantant display of the story information

Copyright © 2017 Tech Mahindra. All rights reserved. 40

Data Visualization (2 - 6):
– Pick up the chart that communicates the story best !
– Bar chart: To make comparisons between the categories, comparisons in time intervals

– Two types:
 Horizontal (long list of categories)
 Vertical (showing negative values, time periods)
 Comparing the trends –line charts

– Pi Chart:
 Best for showing few categories
 Parts of pi chart should add to a meaningful whole Creating effective visualization

– Stacked areas (ex. Cumulative flow diagram)

 When cumulative proportions matter
 They are poor at showing specific values

– Histograms –to understand spread in the data

– Box plots:
 Summarises the distribution (median, min_val, max_val) of the data;
 identify outliers in the data

– Scatter plots:
 Used to establish the relationship between the variables Copyright © 2017 Tech Mahindra. All rights reserved. 41
Data Visualization (3 - 6):

Copyright © 2017 Tech Mahindra. All rights reserved. 42

Data Visualization (4 - 6):
 Comparing colours
– Using the right colour –only if the colour communicates additional information
– Themes:

– Qualitative colour {contrast} They don’t carry obvious relationship among them

– Sequential colours{ range of values)

Same colour from fading shade to dark shade

– Diverging colours {obviously dividing segments}

Same colour from dark shade to fade

Data Visualization (5 - 6):
– Good practices

– A colour scheme should

 Add information
 Encode data well
 Accommodate colour blindness
 Print well –BW and colour

– Colour scheme tools

 Color Brewer 2.0 http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3
 Colorbrwer implementations in Python is done through:
https://pypi.python.org/pypi/brewer2mpl/1.4

– Selection of colours
 Light grey dark lines : to show simple data
 Black and red: Correlation
 Use legends: Indicates what each component represents
 Use labels that paints directly on charts instead of axes
 Make sure the visualization stands by itself
 Use squint test: Can this visualization tell a story?

Data Visualization (6 - 6):
– matplotlib library:
– Steps
1. Create the data set and visualize the figure
2. Plot the data
3. Configure axes
4. Add annotations/legends
5. Show() or save the file as image/pdf ….

– Implementation aspects

1. import matplotlib.pyplot as plt

2. plt.figure()
3. plt.plot(x_vals, y_vals)
4. Plt.plot(x2_vals, y2_vals)
5. plt.xticks([List of values])
6. plt.yticks([List of values]
7. plt.xlim(lower_x, upper_x)
8. plt.ylim(lower_y, upper_y)
9. plt.xlabels(‘’)
10. plt.ylabels(‘ ‘)
11. plt.legend()
12. plt.grid()
13. plt.show()/plt.savefig(…<filename>)
Copyright © 2017 Tech Mahindra. All rights reserved. 45
 Classification of Algorithms

Supervised Learning
 It is process of creating predictive models using set of
historical data that contains results that you are trying to
predict.
 A supervised learning algorithm is the one that given examples that contain the
desired target value

 Supervised Learning Approaches: Use past results to train a

model
 Classification: To identify which group a new record belongs to (i.e., customer or
event) based on its inherent characteristics.
 Regression: It uses past values to predict future values and is used in forecasting
and variance analysis

 Predictive Analytics: A practice of extracting information

from existing data sets in order to determine patterns and
predict future outcomes and trends.
 Collaborative filtering –Mining user behavior and make product
recommendations
Copyright © 2017 Tech Mahindra. All rights reserved. 47
Un-Supervised Learning
 Unsupervised learning does not use previously known
results to train its models.
 Un –supervised algorithms are not given the target desired answer,
but they must find something plausible on their own.

 Uses descriptive statistics to identify clusters (ex: Market analysis)

 They can identify

 clusters or groups of similar records within a database (i.e., clustering)
 relationships among values in a database (i.e., association)

Tasks
 Supervised learning tasks
 K –Nearest neighbors
 Naïve Bayes
 Support vector machines
 Decision trees

 Un –supervised learning tasks:

 k-Means
 DBSCAN

Why do we have so many algorithms?

Choice of the Algorithm
 Consider your goal
 If you are trying to predict or forecast a target value –supervised
learning
 If the target value is discrete {Y/NO, 1/2/3, A/B/C, Red/yellow…}
then use classification algorithm
 If the target value is continuous [a range of values] then use
regression {0.00 -10.00; -99 to +99; -infty to +infty}

 If you are NOT trying to predict or forecast a target value –un

supervised learning
 Try to fit the data into some discrete groups (clustering)

 Supervised Learning
– Classification

Introduction to classification: kNN Algorithm

for every point in our data set:

Compute distance between inX and the current point
sort the distances in increasing order
take k items with lowest distances into inX
find the majority class among these items
return the majority class as our prediction for the class inX

Example -kNN
Consider questionnaire survey on objective testing with two attributes –acid durability and
strength to classify whether a special paper tissue is good or not.
Four training samples:

Suppose factory produces a tissue with tests of values –X1=3, X2=7;

With out expensive survey can we guess what the classification of this new tissue is?
http://people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical-example.html

Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples

Example -kNN
Let no of nearest neighbours (K)=3; Compute the distance from query instance and the training samples and identify 3 minima

Example -kNN
Gather the category Y of the nearest neighbours.

Example -kNN

 With in k=3, we have 2 good and one bad as per the survey input
data
 Conclude that the new tissue paper that pass laboratory tests with
X1=3, X2=7 is included in good category

Naïve Bayes:
Naïve: It simplifies the probability computations by assuming that
predictive features are mutually independent.

Bayes: It maps the probabilities of observing input features given belonging

classes, to the probability distribution over classes based on Bayes theorem:
𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)

 Probability of observing A occurs given B is true: 𝑃 𝐴 𝐵

 Probability of occurrence of A is: 𝑃 𝐴

 Probability of occurrence of B is : 𝑃 𝐵

 Probability of observing B given A occurs:𝑃 𝐵 𝐴

Test Cancer No cancer Total

Test +ve 80 900 980
Test –ve 20 9000 9020
Total 100 9900 10000
 80 out of 100 are correctly diagnosed while the rest are not
 Cancer is falsely detected among 900 patients out of 900 healthy people
 If the result of this screening test on a person is Positive? What is the probability
𝑃 𝑃𝑜𝑠 𝐶 𝑃(𝐶)
that he actually have cancer? 𝑃 𝐶 𝑝𝑜𝑠 = 𝑃(𝑃𝑜𝑠)

80 100 980
𝑃 𝑃𝑜𝑠 𝐶 = = 0.8; 𝑃 𝐶 = = 0.01; 𝑃 𝑃𝑜𝑠 = = 0.098
100 10000 10000
=8.16% which is significantly higher than our general assumption: 100/10000=1%
Copyright © 2017 Tech Mahindra. All rights reserved. 59
Naïve Bayes (3-3):
Example2: Spam mail detection. Observed a tendency that the mails
containing the work “gift” are spam. Classify a given new mail into spam or
ham based on the probability:
𝑷 𝒈𝒊𝒇𝒕 𝑺𝒑𝒂𝒎 𝑷(𝑺𝒑𝒂𝒎)
𝑷 𝑺𝒑𝒂𝒎 𝒈𝒊𝒇𝒕 =
𝑷(𝒈𝒊𝒇𝒕)

 Probability of an email being spam, if it contains the word “gift”:: 𝑃 𝑆𝑝𝑎𝑚 𝑔𝑖𝑓𝑡

 The Nr is “Probability of a message being spam and containing the word “gift” :
𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃(𝑆𝑝𝑎𝑚)

 The Dr is the overall probability of an email containing the word “gift”: Equivalent
to : 𝑃 𝑔𝑖𝑓𝑡|𝑆𝑝𝑎𝑚 𝑃 𝑆𝑝𝑎𝑚 + 𝑃 𝑔𝑖𝑓𝑡 𝐻𝑎𝑚 𝑃(𝐻𝑎𝑚)

 Naïve : Presence of different words are independent of each other:

Naïve Bayes (3-3):
 Let the event of having cancer and positive test result as C, pos respectively. The
probability that the person has cancer, given that test result is positive is: 𝑃 𝐶 𝑃𝑜𝑠

 Cancer is falsely detected among 900 patients out of 900 healthy people
 If the result of this screening test on a person is Positive? What is the probability
that he actually have cancer?
𝑃 𝐵 𝐴 𝑃(𝐴)
 , positive: test shown positive, patient 𝑃 𝐴 𝐵 =
𝑃(𝐵)
 Conclude that the new tissue paper that pass laboratory tests with X1=3, X2=7 is
included in good category

 Un –Supervised Learning
– K Means clustering

K Means clustering (1-7):
 It is process of grouping a complex data into clusters
 Demographics, Movies
 K stands for number of clusters based on
attributes of the data
 “Split the data into k groups”
 What group of the given data belongs to -scatter
plot
 Helps in categorization which we don’t know
apriory!
 Unlike supervised learning, its not a case we
already know the correct group, we try to
converge the data into groups based on the data
–groups also unknown(–latent values)
 A supervised learning algorithm is the one that
given examples that contain the desired target
value
 Ex: interesting clusters of songs based on the
attributes of song

K Means clustering (3-7):
 Randomly we choose following two centroids (k=2) for two clusters.
 In this case the 2 centroid are: m1=(1.0,1.0) and m=(5.0,7.0).

K Means clustering (5-7):

 Now using these centroids we

compute the Euclidean distance of
each object, as shown in table.

 Therefore, the new clusters are:

 {1,2} and {3,4,5,6,7}

 Next centroids are: m1=(1.25,1.5) and

m2 = (3.9,5.1)

K Means clustering (6-7):

 The clusters obtained are:

{1,2} and {3,4,5,6,7}

 Therefore, there is no change in

the cluster.
 Thus, the algorithm comes to a halt
here and final result consist of 2
clusters {1,2} and {3,4,5,6,7}.

K Means clustering (7-7):

Join Our community:

https://my.techmahindra.com/personal/pl73819/blog/Lists/Post
s/Post.aspx?ID=2
Thank you
[email protected]

Disclaimer
Tech Mahindra Limited, herein referred to as TechM provide a wide array of presentations and reports, with the contributions of various
professionals. These presentations and reports are for information purposes and private circulation only and do not constitute an offer to buy or sell
any services mentioned therein. They do not purport to be a complete description of the market conditions or developments referred to in the
material. While utmost care has been taken in preparing the above, we claim no responsibility for their accuracy. We shall not be liable for any direct
or indirect losses arising from the use thereof and the viewers are requested to use the information contained herein at their own risk. These
presentations and reports should not be reproduced, re-circulated, published in any media, website or otherwise, in any form or manner, in part or as
a whole, without the express consent in writing of TechM or its subsidiaries. Any unauthorized use, disclosure or public dissemination of information
contained herein is prohibited. Individual situations and local practices and standards may vary, so viewers and others utilizing information contained
within a presentation are free to adopt differing standards and approaches as they see fit. You may not repackage or sell the presentation. Products
and names mentioned in materials or presentations are the property of their respective owners and the mention of them does not constitute an
endorsement by TechM. Information contained in a presentation hosted or promoted by TechM is provided “as is” without warranty of any kind, either
expressed or implied, including any warranty of merchantability or fitness for a particular purpose. TechM assumes no liability or responsibility for the
contents of a presentation or the opinions expressed by the presenters. All expressions of opinion are subject to change without notice.

Types of Educational Research
100% (1)
Types of Educational Research
92 pages
Random Forest PDF
No ratings yet
Random Forest PDF
92 pages
AI and ML Notes
No ratings yet
AI and ML Notes
17 pages
Deep Learning With Python Sample
100% (1)
Deep Learning With Python Sample
31 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
An Improved Measure of Ethical Leadership
No ratings yet
An Improved Measure of Ethical Leadership
13 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Applied ML Notes
No ratings yet
Applied ML Notes
123 pages
Data Warehousing and Data Mining (10cs755)
No ratings yet
Data Warehousing and Data Mining (10cs755)
142 pages
17MAT31 Imp Problems PDF
No ratings yet
17MAT31 Imp Problems PDF
9 pages
Introduction To Quantitative Analysis My451
100% (1)
Introduction To Quantitative Analysis My451
267 pages
Parkison's Diseases Prediction Using Machine Learning
No ratings yet
Parkison's Diseases Prediction Using Machine Learning
10 pages
Introduction To Machine Learning
100% (1)
Introduction To Machine Learning
46 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
ML UNIT-IV Notes
100% (1)
ML UNIT-IV Notes
23 pages
Deep Learning
No ratings yet
Deep Learning
34 pages
Machine Learning Report
No ratings yet
Machine Learning Report
58 pages
Structural Holes: The Social Structure of Competition: Ron Burt Harvard U Press, 1992 Pp. 1 - 49
No ratings yet
Structural Holes: The Social Structure of Competition: Ron Burt Harvard U Press, 1992 Pp. 1 - 49
7 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Da Session 1
No ratings yet
Da Session 1
50 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
100% (1)
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
57 pages
Unit - 4 Machine Learning
100% (1)
Unit - 4 Machine Learning
84 pages
Turmoil in The Financial Institutions of
No ratings yet
Turmoil in The Financial Institutions of
63 pages
KMV Merton Model
0% (1)
KMV Merton Model
36 pages
Research Methods & Statistics For Public & Nonprofit Administrators-Practical Guide - Nishishiba 2014 PDF
92% (13)
Research Methods & Statistics For Public & Nonprofit Administrators-Practical Guide - Nishishiba 2014 PDF
393 pages
Grip Tester Trials - TRL Report 2009 PPR 497
100% (1)
Grip Tester Trials - TRL Report 2009 PPR 497
36 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
Militant and Weapon Detection Final Report
No ratings yet
Militant and Weapon Detection Final Report
63 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
7 Classification
100% (3)
7 Classification
63 pages
Working From Home Characteristics and Outcomes of Telework
100% (1)
Working From Home Characteristics and Outcomes of Telework
16 pages
Procstat
No ratings yet
Procstat
494 pages
Syllabi of B.voc. (Retail Management) 2014-15
No ratings yet
Syllabi of B.voc. (Retail Management) 2014-15
38 pages
Research Methodology
No ratings yet
Research Methodology
9 pages
Final Twitter - Sentiment - Analysis - Report
100% (1)
Final Twitter - Sentiment - Analysis - Report
14 pages
Soybean Breeding Paper
No ratings yet
Soybean Breeding Paper
10 pages
Alzheimers Disease Detection Using Different Machine Learning Algorithms
100% (1)
Alzheimers Disease Detection Using Different Machine Learning Algorithms
7 pages
Machine Learning Statistical Model Using Transportation Data
No ratings yet
Machine Learning Statistical Model Using Transportation Data
32 pages
Introduction To Machine Learning PDF
100% (1)
Introduction To Machine Learning PDF
17 pages
Uncertainty A B
No ratings yet
Uncertainty A B
5 pages
Social Research Methods: Chapter 15: Quantitative Data Analysis
No ratings yet
Social Research Methods: Chapter 15: Quantitative Data Analysis
22 pages
Asness Et Al 2013 - Quality Minus Junk
No ratings yet
Asness Et Al 2013 - Quality Minus Junk
60 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
Chart Tamer Introduction
No ratings yet
Chart Tamer Introduction
13 pages
ML Projects For Final Year
No ratings yet
ML Projects For Final Year
7 pages
The Revised Two Factor Study Process Questionnaire: R-SPQ-2F
No ratings yet
The Revised Two Factor Study Process Questionnaire: R-SPQ-2F
20 pages
Machine Learning Internship Projects
No ratings yet
Machine Learning Internship Projects
8 pages
Attitudes and Food Choice Behaviour
No ratings yet
Attitudes and Food Choice Behaviour
8 pages
Machine Learning in Traffic Classification of SDN - Final Project Report
No ratings yet
Machine Learning in Traffic Classification of SDN - Final Project Report
11 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Machine Learning
100% (1)
Machine Learning
21 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Research Paper - EVA Indian Banking Sector
No ratings yet
Research Paper - EVA Indian Banking Sector
14 pages
SC222: Tutorial Sheet 4
No ratings yet
SC222: Tutorial Sheet 4
2 pages
Calculation of Monthly Mean Solar Radiation For Horizontal A N D Inclined Surfaces
No ratings yet
Calculation of Monthly Mean Solar Radiation For Horizontal A N D Inclined Surfaces
7 pages
Feature Extraction of Geo-Tagged Twitter Data For Sentiment Analysis
No ratings yet
Feature Extraction of Geo-Tagged Twitter Data For Sentiment Analysis
6 pages
Collecting and Analyzing Diagnostic Information: Prepared By: Ankit Vyas Binal Mehta Babita Agraval Ankit Jaisval
No ratings yet
Collecting and Analyzing Diagnostic Information: Prepared By: Ankit Vyas Binal Mehta Babita Agraval Ankit Jaisval
26 pages
Functional Balance Assessment With Pediatric.17
No ratings yet
Functional Balance Assessment With Pediatric.17
7 pages
Balzer2015 PDF
No ratings yet
Balzer2015 PDF
6 pages
Unit II Requirements Elicitation
No ratings yet
Unit II Requirements Elicitation
23 pages
Lieberson 1991
No ratings yet
Lieberson 1991
14 pages
Deep Learning Based Recommendation Systems
No ratings yet
Deep Learning Based Recommendation Systems
47 pages
Machine Learning 1
No ratings yet
Machine Learning 1
11 pages
Understanding Data Mining
No ratings yet
Understanding Data Mining
21 pages
Fake News Detection Using Machine Learning Models
No ratings yet
Fake News Detection Using Machine Learning Models
5 pages
Feature Selection Techniques in ML With Python-1
No ratings yet
Feature Selection Techniques in ML With Python-1
7 pages
Machine Learning
No ratings yet
Machine Learning
29 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Predicting The Outcomes of National Football League Games PDF
No ratings yet
Predicting The Outcomes of National Football League Games PDF
14 pages
Assignment 4
100% (1)
Assignment 4
3 pages
Data Mining With Bigdata
No ratings yet
Data Mining With Bigdata
30 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
Implementation Data Mining With K-Means Algorithm For Clustering Distribution Rabies Case Area in Palembang City PDF
No ratings yet
Implementation Data Mining With K-Means Algorithm For Clustering Distribution Rabies Case Area in Palembang City PDF
8 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Data Mining Overview
No ratings yet
Data Mining Overview
14 pages
Assignment ASTM
No ratings yet
Assignment ASTM
7 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
No ratings yet
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
8 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
3 pages
Role of Machine Learning in The Field of Fiber Reinforced Polymer
No ratings yet
Role of Machine Learning in The Field of Fiber Reinforced Polymer
6 pages
Building Recommendation System Using Movielens Data
No ratings yet
Building Recommendation System Using Movielens Data
6 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet