Machine Learning: What Is Data Science
Machine Learning: What Is Data Science
Machine Learning
SKEM4173
Artificial Intelligence
Predictive Models
1
23/02/2017
Walmart’s
Twitter’s trending
consumer demand
topics
projection systems
2
23/02/2017
3
23/02/2017
Overview of Machine
Learning
Wage Data
Wage data, which contains income
survey information for males from the
central Atlantic region of the United
States
Left: wage as a function of age. On
average, wage increases with age until
about 60 years of age, at which point it
begins to decline
Center: wage as a function of year.
There is a slow but steady increase of
approximately $10,000 in the average
wage between 2003 and 2009
Right: Boxplots displaying wage as a
function of education, with 1
indicating the lowest level (no high
school diploma) & 5 the highest level Age Model Wage
(an advanced graduate degree). On
average, wage increases with the level
of education
4
23/02/2017
Overview of Machine
Learning
Gene Expression Data
Left: Representation of the NCI60 gene
expression data set in a two-
dimensional space, & . Each point
corresponds to one of the 64 cell lines.
There appear to be 4 groups of cell
lines, which we have represented
using different colours
Right: Same as left panel except that
we have represented each of the 14
different types of cancer using a
different coloured symbol. Cell lines
corresponding to the same cancer type
tend to be nearby in the 2-dimensional
space
Dataset
Cancer
5
23/02/2017
6
23/02/2017
7
23/02/2017
8
23/02/2017
Regression y is numeric
x Model y
Classification y is class
Exercises
• Explain whether each scenario is a classification or regression problem, & indicate
whether we are most interested in inference or prediction.
• We collect a set of data on the top 500 firms in the US. For each firm we record
profit, number of employees, industry & the CEO salary. We are interested in
understanding which factors affect CEO salary.
• We are considering launching a new product & wish to know whether it will be
a success or a failure. We collect data on 20 similar products that were
previously launched. For each product we have recorded whether it was a
success or failure, price charged for the product, marketing budget,
competition price, & ten other variables.
• We are interesting in predicting the % change in the US dollar in relation to the
weekly changes in the world stock markets. Hence we collect weekly data for
all of 2012. For each week we record the % change in the dollar, the % change
in the US market, the % change in the British market, & the % change in the
German market.
9
23/02/2017
Exercises
• You will now think of some real-life applications for machine learning.
• Describe three real-life applications in which classification might be useful.
Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.
• Describe three real-life applications in which regression might be useful.
Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.
• Describe three real-life applications in which cluster analysis might be useful.
Precision
10
23/02/2017
Training
Dataset Test set
set
Data that you feed to the model-building Data that you feed into the resulting
algorithm (regression, decision tree, etc.) model, to verify that the model’s
so that the algorithm can set the correct predictions are accurate
parameters to best predict the outcome
variable
11
23/02/2017
Negative Positive
True Negative TN FP
condition
Positive FN TP
• For a classifier, accuracy is defined as the number of items categorized correctly divided by
total number of items – what fraction of the time the classifier is correct
• Accuracy = = (cM[1,1] + cM[2,2]) / sum(cM) = 92%
• The error of around 8% is unacceptably high for a spam filter!
12
23/02/2017
Validating Models
• Model evaluation: performance of the model on training data
• Biggest worry: validity of the model – will it show similar
quality on new data in production?
• Model validation: testing of a model on new data (test set)
Validating Models
A common model problem: Overfitting
• An overfit model looks
great on the training
data & performs poorly
on new data
• Memorized the training
data instead of
discovering generalizable
rules or patterns
• Overfit model is bad:
– more complicated
than anything useful
– less accurate in
production
13
23/02/2017
Decision Trees
• Decision tree predict responses to data
• To predict a response, follow the decisions in the tree from the
root (beginning) node down to a leaf node. The leaf node
contains the response.
14
23/02/2017
Decision Trees
• This tree predicts classifications based on 2 predictors, x1 & x2
• To predict, start at the top node, represented by a triangle (Δ).
The 1st decision is whether x1<0.5. If so, follow the left branch,
& see that the tree classifies the data as type 0.
• If x1>=0.5, then follow the right branch to the lower-right
triangle node. Here the tree asks if x2<0.5. If so, then follow
the left branch to see that the tree classifies the data as type 0.
If not, then follow the right branch to see that the that the tree
classifies the data as type 1.
Questions
1. What is machine learning?
2. Explain machine learning techniques and its categories
3. Why do we need to evaluate our model?
4. Why do we need a portion of our data called test data?
5. Why do we need to validate our model?
15