0% found this document useful (0 votes)
65 views

Machine Learning: What Is Data Science

The document provides an overview of machine learning, describing how it uses data to build models to make predictions. It discusses the differences between supervised and unsupervised learning, with supervised learning using input and output data to predict outputs, and unsupervised learning using only input data to discover patterns. The document also gives examples of applications of machine learning like product recommendations, advertising evaluation, and medical research.

Uploaded by

fauzansaadon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Machine Learning: What Is Data Science

The document provides an overview of machine learning, describing how it uses data to build models to make predictions. It discusses the differences between supervised and unsupervised learning, with supervised learning using input and output data to predict outputs, and unsupervised learning using only input data to discover patterns. The document also gives examples of applications of machine learning like product recommendations, advertising evaluation, and medical research.

Uploaded by

fauzansaadon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

23/02/2017

Machine Learning

SKEM4173
Artificial Intelligence

What is Data Science


Typical predictive analytic Managing the process
goals: that can transform
• who will win an hypotheses & data
Machine
election into actionable
Computer Learning
• what products will sell predictions
well together Science
• which loans will default
• which advertisements Data scientist is responsible
will be clicked on for:
Statistics
• acquiring data
• managing data
• choosing modelling
technique
• writing the code
• verifying results

Predictive Models

1
23/02/2017

What is Data Science

Some famous examples

Amazon’s product Google’s LinkedIn’s contact


recommendation advertisement recommendation
systems valuation systems system

Walmart’s
Twitter’s trending
consumer demand
topics
projection systems

Data Science Applications

2
23/02/2017

Data Science Applications

Data Science Applications

3
23/02/2017

Overview of Machine Learning


• Refers to a vast set of tools for understanding data
• These tools can be classified as supervised or unsupervised
• Broadly speaking,
• supervised machine learning involves building a statistical model for
predicting, or inferring, an output based on one or more inputs.
Problems of this nature occur in fields as diverse as business, medicine,
astrophysics, and public policy
• with unsupervised machine learning, there are inputs but no
supervising output; nevertheless we can learn relationships &
structure from such data

Overview of Machine
Learning
Wage Data
Wage data, which contains income
survey information for males from the
central Atlantic region of the United
States
Left: wage as a function of age. On
average, wage increases with age until
about 60 years of age, at which point it
begins to decline
Center: wage as a function of year.
There is a slow but steady increase of
approximately $10,000 in the average
wage between 2003 and 2009
Right: Boxplots displaying wage as a
function of education, with 1
indicating the lowest level (no high
school diploma) & 5 the highest level Age Model Wage
(an advanced graduate degree). On
average, wage increases with the level
of education

4
23/02/2017

Overview of Machine
Learning
Gene Expression Data
Left: Representation of the NCI60 gene
expression data set in a two-
dimensional space, & . Each point
corresponds to one of the 64 cell lines.
There appear to be 4 groups of cell
lines, which we have represented
using different colours
Right: Same as left panel except that
we have represented each of the 14
different types of cancer using a
different coloured symbol. Cell lines
corresponding to the same cancer type
tend to be nearby in the 2-dimensional
space
Dataset
Cancer

What is Machine Learning


• Suppose that we are consultants hired by a client to provide
advice on how to improve sales of a particular product
• The Advertising data set consists of the sales of that product in
200 different markets, along with advertising budgets for the
product in each of those markets for 3 different media: TV,
radio and newspaper

5
23/02/2017

What is Machine Learning?


The Advertising data set. The plot
displays sales, in thousands of units, as
a function of TV, radio & newspaper
budgets, in thousands of dollars, for
200 different markets. In each plot we
show the simple least squares fit of
sales to that variable. In other words,
each blue line represents a simple
model that can be used to predict
sales using TV, radio, & newspaper,
respectively

What is Machine Learning


• It is not possible for our client to directly increase sales of the
product. On the other hand, they can control the advertising
expenditure in each of the 3 media
• Therefore, if we determine that there is an association
between advertising & sales, then we can instruct our client to
adjust advertising budgets, thereby indirectly increasing sales
• In other words, our goal is to develop an accurate model that
can be used to predict sales on the basis of the 3 media
budgets
• In this setting, the advertising budgets are input variables
(predictors) while sales is an output variable (response)
• - TV budget, - radio budget, - newspaper budget

6
23/02/2017

What is Machine Learning


• Generally, suppose we observe a quantitative response &
different predictors, , , … ,
• We assume that there is some relationship between & =
( , , … , ), i.e.
= +
• Here is some fixed but unknown function of , … , , & is
a random error term, which is independent of & has mean
zero
• In this formulation, represents the systematic information
that provides about

What is Machine Learning?


The Income data set
Left: The red dots are the observed
values of income (in tens of thousands
of dollars) & years of education for 30
individuals
Right: The blue curve represents the
true underlying relationship between
income & years of education, which is
generally unknown (but is known in
this case because the data were
simulated). The black lines represent
the error associated with each
observation. Note that some errors are
positive (if an observation lies above
the blue curve) & some are negative (if
an observation lies below the curve).
Overall, these errors have
approximately mean zero

7
23/02/2017

What is Machine Learning?


The plot displays income as a function
of years of education & seniority in the
Income data set. The blue surface
represents the true underlying
relationship between income & years
of education & seniority, which is
known since the data are simulated.
The red dots indicate the observed
values of these quantities for 30
individuals

In essence, machine learning refers to


a set of approaches for estimating
.

Types of Machine Learning


Techniques
Most machine learning
problems fall into 1 of 2 Supervised
categories Unsupervised

• For each observation of the predictors , • For every observation = 1, . . . , , we


= 1, . . . , there is an associated response observe a vector but no associated
response
• Wish to fit a model that relates the • No response variable to predict
response to the predictors. • Referred to as unsupervised because we lack
• Aim: to accurately predict the response for a response variable that can supervise our
future observations (prediction) or to analysis
better understand the relationship • Aim: to understand the relationships
between the response & the predictors between the variables or between the
(inference) observations
• Methods: linear regression & logistic • Method: cluster analysis, or clustering. Goal:
regression to ascertain whether the observations fall
into relatively distinct groups

8
23/02/2017

Types of Machine Learning


Techniques
Machine
Learning

Prediction Reason Reason


Supervised Unsupervised Inference
Inference

Regression y is numeric
x Model y

Classification y is class

Exercises
• Explain whether each scenario is a classification or regression problem, & indicate
whether we are most interested in inference or prediction.
• We collect a set of data on the top 500 firms in the US. For each firm we record
profit, number of employees, industry & the CEO salary. We are interested in
understanding which factors affect CEO salary.
• We are considering launching a new product & wish to know whether it will be
a success or a failure. We collect data on 20 similar products that were
previously launched. For each product we have recorded whether it was a
success or failure, price charged for the product, marketing budget,
competition price, & ten other variables.
• We are interesting in predicting the % change in the US dollar in relation to the
weekly changes in the world stock markets. Hence we collect weekly data for
all of 2012. For each week we record the % change in the dollar, the % change
in the US market, the % change in the British market, & the % change in the
German market.

9
23/02/2017

Exercises
• You will now think of some real-life applications for machine learning.
• Describe three real-life applications in which classification might be useful.
Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.
• Describe three real-life applications in which regression might be useful.
Describe the response, as well as the predictors. Is the goal of each application
inference or prediction? Explain your answer.
• Describe three real-life applications in which cluster analysis might be useful.

Flow of Creating & Evaluating Models


Model evaluation
• Quantifying the performance of a
model
• Must use a measure of model
performance that’s appropriate to
both the original business goal & the
chosen modelling technique

Predicting who would default on Predicting revenue lost to


loans (classification) defaulting loans (regression)
Accuracy RMSE

Precision

10
23/02/2017

Flow of Creating & Evaluating Models


Model validation
• Generation of an assurance that the
model will work in production as it
worked during training
• Biggest cause of model validation
failures – not having enough training
data to represent the variety of what
may later be encountered in
production

Test & Training Splits


• When you’re building a model to make predictions, you need
data to build the model (training set)
• You also need data to test whether the model makes correct
predictions on new data (test or hold-out set)

Training
Dataset Test set
set

Data that you feed to the model-building Data that you feed into the resulting
algorithm (regression, decision tree, etc.) model, to verify that the model’s
so that the algorithm can set the correct predictions are accurate
parameters to best predict the outcome
variable

11
23/02/2017

Evaluating Classification Models


• When building a model, the 1st thing to check is if the model
even works on the data it was trained from
• Example of classifying email into spam (email we in no way
want) & non-spam (email we want)
• Summary of classifier performance – confusion matrix (table
that summarizes the classifier’s predictions against the actual
known data categories)
Confusion matrix
Predicted condition

Negative Positive
True Negative TN FP
condition
Positive FN TP

Evaluating Classification Models


Measures of Classifier
Performance

Accuracy Precision Recall

• For a classifier, accuracy is defined as the number of items categorized correctly divided by
total number of items – what fraction of the time the classifier is correct
• Accuracy = = (cM[1,1] + cM[2,2]) / sum(cM) = 92%
• The error of around 8% is unacceptably high for a spam filter!

12
23/02/2017

Validating Models
• Model evaluation: performance of the model on training data
• Biggest worry: validity of the model – will it show similar
quality on new data in production?
• Model validation: testing of a model on new data (test set)

Validating Models
A common model problem: Overfitting
• An overfit model looks
great on the training
data & performs poorly
on new data
• Memorized the training
data instead of
discovering generalizable
rules or patterns
• Overfit model is bad:
– more complicated
than anything useful
– less accurate in
production

13
23/02/2017

Ensuring Model Quality


• The data used to build a model is not the best data for testing
the model’s performance
• Because this data was seen during model construction, &
model construction is optimizing your performance measure,
you tend to get exaggerated measures of performance on your
training data
• Perform all of your clever work on the training data alone, &
delay measuring your performance with respect to your test
data until as late as possible in your project – testing on held-
out data

Decision Trees
• Decision tree predict responses to data
• To predict a response, follow the decisions in the tree from the
root (beginning) node down to a leaf node. The leaf node
contains the response.

14
23/02/2017

Decision Trees
• This tree predicts classifications based on 2 predictors, x1 & x2
• To predict, start at the top node, represented by a triangle (Δ).
The 1st decision is whether x1<0.5. If so, follow the left branch,
& see that the tree classifies the data as type 0.
• If x1>=0.5, then follow the right branch to the lower-right
triangle node. Here the tree asks if x2<0.5. If so, then follow
the left branch to see that the tree classifies the data as type 0.
If not, then follow the right branch to see that the that the tree
classifies the data as type 1.

Questions
1. What is machine learning?
2. Explain machine learning techniques and its categories
3. Why do we need to evaluate our model?
4. Why do we need a portion of our data called test data?
5. Why do we need to validate our model?

15

You might also like