100% found this document useful (1 vote)
124 views

Machine Learning

This document provides an overview of machine learning and introduces a machine learning project using R language. It discusses data science and machine learning, introduces the R programming language, and outlines a step-by-step machine learning tutorial using R. Key topics covered include introduction to data science and machine learning, benefits of machine learning, supervised and unsupervised learning algorithms, and tools for data scientists.

Uploaded by

sitvijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
124 views

Machine Learning

This document provides an overview of machine learning and introduces a machine learning project using R language. It discusses data science and machine learning, introduces the R programming language, and outlines a step-by-step machine learning tutorial using R. Key topics covered include introduction to data science and machine learning, benefits of machine learning, supervised and unsupervised learning algorithms, and tools for data scientists.

Uploaded by

sitvijay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Machine Learning (ML) Applications

and Simple ML Project using R Language

Dr.D.Senthilkumar
University College of Engineering (BIT Campus)
Anna University, Tiruchirappalli

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 1
ENGINEERING TRICHY
OUTLINE

• Introduction of Data Science & Machine Learning


• Introduction of R Language
• Machine Learning Project using R (Step By Step Tutorial)

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 2
ENGINEERING TRICHY
Data Science

 Data Science is a study of data.


 Data Science is an art of uncovering insights and trends that are hiding behind
the data.
 Data Science is the process of using data to understand different things.
 Data Science helps to translate data into a story. The story telling helps in
uncovering insights. The insights help in making decision or strategic choices.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 3
ENGINEERING TRICHY
Data Science

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 4
ENGINEERING TRICHY
Data Science

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 5
ENGINEERING TRICHY
Statistics and Machine Learning

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 6
ENGINEERING TRICHY
Statistics and Machine Learning

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 7
ENGINEERING TRICHY
Data Science & Machine Learning : Multiple Disciplines

Mathematics
Statistics

Research Software
Development
Data
Domain, Science CS / IT
Business
Knowledge Machine
Learning

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 8
ENGINEERING TRICHY
Use cases of Data Science and Machine Learning

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 9
ENGINEERING TRICHY
Role of a Data Scientist
1. Reframe business challenges as analytics challenges.
This is a skill to diagnose the problem, consider the core of a given problem,
and determine which kinds of candidate analysis analytical method can be
applied to solve it.
2. Design, implement and deploy statistical models and data mining technique on
data.
This activity is mainly the role of data scientist, applying complex or advanced
analytical methods to a variety of business problem using data.
3.3 Develop insights that lead to actionable recommendations.
Learn how to draw insights out of data and communicate them effectively.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 10
ENGINEERING TRICHY
Skills required for a Data Scientist

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 11
ENGINEERING TRICHY
jg1
Tools available to a Data Scientist

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 12
ENGINEERING TRICHY
Slide 12

jg1 jyotsana grover, 10-01-2021


Why “Learn” ?
• Machine learning is programming computers to optimize a
performance criterion using example data or past experience.
• There is no need to “learn” to calculate payroll
• Learning is used when:
• Human expertise does not exist (navigating on Mars),
• Humans are unable to explain their expertise (speech recognition)
• Solution changes in time (routing on a computer network)
• Solution needs to be adapted to particular cases (user biometrics)

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 13
ENGINEERING TRICHY
What is Machine Learning?
• Ability of a machine to improve its own performance through the use
of a software that employs artificial intelligence techniques to mimic
the ways by which humans seem to learn such as repetition and
experience.
• Optimize a performance criterion using example data or past
experience.
• Role of Statistics: Inference from a sample
• Role of Computer science: Efficient algorithms to
• Solve the optimization problem
• Representing and evaluating the model for inference

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 14
ENGINEERING TRICHY
What is Machine Learning?
Tom M. Mitchell:
A computer program is said to learn from Experience E with respect to
some Task T and some Performance measure P, if its performance on T,
as measured by P, improves with experience E.
Learning means Improving with Experience at some Task
Example: Smart Homes
• T: Estimate the desired temperature
• E: Learning from temperature dataset
• P: Accuracy of the desired temperature

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 15
ENGINEERING TRICHY
What exactly is “Machine Learning”??

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 16
ENGINEERING TRICHY
What exactly is “Machine Learning”?
• Machine learning, a branch of artificial intelligence, is a scientific discipline
concerned with the design and development of algorithms.
• Computer system (expert system) which is imbued with decision making ability
like a human expert.
• Two parts:
a. Knowledge base (database)
b. Inference engine (predicted with certain probability)
• It is a learner (like a small baby) which looks at the examples (obstacles),
analyses it using stored data which also includes previous experiences, finds its
algorithm and predicts the possible solution with highest probability.
• It keeps updating itself with every obstacle solved, enhancing its performance
every time.
• Its algorithms
6/16/2021
includes different combinations of logic.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
ENGINEERING TRICHY
17
Necessity of Human Machine Interfacing
• Number of types of obstacles in real world are huge, hence it has to
proceed to a generalize solution using certain set of rules.
• Only disadvantage being its high error making.
• And thus human interfacing with these systems becomes necessary.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 18
ENGINEERING TRICHY
Traditional vs Machine Learning algorithms

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 19
ENGINEERING TRICHY
Traditional vs Machine Learning algorithms

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 20
ENGINEERING TRICHY
Machine Learning - Benefits
• Makes human-computer interaction easier
• Relatively simple to integrate
• Will distinguish your products from others
• Increase customer satisfaction
• Will improve simple-intelligent systems

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 21
ENGINEERING TRICHY
Supervised & Unsupervised ML Algorithms

• Supervised learning (classification)


• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 22
ENGINEERING TRICHY
Supervised & Unsupervised ML Algorithms
1. How to estimate the price of a fruit if freshness (in days) is given?
2. How to identify where a given fruit is fresh (if freshness is not
more than 50 days) or not?
3. How to put the fruits in two buckets?
 Supervised ML algorithms.
1. Regression: Estimate the real-value.
2. Classification: Estimate the Discrete-value.
 Unsupervised ML algorithms
1. Clustering: Grouping the given items.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 23
ENGINEERING TRICHY
ML Algorithms
• Supervised: Logistic
Regression
1. Decision trees, K-means Linear
2. K-Nearest-neighbors, clustering Regression
3. Regression and Classification
• Unsupervised:
Algorithms Apriori
1. Clustering, PCA
2. Dimensionality reduction,
3. Association
SVM Decision
Tree
ANN

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 24
ENGINEERING TRICHY
Machine Learning Life Cycle

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 25
ENGINEERING TRICHY
Overview of the Workflow of ML

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 26
ENGINEERING TRICHY
Classification / Prediction —A Two-Step Process

• Model construction: describing a set of predetermined classes


• Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the
model
• Test set is independent of training set, otherwise over-fitting will occur
• If the accuracy is acceptable, use the model to classify data tuples whose class labels are
not known
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 27
ENGINEERING TRICHY
Evaluating the Accuracy of a Classifier or Predictor
• Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets, each approximately
equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized data
 Stratified cross-validation: folds are stratified so that class dist. In each fold is
approx. the same as that in the initial data
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 28
ENGINEERING TRICHY
What Machine Learning can do for you…
• Finding which category an object belongs to… by Classification
Algorithm
• Finding what is strange… by Anomaly Detection Algorithm
• Finding how much and how many… by Regression Algorithm
• Finding how data is arranged… by Clustering Algorithm
• What should I do next… by Reinforcement Learning Algorithm

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 29
ENGINEERING TRICHY
ML algorithms

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 30
ENGINEERING TRICHY
ML Algorithms: Regression
How to estimate the price of a fruit if freshness (in days) is given?

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 31
ENGINEERING TRICHY
ML Algorithms: Regression

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 32
ENGINEERING TRICHY
ML Algorithms: Regression
How to estimate the price of a fruit if freshness (in days) is given?
Input
Training Data (labeled instances (x1; y1); (x2; y2); : : : ; (xn; yn))
e.g., freshness of fruits and prices
Objective
Develop a relation (or rule f : x ! y) to predict y for given x
e.g., a new fruit xnew with given fresh or not
Output
Real-valued y
e.g., Price ynew of the fruit xnew

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 33
ENGINEERING TRICHY
ML Algorithms: Regression

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 34
ENGINEERING TRICHY
ML Algorithms: Classification
How to identify where a given fruit is fresh (if freshness is not more
than 50 days) or not?

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 35
ENGINEERING TRICHY
ML Algorithms: Classification

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 36
ENGINEERING TRICHY
ML Algorithms: Classification
How to identify where a given fruit is fresh (if freshness is not more
than 50 days) or not?
Input
Training Data (labeled instances (x1; y1); (x2; y2); : : : ; (xn; yn))
e.g., freshness of fruits and prices
Objective
Develop a relation (or rule f : x ! y) to predict y for given x
e.g., a new fruit xnew with given freshness (in days)
Output
Discrete-valued y
e.g., Fresh or not ynew of the fruit xnew

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 37
ENGINEERING TRICHY
ML Algorithms: Classification

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 38
ENGINEERING TRICHY
ML Algorithms: Clustering
How to put the fruits in two buckets?

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 39
ENGINEERING TRICHY
ML Algorithms: Clustering
Input
Training Data (Unlabeled instances x1; x2; : : : ; xn)
Objective
Learn more about the data distribution
Output
Discover the inherent groupings in data.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 40
ENGINEERING TRICHY
Applications of Data Science / Machine Learning

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 41
ENGINEERING TRICHY
Application of ML in IoT

 Sensor technology
 Communication technology
 Machine Learning
 Human-machine interface (UI/UX)

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 42
ENGINEERING TRICHY
Machine Learning Applications
• Medical diagnosis
• Data mining
• Bioinformatics
• Speech and handwriting recognition
• Product categorization
• Inertial measurement unit (IMU)
• Information retrieval

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 43
ENGINEERING TRICHY
Machine Learning in Electrical
Engineering
• Machine Learning in grid integration and power distribution
• Machine Learning in end-user consumption pattern.
• Machine Learning in power quality analysis.
• Cyber security.
• Machine Learning in load balancing.
• Machine Learning in control and feedback system

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 44
ENGINEERING TRICHY
Machine Learning in Electronics and
Communications Engineering

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 45
ENGINEERING TRICHY
Become a Machine Learning Expert

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 46
ENGINEERING TRICHY
1. RECOMMENDATION ENGINES

• Example: Netflix viewing suggestions


• Application area: Media + Entertainment + Shopping

• Need a new series to fill the binge void? Netflix can recommend one.
In fact, it probably already has — just check your homepage. Using
machine learning to curate its enormous collection of TV shows and
movies, Netflix taps the streaming history and habits of its millions of
users to predict what individual viewers will likely enjoy.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 47
ENGINEERING TRICHY
2. SORTED, TAGGED & CATEGORIZED PHOTOS

• Example: Reviewer-uploaded photos on Yelp


• Application area: Search + Mobile + Social
• Yelp's crowd-sourced reviews cover everything from restaurants, bars,
doctors' offices, gyms, concert venues and more. Besides giving a star
rating and a written assessment, Yelpers are encouraged to
include pictures of the business they're reviewing or service they're
receiving. Yelp reportedly hosts tens of millions of photos and uses
machine learning sort them all. When you look up a popular
restaurant on Yelp, images are sorted into groups: menus, food,
inside, outside and so on. That makes it easier for people to find
relevant photos rather than riffling through all of them.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 48
ENGINEERING TRICHY
3. SELF-DRIVING CARS

• Example: Waymo cars use ML to understand surroundings


• Application area: Automotive + Transportation
• Waymo is the offshoot of Google's autonomous vehicle project. Its
goal is to create cars that can drive themselves without a human pilot.
In order to do that, Waymo's fleet needs a serious assist from AI.
Waymo's cars use machine learning to see their surroundings, make
sense of them and predict how others behave. With so many
shifting variables on the road, an advanced machine learning system
is crucial to success.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 49
ENGINEERING TRICHY
4. GAMIFIED LEARNING & EDUCATION

• Example: Duolingo's language lessons


• Application area: Education
• Duolingo is a free language learning app that's designed to be fun and
addicting. Although using Duolingo feels a little bit like playing a game
on your phone, its effectiveness is based on research. One aspect of
that involves machine learning. Using data collected from
user answers, Duolingo developed a statistical model of how long a
person is likely to remember a certain word before needing a
refresher. Armed with that information, Duolingo knows when to ping
users who might benefit from retaking an old lesson.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 50
ENGINEERING TRICHY
5. CALCULATING CUSTOMER LIFETIME VALUE
METRICS
• Example: Asos uses CLTV to drive profit
• Application area: Fashion
• Fashion retailer Asos uses machine learning to determine Customer
Lifetime Value (CLTV). This metric estimates the net profit a business
receives from a specific customer over time. In Asos’ case, CLTV
shows which customers are likely to continue buying products from
Asos. Once this is determined, Asos can prioritize high-CLTV
customers and convince them to spend more the next time around.
Because retailers can end up losing money on low-CLTV (with things
like free shipping or ignored marketing promos), this model ensures
that Asos is turning a profit
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 51
ENGINEERING TRICHY
6. PREDICTING WHEN PATIENTS GET SICK

• Example: KenSci assisting caregivers


• Application area: Healthcare
• How it's using machine learning: KenSci helps caregivers predict
which patients will get sick so they can intervene earlier, saving
money and potentially lives. It does so using machine learning to
analyze databases of patient information, including electronic medical
records, financial data and claims.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 52
ENGINEERING TRICHY
7. DETERMINING CREDIT WORTHINESS

• Example: Deserve's model for lending to students


• Application area: Finance
• Traditional credit card companies determine eligibility through an
individual’s FICO score and credit history. But this can be a problem
for those who have no credit history. In light of that, Deserve — which
is is geared toward students and new credit card applicants —
calculates credit worthiness using a machine learning algorithm that
takes into account other factors like current financial health and
habits.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 53
ENGINEERING TRICHY
8. TARGETED EMAILS

• Example: Optimail
• Application area: Marketing
• Optimail uses artificial intelligence and machine learning to deliver
more effective email marketing campaigns by customizing and
personalizing content, as well as adjusting scheduling, to have the
greatest impact on each recipient.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 54
ENGINEERING TRICHY
9. RANKING POSTS ON SOCIAL MEDIA

• Example: Twitter's new timeline


• Application area: Social Media
• Every Twitter user knows there's a ginormous amount of tweets to
sift through. But not all tweets are created equal. Originally, Twitter
displayed the most recent tweets at the top of each user's timeline.
However, this meant possibly missing out on some sweet posts.
So Twitter redesigned its timelines using machine learning to
prioritize tweets that are most relevant to each user. Using that
model, tweets are now ranked with a relevance score (based on what
each user engages with most, popular accounts, etc.), then placed
atop your feed so you're more likely to see them.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 55
ENGINEERING TRICHY
10. COMPUTER VISION FARMING

• Example: Blue River Technology's "See & Spray"


• Application area: Agriculture
• Blue River’s "See & Spray" technology uses computer vision and
machine learning to identify plants in farmers’ fields. That's especially
useful for spotting weeds among acres of crops. As its name implies,
the See & Spray rig can also target specific plants and spray them with
herbicide or fertilizer. It's far more efficient than spraying an entire
field and far better for the environment.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 56
ENGINEERING TRICHY
11. GETTING YOU THE RIGHT ANSWERS
• Example: Quora’s super-specific answer rankings
• Application Area: Search
• How it's using machine learning: Quora uses machine learning in a
few ways, but the most prominent is to determine which questions
and answers are pertinent to a user’s search query. When ranking
answers to a specific question, the company’s machine learning takes
into account thoroughness, truthfulness, reusability and a variety of
other characteristics in order to always give the “best” response to
any-and-all questions.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 57
ENGINEERING TRICHY
12. GIVING BUSINESSES PERSONAL-LEVEL INSIGHTS

• Example: Civis Analytics’ suite of data-intensive products


• Application Area: Analytics + Cloud + Consumer Research
• How it's using machine learning: Civis Analytics’ platforms use
machine learning to give companies deeper insights into their own
data. Organizations like The Bill and Melinda Gates Foundation,
Verizon, Discovery Channel and Robinhood use the Civis’ machine
learning platform to monitor industry trends and predict consumer
habits.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 58
ENGINEERING TRICHY
13. TRANSPARENCY IN THE CPG INDUSTRY

• Example: Label Insight’s 22,000 individual attributes for each


product
• Application Area: Analytics + Retail + Healthcare
• How it's using machine learning: Label Insight uses machine learning
and data science to create more than 22,000 high-order attributes for
retail and consumer packaged goods products. Looking to pick up a
few groceries? The company’s “LabelSync” tool employs machine
learning to give a personalized view of each food product, including
ingredients, suppliers, supply chain history and much more, in order
to give consumers better insights into their purchases.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 59
ENGINEERING TRICHY
14. MAKING SALES AND MARKETING MORE
EFFICIENT
• Example: HubSpot
• Application Area: Marketing + Sales + SaaS
• How it's using machine learning: Hubspot develops sales, marketing
and service software that allows businesses to gain insights into their
customers and future opportunities. The company uses machine
learning in a number of ways. Machine learning gives content
marketers better insights into what search engines associate their
content with, and uses it to assign predictive lead scores to indicate to
sales teams which customers are most ready to purchase their
products.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 60
ENGINEERING TRICHY
15. UPGRADING FASHION SENSES

• Example: Fit Analytics’ consumer-facing and backend machine learning


tools
• Application Area: Fashion + E-commerce
• How it's using machine learning: Fit Analytics uses machine learning to
help consumers get the right sized clothes and brands to gain helpful
insights about their customers. Have you ever ordered something online
that was way too big or too small? Fit Analytics measures a customer's
body and uses machine learning to make recommendations for the best-fit
styles. On the back-end, the machine learning analyzes data points to give
clothing businesses insights into everything from popular styles to average
customer measurements.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 61
ENGINEERING TRICHY
Machine Learning in R: Step By Step
Tutorial (start here)

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 62
ENGINEERING TRICHY
You Can Do Machine Learning in R

• You do not need to understand everything


• You do not need to know how the algorithms work.
• You do not need to be an R programmer
• You do not need to be a machine learning expert.
• What about other steps in a machine learning project.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 63
ENGINEERING TRICHY
Data Science / Machine Learning Workflow

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 64
ENGINEERING TRICHY
PROCESS OF A MACHINE LEARNING PROJECT

• Define Problem.
• Prepare Data.
• Evaluate Algorithms.
• Improve Results.
• Present Results.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 65
ENGINEERING TRICHY
INTRODUCTION of R Language

• R is a language and environment for statistical computing and


graphics.
• R provides a scripting language with an odd syntax.
• Hundreds of packages and thousands of functions to choose from,
• Providing multiple ways to do each task.
• It can feel overwhelming.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 66
ENGINEERING TRICHY
The R environment

• R is an integrated suite of software facilities for data manipulation,


calculation and graphical display. It includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data
analysis,
• graphical facilities for data analysis and display either on-screen or on
hardcopy, and
• a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and
input and output facilities.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 67
ENGINEERING TRICHY
KEY STEPS IN MACHINE LEARNING

• Loading data
• Summarizing your data
• Evaluating algorithms
• And making some predictions.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 68
ENGINEERING TRICHY
Simple Machine Learning Project using R
• Download and install R and get the most useful package for machine
learning in R.
• Load a dataset and understand it’s structure using statistical
summaries and data visualization.
• Create 5 machine learning models, pick the best and build confidence
that the accuracy is reliable.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 69
ENGINEERING TRICHY
OVERVIEW

• Installing the R platform.


• Loading the dataset.
• Summarizing the dataset.
• Visualizing the dataset.
• Evaluating some algorithms.
• Making some predictions.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 70
ENGINEERING TRICHY
INSTALLING THE R PLATFORM

• This tutorial was written and tested with R version 3.2.3


• Here is what we are going to cover in this step:
1. Download R.
2. Install R.
3. Start R.
4. Install R Packages.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 71
ENGINEERING TRICHY
1.1 Download R
• You can download R from The R Project webpage.
• When you click the download link, you will have to choose a mirror. You can then
choose R for your operating system, such as Windows, OS X or Linux.

1.2 Install R
• There are no special requirements.
• R Installation and Administration.

1.3 Start R
• You can start R from whatever menu system you use on your operating system.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 72
ENGINEERING TRICHY
1.4 Install Packages
• The caret package provides a consistent interface into hundreds of
machine learning algorithms
• Data visualization,
• Data resampling,
• Model tuning and model comparison.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 73
ENGINEERING TRICHY
2. Load The Data

• Load the iris data the easy way.


• Load the iris data from CSV (optional, for purists).
• Separate the data into a training dataset and a validation dataset.
2.1 Load Data The Easy Way

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 74
ENGINEERING TRICHY
2.2 Load From CSV
you want to load the data just like you would on your own machine
learning project, from a CSV file.
1. Download the iris dataset from the UCI Machine Learning
Repository
2. Save the file as iris.csv your project directory.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 75
ENGINEERING TRICHY
2.3. Create a Validation Dataset
• Use statistical methods to estimate the accuracy of the models
• Split the loaded dataset into two, 80% of which we will use to train
our models and 20% that we will hold back as a validation dataset.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 76
ENGINEERING TRICHY
3. Summarize Dataset

• Dimensions of the dataset.


• Types of the attributes.
• Peek at the data itself.
• Levels of the class attribute.
• Breakdown of the instances in each class.
• Statistical summary of all attributes.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 77
ENGINEERING TRICHY
3.1 Dimensions of Dataset

• How many instances (rows) and how many attributes (columns) the
data contains with the dim function.

• You should see 120 instances and 5 attributes:

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 78
ENGINEERING TRICHY
3.2 Types of Attributes

• They could be doubles, integers, strings, factors and other types.


• Knowing the types is important as it will give you an idea of how to
better summarize the data you have

• You should see that all of the inputs are double and that the class value
is a factor:

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 79
ENGINEERING TRICHY
3.3 Peek at the Data

• It is also always a good idea to actually eyeball your data.

• You should see the first 5 rows of the data:

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 80
ENGINEERING TRICHY
3.4 Levels of the Class

• Notice above how we can refer to an attribute by name as a property of


the dataset.
• In the results we can see that the class has 3 different labels:

• This is a multiclass or a multinomial classification problem. If there


were two levels, it would be a binary classification problem.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 81
ENGINEERING TRICHY
3.5 Class Distribution

• Look at the number of instances (rows) that belong to each class. We


can view this as an absolute count and as a percentage.

• We can see that each class has the same number of instances (40 or 33%
of the dataset)

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 82
ENGINEERING TRICHY
3.6 Statistical Summary
• look at a summary of each attribute.
• This includes the mean, the min and max values as well as some
percentiles (25th, 50th or media and 75th)

• We can see that all of the numerical values have the same scale
(centimeters) and similar ranges [0,8] centimeters.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 83
ENGINEERING TRICHY
4. Visualize Dataset

• Two types of plots:


• Univariate plots to better understand each attribute.
• Multivariate plots to better understand the relationships between
attributes.

• 4.1 Univariate Plots


• This plots has each individual variable.
• It refer just the input attributes and just the output attributes
• Call the inputs attributes x and the output attribute (or class) y.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 84
ENGINEERING TRICHY
• Given that the input variables are numeric, we can create box and
whisker plots of each.

• This gives us a much clearer idea of the distribution of the input


attributes:

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 85
ENGINEERING TRICHY
• Create a barplot of the Species class variable to get a graphical
representation of the class distribution.

barchart(y)

• The instances are evenly distributed across the three class:

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 86
ENGINEERING TRICHY
4.2 Multivariate Plots
• Interactions between the variables.
• Scatterplots of all pairs of attributes and color the points by class.
• Scatterplots show that points for each class are generally separate, we
can draw ellipses around them.

• Relationships between the input attributes (trends) and between


attributes and the class values (ellipses):

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 87
ENGINEERING TRICHY
• Box and whisker plots of each input variable again, but this time broken
down into separate plots for each class. This can help to tease out obvious
linear separations between the classes.

• This is useful to see that there are clearly different distributions of the
attributes for each class value.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 88
ENGINEERING TRICHY
• Probability density plots to give nice smooth lines for each distribution.

• Like the boxplots, we can see the difference in distribution of each attribute
by class value. We can also see the Gaussian like distribution (bell curve)
of each attribute.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 89
ENGINEERING TRICHY
5. Evaluate Some Algorithms
• create some models of the data and estimate their accuracy on unseen data.
• Setup the test harness to use 10fold cross validation.
• Build 5 different models to predict species from flower measurements
• Select the best model.
• 5.1 Test Harness
• 10fold cross validation to estimate accuracy.
• This will split our dataset into 10 parts, train in 9 and test on 1 and
release for all combinations of train test splits.
• We will also repeat the process 3 times for each algorithm with
different splits of the data into 10 groups, in an effort to get a more
accurate estimate.
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 90
ENGINEERING TRICHY
• We are using the metric of “Accuracy” to evaluate models. This is a
ratio of the number of correctly predicted instances in divided by the
total number of instances in the dataset multiplied by 100 to give a
percentage (e.g. 95% accurate).
• We will be using the metric variable when we run build and evaluate
each model next.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 91
ENGINEERING TRICHY
5.2 Build Models

• Plots that some of the classes are partially linearly separable in some
dimensions.
• Let’s evaluate 5 different algorithms:
• Linear Discriminant Analysis (LDA)
• Classification and Regression Trees (CART).
• kNearest Neighbors (kNN).
• Support Vector Machines (SVM) with a linear kernel.
• Random Forest (RF)
• This is a good mixture of simple linear (LDA), nonlinear (CART,
kNN) and complex nonlinear methods (SVM, RF).
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 92
ENGINEERING TRICHY
Let’s build our five models
Caret does support the configuration and tuning of the configuration of
each model, but we are not going to cover that in this tutorial.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 93
ENGINEERING TRICHY
5.3 Select Best Model

• 5 models and accuracy estimations for each.


• So We need to compare the models to each other and select the most
accuracy
• first creating a list of the created models and using the summary
function.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 94
ENGINEERING TRICHY
Accuracy of each classifier and also other metrics like Kappa:

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 95
ENGINEERING TRICHY
• Create a plot of the model evaluation results and compare the spread and
the mean accuracy of each model.
• There is a population of accuracy measures for each algorithm because
each algorithm was evaluated 10 times (10 fold cross validation).

• The most accurate model in this case was LDA:

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 96
ENGINEERING TRICHY
• The results for just the LDA model can be summarized.

• This gives a nice summary of what was used to train the model and
the mean and standard deviation (SD) accuracy achieved,
specifically 97.5% accuracy +/4%

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 97
ENGINEERING TRICHY
6. Make Predictions
• The most accurate LDA model directly on the validation set and
summarize the results in a confusion matrix.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 98
ENGINEERING TRICHY
• The accuracy is 100%. It was a small validation dataset (20%), but this
result is within our expected margin of 97% +/4% suggesting.

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 99
ENGINEERING TRICHY
References:

Introducing Data Science by Cielen, Meysman and Ali

Storytelling with Data by Cole Nussbaumer Knaflic; Wiley

Introduction to Data Mining by Tan, Steinbach and Vipin Kumar

The Art of Data Science by Roger D Peng and Elizabeth Matsui

Python Data Science Handbook: Essential tools for working with data by Jake VanderPlas

Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF


6/16/2021 100
ENGINEERING TRICHY
Dr. D. SENTHIL KUMAR – (AP/CSE) UNIVERSITY COLLEGE OF
6/16/2021 101
ENGINEERING TRICHY

You might also like