0% found this document useful (0 votes)
9 views29 pages

Machine Learning Section

This document introduces a Machine Learning course focusing on Python and Spark's MLlib library, outlining the structure of the course, including suggested readings, theory lectures, and consulting projects. It explains the difference between supervised and unsupervised learning, detailing how algorithms learn from labeled and unlabeled data, respectively. The document emphasizes the importance of understanding Spark's MLlib documentation and provides guidance for students with varying backgrounds in mathematics.

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

Machine Learning Section

This document introduces a Machine Learning course focusing on Python and Spark's MLlib library, outlining the structure of the course, including suggested readings, theory lectures, and consulting projects. It explains the difference between supervised and unsupervised learning, detailing how algorithms learn from labeled and unlabeled data, respectively. The document emphasizes the importance of understanding Spark's MLlib documentation and provides guidance for students with varying backgrounds in mathematics.

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine Learning

Let’s learn something!


Python and Spark

● It is now time to begin with the Machine


Learning Sections of the course!
● This introduction section will discuss a
general introduction to machine learning
and how Spark’s MLlib library works for
Machine Learning.
Python and Spark

● Most Machine Learning Sections have:


○ Suggested Reading Assignment
○ Basic Theory Lecture
○ Documentation Walkthrough
○ More realistic custom code example
○ Consulting Project
○ Consulting Project Solutions
Python and Spark

● The Consulting Projects are looser, more


realistic projects for you to attempt with
the skills you just learned.
● A dataset, some background, and a
problem is described, and you are free to
solve it however you want.
Python and Spark

● If you prefer a more guided approach to


problems, that’s totally okay!
● We have the custom code examples
before each Consulting Project.
● Plus, you can treat the Consulting Project
Solutions as an additional “code-along”!
Python and Spark

● Because different students have


different backgrounds in math, we will
keep the mathematics behind the
machine learning algorithms light.
Python and Spark

● If you are interested in reading more


about the math behind the algorithms
we discuss, we will be using
Introduction to Statistical Learning by
Gareth James as a companion book.
● It’s freely available online.
Companion Book

● Students who want the mathematical


theory should do the suggested reading
assignment that will appear for each
machine learning section.
● Otherwise, feel free to watch the Intro
Theory Lectures for the fundamentals.
Companion Book

● First Suggested Reading Assignment:


○ Read Chapters 1 & 2 to gain a
background understanding before
continuing to the Machine Learning
Lectures.
What is Machine Learning?

● Machine learning is a method of data analysis


that automates analytical model building.
● Using algorithms that iteratively learn from
data, machine learning allows computers to
find hidden insights without being explicitly
programmed where to look.
What is it used for?

● Fraud detection. ● Recommendation Engines


● Web search results. ● Customer Segmentation
● Real-time ads on web pages ● Text Sentiment Analysis
● Credit scoring and next-best offers. ● Predicting Customer
● Prediction of equipment failures. Churn
● New pricing models. ● Pattern and image
● Network intrusion detection. recognition.
● Email spam filtering.
● Financial Modeling
Machine Learning Process

Test
Data

Model
Data Data Model Model
Training &
Acquisition Cleaning Testing Deployment
Building
Supervised Learning

● Spark’s MLlib is mainly designed for Supervised


and Unsupervised Learning tasks, with most of
its algorithms falling under those two
categories.
● Let’s discuss them in more detail and describe
how they are different!
Supervised Learning

● Supervised learning algorithms are trained


using labeled examples, such as an input
where the desired output is known.
● For example, a piece of equipment could have
data points labeled either “F” (failed) or “R”
(runs).
Supervised Learning

● The learning algorithm receives a set of inputs


along with the corresponding correct outputs,
and the algorithm learns by comparing its
actual output with correct outputs to find
errors.
● It then modifies the model accordingly.
Supervised Learning

● Through methods like classification, regression,


prediction and gradient boosting, supervised
learning uses patterns to predict the values of
the label on additional unlabeled data.
● Supervised learning is commonly used in
applications where historical data predicts
likely future events.
Supervised Learning

● For example, it can anticipate when credit card


transactions are likely to be fraudulent or which
insurance customer is likely to file a claim.
● Or it can attempt to predict the price of a house
based on different features for houses for which
we have historical price data.
Unsupervised Learning

● Unsupervised learning is used against data


that has no historical labels.
● The system is not told the "right answer." The
algorithm must figure out what is being shown.
● The goal is to explore the data and find some
structure within.
Unsupervised Learning

● For example, it can find the main attributes


that separate customer segments from each
other.
● Popular techniques include self-organizing
maps, nearest-neighbor mapping, k-means
clustering and singular value decomposition.
● One issue is that it can be difficult to evaluate
results of an unsupervised model!
Final Thoughts

● Machine Learning takes time to learn.


● Be patient with yourself and feel free to post to
the QA forums.
● No one course can be a reference for all
Machine Learning topics, but I’m always happy
to point you in the right direction!
Machine Learning
with Spark
Python and Spark

● Spark has its own MLlib for Machine


Learning.
● The future of MLlib utilizes the Spark 2.0
DataFrame syntax.
Python and Spark

● One of the main “quirks” of using MLlib is


that you need to format your data so that
eventually it just has one or two
columns:
○ Features, Labels (Supervised)
○ Features (Unsupervised)
Python and Spark

● This requires a little more data


processing work than some other
machine learning libraries, but the big
upside is that this exact same syntax
works with distributed data, which is no
small feat for what is going on “under the
hood”!
Python and Spark

● When working with Python and Spark


with MLlib, the documentation examples
are always with nicely formatted data.
● However, we’ll have our own custom
examples that have messier, more
realistic data!
Python and Spark

● We will also have consulting projects,


which set you loose on a real world data
project with a data set and a problem to
solve, without explicitly telling you what
to do!
Python and Spark

● A huge part of learning MLlib is getting


comfortable with the documentation!
● Being able to master the skill of finding
information (not memorization) is the
key to becoming a great Spark and
Python developer!
Python and Spark

● Fortunately, the Spark MLlib


documentation is quite good, and we’ll
constantly teach you how to refer to it
during each Machine Learning
Algorithm Section.
● Let’s jump to it now!
spark.apache.org

You might also like