0% found this document useful (0 votes)

1 views

algorithmeknn-121213175830-phpapp02

The document serves as an introduction to machine learning using Scikit-Learn, covering key concepts such as supervised and unsupervised learning, model evaluation, and various algorithms including regression, classification, and clustering. It outlines the architecture for operationalizing machine learning algorithms and provides an overview of Scikit-Learn's features and API. Additionally, it emphasizes the importance of data handling, model building, and evaluation techniques in machine learning workflows.

Uploaded by

signe.magne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

algorithmeknn-121213175830-phpapp02

Uploaded by

signe.magne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to Machine

Learning with Scikit-Learn

Plan of Study
- Preface
- What is Machine Learning?
- An Architecture for ML Data Products
- What is Scikit-Learn?
- Data Handling and Loading
- Model Evaluation
- Regressions
- Classification
- Clustering
- Workshop
Preface

Why I didn’t want to teach a Machine

Learning course for DDL.
Smoothing Planning

Probability Computer Vision

Normalization Reinforcement

Distributions Neural Models

Artificial
Statistics
Bayes Theorem Intelligence Computer Vision

Regression Anomaly Detection

Logits
Natural Language Processing
Entropy

Big Data
Computer
Data Mining Science Optimization

Function Approximation Graph Algorithms

Is Machine Learning a one semester course?

Statistician Expert Programmer

Machine Learning
Practitioner Domains

Machine Learning
Hacker Practitioner Continuum Academic
What is Machine Learning?
Learning by Example
Given a bunch of examples (data) extract a
meaningful pattern upon which to act.

Problem Domain Machine Learning Class

Infer a function from labeled data Supervised learning

Find structure of data without feedback Unsupervised learning

Interact with environment towards goal Reinforcement learning

Context to Data Mining & Statistics

Statistics Data Mining

Users

Data

Machine Learning
Computer
Science
Machine or App
Types of Algorithms by Output
Input training data to fit a model which is then
used to predict incoming inputs into ...

Type of Output Algorithm Category

Output is one or more discrete classes Classification (supervised)

Output is continuous Regression (supervised)

Output is membership in a similar group Clustering (unsupervised)

Output is the distribution of inputs Density Estimation

Output is simplified from higher dimensions Dimensionality Reduction

Classification

Given labeled input data (with two or more labels), fit a

function that can determine for any input, what the label is.
Regression

Given continuous input data fit a function that is able to

predict the continuous value of input given other data.
Clustering

Given data, determine a pattern of associated data points

or clusters via their similarity or distance from one another.
Dimensions and Features
In order to do machine learning you need a data set containing
instances (examples) that are composed of features from which
you compose dimensions.

Instance: a single data point or example composed of fields

Feature: a quantity describing an instance
Dimension: one or more attributes that describe a property

from sklearn.datasets import load_digits

digits = load_digits()

X = digits.data # X.shape == (n_samples, n_features)

y = digits.target # y.shape == (n_samples,)
Your Task
Given a data set of instances of size N, create
a model that is fit from the data (built) by
extracting features and dimensions. Then use
that model to predict outcomes …
1. Data Wrangling (normalization, standardization, imputing)
2. Feature Analysis/Extraction
3. Model Selection/Building
4. Model Evaluation
5. Operationalize Model
A Tour of Machine Learning
Algorithms
Models: Instance Methods
Compare instances in data set with a similarity
measure to find best matches.
- Suffers from curse of dimensionality.
- Focus on feature representation and
similarity metrics between instances

● k-Nearest Neighbors (kNN)

● Self-Organizing Maps (SOM)
● Learning Vector Quantization (LVQ)
Models: Regression
Model relationship of independent variables, X
to dependent variable Y by iteratively
optimizing error made in predictions.

● Ordinary Least Squares

● Logistic Regression
● Stepwise Regression
● Multivariate Adaptive Regression Splines (MARS)
● Locally Estimated Scatterplot Smoothing (LOESS)
Models: Regularization Methods
Extend another method (usually regression),
penalizing complexity (minimize overfit)
- simple, popular, powerful
- better at generalization

● Ridge Regression
● LASSO (Least Absolute Shrinkage & Selection Operator)
● Elastic Net
Models: Decision Trees
Model of decisions based on data attributes.
Predictions are made by following forks in a
tree structure until a decision is made. Used for
classification & regression.

● Classification and Regression Tree (CART)

● Decision Stump
● Random Forest
● Multivariate Adaptive Regression Splines (MARS)
● Gradient Boosting Machines (GBM)
Models: Bayesian
Explicitly apply Bayes’ Theorem for
classification and regression tasks. Usually by
fitting a probability function constructed via the
chain rule and a naive simplification of Bayes.

● Naive Bayes
● Averaged One-Dependence Estimators (AODE)
● Bayesian Belief Network (BBN)
Models: Kernel Methods
Map input data into higher dimensional vector
space where the problem is easier to model.
Named after the “kernel trick” which computes
the inner product of images of pairs of data.

● Support Vector Machines (SVM)

● Radial Basis Function (RBF)
● Linear Discriminant Analysis (LDA)
Models: Clustering Methods
Organize data into into groups whose members
share maximum similarity (defined usually by a
distance metric). Two main approaches:
centroids and hierarchical clustering.

● k-Means
● Affinity Propegation
● OPTICS (Ordering Points to Identify Cluster Structure)
● Agglomerative Clustering
Models: Artificial Neural Networks
Inspired by biological neural networks, ANNs are
nonlinear function approximators that estimate
functions with a large number of inputs.
- System of interconnected neurons that activate
- Deep learning extends simple networks recursively

● Perceptron
● Back-Propagation
● Hopfield Network
● Restricted Boltzmann Machine (RBM)
● Deep Belief Networks (DBN)
Models: Ensembles
Models composed of multiple weak models that
are trained independently and whose outputs
are combined to make an overall prediction.

● Boosting
● Bootstrapped Aggregation (Bagging)
● AdaBoost
● Stacked Generalization (blending)
● Gradient Boosting Machines (GBM)
● Random Forest
Models: Other
The list before was not comprehensive, other
algorithm and model classes include:
● Conditional Random Fields (CRF)
● Markovian Models (HMMs)
● Dimensionality Reduction (PCA, PLS)
● Rule Learning (Apriori, Brill)
● More ...
An Architecture for Operationalizing
Machine Learning Algorithms
Build Phase

Feature
Training Data Vectors

Estimation
Algorithm

Labels

Operational Phase

New Data Feature Predictive

Prediction
Vector Model

Architecture of Machine Learning Operations

Feedback!
Initial Fixtures

Model Building

Service/API

Feedback

The Learning Part of Machine Learning

Deploying Machine Learning as a Web Service
Annotation Service Example
Architecture Demo
https://github.com/DistrictDataLabs/product-classifier
What is Scikit-Learn?
Extensions to SciPy (Scientific Python) are
called SciKits. SciKit-Learn provides machine
learning algorithms.
● Algorithms for supervised & unsupervised learning
● Built on SciPy and Numpy
● Standard Python API interface
● Sits on top of c libraries, LAPACK, LibSVM, and Cython
● Open Source: BSD License (part of Linux)
Probably the best general ML framework out there.
Where did it come from?
Started as a Google summer of code project in
2007 by David Cournapeau, then used as a
thesis project by Matthieu Brucher.

In 2010, INRIA pushed the first public release,

and sponsors the project, as do Google,
Tinyclues, and the Python Software
Foundation.
Who uses Scikit-Learn?
Primary Features
- Generalized Linear Models
- SVMs, kNN, Bayes, Decision Trees, Ensembles
- Clustering and Density algorithms
- Cross Validation
- Grid Search
- Pipelining
- Model Evaluations
- Dataset Transformations
- Dataset Loading
A Guide to Scikit-Learn
Scikit-Learn API
Object-oriented interface centered around the
concept of an Estimator:
“An estimator is any object that learns from data; it may
be a classification, regression or clustering algorithm or
a transformer that extracts/filters useful features from
raw data.”

- Scikit-Learn Tutorial
class Estimator(object):

def fit(self, X, y=None):

"""Fits estimator to data. """
# set state of ``self``
return self

def predict(self, X):

"""Predict response of ``X``. """
# compute predictions ``pred``
return pred

The Scikit-Learn Estimator API

Estimators
- fit(X,y) sets the state of the estimator.
- X is usually a 2D numpy array of shape
(num_samples, num_features).
- y is a 1D array with shape (n_samples,)
- predict(X) returns the class or value
- predict_proba() returns a 2D array of
shape (n_samples, n_classes)
from sklearn import svm

estimator = svm.SVC(gamma=0.001)
estimator.fit(X, y)
estimator.predict(x)

Basic methodology
Wrapping fit and predict
We’ve already discussed a broad workflow, the
following is a development workflow:

Feature Feature
Raw Data
Extraction Evaluation

Load &
Build Model Evaluate Model
Transform Data
class Transformer(Estimator):

def transform(self, X):

"""Transforms the input data. """
# transform ``X`` to ``X_prime``
return X_prime

from sklearn import preprocessing

Xt = preprocessing.normalize(X) # Normalizer
Xt = preprocessing.scale(X) # StandardScaler

imputer =Imputer(missing_values='Nan',
strategy='mean')
Xt = imputer.fit_transform(X)

Transformers
Cross Validation (classification)
Assess how model will generalize to independent data set
(e.g. data not in the training set).

1. Divide data into training and test splits

2. Fit model on training, predict on test
3. Determine accuracy, precision and recall
4. Repeat k times with different splits then average as F1

Predicted Class A Predicted Class B

Actual A True A False B #A

Actual B False A True B #B

#P(A) #P(B) total

from sklearn import metrics
from sklearn import cross_validation as cv

splits = cv.train_test_split(X, y, test_size=0.2)

X_train, X_test, y_train, y_test = splits

model = ClassifierEstimator()
model.fit(X_train, y_train)

expected = y_test
predicted = model.predict(X_test)

print metrics.classification_report(expected, predicted)

print metrics.confusion_matrix(expected, predicted)
print metrics.f1_score(expected, predicted)

Cross Validation in Scikit-Learn

MSE & Coefficient of Determination
In regressions we can determine how well the
model fits by computing the mean square error
and the coefficient of determination.

MSE = np.mean((predicted-expected)**2)

R2 is a predictor of “goodness of fit” and is a

value ∈ [0,1] where 1 is perfect fit.
from sklearn import metrics
from sklearn import cross_validation as cv

splits = cv.train_test_split(X, y, test_size=0.2)

X_train, X_test, y_train, y_test = splits

model = RegressionEstimator()
model.fit(X_train, y_train)

expected = y_test
predicted = model.predict(y_test)

print metrics.mean_squared_error(expected, predicted)

print metrics.r2_score(expected, predicted)

K-Part Cross Validation

Other Evaluation
How to evaluate clusters?
Visualization (but only in 2D)
Standardized Data Model Demo
(Wheat Kernel Sizes)
A Tour of Scikit-Learn
Questions, Comments?

Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
100% (4)
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
55 pages
Mid-Year Examination, 2023 Science Year 7 1 Hour
75% (4)
Mid-Year Examination, 2023 Science Year 7 1 Hour
23 pages
Internship Report On Bajaj Auto LTD
100% (8)
Internship Report On Bajaj Auto LTD
32 pages
Tabela de Altura de Cabeçote
76% (82)
Tabela de Altura de Cabeçote
2 pages
Chapter 22: Young Adult Edelman: Health Promotion Throughout The Life Span, 8th Edition
100% (2)
Chapter 22: Young Adult Edelman: Health Promotion Throughout The Life Span, 8th Edition
10 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
UNIT 1
No ratings yet
UNIT 1
28 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
Data Science
No ratings yet
Data Science
38 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
Week 01
No ratings yet
Week 01
37 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
AIYA SESSION 4
No ratings yet
AIYA SESSION 4
42 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
Introduction To Scikit Learn
100% (1)
Introduction To Scikit Learn
108 pages
2482
No ratings yet
2482
41 pages
Machine Learning
No ratings yet
Machine Learning
51 pages
Machine Learning
No ratings yet
Machine Learning
54 pages
Python Predictive Modeling
No ratings yet
Python Predictive Modeling
24 pages
Scikit-Learn-Exercises - Jupyter Notebook
100% (2)
Scikit-Learn-Exercises - Jupyter Notebook
28 pages
Final ML
No ratings yet
Final ML
2 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Scikit-Learn: Library For Machine Learning and Data Science With Python
No ratings yet
Scikit-Learn: Library For Machine Learning and Data Science With Python
11 pages
ML -1_Sovan_Introduction to ML
No ratings yet
ML -1_Sovan_Introduction to ML
83 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
Chapter 01 machine learning
No ratings yet
Chapter 01 machine learning
22 pages
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
100% (1)
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
60 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) instant download
100% (2)
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) instant download
38 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
1. Machine Learning - Introduction
No ratings yet
1. Machine Learning - Introduction
73 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Machine
No ratings yet
Machine
61 pages
Introduction To ML
No ratings yet
Introduction To ML
31 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
What Is Machine Learning_ _ Python Data Science Handbook
No ratings yet
What Is Machine Learning_ _ Python Data Science Handbook
11 pages
Silver Oak College of Computer Application: Subject:Machine Learning
No ratings yet
Silver Oak College of Computer Application: Subject:Machine Learning
15 pages
SK Learn
No ratings yet
SK Learn
9 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
INT524 unit3
No ratings yet
INT524 unit3
35 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
92 pages
Get Scala for Machine Learning Second Edition Patrick R. Nicolas PDF ebook with Full Chapters Now
100% (8)
Get Scala for Machine Learning Second Edition Patrick R. Nicolas PDF ebook with Full Chapters Now
81 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
6 - Machine Learning 2
No ratings yet
6 - Machine Learning 2
14 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
Splunk MLTK QuickRefGuide 2019 Web
No ratings yet
Splunk MLTK QuickRefGuide 2019 Web
2 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Machine Learning Algorithms 1728923216
No ratings yet
Machine Learning Algorithms 1728923216
12 pages
unit 1 ml pdf
No ratings yet
unit 1 ml pdf
19 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
31 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
10 pages
Data in ML
No ratings yet
Data in ML
26 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Natural Flavours Fragrances And Perfumes Chemistry Production And Sensory Approach Sreeraj Gopi pdf download
No ratings yet
Natural Flavours Fragrances And Perfumes Chemistry Production And Sensory Approach Sreeraj Gopi pdf download
83 pages
Manual Videx Citofono
No ratings yet
Manual Videx Citofono
12 pages
Project Work 2024-25 For Class 11-1
No ratings yet
Project Work 2024-25 For Class 11-1
4 pages
Effect of Type of Bleaching Earth On The Final Color of Refined Palm Oil
No ratings yet
Effect of Type of Bleaching Earth On The Final Color of Refined Palm Oil
7 pages
Class 11 ''Chemical Equilibrium'' Important Questions With Answers - Chemistry Tutorials
No ratings yet
Class 11 ''Chemical Equilibrium'' Important Questions With Answers - Chemistry Tutorials
9 pages
ND ôn tập cuối ki 2 lớp 7
No ratings yet
ND ôn tập cuối ki 2 lớp 7
6 pages
CS Colours2011-1 PDF
100% (1)
CS Colours2011-1 PDF
3 pages
Edited Emtl Course File
No ratings yet
Edited Emtl Course File
186 pages
On The Mathematical Foundations of Electrical Circuit Theory
No ratings yet
On The Mathematical Foundations of Electrical Circuit Theory
18 pages
UMTS RF Fundamentals
No ratings yet
UMTS RF Fundamentals
60 pages
Dental Health Month
No ratings yet
Dental Health Month
1 page
Percent Error Worksheet
No ratings yet
Percent Error Worksheet
2 pages
PID Book
No ratings yet
PID Book
116 pages
M.tech Syllabus 2nd Year Mtu
No ratings yet
M.tech Syllabus 2nd Year Mtu
15 pages
Sinbad PDF
No ratings yet
Sinbad PDF
94 pages
Exam 11
No ratings yet
Exam 11
3 pages
D4373 Determinacion Rapida de Carbonato de Calcio
No ratings yet
D4373 Determinacion Rapida de Carbonato de Calcio
5 pages
Ormoc Water Supply System Project: Pre-Market Sounding
No ratings yet
Ormoc Water Supply System Project: Pre-Market Sounding
25 pages
Backyard Chicken Basics PDF
No ratings yet
Backyard Chicken Basics PDF
4 pages
182661-da-01-en-LCD MODUL GRAFIK FSTN SCHWARZ 320X240
No ratings yet
182661-da-01-en-LCD MODUL GRAFIK FSTN SCHWARZ 320X240
18 pages
Io List
No ratings yet
Io List
4 pages
Baby Names
No ratings yet
Baby Names
10 pages
Fixtures Dimensions & Drawings
No ratings yet
Fixtures Dimensions & Drawings
6 pages
The Malfoy Baby Bargain
No ratings yet
The Malfoy Baby Bargain
56 pages
General Information: Bachelor, Master and Doctoral Study Programs
No ratings yet
General Information: Bachelor, Master and Doctoral Study Programs
16 pages

algorithmeknn-121213175830-phpapp02

Uploaded by

algorithmeknn-121213175830-phpapp02

Uploaded by

Introduction to Machine

Learning with Scikit-Learn

Why I didn’t want to teach a Machine

Probability Computer Vision

Distributions Neural Models

Regression Anomaly Detection

Function Approximation Graph Algorithms

Is Machine Learning a one semester course?

Problem Domain Machine Learning Class

Infer a function from labeled data Supervised learning

Find structure of data without feedback Unsupervised learning

Interact with environment towards goal Reinforcement learning

Statistics Data Mining

Type of Output Algorithm Category

Output is one or more discrete classes Classification (supervised)

Output is continuous Regression (supervised)

Output is membership in a similar group Clustering (unsupervised)

Output is the distribution of inputs Density Estimation

Output is simplified from higher dimensions Dimensionality Reduction

Given labeled input data (with two or more labels), fit a

Given continuous input data fit a function that is able to

Given data, determine a pattern of associated data points

Instance: a single data point or example composed of fields

from sklearn.datasets import load_digits

X = digits.data # X.shape == (n_samples, n_features)

● k-Nearest Neighbors (kNN)

● Ordinary Least Squares

● Classification and Regression Tree (CART)

● Support Vector Machines (SVM)

New Data Feature Predictive

Architecture of Machine Learning Operations

The Learning Part of Machine Learning

In 2010, INRIA pushed the first public release,

def fit(self, X, y=None):

def predict(self, X):

The Scikit-Learn Estimator API

def transform(self, X):

from sklearn import preprocessing

1. Divide data into training and test splits

Predicted Class A Predicted Class B

Actual A True A False B #A

Actual B False A True B #B

#P(A) #P(B) total

splits = cv.train_test_split(X, y, test_size=0.2)

print metrics.classification_report(expected, predicted)

Cross Validation in Scikit-Learn

R2 is a predictor of “goodness of fit” and is a

splits = cv.train_test_split(X, y, test_size=0.2)

print metrics.mean_squared_error(expected, predicted)

K-Part Cross Validation

You might also like