0% found this document useful (0 votes)
17 views8 pages

Machine-Learning-Using-Python-Pdf-Free (1) - 23-30

The document provides an overview of machine learning, categorizing algorithms into supervised, unsupervised, reinforcement, and evolutionary learning. It discusses the importance of machine learning in organizations for improving key performance indicators and outlines a framework for developing machine learning models, including problem identification, data collection, preprocessing, model building, and deployment. Additionally, it highlights Python as a preferred programming language for machine learning due to its readability, extensive libraries, and strong community support.

Uploaded by

Yazhini K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

Machine-Learning-Using-Python-Pdf-Free (1) - 23-30

The document provides an overview of machine learning, categorizing algorithms into supervised, unsupervised, reinforcement, and evolutionary learning. It discusses the importance of machine learning in organizations for improving key performance indicators and outlines a framework for developing machine learning models, including problem identification, data collection, preprocessing, model building, and deployment. Additionally, it highlights Python as a preferred programming language for machine learning due to its readability, extensive libraries, and strong community support.

Uploaded by

Yazhini K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2 Machine Learning using Python

Artificial intelligence

Machine learning

Deep learning

FIGURE 1.1 Relationship between artificial intelligence, machine learning, and deep learning.

Machine learning algorithms are classified into four categories as defined below:

1. Supervised Learning Algorithms: These algorithms require the knowledge of both the
outcome variable (dependent variable) and the features (independent variable or input
variables). The algorithm learns (i.e., estimates the values of the model parameters or
feature weights) by defining a loss function which is usually a function of the difference
between the predicted value and actual value of the outcome variable. Algorithms such
as linear regression, logistic regression, discriminant analysis are examples of supervised
learning algorithms. In the case of multiple linear regression, the regression parameters
å
n Ù
are estimated by minimizing the sum of squared errors, which is given by i =1
( yi - yi )2 ,
where y i is the actual value of the outcomeå 2n Ù
i =1
( yi - yi )is
variable, the predicted value of the outcome
variable, and n is the total number of records in the data. Here the predicted value is a linear
or a non-linear function of the features (or independent variables) in the data. The predic-
tion is achieved (by estimating feature weights) with the knowledge of the actual values of
the outcome variables, thus called supervised learning algorithms. That is, the supervision
is achieved using the knowledge of outcome variable values.
2. Unsupervised Learning Algorithms: These algorithms are set of algorithms which do not have
the knowledge of the outcome variable in the dataset. The algorithms must find the possible
values of the outcome variable. Algorithms such as clustering, principal component analysis
are examples of unsupervised learning algorithms. Since the values of outcome variable are
unknown in the training data, supervision using that knowledge is not possible.
3. Reinforcement Learning Algorithms: In many datasets, there could be uncertainty around
both input as well as the output variables. For example, consider the case of spell check
in various text editors. If a person types “buutiful” in Microsoft Word, the spell check in
Microsoft Word will immediately identify this as a spelling mistake and give options such
as “beautiful”, “bountiful”, and “dutiful”. Here the prediction is not one single value, but a
set of values. Another definition is: Reinforcement learning algorithms are algorithms that
have to take sequential actions (decisions) to maximize a cumulative reward. Techniques
such as Markov chain and Markov decision process are examples of reinforcement learning
algorithms.
4. Evolutionary Learning Algorithms: Evolutional algorithms are algorithms that imitate natu-
ral evolution to solve a problem. Techniques such as genetic algorithm and ant colony optimi-
zation fall under the category of evolutionary learning algorithms.

In this book, we will be discussing several supervised and unsupervised learning algorithms.

Chapter 01_Introduction to Machine Learning.indd 2 4/24/2019 6:53:03 PM


Chapter 1 · Introduction to Machine Learning 3

1.2 | WHY MACHINE LEARNING?


Organizations across the world use several performance measures such as return on investment (ROI),
market share, customer retention, sales growth, customer satisfaction, and so on for quantifying, moni-
toring, benchmarking, and improving. Organizations would like to understand the association between
key performance indicators (KPIs) and factors that have a significant impact on the KPIs for effective
management. Knowledge of the relationship between KPIs and factors would provide the decision maker
with appropriate actionable items (U D Kumar, 2017). Machine learning algorithms can be used for
identifying the factors that influence the key performance indicators, which can be further used for deci-
sion making and value creation. Organizations such as Amazon, Apple, Capital One, General Electric,
Google, IBM, Facebook, Procter and Gamble and so on use ML algorithms to create new products and
solutions. ML can create significant value for organizations if used properly. MacKenzie et al. (2013)
reported that Amazon’s recommender systems resulted in a sales increase of 35%.
A typical ML algorithm uses the following steps:

1. Identify the problem or opportunity for value creation.


2. Identify sources of data (primary as well secondary data sources) and create a data lake
(integrated data set from different sources).
3. Pre-process the data for issues such as missing and incorrect data. Generate derived variables
(feature engineering) and transform the data if necessary. Prepare the data for ML model building.
4. Divide the datasets into subsets of training and validation datasets.
5. Build ML models and identify the best model(s) using model performance in validation data.
6. Implement Solution/Decision/Develop Product.

1.3 | FRAMEWORK FOR DEVELOPING MACHINE LEARNING MODELS


The framework for ML algorithm development can be divided into five integrated stages: problem and
opportunity identification, collection of relevant data, data pre-processing, ML model building, and model
deployment. The various activities carried out during these different stages are described in Figure 1.2.
The success of ML projects will depend on how innovatively the data is used by the organization as
compared to the mechanical use of ML tools. Although there are several routine ML projects such as
customer segmentation, clustering, forecasting, and so on, highly successful companies blend innova-
tion with ML algorithms.
The success of ML projects will depend on the following activities:

1. Feature Extraction: Feature extraction is a process of extracting features from different sources.
For a given problem, it is important to identify the features or independent variables that may be
necessary for building the ML algorithm. Organizations store data captured by them in enter-
prise resource planning (ERP) systems, but there is no guarantee that the organization would
have identified all important features while designing the ERP system. It is also possible that the
problem being addressed using the ML algorithm may require data that is not captured by the
organization. For example, consider a company that is interested in predicting the warranty cost
for the vehicle manufactured by them. The number of warranty claims may depend on weather
conditions such as rainfall, humidity, and so on. In many cases, feature extraction itself can be
an iterative process.

Chapter 01_Introduction to Machine Learning.indd 3 4/24/2019 6:53:03 PM


4 Machine Learning using Python

Problem or Opportunity Identification


• A good ML project starts with the ability of the organization to define the
problem clearly. Domain knowledge is very important at this stage of the project.
• Problem definition or opportunity identification will be a major challenge for
many companies who do not have capabilities to ask right questions.

Feature Extraction − Collection of Relevant Data


• Once the problem is defined clearly, the project team should identify and collect
the relevant data. This is an iterative process since 'relevant data' may not
be known in advance in many analytics projects. The existence of ERP systems
will be very useful at this stage. In addition to the data available within the
organization, they have to collect data from external sources. The data needs to
be integrated to create a data lake. Quality of data is a major impediment for
successful ML model development.

Data Pre-processing
• Anecdotal evidence suggests that data preparation and data processing form a
significant proportion of any analytics project. This would include data cleaning
and data imputation and the creation of additional variables (feature engineering)
such as interaction variables and dummy variables.

Model Building
• ML model building is an iterative process that aims to find the best model. Several
analytical tools and solution procedures will be used to find the best ML model.
• To avoid overfitting, it is important to create several training and validation datasets.

Communication and Deployment of the Data Analysis


• The primary objective of machine learning is to come up with actionable items that
can be deployed.
• The communication of the ML algorithm output to the top managment and clients
plays a crucial role. Innovative data visualization techniques may be used in this stage.
• Deployment of the model may involve developing software solutions and products,
such as recommender engine.

FIGURE 1.2 Framework of ML model development.

2. Feature Engineering: Once the data is made available (after feature extraction), an important
step in machine learning is feature engineering. The model developer should decide how he/she
would like to use the data that has been captured by deriving new features. For example, if X1 and
X2 are two features that are captured in the original data. We can derive new features by taking
ratio (X1/X2) and product (X1X2). There are many other innovative ways of deriving new features
such as binning continuous variable, centring the data (deviation from the mean), and so on.
The success of the ML model may depend on feature engineering.
3. Model Building and Feature Selection: During model building, the objective is to identify the
model that is more suitable for the given problem context. The selected model may not be always
the most accurate model, as accurate model may take more time to compute and may require
expensive infrastructure. The final model for deployment will be based on multiple criteria such
as accuracy, computing speed, cost of deployment, and so on. As a part of model building, we
will also go through feature selection which identifies important features that have significant
relationship with the outcome variable.

Chapter 01_Introduction to Machine Learning.indd 4 4/24/2019 6:53:04 PM


Chapter 1 · Introduction to Machine Learning 5

4. Model Deployment: Once the final model is chosen, then the organization must decide the
strategy for model deployment. Model deployment can be in the form of simple business rules,
chatbots, real-time actions, robots, and so on.

In the next few sections we will discuss about why Python has become one of most widely adopted lan-
guage for machine learning, what features and libraries are available in Python, and how to get started
with Python language.

1.4 | WHY PYTHON?


Python is an interpreted, high-level, general-purpose programming language (Downey, 2012). One of
the key design philosophy of Python is code readability. Ease of use and high productivity have made
Python very popular. Based on the number of question views on StatckOverflow (Kauflin, 2017), as
shown in Figure 1.3, Python seems to have gained attention and popularity significantly compared to
other languages since 2012.

Growth of major programming languages


Based on Stack Overflow question views in World Bank high-income countries

Python
JavaScript
Java
% of overall question views each month

9%

c#

6%

php
c++
3%

0%
2012 2014 2016 2018
Time

FIGURE 1.3 Overall question views for Python.


Source: https://www.forbes.com/sites/jeffkauflin/2017/05/12/the-five-most-in-demand-coding-languages/# 211777b2b3f5

Python has an amazing ecosystem and is excellent for developing prototypes quickly. It has a comprehen-
sive set of core libraries for data analysis and visualization. Python, unlike R, is not built only for data analy-
sis, it is a general-purpose language. Python can be used to build web applications, enterprise applications,
and is easier to integrate with existing systems in an enterprise for data collection and preparation.

Chapter 01_Introduction to Machine Learning.indd 5 4/24/2019 6:53:04 PM


6 Machine Learning using Python

Data science projects need extraction of data from various sources, data cleaning, data imputa-
tion beside model building, validation, and making predictions. Enterprises typically want to build an
End-to-End integrated systems and Python is a powerful platform to build these systems.
Data analysis is mostly an iterative process, where lots of exploration needs to be done in an ad-hoc
manner. Python being an interpreted language provides an interactive interface for accomplishing this.
Python’s strong community continuously evolves its data science libraries and keeps it cutting edge.
It has libraries for linear algebra computations, statistical analysis, machine learning, visualization, opti-
mization, stochastic models, etc. We will discuss the different libraries in the subsequent section in detail.
Python has a shallow learning curve and it is one of the easiest languages to learn to come up to speed.
The following link provides a list of enterprises using Python for various applications ranging from
web programming to complex scalable applications:
https://www.python.org/about/success/
An article published on forbes.com puts Python as top 5 languages with highest demand in the industry
(link to the article is provided below):
https://www.forbes.com/sites/jeffkauflin/2017/05/12/the-five-most-in-demand-coding-languages/#6e2dc575b3f5
A search on number of job posts on various languages such as Python, R, Java, Scala, and Julia with terms
like “data science” or “machine learning” on www.indeed.com site, give the following trend results
(Figure 1.4). It is very clear that Python has become the language with most demand since 2016 and it is
growing very rapidly.

Job Postings

Jun 16, 2017


0.040
python and (“data science” or “machine learning”) : 0.0359%
R and (“data science” or “machine learning”) : 0.0286%
Percentage of Matching Job Postings (%)

0.035
scala and (“data science” or “machine learning”) : ---
java and (“data science” or “machine learning”) : ---
0.030 julia and (“data science” or “machine learning”) : ---

0.025

0.020

0.015

0.010

0.005

0.000
2014 2015 2016 2017

FIGURE 1.4 Trend on job postings.


Source: http://makemeanalyst.com/most-popular-languages-for-data-science-and-analytics-2017/

Chapter 01_Introduction to Machine Learning.indd 6 4/24/2019 6:53:05 PM


Chapter 1 · Introduction to Machine Learning 7

1.5 | PYTHON STACK FOR DATA SCIENCE


Python community has developed several libraries that cater to specific areas of data science applica-
tions. For example, there are libraries for statistical computations, machine learning, DataFrame opera-
tions, visualization, scientific computation using arrays and matrices, etc.

Library for scientific computing.


Efficient storage of arrays and matrices.
linear algebra, statistical
Backbone of all scientific calculations
computations, optimization algorithm.
and algorithms. Plotting and visualization
SM StatsModel
Statistics in Python
seaborn
NumPy SciPy
matplotlib

learn
scikit

High-performance, easy-to-use data structures for Machine learning library.


data manipulation and analysis. Pandas provide the IDE or Development Collection of ML algorithms:
features of DataFrame, which is very popular in the area environment for data Supervised and Unsupervised.
of analytics for data munging, cleaning, and transformation. analysis in Python.

FIGURE 1.5 Python core libraries for data science applications.

Figure 1.5 shows the important Python libraries that are used for developing data science or machine
learning models. Table 1.1 also provides details of these libraries and the website for referring to docu-
mentations. We will use these throughout the book.

TABLE 1.1 Core Python Libraries for Data Analysis


Areas of Application Library Description Documentation Website
Statistical SciPy SciPy contains modules for optimization and ­computation. www.scipy.org
Computations It provides libraries for several statistical distributions and
statistical tests.
Statistical Modelling StatsModels StatsModels is a Python module that provides classes and www.statsmodels.org/
functions for various statistical analyses. stable/index.html
Mathematical NumPy NumPy is the fundamental package for scientific www.numpy.org
Computations computing involving large arrays and matrices. It provides
useful mathematical computation capabilities.
Data Structure Pandas Pandas provides high-performance, easy-to-use data pandas.pydata.org
Operations structures called DataFrame for exploration and analysis.
(Dataframes) DataFrames are the key data structures that feed into most of
the statistical and machine learning models.
Visualization Matplotlib It is a 2D plotting library. matplotlib.org
(Continued)

Chapter 01_Introduction to Machine Learning.indd 7 4/24/2019 6:53:05 PM


8 Machine Learning using Python

TABLE 1.1 Continued


Areas of Application Library Description Documentation Website
More elegant Seaborn According to seaborn.pydata.org, Seaborn is a Python seaborn.pydata.org
Visualization visualization library based on matplotlib. It provides a
high-level interface for drawing attractive statistical graphics.
Machine Learning Scikit-learn Scikit-learn provides a range of supervised and scikit-learn.org
Algorithm (aka sklearn) unsupervised learning algorithms.
IDE (Integrated Jupyter According to jupyter.org, the Jupyter Notebook is an open- jupyter.org
Development Notebook source web application that allows you to create and share
Environment) documents that contain live code, equations, visualizations,
and explanatory text.

1.6 | GETTING STARTED WITH ANACONDA PLATFORM


We recommend using Anaconda platform for data science. The Anaconda distribution simplifies the
installation process by including almost everything we need for working on data science tasks. It con-
tains the core Python language, as well as all the essential libraries including NumPy, Pandas, SciPy,
Matplotlib, sklearn, and Jupyter notebook. It has distributions for all Operating Systems (OS) environ-
ments (e.g. Windows, MAC, and Linux). Again, we recommend using Python 3.5+ environment for
Anaconda. All the codes in the book are written using Anaconda 5.0 for Python 3.5+.
Follow the steps below for installation:

Step 1: Go to Anaconda Site


Go to https://www.anaconda.com/distribution/ using your browser window.

Step 2: Download Anaconda Installer for your Environment


Select your OS environment and choose Python 3.7 version to download the installation files as shown
in Figure 1.6.

Windows macOS Linux

Anaconda 2018.12 for macOS Installer

Python 3.7 version Python 2.7 version


Download Download

64-Bit Graphical Installer (652.7 MB) 64-Bit Graphical Installer (640.7 MB)
64-Bit Command Line Installer (557 MB) 64-Bit Command Line Installer (547 MB)

FIGURE 1.6 Anaconda distribution site for downloading the installer.


Source: www.anaconda.com

Chapter 01_Introduction to Machine Learning.indd 8 4/24/2019 6:53:06 PM


Chapter 1 · Introduction to Machine Learning 9

Step 3: Install Anaconda


Double click on the downloaded file and follow the on-screen installation instructions, leaving options
as set by default. This will take a while and complete the installation process.

Step 4: Start Jupyter Notebook


Open the command terminal window as per your OS environment and type the following command, as
shown in Figure 1.7.
jupyter notebook -- ip=*

FIGURE 1.7 Screenshot of starting the jupyter notebook.

This should start the Jupyter notebook and open a browser window in your default browser software as
shown in Figure 1.8.

FIGURE 1.8 Screenshot of the file system explorer of Jupyter notebook open in the browser.

The reader can also start browser window using the URL highlighted below. The URL also contains the
password token as shown in Figure 1.9.

Chapter 01_Introduction to Machine Learning.indd 9 4/24/2019 6:53:07 PM

You might also like