Machine-Learning-Using-Python-Pdf-Free (1) - 23-30
Machine-Learning-Using-Python-Pdf-Free (1) - 23-30
Artificial intelligence
Machine learning
Deep learning
FIGURE 1.1 Relationship between artificial intelligence, machine learning, and deep learning.
Machine learning algorithms are classified into four categories as defined below:
1. Supervised Learning Algorithms: These algorithms require the knowledge of both the
outcome variable (dependent variable) and the features (independent variable or input
variables). The algorithm learns (i.e., estimates the values of the model parameters or
feature weights) by defining a loss function which is usually a function of the difference
between the predicted value and actual value of the outcome variable. Algorithms such
as linear regression, logistic regression, discriminant analysis are examples of supervised
learning algorithms. In the case of multiple linear regression, the regression parameters
å
n Ù
are estimated by minimizing the sum of squared errors, which is given by i =1
( yi - yi )2 ,
where y i is the actual value of the outcomeå 2n Ù
i =1
( yi - yi )is
variable, the predicted value of the outcome
variable, and n is the total number of records in the data. Here the predicted value is a linear
or a non-linear function of the features (or independent variables) in the data. The predic-
tion is achieved (by estimating feature weights) with the knowledge of the actual values of
the outcome variables, thus called supervised learning algorithms. That is, the supervision
is achieved using the knowledge of outcome variable values.
2. Unsupervised Learning Algorithms: These algorithms are set of algorithms which do not have
the knowledge of the outcome variable in the dataset. The algorithms must find the possible
values of the outcome variable. Algorithms such as clustering, principal component analysis
are examples of unsupervised learning algorithms. Since the values of outcome variable are
unknown in the training data, supervision using that knowledge is not possible.
3. Reinforcement Learning Algorithms: In many datasets, there could be uncertainty around
both input as well as the output variables. For example, consider the case of spell check
in various text editors. If a person types “buutiful” in Microsoft Word, the spell check in
Microsoft Word will immediately identify this as a spelling mistake and give options such
as “beautiful”, “bountiful”, and “dutiful”. Here the prediction is not one single value, but a
set of values. Another definition is: Reinforcement learning algorithms are algorithms that
have to take sequential actions (decisions) to maximize a cumulative reward. Techniques
such as Markov chain and Markov decision process are examples of reinforcement learning
algorithms.
4. Evolutionary Learning Algorithms: Evolutional algorithms are algorithms that imitate natu-
ral evolution to solve a problem. Techniques such as genetic algorithm and ant colony optimi-
zation fall under the category of evolutionary learning algorithms.
In this book, we will be discussing several supervised and unsupervised learning algorithms.
1. Feature Extraction: Feature extraction is a process of extracting features from different sources.
For a given problem, it is important to identify the features or independent variables that may be
necessary for building the ML algorithm. Organizations store data captured by them in enter-
prise resource planning (ERP) systems, but there is no guarantee that the organization would
have identified all important features while designing the ERP system. It is also possible that the
problem being addressed using the ML algorithm may require data that is not captured by the
organization. For example, consider a company that is interested in predicting the warranty cost
for the vehicle manufactured by them. The number of warranty claims may depend on weather
conditions such as rainfall, humidity, and so on. In many cases, feature extraction itself can be
an iterative process.
Data Pre-processing
• Anecdotal evidence suggests that data preparation and data processing form a
significant proportion of any analytics project. This would include data cleaning
and data imputation and the creation of additional variables (feature engineering)
such as interaction variables and dummy variables.
Model Building
• ML model building is an iterative process that aims to find the best model. Several
analytical tools and solution procedures will be used to find the best ML model.
• To avoid overfitting, it is important to create several training and validation datasets.
2. Feature Engineering: Once the data is made available (after feature extraction), an important
step in machine learning is feature engineering. The model developer should decide how he/she
would like to use the data that has been captured by deriving new features. For example, if X1 and
X2 are two features that are captured in the original data. We can derive new features by taking
ratio (X1/X2) and product (X1X2). There are many other innovative ways of deriving new features
such as binning continuous variable, centring the data (deviation from the mean), and so on.
The success of the ML model may depend on feature engineering.
3. Model Building and Feature Selection: During model building, the objective is to identify the
model that is more suitable for the given problem context. The selected model may not be always
the most accurate model, as accurate model may take more time to compute and may require
expensive infrastructure. The final model for deployment will be based on multiple criteria such
as accuracy, computing speed, cost of deployment, and so on. As a part of model building, we
will also go through feature selection which identifies important features that have significant
relationship with the outcome variable.
4. Model Deployment: Once the final model is chosen, then the organization must decide the
strategy for model deployment. Model deployment can be in the form of simple business rules,
chatbots, real-time actions, robots, and so on.
In the next few sections we will discuss about why Python has become one of most widely adopted lan-
guage for machine learning, what features and libraries are available in Python, and how to get started
with Python language.
Python
JavaScript
Java
% of overall question views each month
9%
c#
6%
php
c++
3%
0%
2012 2014 2016 2018
Time
Python has an amazing ecosystem and is excellent for developing prototypes quickly. It has a comprehen-
sive set of core libraries for data analysis and visualization. Python, unlike R, is not built only for data analy-
sis, it is a general-purpose language. Python can be used to build web applications, enterprise applications,
and is easier to integrate with existing systems in an enterprise for data collection and preparation.
Data science projects need extraction of data from various sources, data cleaning, data imputa-
tion beside model building, validation, and making predictions. Enterprises typically want to build an
End-to-End integrated systems and Python is a powerful platform to build these systems.
Data analysis is mostly an iterative process, where lots of exploration needs to be done in an ad-hoc
manner. Python being an interpreted language provides an interactive interface for accomplishing this.
Python’s strong community continuously evolves its data science libraries and keeps it cutting edge.
It has libraries for linear algebra computations, statistical analysis, machine learning, visualization, opti-
mization, stochastic models, etc. We will discuss the different libraries in the subsequent section in detail.
Python has a shallow learning curve and it is one of the easiest languages to learn to come up to speed.
The following link provides a list of enterprises using Python for various applications ranging from
web programming to complex scalable applications:
https://www.python.org/about/success/
An article published on forbes.com puts Python as top 5 languages with highest demand in the industry
(link to the article is provided below):
https://www.forbes.com/sites/jeffkauflin/2017/05/12/the-five-most-in-demand-coding-languages/#6e2dc575b3f5
A search on number of job posts on various languages such as Python, R, Java, Scala, and Julia with terms
like “data science” or “machine learning” on www.indeed.com site, give the following trend results
(Figure 1.4). It is very clear that Python has become the language with most demand since 2016 and it is
growing very rapidly.
Job Postings
0.035
scala and (“data science” or “machine learning”) : ---
java and (“data science” or “machine learning”) : ---
0.030 julia and (“data science” or “machine learning”) : ---
0.025
0.020
0.015
0.010
0.005
0.000
2014 2015 2016 2017
learn
scikit
Figure 1.5 shows the important Python libraries that are used for developing data science or machine
learning models. Table 1.1 also provides details of these libraries and the website for referring to docu-
mentations. We will use these throughout the book.
64-Bit Graphical Installer (652.7 MB) 64-Bit Graphical Installer (640.7 MB)
64-Bit Command Line Installer (557 MB) 64-Bit Command Line Installer (547 MB)
This should start the Jupyter notebook and open a browser window in your default browser software as
shown in Figure 1.8.
FIGURE 1.8 Screenshot of the file system explorer of Jupyter notebook open in the browser.
The reader can also start browser window using the URL highlighted below. The URL also contains the
password token as shown in Figure 1.9.