Machine Learning with Python
Supervised Learning
Downloaded from: www.detodopython.com
Ligdi González
2019
INDEX
Chapter 1
INTRODUCTION
MACHINE LEARNING WITH PYTHON
Chapter 2
PREPARAR THE PYTHON ENVIRONMENT
PYTHON
LIBRARIES FOR DATA SCIENCE
LIBRARIESPATO DATA VISUALIZATION
LIBRARIESPARA MACHINE LEARNING
INSTALL THE PYTHON ENVIRONMENT
WORKING WITH THE SPYDER IDE
WORKING WITH JUPYTER NOTEBOOK IDE
Chapter 3
INTENSIVE COURSES IN PYTHON AND SCIPY
INTENSIVE PYTHON COURSE
INTENSIVE NUMPY COURSE
INTENSIVE COURSE OFPANDAS
INTENSIVE MATPLOTLIB COURSE
Chapter 4
LOAD THE DATOS
CONSIDERATIONS OF CSV FILES
DATASET: PIMA INDIANS DIABETES
LOAD CSV FILES USING PYTHON
UPLOAD FILES USING PYTHON
Chapter 5
UNDERSTANDING THE DATOS
CHECK THE DATA
DIMENSIONING THE DATA
DATA TYPES
STATISTICAL DESCRIPTION
CLASS DISTRIBUTION (CLASSIFICATION PROBLEMS)
CORRELATION BETWEEN CHARACTERISTICS
ADDITIONAL CONSIDERATIONS
Chapter 6
VISUALIZING THE DATOS
GRAPHING WITH A SINGLE VARIABLE
GRAPHING WITH MULTIPLE VARIABLES
Chapter 7
PROCESSING OF THE DATOS
SEPADATA RATION
STANDARDIZATION OF DATA
DATA NORMALIZATION
COLUMN REMOVAL
Chapter 8
SELECTING FEATURES
IMPORTANCE OF CHARACTERISTICS
FILTER METHODS
WRAPPING METHODS
INTEGRATED METHODS
Chapter 9
CLASSIFICATION ALGORITHMS
LOGISTIC REGRESSION
K - NEAREST NEIGHBORS
SUPPORT VECTOR MACHINES
NAIVE BAYES
DECISION TREES CLASSIFICATION
RANDOM FORESTS CLASSIFICATION
Chapter 10
PERFORMANCE METRICS FOR CLASSIFICATION ALGORITHMS
CONFUSION MATRIX
CLASSIFICATION REPORT
AREA UNDER THE CURVE
Chapter 11
REGRESSION ALGORITHMS
LINEAR REGRESSION
POLYNOMIAL REGRESSION
SUPPORT VECTORS REGRESSION
REGRESSION DECISION TREES
RANDOM FOREST REGRESSION
Chapter 12
PERFORMANCE METRICS FOR REGRESSION ALGORITHMS
ROOT MEAN SQUARE ERROR (RMSE)
MEAN ABSOLUTE ERROR (MAE)
Chapter 13
MACHINE LEARNING PROJECT - CLASSIFICATION
DEFINITION OF THE PROBLEM
IMPORT THE LIBRARIES
LOAD THE DATASET
UNDERSTANDING THE DATA
VISUALIZING THE DATA
SEPADATA RATIONING
SELECTING FEATURES
DATA PROCESSING
SEPADATA RATIONING
APPLICATION OF CLASSIFICATION ALGORITHMS
Chapter 14
MACHINE LEARNING PROJECT - REGRESSION
DEFINITION OF THE PROBLEM
IMPORT THE LIBRARIES
LOAD THE DATASET
UNDERSTANDING THE DATA
VISUALIZING THE DATA
SEPADATA RATION
SEPARATION OF THE DATA
APPLICATION OF REGRESSION ALGORITHMS
Chapter 15
CONTINUE LEARNING MACHINE LEARNING
BUILD TEMPLATESPATO PROJECTS
DATA SETSPATO PRACTICE
Chapter 16
GET MORE INFORMATION
GENERAL COUNCIL
HELP WITH PYTHON
HELP WITH SCIPY AND NUMPY
HELP WITH MATPLOTLIB
HELP WITHPANDAS
HELP WITH SCIKIT-LEARN
Chapter 1
INTRODUCTION
This book is a guide to learning applied Machine Learning with
Python. The process that can be used will be uncovered, step by step,
start in Machine Learning with the Python ecosystem.
MACHINE LEARNING WITH PYTHON
This book focuses on Machine Learning projects with the language
of Python programming. It is written to explain each of the steps
to develop a Machine Learning project with algorithms of
classification and regression. It is divided as follows:
Learn how the subtasks of a Machine project
Learning is assigned to Python and the best way to work on
through each task.
Finally, all the explained knowledge is joined together to
develop a classification problem
In the first chapters, you will learn how to complete the subtasks.
specific to a Machine Learning project using Python. Once
to learn how to complete a task using the platform and
to obtain a reliable result that can be used over and over again
project after project.
A Machine Learning project can be divided into 4 tasks:
Define the problem: investigate and characterize the problem
to better understand the objectives of the project.
Analyze the data: use descriptive statistics and
visualize them to better understand the data that is available
available.
Prepare the data: use data transformations
to better explain the structure of the prediction problem to the
Machine Learning algorithms.
Evaluate the algorithms: design tests to evaluate one
standard series of algorithms on the data and select the best ones
to investigate further.
In the last part of the book, it will bring together everything learned previously to
the development of a project, one with classification algorithms and the other with
regression algorithms.
Each of the lessons is designed to be read from beginning to end.
in order, and show exactly how to complete each task in a project
of Machine Learning. Of course, one can dedicate themselves to chapters
specifically, later, to refresh the knowledge. The chapters
are structured to demonstrate the libraries and functions and show
specific techniques for a Machine Learning task.
Each chapter is designed to be completed in less than 30 minutes.
Depending on your skill level and enthusiasm. It is possible to work.
with the whole book in one weekend. It also works, if you want,
dive into specific chapters and use the book as a reference.
Chapter 2
PREPARE THE PYTHON ENVIRONMENT
Downloaded from: www.detodopython.com
Python is growing and can become the dominant platform for
Machine Learning, the main reason to adopt this language of
programming is a general-purpose language that can be used for both
research and development, as well as in production.
In this chapter, you will learn everything related to the Python environment.
for Machine Learning:
Python and its use for Machine Learning.
2. The basic libraries for Machine Learning such as
SciPy, NumPy and Pandas.
3. The library for data visualization such as
matplotlib.
4. The library to implement Machine algorithms
Learning, scikit-learn.
5. Set up the Python environment using Anaconda.
6. Set up the Spyder IDE.
7. Set up the Jupyter Notebook IDE.
PYTHON
Python leads in machine learning development languages due to
its simplicity and ease of learning. Python is used by more and
more data scientists and developers for construction and analysis of
models. In addition, it is a success among beginners who are new to
Machine Learning.
Downloaded from: www.detodopython.com
Unlike other programming languages for Machine
Learning, like R or MATLAB, data processing and expressions
scientific mathematics are not integrated into the language itself, but the
libraries like SciPy, NumPy, and Pandas offer functionality
equivalent in possibly more accessible syntax.
Specialized Machine Learning libraries like scikit-learn,
Theano and TensorFlow provide you with the ability to train a variety of
Machine Learning models that potentially use infrastructure
distributed computing.
LIBRARIES FOR DATA SCIENCE
These are the basic libraries that transform Python from a language of
general-purpose programming in a powerful and robust tool
for data analysis and visualization, these are the foundations on which
they are based on the most specialized tools.
SciPy
SciPy is a software library for engineering and science, SciPy
includes functions for linear algebra, optimization, integration and
statistics. Provide efficient numerical routines such as integration
numeric, optimization, and many others through its submodules
specific. The main functionality of this library is based on NumPy and
its matrices and adding a collection of high-level algorithms and commands
level to manipulate and visualize data.
NumPy
NumPy stands for Numerical Python and is a fundamental library
for scientific computing in Python as it provides vectorization
of mathematical operations in the type of matrices, improving the
performance and, consequently, accelerates execution.
It is focused on managing and treating data as matrices, its
the purpose is to provide the ability to perform complex operations of
matrix that are required by neural networks and complex statistics of
easy way.
In summary, NumPy presents objects for arrays and matrices
multidimensional, as well as routines that allow developers
perform advanced mathematical and statistical functions on those matrices
with the least amount of code possible.
NumPy is a data management library that is usually
paired with TensorFlow, SciPy, matplotlib, and many other libraries of
Python oriented towards Machine Learning and data science.
Pandas
Pandas is a Python library designed to work with data
"labeled" and "relational" in a simple and intuitive way, it is designed
for quick and easy data manipulation, aggregation, and visualization.
Pandas adds data structures and tools that are designed to
the analysis of practical data in finance, statistics, and engineering. Pandas
works well with incomplete, unordered, and unlabeled data, it is
to say, the type of data that is likely to be found in the real world, and
provides tools to configure, merge, reshape, and split
data sets. With this library, you can add and remove columns.
easily from the DataFrame, convert data structures into objects and
handling missing data (NaN).
LIBRARIES FOR DATA VISUALIZATION
The best and most sophisticated analysis makes no sense if you can't
communicate it to other people. These libraries allow you to easily create
more visually appealing and sophisticated graphs, tables, and maps, without
import what type of analysis you are trying to do.
Matplotlib
It is a standard Python library for creating diagrams and graphs in
2D, it is very low level, which means it requires more commands to
generate nice graphs and figures with some additional libraries
advanced, however, the other side of that is flexibility, with
with sufficient commands, you can make almost any type of graphs that
I want to create with matplotlib.
LIBRARIES FOR MACHINE LEARNING
Machine Learning is at the intersection of Intelligence
Artificial and statistical analysis. When training computers with datasets
From real-world data, algorithms can be created that make predictions.
more precise and sophisticated, whether it is about obtaining better
directions to drive or build computers that can identify
milestones simply by observing images.
Scikit-learn
The scikit-learn library is definitely one of the most popular.
of Machine Learning, it has a large number of features for the
data mining and data analysis, so it is an excellent option
both for researchers and for developers. It is built on
the popular libraries NumPy, SciPy, and matplotlib, so you will have a
familiar feeling when using it.
Scikit-learn exposes a concise and consistent interface for the
common Machine Learning algorithms, so it is easy to carry to the
production systems. The library combines quality code and good
documentation, ease of use and high performance, and it is a standard of the
de facto industry for Machine Learning with Python.
For more information, you can check the following
information
The provided text is a URL and cannot be translated.
INSTALL THE PYTHON ENVIRONMENT
There are multiple ways to install the Python environment, IDEs, and
libraries to be used in Machine Learning, but the easiest way,
especially if you are just learning and don't know how to work very well
in the command terminal, it is using the Anaconda environment.
You just have to go to the official page, anaconda.com and
press the 'Downloads' button.
There select the operating system that the computer uses.
Then select the version of Python to use, it is recommended
select version 3.6 since most Machine libraries
Learning is developed for this version, while the other one is
becoming more and more obsolete for its use every day.
Once it has been downloaded, proceed to install it on the computer.
one must have a little patience because sometimes this whole process
it takes time.
At the end of all this, Python will be installed on the computer, great
part of the libraries to implement in Machine Learning and even several
IDEs that can be used for project development.
WORKING WITH THE SPYDER IDE
Spyder can be found in the Anaconda browser 'Anaconda-
Navigator.
To open Spyder, you just need to press "Launch" on the
application box.
Once you start Spyder, you should see an open editor window.
the left side and a Python console window at the bottom
right. For its part, the panel located in the upper right is used
a variable explorer, a file explorer, and a browser of
help. Like most IDEs, you can change which panels
they are visible and their design within the window.
You can start working with Spyder immediately in the window.
from the console. By default, Spyder provides a console
IPython that can be used to interact directly with the engine
Python works essentially the same way as it does in the
command line, the big difference is that Spyder can inspect the
Python engine contents and can do other things such as display
variables and their contents within the variable explorer.
For more information, you can check the following.
information:
Invalid input. Please provide text for translation.
WORKING WITH JUPYTER NOTEBOOK IDE
You can find Jupyter Notebook in the Anaconda browser.
Anaconda-Navigator, just like Spyder.
Once selected to open Jupyter, a tab opens in the
default web browser of the computer. From here you can
navigate through the computer folders to determine where
the project document will be saved. A new Notebook must be created
in Python 3, a new tab opens at this moment, here is where you
He must start developing the project.
For the purposes of the book, all lines of code and results of
The codes will be made using Jupyter Notebook as the IDE, this is because
the presentation of the data is ideal to facilitate understanding, but
You can select any IDE of your preference.
Chapter 3
INTENSIVE COURSES IN PYTHON AND SCIPY
To develop Machine Learning algorithms in Python, it is not necessary
to be an expert Python developer, just having the basics of how
programming in any programming language can enable one to
understand Python very easily, you just need to know some
properties of the language and transfer what you already really know to Python.
In this chapter you will learn:
1. Basic syntax of Python.
2. Basic knowledge of library programming
NumPy, Pandas, and Matplotlib.
3. The foundations for building tasks in Python for Machine
Learning.
If you already know something about Python, this chapter will be a refresher.
knowledge.
INTENSIVE PYTHON COURSE
Python is a powerful computational tool that can be used
to solve complicated tasks in the area of finance, economics, science of
data and Machine Learning.
The technical advantages that Python offers compared to others
the programming languages are the following:
It is free and constantly updated.
It can be used in multiple domains.
It does not require much time to process calculations and has a
intuitive syntax that allows for programming complex lines of code.
With all these features, Python can be
implementing in countless applications.
Therefore, Python's popularity is based on two pillars, the first is that
it is easy to learn since it has a clear and intuitive syntax and the
the second reason is that it is very powerful as it can execute a variety
of complex lines of code.
For more information, you can review the following
information:
Unable to translate the given URL.
INTENSIVE NUMPY COURSE
NumPy is a Python package that stands for 'Numerical Python', it is the
main library for scientific computing, provides a structure of
matrix data that has some benefits over regular lists
Python.
On its part, NumPy array or the NumPy array is a
powerful N-dimensional array object that has the shape of rows and columns,
in which there are several elements that are stored in their respective
memory locations. In the figure, this can be observed more
clearly, this is a two-dimensional matrix because it has rows and columns.
As can be seen, it has four rows and three columns, in the case that only
if it had a row then it would have been a one-dimensional matrix.
According to the explanation above, in the first figure there is a
one-dimensional matrix or 1D. In the second figure, there is a matrix
bidimensional or 2D, where the rows are indicated as axis 0, while
The columns are axis 1.
The number of axes increases according to the number of dimensions, in
3D matrices will have an additional axis 2. These axes are only valid for
matrices that have at least two dimensions, since this doesn't make sense
for one-dimensional matrices.
Downloaded from: www.detodopython.com
Among the differences that NumPy has and the lists it offers
Python for handling data is found that, NumPy takes less.
memory compared to Python lists, is also quite fast.
in terms of execution and it does not require much effort in the
programming to execute the routines. For these reasons, NumPy is much
easier and more convenient to use in the development of Machine algorithms
Learning.
Knowing all this, two matrices are created: one unidimensional and the other
bidimensional, this using of course NumPy.
With the first instruction, the Python program is told that from now on
from now on np will be the reference for everything related to NumPy.
In a very simple way, matrices are declared within NumPy.
NumPy contains several instructions to obtain more information.
about the matrices that have been created using this library, some of them
the following are:
The data type of the elements that are can be found
stored in an array, for this the 'dtype' function is used,
will throw the data type along with the size.
In the same way, one can find the size and shape of the matrix.
with the functions 'size' and 'shape', respectively.
Downloaded from: www.detodopython.com
Another operation that can be performed with NumPy is the change.
of size and shape of the matrices.
The change in the shape of the
matrices is when the number of rows and columns is changed that gives it a
new view of an object.
In the image, there are 3 columns and 2 rows that have been converted into 2.
columns and 3 rows using the "reshape" instruction.
Another function that can be used with NumPy is to select a single
matrix element, for this you only need to specify the number
of the column and row where it is located and will return the stored value in
that location.
With the NumPy library, it can be done very easily.
arithmetic operations. You can find the minimum and maximum value and the
sum of the matrix with NumPy, for this we use 'min', 'max' and 'sum',
respectively.
More operations can be performed with NumPy arrays, such as
the addition, subtraction, multiplication, and division of the two matrices.
Downloaded from: www.detodopython.com
A large part of the functions that can be used has been covered.
time to manipulate the data with NumPy, there are still many more to come
instructions, but as it is being programmed, they will surely be added
learning, remember that the official website of this library offers very
Good information about all the functions it offers, which is why it is not
to visit her.
For more information, you can review the following
information
The provided text is a URL and does not contain translatable content.
Invalid input. Please provide a text for translation.
INTENSIVE COURSE DEPANDAS
Pandas is an open-source Python library that provides
high-performance data analysis and manipulation tools
using its powerful data structures. The name Pandas is derived
of the term 'Panel Data' and it is the data analysis library of Python.
Using this library, one can achieve five typical steps in the
data processing and analysis, regardless of the source of the
load, prepare, manipulate, model, and analyze.
A DataFrame is the fundamental structure of Pandas; these are
two-dimensional labeled data structures with column types
potentially different. The Pandas DataFrames consist of three
main components: the data, the index, and the columns.
Downloaded from: www.detodopython.com
Additionally, with the Pandas DataFrame structure, one can specify
the names of the index and column. The index indicates the difference in the rows,
while the column names indicate the difference in the
columns. These components are very useful when manipulation is required.
the data.
The main features of the Pandas library are:
☐ Fast and efficient DataFrame object with indexing
default and custom.
Tools for loading data into in-memory data objects
from different file formats.
Data alignment and integrated handling of missing data.
Renovation and change of date sets.
Labeling, cutting, indexing, and subset of large sets
of data.
The columns of a data structure can be deleted or
insert.
Group by data for aggregation and transformations.
High performance in data fusion and integration.
Time series functionality.
Like any other library in Python, it must be imported within the
program, in this case the alias pd is usually used, while for the library
NumPy is loaded as np.
Creating DataFrames is the first step in any project of
Machine Learning with Python, so to create a data plot
from a NumPy array it should be passed to the DataFrame() function in the
data argument
If you observe the result of this code, the elements are fragmented.
selected from the NumPy matrix to build the DataFrame, first you
select the values that appear in the lists that start with Row1 and
Row2, then select the index or the row numbers Row1 and Row2 and then
the names of the columns Col1 and Col2.
The way this DataFrame is created will be the same for all
structures.
Once the DataFrame is created, it can be explored with all the
instructions that Pandas has. The first thing to do is
to know the shape of the data, the shape instruction is used. With
this instruction can learn the dimensions of the DataFrame, that is the
width and height.
On the other hand, the len() function can be used in combination with the
Index instruction to know the height of the DataFrame.
For more information, you can review the following
information
Invalid input. Please provide text for translation.
http://bit.ly/2NFsYCa
INTENSIVE COURSE ON MATPLOTLIB
Matplotlib is a plotting library used for 2D graphs in
The Python programming language is very flexible and has many values.
built-in defaults that help a lot at work. Like
Well, it doesn't take much to get started, you just have to do the...
necessary imports, prepare some data and with this it can be
start plotting the function with the help of the plot() instruction.
When graphing, the following should be taken into account
considerations:
The figure is the window or general page where everything is drawn, it is
the top-level component of everything that will be considered in the
Next points. It is possible to create multiple independent figures.
A figure can have other things such as a subtitle,
What is a centered figure title, a legend, a bar of
color, among others.
The axes are added to the figure. The axes are the area in which it is
plot() and scatter() functions are used to graph the data and that
They can have associated tags. The figures can contain
multiple axes.
Each axis has an x-axis and a y-axis, and each of them contains a
numbering. There are also the axis labels, the title and the
legend that should be taken into account when one wants to personalize
the axes, but also taking into account that the scales of the axes
and the grid lines can be useful.
The spinal lines are lines that connect the axis marks.
that designate the boundaries of the data area, in other words, they are the
simple square that you can see when you have initialized the axes,
like in the following image:
As can be observed, the right and upper ends are
configured as invisible.
For more information, you can review the following
information
Invalid input. Please provide a text for translation.
Invalid input. Please provide a text to translate.
Downloaded from: www.detodopython.com
Chapter 4
LOAD THE DATA
The first step when starting a Machine Learning project is to
load the data. There are many ways to do this, but the first thing that
what should be seen is the format in which this data is found, generally it
they are found in CSV files, so here it will be explained how to load a
CSV file in Python, although it will also show how to do it in case
that you have another format.
In this chapter you will learn:
1. Load CSV files using Python.
2. Upload files using Python.
CONSIDERATIONS OF CSV FILES
Thereareseveralconsiderationsthatmustbetakenintoaccountatthetime
to load the data from a CSV file, the following is explained
some of them.
File header
Sometimes the data has a header, this helps in the
automatic assignment of names to each data column. In case of
do not rely on this header, its attributes can be named in a way
manual. Similarly, you must explicitly specify whether the file to
use account or not with a header.
Delimiter
The delimiter that is used as standard to separate the values in
the fields is the comma (,). The file to be used could use a delimiter
different like a blank space or semicolon (;), in these cases we
must specify explicitly.
DATASET: PIMA INDIANS DIABETES
The dataset of the Pima Indians will be used for
demonstrate the data load in this chapter. It will also be used in
many of the upcoming chapters.
The objective of this dataset is to predict whether a patient has
diabetes or not, according to certain diagnostic measurements included in the
dataset. In particular, all the patients here are women from
less than 21 years of Pima Indian heritage.
The data is available for free on the Kaggle page under
the name of 'Pima Indians Diabetes Database'. To download the file
You must be subscribed to Kaggle.
For more information, you can review the following
information
http://bit.ly/2wqG0g6
LOAD CSV FILES USING PYTHON
To load the CSV data, the Pandas library and the function will be used.
"pandas.read_csv()". This function is very flexible and is the fastest way
and easy to load data into Machine Learning. The function returns a
Pandas DataFrame that can be used immediately.
In the following example, it is assumed that the file diabetes.csv is
find saved in the current working directory.
You can also load the data directly from an address.
web, for this you just need to include the web address within the
parenthesis where the file name is currently held.
UPLOAD FILES USING PYTHON
In case the Machine Learning files are not found in
CSV format, the Pandas library offers the option to load data in different
formats. Below is a list and the instructions that should be used:
Description Command
HTML read_html
MS EXCEL read_excel
JSON read_json
SQL read_sql
Chapter 5
UNDERSTANDING THE DATA
Once the data is loaded, it must be understood in order to obtain the
better results. This chapter will explain several ways in which it
You can use Python to better understand Machine Learning data.
In this chapter you will learn:
1. Verifytherawdata.
2. Check the dimensions of the dataset.
3. Review the data types of the features.
4. Checkthedistributionoffeaturesamongtheclassesinthe
dataset.
5. Verifythedatausingdescriptivestatistics.
6. Understand the relationships in the data using
correlations.
7. Review the bias of the distributions of each feature.
To implement each of these items, the set of
data that was worked on in the previous chapter corresponding to the
diabetes of the Pima Indians.
CHECK THE DATA
Looking at the raw data can reveal information that cannot be
obtain in another way. It can also show forms that can later
becoming ideas on how to better process and manage the data for
Machine Learning tasks.
To visualize the first 5 rows of the data, the function is used
head() of Pandas DataFrame.
Result once the previous code is executed:
SCALING THE DATA
To work in Machine Learning, one must have a good control over the
amount of data available, knowing the number of rows and of
columns.
If there are too many rows, some algorithms may take longer.
too much in training. Instead, if there are very few, suddenly
there are not enough data to train the algorithms.
If there are too many features, some algorithms may
to be distracted or to suffer poor performance due to the
dimensionality.
To understand the shape and size of the dataset, we use the
Pandas properties shape.
Result after running the above code:
The result of this code represents the number of rows and then the
columns. Therefore, the dataset contains 768 rows and 9
columns.
DATA TYPES
Knowing the data type of each feature is important. Knowing
This information can give you an idea if the data should be converted.
originals in other formats so that they are easier to implement
alongside machine learning algorithms.
To know the data types, the function dtypes can be implemented.
from Pandas, this way you can know the data type of each
characteristic that is part of the dataset.
Result once the above code has been executed:
The result is a list with each of the columns of the set of
data, where the type of data being handled is specified.
STATISTICAL DESCRIPTION
Statistical description refers to the information that can be
obtain data relating to statistical properties of each
characteristic. To implement it, you only need to use the function
describe() from Pandas, and the properties it returns are as follows:
Count
☐Media
Standard deviation
Minimum value
25%
50%
75%
Maximum value
Result once the above code is executed:
As can be observed, it returns a lot of information from the set of
data, but the important thing here is to verify if all the quantities of data
coincide, this way you can know if there are any missing data or not.
CLASS DISTRIBUTION (PROBLEMS OF
CLASSIFICATION
In case you are working on a classification problem,
As is the current case, it is necessary to know how balanced they are.
class values. When problems are highly encountered
unbalanced (many more observations for one class than for another) are
they require special treatment when processing
from the data. To find out this information, a
Pandas function, groupby along with the Class column of the set.
data being worked on.
Result after executing the above code:
Class 0 corresponds to non-diabetic people while class
1 corresponds to the women who have the disease.
CORRELATION BETWEEN CHARACTERISTICS
Correlation refers to the relationship between two variables and how they may
or not to change together. To implement this, the method must be indicated.
to use, the most common is the Pearson correlation coefficient, which assumes
a normal distribution of the involved attributes. If the correlation is
finding between -1 or 1 shows a complete negative or positive correlation,
respectively. While a value of 0 shows no correlation
not at all. To calculate a correlation matrix, the function is used
Result once the above code is executed:
Some Machine Learning algorithms, such as linear regression and
Logistics may suffer from poor performance if there are attributes.
highly correlated in the dataset.
ADDITIONAL CONSIDERATIONS
The points discussed here are just some advice that should be
consider at the time of reviewing the data, in addition to these, it must be
take the following into consideration:
☐Review the numbers. Generating the statistical description is not
sufficient. One should take time to read and understand very well
the numbers being seen.
When looking at the numbers, it is necessary to understand very well how and why they are.
they are looking at these specific numbers, how they relate to the
domain of the problem in general, the important thing here is to make oneself
ask and know very well what the information we are about is about
present the dataset.
It is advisable to write down all the observations that are obtained or
ideas that arise when analyzing the data. At times this
information will be very important when trying to come up with ideas
new things to try.
Chapter 6
VISUALIZING THE DATA
As explained in the previous chapter, data must be understood to
achieve good results when implementing the algorithms of
Machine Learning. Another way to learn from data is by implementing the
visualization of the same.
In this chapter you will learn:
1. Graphing with a single variable
2. Graphing with multiple variables
GRAPHING WITH A SINGLE VARIABLE
Graphs with a single variable are useful for understanding each
characteristic of the dataset independently. Here it
they will explain the following:
Histograms
Box Plot Diagram
Histogram
A quick way to get an idea of the distribution of each attribute is
observe the histograms. These are used to present the distribution
and the relationships of a single variable in a set of characteristics. The
histograms group data into bins and the shape of them,
one can quickly get an idea of whether an attribute is Gaussian,
biased or even has an exponential distribution. It can even
help to determine if there are possible outliers.
Result once the previous code is executed:
Box plots
Another way to check the distribution of each attribute is to use the
box diagrams. This summarizes the distribution of each attribute, plotting
a line for the median or average value and a box around the
percentiles 25 and 75.
Result after executing the previous code:
For more information, you can review the following
information
Invalid input format. Please provide text for translation.
GRAPHING WITH MULTIPLE VARIABLES
This section provides an example of a chart that shows the
interactions between multiple variables in the dataset:
Correlation matrix
Correlation matrix
The correlation presents an indication of how related they are.
changes between two variables. Positive correlation refers to when two
Variables change in the same direction. In contrast, correlation is
negative when they change in opposite directions, one goes up and the other goes down. The
correlation matrix refers to the correlation between each pair of attributes,
In this way, it can be verified which variables have a high correlation.
Result once the previous code is executed:
If the matrix is analyzed, it is symmetric, that is, the lower part
the left side of the matrix is the same as the upper right. This is useful since
we can see two different views of the same data in a graph.
For more information, you can check the following
information:
http://bit.ly/2XgTTOg
Chapter 7
DATA PROCESSING
Data processing is a fundamental step in any
Machine Learning problem, due to the algorithms making
assumptions about the data, so they must be presented
correctly.
In this chapter, you will learn how to prepare the data for Machine
Learning in Python using scikit-learn:
1. Standardize the data
2. Normalize the data
3. Column removal
DATA SEPARATION
In this chapter, several methods for processing will be explained.
data for Machine Learning. The dataset that will be used will be the
of diabetes in the Pima Indians.
Before starting the data processing, it is advisable to separate
the dataset with the input and output variables or, as it is also known
you know them, the independent and dependent variables.
Observing the dataset that has been worked on,
The output or dependent variable would be the column 'Outcome',
because it reflects whether the person has diabetes or not.
Having defined this, we proceed to separate the data, creating the
variable 'X' for input or independent data and variable 'y'
for the column corresponding to "Outcome".
Result after executing the previous code:
Result once the previous code has been executed:
Downloaded from: www.detodopython.com
Data processing will only be done for the set
corresponding to the variable 'X', the input or independent data. The
Data from the variable 'y' will not undergo any procedure.
STANDARDIZATION OF THE DATA
Data standardization is a common requirement for many
machine learning estimators, this is because the algorithms are
they may behave badly if the individual characteristics do not resemble each other more or
less than the standard normally distributed data.
To comply with this procedure, the StandardScaler function is used.
from the scikit-learn library and is applied, in this case to all the data from the
dataset "X".
Result once the above code has been executed:
Note: when running this code there will probably be an error, the message
what it indicates is that all data of the integer and floating type have been
converted to float to be able to implement the instruction
StandardScaler.
As can be seen when running the code, it generates a matrix of the type
NumPy with the standardized data, in case they want to be converted.
data in Pandas DataFrame the following lines must be executed
additional code.
Result once the above code is executed:
For more information, you can check the following
information:
Invalid input. Please provide text for translation.
Invalid input. Please provide text to be translated.
DATA NORMALIZATION
Normalizing the data refers to changing the scale of the data.
characteristic so that they have a length of 1. This method of
processing can be useful for sparse datasets (many
zeros) with attributes of different scales.
To implement this method, the Normalizer class from the library is used.
scikit-learn.
Result after executing the above code:
Just like in the previous case, executing the code generates a matrix of
NumPy type with normalized data, in case you want to convert it.
the data in Pandas DataFrame must run the following lines of
additional codes.
Result once the previous code is executed:
For more information, you can check the following
information:
The provided text appears to be a URL and not translatable content.
The provided text is a URL and cannot be translated.
COLUMN REMOVAL
Another way to process the data is by removing columns with data that
they are not necessary for the analysis. The function to be used to fulfill this
form esdrop, along with the name of the column to be deleted.
To conduct the respective test, the column will be removed.
"BloodPressure" of the dataset being worked on.
Result once the previous code is executed:
For more information, you can review the following
information
Invalid input. Please provide the text you want to be translated.
Invalid input for translation. Please provide text for translation.
Chapter 8
SELECTING FEATURES
The characteristics of data that are used to train the models of
Machine Learning has a great influence on the performance that can
achieve. Irrelevant or partially relevant characteristics may
negatively affect the performance of the model. In this chapter, it is
they will describe the automatic feature selection techniques that
they can be used to prepare the data.
In this chapter you will learn:
1. Filter method
2. Wrapping method
3. Integrated methods
IMPORTANCE OF THE CHARACTERISTICS
Datasets can sometimes be small while
others are extremely large in size, especially when they have
with a large number of features, making them very difficult to
process.
When you have this type of high-dimensional datasets and
they use all of them for the creation of Machine Learning models
to cause
Additional features act as noise for which the
A machine learning model can have a performance
extremely low.
The model takes longer to train.
☐ Assignment of unnecessary resources for these features.
For all this, feature selection must be implemented in the
Machine Learning projects.
FILTER METHODS
The following image better describes the selection methods of
features based on filters:
Filtering methods are generally used as a step of
data processing, feature selection is independent of
any Machine Learning algorithm.
The characteristics are classified according to the statistical scores that
they tend to determine the correlation of the characteristics with the variable of
As a result, correlation is a very contextual term and varies from one work to another.
to another.
The following table can be used to define the coefficients of
correlation for different types of data, in this case, continuous and
categorical.
Pearson correlation: it is used as a measure to quantify the
linear dependence between two continuous variables X and Y, its value varies from -1 to
+1.
LDA: linear discriminant analysis is used to find a
linear combination of features that characterizes or separates two or more
classes, or levels, of a categorical variable.
ANOVA means analysis of variance and is similar to LDA, except
due to the fact that it operates through one or more independent functions
categorical and a continuous dependent function. Provide a proof
statistic on whether the means of various groups are equal or not.
Chi-squared: it is a statistical test that is applied to groups of
categorical characteristics to evaluate the probability of correlation or
association between them using their frequency distribution.
Filter methods do not eliminate multicollinearity, therefore, they
You must also deal with them before training models for your data.
Practically, what was explained above will be implemented in a
Chi-square statistical test for non-negative characteristics, for
select 5 of the best features of the dataset with which
it has been found to be related to diabetes in the Pima Indians.
The scikit-learn library provides the SelectKBest class that can be
use with a set of different statistical tests to select a
specific number of functions, in this case, is Chi-square.
Result once the previous code is executed:
The result obtained in this part of the program is the score that you have.
each of the characteristics once the Chi-square method is applied.
according to this, the most important characteristics can already be defined that
can affect the analysis with Machine Learning.
Result once the previous code has been executed:
In summary, the results obtained show the 5
chosen characteristics, taken with the highest scores: 'Pregnancies'
pregnancies
body mass
These scores help determine the best features for
train the model.
For more information, you can check the following
information:
http://bit.ly/2C0BojO
WRAPPING METHODS
Like the filtering methods, a graph is shown where
better explain this method:
As can be seen, a wrapping method requires an algorithm.
of Machine Learning and uses its performance as an evaluation criterion.
This method seeks a feature that is more suitable for the algorithm.
and aims to improve performance.
Therefore, it is about using a subset of features and they
train a model using it, based on the inferences drawn from the
the previous model, it is decided to add or remove features of its
subset. The problem essentially reduces to a problem of
search. These methods are often computationally very expensive.
Some common examples of Wrapping Methods are the following:
Forward Selection: it is an iterative method in
the one that starts without having any characteristics in the model. In each
in each iteration, the function that best improves our model continues to be added
until the addition of a new variable does not improve performance
model.
Backward Selection: it starts with all
the characteristics and the least significant feature is removed in each
iteration, which improves the performance of the model. This is repeated until
No improvement in feature elimination is observed.
Recursive Feature Elimination
Elimination) is an optimization algorithm that seeks to find the
subset of functions with better performance. Creates repeatedly
models and set aside the best or worst performance characteristic in
each iteration. Build the following model with the characteristics of the
left until all the features are exhausted, then classify them
characteristics according to the order of their elimination.
Applying this theory to the dataset that has been worked on until
At times, the method of feature elimination will be applied.
recursive along with the logistic regression algorithm.
VISIT
Python for all kinds of use, such as application development of
desktop and web, Data Science, Machine Learning, Artificial Intelligence
Deep Learning and more...
www.detodopython.com
Result once the above code is executed:
For this method, the 4 main functions were chosen and the result was
Pregnancies
(body mass index) and "DiabetesPedigreeFunction" (pedigree function of
diabetes.
As can be seen, these data are marked as
True in the selected feature matrix and as 1 in
the feature classification matrix.
For more information, you can check the following
information:
The provided text is a URL and does not contain translatable content.
INTEGRATED METHODS
Combines the qualities of filter and wrapper methods. It
implement through algorithms that have their own methods of
embedded feature selection.
Some of the most popular examples of these methods are the
LASSO and RIDGE regression, which have penalty functions
incorporated to reduce overfitting.
For more information, you can check the following
information:
The provided text is a URL and does not contain translatable content.
Chapter 9
CLASSIFICATION ALGORITHMS
When developing a Machine Learning project, one cannot know about
in advance which algorithms are the most suitable for the problem, it must be
try various methods and focus on those that prove to be
the most promising. In this chapter, it will be explained how to implement the
Machine Learning algorithms that can be used to solve
classification problems in Python with scikit-learn.
In this chapter, you will learn how to implement the following algorithms:
1. Logistic regression
2. Closest neighbors
3. Support vector machines
4. Naive Bayes
5. Decision trees classification
6. Random forests classification
For this analysis, the diabetes dataset will be used.
Pima Indians. Similarly, it is assumed that the theoretical part is known.
each Machine Learning algorithm and how to use them will not be explained
based on the parameterization of each algorithm.
LOGISTIC REGRESSION
Logistic Regression is a statistical method for predicting classes
binary. The result or target variable is dichotomous in nature.
Dichotomous means that there are only two possible classes. For example, one can
use for cancer detection problems or calculate the probability of
that an event occurs.
Downloaded from: www.detodopython.com
Logistic Regression is one of the Machine Learning algorithms.
simpler and more commonly used for binary classification. It is easy to
implement and can be used as a baseline for any problem of
binary classification. Logistic Regression describes and estimates the relationship
between a binary dependent variable and the independent variables.
Result once the above code is executed:
To implement this algorithm, it is only necessary to define it, carry out its
respective training along with the training data and a
prediction.
K - NEAREST NEIGHBORS
K nearest neighbors is a non-parametric learning algorithm,
this means that it makes no assumptions about the distribution of data
underlying. In other words, the structure of the model is determined by
starting from the dataset. This is very useful in practice where most
real world datasets do not follow theoretical assumptions
mathematics.
The algorithm needs all the training data and they are used.
in the testing phase. This makes training faster and the phase
testing that is slower and more expensive. The expensive part refers to the time required.
and memory. In the worst case, K nearest neighbors needs more
time to scan all data points and scan all points
data will require more memory to store training data.
Result after executing the previous code:
As can be seen here, you simply need to define the algorithm,
train the model with the training data and make a prediction.
SUPPORT VECTOR MACHINES
Support vector machines seek the line that best separates
two classes. The data features that are closer to the line than
better separate the classes are called support vectors and influence in the
location of the line. Of particular importance is the use of different
kernel functions through the kernel parameter.
Result once the previous code has been executed:
For this algorithm, similarly, the model must be defined and trained.
and make a prediction.
NAIVE BAYES
Naive Bayes is a statistical classification technique based on the
Bayes' theorem. It is one of the most used Machine Learning algorithms.
simple. It is a fast, accurate, and reliable algorithm, and it even has a high
precision and speed in large datasets.
Result once the previous code is executed:
Just like the previous cases, for this algorithm, the algorithm is defined
to implement, then it is trained using the data from
training and finally a prediction is made, either with the data
test or with new data.
DECISION TREES CLASSIFICATION
Decision tree is a type of supervised learning algorithm that
it is mainly used in classification problems, although it works
for categorical input and output variables as continuous.
In this technique, the data is divided into two or more homogeneous sets.
based on the most significant differentiator in the input variables. The
decision tree identifies the most significant variable and its value that
provides the best homogeneous population sets. All the
input variables and all possible split points are evaluated and are
choose the one that has the best result.
Tree-based learning algorithms are considered one of the
best and most used supervised learning methods. The methods
based on trees enhance predictive models with high accuracy,
stability and ease of interpretation. Unlike the models
linear, map quite well non-linear relationships.
Result once the above code is executed:
To implement this algorithm, it is only necessary to define it, perform its
respective training along with the training data and we carry out
a prediction.
RANDOM FORESTS CLASSIFICATION
Random forests classification is a versatile method of Machine
Learning. Implements dimensionality reduction methods, handles values
missing values, outliers, and other essential data exploration steps, and
does a pretty good job. It's a type of learning method for
sets, where a group of weak models are combined to form a
powerful model.
In random forests, several trees are cultivated instead of just one.
tree. To classify a new object based on attributes, each tree gives a
classification and it is said that the tree "votes" for that class. The forest chooses the
ranking with the most votes, among all the trees in the forest.
Result once the above code has been executed:
This algorithm is no different from the previous ones, here the algorithm is defined,
the model is trained and the respective prediction is made.
All algorithms can be configured by defining certain parameters.
to improve the performance of the model. All these parameters are
are defined in the scikit-learn library.
Chapter 10
PERFORMANCE METRICS ALGORITHMS OF
CLASSIFICATION
For machine learning classification problems, there are
a large number of metrics that can be used to evaluate the
predictions of these problems. This section will explain how
implement several of these metrics.
In this chapter you will learn:
Confusion matrix
2. Classification report
3. Area under the curve
CONFUSION MATRIX
The confusion matrix is one of the most intuitive and straightforward metrics.
what is used to find the precision and accuracy of the model. It is used
for the classification problem where the output can be either two or more
types of classes.
Identified the problem, the confusion matrix is a table with two
current
dimensions. The rows of the matrix indicate the observed or real class and the
Columns indicate the predicted classes.
It should be clarified that the confusion matrix itself is not a measure.
of performance as such, but almost all performance metrics are
they are based on her and on the numbers she contains.
Below is an example of the calculation of a matrix of
confusion for the dataset corresponding to diabetes of the
Pima Indians using Logistic Regression algorithm.
Result once the previous code is executed:
The confusion matrix shows us that the majority of the
predictions fall on the main diagonal of the matrix, these being
correct predictions.
For more information, you can check the following
information:
Invalid input. Please provide text for translation.
CLASSIFICATION REPORT
The scikit-learn library provides a very convenient report when
work is done on classification problems, this gives a quick idea of the
precision of a model using a series of measures. The function
classification_report() shows precision, sensitivity, F1 score
and the support for each class.
The following example shows the implementation of the function in
a problem.
Result once the previous code has been executed:
For more information, you can check the following
information
The provided text is a URL and does not contain translatable content.
Invalid input. Please provide a text to translate.
AREA UNDER THE CURVE
When it comes to a classification problem, one can rely on a
AUC-ROC curve, to measure performance. This is one of the metrics of
most important evaluation to verify the performance of any model
of classification. ROC comes from the characteristics of operation of the
receptor and AUC of the area under the curve.
Result once the previous code has been executed:
The obtained value is relatively close to 1 and greater than 0.5, which
It suggests that much of the predictions made have been correct.
For more information, you can check the following
information:
Invalid input. Please provide the text you want to be translated.
Invalid input. Please provide text to translate.
Chapter 11
REGRESSION ALGORITHMS
Point checking is a way to discover which algorithms
they work well on Machine Learning problems. It is difficult to know with
previously what algorithm is the most suitable to use, so it should be
check various methods and focus on those that demonstrate
to be the most promising. This chapter will explain how to implement
regression algorithms using Python and scikit-learn for problems
of regression.
In this chapter you will learn how to implement the following algorithms:
1. Linear regression
2. Polynomial regression
3. Supportvectorregression
4. Regression decision trees
5. Random forests regression
For this analysis, the advertising dataset will be taken that you can
find on the Kaggle page as "Advertising Data". Likewise
It is assumed that the theoretical part of each algorithm is known.
Machine Learning and how to use it will not explain the basics or the
parameterization of each algorithm.
LINEAR REGRESSION
Linear regression is a parametric technique used for prediction.
continuous variables, dependent, given a set of variables
independent. It is parametrically based because it makes certain
assumptions based on the dataset. If the dataset follows
those assumptions, the regression yields incredible results, otherwise,
it struggles to provide compelling accuracy.
Result once the previous code is executed:
As can be seen here, it is simply necessary to define the algorithm,
train the model with the training data and perform a
prediction.
For more information, you can check the following information:
http://bit.ly/2RpwDGK
Invalid input. Please provide text for translation.
POLYNOMIAL REGRESSION
Polynomial regression is a special case of linear regression, it extends
the linear model by adding additional predictors, obtained by raising each
one of the original predictors to a power. The standard method for
extend linear regression to a nonlinear relationship between the variables
dependent and independent, has been to replace the linear model with a
polynomial function.
Result after running the above code:
For this algorithm, an additional step must be taken before defining the
model, first of all, the degree of the polynomial must be defined and
transform the data corresponding to X. Having done all this now yes it
you can define the model, train it, and finally make a prediction.
For more information, you can review the following information:
Invalid input, please provide text for translation.
SUPPORT VECTORS REGRESSION
Support vector regression uses the same principles as those of
classification, with only a few minor differences. First of all, given
that the output is a real number, it becomes very difficult to predict the
available information, which has infinite possibilities, however, the
the main idea is always the same: minimize the error, individualize the
hyperplane that maximizes the margin, taking into account that some part is tolerated
of the error.
Result after executing the above code:
For this algorithm, similarly, the model must be defined, trained and
make a prediction. For this dataset, it is carried out beforehand,
a scaling in the data since the model was not entirely effective with
the original data.
For more information, you can review the following information:
Unable to access the provided link.
The provided text is a URL and does not contain translatable content.
REGRESSION DECISION TREES
Decision trees are a supervised learning technique that
predict response values through decision rule learning
derivatives of characteristics.
Decision trees work by splitting the feature space.
in several simple rectangular regions, divided by parallel divisions
of axes. To obtain a prediction for a particular observation, you
use the mean or mode of the responses from the observations of
training, within the partition to which the new one belongs
observation.
Result after running the above code:
Just like the previous cases, for this algorithm the algorithm is defined as
implement, then it is trained using the data from
training and finally a prediction is made, either with the data
for testing or with new data.
For more information, you can review the following information:
The provided text is a URL and does not contain translatable content.
The provided text is a URL and cannot be translated.
RANDOM FOREST REGRESSION
Random forests is a supervised learning algorithm that,
As can already be seen in its name, it creates a forest and does so in some way.
random way. To put it simply: the random forest creates
multiple decision trees and combines them to obtain a prediction
more accurate and stable. In general, the more trees in the forest,
see, the forest is more robust.
In this algorithm, additional randomness is added to the model, while
grow the trees, instead of looking for the most important characteristic at
split a node, find the best feature among a subset
random features. This results in a wide diversity
which generally results in a better model.
Result once the previous code is executed:
This algorithm is no different from the previous ones; here the algorithm is defined,
train the model and make the respective prediction.
For more information, you can review the following information:
Invalid input. Please provide text for translation.YoA
Invalid input. Please provide text for translation.
Remember that this algorithm, like the previous ones, can be
configure by defining certain parameters to improve the performance of the
model. All these parameters are defined in the library.
scikit-learn.
Chapter 12
PERFORMANCE METRICS FOR REGRESSION ALGORITHMS
In this chapter, the most common metrics for evaluation will be reviewed.
predictions about Machine Learning regression problems.
In this chapter you will learn:
1. Mean Squared Error
2. Mean Absolute Error
3.
ROOT MEAN SQUARE ERROR (RMSE)
The most commonly used metric for regression tasks is the
mean squared error and represents the square root of the distance
mean squared error between the actual value and the predicted value.
Indicate the absolute fit of the model to the data, how close the points are.
of observed data from the predicted values of the model. The error
Mean square error or RMSE is an absolute measure of fit.
Result once the previous code is executed:
The best value for this parameter is 0.0.
For more information, you can review the following information:
Invalid input.
MEAN ABSOLUTE ERROR (MAE)
The mean absolute error is the average of the absolute difference between the
predicted values and the observed value. It is a linear score, which
means that all individual differences are weighted equally in the
average.
Result after executing the above code:
For this parameter, a value of 0.0 indicates that there is no error, that is,
predictions are perfect.
For more information, you can review the following information:
http://bit.ly/2TzzKNY
indicates the goodness or suitability of the model, often used for purposes
descriptive and shows that the independent variables as well
selected explain the variability in their dependent variables.
Result after executing the previous code:
The best possible score is 1.0 and it can be negative, because the model
it can be arbitrarily worse. A constant model that always predicts
the expected value of 'y', without taking into account the input characteristics,
you will receive a score of 0.0.
For more information, you can review the following information:
URL provided does not contain translatable text.
Downloaded from: www.detodopython.com
Chapter 13
MACHINE LEARNING PROJECT - CLASSIFICATION
Downloaded from: www.detodopython.com
In this chapter, a classification project will be developed using
Python includes each step of the Machine Learning process applied to
problem.
In this chapter, you will learn:
1. How to work through a classification problem
2. How to use data transformations to improve the
model performance
3. How to implement Machine Learning algorithms and
compare their results
DEFINITION OF THE PROBLEM
For the project, the same dataset that has been used will be utilized.
working through each of the chapters, but this time it will be developed from
complete manner. The dataset corresponds to the Indians
Pima.
This set describes the medical records of the Pima Indians and whether each
the patient will develop diabetes within five years. As such, it is
a classification problem.
The dataset can be found on the Kaggle page as
Pima Indians Diabetes Database
IMPORT THE LIBRARIES
The first step in any Machine Learning project is to
import the libraries to be used. It is normal that at first it is not known
all the libraries to be used and that are being added little by little, but by
At least one should start with the basics that are usually used in
implement any Machine Learning project.
Importing libraries can be done as needed.
programming, what I recommend is to include all these lines of
codes at the beginning of the program to easily know the modules
that are being used within the program, as well as maintaining a
order in programming.
For now, the Pandas and NumPy libraries will be imported.
matplotlib.
LOAD THE DATASET
Once the libraries have been imported, they can be used inside.
of the program, for this the dataset is loaded.
The dataset must be downloaded directly from the page of
Download Kaggle and save it on the computer in a specific folder where
the dataset file will be along with the program file
Python that is being developed, in this way it will be easier to
programming.
Remember that to download the files from Kaggle you must be
subscribed on the page, this is completely free and you will have access to your
available a large number of datasets with which you will be able to
practice later.
The exact name of the file where it is located must be placed
dataset. It is recommended to save the data in the same folder in
where the program file is located this way it is kept a
order.
UNDERSTANDING THE DATA
The next step in the development of a Machine Learning project is
the understanding of the data at hand. For this step, several are used.
functions available in the Pandas library.
Result after executing the above code:
Result after running the previous code:
Result once the above code is executed:
Result after executing the previous code:
Result after executing the previous code:
Result once the previous code is executed:
The dataset consists of 9 columns, of which 8 are the
independent variables and the last column corresponds to the variable
dependent. Additionally, all data is numerical, between integers and
floating.
Additionally, the data is balanced so it is not
no additional steps are necessary.
VISUALIZING THE DATA
Once the data has been understood numerically, now it
will be analyzed visually using the matplotlib library. This library was
imported at the beginning of the program so it is not necessary to do it
again.
Result once the previous code is executed:
Result after executing the previous code:
Result once the above code is executed:
As we can see in the obtained graphs, the results are
very similar to those obtained in the previous point.
DATA SEPARATION
Knowing the dataset that is being worked on within the
project, the data is separated into two sets, the
first set of independent variables, X, and a second set of variables
dependent, and.
The data corresponding to the variables is separated.
independent, which would be all columns except the last one.
Result after executing the above code:
Next, the corresponding variables are defined, or variable.
dependent, which would be the last column of the dataset.
Result after executing the previous code:
SELECTING FEATURES
As observed, variable X has 8 columns, for this case it
the procedure will be applied to select the 5 characteristics that have
greater influence on the dependent variable. For this example, the
filter method.
To implement the respective functions, the must be imported.
corresponding libraries. It is recommended to place these lines of code at
beginning of the program along with the other libraries with which it is
working. This makes the program much cleaner and easier to
to understand for other people.
Once this is done, the respective code can be implemented.
Result after executing the previous code:
The 5 characteristics that have the greatest impact on the variable
Dependents are as follows:
Column 0 – pregnancies
Column 1 - glucose
Column 4 – insulin
Column 5 - bmi
Column 7 – age
Therefore, the variables X are now converted with only these 5
columns.
Result once the previous code has been executed:
DATA PROCESSING
Having defined the values of X that will be used in the algorithms, now
proceed to carry out the respective processing of the data.
For this case, only the data will be standardized, since they
they are at different scales and this can cause errors in the
analysis.
Before applying the function, the respective library must be imported.
Remember to place this line of code at the beginning of the program.
Now the data standardization is being carried out, it is important to mention
that this procedure is only carried out on the data corresponding to X
only.
Result after executing the previous code:
DATA SEPARATION
We reach the last step before implementing the Machine algorithms
Learning, this corresponds to separating the data into training.
And test, for this the train_test_split function from the scikit-learn library is used.
remember that it must be imported before using it.
Only 25% of the dataset will be used as data for
test, so the test size is set to 0.25.
APPLICATION OF CLASSIFICATION ALGORITHMS
The algorithms explained previously will be implemented and will
will evaluate the results concerning the error of each one of them.
Logistic Regression
The first algorithm to evaluate is the most basic of all and it is the
Logistic Regression.
Result once the previous code is executed:
The previous data is the representation of the test data.
originals, with the data obtained from the model. Next, a
the respective analysis with the corresponding metrics for the algorithms of
classification.
The first metric to evaluate will be the confusion matrix, for this
first the library will be imported and later the function.
Result once the previous code has been executed:
The results obtained here are quite satisfactory, as the
the model was able to obtain some good results.
Now the results are evaluated, but this time using the report from
classification available in the scikit-learn library.
Result after running the above code:
This report shows that the accuracy percentage,
sensitivity and F1 score of the model hover around 0.8, a number that is
very good.
KNearest Neighbors
The following algorithm to be evaluated will be the K nearest neighbors algorithm.
the same procedure will be carried out as was done previously.
Result after executing the previous code:
After completing this procedure, the model continues to be evaluated.
For this, the two metrics that were used with the previous one are applied.
firstly, the confusion matrix.
Result after running the above code:
And subsequently, the classification report is obtained.
Result once the previous code is executed:
As can be seen, the results obtained here are a bit
better than those obtained with the previous algorithm.
Support Vector Machines
Now the support vector machine algorithm will be evaluated,
the procedure is very similar to the other evaluated algorithms
previously.
Result once the above code is executed:
As in the previous cases, the obtained model is evaluated by
mean of the confusion matrix and the classification report.
Result after executing the above code:
Result after executing the previous code:
Naive Bayes
Downloaded from: www.detodopython.com
The following algorithm to be evaluated will be the naive bayes or how it is
know in Spanish, the naive bayesian.
Result once the previous code is executed:
Once the model is defined, its error is checked, for this purpose the
function of the confusion matrix and the classification report.
Result once the previous code has been executed:
Result once the above code is executed:
Classification Decision Trees
To evaluate the problem by implementing the tree algorithm
Classification decision is carried out through the following procedure.
Result once the previous code is executed:
The error is evaluated by implementing the confusion matrix and obtaining the
classification report.
Result after executing the previous code:
Result once the above code is executed:
Random Forest Classification
Finally, the last explained algorithm, random forests, is evaluated.
classification.
Result once the previous code is executed:
The functions to evaluate the algorithm are implemented below.
just like before, the confusion matrix and the report are used
classification.
Result after executing the previous code:
Result once the previous code is executed:
As can be seen, the evaluated algorithms obtained some
very similar results to each other.
In these cases, the algorithm that is much faster is selected.
easy to implement, which would be Logistic Regression.
Not all cases are like this; sometimes when working with
a classification problem, the results, after implementing several
algorithms, are different from each other, so here the selection is made
algorithm that obtains the best results.
Chapter 14
MACHINE LEARNING PROJECT - REGRESSION
In this chapter, a regression project will be developed using Python.
Each step of the Machine Learning process applied to the problem is included.
In this chapter you will learn:
1. How to work through a regression problem
2. How to use data transformations to improve the
model performance
3. How to implement Machine Learning algorithms and
compare your results
DEFINITION OF THE PROBLEM
The same dataset that was used will be utilized for the project.
to explain the regression algorithms, but this time it will be developed from
complete manner. The dataset corresponds to the ads
of advertising.
The dataset can be found on the Kaggle page as
Advertising Data
IMPORT THE LIBRARIES
The first step in any Machine Learning project is to
import the libraries to be used. It is normal that at first it is not known
all the libraries to be used and that will be added little by little, but for
At least you should start with the basics that are normally used in
implement any Machine Learning project.
Importing libraries can be done as needed.
programming, what I recommend is that all these lines are placed in
codes at the beginning of the program so that it is easy to know the modules
that are being used within the program, as well as maintaining a
order in programming.
For now, the Pandas and matplotlib libraries will be imported.
LOAD THE DATASET
Once the libraries have been imported, they can be used inside
from the program, to do this, the dataset is loaded.
The dataset must be downloaded directly from the page of
Kaggle and save it on the computer in a specific folder where
the dataset file will be alongside the program file
Python that is being developed, in this way it will make it easier to
programming.
Remember that to download the files from Kaggle you must be
subscribed on the page, this is completely free and you will have access to your
a large number of datasets that you will be able to
practice later.
The exact name of the file where it is located must be placed.
dataset. It is recommended to save the data in the same folder in
where the program file is located this way it is maintained a
order.
UNDERSTANDING THE DATA
The next step in the development of a Machine Learning project is
understanding the data available. Several are used for this step
functions available in the scikit-learn library.
Result once the previous code is executed:
Result once the previous code is executed:
Result once the previous code is executed:
Result once the previous code has been executed:
Result after executing the above code:
The dataset contains 5 columns, of which 4 are the
independent variables and the last column corresponds to the variable
dependent.
Detailing the data, it can be observed that the first column is of
numbering so that it can be discarded when the separation is made of
the data.
It can be observed that all the programming code used until
here is exactly the same as the classification problem developed in the
previous chapter.
VISUALIZING THE DATA
Once the data has been understood numerically, now we
will be analyzed visually using the matplotlib library. This library was
imported at the beginning of the program so it is not necessary to do so
again.
Result after executing the previous code:
Result once the above code is executed:
Result after executing the previous code:
As we can see in the obtained graphs, the results are
very similar to those obtained in the previous point.
It can be observed that all the programming code used up to
here is exactly the same as the classification problem developed in the
previous chapter.
DATA SEPARATION
Knowing the dataset that is being worked on within the
project, the separation of the data into two sets is carried out, the
first set of independent variables, X, and a second set of variables
dependents, and.
We start by separating the data corresponding to the variables.
independent, which would be all the columns except the first one and
the last one.
The first column is removed because it contains the numbering of the
rows, information that does not influence the analysis of Machine Learning.
Result after executing the above code:
Next, the corresponding variables ay are defined, or variable.
dependent, which would be the last column of the dataset.
Result after executing the previous code:
Downloaded from: www.detodopython.com
For this problem, feature selection is no longer necessary.
which has very few, so this step will be skipped.
Similarly, the data is suitable for use.
within Machine Learning algorithms, so it is not necessary
carry out a processing on them.
DATA SEPARATION
We have reached the final step before implementing the Machine algorithms.
Learning, this corresponds to separating the data into training
and test, for this the train_test_split function from the scikit- library is used.
learn, remember that it must be imported before using it.
Only 25% of the dataset will be used as data for
test, which is set at the test size of 0.25.
APPLICATION OF REGRESSION ALGORITHMS
The algorithms explained earlier will be implemented and
will evaluate the results concerning the error of each of them.
Linear Regression
The first algorithm to evaluate is the most basic of all and is that of
Linear Regression.
Result after executing the above code:
The previous data is the representation of the test data,
original, with the data obtained by the model. Next, it is carried out
the respective analysis with the corresponding metrics for the algorithms of
regression.
The first metric to evaluate will be the mean squared error, for this...
First, the library will be imported, and subsequently the function.
Result once the above code is executed:
Downloaded from: www.detodopython.com
The results obtained here are quite satisfactory, I remember that if
The value closer to 0 indicates that the model is good.
Now the results are evaluated, but this time using the metric. ,
importing the function in scikit-learn first to then implement it.
Result after executing the previous code:
This metric, unlike the previous one, the closer the result is to
1 is a very good model, so the result obtained indicates that the
the developed model is good.
Polynomial Regression
The following algorithm to be evaluated will be polynomial regression, here we
it will carry out the same procedure that was performed earlier.
Result once the above code is executed:
After completing all this procedure, the model continues to be evaluated.
For this, the two metrics that were used with the previous one are applied.
firstly, the mean squared error.
Result after executing the above code:
And subsequently, the metric is obtained.
Result once the previous code has been executed:
As can be seen, the results obtained here are very similar.
to the previous model.
Support Vector Regression
Now the support vector regression algorithm is being evaluated,
the procedure is very similar to the other evaluated algorithms
previously.
Result once the previous code is executed:
Just like in the previous cases, the obtained model is evaluated by
mean of the mean squared error and the metric.
Result once the previous code is executed:
Result once the above code is executed:
If we observe the results of the evaluation metrics, they are not for
nothing good, so it can be inferred that this is not the best algorithm
for this dataset.
Decision Trees Regression
The following algorithm to be evaluated will be regression decision trees.
Result once the previous code is executed:
Once the model is defined, its error is verified, for which the
function of mean squared error and the metric.
Result after executing the previous code:
Result after executing the above code:
Random Forest Regression
To evaluate the problem by implementing the forest algorithm
random regression the following procedure is carried out.
Result after executing the previous code:
The error is evaluated by implementing the mean squared error and the metric.
.
Result once the above code is executed:
Result after executing the above code:
As can be seen, the evaluated algorithms obtained some
results very similar to each other, except for the one about vectors of
regression support. This does not mean that this algorithm is bad, but rather that
it is simply not the most suitable for this dataset.
Any algorithm that achieved good results can be chosen.
result, in case a fast and easy algorithm is required
you can choose the Linear Regression one or the Regression
Polynomial.
Chapter 15
CONTINUE LEARNING MACHINE LEARNING
In this chapter, you will find areas where you can practice the new
acquired skills in Machine Learning with Python.
In this chapter you will learn:
Find datasets to work on new ones
projects
BUILD TEMPLATES FOR PROJECTS
Throughout the book, it has been explained how to develop projects of
Machine Learning using Python. What is explained here is the foundation that can
to be used to start new Machine Learning projects. This is
just a start, and you can improve as you develop
bigger and more complicated projects.
As you apply what is explained here and implement the skills
acquired in Machine Learning using the Python platform,
you will develop experience and skills with new and different techniques with
Python, making it easier to develop projects within this area.
DATA SETS FOR PRACTICE
The important thing to improve the skills learned from the book
is to keep practicing. On the web, you can find several datasets
what you can use to practice knowledge and, at the same time, to create
a project portfolio to showcase within the resume.
The first place where you can find datasets that you can
using in Machine Learning projects is in the Machine repository
Learning from UCI (University of California, Irvine). The sets of
The data from this repository is standardized, relatively clean, well
understood and excellent for use as datasets for
practice.
With this repository, it is possible to build and further develop the
skills in the area of Machine Learning. It is also useful for getting started
create a portfolio that can be shown to future employers
demonstrating that it can be capable of delivering results in projects of
Machine Learning using Python.
The link for this repository is the next:
http://archive.ics.uci.edu/ml/index.php.
Another page that has very good datasets for
practice is Kaggle. Here, you can download and practice your skills
learned, but you can also participate in competitions. In a
competition, the organizer provides you with a set of data from
training, a test dataset in which you must do
predictions, a performance measure, and a deadline. Then the
competitors work to create the most accurate model possible. The
Winners often receive cash prizes.
These competitions often last weeks or months and can be very
fun. They also offer a great opportunity to test your
skills with Machine Learning tools with datasets
that often require a lot of cleaning and preparation. A good place to
starting would be the competences for beginners, as they often are
less challenging and have a lot of help in the form of tutorials for
start.
The main website to find the datasets is the
next:Unable to translate URL content..
No matter which page is used to obtain the datasets,
the steps to follow to improve Machine skills
Learning are the following:
1. Browse the list of free datasets on the
repository and download some that seem interesting.
2. Use the libraries and functions learned in this book to
work through the dataset and develop a model
I need.
3. Write the workflow and the conclusions obtained for
that can be consulted later and even if possible
share this information on some website, a place
recommended is on Github.
Working on small projects is a good way to practice the
fundamentals. Sometimes some problems will become easier because of the
one should seek new challenges to leave the comfort zone to
help increase skills in Machine Learning.
Chapter 16
GET MORE INFORMATION
Downloaded from: www.detodopython.com
This book is just the beginning of your journey to learn Machine
Learning with Python. As new projects are developed, it is
You may need help. This chapter indicates some of the
best sources of help for Python and Machine Learning that you can
to find.
In this chapter you will learn:
1. Useful websites to clear up doubts.
GENERAL COUNCIL
In general, on the LigdiGonzalez page you will find much more
information about Machine Learning. From theoretical information like
practical exercises, this way you can enhance your skills within
on this topic. All the information contained here is in Spanish.
For more information, you can review the following
information
Invalid input. Please provide a text for translation.
Similarly, the official documentation for Python and SciPy is
excellent. Both the user guides and the API documentation are
an excellent help to clarify doubts. Here you will have an understanding
most complete of the deepest configuration you can explore. The
The published information is in English.
Another very useful resource is question and answer sites, such as
StackOverflow. You can search for error messages and problems that
have and find examples of codes and ideas that can help in the
projects. This page is available in both Spanish and English,
although the largest amount of information is found in this last one.
HELP WITH PYTHON
Python is a multi-purpose programming language. As
the more you learn about it, the better you will be able to use it. If it is relatively
new to the Python platform, here are some valuable resources to go
a deeper step:
Official Python 3 Documentation:
The provided text is a URL and does not contain translatable text.
HELP WITH SCIPY AND NUMPY
It is a good idea to familiarize yourself with the broader SciPy ecosystem.
therefore it is advisable to review the notes of SciPy and the documentation
of NumPy, especially when there are problems with these libraries.
SciPy Notes:
http://scipy-lectures.org/
Official NumPy documentation:
Invalid input. Please provide text for translation.
HELP WITH MATPLOTLIB
Graphically displaying data is very important in Machine
Learning, so you can review the official information of matplotlib,
where many examples can be found, with their respective codes, that
they can be quite useful to implement in personal projects.
Official matplotlib documentation:
Invalid input; URL cannot be translated.
HELP WITH PANDAS
Pandas has a wealth of documentation. The examples
presented in the official documentation are very useful as they will give ideas
about different ways to split and cut the data.
Official Pandas Documentation:
http://pandas.pydata.org/pandas-docs/stable/
HELP WITH SCIKIT-LEARN
The documentation published on the scikit-learn website is of great
help at the time of developing Machine Learning projects, review the
configuring each function can help improve projects and
get better results.
Official scikit-learn documentation:
The provided text is a URL and does not contain translatable content.
Machine Learning with Python
Downloaded from: www.detodopython.com
Supervised Learning
Ligdi González
2019
EVERYTHINGPROGRAMMING.ORG
Material for lovers of the
Java programming
C/C++/C#,Visual.Net, SQL,
Python, Javascript, Oracle
Algorithms, CSS, Development
Web, Joomla, jquery, Ajax and
Much More...
VISIT
www.detodoprogramacion.org
www.detodopython.com
www.freecode.com