0% found this document useful (0 votes)
16 views43 pages

ML-Unit 1

The document provides an overview of machine learning, including its definition, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications in various fields such as image recognition, speech recognition, and self-driving cars. It discusses the importance of data quality, the challenges faced in machine learning, and the prerequisites for understanding the concepts. Additionally, it highlights the growing need for machine learning in solving complex problems and enhancing decision-making processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views43 pages

ML-Unit 1

The document provides an overview of machine learning, including its definition, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications in various fields such as image recognition, speech recognition, and self-driving cars. It discusses the importance of data quality, the challenges faced in machine learning, and the prerequisites for understanding the concepts. Additionally, it highlights the growing need for machine learning in solving complex problems and enhancing decision-making processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT-I

Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications -
Languages/Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities - Types of data -
Exploring structure of data - Data quality and remediation - Data Pre-processing

Machine Learning

Machine learning is a growing technology which enables computers to learn automatically from past data.
Machine learning uses various algorithms for building mathematical models and making predictions using
historical data or information. Currently, it is being used for various tasks such as image
recognition, speech recognition, email filtering, Facebook auto-tagging, recommender system, and many
more.

What is Machine Learning

Machine Learning (ML) is that field of computer science with the help of which computer systems can provide
sense to data in much the same way as human beings do.

In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an
algorithm or method. The main focus of ML is to allow computer systems learn from experience without being
explicitly programmed or human intervention.

In the real world, we are surrounded by humans who can learn everything from their experiences with their
learning capability, and we have computers or machines which work on our instructions. But can a machine
also learn from experiences or past data like a human does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the development
of algorithms which allow a computer to learn from the data and past experiences on their own. The term
machine learning was first introduced by Arthur Samuel in 1959. We can define it in a summarized way as:

1
Machine learning enables a machine to automatically learn from data, improve performance from experiences,
and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine learning algorithms build
a mathematical model that helps in making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. A machine has the ability to learn
if it can improve its performance by gaining more data.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount of
data, as the huge amount of data helps to build a better model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of writing a
code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms, machine
builds the logic as per the data and predict the output. Machine learning has changed our way of thinking
about the problem. The below block diagram explains the working of Machine Learning algorithm:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for machine learning is
that it is capable of doing tasks that are too complex for a person to implement directly. As a human, we have
some limitations as we cannot access the huge amount of data manually, so for this, we need some computer
systems and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let them explore
the data, construct the models, and predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined by the cost function. With the
help of machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases, Currently, machine learning
is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion by Facebook,

2
etc. Various top companies such as Netflix and Amazon have build machine learning models that are using a
vast amount of data to analyze the user interest and recommend product accordingly.

The importance of Machine Learning:

 Rapid increment in the production of data


 Solving complex problems, which are difficult for a human
 Decision making in various sector including finance
 Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

Supervised learning

Unsupervised learning

Semi-supervised learning

Reinforcement learning

Introduction to Machine Learning

1) Supervised Learning

Supervised learning is commonly used in real world applications, such as face and speech recognition,
products or movie recommendations, and sales forecasting. Supervised learning can be further classified into
two types - Regression and Classification.

Regression trains on and predicts a continuous-valued response, for example predicting real estate prices.

Classification attempts to find the appropriate class label, such as analyzing positive/negative sentiment, male
and female persons, benign and malignant tumors, secure and unsecure loans etc.

In supervised learning, learning data comes with description, labels, targets or desired outputs and the
objective is to find a general rule that maps inputs to outputs. This kind of learning data is called labeled data.
The learned rule is then used to label new data with unknown outputs.

Supervised learning involves building a machine learning model that is based on labeled samples. For
example, if we build a system to estimate the price of a plot of land or a house based on various features, such
as size, location, and so on, we first need to create a database and label it. We need to teach the algorithm what
features correspond to what prices. Based on this data, the algorithm will learn how to calculate the price of
real estate using the values of the input features.

Supervised learning deals with learning a function from available training data. Here, a learning algorithm
analyzes the training data and produces a derived function that can be used for mapping new examples. There
are many supervised learning algorithms such as Logistic Regression, Neural networks, Support Vector
Machines (SVMs), and Naive Bayes classifiers.

Common examples of supervised learning include classifying e-mails into spam and not-spam categories,
labeling webpages based on their content, and voice recognition.

3
2) Unsupervised Learning

Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective equipment, or to group
customers with similar behaviors for a sales campaign. It is the opposite of supervised learning. There is no
labeled data here.

When learning data contains only some indications without any description or labels, it is up to the coder or
to the algorithm to find the structure of the underlying data, to discover hidden patterns, or to determine how
to describe the data. This kind of learning data is called unlabeled data.

Suppose that we have a number of data points, and we want to classify them into several groups. We may not
exactly know what the criteria of classification would be. So, an unsupervised learning algorithm tries to
classify the given dataset into a certain number of groups in an optimum way.

Unsupervised learning algorithms are extremely powerful tools for analyzing data and for identifying patterns
and trends. They are most commonly used for clustering similar input into logical groups. Unsupervised
learning algorithms include Kmeans, Random Forests, Hierarchical clustering and so on.

3) Semi-supervised Learning

If some learning samples are labeled, but some other are not labeled, then it is semi-supervised learning. It
makes use of a large amount of unlabeled data for training and a small amount of labeled data for testing.
Semi-supervised learning is applied in cases where it is expensive to acquire a fully labeled dataset while more
practical to label a small subset. For example, it often requires skilled experts to label certain remote sensing
images, and lots of field experiments to locate oil at a particular location, while acquiring unlabeled data is
relatively easy.

4) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for each
right action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and
improves its performance. In reinforcement learning, the agent interacts with the environment and explores it.
The goal of an agent is to get the most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement
learning.

Purpose of Machine Learning

Machine learning can be seen as a branch of AI or Artificial Intelligence, since, the ability to change
experience into expertise or to detect patterns in complex data is a mark of human or animal intelligence.

As a field of science, machine learning shares common concepts with other disciplines such as statistics,
information theory, game theory, and optimization.

As a subfield of information technology, its objective is to program machines so that they will learn.

However, it is to be seen that, the purpose of machine learning is not building an automated duplication of
intelligent behavior, but using the power of computers to complement and supplement human intelligence.
For example, machine learning programs can scan and process huge databases detecting patterns that are
beyond the scope of human perception.

4
Machine Learning at present:

Now machine learning has got a great advancement in its research, and it is present everywhere around us,
such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It includes
Supervised, unsupervised, and reinforcement learning with clustering, classification, decision tree, SVM
algorithms, etc.

Modern machine learning models can be used for making various predictions, including weather prediction,
disease prediction, stock market analysis, etc.

Prerequisites

Before learning machine learning, you must have the basic knowledge of followings so that you can easily
understand the concepts of machine learning:

 Fundamental knowledge of probability and linear algebra.


 The ability to code in any computer language, especially in Python language.
 Knowledge of Calculus, especially derivatives of single variable and multivariate functions.

Challenges in Machines Learning

While Machine Learning is rapidly evolving, making significant strides with cybersecurity and autonomous
cars, this segment of AI as whole still has a long way to go. The reason behind is that ML has not been able
to overcome number of challenges. The challenges that ML is facing currently are

Quality of data − Having good-quality data for ML algorithms is one of the biggest challenges. Use of low-
quality data leads to the problems related to data preprocessing and feature extraction.

Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for
data acquisition, feature extraction and retrieval.

Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is
a tough job.

No clear objective for formulating business problems − Having no clear objective and well-defined goal
for business problems is another key challenge for ML because this technology is not that mature yet.

Issue of overfitting & underfitting − If the model is overfitting or underfitting, it cannot be represented well
for the problem.

Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can
be a real hindrance.

Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.

Applications of Machines Learning


Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. It is used
to solve many real-world complex problems which cannot be solved with traditional approach. Following are
some real-world applications of ML −

5
Applications of Machine learning

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify objects,
persons, places, digital images, etc. The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this
is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and person
identification in the picture.

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to
text", or "Computer speech recognition." At present, machine learning algorithms are widely used by various
applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest
route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the
help of two ways:

Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.

6
Everyone who is using Google Map is helping this app to make it better. It takes information from the user
and sends back to its database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as Amazon,
Netflix, etc., for product recommendation to the user. Whenever we search for some product on Amazon, then
we started getting an advertisement for the same product while internet surfing on the same browser and this
is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the product as
per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and
this is also done with the help of machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-
driving car. It is using unsupervised learning method to train the car models to detect people and objects while
driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters used by Gmail:

Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes classifier
are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants can help us in
various ways just by our voice instructions such as Play music, call someone, Open an email, Scheduling an
appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever
we perform some online transaction, there may be various ways that a fraudulent transaction can take place
such as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward
Neural network helps us by checking whether it is a genuine transaction or a fraud transaction.

7
For each genuine transaction, the output is converted into some hash values, and these values become the input
for the next round. For each genuine transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and
downs in shares, so for this machine learning's long short term memory neural network is used for the
prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing
very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in
finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as for
this also machine learning helps us by converting the text into our known languages. Google's GNMT (Google
Neural Machine Translation) provide this feature, which is a Neural Machine Learning that translates the text
into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence-to-sequence learning algorithm, which is used
with image recognition and translates the text from one language to another language.

Machine Learning Tools

Machine learning is one of the most revolutionary technologies that is making lives simpler. It is a subfield of
Artificial Intelligence, which analyses the data, build the model, and make predictions. Due to its popularity
and great applications, every tech enthusiast wants to learn and build new machine learning Apps. However,
to build ML models, it is important to master machine learning tools. Mastering machine learning tools will
enable you to play with the data, train your models, discover new methods, and create algorithms.

There are different tools, software, and platform available for machine learning, and also new software and
tools are evolving day by day. Although there are many options and availability of Machine learning tools,
choosing the best tool per your model is a challenging task. If you choose the right tool for your model, you
can make it faster and more efficient. In this topic, we will discuss some popular and commonly used Machine
learning tools and their features.

8
Figure: Machine Learning Tools

1. TensorFlow
Machine Learning Tools
TensorFlow is one of the most popular open-source libraries used to train and build both machine learning
and deep learning models. It provides a JS library and was developed by Google Brain Team. It is much
popular among machine learning enthusiasts, and they use it for building different ML applications. It offers
a powerful library, tools, and resources for numerical computation, specifically for large scale machine
learning and deep learning projects. It enables data scientists/ML developers to build and deploy machine
learning applications efficiently. For training and building the ML models, TensorFlow provides a high-level
Keras API, which lets users easily start with TensorFlow and machine learning.

Features:
Below are some top features:

 TensorFlow enables us to build and train our ML models easily.


 It also enables you to run the existing models using the TensorFlow.js
 It provides multiple abstraction levels that allow the user to select the correct resource as per the
requirement.
 It helps in building a neural network.
 Provides support of distributed computing.
 While building a model, for more need of flexibility, it provides eager execution that enables
immediate iteration and intuitive debugging.
 This is open-source software and highly flexible.
 It also enables the developers to perform numerical computations using data flow graphs.
 Run-on GPUs and CPUs, and also on various mobile computing platforms.
 It provides a functionality of auto diff (Automatically computing gradients is called automatic
differentiation or auto diff).
 It enables to easily deploy and training the model in the cloud.
 It can be used in two ways, i.e., by installing through NPM or by script tags.
 It is free to use.
9
2. PyTorch

PyTorch is an open-source machine learning framework, which is based on the Torch library. This framework
is free and open-source and developed by FAIR(Facebook's AI Research lab). It is one of the popular ML
frameworks, which can be used for various applications, including computer vision and natural language
processing. PyTorch has Python and C++ interfaces; however, the Python interface is more interactive.
Different deep learning software is made up on top of PyTorch, such as PyTorch Lightning, Hugging Face's
Transformers, Tesla autopilot, etc.

It specifies a Tensor class containing an n-dimensional array that can perform tensor computations along with
GPU support.

Features:
Below are some top features:

 It enables the developers to create neural networks using Autograde Module.


 It is more suitable for deep learning researches with good speed and flexibility.
 It can also be used on cloud platforms.
 It includes tutorial courses, various tools, and libraries.
 It also provides a dynamic computational graph that makes this library more popular.
 It allows changing the network behaviour randomly without any lag.
 It is easy to use due to its hybrid front-end.
 It is freely available.
 3. Google Cloud ML Engine
 Machine Learning Tools
 While training a classifier with a huge amount of data, a computer system might not perform well.
However, various machine learning or deep learning projects requires millions or billions of training
datasets. Or the algorithm that is being used is taking a long time for execution. In such a case, one
should go for the Google Cloud ML Engine. It is a hosted platform where ML developers and data
scientists build and run optimum quality machine, learning models. It provides a managed service that
allows developers to easily create ML models with any type of data and of any size.

Features:
Below are the top features:

 Provides machine learning model training, building, deep learning and predictive modelling.
 The two services, namely, prediction and training, can be used independently or combinedly.
 It can be used by enterprises, i.e., for identifying clouds in a satellite image, responding faster to emails
of customers.
 It can be widely used to train a complex model.

4. Amazon Machine Learning (AML)

Amazon provides a great number of machine learning tools, and one of them is Amazon Machine Learning or
AML. Amazon Machine Learning (AML) is a cloud-based and robust machine learning software application,
which is widely used for building machine learning models and making predictions. Moreover, it integrates
data from multiple sources, including Redshift, Amazon S3, or RDS.

10
Features
Below are some top features:

 AML offers visualization tools and wizards.


 Enables the users to identify the patterns, build mathematical models, and make predictions.
 It provides support for three types of models, which are multi-class classification, binary classification,
and regression.
 It permits users to import the model into or export the model out from Amazon Machine Learning.
 It also provides core concepts of machine learning, including ML models, Data sources, Evaluations,
Real-time predictions and Batch predictions.
 It enables the user to retrieve predictions with the help of batch APIs for bulk requests or real-time
APIs for individual requests.

5. NET

Accord.Net is .Net based Machine Learning framework, which is used for scientific computing. It is combined
with audio and image processing libraries that are written in C#. This framework provides different libraries
for various applications in ML, such as Pattern Recognition, linear algebra, Statistical Data processing. One
popular package of the Accord.Net framework is Accord. Statistics, Accord.Math, and
Accord.MachineLearning.

Features
Below are some top features:

 It contains 38+ kernel Functions.


 Consists of more than 40 non-parametric and parametric estimation of statistical distributions.
 Used for creating production-grade computer audition, computer vision, signal processing, and
statistics apps.
 Contains more than 35 hypothesis tests that include two-way and one way ANOVA tests, non-
parametric tests such as the Kolmogorov-Smirnov test and many more.

6. Apache Mahout
Apache Mahout is an open-source project of Apache Software Foundation, which is used for developing
machine learning applications mainly focused on Linear Algebra. It is a distributed linear algebra framework
and mathematically expressive Scala DSL, which enable the developers to promptly implement their own
algorithms. It also provides Java/Scala libraries to perform Mathematical operations mainly based on linear
algebra and statistics.

Features:
Below are some top features:

 It enables developers to implement machine learning techniques, including recommendation,


clustering, and classification.
 It is an efficient framework for implementing scalable algorithms.
 It consists of matrix and vector libraries.
 It provides support for multiple distributed backends(including Apache Spark)

11
 It runs on top of Apache Hadoop using the MapReduce paradigm.

7. Shogun

Shogun is a free and open-source machine learning software library, which was created by Gunnar Raetsch
and Soeren Sonnenburg in the year 1999. This software library is written in C++ and supports interfaces for
different languages such as Python, R, Scala, C#, Ruby, etc., using SWIG(Simplified Wrapper and Interface
Generator). The main aim of Shogun is on different kernel-based algorithms such as Support Vector Machine
(SVM), K-Means Clustering, etc., for regression and classification problems. It also provides the complete
implementation of Hidden Markov Models.

Features:
Below are some top features:

 The main aim of Shogun is on different kernel-based algorithms such as Support Vector Machine
(SVM), K-Means Clustering, etc., for regression and classification problems.
 It provides support for the use of pre-calculated kernels.
 It also offers to use a combined kernel using Multiple kernel Learning Functionality.
 This was initially designed for processing a huge dataset that consists of up to 10 million samples.
 It also enables users to work on interfaces on different programming languages such as Lua, Python,
Java, C#, Octave, Ruby, MATLAB, and R.

8. Oryx2

It is a realization of the lambda architecture and built on Apache Kafka and Apache Spark. It is widely used
for real-time large-scale machine learning projects. It is a framework for building apps, including end-to-end
applications for filtering, packaged, regression, classification, and clustering. It is written in Java languages,
including Apache Spark, Hadoop, Tomcat, Kafka, etc. The latest version of Oryx2 is Oryx 2.8.0.

Features:
Below are some top features:

 It has three tiers: specialization on top providing ML abstractions, generic lambda architecture tier,
end-to-end implementation of the same standard ML algorithms.
 The original project of Oryx2 was Oryx1, and after some upgrades, Oryx2 was launched.
 It is well suited for large-scale real-time machine learning projects.
 It contains three layers which are arranged side-by-side, and these are named as Speed layer, batch
layer, and serving layer.
 It also has a data transport layer that transfer data between different layers and receives input from
external sources.

9. Apache Spark MLlib

Apache Spark MLlib is a scalable machine learning library that runs on Apache Mesos, Hadoop, Kubernetes,
standalone, or in the cloud. Moreover, it can access data from different data sources. It is an open-source
cluster-computing framework that offers an interface for complete clusters along with data parallelism and
fault tolerance.
12
For optimized numerical processing of data, MLlib provides linear algebra packages such as Breeze and netlib-
Java. It uses a query optimizer and physical execution engine for achieving high performance with both batch
and streaming data.
Features
Below are some top features:

 MLlib contains various algorithms, including Classification, Regression, Clustering,


recommendations, association rules, etc.
 It runs different platforms such as Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud
against diverse data sources.
 It contains high-quality algorithms that provide great results and performance.
 It is easy to use as it provides interfaces In Java, Python, Scala, R, and SQL.

10. Google ML kit for Mobile

For Mobile app developers, Google brings ML Kit, which is packaged with the expertise of machine learning
and technology to create more robust, optimized, and personalized apps. This tools kit can be used for face
detection, text recognition, landmark detection, image labelling, and barcode scanning applications. One can
also use it for working offline.

Features:
Below are some top features:

 The ML kit is optimized for mobile.


 It includes the advantages of different machine learning technologies.
 It provides easy-to-use APIs that enables powerful use cases in your mobile apps.
 It includes Vision API and Natural Language APIS to detect faces, text, and objects, and identify
different languages & provide reply suggestions.

Preparing to Model: Introduction

Step 1: Collect Data

Given the problem you want to solve, you will have to investigate and obtain data that you will use to feed
your machine. The quality and quantity of information you get are very important since it will directly impact
how well or badly your model will work. You may have the information in an existing database or you must
create it from scratch. If it is a small project, you can create a spreadsheet that will later be easily exported as
a CSV file. It is also common to use the web scraping technique to automatically collect information from
various sources such as APIs.

Step 2: Prepare the data

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Email

13
SIGN UP

This is a good time to visualize your data and check if there are correlations between the different
characteristics that we obtained. It will be necessary to make a selection of characteristics since the ones you
choose will directly impact the execution times and the results. You can also reduce dimensions by applying
PCA if necessary.

Additionally, you must balance the amount of data we have for each result -class- so that it is significant as
the learning may be biased towards a type of response and when your model tries to generalize knowledge it
will fail.

You must also separate the data into two groups: one for training and the other for model evaluation which
can be divided approximately in a ratio of 80/20 but it can vary depending on the case and the volume of data
we have.

At this stage, you can also pre-process your data by normalizing, eliminating duplicates, and making error
corrections.

Step 3: Choose the model

There are several models that you can choose according to the objective that you might have: you will use
algorithms of classification, prediction, linear regression, clustering, i.e. k-means or K-Nearest Neighbor,
Deep Learning, i.e Neural Networks, Bayesian, etc.

There are various models to be used depending on the data you are going to process such as images, sound,
text, and numerical values. In the following table, we will see some models and their applications that you can
apply in your projects:

Model Applications

Logistic Regression Price prediction

Fully connected networks Classification

Convolutional Neural Networks Image processing

Recurrent Neural Networks Voice recognition

Random Forest Fraud Detection

Reinforcement Learning Learning by trial and error

Generative Models Image creation

K-means Segmentation

14
Model Applications

K-Nearest Neighbors Recommendation systems

Bayesian Classifiers Spam and noise filtering

Linear Regression Classification

Step 4 Train your machine model

You will need to train the datasets to run smoothly and see an incremental improvement in the prediction rate.
Remember to initialize the weights of your model randomly -the weights are the values that multiply or affect
the relationships between the inputs and outputs- which will be automatically adjusted by the selected
algorithm the more you train them.

Step 5: Evaluation

You will have to check the machine created against your evaluation data set that contains inputs that the model
does not know and verify the precision of your already trained model. If the accuracy is less than or equal to
50%, that model will not be useful since it would be like tossing a coin to make decisions. If you reach 90%
or more, you can have good confidence in the results that the model gives you.

Step 6: Parameter Tuning

If during the evaluation you did not obtain good predictions and your precision is not the minimum desired, it
is possible that you have overfitting or underfitting problems and you must return to the training step before
making a new configuration of parameters in your model. You can increase the number of times you iterate
your training data- termed epochs. Another important parameter is the one known as the “learning rate”, which
is usually a value that multiplies the gradient to gradually bring it closer to the global -or local- minimum to
minimize the cost of the function.

Increasing your values by 0.1 units from 0.001 is not the same as this can significantly affect the model
execution time. You can also indicate the maximum error allowed for your model. You can go from taking a
few minutes to hours, and even days, to train your machine. These parameters are often called
Hyperparameters. This “tuning” is still more of an art than a science and will improve as you experiment.
There are usually many parameters to adjust and when combined they can trigger all your options. Each
algorithm has its own parameters to adjust. To name a few more, in Artificial Neural Networks (ANNs) you
must define in its architecture the number of hidden layers it will have and gradually test with more or less
and with how many neurons each layer. This will be a work of great effort and patience to give good results.

Step 7: Prediction or Inference

You are now ready to use your Machine Learning model inferring results in real-life scenarios.

Machine learning Life cycle

15
Machine learning has given the computer systems the abilities to automatically learn without being explicitly
programmed. But how does a machine learning system work? So, it can be described using the life cycle of
machine learning. Machine learning life cycle is a cyclic process to build an efficient machine learning project.
The main purpose of the life cycle is to find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

Gathering Data

Data preparation

Data Wrangling

Analyse Data

Train the model

Test the model

Deployment

Machine learning Life cycle

The most important thing in the complete process is to understand the problem and to know the purpose of
the problem. Therefore, before starting the life cycle, we need to understand the problem because the good
result depends on the better understanding of the problem.

In the complete life cycle process, to solve a problem, we create a machine learning system called "model",
and this model is created by providing "training". But to train a model, we need data, hence, life cycle starts
by collecting data.

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to identify and obtain
all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from various sources such
as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The
quantity and quality of the collected data will determine the efficiency of the output. The more will be the
data, the more accurate will be the prediction.

This step includes the below tasks:

 Identify various data sources


 Collect data
 Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in further
steps.

2. Data preparation
16
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we put our
data into a suitable place and prepare it to use in our machine learning training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

Data exploration:

It is used to understand the nature of data that we have to work with. We need to understand the characteristics,
format, and quality of data.

A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends, and
outliers.

Data pre-processing:

Now the next step is preprocessing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is the process of
cleaning the data, selecting the variable to use, and transforming the data in a proper format to make it more
suitable for analysis in the next step. It is one of the most important steps of the complete process. Cleaning
of data is required to address the quality issues.

It is not necessary that data we have collected is always of our use as some of the data may not be useful. In
real-world applications, collected data may have various issues, including:

 Missing Values
 Duplicate data
 Invalid data
 Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the quality of the
outcome.

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

 Selection of analytical techniques


 Building models
 Review the result

The aim of this step is to build a machine learning model to analyze the data using various analytical techniques
and review the outcome. It starts with the determination of the type of the problems, where we select the
machine learning techniques such as Classification, Regression, Cluster analysis, Association, etc. then build
the model using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the model.
17
5. Train Model

Now the next step is to train the model, in this step we train our model to improve its performance for better
outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a model is required
so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model. In this step, we
check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement of project or
problem.

7. Deployment

The last step of machine learning life cycle is deployment, where we deploy the model in the real-world
system.

If the above-prepared model is producing an accurate result as per our requirement with acceptable speed,
then we deploy the model in the real system. But before deploying the project, we will check whether it is
improving its performance using available data or not. The deployment phase is similar to making the final
report for a project.

Types of data

DATA: It can be any unprocessed fact, value, text, sound, or picture that is not being interpreted and analyzed.
Data is the most important part of all Data Analytics, Machine Learning, Artificial Intelligence. Without data,
we can’t train any model and all modern research and automation will go in vain. Big Enterprises are spending
lots of money just to gather as much certain data as possible.

INFORMATION: Data that has been interpreted and manipulated and has now some meaningful inference
for the users.

KNOWLEDGE: Combination of inferred information, experiences, learning, and insights. Results in


awareness or concept building for an individual or organization.

How we split data in Machine Learning?

Training Data: The part of data we use to train our model. This is the data that your model actually
sees(both input and output) and learns from.

18
Validation Data: The part of data that is used to do a frequent evaluation of the model, fit on the training
dataset along with improving involved hyperparameters (initially set parameters before the model begins
learning). This data plays its part when the model is actually training.

Testing Data: Once our model is completely trained, testing data provides an unbiased evaluation. When we
feed in the inputs of Testing data, our model will predict some values(without seeing actual output). After
prediction, we evaluate our model by comparing it with the actual output present in the testing data. This is
how we evaluate and see how much our model has learned from the experiences feed in as training data, set
at the time of training.

Consider an example:

There’s a Shopping Mart Owner who conducted a survey for which he has a long list of questions and answers
that he had asked from the customers, this list of questions and answers is DATA. Now every time when he
wants to infer anything and can’t just go through each and every question of thousands of customers to find
something relevant as it would be time-consuming and not helpful. In order to reduce this overhead and time
wastage and to make work easier, data is manipulated through software, calculations, graphs, etc. as per own
convenience, this inference from manipulated data is Information. So, Data is a must for Information. Now
Knowledge has its role in differentiating between two individuals having the same information. Knowledge
is actually not technical content but is linked to the human thought process.

Different Forms of Data

Numerical data:Such as house price, temperature, etc.

Categorical data:Such as Yes/No, True/False, Blue/green, etc.

Ordinal data:These data are similar to categorical data but can be measured on the basis of comparison.

Numeric Data : If a feature represents a characteristic measured in numbers , it is called a numeric feature.
Numerical data is any data where data points are exact numbers. Statisticians also might call numerical data,
quantitative data. This data has meaning as a measurement such as house prices or as a count, such as a
number of residential properties in Los Angeles or how many houses sold in the past year.

Numerical data can be characterized by continuous or discrete data. Continuous data can assume any value
within a range whereas discrete data has distinct values.

Figure: Numerical Data


19
For example, the number of students taking Python class would be a discrete data set. You can only have
discrete whole number values like 10, 25, or 33. A class cannot have 12.75 students enrolled. A student
either join a class or he doesn’t. On the other hand, continuous data are numbers that can fall anywhere
within a range. Like a student could have an average score of 88.25 which falls between 0 and 100.

The takeaway here is that numerical data is not ordered in time. They are just numbers that we have collected.

Categorical Data : A categorical feature is an attribute that can take on one of the limited , and usually fixed
number of possible values on the basis of some qualitative property . A categorical feature is also called a
nominal feature.

Categorical data represents characteristics, such as a hockey player’s position, team, hometown. Categorical
data can take numerical values. For example, maybe we would use 1 for the colour red and 2 for blue. But
these numbers don’t have a mathematical meaning. That is, we can’t add them together or take the average.

In the context of super classification, categorical data would be the class label. This would also be something
like if a person is a man or woman, or property is residential or commercial.

Figure: Categorical Data

Ordinal Data : This denotes a nominal variable with categories falling in an ordered list . Examples include
clothing sizes such as small, medium , and large , or a measurement of customer satisfaction on a scale from
“not at all happy” to “very happy”.

There is also something called ordinal data, which in some sense is a mix of numerical and categorical data.
In ordinal data, the data still falls into categories, but those categories are ordered or ranked in some particular
way. An example would be class difficulty, such as beginner, intermediate, and advanced. Those three types
of classes would be a way that we could label the classes, and they have a natural order in increasing difficulty.

Another example is that we just take quantitative data, and splitting it into groups, so we have bins or categories
of other types of data.

Figure: Ordinal Data


20
For plotting purposes, ordinal data is treated much in the same way as categorical data. But groups are usually
ordered from lowest to highest so that we can preserve this ordering.

Exploring structure of data

Data Structure for Machine Learning

Machine Learning is one of the hottest technologies used by data scientists or ML experts to deploy a real-
time project. However, only skills of machine learning are not sufficient for solving real-world problems and
designing a better product, but also you have to gain good exposure to the data structure.

The data structure used for machine learning is quite similar to other software development fields where it is
often used. Machine Learning is a subset of artificial intelligence that includes various complex algorithms
to solve mathematical problems to a great extent. Data structure helps to build and understand these complex
problems. Understanding the data structure also helps you to build ML models and algorithms in a much more
efficient way than other ML professionals. In this topic, "Data Structure for Machine Learning", we will
discuss various concepts of data structure used in Machine Learning, along with the relationship between data
structure and ML. So, let's start with a quick overview of Data structure and Machine Learning.

What is Data Structure?

The data structure is defined as the basic building block of computer programming that helps us to
organize, manage and store data for efficient search and retrieval.

In other words, the data structure is the collection of data type 'values' which are stored and organized in such
a way that it allows for efficient access and modification.

Types of Data Structure

The data structure is the ordered sequence of data, and it tells the compiler how a programmer is using the
data such as Integer, String, Boolean, etc.

There are two different types of data structures: Linear and Non-linear data structures.

21
1. Linear Data structure:

The linear data structure is a special type of data structure that helps to organize and manage data in a specific
order where the elements are attached adjacently.

There are mainly 4 types of linear data structure as follows:

Array:

An array is one of the most basic and common data structures used in Machine Learning. It is also used in
linear algebra to solve complex mathematical problems. You will use arrays constantly in machine learning,
whether it's:

o To convert the column of a data frame into a list format in pre-processing analysis
o To order the frequency of words present in datasets.
o Using a list of tokenized words to begin clustering topics.
o In word embedding, by creating multi-dimensional matrices.

An array contains index numbers to represent an element starting from 0. The lowest index is arr[0] and
corresponds to the first element.

Let's take an example of a Python array used in machine learning. Although the Python array is quite different
from than array in other programming languages, the Python list is more popular as it includes the flexibility
of data types and their length. If anyone is using Python in ML algorithms, then it's better to kick your journey
from array initially.

Python Array method:

Method Description

22
Append() It is used to add an element at the end of the list.

Clear() It is used to remove/clear all elements in the list.

Copy() It returns a copy of the list.

Count() It returns the count or total available element with an integer value.

Extend() It is used to add the element of a list to the end of the current list.

Index() It returns the index of the first element with the specified value.

Insert() It is used to add an element at a specific position using an index number.

Pop() It is used to remove an element from a specified position using an index number.

Remove() Used to remove the elements with specified values.

Reverse() Used to show list in reverse order

Sort() Used to sort the list in an array.

Stacks:

Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out). It is used for binary
classification in deep learning. Although stacks are easy to learn and implement in ML models but having a
good grasp can help in many computer science aspects such as parsing grammar, etc.

Stacks enable the undo and redo buttons on your computer as they function similar to a stack of blog content.
There is no sense in adding a blog at the bottom of the stack. However, we can only check the most recent one
that has been added. Addition and removal occur at the top of the stack.

Linked List:

A linked list is the type of collection having several separately allocated nodes. Or in other words, a list is
the type of collection of data elements that consist of a value and pointer that point to the next node in the
list.

In a linked list, insertion and deletion are constant time operations and are very efficient, but accessing a value
is slow and often requires scanning. So, a linked list is very significant for a dynamic array where the shifting
of elements is required. Although insertion of an element can be done at the head, middle or tail position, it is
relatively cost consuming. However, linked lists are easy to splice together and split apart. Also, the list can
be converted to a fixed-length array for fast access.

23
Queue:

A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing scenario in real-time
programs, such as people waiting in line to withdraw cash in the bank. Hence, the queue is significant in a
program where multiple lists of codes need to be processed.

The queue data structure can be used to record the split time of a car in F1 racing.

2. Non-linear Data Structures

As the name suggests, in Non-linear data structures, elements are not arranged in any sequence. All the
elements are arranged and linked with each other in a hierarchal manner, where one element can be linked
with one or more elements.

1) Trees

Binary Tree:

The concept of a binary tree is very much similar to a linked list, but the only difference of nodes and their
pointers. In a linked list, each node contains a data value with a pointer that points to the next node in the list,
whereas; in a binary tree, each node has two pointers to subsequent nodes instead of just one.

Binary trees are sorted, so insertion and deletion operations can be easily done with O(log N) time complexity.
Similar to the linked list, a binary tree can also be converted to an array on the basis of tree sorting.

In a binary tree, there are some child and parent nodes shown in the above image. Where the value of the left
child node is always less than the value of the parent node while the value of the right-side child nodes is
24
always more than the parent node. Hence, in a binary tree structure, data sorting is done automatically, which
makes insertion and deletion efficient.

2) Graphs

A graph data structure is also very much useful in machine learning for link prediction. Graphs are directed
or undirected concepts with nodes and ordered or unordered pairs. Hence, you must have good exposure to
the graph data structure for machine learning and deep learning.

3) Maps

Maps are the popular data structure in the programming world, which are mostly useful for minimizing the
run-time algorithms and fast searching the data. It stores data in the form of (key, value) pair, where the key
must be unique; however, the value can be duplicated. Each key corresponds to or maps a value; hence it is
named a Map.

In different programming languages, core libraries have built-in maps or, rather, HashMaps with different
names for each implementation.

o In Java: Maps
o In Python: Dictionaries
o C++: hash_map, unordered_map, etc.

Python Dictionaries are very useful in machine learning and data science as various functions and algorithms
return the dictionary as an output. Dictionaries are also much used for implementing sparse matrices, which
is very common in Machine Learning.

4) Heap data structure:

Heap is a hierarchically ordered data structure. Heap data structure is also very much similar to a tree, but it
consists of vertical ordering instead of horizontal ordering.

Ordering in a heap DS is applied along the hierarchy but not across it, where the value of the parent node is
always more than that of child nodes either on the left or right side.

25
Here, the insertion and deletion operations are performed on the basis of promotion. It means, firstly, the
element is inserted at the highest available position. After that, it gets compared with its parent and promoted
until it reaches the correct ranking position. Most of the heaps data structures can be stored in an array along
with the relationships between the elements.

Dynamic array data structure:

This is one of the most important types of data structure used in linear algebra to solve 1-D, 2-D, 3-D as well
as 4-D arrays for matrix arithmetic. Further, it requires good exposure to Python libraries such as Python
NumPy for programming in deep learning.

How is Data Structure used in Machine Learning?

For a Machine learning professional, apart from knowledge of machine learning skills, it is required to have
mastery of data structure and algorithms.

When we use machine learning for solving a problem, we need to evaluate the model performance, i.e., which
model is fastest and requires the smallest amount of space and resources with accuracy. Moreover, if a model
is built using algorithms, comparing and contrasting two algorithms to determine the best for the job is crucial
to the machine learning professional. For such cases, skills in data structures become important for ML
professionals.

Data quality and remediation

Data quality (DQ) is the degree to which a given dataset meets a user's needs. Data quality is an important
criterion for ensuring that data-driven decisions are made as accurately as possible.

High quality data is of sufficient quantity -- and has sufficient detail -- to meet its’ intended uses. It is
consistent with other sources, presented in appropriate ways and has a high degree of completeness. Other
key data quality components include:

 Accuracy -- The extent to which data represents real-world events accurately.


 Credibility -- The extent to which data is considered trustworthy and true.
 Timeliness -- The extent to which data meets the user's current needs.
 Consistency -- The extent to which the same data occurrences have the same value in different
datasets.
 Integrity -- The extent to which all data references have been joined accurately.

Machine learning algorithms trained on accurate, clean, and well-labelled data can identify the patterns hidden
in the data and produce models that provide predictions with high accuracy. It is for this reason that it is very
important to understand the input, detect and address any issues affecting its quality, before feeding the input
to the machine learning algorithm.

Data quality evaluation

There are many aspects of data quality and various dimensions that one can consider when evaluating the data
at hand. Some of the most common aspects examined in the data quality assessment process are the following:

Number of missing values. Most of the real-world datasets contain missing values, i.e., feature entries with
no data value stored. As many machine learning algorithms do not support missing values, detecting the
missing values and properly handling them, can have a significant impact.

26
Existence of duplicate values. Duplicate values can take various formats, such as multiple entries of the same
data point, multiple instances of an entire column, and repetition of the same value in an I.D. variable. While
duplicate instances might be valid in some datasets, they often arise because of errors in the data extraction
and integration processes. Hence, it is important to detect any duplicate values and decide if they correspond
to invalid values (true duplicates) or form a valid part of the dataset.

Existence of outliers/anomalies. Outliers are data points that differ substantially from the rest of data, and
they may arise due to the diversity of the dataset or because of errors/mistakes. As machine learning algorithms
are sensitive to the range and distribution of attribute values, identifying the outliers and their nature is
important for assessing the quality of the dataset.

Existence of invalid/bad formatted values. Datasets often contain inconsistent values, such as variables with
different units across the data points and variables with incorrect data type. For example, it is often the case
that some special numerical variables, such as percentages and fractions, are mistakenly stored as strings, and
one should detect and transform such cases so that the machine learning algorithm can work with the actual
numbers.

Improving data quality

After exploring the data to assess its quality and gain an in-depth understanding of the dataset, it is important
to resolve any detected issues before proceeding to the next stages of the machine learning pipeline. Below,
we give some of the most common ways for addressing such issues.

Handling missing values. There are different ways of dealing with missing data based on their number and
their data type:

Removing the missing data. If the number of data points containing missing values is small and the size of
the dataset is large enough, you may remove such data points. Also, if a variable is containing a very large
number of missing values, it may be removed.

Imputation. If the number of missing values is not small enough to be removed and not large enough to be a
substantial proportion of the variable entries, you can replace the missing values in a numerical variable with
the mean/median of the non-missing entries and the missing values in a categorical variable with the mode,
which is the most frequent entry of the variable.

Dealing with duplicate values. True duplicates, i.e., instances of the same data point, are usually removed.
In this way, the increase of the sample weight on these points is eliminated, and the risk of any artificial
inflation in the performance metrics is reduced.

Dealing with outliers. As with the case of missing values, common methods of handling the detected outliers
include removing the outliers and imputing new values. However, depending on the context of the dataset and
the number of the outliers, keeping the outliers unchanged might be the most suitable course of action. For
example, keeping the outliers would be suggested in a dataset where the number of outliers is not very small
as they might be necessary to correctly understand the dataset.

Converting bad formatted values. All malformed values are converted and stored with the correct datatype.
For example, numerical variables that are stored as strings are converted to the corresponding numbers, and
strings that represent dates are stored as date objects. Also, it is important to convert and ensure that all entries
in a variable correspond to the same unit as otherwise the comparisons between the variable entries will not
correspond to the true comparisons.

What is data remediation?

27
Data remediation is the process of cleansing, organizing and migrating data so that it’s properly protected and
best serves its intended purpose. There is a misconception that data remediation simply means deleting
business data that is no longer needed. It’s important to remember that the key word “remediation” derives
from the word “remedy,” which is to correct a mistake. Since the core initiative is to correct data, the data
remediation process typically involves replacing, modifying, cleansing or deleting any “dirty” data.

Data remediation terminology

Data Migration – The process of moving data between two or more systems, data formats or servers.

Data Discovery – A manual or automated process of searching for patterns in data sets to identify structured
and unstructured data in an organization’s systems.

ROT – An acronym that stands for redundant, obsolete and trivial data. According to the Association for
Intelligent Information Management, ROT data accounts for nearly 80 percent of the unstructured data that is
beyond its recommended retention period and no longer useful to an organization.

Dark Data – Any information that businesses collect, process and store, but do not use for other purposes.
Some examples include customer call records, raw survey data or email correspondences. Often, the storing
and securing of this type of data incurs more expense and sometimes even greater risk than it does value.

Dirty Data – Data that damages the integrity of the organization’s complete dataset. This can include data
that is unnecessarily duplicated, outdated, incomplete or inaccurate.

Data Overload – This is when an organization has acquired too much data, including low-quality or dark
data. Data overload makes the tasks of identifying, classifying and remediating data laborious.

Data Cleansing – Transforming data in its native state to a predefined standardized format.

Data Governance – Management of the availability, usability, integrity and security of the data stored within
an organization.

Data Pre-processing

Data pre-processing is a process of preparing the raw data and making it suitable for a machine learning model.
It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and formatted
data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for
this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot
be directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and

28
making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine
learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as
the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a machine learning
model for business purpose, then dataset will be different with the dataset required for a liver patient. So each
dataset is different from another dataset. To use the dataset in our code, we usually put it into a CSV file.
However, sometimes, we may also need to use an HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the tabular data,
such as spreadsheets. It is useful for huge datasets and can use these datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded from here,
"https://www.superdatascience.com/pages/machine-learning

. For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets
, https://archive.ics.uci.edu/ml/index.php
etc.

We can also create our dataset by gathering data using various API with Python and put that data into a .csv
file.

2) Importing Libraries

In order to perform data preprocessing using Python, we need to import some predefined Python libraries.
These libraries are used to perform some specific jobs. There are three specific libraries that we will use for
data preprocessing, which are:

29
Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It is the
fundamental package for scientific calculation in Python. It also supports to add large, multidimensional arrays
and matrices. So, in Python, we can import it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this library, we
need to import a sub-library pyplot. This library is used to plot any type of charts in Python for the code. It
will be imported as below:

1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and used for
importing and managing the datasets. It is an open-source data manipulation and analysis library. It will be
imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning project. But before
importing a dataset, we need to set the current directory as a working directory. To set a working directory in
Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Here, in the below image, we can see the Python file along with required dataset. Now, the current folder is
set as a working directory.

30
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv file
and performs various operations on it. Using this function, we can read a csv file locally as well as through an
URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed the name
of our dataset. Once we execute the above line of code, it will successfully import the dataset in our code. We
can also check the imported dataset by clicking on the section variable explorer, and then double click
on data_set. Consider the below image:

31
As in the above image, indexing is started from 0, which is the default indexing in Python. We can also change
the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent variables) and dependent
variables from dataset. In our dataset, there are three independent variables that are Country, Age, and Salary,
and one is a dependent variable which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the
required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the columns.
Here we have used :-1, because we don't want to take the last column as it contains the dependent variable. So
by doing this, we will get the matrix of features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
32
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains some
missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to
handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In this way, we
just delete the specific row or column which consists of null values. But this way is not so efficient and
removing data may lead to loss of information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row which contains any
missing value and will put it on the place of missing value. This strategy is useful for the features which have
numeric data such as age, salary, year, etc. Here, we will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for
building machine learning models. Here we will use Imputer class of sklearn.preprocessing library. Below
is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)

33
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means of rest column
values.

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our dataset would have
a categorical variable, then it may create trouble while building the model. So it is necessary to encode these
categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
34
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has successfully encoded
the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these variables are
encoded into 0, 1, and 2. By these values, the machine learning model may assume that there is some
correlation between these variables which will produce the wrong output. So to remove this issue, we will
use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that variable
in a particular column, and rest variables become 0. With dummy encoding, we will have a number of columns
equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For Dummy
Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable


2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
35
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided into three
columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:

For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here we are
not using OneHotEncoder class because the purchased variable has only two categories yes or no, and which
are automatically encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:

36
6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one of the
crucial steps of data preprocessing as by doing this, we can enhance the performance of our machine learning
model.

Suppose, if we have given training to our machine learning model by a dataset and we test it by a completely
different dataset. Then, it will create difficulties for our model to understand the correlations between the
models.

If we train our model very well and its training accuracy is also very high, but we provide a new dataset to it,
then it will decrease the performance. So we always try to make a machine learning model which performs
well with the training set and also with the test dataset. Here, we can define these datasets as:

37
Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the
output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random train and test
subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for arrays of
data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells
the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you always get
the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under the variable explorer
section.

38
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the
independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same
range and in the same scale so that no any variable dominate the other variable.

Consider the below dataset:

As we can see, the age and salary column values are not on the same scale. A machine learning model is based
on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our machine
learning model.

Euclidean distance is given as:

39
If we compute any two values from age and salary, then salary values will dominate the age values, and it will
produce an incorrect result. So to remove this issue, we need to perform feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

40
1. from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features. And then we
will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform() because it is already
done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:

41
x_test:

As we can see in the above output, all the variables are scaled between values -1 to 1.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more understandable.

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5. #importing datasets
6. data_set= pd.read_csv('Dataset.csv')
7. #Extracting Independent Variable
8. x= data_set.iloc[:, :-1].values
9. #Extracting Dependent variable
10. y= data_set.iloc[:, 3].values
11. #handling missing data(Replacing missing data with the mean value)
12. from sklearn.preprocessing import Imputer
13. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
14. #Fitting imputer object to the independent varibles x.
15. imputerimputer= imputer.fit(x[:, 1:3])
16. #Replacing missing data with the calculated mean value
17. x[:, 1:3]= imputer.transform(x[:, 1:3])
18. #for Country Variable

42
19. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
20. label_encoder_x= LabelEncoder()
21. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
22. #Encoding for dummy variables
23. onehot_encoder= OneHotEncoder(categorical_features= [0])
24. x= onehot_encoder.fit_transform(x).toarray()
25. #encoding for purchased variable
26. labelencoder_y= LabelEncoder()
27. y= labelencoder_y.fit_transform(y)
28. # Splitting the dataset into training and test set.
29. from sklearn.model_selection import train_test_split
30. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
31. #Feature Scaling of datasets
32. from sklearn.preprocessing import StandardScaler
33. st_x= StandardScaler()
34. x_train= st_x.fit_transform(x_train)
35. x_test= st_x.transform(x_test)

In the above code, we have included all the data preprocessing steps together. But there are some steps or lines
of code which are not necessary for all machine learning models. So we can exclude them from our code to
make it reusable for all models.

43

You might also like