0% found this document useful (0 votes)
8 views

Machine Learning

Uploaded by

BRYAND CHE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Machine Learning

Uploaded by

BRYAND CHE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Machine Learning

The Machine Learning Tutorial covers both the fundamentals and more complex ideas of
machine learning. Students and professionals in the workforce can benefit from our
machine learning tutorial.

A rapidly developing field of technology, machine learning allows computers to


automatically learn from previous data. For building mathematical models and making
predictions based on historical data or information, machine learning employs a variety
of algorithms. It is currently being used for a variety of tasks, including speech recognition,
email filtering, auto-tagging on Facebook, a recommender system, and image
recognition.

You will learn about the many different methods of machine learning, including
reinforcement learning, supervised learning, and unsupervised learning, in this machine
learning tutorial. Regression and classification models, clustering techniques, hidden
Markov models, and various sequential models will all be covered.

What is Machine Learning


In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data like
a human does? So here comes the role of Machine Learning.
Introduction to Machine Learning

A subset of artificial intelligence known as machine learning focuses primarily on the


creation of algorithms that enable a computer to independently learn from data and
previous experiences. Arthur Samuel first used the term "machine learning" in 1959. It
could be summarized as follows:

Without being explicitly programmed, machine learning enables a machine to


automatically learn from data, improve performance from experiences, and predict things.

Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample
historical data, or training data. For the purpose of developing predictive models, machine
learning brings together statistics and computer science. Algorithms that learn from
historical data are either constructed or utilized in machine learning. The performance will
rise in proportion to the quantity of information we provide.

A machine can learn if it can gain more data to improve its performance.
How does Machine Learning work
A machine learning system builds prediction models, learns from previous data, and
predicts the output of new data whenever it receives it. The amount of data helps to build
a better model that accurately predicts the output, which in turn affects the accuracy of
the predicted output.

Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic
based on the data and predict the output. Our perspective on the issue has changed as a
result of machine learning. The Machine Learning algorithm's operation is depicted in the
following block diagram:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

Need for Machine Learning


The demand for machine learning is steadily rising. Because it is able to perform tasks that
are too complex for a person to directly implement, machine learning is required. Humans
are constrained by our inability to manually access vast amounts of data; as a result, we
require computer systems, which is where machine learning comes in to simplify our lives.

By providing them with a large amount of data and allowing them to automatically
explore the data, build models, and predict the required output, we can train machine
learning algorithms. The cost function can be used to determine the amount of data and
the machine learning algorithm's performance. We can save both time and money by
using machine learning.

The significance of AI can be handily perceived by its utilization's cases, Presently, AI is


utilized in self-driving vehicles, digital misrepresentation identification, face
acknowledgment, and companion idea by Facebook, and so on. Different top
organizations, for example, Netflix and Amazon have constructed AI models that are
utilizing an immense measure of information to examine the client interest and suggest
item likewise.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning system
for training, and the system then predicts the output based on the training data.

The system uses labeled data to build a model that understands the datasets and learns
about each one. After the training and processing are done, we test the model with
sample data to see if it can accurately predict the output.

The mapping of the input data to the output data is the objective of supervised learning.
The managed learning depends on oversight, and it is equivalent to when an understudy
learns things in the management of the educator. Spam filtering is an example of
supervised learning.

Supervised learning can be grouped further in two categories of algorithms:

o Classification: this is a supervised machine learning method where the model tries
to predict the correct label of a given input data. the output is always discrete.
o Regression: this is a supervised machine learning technique which is used to
predict continuous values.
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two
categories of algorithms:

o Clustering: this is and act of grouping unlabeled examples


o Association: this type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that
it can be more profitable.

3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent
gets a reward for each right action and gets a penalty for each wrong action. The agent
learns automatically with these feedbacks and improves its performance. In reinforcement
learning, the agent interacts with the environment and explores it. The goal of an agent
is to get the most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

Note: We will learn about the above types of machine learning in detail in later chapters.

History of Machine Learning


Before some years (about 40-50 years), machine learning was science fiction, but today it
is the part of our daily life. Machine learning is making our day to day life easy from self-
driving cars to Amazon virtual assistant "Alexa". However, the idea behind machine
learning is so old and has a long history. Below some milestones are given which have
occurred in the history of machine learning:
The early history of Machine Learning (Pre-1940):

o 1834: In 1834, Charles Babbage, the father of the computer, conceived a device
that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.

The era of stored program computers:

o 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which
was the first electronic general-purpose computer. After that stored program
computer such as EDSAC in 1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.

Computer machinery and intelligence:


o 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can
machines think?"

Machine intelligence in Games:

o 1952: Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. It performed better more
it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.

The first "AI" winter:

o The duration of 1974 to 1980 was the tough time for AI and ML researchers, and
this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had reduced
their interest from AI, which led to reduced funding by the government to the
researches.

Machine Learning from theory to reality

o 1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce 20,000
words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.

Machine Learning at 21st century


2006:

o Geoffrey Hinton and his group presented the idea of profound getting the hang
of utilizing profound conviction organizations.
o The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine learning
models.

2007:

o Participants were tasked with increasing the accuracy of Netflix's recommendation


algorithm when the Netflix Prize competition began.
o Support learning made critical progress when a group of specialists utilized it to
prepare a PC to play backgammon at a top-notch level.

2008:

o Google delivered the Google Forecast Programming interface, a cloud-based help


that permitted designers to integrate AI into their applications.
o Confined Boltzmann Machines (RBMs), a kind of generative brain organization,
acquired consideration for their capacity to demonstrate complex information
conveyances.

2009:

o Profound learning gained ground as analysts showed its viability in different


errands, including discourse acknowledgment and picture grouping.
o The expression "Large Information" acquired ubiquity, featuring the difficulties and
open doors related with taking care of huge datasets.

2010:

o The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was


presented, driving progressions in PC vision, and prompting the advancement of
profound convolutional brain organizations (CNNs).

2011:

o On Jeopardy! IBM's Watson defeated human champions., demonstrating the


potential of question-answering systems and natural language processing.
2012:

o AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC,


fundamentally further developing picture order precision and laying out profound
advancing as a predominant methodology in PC vision.
o Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized
profound figuring out how to prepare a brain organization to perceive felines from
unlabeled YouTube recordings.

2013:

o Ian Goodfellow introduced generative adversarial networks (GANs), which made it


possible to create realistic synthetic data.
o Google later acquired the startup DeepMind Technologies, which focused on deep
learning and artificial intelligence.

2014:

o Facebook presented the DeepFace framework, which accomplished close human


precision in facial acknowledgment.
o AlphaGo, a program created by DeepMind at Google, defeated a world champion
Go player and demonstrated the potential of reinforcement learning in challenging
games.

2015:

o Microsoft delivered the Mental Toolbox (previously known as CNTK), an open-


source profound learning library.
o The performance of sequence-to-sequence models in tasks like machine
translation was enhanced by the introduction of the idea of attention mechanisms.

2016:

o The goal of explainable AI, which focuses on making machine learning models
easier to understand, received some attention.
o Google's DeepMind created AlphaGo Zero, which accomplished godlike Go
abilities to play without human information, utilizing just support learning.

2017:

o Move learning acquired noticeable quality, permitting pretrained models to be


utilized for different errands with restricted information.
o Better synthesis and generation of complex data were made possible by the
introduction of generative models like variational autoencoders (VAEs) and
Wasserstein GANs.
o These are only a portion of the eminent headways and achievements in AI during
the predefined period. The field kept on advancing quickly past 2017, with new
leap forwards, strategies, and applications arising.

Machine Learning at present:


The field of machine learning has made significant strides in recent years, and its
applications are numerous, including self-driving cars, Amazon Alexa, Catboats, and the
recommender system. It incorporates clustering, classification, decision tree, SVM
algorithms, and reinforcement learning, as well as unsupervised and supervised learning.

Present day AI models can be utilized for making different expectations, including climate
expectation, sickness forecast, financial exchange examination, and so on.

Prerequisites
Before learning machine learning, you must have the basic knowledge of followings so
that you can easily understand the concepts of machine learning:

o Fundamental knowledge of probability and linear algebra.


o The ability to code in any computer language, especially in Python language.
o Knowledge of Calculus, especially derivatives of single variable and multivariate
functions.

Applications of Machine learning


Machine learning is a buzzword for today's technology, and it is growing very rapidly day
by day. We are using machine learning in our daily life even without knowing it such as
Google Maps, Google assistant, Alexa, etc. Below are some most trending real-world
applications of Machine Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used
to identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload


a photo with our Facebook friends, then we automatically get a tagging suggestion with
name, and the technology behind this is machine learning's face
detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies
such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we
search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine
learning.

Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and
spam emails in our spam box, and the technology behind this is Machine learning. Below
are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,


and Naïve Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the
information using our voice instruction. These assistants can help us in various ways just
by our voice instructions such as Play music, call someone, Open an email, Scheduling an
appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and
decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways
that a fraudulent transaction can take place such as fake accounts, fake ids, and steal
money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a
specific pattern which gets change for the fraud transaction hence, it detects it and makes
our online transactions more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is
always a risk of up and downs in shares, so for this machine learning's long short term
memory neural network is used for the prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning


algorithm, which is used with image recognition and translates the text from one language
to another language.
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically learn
without being explicitly programmed. But how does a machine learning system work? So,
it can be described using the life cycle of machine learning. Machine learning life cycle is
a cyclic process to build an efficient machine learning project. The main purpose of the
life cycle is to find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
The most important thing in the complete process is to understand the problem and to
know the purpose of the problem. Therefore, before starting the life cycle, we need to
understand the problem because the good result depends on the better understanding
of the problem.

In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we
need data, hence, life cycle starts by collecting data.

1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is
to identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will
determine the efficiency of the output. The more will be the data, the more accurate will
be the prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.

2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine
learning training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:


o Data exploration:

It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.

o Data pre-processing:

Now the next step is preprocessing of data for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format.
It is the process of cleaning the data, selecting the variable to use, and transforming the
data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to address
the quality issues.

It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:

o Missing Values
o Duplicate data
o Invalid data
o Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect
the quality of the outcome.

4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
o Selection of analytical techniques
o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type
of the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model
using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the
model.

5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to
it.

Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.

7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in
the real-world system.

If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.

Installing Anaconda and Python


To learn machine learning, we will use the Python programming language in this tutorial.
So, in order to use Python for machine learning, we need to install it in our computer
system with compatible IDEs (Integrated Development Environment).

In this topic, we will learn to install Python and an IDE with the help of Anaconda
distribution.

Anaconda distribution is a free and open-source platform for Python/R programming


languages. It can be easily installed on any OS such as Windows, Linux, and MAC OS. It
provides more than 1500 Python/R data science packages which are suitable for
developing machine learning and deep learning models.

Anaconda distribution provides installation of Python with various IDE's such as Jupyter
Notebook, Spyder, Anaconda prompt, etc. Hence it is a very convenient packaged
solution which you can easily download and install in your computer. It will automatically
install Python and some basic IDEs and libraries with it.

Below some steps are given to show the downloading and installing process of Anaconda
and IDE:

Step-1: Download Anaconda Python:

o To download Anaconda in your system, firstly, open your favorite browser and type
Download Anaconda Python, and then click on the first link as given in the below
image. Alternatively, you can directly download it by clicking on this
link, https://www.anaconda.com/distribution/#download-section.
o After clicking on the first link, you will reach to download page of Anaconda, as
shown in the below image:

o Since, Anaconda is available for Windows, Linux, and Mac OS, hence, you can
download it as per your OS type by clicking on available options shown in below
image. It will provide you Python 2.7 and Python 3.7 versions, but the latest version
is 3.7, hence we will download Python 3.7 version. After clicking on the download
option, it will start downloading on your computer.

Note: In this topic, we are downloading Anaconda for Windows you can choose it as per your
OS.

Step- 2: Install Anaconda Python (Python 3.7 version):


Once the downloading process gets completed, go to downloads → double click on the
".exe" file (Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will open a
setup window for Anaconda installations as given in below image, then click on Next.
o It will open a License agreement window click on "I Agree" option and move
further.
o In the next window, you will get two options for installations as given in the below
image. Select the first option (Just me) and click on Next.

o Now you will get a window for installing location, here, you can leave it as default
or change it by browsing a location, and then click on Next. Consider the below
image:
o Now select the second option, and click on install.

o Once the installation gets complete, click on Next.


o Now installation is completed, tick the checkbox if you want to learn more about
Anaconda and Anaconda cloud. Click on Finish to end the process.
Note: Here, we will use the Spyder IDE to run Python programs.

Step- 3: Open Anaconda Navigator

o After successful installation of Anaconda, use Anaconda navigator to launch a


Python IDE such as Spyder and Jupyter Notebook.
o To open Anaconda Navigator, click on window Key and search for Anaconda
navigator, and click on it. Consider the below image:

o After opening the navigator, launch the Spyder IDE by clicking on


the Launch button given below the Spyder. It will install the Spyder IDE in your
system.
Run your Python program in Spyder IDE.

o Open Spyder IDE, it will look like the below image:

o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right side.

Step- 4: Close the Spyder IDE.

Difference between Artificial intelligence and


Machine learning
Artificial intelligence and machine learning are the part of computer science that are
correlated with each other. These two technologies are the most trending technologies
which are used for creating intelligent systems.

Although these are two related technologies and sometimes people use them as a
synonym for each other, but still both are the two different terms in various cases.

On a broad level, we can differentiate both AI and ML as:

AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that allows
machines to learn from data without being programmed explicitly.
Below are some main differences between AI and machine learning along with the
overview of Artificial intelligence and machine learning.

Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that
can mimic human intelligence. It is comprised of two words "Artificial" and
"intelligence", which means "a human-made thinking power." Hence we can define it as,

Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.

The Artificial intelligence system does not require to be pre-programmed, instead of that,
they use such algorithms which can work with their own intelligence. It involves machine
learning algorithms such as Reinforcement learning algorithm and deep learning neural
networks. AI is being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess
playing, etc.

Based on capabilities, AI can be classified into three types:

o Weak AI
o General AI
o Strong AI

Currently, we are working with weak AI and general AI. The future of AI is Strong AI for
which it is said that it will be intelligent than humans.

Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,

Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.

Machine learning enables a computer system to make predictions or take some decisions
using historical data without being explicitly programmed. Machine learning uses a
massive amount of structured and semi-structured data so that a machine learning model
can generate accurate result or give predictions based on that data.

Machine learning works on algorithm which learn by it?s own using historical data. It
works only for specific domains such as if we are creating a machine learning model to
detect pictures of dogs, it will only give result for dog images, but if we provide a new
data like cat image then it will become unresponsive. Machine learning is being used in
various places such as for online recommender system, for Google search algorithms,
Email spam filter, Facebook Auto friend tagging suggestion, etc.

It can be divided into three types:

o Supervised learning
o Reinforcement learning
o Unsupervised learning

Key differences between Artificial Intelligence (AI) and


Machine learning (ML):

Artificial Intelligence Machine learning

Artificial intelligence is a technology Machine learning is a subset of AI which allows


which enables a machine to simulate a machine to automatically learn from past data
human behavior. without programming explicitly.

The goal of AI is to make a smart The goal of ML is to allow machines to learn


computer system like humans to from data so that they can give accurate output.
solve complex problems.

In AI, we make intelligent systems to In ML, we teach machines with data to perform
perform any task like a human. a particular task and give an accurate result.

Machine learning and deep learning Deep learning is a main subset of machine
are the two main subsets of AI. learning.

AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent Machine learning is working to create machines
system which can perform various that can perform only those specific tasks for
complex tasks. which they are trained.

AI system is concerned about Machine learning is mainly concerned about


maximizing the chances of success. accuracy and patterns.

The main applications of AI are Siri, The main applications of machine learning
customer support using catboats, are Online recommender system, Google
Expert System, Online game playing, search algorithms, Facebook auto friend
intelligent humanoid robot, etc. tagging suggestions, etc.

On the basis of capabilities, AI can Machine learning can also be divided into
be divided into three types, which mainly three types that are Supervised
are, Weak AI, General AI, learning, Unsupervised learning,
and Strong AI. and Reinforcement learning.

It includes learning, reasoning, and It includes learning and self-correction when


self-correction. introduced with new data.

AI completely deals with Structured, Machine learning deals with Structured and
semi-structured, and unstructured semi-structured data.
data.

How to get datasets for Machine Learning


The field of ML depends vigorously on datasets for preparing models and making precise
predictions. Datasets assume a vital part in the progress of AIML projects and are
fundamental for turning into a gifted information researcher. In this article, we will
investigate the various sorts of datasets utilized in AI and give a definite aid on where to
track down them.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an
example of the dataset:
Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.

Types of data in datasets

o Numerical data:Such as house price, temperature, etc.


o Categorical data:Such as Yes/No, True/False, Blue/green, etc.
o Ordinal data:These data are similar to categorical data but can be measured on
the basis of comparison.

Note: A real-world dataset is of huge size, which is difficult to manage and process
at the initial level. Therefore, to practice machine learning algorithms, we can use
any dummy dataset.

Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets.
A few normal sorts of datasets utilized in machine learning include:

Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer
vision tasks such as image classification, object detection, and image segmentation.
Examples :

o ImageNet
o CIFAR-10
o MNIST

Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment
posts. These datasets are utilized in NLP techniques like sentiment analysis, text
classification, and machine translation.

Examples :

o Gutenberg Task dataset


o IMDb film reviews dataset

Time Series Datasets:


Time series datasets include information focuses gathered after some time. They are
generally utilized in determining, abnormality location, and pattern
examination. Examples :

o Securities exchange information


o Climate information
o Sensor readings.

Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets.
They contain lines addressing examples or tests and segments addressing highlights or
qualities. Tabular datasets are utilized for undertakings like relapse and arrangement. The
dataset given before in the article is an illustration of a tabular dataset.

Need of Dataset
o Completely ready and pre-handled datasets are significant for machine learning
projects.
o They give the establishment to prepare exact and solid models. Notwithstanding,
working with enormous datasets can introduce difficulties regarding the board and
handling.
o To address these difficulties, productive information the executive's strategies and
are expected to handle calculations.

Data Pre-processing:
Data pre-processing is a fundamental stage in preparing datasets for machine learning. It
includes changing raw data into a configuration reasonable for model training. Normal
pre-processing procedures incorporate data cleaning to eliminate irregularities or
blunders, standardization to scale data inside a particular reach, highlight scaling to
guarantee highlights have comparative ranges, and taking care of missing qualities
through ascription or evacuation.

During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:

o Training dataset:
o Test Dataset
Note: The datasets are of large size, so to download these datasets, you must
have fast internet on your computer.

Training Dataset and Test Dataset:


In machine learning, datasets are ordinarily partitioned into two sections: the training
dataset and the test dataset. The training dataset is utilized to prepare the machine
learning model, while the test dataset is utilized to assess the model's exhibition. This
division surveys the model's capacity, to sum up to inconspicuous data. It is fundamental
to guarantee that the datasets are representative of the issue space and appropriately
split to stay away from inclination or overfitting.

Popular sources for Machine Learning datasets


Below is the list of datasets which are freely available for the public to work on it:

1. Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve
difficult Data Science related tasks.

Kaggle provides a high-quality dataset in different formats that we can easily find and
download.

The link for the Kaggle dataset is https://www.kaggle.com/datasets.

2. UCI Machine Learning Repository


The UCI Machine Learning Repository is an important asset that has been broadly utilized
by scientists and specialists beginning around 1987. It contains a huge collection of
datasets sorted by machine learning tasks such as regression, classification, and
clustering. Remarkable datasets in the storehouse incorporate the Iris dataset, Vehicle
Assessment dataset, and Poker Hand dataset.
The link for the UCI machine learning repository
is https://archive.ics.uci.edu/ml/index.php.

3. Datasets via AWS


We can search, download, access, and share the datasets that are publicly available via
AWS resources. These datasets can be accessed through AWS resources but provided and
maintained by different government organizations, researches, businesses, or individuals.

Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.

This source provides the various types of datasets with examples and ways to use the
dataset. It also provides the search box using which we can search for the required dataset.
Anyone can add any dataset or example to the Registry of Open Data on AWS.

The link for the resource is https://registry.opendata.aws/.

4. Google's Dataset Search Engine


Google's Dataset Web index helps scientists find and access important datasets from
different sources across the web. It files datasets from areas like sociologies, science, and
environmental science. Specialists can utilize catchphrases to find datasets, channel
results in light of explicit standards, and access the datasets straightforwardly from the
source.
The link for the Google dataset search engine
is https://toolbox.google.com/datasetsearch.

5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing,
computer vision, and domain-specific sciences. It gives admittance to assorted and
arranged datasets that can be significant for machine learning projects.

The link to download or use the dataset from this resource is https://msropendata.com/.

6. Awesome Public Dataset Collection


Awesome public dataset collection provides high-quality datasets that are arranged in a
well-organized manner within a list according to topics such as Agriculture, Biology,
Climate, Complex networks, etc. Most of the datasets are available free, but some may
not, so it is better to check the license before downloading the dataset.

The link to download the dataset from Awesome public dataset collection
is https://github.com/awesomedata/awesome-public-datasets.

7. Government Datasets
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.

The goal of providing these datasets is to increase transparency of government work


among the people and to use the data in an innovative approach. Below are some links
of government datasets:

o Indian Government dataset


o US Government Dataset
o Northern Ireland Public Sector Datasets
o European Union Open Data Portal

8. Computer Vision Datasets


Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you
can refer to this source.

The link for downloading the dataset from this source is https://www.visualdata.io/.

9. Scikit-learn dataset
Scikit-learn, a well-known machine learning library in Python, gives a few underlying
datasets to practice and trial and error. These datasets are open through the sci-kit-learn
Programming interface and can be utilized for learning different machine-learning
calculations. Scikit-learn offers both toy datasets, which are little and improved, and
genuine world datasets with greater intricacy. Instances of sci-kit-learn datasets
incorporate the Iris dataset, the Boston Lodging dataset, and the Wine dataset.
The link to download datasets from this source is https://scikit-
learn.org/stable/datasets/index.html.

Data Ethics and Privacy:


Data ethics and privacy are basic contemplations in machine learning projects. It is
fundamental to guarantee that data is gathered and utilized morally, regarding privacy
freedoms and observing pertinent regulations and guidelines. Data experts ought to go
to lengths to safeguard data privacy, get appropriate assent, and handle delicate data
mindfully. Assets, for example, moral rules and privacy structures can give direction on
keeping up with moral practices in data assortment and use.

Conclusion:
In conclusion, datasets structure the groundwork of effective machine-learning projects.
Understanding the various kinds of datasets, the significance of data pre-processing, and
the job of training and testing datasets are key stages towards building powerful models.
By utilizing well-known sources, for example, Kaggle, UCI Machine Learning Repository,
AWS, Google's Dataset Search, Microsoft Datasets, and government datasets, data
researchers and specialists can get to an extensive variety of datasets for their machine
learning projects. It is fundamental to consider data ethics and privacy all through the
whole data lifecycle to guarantee mindful and moral utilization of data. With the right
datasets and moral practices, machine learning models can accomplish exact predictions
and drive significant bits of knowledge.

Data Preprocessing in Machine learning


Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing is
required tasks for cleaning the data and making it suitable for a machine learning model
which also increases the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset


To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in
a proper format is known as the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient. So each dataset is different from another dataset. To
use the dataset in our code, we usually put it into a CSV file. However, sometimes, we
may also need to use an HTML or xlsx file.

What is a CSV File?


CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning.
For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python and put
that data into a .csv file.

2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we can import
it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot. This library is used to plot any
type of charts in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt


Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required
dataset.

Here, in the below image, we can see the Python file along with required dataset. Now,
the current folder is set as a working directory.
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used
to read a csv file and performs various operations on it. Using this function, we can read
a csv file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we
have passed the name of our dataset. Once we execute the above line of code, it will
successfully import the dataset in our code. We can also check the imported dataset by
clicking on the section variable explorer, and then double click on data_set. Consider
the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent


variables) and dependent variables from dataset. In our dataset, there are three
independent variables that are Country, Age, and Salary, and one is a dependent
variable which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used
to extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is
for all the columns. Here we have used :-1, because we don't want to take the last column
as it contains the dependent variable. So by doing this, we will get the matrix of features.
By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of
dependent variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

Note: If you are using Python language for machine learning, then extraction is mandatory,
but for R language it is not required.

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our
dataset contains some missing data, then it may create a huge problem for our machine
learning model. Hence it is necessary to handle missing values present in the dataset.
Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values.
In this way, we just delete the specific row or column which consists of null values. But
this way is not so efficient and removing data may lead to loss of information which will
not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we
will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means
of rest column values.

5) Encoding Categorical data:


Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output,
these variables are encoded into 0, 1, and 2. By these values, the machine learning model
may assume that there is some correlation between these variables which will produce
the wrong output. So to remove this issue, we will use dummy encoding.

Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1
values. For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable


2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1
and divided into three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically
encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:


6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test
set. This is one of the crucial steps of data preprocessing as by doing this, we can enhance
the performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we
test it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide
a new dataset to it, then it will decrease the performance. So we always try to make a
machine learning model which performs well with the training set and also with the test
dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already
know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into
random train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two
are for arrays of data, and test_size is for specifying the size of the test set. The
test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing
sets.
o The last parameter random_state is used to set a seed for a random generator so
that you always get the same result, and the most used value for this is 42.

Output:
By executing the above code, we will get 4 different variables, which can be seen under
the variable explorer section.

As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values.

7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique
to standardize the independent variables of the dataset in a specific range. In feature
scaling, we put our variables in the same range and in the same scale so that no any
variable dominate the other variable.

Consider the below dataset:


As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then
it will cause some issue in our machine learning model.

Euclidean distance is given as:


If we compute any two values from age and salary, then salary values will dominate the
age values, and it will produce an incorrect result. So to remove this issue, we need to
perform feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library


as:

1. from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or
features. And then we will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead


of fit_transform() because it is already done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test
as:

x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.

Note: Here, we have not scaled the dependent variable because there are only two values 0
and 1. But if these variables will have more range of values, then we will also need to scale
those variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more
understandable.

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)

In the above code, we have included all the data preprocessing steps together. But there
are some steps or lines of code which are not necessary for all machine learning models.
So we can exclude them from our code to make it reusable for all models.

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained using
well "labelled" training data, and on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as
a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

How Supervised Learning Works?


In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on
the basis of test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled
as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation
dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets
as the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts
the correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Note: We will discuss these algorithms in detail in later chapters.

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the basis
of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning


In the previous topic, we learned supervised machine learning in which models are trained
using labeled data under the supervision of training data. But there may be many cases
in which we do not have labeled data and need to find the hidden patterns from the given
dataset. So, to solve such types of cases in machine learning, we need unsupervised
learning techniques.

What is Unsupervised Learning?


As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that dataset in a compressed
format.

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never trained upon
the given dataset, which means it does not have any idea about the features of the
dataset. The task of the unsupervised learning algorithm is to identify the image features
on their own. Unsupervised learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities between images.

Why use Unsupervised Learning?


Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will apply suitable algorithms such as k-means
clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of
problems:

o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence
of those commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines
the set of items that occurs together in the dataset. Association rule makes
marketing strategy more effective. Such as people who buy X item (suppose a
bread) are also tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.

Note: We will learn these algorithms in later chapters.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance.

o Difference between Supervised and


Unsupervised Learning
o Supervised and Unsupervised learning are the two techniques of machine learning.
But both the techniques are used in different scenarios and with different datasets.
Below the explanation of both learning methods along with their difference table
is given.
o
o Supervised Machine Learning:
o Supervised learning is a machine learning method in which models are trained
using labeled data. In supervised learning, models need to find the mapping
function to map the input variable (X) with the output variable (Y).

o
o Supervised learning needs supervision to train the model, which is similar to as a
student learns things in the presence of a teacher. Supervised learning can be used
for two types of problems: Classification and Regression.
o Learn more Supervised Machine Learning
o Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly. So
to identify the image in supervised learning, we will give the input data as well as
output for that, which means we will train the model by the shape, size, color, and
taste of each fruit. Once the training is completed, we will test the model by giving
the new set of fruit. The model will identify the fruit and predict the output using
a suitable algorithm.
o Unsupervised Machine Learning:
o Unsupervised learning is another machine learning method in which patterns
inferred from the unlabeled input data. The goal of unsupervised learning is to find
the structure and patterns from the input data. Unsupervised learning does not
need any supervision. Instead, it finds patterns from the data by its own.
o Learn more Unsupervised Machine Learning
o Unsupervised learning can be used for two types of
problems: Clustering and Association.
o Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to
the model. We will just provide the input dataset to the model and allow the model
to find the patterns from the data. With the help of a suitable algorithm, the model
will train itself and divide the fruits into different groups according to the most
similar features between them.
o The main differences between Supervised and Unsupervised learning are given
below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained Unsupervised learning algorithms are


using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is In unsupervised learning, only input data


provided to the model along with the is provided to the model.
output.

The goal of supervised learning is to train The goal of unsupervised learning is to


the model so that it can predict the output find the hidden patterns and useful
when it is given new data. insights from the unknown dataset.

Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problem in Clustering and Associations problem
s. s.

Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.
Supervised learning model produces an Unsupervised learning model may give
accurate result. less accurate result as compared to
supervised learning.

Supervised learning is not close to true Unsupervised learning is more close to


Artificial intelligence as in this, we first train the true Artificial Intelligence as it learns
the model for each data, and then only it similarly as a child learns daily routine
can predict the correct output. things by his experiences.

It includes various algorithms such as It includes various algorithms such as


Linear Regression, Logistic Regression, Clustering, KNN, and Apriori algorithm.
Support Vector Machine, Multi-class
Classification, Decision tree, Bayesian
Logic, etc.

o Note: The supervised and unsupervised learning both are the machine learning
methods, and selection of any of these learning depends on the factors related to the
structure and volume of your dataset and the use cases of the problem.

You might also like