Machine Learning
Machine Learning
The Machine Learning Tutorial covers both the fundamentals and more complex ideas of
machine learning. Students and professionals in the workforce can benefit from our
machine learning tutorial.
You will learn about the many different methods of machine learning, including
reinforcement learning, supervised learning, and unsupervised learning, in this machine
learning tutorial. Regression and classification models, clustering techniques, hidden
Markov models, and various sequential models will all be covered.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample
historical data, or training data. For the purpose of developing predictive models, machine
learning brings together statistics and computer science. Algorithms that learn from
historical data are either constructed or utilized in machine learning. The performance will
rise in proportion to the quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
How does Machine Learning work
A machine learning system builds prediction models, learns from previous data, and
predicts the output of new data whenever it receives it. The amount of data helps to build
a better model that accurately predicts the output, which in turn affects the accuracy of
the predicted output.
Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic
based on the data and predict the output. Our perspective on the issue has changed as a
result of machine learning. The Machine Learning algorithm's operation is depicted in the
following block diagram:
By providing them with a large amount of data and allowing them to automatically
explore the data, build models, and predict the required output, we can train machine
learning algorithms. The cost function can be used to determine the amount of data and
the machine learning algorithm's performance. We can save both time and money by
using machine learning.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning system
for training, and the system then predicts the output based on the training data.
The system uses labeled data to build a model that understands the datasets and learns
about each one. After the training and processing are done, we test the model with
sample data to see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning.
The managed learning depends on oversight, and it is equivalent to when an understudy
learns things in the management of the educator. Spam filtering is an example of
supervised learning.
o Classification: this is a supervised machine learning method where the model tries
to predict the correct label of a given input data. the output is always discrete.
o Regression: this is a supervised machine learning technique which is used to
predict continuous values.
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two
categories of algorithms:
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent
gets a reward for each right action and gets a penalty for each wrong action. The agent
learns automatically with these feedbacks and improves its performance. In reinforcement
learning, the agent interacts with the environment and explores it. The goal of an agent
is to get the most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
Note: We will learn about the above types of machine learning in detail in later chapters.
o 1834: In 1834, Charles Babbage, the father of the computer, conceived a device
that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.
o 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which
was the first electronic general-purpose computer. After that stored program
computer such as EDSAC in 1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.
o 1952: Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. It performed better more
it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
o The duration of 1974 to 1980 was the tough time for AI and ML researchers, and
this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had reduced
their interest from AI, which led to reduced funding by the government to the
researches.
o 1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce 20,000
words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
o Geoffrey Hinton and his group presented the idea of profound getting the hang
of utilizing profound conviction organizations.
o The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine learning
models.
2007:
2008:
2009:
2010:
2011:
2013:
2014:
2015:
2016:
o The goal of explainable AI, which focuses on making machine learning models
easier to understand, received some attention.
o Google's DeepMind created AlphaGo Zero, which accomplished godlike Go
abilities to play without human information, utilizing just support learning.
2017:
Present day AI models can be utilized for making different expectations, including climate
expectation, sickness forecast, financial exchange examination, and so on.
Prerequisites
Before learning machine learning, you must have the basic knowledge of followings so
that you can easily understand the concepts of machine learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used
to identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies
such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we
search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine
learning.
Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
These assistant record our voice instructions, send it over the server on a cloud, and
decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways
that a fraudulent transaction can take place such as fake accounts, fake ids, and steal
money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a
specific pattern which gets change for the fraud transaction hence, it detects it and makes
our online transactions more secure.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
The most important thing in the complete process is to understand the problem and to
know the purpose of the problem. Therefore, before starting the life cycle, we need to
understand the problem because the good result depends on the better understanding
of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we
need data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is
to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will
determine the efficiency of the output. The more will be the data, the more accurate will
be the prediction.
By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine
learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format.
It is the process of cleaning the data, selecting the variable to use, and transforming the
data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to address
the quality issues.
It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
It is mandatory to detect and remove the above issues because it can negatively affect
the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
o Selection of analytical techniques
o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type
of the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model
using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the
model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to
it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in
the real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
In this topic, we will learn to install Python and an IDE with the help of Anaconda
distribution.
Anaconda distribution provides installation of Python with various IDE's such as Jupyter
Notebook, Spyder, Anaconda prompt, etc. Hence it is a very convenient packaged
solution which you can easily download and install in your computer. It will automatically
install Python and some basic IDEs and libraries with it.
Below some steps are given to show the downloading and installing process of Anaconda
and IDE:
o To download Anaconda in your system, firstly, open your favorite browser and type
Download Anaconda Python, and then click on the first link as given in the below
image. Alternatively, you can directly download it by clicking on this
link, https://www.anaconda.com/distribution/#download-section.
o After clicking on the first link, you will reach to download page of Anaconda, as
shown in the below image:
o Since, Anaconda is available for Windows, Linux, and Mac OS, hence, you can
download it as per your OS type by clicking on available options shown in below
image. It will provide you Python 2.7 and Python 3.7 versions, but the latest version
is 3.7, hence we will download Python 3.7 version. After clicking on the download
option, it will start downloading on your computer.
Note: In this topic, we are downloading Anaconda for Windows you can choose it as per your
OS.
o Now you will get a window for installing location, here, you can leave it as default
or change it by browsing a location, and then click on Next. Consider the below
image:
o Now select the second option, and click on install.
o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right side.
Although these are two related technologies and sometimes people use them as a
synonym for each other, but still both are the two different terms in various cases.
AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that allows
machines to learn from data without being programmed explicitly.
Below are some main differences between AI and machine learning along with the
overview of Artificial intelligence and machine learning.
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that
can mimic human intelligence. It is comprised of two words "Artificial" and
"intelligence", which means "a human-made thinking power." Hence we can define it as,
Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of that,
they use such algorithms which can work with their own intelligence. It involves machine
learning algorithms such as Reinforcement learning algorithm and deep learning neural
networks. AI is being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess
playing, etc.
o Weak AI
o General AI
o Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for
which it is said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.
Machine learning enables a computer system to make predictions or take some decisions
using historical data without being explicitly programmed. Machine learning uses a
massive amount of structured and semi-structured data so that a machine learning model
can generate accurate result or give predictions based on that data.
Machine learning works on algorithm which learn by it?s own using historical data. It
works only for specific domains such as if we are creating a machine learning model to
detect pictures of dogs, it will only give result for dog images, but if we provide a new
data like cat image then it will become unresponsive. Machine learning is being used in
various places such as for online recommender system, for Google search algorithms,
Email spam filter, Facebook Auto friend tagging suggestion, etc.
o Supervised learning
o Reinforcement learning
o Unsupervised learning
In AI, we make intelligent systems to In ML, we teach machines with data to perform
perform any task like a human. a particular task and give an accurate result.
Machine learning and deep learning Deep learning is a main subset of machine
are the two main subsets of AI. learning.
AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent Machine learning is working to create machines
system which can perform various that can perform only those specific tasks for
complex tasks. which they are trained.
The main applications of AI are Siri, The main applications of machine learning
customer support using catboats, are Online recommender system, Google
Expert System, Online game playing, search algorithms, Facebook auto friend
intelligent humanoid robot, etc. tagging suggestions, etc.
On the basis of capabilities, AI can Machine learning can also be divided into
be divided into three types, which mainly three types that are Supervised
are, Weak AI, General AI, learning, Unsupervised learning,
and Strong AI. and Reinforcement learning.
AI completely deals with Structured, Machine learning deals with Structured and
semi-structured, and unstructured semi-structured data.
data.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an
example of the dataset:
Country Age Salary Purchased
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.
Note: A real-world dataset is of huge size, which is difficult to manage and process
at the initial level. Therefore, to practice machine learning algorithms, we can use
any dummy dataset.
Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets.
A few normal sorts of datasets utilized in machine learning include:
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer
vision tasks such as image classification, object detection, and image segmentation.
Examples :
o ImageNet
o CIFAR-10
o MNIST
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment
posts. These datasets are utilized in NLP techniques like sentiment analysis, text
classification, and machine translation.
Examples :
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets.
They contain lines addressing examples or tests and segments addressing highlights or
qualities. Tabular datasets are utilized for undertakings like relapse and arrangement. The
dataset given before in the article is an illustration of a tabular dataset.
Need of Dataset
o Completely ready and pre-handled datasets are significant for machine learning
projects.
o They give the establishment to prepare exact and solid models. Notwithstanding,
working with enormous datasets can introduce difficulties regarding the board and
handling.
o To address these difficulties, productive information the executive's strategies and
are expected to handle calculations.
Data Pre-processing:
Data pre-processing is a fundamental stage in preparing datasets for machine learning. It
includes changing raw data into a configuration reasonable for model training. Normal
pre-processing procedures incorporate data cleaning to eliminate irregularities or
blunders, standardization to scale data inside a particular reach, highlight scaling to
guarantee highlights have comparative ranges, and taking care of missing qualities
through ascription or evacuation.
During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:
o Training dataset:
o Test Dataset
Note: The datasets are of large size, so to download these datasets, you must
have fast internet on your computer.
1. Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve
difficult Data Science related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find and
download.
Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.
This source provides the various types of datasets with examples and ways to use the
dataset. It also provides the search box using which we can search for the required dataset.
Anyone can add any dataset or example to the Registry of Open Data on AWS.
5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing,
computer vision, and domain-specific sciences. It gives admittance to assorted and
arranged datasets that can be significant for machine learning projects.
The link to download or use the dataset from this resource is https://msropendata.com/.
The link to download the dataset from Awesome public dataset collection
is https://github.com/awesomedata/awesome-public-datasets.
7. Government Datasets
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.
The link for downloading the dataset from this source is https://www.visualdata.io/.
9. Scikit-learn dataset
Scikit-learn, a well-known machine learning library in Python, gives a few underlying
datasets to practice and trial and error. These datasets are open through the sci-kit-learn
Programming interface and can be utilized for learning different machine-learning
calculations. Scikit-learn offers both toy datasets, which are little and improved, and
genuine world datasets with greater intricacy. Instances of sci-kit-learn datasets
incorporate the Iris dataset, the Boston Lodging dataset, and the Wine dataset.
The link to download datasets from this source is https://scikit-
learn.org/stable/datasets/index.html.
Conclusion:
In conclusion, datasets structure the groundwork of effective machine-learning projects.
Understanding the various kinds of datasets, the significance of data pre-processing, and
the job of training and testing datasets are key stages towards building powerful models.
By utilizing well-known sources, for example, Kaggle, UCI Machine Learning Repository,
AWS, Google's Dataset Search, Microsoft Datasets, and government datasets, data
researchers and specialists can get to an extensive variety of datasets for their machine
learning projects. It is fundamental to consider data ethics and privacy all through the
whole data lifecycle to guarantee mindful and moral utilization of data. With the right
datasets and moral practices, machine learning models can accomplish exact predictions
and drive significant bits of knowledge.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So for this, we use data preprocessing task.
Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning.
For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.
We can also create our dataset by gathering data using various API with Python and put
that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we can import
it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot. This library is used to plot any
type of charts in Python for the code. It will be imported as below:
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Note: We can set any directory as a working directory, but it must contain the required
dataset.
Here, in the below image, we can see the Python file along with required dataset. Now,
the current folder is set as a working directory.
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used
to read a csv file and performs various operations on it. Using this function, we can read
a csv file locally as well as through an URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we
have passed the name of our dataset. Once we execute the above line of code, it will
successfully import the dataset in our code. We can also check the imported dataset by
clicking on the section variable explorer, and then double click on data_set. Consider
the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used
to extract the required rows and columns from the dataset.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is
for all the columns. Here we have used :-1, because we don't want to take the last column
as it contains the dependent variable. So by doing this, we will get the matrix of features.
By executing the above code, we will get output as:
As we can see in the above output, there are only three variables.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of
dependent variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory,
but for R language it is not required.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values.
In this way, we just delete the specific row or column which consists of null values. But
this way is not so efficient and removing data may lead to loss of information which will
not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we
will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
As we can see in the above output, the missing values have been replaced with the means
of rest column values.
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.
But in our case, there are three country variables, and as we can see in the above output,
these variables are encoded into 0, 1, and 2. By these values, the machine learning model
may assume that there is some correlation between these variables which will produce
the wrong output. So to remove this issue, we will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1
values. For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
Output:
As we can see in the above output, all the variables are encoded into numbers 0 and 1
and divided into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically
encoded into 0 and 1.
Output:
Suppose, if we have given training to our machine learning model by a dataset and we
test it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide
a new dataset to it, then it will decrease the performance. So we always try to make a
machine learning model which performs well with the training set and also with the test
dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already
know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.
For splitting the dataset, we will use the below lines of code:
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into
random train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two
are for arrays of data, and test_size is for specifying the size of the test set. The
test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing
sets.
o The last parameter random_state is used to set a seed for a random generator so
that you always get the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under
the variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique
to standardize the independent variables of the dataset in a specific range. In feature
scaling, we put our variables in the same range and in the same scale so that no any
variable dominate the other variable.
Standardization
Normalization
Here, we will use the standardization method for our dataset.
Now, we will create the object of StandardScaler class for independent variables or
features. And then we will fit and transform the training dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test
as:
x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two values 0
and 1. But if these variables will have more range of values, then we will also need to scale
those variables.
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)
In the above code, we have included all the data preprocessing steps together. But there
are some steps or lines of code which are not necessary for all machine learning models.
So we can exclude them from our code to make it reusable for all models.
In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as
a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled
as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which
come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will apply suitable algorithms such as k-means
clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence
of those commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines
the set of items that occurs together in the dataset. Association rule makes
marketing strategy more effective. Such as people who buy X item (suppose a
bread) are also tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
o
o Supervised learning needs supervision to train the model, which is similar to as a
student learns things in the presence of a teacher. Supervised learning can be used
for two types of problems: Classification and Regression.
o Learn more Supervised Machine Learning
o Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly. So
to identify the image in supervised learning, we will give the input data as well as
output for that, which means we will train the model by the shape, size, color, and
taste of each fruit. Once the training is completed, we will test the model by giving
the new set of fruit. The model will identify the fruit and predict the output using
a suitable algorithm.
o Unsupervised Machine Learning:
o Unsupervised learning is another machine learning method in which patterns
inferred from the unlabeled input data. The goal of unsupervised learning is to find
the structure and patterns from the input data. Unsupervised learning does not
need any supervision. Instead, it finds patterns from the data by its own.
o Learn more Unsupervised Machine Learning
o Unsupervised learning can be used for two types of
problems: Clustering and Association.
o Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to
the model. We will just provide the input dataset to the model and allow the model
to find the patterns from the data. With the help of a suitable algorithm, the model
will train itself and divide the fruits into different groups according to the most
similar features between them.
o The main differences between Supervised and Unsupervised learning are given
below:
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input
corresponding outputs. data and no corresponding output data.
Supervised learning model produces an Unsupervised learning model may give
accurate result. less accurate result as compared to
supervised learning.
o Note: The supervised and unsupervised learning both are the machine learning
methods, and selection of any of these learning depends on the factors related to the
structure and volume of your dataset and the use cases of the problem.