What is Machine Learning?
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming,
coined the term “Machine Learning”. He defined machine learning as – a “Field of
study that gives computers the capability to learn without being explicitly
programmed”.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned
with the development of algorithms which allow a computer to learn from the data and
past experiences on their own. The term machine learning was first introduced by
Arthur Samuel in 1959.
Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining
more data.
How does Machine Learning work
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The accuracy
of predicted output depends upon the amount of data, as the huge amount of data helps
to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so
instead of writing a code for it, we just need to feed the data to generic algorithms, and
with the help of these algorithms, machine builds the logic as per the data and predict
the output. Machine learning has changed our way of thinking about the problem. The
below block diagram explains the working of Machine Learning algorithm:
Features of Machine Learning:
o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge
amount of the data.
Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model by
providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns
things in the supervision of the teacher. The example of supervised learning is spam
filtering.
Supervised learning can be grouped further in two categories of algorithms:
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to
find useful insights from the huge amount of data. It can be further classifieds into two
categories of algorithms:
o Clustering
o Association
o 3) Reinforcement Learning
o Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong
action. The agent learns automatically with these feedbacks and improves its
performance. In reinforcement learning, the agent interacts with the
environment and explores it. The goal of an agent is to get the most reward
points, and hence, it improves its performance.
o The robotic dog, which automatically learns the movement of his arms, is an
example of Reinforcement learning.
Data Types In Machine Learning
Data Types Are A Way Of Classification That Specifies Which Type Of Value A Variable
Can Store And What Type Of Mathematical Operations, Relational, Or Logical Operations
Can Be Applied To The Variable Without Causing An Error.
Different Types Of Data Types
The Data Type Is Broadly Classified Into
1. Quantitative
2. Qualitative
1. Quantitative Data Type: –
This Type Of Data Type Consists Of Numerical Values. Anything Which Is Measured By
Numbers.
E.G., Profit, Quantity Sold, Height, Weight, Temperature, Etc.
This Is Again Of Two Types
A.) Discrete Data Type: –
The Numeric Data Which Have Discrete Values Or Whole Numbers. This Type Of
Variable Value If Expressed In Decimal Format Will Have No Proper Meaning. Their
Values Can Be Counted.
E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In A Class, Etc.
Fig: Discrete Data Types
B.) Continuous Data Type: –
The Numerical Measures Which Can Take The Value Within A Certain Range. This Type
Of Variable Value If Expressed In Decimal Format Has True Meaning. Their Values Can
Not Be Counted But Measured. The Value Can Be Infinite
E.G.: – Height, Weight, Time, Area, Distance, Measurement Of Rainfall, Etc.
Fig: Continuous Data Types
2. Qualitative Data Type: –
These Are The Data Types That Cannot Be Expressed In Numbers. This Describes
Categories Or Groups And Is Hence Known As The Categorical Data Type.
This Can Be Divided Into:-
A. Structured Data:
This Type Of Data Is Either Number Or Words. This Can Take Numerical Values But
Mathematical Operations Cannot Be Performed On It. This Type Of Data Is Expressed In
Tabular Format.
E.G.) Sunny=1, Cloudy=2, Windy=3 Or Binary Form Data Like 0 Or1, Good Or Bad, Etc.
Fig: Structured Data
B. Unstructured Data:
This Type Of Data Does Not Have The Proper Format And Therefore Known As
Unstructured Data.This Comprises Textual Data, Sounds, Images, Videos, Etc.
Fig: Unstructured Data
Besides This, There Are Also Other Types Refer As Data Types Preliminaries Or Data
Measures:-
1. Nominal
2. Ordinal
3. Interval
4. Ratio
These Can Also Be Refer Different Scales Of Measurements.
I. Nominal Data Type:
This Is In Use To Express Names Or Labels Which Are Not Order Or Measurable.
E.G., Male Or Female (Gender), Race, Country, Etc.
Fig: Gender (Female, Male), An Example Of Nominal Data Type
II. Ordinal Data Type:
This Is Also A Categorical Data Type Like Nominal Data But Has Some Natural Ordering
Associated With It.
E.G., Likert Rating Scale, Shirt Sizes, Ranks, Grades, Etc.
Fig: Rating (Good, Average, Poor), An Example Of Ordinal Data Type
III. Interval Data Type:
This Is Numeric Data Which Has Proper Order And The Exact Zero Means The True
Absence Of A Value Attached. Here Zero Means Not A Complete Absence But Has Some
Value. This Is The Local Scale.
E.G., Temperature Measured In Degree Celsius, Time, Sat Score, Credit Score, PH, Etc.
Difference Between Values Is Familiar. In This Case, There Is No Absolute Zero. Absolute
Fig: Temperature, An Example Of Interval Data Type
IV. Ratio Data Type:
This Quantitative Data Type Is The Same As The Interval Data Type But Has The
Absolute Zero. Here Zero Means Complete Absence And The Scale Starts From Zero. This
Is The Global Scale.
E.G., Temperature In Kelvin, Height, Weight, Etc.
Fig: Weight, An Example Of Ratio Data Type
Applications of Machine Learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly
day by day. We are using machine learning in our daily life even without knowing it such
as Google Maps, Google assistant, Alexa, etc. Below are some most trending real-world
applications of Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is
used to identify objects, persons, places, digital images, etc. The popular use case of
image recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a
photo with our Facebook friends, then we automatically get a tagging suggestion with
name, and the technology behind this is machine learning's face detection and
recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
P
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech recognition.
Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to
follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or
heavily congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies
such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we
search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine
learning.
Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment
series, movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and
spam emails in our spam box, and the technology behind this is Machine learning. Below
are some spam filters used by Gmail:
Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,
and Naïve Bayes classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant:
We have various virtual personal assistants such as Google assistant, Alexa, Cortana,
Siri. As the name suggests, they help us in finding the information using our voice
instruction. These assistants can help us in various ways just by our voice instructions
such as Play music, call someone, Open an email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important part.
These assistant record our voice instructions, send it over the server on a cloud, and
decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways
that a fraudulent transaction can take place such as fake accounts, fake ids, and steal
money in the middle of a transaction. So to detect this, Feed Forward Neural network
helps us by checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a
specific pattern which gets change for the fraud transaction hence, it detects it and
makes our online transactions more secure.
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market, there is
always a risk of up and downs in shares, so for this machine learning's long short term
memory neural network is used for the prediction of stock market trends.
10. Medical Diagnosis:
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning
algorithm, which is used with image recognition and translates the text from one
language to another language.
Problems not to be solved using machine learning
1. Reasoning Power
One area where ML has not mastered successfully is reasoning power, a distinctly
human trait. Algorithms available today are mainly oriented towards specific use-cases
and are narrowed down when it comes to applicability. They cannot think as to why a
particular method is happening that way or ‘introspect’ their own outcomes.
For instance, if an image recognition algorithm identifies apples and oranges in a given
scenario, it cannot say if the apple (or orange) has gone bad or not, or why is that fruit
an apple or orange. Mathematically, all of this learning process can be explained by us,
but from an algorithmic perspective, the innate property cannot be told by the
algorithms or even us.
2. Contextual Limitation
If we consider the area of natural language processing (NLP), text and speech
information are the means to understand languages by NLP algorithms. They may learn
letters, words, sentences or even the syntax, but where they fall back is the context of
the language. Algorithms do not understand the context of the language used. A classic
example for this would be the “Chinese room” argument given by philosopher John
Searle, which says that computer programs or algorithms grasp the idea merely by
‘symbols’ rather than the context given. (You can find the complete information on
Chinese room here).
So, ML does not have an overall idea of the situation. It is limited by mnemonic
interpretations rather than thinking to see what is actually going on.
3. Scalability
Although we see ML implementations being deployed on a significant basis, it all
depends on data as well as its scalability. Data is growing at an enormous rate and has
many forms which largely affects the scalability of an ML project. Algorithms cannot do
much about this unless they are updated constantly for new changes to handle data.
This is where ML regularly requires human intervention in terms of scalability and
remains unsolved mostly.
In addition, growing data has to be dealt the right way if shared on an ML platform
which again needs examination through knowledge and intuition apparently lacked by
current ML.
4. Regulatory Restriction For Data In ML
ML usually need considerable amounts (in fact, massive) of data in stages such as
training, cross-validation etc. Sometimes, data includes private as well as general
information. This is where it gets complicated. Most tech companies have privatised
data and these data are the ones which are actually useful for ML applications. But, there
comes the risk of the wrong usage of data, especially in critical areas such as medical
research, health insurance etc.,
Even though data are anonymised at times, it has the possibility of being vulnerable.
Hence this is the reason regulatory rules are imposed heavily when it comes to using
private data.
5. Internal Working Of Deep Learning
This sub-field of ML is actually responsible for today’s AI growth. What was once just a
theory has appeared to be the most powerful aspect of ML. Deep Learning (DL) now
powers applications such as voice recognition, image recognition and so on through
artificial neural networks.
But, the internal working of DL is still unknown and yet to be solved. Advanced DL
algorithms still baffle researchers in terms of its working and efficiency. Millions of
neurons that form the neural networks in DL increase abstraction at every level, which
cannot be comprehended at all. This is why deep learning is dubbed a ‘black box’ since
its internal agenda is unknown.
Data Pre-processing in Machine learning
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory
to clean it and put in a formatted way. So for this, we use data preprocessing task.
Why do we need Data Pre-processing?
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing
is required tasks for cleaning the data and making it suitable for a machine learning
model which also increases the accuracy and efficiency of a machine learning model.
It involves below steps:
Getting the dataset
Importing libraries
Importing datasets
Finding Missing Data
Encoding Categorical Data
Splitting dataset into training and test set
Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in
a proper format is known as the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient. So each dataset is different from another dataset. To
use the dataset in our code, we usually put it into a CSV file. However, sometimes, we
may also need to use an HTML or xlsx file.
What is a CSV File?
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.
We can also create our dataset by gathering data using various API with Python and put
that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs.
There are three specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we can
import it as:
importnumpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot. This library is used to plot any
type of charts in Python for the code. It will be imported as below:
importmatplotlib.pyplot as mpt
Here we have used mpt as a short name for this library.
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library.
3) Importing the Datasets
Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow the
below steps:
Save your Python file in the directory which contains dataset.
Go to File explorer option in Spyder IDE, and select the required directory.
Click on F5 button or run option to execute the file.
Note: We can set any directory as a working directory, but it must contain the required
dataset.
Here, in the below image, we can see the Python file along with required dataset. Now,
the current folder is set as a working directory.
Data Preprocessing in Machine learning
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is
used to read a csv file and performs various operations on it. Using this function, we can
read a csv file locally as well as through an URL.
We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv')
Extracting dependent and independent variables:
In machine learning, it is important to distinguish the matrix of features (independent
variables) and dependent variables from dataset. In our dataset, there are three
independent variables that are Country, Age, and Salary, and one is a dependent variable
which is Purchased.
Extracting independent variable:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is
used to extract the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:)
is for all the columns. Here we have used :-1, because we don't want to take the last
column as it contains the dependent variable. So by doing this, we will get the matrix of
features.
By executing the above code, we will get output as:
[['India' 38.0 68000.0]
['France' 43.0 45000.0]
['Germany' 30.0 54000.0]
['France' 48.0 65000.0]
['Germany' 40.0 nan]
['India' 35.0 58000.0]
['Germany' nan 53000.0]
['France' 49.0 79000.0]
['India' 50.0 88000.0]
['France' 37.0 77000.0]]
As we can see in the above output, there are only three variables.
Extracting dependent variable:
To extract dependent variables, again, we will use Pandas .iloc[] method.
y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of
dependent variables.
By executing the above code, we will get output as:
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is
mandatory, but for R language it is not required.
4) Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets. If our
dataset contains some missing data, then it may create a huge problem for our machine
learning model. Hence it is necessary to handle missing values present in the dataset.
Ways to handle missing data:
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values.
In this way, we just delete the specific row or column which consists of null values. But
this way is not so efficient and removing data may lead to loss of information which will
not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This
strategy is useful for the features which have numeric data such as age, salary, year, etc.
Here, we will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
#handling missing data (Replacing missing data with the mean value)
fromsklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent variables x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
array([['India', 38.0, 68000.0],
['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object
As we can see in the above output, the missing values have been replaced with the
means of rest column values.
5) Encoding Categorical data:
Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if
our dataset would have a categorical variable, then it may create trouble while building
the model. So it is necessary to encode these categorical variables into numbers.
For Country variable:
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
#Catgorical data
#for Country Variable
fromsklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.
But in our case, there are three country variables, and as we can see in the above output,
these variables are encoded into 0, 1, and 2. By these values, the machine learning model
may assume that there is some correlation between these variables which will produce
the wrong output. So to remove this issue, we will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With
dummy encoding, we will have a number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1
values. For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
#for Country Variable
fromsklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
Output:
array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])
As we can see in the above output, all the variables are encoded into numbers 0 and 1
and divided into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
Data Preprocessing in Machine learning
For Purchased Variable:
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object of
LableEncoder class. Here we are not using OneHotEncoder class because the purchased
variable has only two categories yes or no, and which are automatically encoded into 0
and 1.
Output:
Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and we
test it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always try to
make a machine learning model which performs well with the training set and also with
the test dataset.
Training Set: A subset of dataset to train the machine learning model, and we already
know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.
For splitting the dataset, we will use the below lines of code:
fromsklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
Explanation:
In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
In the second line, we have used four variables for our output that are
x_train: features for the training data
x_test: features for testing data
y_train: Dependent variables for training data
y_test: Independent variable for testing data
In train_test_split() function, we have passed four parameters in which first two are for
arrays of data, and test_size is for specifying the size of the test set. The test_size maybe
.5, .3, or .2, which tells the dividing ratio of training and testing sets.
The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a
technique to standardize the independent variables of the dataset in a specific range. In
feature scaling, we put our variables in the same range and in the same scale so that no
any variable dominate the other variable.
As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then
it will cause some issue in our machine learning model.
If we compute any two values from age and salary, then salary values will dominate the
age values, and it will produce an incorrect result. So to remove this issue, we need to
perform feature scaling for machine learning.
There are two ways to perform feature scaling in machine learning:
Standardization
Normalization
Data Preprocessing in Machine learning
Here, we will use the standardization method for our dataset.
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library
as:
fromsklearn.preprocessing import StandardScaler
Now, we will create the object of StandardScaler class for independent variables or
features. And then we will fit and transform the training dataset.
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform()
because it is already done in training set.
x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test
as:
x_train:
Data Preprocessing in Machine learning
x_test:
Data Preprocessing in Machine learning
As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two
values 0 and 1. But if these variables will have more range of values, then we will also
need to scale those variables.
Combining all the steps:
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
# importing libraries
importnumpy as nm
importmatplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Dataset.csv')
#Extracting Independent Variable
x= data_set.iloc[:, :-1].values
#Extracting Dependent variable
y= data_set.iloc[:, 3].values
#handling missing data(Replacing missing data with the mean value)
fromsklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent varibles x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
#for Country Variable
fromsklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
#encoding for purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
# Splitting the dataset into training and test set.
fromsklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
#Feature Scaling of datasets
fromsklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have included all the data preprocessing steps together. But there
are some steps or lines of code which are not necessary for all machine learning models.
So we can exclude them from our code to make it reusable for all models.
What is data exploration in machine learning (ML)?
Data exploration is a vital process in data science. Analysts investigate a dataset to
illuminate specific patterns or characteristics to help companies or organizations
understand insights and implement new policies.
While data exploration doesn’t necessarily reveal every minute detail, it helps form a
broader picture of specific trends or areas to study. Using manual methods and
automated tools, users explore data to determine which model or algorithm is best for
subsequent steps in data analysis.
Manual data exploration techniques can help users identify specific areas of interest,
which is workable yet falls short of deeper investigation. This is where machine learning
can take your data analysis to the next level.
Machine learning algorithms or automated exploration software can easily identify
relationships between various data variables and dataset structures to determine
whether outliers exist, and create data values that can highlight patterns or points of
interest.
Both data exploration and machine learning can identify notable patterns and help draw
conclusions from datasets. But machine learning allows users to extract information in
large databases quickly and with little room for error.
With more data available than ever before, many companies are faced with an
abundance of data but not enough resources to analyze and process it. This is where
machine learning comes in.
What are the advantages of data exploration in machine learning?
Using machine learning for exploratory data analysis helps data scientists monitor their
data sources and explore data for large analyses. While manual data exploration can be
useful for homing in on specific datasets of interest, machine learning offers a much
wider lens, offering actionable insights that can transform your company’s
understanding of patterns and trends.
Machine learning software can also make your data far easier to digest. By taking data
points and exporting them to data visualization displays such as bar charts or scatter
plots, companies can extract meaningful information at a glance without spending time
interpreting and questioning results.
When you begin to explore your data with automated data exploration tools, you can
come away with in-depth insights that lead to better decisions. Today’s machine
learning solutions include open-source tools with regression capabilities and
visualization methods using programming languages such as Python for data
preparation.
Data exploration through machine learning
Data exploration has two primary goals: To highlight traits of single variables, and reveal
patterns and relationships between variables.
When using machine learning for data exploration, data scientists start by identifying
metrics or variables, running an univariate analysis and bivariate analysis, and
conducting a missing values treatment.
Another key step includes identifying outliers, and finally, variable transformation and
variable creation. Let’s review these steps in more detail:
Identifying variables
To get started, data scientists will identify the factors that change or could potentially
change. Then, scientists will identify the data type and category of the variables.
Univariate and bivariate analysis
Each variable is then explored individually with box plots or histograms to determine
whether it is categorical or continuous, a process known as the univariate analysis. This
process can also highlight missing data and outlier values. Next, a bivariate analysis will
help determine the relationship between variables.
Missing values
It’s not uncommon for datasets to have missing values or missing data. Identifying gaps
in information improves the overall accuracy of your data analysis.
Identifying outliers
Another common element in datasets is the presence of outliers. Outliers in data refer to
observations that are divergent from a generalized pattern in a data sample. Outliers can
skew data considerably, and should be highlighted and addressed before extracting
insights.
Variable transformation and creation
Occasionally it can be helpful to transform or create new variables. Transforming can
help scale variables for better visualization, while variable creation can highlight new
relationships between variables.
Businesses and organizations can use data exploration to help gain actionable insights
from large datasets. You can accelerate data exploration with machine learning, making
it a far quicker and more seamless process for your organization.
What is data remediation?
Data remediation is the process of cleansing, organizing and migrating data so that it’s
properly protected and best serves its intended purpose. There is a misconception that
data remediation simply means deleting business data that is no longer needed.
Data Migration – The process of moving data between two or more systems, data
formats or servers.
Data Discovery – A manual or automated process of searching for patterns in data sets
to identify structured and unstructured data in an organization’s systems.
ROT – An acronym that stands for redundant, obsolete and trivial data. According to the
Association for Intelligent Information Management, ROT data accounts for nearly 80
percent of the unstructured data that is beyond its recommended retention period and
no longer useful to an organization.
Dark Data – Any information that businesses collect, process and store, but do not use
for other purposes. Some examples include customer call records, raw survey data or
email correspondences. Often, the storing and securing of this type of data incurs more
expense and sometimes even greater risk than it does value.
Dirty Data – Data that damages the integrity of the organization’s complete dataset.
This can include data that is unnecessarily duplicated, outdated, incomplete or
inaccurate.
Data Overload – This is when an organization has acquired too much data, including
low-quality or dark data. Data overload makes the tasks of identifying, classifying and
remediating data laborious.
Data Cleansing – Transforming data in its native state to a predefined standardized
format.
Data Governance – Management of the availability, usability, integrity and security of
the data stored within an organization.
Stages of data remediation
Data remediation is an involved process. After all, it’s more than simply purging your
organization’s systems of dirty data. It requires knowledgeable assessment on how to
most effectively resolve unclean data.
Assessment
Before you take any action on your company’s data, you need to have a complete
understanding of the data you possess. How valuable is this data to the company? Is this
data sensitive? Does this data actually require specialized storage, or is it trivial
information? Identifying the quantity and type of data you’re dealing with, even if it’s
just a ballpark estimate to start, will help your team get a general sense of how much
time and resources need to be dedicated for successful data remediation.
Organizing and segmentation
Not all data is created equally, which means that not all pieces of data require the same
level of protection or storage features. For instance, it isn’t cost-efficient for a company
to store all data, ranging from information that is publicly facing to sensitive data, all in
the same high-security vault. This is why organizing and creating segments based on the
information’s purpose is critical during the data remediation process.