0% found this document useful (0 votes)
11 views

Machine learning notes_Unit-1

Machine learning is a subset of artificial intelligence that enables computers to learn from data and experiences without explicit programming. It has numerous applications, including image and speech recognition, traffic prediction, product recommendations, and medical diagnosis. Common challenges in machine learning include inadequate training data, poor data quality, and the complexity of the machine learning process.

Uploaded by

ab.suman003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Machine learning notes_Unit-1

Machine learning is a subset of artificial intelligence that enables computers to learn from data and experiences without explicit programming. It has numerous applications, including image and speech recognition, traffic prediction, product recommendations, and medical diagnosis. Common challenges in machine learning include inadequate training data, poor data quality, and the complexity of the machine learning process.

Uploaded by

ab.suman003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine learning Introduction

In the real world, we are surrounded by humans who can learn everything from their experiences with
their learning capability, and we have computers or machines which work on our instructions. But can
a machine also learn from experiences or past data like a human does? So here comes the role
of Machine Learning.

A subset of artificial intelligence known as machine learning focuses primarily on the creation of
algorithms that enable a computer to independently learn from data and previous experiences. Arthur
Samuel first used the term "machine learning" in 1959. It could be summarized as follows:

Without being explicitly programmed, machine learning enables a machine to automatically learn from
data, improve performance from experiences, and predict things.

A machine can learn if it can gain more data to improve its performance.

Applications of Machine learning


Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:
Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face
detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with
our Facebook friends, then we automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's
a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms are
widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with
the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors

o Average time has taken on past days at the same time.


Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning

Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on
self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our spam
box, and the technology behind this is Machine learning. Below are some spam filters used by Gmail:

o Content Filter

o Header filter

o General blacklists filter

o Rules-based filters

o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants can help
us in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is a
genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is
used for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning
that translates the text into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.

What is Machine Learning?


Machine Learning is defined as the study of computer algorithms for automatically constructing
computer software through past experience and training data.

It is a branch of Artificial Intelligence and computer science that helps build a model based on training
data and make predictions and decisions without being constantly programmed. Machine Learning is
used in various applications such as email filtering, speech recognition, computer vision, self-driven
cars, Amazon product recommendation, etc.

Backward Skip 10sPlay VideoForward Skip 10s

Common issues in Machine Learning


1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of quality as well as
quantity of data. Although data plays a vital role in the processing of machine learning algorithms,
many data scientists claim that inadequate data, noisy data, and unclean data are extremely
exhausting the machine learning algorithms. For example, a simple task requires thousands of sample
data, and an advanced task such as speech or image recognition needs millions of sample data
examples. Further, data quality is also important for the algorithms to work ideally, but the absence of
data quality is also found in Machine Learning applications. Data quality can be affected by some
factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as
accuracy in classification tasks.

o Incorrect data- It is also responsible for faulty programming and results obtained in machine
learning models. Hence, incorrect data may affect the accuracy of the results also.

o Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.

2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it must be of good
quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less accuracy in
classification and low-quality results. Hence, data quality can also be considered as a major common
problem while processing machine learning algorithms.

3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that sample training
data must be representative of new cases that we need to generalize. The training data must cover all
cases that are already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less accurate
predictions. A machine learning model is said to be ideal if it predicts well for generalized cases and
provides accurate decisions. If there is less training data, then there will be a sampling noise in the
model, called the non-representative training set. It won't be accurate in predictions. To overcome
this, it will be biased against one class or a group.

Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.

4. Getting bad recommendations

A machine learning model operates under a specific context which results in bad recommendations
and concept drift in the model. Let's understand with an example where at a specific time customer is
looking for some gadgets, but now customer requirement changed over time but still machine learning
model showing same recommendations to the customer while customer expectation has been
changed. This incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating and monitoring
data according to the expectations.

5. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in the market, still
these industries are fresher in comparison to others. The absence of skilled resources in the form of
manpower is also an issue. Hence, we need manpower having in-depth knowledge of mathematics,
science, and technologies for developing and managing scientific substances for machine learning.

6.Irrelevant features
Although machine learning models are intended to give the best possible outcome, if we feed garbage
data as input, then the result will also be garbage. Hence, we should use relevant features in our
training sample. A machine learning model is said to be good if training data has a good set of features
or less to no irrelevant features.

7. Slow implementations and results

This issue is also very commonly seen in machine learning models. However, machine learning models
are highly efficient in producing accurate results but are time-consuming. Slow programming,
excessive requirements' and overloaded data take more time to provide accurate results than
expected. This needs continuous maintenance and monitoring of the model for delivering accurate
results.

8.Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major issue faced by machine
learning engineers and data scientists. However, Machine Learning and Artificial Intelligence are very
new technologies but are still in an experimental phase and continuously being changing over time.
There is the majority of hits and trial experiments; hence the probability of error is higher than
expected. Further, it also includes analyzing the data, removing data bias, training data, applying
complex mathematical calculations, etc., making the procedure more complicated and quite tedious.

9.Underfitting of Training Data

This process occurs when data is unable to establish an accurate relationship between input and
output variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple to
establish a precise relationship.

10. Overfitting of Training Data

Overfitting refers to a machine learning model trained with a massive amount of data that negatively
affect its performance. It is like trying to fit in Oversized jeans. Unfortunately, this is one of the
significant issues faced by machine learning professionals. This means that the algorithm is trained
with noisy and biased data, which will affect its overall performance. Let’s understand this with the
help of an example. Let’s consider a model trained to differentiate between a cat, a rabbit, a dog, and
a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a
considerable probability that it will identify the cat as a rabbit. In this example, we had a vast amount
of data, but it was biased; hence the prediction was negatively affected.

Types of Machine Learning


Based on the methods and way of learning, machine learning is divided into mainly four types, which
are:

1. Supervised Machine Learning

2. Unsupervised Machine Learning

3. Semi-Supervised Machine Learning

4. Reinforcement Learning
1. Supervised Machine Learning

As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output. More preciously, we can say; first, we train
the machine with the input and corresponding output, and then we ask the machine to predict
the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs
are taller, cats are smaller), etc. After completion of training, we input the picture of a cat and
ask the machine to identify the object and predict the output. Now, the machine is well
trained, so it will check all the features of the object, such as height, shape, colour, eyes, ears,
tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the process of how
the machine identifies the objects in Supervised Learning.

The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment,
Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning

Supervised machine learning can be classified into two types of problems, which are given
below:

o Classification

o Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of
classification algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:

o Random Forest Algorithm

o Decision Tree Algorithm

o Logistic Regression Algorithm

o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm

o Multivariate Regression Algorithm

o Decision Tree Algorithm

o Lasso Regression

Advantages and Disadvantages of Supervised Learning

Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.

o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.

o It may predict the wrong output if the test data is different from the training data.

o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning

Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.

o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by
using medical images and past labelled data with labels for disease conditions. With such a
process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying fraud
transactions, fraud customers, etc. It is done by using historic data to identify the patterns that
can lead to possible fraud.

o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam folder.

o Speech Recognition - Supervised learning algorithms are also used in speech recognition. The
algorithm is trained with voice data, and various identifications can be done using the same,
such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning

Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to
find the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour difference, shape
difference, and predict the output when it is tested with the test dataset.

Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:

o Clustering

o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the data. It is
a way to group the objects into a cluster such that the objects with the most similarities remain
in one group and have fewer or no similarities with the objects of other groups. An example
of the clustering algorithm is grouping the customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm

o Mean-shift algorithm

o DBSCAN Algorithm
o Principal Component Analysis

o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which finds interesting


relations among variables within a large dataset. The main aim of this learning algorithm is to
find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly applied in Market
Basket analysis, Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:

o These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.

o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled,
and algorithms are not trained with the exact output in prior.

o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset
that does not map with the output.

Applications of Unsupervised Learning

o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.

o Recommendation Systems: Recommendation systems widely use unsupervised learning


techniques for building recommendation applications for different web applications and e-
commerce websites.

o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning,


which can identify unusual data points within the dataset. It is used to discover fraudulent
transactions.

o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract


particular information from the database. For example, extracting information of each user
located at a particular location.

3. Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between


Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no labelled
training data) algorithms and uses the combination of labelled and unlabeled datasets during
the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it mostly consists
of unlabeled data. As labels are costly, but for corporate purposes, they may have few labels.
It is completely different from supervised and unsupervised learning as they are based on the
presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning algorithms,


the concept of Semi-supervised learning is introduced. The main aim of semi-supervised
learning is to effectively use all the available data, rather than only labelled data like in
supervised learning. Initially, similar data is clustered along with an unsupervised learning
algorithm, and further, it helps to label the unlabeled data into labelled data. It is because
labelled data is a comparatively more expensive acquisition than unlabeled data.

We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under unsupervised
learning. Under semi-supervised learning, the student has to revise himself after analyzing the
same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning

Advantages:

o It is simple and easy to understand the algorithm.

o It is highly efficient.

o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.

o We cannot apply these algorithms to network-level data.

o Accuracy is low.

4. Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A


software component) automatically explore its surrounding by hitting & trail, taking action,
learning from experiences, and improving its performance. Agent gets rewarded for each
good action and get punished for each bad action; hence the goal of reinforcement learning
agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is to
play a game, where the Game is the environment, moves of an agent at each step define states,
and the goal of the agent is to get a high score. Agent receives feedback in terms of punishment
and rewards.

Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In


MDP, the agent constantly interacts with the environment and performs actions; at each
action, the environment responds and generates a new state.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the


tendency that the required behaviour would occur again by adding something. It enhances the
strength of the behaviour of the agent and positively impacts it.

o Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to


the positive RL. It increases the tendency that the specific behaviour would occur again by
avoiding the negative condition.

Real-world Use cases of Reinforcement Learning

o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.

o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how to
use RL in computer to automatically learn and schedule resources to wait for different jobs in
order to minimize average job slowdown.

o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning.
There are different industries that have their vision of building intelligent robots using AI and
Machine learning technology.

o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help of
Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.

o The learning model of RL is similar to the learning of human beings; hence most accurate
results can be found.

o Helps in achieving long term results.


Disadvantage

o RL algorithms are not preferred for simple problems.

o RL algorithms require huge data and computations.

o Too much reinforcement learning can lead to an overload of states which can weaken the
results.

What is a Dataset?
A Dataset is a set of data grouped into a collection with which developers can work to meet
their goals. In a dataset, the rows represent the number of data points and the columns
represent the features of the Dataset. They are mostly used in fields like machine learning,
business, and government to gain insights, make informed decisions, or train algorithms.
Datasets may vary in size and complexity and they mostly require cleaning and preprocessing
to ensure data quality and suitability for analysis or modeling.

Let us see an example below:

Dataset

This is the Iris dataset. Since this is a dataset with which we build models, there are input
features and output features. Here:

The input features are Sepal Length, Sepal Width, Petal Length, and Petal Width.

Species is the output feature.

Datasets can be stored in multiple formats. The most common ones are CSV, Excel, JSON, and
zip files for large datasets such as image datasets.
Why are datasets used?

Datasets are used to train and test AI models, analyze trends, and gain insights from data. They
provide the raw material for computers to learn patterns and make predictions.

Types of Datasets

There are various types of datasets available out there. They are:

Numerical Dataset: They include numerical data points that can be solved with equations.
These include temperature, humidity, marks and so on.

Categorical Dataset: These include categories such as colour, gender, occupation, games,
sports and so on.

Web Dataset: These include datasets created by calling APIs using HTTP requests and
populating them with values for data analysis. These are mostly stored in JSON (JavaScript
Object Notation) formats.

Time series Dataset: These include datasets between a period, for example, changes in
geographical terrain over time.

Image Dataset: It includes a dataset consisting of images. This is mostly used to differentiate
the types of diseases, heart conditions and so on.

Ordered Dataset: These datasets contain data that are ordered in ranks, for example, customer
reviews, movie ratings and so on.

Partitioned Dataset: These datasets have data points segregated into different members or
different partitions.

File-Based Datasets: These datasets are stored in files, in Excel as .csv, or .xlsx files.

Bivariate Dataset: In this dataset, 2 classes or features are directly correlated to each other.
For example, height and weight in a dataset are directly related to each other.

Multivariate Dataset: In these types of datasets, as the name suggests 2 or more classes are
directly correlated to each other. For example, attendance, and assignment grades are directly
correlated to a student’s overall grade.

Properties of Dataset

Center of data: This refers to the “middle” value of the data, often measured by mean, median,
or mode. It helps understand where most of the data points are concentrated.

Skewness of data: This indicates how symmetrical the data distribution is. A perfectly
symmetrical distribution (like a normal distribution) has a skewness of 0. Positive skewness
means the data is clustered towards the left, while negative skewness means it’s clustered
towards the right.
Spread among data members: This describes how much the data points vary from the center.
Common measures include standard deviation or variance, which quantify how far individual
points deviate from the average.

Presence of outliers: These are data points that fall significantly outside the overall pattern.
Identifying outliers can be important as they might influence analysis results and require
further investigation.

Correlation among the data: This refers to the strength and direction of relationships between
different variables in the dataset. A positive correlation indicates values in one variable tend
to increase as the other does, while a negative correlation suggests they move in opposite
directions. No correlation means there’s no linear relationship between the variables.

Type of probability distribution that the data follows: Understanding the distribution (e.g.,
normal, uniform, binomial) helps us predict how likely it is to find certain values within the
data and choose appropriate statistical methods for analysis.

Features of a Dataset

The features of a dataset may allude to the columns available in the dataset. The features of a
dataset are the most critical aspect of the dataset, as based on the features of each available
data point, will there be any possibility of deploying models to find the output to predict the
features of any new data point that may be added to the dataset.

It is only possible to determine the standard features from some datasets since their
functionalities and data would be completely different when compared to other datasets.
Some possible features of a dataset are:

Numerical Features: These may include numerical values such as height, weight, and so on.
These may be continuous over an interval, or discrete variables.

Categorical Features: These include multiple classes/ categories, such as gender, colour, and
so on.

Metadata: Includes a general description of a dataset. Generally in very large datasets, having
an idea/ description of the dataset when it’s transferred to a new developer will save a lot of
time and improve efficiency.

Size of the Data: It refers to the number of entries and features it contains in the file containing
the Dataset.

Formatting of Data: The datasets available online are available in several formats. Some of
them are JSON (JavaScript Object Notation), CSV (Comma Separated Value), XML (eXtensible
Markup Language), DataFrame, and Excel Files (xlsx or xlsm). For particularly large datasets,
especially involving images for disease detection, while downloading the files from the
internet, it comes in zip files which will be needed to extract in the system to individual
components.

Target Variable: It is the feature whose values/attributes are referred to to get outputs from
the other features with machine learning techniques.
Data Entries: These refer to the individual values of data present in the Dataset. They play a
huge role in data analysis.

Data preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and
put in a formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
It involves below steps:
o Getting the dataset
o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

o 1) Get the Dataset


o To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem
in a proper format is known as the dataset.
o Dataset may be of different formats for different purposes, such as, if we want to create
a machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient. So each dataset is different from another dataset. To
use the dataset in our code, we usually put it into a CSV file. However, sometimes, we
may also need to use an HTML or xlsx file.
o What is a CSV File?
o CSV stands for "Comma-Separated Values" files; it is a file format which allows us
to save the tabular data, such as spreadsheets. It is useful for huge datasets and can use
these datasets in programs.
We can also create our dataset by gathering data using various API with Python and put
that data into a .csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs.
There are three specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in Python.
It also supports to add large, multidimensional arrays and matrices. So, in Python, we
can import it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library,
and with this library, we need to import a sub-library pyplot. This library is used to plot
any type of charts in Python for the code. It will be imported as below:
1. import matplotlib.pyplot as mpt
Here we have used mpt as a short name for this library.
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library.

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning project.
But before importing a dataset, we need to set the current directory as a working directory. To
set a working directory in Spyder IDE, we need to follow the below steps:
1. Save your Python file in the directory which contains dataset.
2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv file
locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code
Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent


variables) and dependent variables from dataset. In our dataset, there are three independent
variables that are Country, Age, and Salary, and one is a dependent variable which
is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.

x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for
all the columns. Here we have used :-1, because we don't want to take the last column as it
contains the dependent variable. So by doing this, we will get the matrix of features.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only.

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data

But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may
assume that there is some correlation between these variables which will produce the wrong
output. So to remove this issue, we will use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the purchased
variable has only two categories yes or no, and which are automatically encoded into 0 and 1.

) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we already know
the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we
put our variables in the same range and in the same scale so that no any variable dominate the
other variable.
If we
compute any two values from age and salary, then salary values will dominate the age values,
and it will produce an incorrect result. So to remove this issue, we need to perform feature
scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization
Here, we will use the standardization method for our dataset.

Bias and Variance in Machine Learning

• Bias: Bias refers to the error due to overly simplistic assumptions in the learning
algorithm. These assumptions make the model easier to comprehend and learn but
might not capture the underlying complexities of the data. It is the error due to the
model’s inability to represent the true relationship between input and output accurately.
When a model has poor performance both on the training and testing data means high
bias because of the simple model, indicating underfitting.

• Variance: Variance, on the other hand, is the error due to the model’s sensitivity to
fluctuations in the training data. It’s the variability of the model’s predictions for
different instances of training data. High variance occurs when a model learns the
training data’s noise and random fluctuations rather than the underlying pattern. As a
result, the model performs well on the training data but poorly on the testing data,
indicating overfitting.

Underfitting in Machine Learning

A statistical model or a machine learning algorithm is said to have underfitting when a


model is too simple to capture data complexities. It represents the inability of the model
to learn the training data effectively result in poor performance both on the training and
testing data. In simple terms, an underfit model’s are inaccurate, especially when
applied to new, unseen examples. It mainly happens when we uses very simple model
with overly simplified assumptions. To address underfitting problem of the model, we
need to use more complex models, with enhanced feature representation, and less
regularization.

Note: The underfitting model has High bias and low variance.

Reasons for Underfitting

The model is too simple, So it may be not capable to represent the complexities in the
data.

The input features which is used to train the model is not the adequate representations
of underlying factors influencing the target variable.
The size of the training dataset used is not enough.

Excessive regularization are used to prevent the overfitting, which constraint the model
to capture the data well.

Features are not scaled.

Techniques to Reduce Underfitting

1. Increase model complexity.

2. Increase the number of features, performing feature engineering.

3. Remove noise from the data.

4. Increase the number of epochs or increase the duration of training to get better results.

Overfitting in Machine Learning

A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our data set. And when testing
with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the non-
parametric and non-linear methods because these types of machine learning algorithms
have more freedom in building the model based on the dataset and therefore they can
really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we
are using decision trees.

In a nutshell, Overfitting is a problem where the evaluation of machine learning


algorithms on training data is different from unseen data.

Reasons for Overfitting:

1. High variance and low bias.

2. The model is too complex.

3. The size of the training data.

Techniques to Reduce Overfitting

1. Improving the quality of training data reduces overfitting by focusing on meaningful


patterns, mitigate the risk of fitting the noise or irrelevant features.

2. Increase the training data can improve the model’s ability to generalize to unseen data
and reduce the likelihood of overfitting.

3. Reduce model complexity.


4. Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).

5. Ridge Regularization and Lasso Regularization.

6. Use dropout for neural networks to tackle overfitting.

What is Bias?

Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual value.
These differences between actual or expected values and the predicted values are
known as error or bias error or error due to bias. Bias is a systematic error that occurs
due to wrong assumptions in the machine learning process.

Bias(Y^)=E(Y^)–Y

Where E(Y^) is the expected value of the estimator Y^Y^. It is the measurement of the
model that how well it fits the data.

• Low Bias: Low bias value means fewer assumptions are taken to build the target
function. In this case, the model will closely match the training dataset.

• High Bias: High bias value means more assumptions are taken to build the target
function. In this case, the model will not match the training dataset closely.

The high-bias model will not be able to capture the dataset trend. It is considered as the
underfitting model which has a high error rate. It is due to a very simplified algorithm.

For example, a linear regression model may have a high bias if the data has a non-linear
relationship.
Ways to reduce high bias in Machine Learning:

• Use a more complex model: One of the main reasons for high bias is the very
simplified model. it will not be able to capture the complexity of the data. In such cases,
we can make our mode more complex by increasing the number of hidden layers in the
case of a deep neural network. Or we can use a more complex model like Polynomial
regression for non-linear datasets, CNN for image processing, and RNN for sequence
learning.

• Increase the number of features: By adding more features to train the dataset will
increase the complexity of the model. And improve its ability to capture the underlying
patterns in the data.

• Reduce Regularization of the model: Regularization techniques such as L1 or L2


regularization can help to prevent overfitting and improve the generalization ability of
the model. if the model has a high bias, reducing the strength of regularization or
removing it altogether can help to improve its performance.

• Increase the size of the training data: Increasing the size of the training data can help
to reduce bias by providing the model with more examples to learn from the dataset.

What is Variance?

Variance is the measure of spread in data from its mean position. In machine learning
variance is the amount by which the performance of a predictive model changes when
it is trained on different subsets of the training data. More specifically, variance is the
variability of the model that how much it is sensitive to another subset of the training
dataset. i.e. how much it can adjust on the new subset of the training dataset.

Let Y be the actual values of the target variable, and Then the variance of a model can
be measured as the expected value of the square of the difference between predicted
values and the expected value of the predicted values.

Variance=E[(Y^–E[Y^])2]

where E[Yˉ]E[Yˉ] is the expected value of the predicted values. Here expected value is
averaged over all the training data.

Variance errors are either low or high-variance errors.

• Low variance: Low variance means that the model is less sensitive to changes in the
training data and can produce consistent estimates of the target function with different
subsets of data from the same distribution. This is the case of underfitting when the
model fails to generalize on both training and test data.

• High variance: High variance means that the model is very sensitive to changes in the
training data and can result in significant changes in the estimate of the target function
when trained on different subsets of data from the same distribution. This is the case of
overfitting when the model performs well on the training data but poorly on new,
unseen test data. It fits the training data too closely that it fails on the new training
dataset.

Ways to Reduce the reduce Variance in Machine Learning:

• Cross-validation: By splitting the data into training and testing sets multiple times,
cross-validation can help identify if a model is overfitting or underfitting and can be
used to tune hyperparameters to reduce variance.

• Feature selection: By choosing the only relevant feature will decrease the model’s
complexity. and it can reduce the variance error.

• Regularization: We can use L1 or L2 regularization to reduce variance in machine


learning models

• Ensemble methods: It will combine multiple models to improve generalization


performance. Bagging, boosting, and stacking are common ensemble methods that can
help reduce variance and improve generalization performance.

• Simplifying the model: Reducing the complexity of the model, such as decreasing the
number of parameters or layers in a neural network, can also help reduce variance and
improve generalization performance.

• Early stopping: Early stopping is a technique used to prevent overfitting by stopping


the training of the deep learning model when the performance on the validation set stops
improving.

If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error-prone. If algorithms fit too complex
(hypothesis with high degree equation) then it may be on high variance and low bias.
In the latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as a Trade-off or Bias Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like this.

We try to optimize the value of the total error for the model by using the Bias-
Variance Tradeoff.
The best fit will be given by the hypothesis on the tradeoff point. The error to
complexity graph to show trade-off is given as –

You might also like