0% found this document useful (0 votes)
7 views20 pages

unit-1 Lecture Notes

This document provides an overview of machine learning, including its evolution, types, and applications. It discusses human learning processes and compares them with machine learning, defining key concepts such as supervised, unsupervised, and reinforcement learning. Additionally, it highlights issues in machine learning, such as privacy concerns and data bias, while also mentioning popular programming languages and tools used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views20 pages

unit-1 Lecture Notes

This document provides an overview of machine learning, including its evolution, types, and applications. It discusses human learning processes and compares them with machine learning, defining key concepts such as supervised, unsupervised, and reinforcement learning. Additionally, it highlights issues in machine learning, such as privacy concerns and data bias, while also mentioning popular programming languages and tools used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT-I

Introduction to Machine Learning & Preparing to Model


1.1 Evolution of Machine Learning:
 As of today, machine learning is a mature technology area finding its application in
almost every sphere of life.
 we closely look at where did it all start from?

1.2 WHAT IS HUMAN LEARNING?


 Human learning is typically referred to as the process of gaining information
through observation.
 In our daily life, we need to carry out multiple activities. It may be a task as
simple as walking down the street or doing the homework.
 To do a task in a proper way, we need to have prior information on one or more
things related to the task. Also, as we keep learning more or in other words
acquiring more information, the efficiency in doing the tasks keep improving.
 For example, with more knowledge, the ability to do homework with less number
of mistakes.

1.3 TYPES OF HUMAN LEARNING


Human learning happens in one of the three ways
1. Learning under Expert guidance Either somebody who is an expert in the subject
directly teaches us
2. Learning guided by knowledge gained from experts (we build our own notion
indirectly based on what we have learnt from the expert in the past)
3. Learning by self or self-learning (we do it ourselves, may be after multiple attempts,
some being unsuccessful).
1.3.1 Learning under expert guidance:
 As child learn from parents, in schools the baby is able to learn all these things
from his teacher who already has knowledge on these areas
 In all phases of life of a human being, there is an element of guided learning. This
learning is imparted by someone, purely because of the fact that he/she has
already gathered the knowledge by his/her past experience in that field.
1.3.2 Learning guided by knowledge gained from experts:
 Learning also happens with the knowledge which has been told by teacher or
mentor at some point of time in some other form/context.
 For example, a baby can group together all objects of same colour even if his
parents have not specifically taught him to do so. He is able to do so because at
some point of time or other his parents have told him which colour is blue, which
is red, which is green, etc.
 In all these situations, there is no direct learning. It is some past information
shared on some different context, which is used as a learning to make decisions.
1.3.3 Learning by self:
In many situations, humans are left to learn on their own.
 A classic example is a baby learning to walk through obstacles. He bumps on to
obstacles and falls down multiple times till he learns that whenever there is an
obstacle, he needs to cross over it. He faces the same challenge while learning to
ride a cycle as a kid or drive a car as an adult. Not all things are taught by others.
 A lot of things need to be learnt only from mistakes made in the past. We tend to
form a check list on things that we should do, and things that we should not do
based on our experiences.
1.4 WHAT IS MACHINE LEARNING?
 The term machine learning was first introduced by Arthur Samuel in 1959. We can
define it in a summarized way as: Machine learning enables a machine to automatically
learn from data, improve performance from experiences, and predict things without
being explicitly programmed.
 Tom M. Mitchell has defined machine learning as ‘A computer program is said to
learn from experience E with respect to some class of tasks T and performance measure
P, if its performance at tasks in T, as measured by P, improves with experience E.’
1.4.1 How do machines learn?
The basic machine learning process can be divided into three parts.
1. Data Input: Past data or information is utilized as a basis for future decision making
2. Abstraction: The input data is represented in a broader way through the underlying
algorithm
3. Generalization: The abstracted representation is generalized to form a framework for
making decisions.

FIG. 1.2 Process of machine learning

1.4.1.1 Abstraction:
 Representing raw input data in a structured format is the typical task for a learning
algorithm.
 The work of assigning a meaning to data occurs during the abstraction process.
 During the process of knowledge representation, the computer summarizes raw inputs in
a model
 The model may be in any one of the following forms
(a) Computational blocks like if/else rules
(b) Mathematical equations
(c) Specific data structures like trees or graphs.
(d) Logical groupings of similar observations
 This process of fitting the model based on the input data is known as training. Also, the
input data based on which the model is being finalized is known as training data.

1.4.1.2 Generalization:

The other key part is generalization.


• In this phase the model is applied on a set of unknown data,usually termed as test data and to
take decisions.
• At this time two problems may arise and are as follows:
1. The trained model is aligned with the training data too much, hence may not produce the
actual trend.
2. The test data possess certain characteristics which may be seen as different as training data
1.4.2 Well-posed learning problem:
 By using Machine Learning framework, a new problem can be defined
 The framework involves answering three questions:
1. What is the problem? The problem has to be Described informally and formally
and list of assumptions and similar problems has to be considered.
2. Why does the problem need to be solved? List the motivation for solving the
problem, the benefits that the solution will provide and how the solution will be
used.
3. How would I solve the problem? Describe how the problem would be solved
manually to flush domain knowledge.
1.5 TYPES OF MACHINE LEARNING:
Machine learning can be classified into three broad categories:
1. Supervised learning: Also called predictive learning. A machine predicts the class of
unknown objects based on prior class related information of similar objects.
2. Unsupervised learning: Also called descriptive learning. A machine finds patterns in
unknown objects by grouping similar objects together.
3. Reinforcement learning: A machine learns to act on its own to achieve the given goals.

FIG: Types of machine learning

1.5.1 Supervised learning:


 In the supervised learning technique, we train the machines using the "labelled" dataset, and
based on the training, the machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output.
FIG: Supervised learning
 Supervised machine learning can be classified into two types of problems, which are given
below:
 Classification
 Regression
Classification:
 Classification is a type of supervised learning where a target feature, is predicted for test
data based on the information given by training data. The target categorical feature is
known as class.
 Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
 Some real-world examples of classification algorithms are Spam Detection, Email
filtering, etc.
Some popular classification algorithms are given below:
 Random Forest Algorithm
 Decision Tree Algorithm
 Logistic Regression Algorithm
 Support Vector Machine Algorithm

FIG: Classification
Regression:
 Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
 it is a type of supervised learning that learns from labelled data sets to predict continuous
output for different data
 Some popular Regression algorithms are given below:
1. Linear Regression Algorithm
2. Logistic Regression
3. Multivariate Regression Algorithm
1.5.2 Unsupervised learning:
 Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision
 In unsupervised learning, the objective is to take a dataset as input and try to find similar
groupings or patterns within the data elements or records.
 unsupervised learning is also termed as descriptive model and the process of unsupervised
learning is referred as pattern discovery or knowledge discovery.

FIG: Unsupervised learning


Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
 Unsupervised learning is helpful for finding useful insights from the data.
 Unsupervised learning works on unlabeled and uncategorized data.
 In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them as
per the presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
 K-means clustering
 Hierarchal clustering
 Anomaly detection
 Neural Networks
 Principle Component Analysis
 Apriori algorithm
1.5.3 Reinforcement learning:
 Reinforcement Learning is a feedback-based Machine learning technique in which an agent
learns to behave in an environment by performing the actions and seeing the results of actions.
 For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled
data

FIG: Reinforcement learning


1.6 PROBLES NOT TO BE SOLVED USING MACHINE LEARNING:
 Machine learning should not be applied to tasks in which humans are very effective or
frequent human intervention is needed.
 For example, air traffic control is a very complex task needing intense human
involvement.
 At the same time, for very simple tasks which can be implemented using traditional
programming paradigms, there is no sense of using machine learning.
 For example, simple rule-driven or formula-based applications like price calculator
engine, dispute tracking application, etc. do not need machine learning techniques.
 For situations where training data is not sufficient, machine learning cannot be used
effectively.

1.7 APPLICATIONS OF MACHINE LEARNING: It is a versatile technology with a wide range


of applications across various domains.

1. Healthcare:
o Diagnostics: ML algorithms can analyze medical images (like X-rays, MRIs) to
detect conditions such as tumors, fractures, and infections.
o Personalized Medicine: Tailoring treatment plans based on individual patient
data and genetic information.
2. Finance:
o Fraud Detection: Identifying unusual patterns in transactions that may indicate
fraudulent activity.
o Algorithmic Trading: Developing trading strategies that can analyze market
conditions and execute trades at optimal times.
3. Retail:
o Recommendation Systems: Providing personalized product recommendations to
customers based on their browsing and purchase history.
o Inventory Management: Predicting demand for products to optimize stock
levels and reduce waste.
o Customer Sentiment Analysis: Analyzing customer reviews and feedback to
improve products and services.
4. Transportation:
o Autonomous Vehicles: Developing self-driving cars that can navigate and make
decisions in real-time.
o Route Optimization: Finding the most efficient routes for logistics and delivery
services.
o Traffic Prediction: Predicting traffic patterns and suggesting optimal routes to
reduce congestion.
5. Manufacturing:
o Predictive Maintenance: Monitoring equipment to predict failures and schedule
maintenance before breakdowns occur.
o Quality Control: Inspecting products for defects using computer vision and other
sensors.
6. Entertainment:
o Content Recommendation: Suggesting movies, music, and other content based
on user preferences.
o Content Creation: Using generative models to create music, art, and even entire
movie scripts.
o Audience Analysis: Analyzing audience behavior to improve content delivery
and marketing strategies.

1.8 STATE-OF-THE-ART LANGUAGES/TOOLS IN MACHINE LEARNING


 The algorithms related to different machine learning tasks are known to all and can be
implemented using any language/platform. Few of them, which are most widely used, are
covered below.
1.8.1 Python
 Python is one of the most popular, open source programming language widely adopted
by machine learning community.
 Python has very strong libraries for advanced mathematical functionalities (NumPy),
algorithms, mathematical tools (SciPy) and numerical plotting (matplotlib).
 there is a machine learning library named scikit-learn, which has various classification,
regression, and clustering algorithms embedded in it.
1.8.2 R
 R is a language for statistical computing and data analysis. It is an open source
language.
 R is a very simple programming language with a huge set of libraries available for
different stages of machine learning.
1.8.3 Matlab
 MATLAB (matrix laboratory) is a licenced commercial software with a robust support
for a wide range of numerical computing. MATLAB is developed by MathWorks, a
company founded in 1984.
 MATLAB also provides extensive support of statistical functions and has a huge
number of machine learning algorithms in-built.
1.8.4 SAS
 SAS (earlier known as ‘Statistical Analysis System’) is another licenced commercial
software which provides strong support for machine learning functionalities. Developed
in C by SAS Institute, SAS had its first release in the year 1976.
 SAS is a software suite comprising different components .It helps in specialized
functions related to data mining and statistical analysis.

1.9 ISSUES IN MACHINE LEARNING


1.9.1 privacy:
 The biggest fear and issue arising out of machine learning is related to privacy and the
breach of it. The primary focus of learning is on analyzing data, both past and current,
and coming up with insight from the data.
 For example, if there is a learning algorithm to do preference-based customer
segmentation and the output of the analysis is used for sending targeted marketing
campaigns, it will hurt the emotion of people and actually do more harm than good.
1.9.2: Inadequate Training Data

 data plays a vital role in the processing of machine learning algorithms, many data
scientists claim that inadequate data, noisy data, and unclean data are extremely
exhausting the machine learning algorithms

1.9.3: Overfitting and Underfitting Overfitting:

 Overfitting is one of the most common issues,Whenever a machine learning model is


trained with a huge amount of data, it starts capturing noise and inaccurate data into the
training data set. It negatively affects the performance of the model.
 Underfitting is just the opposite of overfitting. Whenever a machine learning model is
trained with fewer amounts of data, and as a result, it provides incomplete and inaccurate
data and destroys the accuracy of the machine learning model.

1.9.4: Data Bias:

 These errors exist when certain elements of the dataset are heavily weighted or need more
importance than others. Biased data leads to inaccurate results, skewed outcomes, and
other analytical errors.

2. Preparing to Model

2.1 Machine Learning Activities:


 The first step in machine learning activity starts with data
 In case of supervised learning, it is the labelled training data set followed by test data
which is not labelled.
 In case of unsupervised learning, there is no question of labelled data but the task is to
find patterns in the input data.
 Multiple pre-processing activities may need to be done on the input data before we can
go ahead with core machine learning activities.

FIG. 2.1 Detailed process of machine learning


1. Following are the typical preparation activities done once the input data comes into the
machine learning system:

 Understand the type of data in the given input data set.


 Explore the data to understand the nature and quality.
 Explore the relationships amongst the data elements.
 Find potential issues in data.
 Do the necessary remediation, e.g. impute missing data values, etc., if needed.
 Apply pre-processing steps, as necessary

2. Once the data is prepared for modelling, then do the following activities in the learning
 The input data is first divided into two parts – the training data and the test data (called
holdout). This step is applicable for supervised learning only.
 Consider different models or learning algorithms for selection.
 Train the model based on the training data (for supervised) and Directly apply the chosen
unsupervised model (for unsupervised).

3. performance of the model is evaluated.


4. Based on options available, specific actions can be taken to improve the performance of
the model, if possible.
2.2 Basic Types of Data in Machine Learning
 A data set is a collection of related information or records.
• A data set is a collection of related information
• Each row of dataset is called a record
• Each data set also has multiple attributes, each of which gives information on a specific
characteristic.
• Attributes can also be termed as feature, variable, dimension or field.
• Example: A data set on students in which each record consists of information about a
specific student.

 Data can broadly be divided into following two types:


1. Qualitative data
2. Quantitative data

1. Qualitative data:
• Qualitative data provides information about the quality of an object or information which
cannot be measured.
• For example, if we consider the quality of performance of students in term of 'Good',
'Average', and 'Poor',
• it falls under the category of qualitative data.
• Also, name or roll number of students are information that cannot be measured using
some scale of measurement.
• Qualitative data is also called categorical data.
• Qualitative data - two types:
(a) Nominal data
(b) Ordinal data

(a)
Nominal data is one which has no numeric value, but a named value.
• It is used for assigning named values to attributes.
• Nominal values cannot be quantified.
• Examples of nominal data are
1. Blood group: A+, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Other
• Mathematical operations such as addition, subtraction, multiplication etc. and statistical
functions such as mean, variance, etc. cannot be performed on nominal data.

(b) Ordinal data is naturally ordered.


• This means ordinal data also assigns named values to attributes but unlike nominal data.
• They can be arranged in a Sequence of increasing or decreasing value
• Hence comparison is possible here.
1. Customer satisfaction: 'Very Happy', 'Happy', 'Unhappy', etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: 'Very Hard', 'Hard', 'Soft', etc.
• Like nominal data, basic counting is possible for ordinal data. Hence, the mode and
median can be identified. But Mean cannot be calculated.
2. Quantitative data
• Quantitative data relates to information about the quantity of an object (hence it can be
measure
• Quantitative data is also termed as numeric data
• There are two types of quantitative data:
(a) Interval data
(b) Ratio data
Interval data:
 Interval data is numeric data for which not only the order is known, but the exact difference
between values is also known.
 An ideal example of interval data is Celsius temperature. The difference between each value
remains the same in Celsius temperature.
 For example, the difference between 12°C and 18°C degrees is measurable and is 6°C as in the
case of difference between 15.5°C and 21.5°C.
 Other examples include date, time, etc.
 For interval data, mathematical operations such as addition and subtraction are possible. For
that reason, for interval data, the central tendency can be measured by mean, median, or mode.
Standard deviation can also be calculated.
Ratio data:
 Ratio data represents numeric data for which exact value can be measured. Absolute zero is
available for ratio data.
 these variables can be added, subtracted, multiplied, or divided.
 The central tendency can be measured by mean, median, or mode and methods of dispersion
such as standard deviation. Examples of ratio data include height, weight, age, salary, etc
2.3 EXPLORING STRUCTURE OF DATA
 Data exploration refers to the initial step in data analysis in which data analysts use data
visualization and statistical techniques to describe dataset characterizations, such as size,
quantity, and accuracy, in order to better understand the nature of the data.
2.3.1 Exploring numerical data
 There are two most effective mathematical plots to explore numerical data – box plot and
histogram.
Understanding central tendency:
 To understand the nature of numeric variables, we can apply the measures of central
tendency of data, i.e. mean, mode and median
 In statistics, measures of central tendency help us understand the central point of a set of
data.
 Mean is a sum of all data values divided by the count of data elements.
 Mean being calculated from the cumulative sum of data values, is impacted if too many
data elements are having values close to the maximum or minimum values. It is
especially sensitive to outliers, i.e. the values which are unusually high or low, compared
to the other values.
Understanding data spread:
 In some data sets, the data values are concentrated closely near the mean and in other
data sets, the data values are more widely spread out from the mean. So, we will take a
granular view of the data spread in the form of
1. Dispersion of data
2. Position of the different data values

1. Measuring data dispersion:


 Consider the data values of two attributes
1. Attribute 1 values: 44, 46, 48, 45, and 47
2.Attribute 2 values : 34, 46, 59, 39, and 52
 Both the set of values have a mean of 46.
 However, the first set of values that is of attribute 1 is more concentrated or clustered around
the mean/median value whereas the second set of values of attribute 2 is quite spread out or
dispersed.
 To measure the extent of dispersion of a data, or to find out how much the different values of a
data are spread out, the variance of the data is measured. The variance of a data is measured
using the formula given below:

variance

where x is the variable or attribute whose variance is to be measured and n is the number of
observations
 Standard deviation of a data is measured as follows:

 Larger value of variance or standard deviation indicates more dispersion in the data and vice
versa.
 In the above example, let’s calculate the variance of attribute 1 and that of attribute 2.
 For attribute 1

For attribute 2,

 So it is clear from the measure that attribute 1 values are quite concentrated around the mean
while attribute 2 values are extremely spread out.
2. Measuring data value position:
 When the data values of an attribute are arranged in an increasing order, we have seen
earlier that median gives the central data value, which divides the entire data set into two
halves.
 Similarly, if the first half of the data is divided into two halves so that each half consists of
one quarter of the data set, then that median of the first half is known as first quartile or Q1.
 In the same way, if the second half of the data is divided into two halves, then that median
of the second half is known as third quartile or Q3.
 The overall median is also known as second quartile or Q2. So, any data set has five values
minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum

2.3.2 Plotting and exploring numerical data


 There are two most effective mathematical plots to explore numerical data.
1. box plot
2. histogram

2.3.2.1 Box Plots: It is a type of chart that depicts a group of numerical data through their
quartiles. It is a simple way to visualize the shape of our data.
 Components of a box plot gives a five-number summary of a set of data which is-
 Minimum – It is the minimum value in the dataset excluding the outliers
 First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
 Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half
above.  Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
 Maximum – It is the maximum value in the dataset excluding the outliers.

 The area inside the box (50% of the data) is known as the Inter Quartile Range. The IQR
is calculated as – IQR = Q3-Q1
 Outliers are the data points below and above the lower and upper limit.
 The lower and upper limit is calculated as – Lower Limit = Q1 - 1.5*IQR, Upper Limit =
Q3 + 1.5*IQR
2.3.2.2 HISTOGRAM:
 Histogram is another plot which helps in effective visualization of numeric attributes. It
helps in understanding the distribution of a numeric data into series of intervals, also
termed as ‘bins’.
 The important difference between histogram and box plot is
 The focus of histogram is to plot ranges of data values (acting as ‘bins’), the number of
data elements in each range will depend on the data distribution. Based on that, the size
of each bar corresponding to the different ranges will vary.
 The focus of box plot is to divide the data elements in a data set into four equal portions,
such that each portion contains an equal number of data elements.

FIG. General Histogram shapes

2.3.3 Exploring categorical data: Categorical data is used to group the information with similar
characteristics. mean and median cannot be applied for categorical variables, mode is only
applicable
• An attribute may have one or more modes.Frequency distribution of an attribute having
1. Single mode is called 'unimodal’,
2. Two modes are called 'bimodal' and
3. Multiple modes are called 'multimodal'.
 Lets take sample data set

S.No Owner ID Car Name


1 501 Maruti
2 502 Toyota
3 503 Maruti
4 504 Honda
5 505 MG
6 506 Honda
7 507 Maruti
8 508 Toyota
9 509 Maruti
10 510 KIA
The first we need to find how many unique names are there for the attribute ‘car name’. We
can get this as follows:
Maruti
Toyota
Honda
MG
KIA
We may also look for a little more details and want to get a table consisting the categories of
the attribute and count of the data elements falling into that category
Name Count
Maruti 4
Toyota 2
Honda 2
MG 1
KIA 1
 In the same way, we may also be interested to know the proportion (or percentage) of count of
data elements belonging to a category.
 Example for the attributes ‘Car name’, the proportion of data elements belonging to the
category Maruti is 4 ÷ 10 = 0.4, i.e. 40%., Toyota is 20%, Honda is 20%, MG is 10%, KIA is
10% 2.4.4 Exploring relationship between variables:
 Analyzing and visualizing variables one at a time is not enough. To make various conclusions
and analyses when performing exploratory data analysis, we need to understand how the
variables in a dataset interact with respect to each other.
 There are numerous ways to analyze this relationship visually, one of the most common
methods is the use of popular scatterplots.
2.3.3.1 Scatter plot:
 A scatterplot is one of the most common visual forms when it comes to comprehending the
relationship between variables at a glance.
 In the simplest form, this is nothing but a plot of Variable A against Variable B: either one
being plotted on the x-axis and the remaining one on the y-axis

2.3.3.2 Two-way cross-tabulations:


 It is (also called cross-tab or contingency table) are used to understand the relationship of
two categorical attributes in a concise way.
 It has a matrix format that presents a summarized view of the bivariate frequency
distribution.
 A cross-tab, very much like a scatter plot, helps to understand how much the data values
of one attribute changes with the change in data values of another attribute
2.4 DATA QUALITY AND REMEDIATION
2.4.1 Data quality
 Success of machine learning depends largely on the quality of data.
 We come across at least two types of problems:
1. Certain data elements without a value or data are a missing value.
2. Data elements having value surprisingly different from the other elements, which we term as
outliers.
 There are multiple factors which lead to these data quality issues. Following are some of them:
(a) Incorrect sample set selection: The data may not reflect normal or regular quality due to
incorrect selection of sample set.
 For example, if we are selecting a sample set of sales transactions from a festive period
and trying to use that data to predict sales in future. In this case, the prediction will be far
apart from the actual scenario, just because the sample set has been selected in a wrong
time. It may also happen due to incorrect sample size. For example, a sample of small
size may not be able to capture all aspects or information needed for right learning of the
model.
(b) Errors in data collection: resulting in outliers and missing values In many cases, a person or
group of persons are responsible for the collection of data to be used in a learning activity. In this
manual process, there is the possibility of wrongly recording data in terms of value. This may
result in data elements which have abnormally high or low value from other elements.
2.4.2 Data remediation: Let’s see how to handle outliers and missing values.
2.4.2.1 Handling outliers:
 Outliers are data elements with an abnormally high value which may impact prediction
accuracy, especially in regression models.  Once the outliers are identified and the decision has
been taken to modify those values, you may consider one of the following approaches
• Remove outliers: If the number of records which are outliers is not many, a simple
approach may be to remove them.
• Imputation: assign the outlier value with mean or median or mode. The value of the most
similar data element may also be used for imputation.
• Capping: For values that lie outside the 1.5 * IQR limits, we can cap them by replacing
those observations
• If there is a significant number of outliers, they should be treated separately in the statistical
model, In that case, the groups should be treated as two different groups.The model should be
built for both groups and then the output can be combined.
2.4.2.2 Handling missing values:
• There are multiple strategies to handle missing value of data elements
1. Eliminate records having a missing value of data elements
2. Imputing missing values
3. Estimate missing values
1.Eliminate records having a missing value of data elements:
• In case the proportion of data elements having missing values is within a tolerable limit,
a simple but effective approach is to remove the records having such data elements.
• This will not be possible if the proportion of records having data elements with missing
value is really high.
• This will reduce the power of model because of reduction in the training data size.
2.Imputing missing values
Imputation is a method to assign a value to the data elements having missing values.
• Mean/mode/median is most frequently assigned value.
• For quantitative attributes, all missing values are imputed with the mean, median, or
mode of the remaining values under the same attribute.
• For qualitative attributes, all missing values are imputed by the mode of all remaining
values of the same attribute.
3. Estimate missing values:
 If there are data points similar to the ones with missing attribute values, then the
attribute values from those similar data points can be planted in place of the missing
value
 For finding similar data points or observations, distance function can be used.
 For example, let’s assume that the weight of a Russian student having age 12 years and
height 5 ft. is missing. Then the weight of any other Russian student having age close to
12 years and height close to 5 ft. can be assigned

You might also like