0% found this document useful (0 votes)
15 views52 pages

ML - Module 1

Uploaded by

Varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views52 pages

ML - Module 1

Uploaded by

Varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Machine Learning 2025

MODULE 1

CHAPTER 1 - INTRODUCTION

1.1NEED FOR MACHINE LEARNING

Business organizations handle massive amounts of data in their daily activities.


Earlier, the full potential of this data remained untapped due to two main reasons:

1. Scattered Data Sources: Data was distributed across different archival


systems, making it difficult for organizations to integrate and utilize
effectively.
2. Lack of Awareness of Tools: Organizations were unaware of software tools
capable of extracting valuable information from the data.

This situation has changed with the adoption of machine learning, which allows
businesses to utilize this data effectively.

Reasons for the Popularity of Machine Learning

Machine learning has gained immense popularity among business organizations for
three key reasons:

1. High Volume of Available Data: oCompanies like Facebook, Twitter, and


YouTube generate massive amounts of data daily. oData volume is estimated to
double approximately every year.
2. Reduced Cost of Storage:
oThe cost of storage and hardware has dropped significantly. oIt is now easier to
capture, process, store, distribute, and transmit digital information.
3. Availability of Complex Algorithms: oThe advent of deep learning has provided
numerous algorithms for machine learning.

Due to these factors, machine learning has become a dominant technology trend in
business organizations.

Dept. Of CSE, SJBIT Page 1


Machine Learning 2025

KNOWLEDGE PYRAMID

Terms : data, information, knowledge, intelligence, and wisdom.

1. Data:
o All facts are considered data.
o Data can be numbers or text processed by computers. oOrganizations
accumulate data from various sources like flat files, databases, and data
warehouses in diverse storage formats.
2. Information:
o Processed data is called information. oIt includes patterns, associations, or
relationships among data.
o Example: Sales data analyzed to determine the fastest-selling product.
3. Knowledge:
o Condensed information is termed knowledge.
o Example: Historical patterns and future trends derived from sales data.

Dept. Of CSE, SJBIT Page 2


Machine Learning 2025

o Knowledge is useful only when applied effectively.


4. Intelligence:
o Knowledge in actionable form is called intelligence. oIntelligence refers to
applying knowledge for decision-making or actions.
o Computer systems have excelled in achieving intelligence.
5. Wisdom:
o Wisdom represents the maturity of mind and is exhibited solely by
humans.

The Need for Machine Learning

The goal of machine learning is to process archival data to help organizations:

• Make better decisions.


• Design new products.
• Improve business processes.
• Develop effective decision-support systems.

Machine learning bridges the gap between data and wisdom, enabling businesses to
thrive in a data-driven environment.

1.1MACHINE LEARNING EXPLAINED


Definition of Machine Learning

• Machine Learning (ML) is a subfield of Artificial Intelligence (AI).

• Arthur Samuel, a pioneer in AI, defined ML as:


o "Machine learning is the field of study that gives computers the ability
to learn without being explicitly programmed."
• This means that ML systems can improve their performance and make
decisions based on experience rather than following pre-defined rules.

Dept. Of CSE, SJBIT Page 3


Machine Learning 2025

Conventional Programming vs. Machine Learning

• Conventional Programming:
o A problem is analyzed, and a solution is designed using algorithms or
flowcharts.
The solution is then implemented as a program using a
O
programming language.
OSuitable for structured problems but struggles with complex,
unpredictable tasks like puzzles, gaming strategies, or image
recognition.
• Early AI Approach – Expert Systems:
o AI initially relied on expert systems, which converted human expert
knowledge into a set of rules for a program. oExample: MYCIN, an
expert system designed for medical diagnosis. oThese systems used
logical rules derived from human expertise to solve problems.

o However, expert systems had limitations:

▪ Required extensive human expertise for rule


creation.

▪ Could not adapt or improve on their own.


▪ Lacked real intelligence, as they could not
handle new, unseen situations effectively. Shift to Machine
Learning
• Due to the limitations of expert systems, AI shifted towards data-driven
learning instead of rule-based programming.

• Machine Learning Approach:

o Instead of manually creating rules, ML systems learn from data.

Dept. Of CSE, SJBIT Page 4


Machine Learning 2025

o A learning algorithm analyzes data and extracts patterns. oThe extracted


patterns are used to build models that can make predictions on new data.
oThis allows systems to improve their accuracy and adaptability over
time.

How Machine Learning Works?

• Humans vs. Machines in Learning:

o Humans learn based on experience and make decisions accordingly.

oMachines learn from data, forming models that help them make

decisions.

o The process can be understood as:

▪ Humans: Experience → Decision-making.

▪ Machines: Data → Learning Program →


Model → Predictions.
• The goal of ML is to develop models that can generalize and correctly predict
unknown data without requiring explicit instructions.

Key Takeaways

• ML allows computers to learn from data rather than relying on predefined


rules.
• Expert systems were an early attempt at AI but were limited by their
dependence on human-defined rules.
• ML overcomes these limitations by using data-driven learning to create
adaptable, intelligent models.
• The learning process in ML mimics human learning, where past experiences
(data) help in making future decisions.

Dept. Of CSE, SJBIT Page 5


Machine Learning 2025

The models can then be used to predict new inputs. Thus, the aim of machine
learning is to learn a model or set of rules from the given dataset automatically so
that it can predict the unknown data correctly.
As humans take decisions based on an experience, computers make models based on
extracted patterns in the input data and then use these data-filled models for
prediction and to take decisions. For computers, the learnt model is equivalent to
human experience.
The quality of data determines the quality of experience and, therefore, the quality of
the learning system. In statistical learning, the relationship between the input x and
output y is modeled as a function in the form y = f(x). Here, f is the learning function
that maps the input to output y. Learning of function f is the crucial aspect of forming
a model in statistical learning. In machine learning, this is simply called mapping of
input to output.
The learning program summarizes the raw data in a model. Formally stated, a model
is an explicit description of patterns within the data in the form of:

1. Mathematical equation

2. Relational diagrams like trees/graphs.

3. Logical if/else rules, or

4. Groupings called clusters


In summary, a model can be a formula, procedure or representation that can generate
data decisions. The difference between pattern and model is that the former is local
and applicable only to certain attributes but the latter is global and fits the entire
dataset. For example, a model can be helpful to examine whether a given email is
spam or not. The point is that the model is generated automatically from the given
data.
Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A
computer program is said to learn from experience E, with respect to task T and some
performance measure P, if its performance on T measured by P improves with
experience E.” The important components of this definition are experience E, task T,
and performance measure P.

Dept. Of CSE, SJBIT Page 6


Machine Learning 2025

For example, the task T could be detecting an object in an image. The machine can
gain the knowledge of object using training dataset of thousands of images. This is
called experience E.
So, the focus is to use this experience E for this task of object detection T. The ability
of the system to detect the object is measured by performance measures like
precision and recall. Based on the performance measures, course correction can be
done to improve the performance of the system.
Models of computer systems are equivalent to human experience. Experience is
based on data. Humans gain experience by various means. They gain knowledge by
rote learning. They observe others and imitate it. Humans gain alot of knowledg from
teachers and books. We learn many things by trial and error.
Once the knowledge is gained, when a new problem is encountered, humans search
for similar past situations and then formulate the heuristics and use that for
prediction. But, in systems, experience is gathered by these steps:

1 Collection of data
2 Once data is gathered, abstract concepts are formed out of that data. Abstraction
is used to generate concepts. This is equivalent to humans’ idea of objects, for
example, we have some idea about how an elephant looks like.
3 Generalization converts the abstraction into an actionable form of intelligence. It
can be viewed as ordering of all possible concepts. So, generalization involves
ranking of concepts, inferencing from them and formation of heuristics, an
actionable aspect of intelligence.
4 Heuristics are educated guesses for all tasks. For example, if one runs or
encounters a danger, it is the resultant of human experience or his heuristics
formation. In machines, it happens the same way.
5 Heuristics normally works! But, occasionally, it may fail too. It is not the fault of
heuristics as it is just a ‘rule of thumb′. The course correction is done by taking
evaluation measures. Evaluation checks the thoroughness of the models and to-
do course correction, if necessary, to generate better formulations.

1.3 MACHINE LEARNING IN RELATION TO OTHER FIELDS

Dept. Of CSE, SJBIT Page 7


Machine Learning 2025

Machine learning uses the concepts of Artificial Intelligence, Data Science, and
Statistics primarily. It is the resultant of combined ideas of diverse fields.

1.3.1 Machine Learning and Artificial Intelligence

• Machine learning is an important branch of AI, which is a much broader


subject.

• The aim of AI is to develop intelligent agents.

oAn agent can be a robot, human, or any autonomous system.


• Initially, AI aimed to develop intelligent systems like human beings,
focusing on logic and logical inferences.

• AI has gone through ups and downs, with setbacks referred to as AI winters.
• The resurgence of AI happened due to data-driven systems, focusing on
finding relations and regularities in data.
• Machine learning is a subbranch of AIthat focuses on extracting patterns for
prediction.
• It includes various approaches like learning from examples and
reinforcement learning.
• Machine learning models can analyze unknown instances and generate
results.

Deep Learning and Machine Learning


• Deep Learning (DL) is a subbranch of ML that uses neural network
technology.
• Neural networks are inspired by human neuron models.
• Many neurons form a network connected with activation functions,
triggering other neurons to perform tasks.

Dept. Of CSE, SJBIT Page 8


Machine Learning 2025

1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics

Machine Learning and Data Science

• Data science is an umbrella term that includes machine learning as a


branch.
• Data science focuses on gathering, analyzing, and extracting knowledge
from data.

Big Data and Machine Learning

• Big Data is characterized by:

1. Volume – Large-scale data generated by platforms like Facebook,


Twitter, and YouTube.

2. Variety – Data exists in multiple formats, including images, videos, and


text.

3. Velocity – High-speed data generation and processing.


• Machine learning algorithms use big data for tasks such as language
translation and image recognition.
• Big data has influenced the growth of deep learning, which constructs
models using neural networks.

Dept. Of CSE, SJBIT Page 9


Machine Learning 2025

Machine Learning and Data Mining


• Data mining originates from business applications and involves extracting
hidden patterns from data.

• Difference between ML and Data Mining:

o Data Mining focuses on discovering hidden patterns.


o Machine Learninguses these patterns for prediction and
decisionmaking.

Machine Learning and Data Analytics

• Data Analytics extracts useful knowledge from raw data.


• Predictive Data Analytics is used for making predictions and shares many
algorithms with ML.

Machine Learning and Pattern Recognition


• Pattern Recognition is an engineering field that uses machine learning
algorithms.

• It focuses on extracting features for pattern analysis and classification.

• Pattern recognition can be considered a specific application of ML.

Dept. Of CSE, SJBIT Page 10


Machine Learning 2025

1.3.3 Machine Learning and Statistics


• Statistics is a branch of mathematics with a strong theoretical foundation in
statistical learning.

• Both statistics and ML learn from data, but their approaches differ:

oStatistics:

▪ Identifies regularities in data (patterns).

▪ Starts with a hypothesis and conducts experiments to validate it.

▪ Uses mathematical models with many assumptions.

▪ Requires a strong understanding of statistical procedures.

oMachine Learning:
▪ Operates with fewer assumptions and focuses on extracting patterns from
data.
▪ Requires less statistical knowledge but involves using various tools to
automate the learning process.

Dept. Of CSE, SJBIT Page 11


Machine Learning 2025

▪ Seen as a modern evolution of traditional statistics due to its focus on


automation and prediction.

1.4 TYPES OF MACHINE LEARNING

Labelled and Unlabelled Data


• Data is a raw fact. Normally, data is represented in the form of a table. Data also
can be referred to as a data point, sample, or an example.
• Each row of the table represents a data point. Features are attributes or
characteristics of an object. Normally, the columns of the table are attributes. Out
of all attributes, one attribute is important and is called a label. Label is the
feature that we aim to predict. Thus, there are two types of data – labelled and
unlabelled.
• Labelled Data To illustrate labelled data, let us take one example dataset called
Iris flower dataset tor Fisher’s Iris dataset. The dataset has 50 samples of Iris –
with four attributes, length and width of sepals and petals. The target variable is
called class. There are three classes – Iris setosa, Iris virginica, and Iris versicolor.

Dept. Of CSE, SJBIT Page 12


Machine Learning 2025

Table 1.1: Iris Flower Dataset

• A dataset need not be always numbers. It can be images or video frames. Deep
neural network scan handle images with labels. In the following Figure 1.6, the
deep neural network takes images of dogs and cats with labels for classification.
• In unlabeled data, there are no labels in the dataset.

1.4.1 Supervised Learning


• Supervised algorithms use labelled dataset. As the name suggests, there is a
supervisor or teacher component in supervised learning. A supervisor provides
labelled data so that the model is constructed and generates test data.
• In supervised learning algorithms, learning takes place in two stages. In layman
terms, during the first stage, the teacher communicates the information to the
student that the student is supposed to master. The student receives the
information and understands it. During this stage, the teacher has no knowledge
of whether the information is grasped by the student.
• This leads to the second stage of learning. The teacher then asks the student a set
of questions to find out how much information has been grasped by the student.
Based on these questions, the student is tested, and the teacher informs the student
about his assessment. This kind of learning is typically called supervised learning.

Supervised learning has two methods:

1. Classification

2. Regression
Dept. Of CSE, SJBIT Page 13
Machine Learning 2025

1.Classification
• Classification is a supervised learning method. The input attributes of the
classification algorithms are called independent variables.
• The target attribute is called label or dependent variable. The relationship between
the input and target variable is represented in the form of a structure which is
called a classification model. So, the focus of classification is to predict the ‘label’
that is in a discrete form (a value from the set of finite values).
• An example is shown in Figure 1.7 where a classification algorithm takes a set of
labelled data images such as dogs and cats to construct a model that can later be
used to classify an unknown test image data.

Dept. Of CSE, SJBIT Page 14


In Classification, learning takes place in two stages. During the first stage, called
the training stage learning algorithm takes a labelled dataset and starts learning.
After the training set, samples are processed and the model is generated. In the
second stage, the constructed model is tested with test or unknown sample and
assigned a label. This is the classification process.
• This is illustrated in the above Figure 1.7. Initially, the classification learning
algorithm learns with the collection of labelled data and constructs the model.
Then, a test case is selected, and the model assigns a label.
• Similarly, in the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the
classification will generate the label for this. This is called classification. One of
the examples of classification is –Image recognition, which includes classification
of diseases like cancer, classification of plants, etc.
• The classification models can be categorized based on the implementation
technology like decision trees, probabilistic methods, distance measures, and soft
computing methods.
• Classification models can also be classified as generative models and
discriminative models. Generative models deal with the process of data
generation and its distribution. Probabilistic models are examples of generative
models. Discriminative models do not care about the generation of data. Instead,
they simply concentrate on classifying the given data.
• Some of the key algorithms of classification are:

1. Decision Tree

2. Random Forest
3. Support Vector Machines

4. Naïve Bayes

2.Regression Models
• Regression models, unlike classification algorithms, predict continuous variables
like price. In other words, it is a number. A fitted regression model is shown in
Figure 1.8 for a dataset that represent weeks input x and product sales y.
• The regression model takes input x and generates a model in the form of a fitted
line of the form y=f(x). Here, x is the independent variable that may be one or
more attributes and y is the dependent variable. In Figure 1.8, linear regression
takes the training set and tries to fit it with a line – product sales = 0.66 Week +
0.54. Here, 0.66 and 0.54 are all regression coefficients that are learnt from data.
• The advantage of this model is that prediction for product sales (y) can be made
for unknown week data (x). For example, the prediction for unknown eighth week
can be made by substituting x as 8 in that regression formula to get y.
• Both regression and classification models are supervised algorithms. Both have a
supervisor and the concepts of training and testing are applicable to both. What is
the difference between classification and regression models? The main difference
is that regression models predict continuous variables such as product price, while
classification concentrates on assigning labels such as class.
1.4.2Unsupervised Learning
• The second kind of learning is by self-instruction. As the name suggests, there are
no supervisor or teacher components. In the absence of a supervisor or teacher,
selfinstruction is the most common kind of learning process. This process of
selfinstruction is based on the concept of trial and error.
• Here, the program is supplied with objects, but no labels are defined. The
algorithm itself observes the examples and recognizes patterns based on the
principles of grouping. Grouping is done in ways that similar objects form the
same group.
• Cluster analysis and Dimensional reduction algorithms are examples of
unsupervised algorithms.

Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group objects into
disjoint clusters or groups. Cluster analysis clusters objects based on its attributes.
All the data objects of the partitions are similar in some aspect and vary from the data
objects in the other partitions significantly. Some of the examples of clustering
processes are — segmentation of a region of interest in an image, detection of
abnormal growth in a medical image, and determining clusters of signatures in a gene
database.
An example of clustering scheme is shown in Figure 1.9 where the clustering
algorithm takes a set of dogs and cats images and groups it as two clusters-dogs and
cats. It can be observed that the samples belonging to a cluster are similar and
samples are different radically across clusters.

Some of the key clustering algorithms are:

1. k-means algorithm

2. Hierarchical algorithms

Dimensionality Reduction
Dimensionality reduction algorithms are examples of unsupervised algorithms. It
takes a higher dimension data as input and outputs the data in lower dimension by
taking advantage of the variance of the data. It is a task of reducing the dataset with
few features without losing the generality.
1.4.3Semi-supervised Learning
There are circumstances where the dataset has a huge collection of unlabelled data
and some labelled data. Labelling isa costly process and difficult to perform by the
humans. Semi-supervised algorithms use unlabelled data by assigning a pseudo-
label. Then, the labelled and pseudo-labelled dataset can be combined.

1.4.4Reinforcement Learning
Reinforcement learning mimics human beings. Like human beings use ears and eyes
to perceive the world and take actions, reinforcement learning allows the agent to
interact with the environment to get rewards. The agent can be human, animal, robot,
or any independent program. The rewards enable the agent to gain experience. The
agent aims to maximize the reward.
The reward can be positive or negative (Punishment). When the rewards are more,
the behavior gets reinforced and learning becomes possible.

Consider the following example of a Grid game as shown in Figure 1.10

In this grid game, the gray tile indicates the danger, black is a block, and the tile with
diagonally nes is the goal. The aim is to start, say from bottom-left grid, using the
actions left, right, top and bottom to reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the
environment to get experience. In the above case, the agent tries to create a model by
simulating many paths and finding rewarding paths. This experience helps in
constructing a model.
It can be said in summary, compared to supervised learning, there is no supervisor or
labelled dataset. Many sequential decisions need to be taken to reach the final
decision. Therefore, reinforcement algorithms are reward-based, goal-oriented
algorithms.

1.5 CHALLENGES OF MACHINE LEARNING


Problems that can be Dealt with Machine Learning
Computers are better than humans in performing tasks like computation. For
example, while calculating the square root of large numbers, an average human may
blink but computers can display the result in seconds. Computers can play games like
chess, GO, and even beat professional players of that game.
However, humans are better than computers in many aspects like recognition. But,
deep learning systems challenge human beings in this aspect as well. Machines can
recognize human faces in a second. Still, there are tasks where humans are better as
machine learning systems still require quality data for model construction. The
quality of a learning system depends on the quality of data. This is a challenge. Some
of the challenges are listed below:

1.Problems – Machine learning can deal with the ‘well-posed’ problems where
specifications are complete and available. Computers cannot solve ‘ill-posed’
problems.

Consider one simple example (shown in Table 1.3):


2.Huge data – This is a primary requirement of machine learning. Availability of a
quality data is a challenge. A quality data means it should be large and should not
have data problems such as missing data or incorrect data.
3.High computation power – With the availability of Big Data, the computational
resource requirement has also increased. Systems with Graphics Processing Unit
(GPU) or even Tensor Processing Unit (TPU) are required to execute machine
learning algorithms. Also, machine learning tasks have become complex and hence
time complexity has increased, and that can be solved only with high computing
power.
4.Complexity of the algorithms – The selection of algorithms, describing the
algorithms, application of algorithms to solve machine learning task, and comparison
of algorithms have become necessary for machine learning or data scientists now.
Algorithms have become a big topic of discussion and it is a challenge for machine
learning professionals to design, select, and evaluate optimal algorithms.
5.Bias/Variance – Variance is the error of the model. This leads to a problem called
bias/ variance tradeoff. A model that fits the training data correctly but fails for test
data, in general lacks generalization, is called overfitting. The reverse problem is
called underfitting where the model fails for training data but has good
generalization. Overfitting and underfitting are great challenges for machine learning
algorithms.
1.6 MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business organizations
is CRISP-DM. Since machine learning is like data mining, except for the aim, this
process can be used for machine learning. CRISP-DM stands for Cross Industry
Standard Process – Data Mining. This process involves six steps. The steps are listed
below in Figure 1.11.

1.Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm
is enough for giving the solution. This step also involves the formulation of the
problem statement for the data mining process.
2.Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the
selected hypothesis.
3.Preparation of data – This step involves producing the final dataset by cleaning
the raw data and preparation of data for the data mining process. The missing values
may cause problems during both training and testing phases. Missing data forces
classifiers to produce in accurate results. This is a perennial problem for the
classification models. Hence, suitable strategies should be adopted to handle the
missing data.
4.Modelling – This step plays a role in the application of data mining algorithm for
the data to obtain a model or pattern.
5.Evaluate – This step involves the evaluation of the data mining results using
statistical analysis and visualization methods. The performance of the classifier is
determined by evaluating the accuracy of the classifier. The process of classification
is a fuzzy issue. For example, classification of emails requires extensive domain
knowledge and requires domain experts. Hence, performance of the classifier is very
crucial.
6.Deployment – This step involves the deployment of results of the data mining
algorithm to improve the existing process or for a new situation.

1.7 MACHINE LEARNING APPLICATIONS


Machine Learning technologies are used widely now in different domains. Machine
learning applications are everywhere! One encounters many machine learning
applications in the day-to-day life.

Some applications are listed below:


1.Sentiment analysis – This is an application of natural language processing (NLP)
where the words of documents are converted to sentiments like happy, sad, and angry
which are captured by emoticons effectively. For movie reviews or product reviews,
five stars or one star are automatically attached using sentiment analysis programs.
2.Recommendation systems – These are systems that make personalized purchases
possible. For example, Amazon recommends users to find related books or books
bought by people who have the same taste like you, and Netflix suggests shows or
related movies of your taste. The recommendation systems are based on machine
learning.
3.Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and
Google Assistant are all examples of voice assistants. They take speech commands
and perform tasks. These chatbots are the result of machine learning technologies.
4.Technologies like Google Maps and those used by Uber are all examples of
machine learning which offer to locate and navigate shortest paths to reduce time.
The machine learning applications are enormous. The following Table 1.4
summarizes some of the machine learning applications.
CHAPTER 2 – UNDERSTANDING DATA

2.1 WHAT IS DATA?


• All facts are data. In computer systems, bits encode facts present in numbers, text,
images, audio, and video.
• Data can be directly human interpretable (such as numbers or texts) or diffused
data such as images or video that can be interpreted only by a computer.
• Data is available in different data sources like flat files, databases, or data
warehouses. It can either be an operational data or a non-operational data.
• Operational data is the one that is encountered in normal business procedures and
processes. For example, daily sales data is operational data, on the other hand,
nonoperational data is the kind of data that is used for decision making.
• Data by itself is meaningless. It has to be processed to generate any information.
A string of bytes is meaningless. Only when a label is attached like height of
students of a class, the data becomes meaningful.
• Processed data is called information that includes patterns, associations, or
relationships among data. For example, sales data can be analyzed to extract
information like which product was sold larger in the last quarter of the year.

Elements of Big Data


• Data whose volume is less and can be stored and processed by a small-scale
computer is called ‘small data’. These data are collected from several sources,
and integrated and processed by a small-scale computer. Big data, on the other
hand, is a larger data whose volume is much larger than ‘small data’ and is
characterized as follows:
1.Volume – Since there is a reduction in the cost of storing devices, there has been a
tremendous growth of data. Small traditional data is measured in terms of gigabytes
(GB) and terabytes (TB), but Big Data is measured in terms of petabytes (PB) and
exabytes (EB). One exabyte is 1 million terabytes.

2.Velocity – The fast arrival speed of data and its increase in data volume is noted as
velocity. The availability of IoT devices and Internet power ensures that the data is
arriving at a faster rate. Velocity helps to understand the relative growth of big data
and its accessibility by users, systems and applications.

3.Variety – The variety of Big Data includes:

• Form – There are many forms of data. Data types range from text, graph,
audio, video, to maps. There can be composite data too, where one media can
have many other sources of data, for example, a video can have an audio song.
• Function – These are data from various sources like human conversations,
transaction records, and old archive data.
• Source of data – This is the third aspect of variety. There are many sources of
data. Broadly, the data source can be classified as open/public data, social
media data and multimodal data.

4.Veracity of data – Veracity of data deals with aspects like conformity to the facts,
truthfulness, believability, and confidence in data. There may be many sources of
error such as technical errors, typographical errors, and human errors. So, veracity is
one of the most important aspects of data.

5.Validity – Validity is the accuracy of the data for taking decisions or for any other
goals that are needed by the given problem.

6.Value – Value is the characteristic of big data that indicates the value of the
information that is extracted from the data and its influence on the decisions that are
taken based on it. Thus, these 6 Vs are helpful to characterize the big data. The data
quality of the numeric attributes is determined by factors like precision, bias, and
accuracy.

•Precision is defined as the closeness of repeated measurements. Often, standard


deviation is used to measure the precision.

•Bias is a systematic result due to erroneous assumptions of the algorithms or


procedures. Accuracy is the degree of measurement of errors that refers to the
closeness of measurements to the true value of the quantity. Normally, the significant
digits used to store and manipulate indicate the accuracy of the measurement.

2.1.1 Types of Data


In Big Data, there are three kinds of data. They are structured data, unstructured data,
and semi structured data.

1.Structured Data
In structured data, data is stored in an organized manner such as a database where it
is available in the form of a table. The data can also be retrieved in an organized
manner using tools like SQL. The structured data frequently encountered in machine
learning are listed below:

• Record Data A dataset is a collection of measurements taken from a process. We


have a collection of objects in a dataset and each object has a set of
measurements. The measurements can be arranged in the form of a matrix. Rows
in the matrix represent an object and can be called as entities, cases, or records.
The columns of the dataset are called attributes, features, or fields. Label is the
term that is used to describe the individual observations.
• Data Matrix It is a variation of the record type because it consists of numeric
attributes. The standard matrix operations can be applied on these data. The data
is thought of as points or vectors in the multidimensional space where every
attribute is a dimension describing the object.
• Graph Data It involves the relationships among objects. For example, a web
page can refer to another web page. This can be modeled as a graph. The modes
are web pages and the hyperlink is an edge that connects the nodes.
• Ordered Data Ordered data objects involve attributes that have an implicit order
among them.

The examples of ordered data are:

a) Temporal data –It is the data whose attributes are associated with time. For
example, the customer purchasing patterns during festival time is sequential data.
Time series data is a special type of sequence data where the data is a series of
measurements over time
b) Sequence data – It is like sequential data but does not have time stamps. This
data involves the sequence of words or letters. For example, DNA data is a
sequence of four characters – A T G C.
c) Spatial data – It has attributes such as positions or areas. For example, maps are
spatial data where the points are related by location.

2. Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual
documents, programs, and blog data. It is estimated that 80% of the data are
unstructured data.
3.Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include
data like XML/JSON data, RSS feeds, and hierarchical data.

2.1.2 Data Storage and Representation


Once the dataset is assembled, it must be stored in a structure that is suitable for data
analysis. The goal of data storage management is to make data available for analysis.
There are different approaches to organize and manage data in storage files and
systems from flat file to data warehouses. Some of them are listed below:

FLAT FILESThese are the simplest and most commonly available data source. It
is also the cheapest way of organizing the data. These flat files are the files where
data is stored in plain ASCII or EBCDIC format. Minor changes of data in flat files
affect the results of the data mining algorithms. Hence, flat file is suitable only for
storing small dataset and not desirable if the dataset becomes larger.

Some of the popular spreadsheet formats are listed below:

1 CSV files – CSV stands for comma-separated value files where the values are
separated by commas. These are used by spreadsheet and database applications.
The first row may have attributes and the rest of the rows represent the data.
2 TSV files –TSV stands for Tab separated values files where values are
separated by Tab. Both CSV and TSV files are generic in nature and can be
shared. There are many tools like Google Sheets and Microsoft Excel to process
these files.

DATABASE SYSTEM
1 World Wide Web (Www) It provides a diverse, worldwide online information
source
2 XML (eXtensible Markup Language) It is both human and machine
interpretable data format.
3 Data Stream It is dynamic data, which flows in and out of the observing
environment.
4 JSON (JavaScript Object Notation) It is another useful data interchange
format that is often used for many machine learning algorithms.

2.2 BIG DATA ANALYSIS FRAMEWORK


• The primary aim of data analysis is to assist business organizations to take
decisions. For example, a business organization may want to know which is the
fastest selling product, in order for them to market activities.
• Data analysis is an activity that takes the data and generates useful information
and insights for assisting the organizations.
• Data analysis and data analytics are terms that are used interchangeably to refer
to the same concept. However, there is a subtle difference. Data analytics is a
general term and data analysis is a part of it.
• Data analytics refers to the process of data collection, preprocessing and
analysis. It deals with the complete cycle of data management. Data analysis is
just analysis and is a part of data analytics. It takes historical data and does the
analysis. Data analytics, instead, concentrates more on future and helps in
prediction.
• There are four types of data analytics:
1. Descriptive analytics

2. Diagnostic analytics

3. Predictive analytics
4. Prescriptive analytics

• Descriptive Analytics: It is about describing the main features of the data. After
data collection is done, descriptive analytics deals with the collected data and
quantifies it.

• Diagnostic Analytics: It inference part. deals with the question -'Why?. This is
also known as causal analysis as it aims to find out the cause and effect of the
events.
• Predictive Analytics: It deals with the future. It deals with the question - What
will happen in future given this data?'. This involves the application of
algorithms to identify the patterns to predict the future.

• Prescriptive Analytics: Prescriptive analytics goes beyond prediction and helps


in decision making by giving a set of actions. It helps the organizations to plan
better for the future and to mitigate the risks that are involved.

2.3 BIG DATA ANALYSIS FRAMEWORK


For performing data analytics, many frameworks are proposed. All proposed
analytics frameworks have some common factors. Big data framework is a layered
architecture. Such an architecture has many advantages such as genericness. A
4layer architecture has the following layers:
1.Date connection layer
2.Data management layer
3.Data analytics later
4.Presentation layer

1 Data Connection Layer: It has data ingestion mechanisms and data


connectors. Data ingestion means taking raw data and importing it into
appropriate data structures. It performs the tasks of ETL process. By ETL, it
means extract, transform and load operations.
2 Data Management Layer: It performs preprocessing of data. The purpose of
this layer is to allow parallel execution of queries, and read, write and data
management tasks. There may be many schemes that can be implemented by
this layer such as data-in-place, where the data is not moved at all, or
constructing data repositories such as data warehouses and pull data on-
demand mechanisms.
3 Data Analytic Layer: It has many functionalities such as statistical tests,
machine learning algorithms to understand, and construction of machine
learning models. This layer implements many model validation mechanisms
too.
4 Presentation Layer: It has mechanisms such as dashboards, and applications
that display the results of analytical engines and machine learning algorithms.

Thus, the Big Data processing cycle involves data management that consists of the
following steps.
1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm This
is an iterative process and is carried out on a permanent basis to ensure that
data is suitable for data mining.

2.3.1 Data Collection


The first task of gathering datasets is the collection of data. It is often estimated that
most of the time is spent for collection of good quality data. A good quality data
yields a better result. It is often difficult to characterize a ‘Good data’. ‘Good data’
is one that has the following properties:
1 Timeliness – The data should be relevant and not stale or obsolete data.
2 Relevancy –The data should be relevant and ready for the machine learning or
data mining algorithms. All the necessary information should be available and
there should be no bias in the data.
3 Knowledge about the data – The data should be understandable and
interpretable, and should be self- sufficient for the required application as
desired by the domain knowledge engineer.
Broadly, the data source can be classified as open/public data, social media data and
multimodal data.
1 Open or public data source – It is a data source that does not have any
stringent copyright rules or restrictions. Its data can be primarily used for many
purposes.
Government census data are good examples of open data:
• Digital libraries that have huge amount of text data as well as document
images
• Scientific domains with a huge collection of experimental data like
genomic data and biological data.
• Healthcare systems that use extensive databases like patient databases,
health insurance data, doctors’ information, and bioinformatics information
2 Social media –It is the data that is generated by various social media platforms
like Twitter, Facebook, YouTube, and Instagram. An enormous amount of data
is generated by these platforms.
3 Multimodal data –It includes data that involves many modes such as text,
video, audio and mixed types. Some of them are listed below:
• Image archives contain larger image databases along with numeric and text
data
• The World Wide Web (WWW) has huge amount of data that is distributed on
the Internet. These data are heterogeneous in nature.

2.3.2 Data Preprocessing


In real world, the available data is ’dirty’. By this word ’dirty’, it means:

•Data preprocessing improves the quality of the data mining techniques. The raw
data must be pre- processed to give accurate results. The process of detection and
removal of errors in data is called data cleaning.
•Data wrangling means making the data processable for machine learning
algorithms. Some of the data errors include human errors such as typographical
errors or incorrect measurement and structural errors like improper data formats.

•Data errors can also arise from omission and duplication of attributes. Noise is a
random component and involves distortion of a value or introduction of spurious
objects. Often, the noise is used if the data is a spatial or temporal component.
Certain deterministic distortions in the form of a streak are known as artifacts.
• It can be observed that data like Salary = ’ ’ is incomplete data. The DoB of
patients, John, Andre, and Raju, is the missing data. The age of David is
recorded as ‘5’ but his DoB indicates it is 10/10/1980. This is called
inconsistent data.
• Inconsistent data occurs due to problems in conversions, inconsistent formats,
and difference in units. Salary for John is -1500. It cannot be less than ‘0’. It is
an instance of noisy data. Outliers are data that exhibit the characteristics that
are different from other data and have very unusual values. The age of Raju
cannot be 136. It might be a typographical error. It is often required to
distinguish between noise and outlier data.
• Outliers may be legitimate data and sometimes are of interest to the data
mining algorithms. These errors often come during data collection stage. These
must be removed so that machine learning algorithms yield better results as the
quality of results is determined by the quality of input data. This removal
process is called data cleaning.

Missing Data Analysis


The primary data cleaning process is missing data analysis. Data cleaning routines
attempt to fill up the missing values, smoothen the noise while identifying the
outliers and correct the inconsistencies of the data. This enables data mining to
avoid overfitting of the models.

The procedures that are given below can solve the problem of missing data:
1 Ignore the tuple – A tuple with missing data, especially the class label, is
ignored. This method is not effective when the percentage of the missing
values increases.
2 Fill in the values manually – Here, the domain expert can analyse the data
tables and carry out the analysis and fill in the values manually. But, this is
time consuming and may not be feasible for larger sets.
3 A global constant can be used to fill in the missing attributes. The missing
values may be ’Unknown’ or be ’Infinity’. But, some data mining results may
give spurious results by analysing these labels.
4 The attribute value may be filled by the attribute value. Say, the average
income can replace a missing value.
5 Use the attribute mean for all samples belonging to the same class. Here, the
average value replaces the missing values of all tuples that fall in this group.
6 Use the most possible value to fill in the missing value. The most probable
value can be obtained from other methods like classification and decision tree
prediction.
Some of these methods introduce bias in the data. The filled value may not be
correct and could be just an estimated value. Hence, the difference between the
estimated and the original value is called an error or bias.

Removal of Noisy or Outlier Data


Noise is a random error or variance in a measured value. It can be removed by
using binning, which is a method where the given data values are sorted and
distributed into equal frequency bins. The bins are also called as buckets. The
binning method then uses the neighbor values to smooth the noisy data.
Some of the techniques commonly used are ‘smoothing by means’ where the mean of
the bin removes the values of the bins, ‘smoothing by bin medians’ where the bin
median replaces the bin values, and ‘smoothing by bin boundaries’ where the bin
value is replaced by the closest bin boundary. The maximum and minimum values
are called bin boundaries. Binning methods may be used as a discretization
technique.
Example 2.1 illustrates this principle

Data Integration and Data Transformations


Data integration involves routines that merge data from multiple sources into a single
data source. So, this may lead to redundant data.

The main goal of data integration is to detect and remove redundancies that arise
from integration. Data transformation routines perform operations like
normalization to improve the performance of the data mining algorithms. It is
necessary to transform data so that it can be processed.
This can be considered as a preliminary stage of data conditioning. Normalization is
one such technique. In normalization, the attribute values are scaled to fit in a range
(say 0-1) to improve the performance of the data mining algorithm. Often, in neural
networks, these techniques are used. Some of the normalization procedures used
are:
1. Min-Max
2. z-Score

Min-Max ProcedureIt is a normalization technique where each variable V is


normalized by its difference with the minimum value divided by the range to a new
range, say 0–1. Often, neural networks require this kind of normalization. The
formula to implement this normalization is given as:

Here max-min is the range. Min and max are the minimum and maximum of the
given data, new max and new min are the minimum and maximum of the target
range, say 0 and 1.

Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure
and map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The new min and new
max are 0 and 1, respectively. The mapping can be done using Eq. (2.1) as:
So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range
{0, 0.33, 0.66, 1}. Thus, the Min-Max normalization range is between 0 and 1.

Z-Score Normalization:This procedure works by taking the difference between


the field value and mean value, and by scaling this difference by standard deviation
of the attribute.

Here, s is the standard deviation of the list V and m is the mean of the list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the marks to
zscore.
Solution: The mean and Sample Standard deviation (s) values of the list V are 20 and
10, respectively. So, the z-scores of these marks are calculated using Eq. (2.2) as:
Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.

Data Reduction
Data reduction reduces data size but produces the same results. There are different
ways in which data reduction can be carried out such as data aggregation, feature
selection, and dimensionality reduction

2.4DESCRIPTIVE STATISTICS
• Descriptive statistics is a branch of statistics that does dataset summarization. It is
used to summarize and describe data.
• Descriptive statistics are just descriptive and do not go beyond that.
• In other words, descriptive statistics do not bother too much about machine
learning algorithms and its functioning.

Dataset and Data Types


A dataset can be assumed to be a collection of data objects. The data objects may be
records, points, vectors, patterns, events, cases, samples or observations. These
records contain many attributes. An attribute can be defined as the property or
characteristics of an object. For example, consider the following database shown in
sample Table 2.2

Every attribute should be associated with a value. This process is called


measurement. The type of attribute determines the data types, often referred to as
measurement scale types. The data types are shown in Figure 2.1.
Broadly, data can be classified into two types:

1. Categorical or qualitative data

2. Numerical or quantitative data

CATEGORICAL OR QUALITATIVE DATA The categorical data can be divided


into two types. They are nominal type and ordinal type.

1 Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are
symbols and cannot be processed like a number. For example, the average of a
patient ID does not make any statistical sense. Nominal data type provides only
information but has no ordering among data. Only operations like (=, ≠) are
meaningful for these data. For example, the patient ID can be checked for
equality and nothing else.
2 Ordinal Data –It provides enough information and has natural order. For
example, Fever = {Low, Medium, High} is an ordinal data. Certainly, low is
less than medium and medium is less than high, irrespective of the value. Any
transformation can be applied to these data to get a new value.
Numeric or Qualitative Data It can be divided into two categories. They are
interval type and ratio type.

➢ Interval Data – Interval data is a numeric data for which the differences
between values are meaningful. For example, there is a difference between 30 degree
and 40 degree. Only the permissible operations are + and -.
➢ Ratio Data – For ratio data, both differences and ratio are meaningful. The
difference between the ratio and interval data is the position of zero in the scale. For
example, take the Centigrade Fahrenheit conversion. The zeroes of both scales do not
match. Hence, these are interval data.

Another way of classifying the data is to classify it as:

1.Discrete value data

2.Continuous data

➢ Discrete Data This kind of data is recorded as integers. For example, the
responses of the survey can be discrete data. Employee identification number such as
10001 is discrete data.

➢ Continuous Data It can be fitted into a range and includes decimal point. For
example, age is a continuous data. Though age appears to be discrete data, one may
be 12.5 years old and it makes sense. Patient height and weight are all continuous
data.
Third way of classifying the data is based on the number of variables used in the
dataset. Based on that, the data can be classified as univariate data, bivariate data,
and multivariate data. This is shown in Figure 2.2.

2.5UNIVARIATE DATA ANALYSIS AND VISUALIZATION


• Univariate analysis is the simplest form of statistical analysis.
• As the name indicates, the dataset has only one variable.
• A variable can be called as a category.
• Univariate does not deal with cause or relationships.
• The aim of univariate analysis is to describe data and find patterns
• Univariate data description involves finding the frequency distributions, central
tendency measures, dispersion or variation, and shape of the data.
Data Visualization
Data visualization helps to understand data. It helps to present information and data
to customers. Some of the graphs that are used in univariate data analysis are bar
charts, histograms, frequency polygons and pie charts.
The advantages of the graphs are presentation of data, summarization of data,
description of data, exploration of data, and to make comparisons of data.

Let us consider some forms of graphs.

1 BAR CHART
• A Bar chart (or Bar graph) is used to display the frequency distribution for
variables.
• Bar charts are used to illustrate discrete data.
• The charts can also help to explain the counts of nominal data.
• It also helps in comparing the frequency of different groups.
The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4,
5} is shown below in Figure 2.3.

2 PIE CHART

These are equally helpful in illustrating the univariate data.


The percentage frequency distribution of students' marks {22, 22, 40, 40, 70, 70, 70,
85, 90, 90} is below in Figure

3 HISTOGRAM
It plays an important role in data mining for showing frequency distributions. The
histogram for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50,
51-75, 76-100 is given below in Figure 2.5. One can visually inspect from Figure 2.5
that the number of students in the range 76-100 is 2.
Histogram conveys useful information like nature of data and its mode. Mode
indicates the peak of dataset. In other words, histograms can be used as charts to
show frequency, skewness present in the data, and shape.

DOT PLOTS
These are similar to bar charts. They are less clustered as compared to bar charts, as
they illustrate the bars only with single points. The dot plot of English marks for five
students with ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure
2.6. The advantage is that by visual inspection one can find out who got more marks.

CENTRAL TENDENCY
One can remember all the data Therefore, a condensation or summary of the data is
necessary. This makes the data analysis easy and simple. One such summary is called
central tendency. Thus, central tendency can explain the characteristics of data and
that further helps in comparison. Mass data have tendency to concentrate at certain
values, normally in the central location. It is called measure of central tendency (or
averages). Popular measures are mean, median and mode.

Mean – Arithmetic average (or mean)


It is a measure of central tendency that represents the ‘center’ of the dataset.
Mathematically, the average of all the values in the sample (population) is denoted as
x. Let x1, x2, … , xN be a set of ‘N’ values or observations, then the arithmetic mean
is given as:

Weighted mean
Unlike arithmetic mean that gives the weightage of all items equally, weighted mean
gives different importance to all items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency distribution,
mid values of the range are taken for computation.
This is illustrated in the following computation. In weighted mean, the mean is
computed by adding the product of proportion and group mean. It is mostly used
when the sample sizes are unequal.

Geometric mean
Let x1, x2, … , xN be a set of ‘N’ values or observations. Geometric mean is the Nth
root of the product of N items. The formula for computing geometric mean is given
as follows:
The problem of mean is its extreme sensitiveness to noise. Even small changes in the
input affect the mean drastically. Hence, often the top 2% is chopped off and then the
mean is calculated for a larger dataset.

Median
The middle value in the distribution is called median. If the total number of items in
the distribution is odd, then the middle value is called median. A median class is that
class where (N/2)th item is present. In the continuous case, the median is given by
the formula:
Median class is that class where N/2th item is present. Here, i is the class interval of
the median class and L1 is the lower limit of median class, f is the frequency of the
median class, and cf is the cumulative frequency of all classes preceding median.

Mode
Mode is the value that occurs more frequently in the dataset. In other words, the
value that has the highest frequency is called mode.

DISPERSION
The spread out of a set of data around the central tendency (mean, median or mode)
is called dispersion. Dispersion is represented by various ways such as range,
variance, standard deviation, and standard error. These are second order measures.
The most common measures of the dispersion data are listed below:
Range is the difference between the maximum and minimum of values of the given
list of data.
Standard Deviation The mean does not convey much more than a middle point. For
example, the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20.
The difference between these two sets is the spread of data. Standard deviation is the
average distance from the mean of the dataset to each point. The formula for sample
standard deviation is given by:

Here, N is the size of the population, xi is observation or value from the population
and m is the population mean. Often, N – 1 is used instead of N in the denominator
of Eq. (2.8).

QUARTILES AND INTER QUARTILE RANGE


It is sometimes convenient to subdivide the dataset using coordinates. Percentiles are
about data that are less than the coordinates by some percentage of the total value.
kth percentile is the property that the k% of the data lies at or below Xi. For example,
median is 50th percentile and can be denoted as Q0.50. The 25th percentile is called
first quartile (Q1) and the 75th percentile is called third quartile (Q3). Another
measure that is useful to measure dispersion is Inter Quartile Range (IQR). The IQR
is the difference between Q3 and Q1.

Outliers are normally the values falling apart at least by the amount 1.5 × IQR above
the third quartile or below the first quartile.

Five-point Summary and Box Plots The median, quartiles Q1 and Q3, and
minimum and maximum written in the order < Minimum, Q1, Median, Q3,
Maximum > is known as five-point summary.
Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak
location of the dataset.

Skewness
The measures of direction and degree of symmetry are called measures of third order.
Ideally, skewness should be zero as in ideal normal distribution. More often, the
given dataset may not have perfect symmetry.

• Skewness in Data: The dataset may have very high or very low values, leading to
skewness.
• Right Skew (Positive Skew): More high values; tail is longer on the right, hump
on the left. Mean > Median.
• Left Skew (Negative Skew): More low values; tail is longer on the left, hump on
the right. Mean < Median.
• Symmetric Distribution: Equal distribution of data, skewness = 0.
• Impact of Skewness: Leads to more outliers, affecting mean, median, and
performance of data mining algorithms.
Generally, for negatively skewed distribution, the median is more than the mean. The
relationship between skew and the relative size of the mean and median can be
summarized by a convenient numerical skew index known as Pearson 2 skewness
coefficient.
Also, the following measure is more commonly used to measure skewness. Let X1,
X2, …, XN be a set of ‘N’ values or observations then the skewness can be given as:

Here, m is the population mean and s is the population standard deviation of the
univariate data. Sometimes, for bias correction instead of N, N - 1 is used.

Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates
higher kurtosis and vice versa. Kurtosis is measured using the formula given below:

It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14) for
bias correction. Here, x and s are the mean and standard deviation of the univariate
data, respectively. Some of the other useful measures for finding the shape of the
univariate dataset are mean absolute deviation (MAD) and coefficient of variation
(CV).

Mean Absolute Deviation (MAD)


MAD is another dispersion measure and is robust to outliers. Normally, the outlier
point is detected by computing the deviation from median and by dividing it by
MAD.
Here, the absolute deviation between the data and mean is taken. Thus, the absolute
deviation is given as:

Coefficient of Variation (CV)


Coefficient of variation is used to compare datasets with different units. CV is the
ratio of standard deviation and mean, and %CV is the percentage of coefficient of
variations.

Special Univariate Plots


The ideal way to check the shape of the dataset is a stem and leaf plot. A stem and
leaf plot are a display that help us to know the shape and distribution of the data. In
this method, each value is split into a ’stem’ and a ’leaf’. The last digit is usually the
leaf and digits to the left of the leaf mostly form the stem. For example, marks 45 are
divided into stem 4 and leaf 5 in Figure 2.9. The stem and leaf plot for the English
subject marks, say, {45, 60, 60, 80, 85} is given in the figure.

It can be seen from Figure 2.9 that the first column is stem and the second column is
leaf. For the given English marks, two students with 60 marks are shown in stem and
leaf plot as stem-6 with 2 leaves with 0.
As discussed earlier, the ideal shape of the dataset is a bell-shaped curve. This
corresponds to normality. Most of the statistical tests are designed only for normal
distribution of data. A Q-Q plot can be used to assess the shape of the dataset. The
QQ plot is a 2D scatter plot of an univariate data against normal distribution data or
of two datasets - the quartiles of the first and second datasets. The normal Q-Q plot
for marks x = [13 11 2 3 4 8 9] is given below in Figure

You might also like