0% found this document useful (0 votes)
5 views

Module 1

Machine Learning is a subset of artificial intelligence that enables computers to learn from data and improve performance without explicit programming. It involves creating algorithms that build predictive models based on historical data, with applications in various fields such as finance and data analytics. Machine learning can be classified into three types: supervised learning, unsupervised learning, and reinforcement learning, and is closely related to data science and statistics.

Uploaded by

npworld2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module 1

Machine Learning is a subset of artificial intelligence that enables computers to learn from data and improve performance without explicit programming. It involves creating algorithms that build predictive models based on historical data, with applications in various fields such as finance and data analytics. Machine learning can be classified into three types: supervised learning, unsupervised learning, and reinforcement learning, and is closely related to data science and statistics.

Uploaded by

npworld2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Machine Learning BCS602

Module 1
Introduction to Machine Learning
What is Machine Learning?
In the real world, we are surrounded by humans who can learn everything from their experiences with
their learning capability, and we have computers or machines which work on our instructions. But can a
machine also learn from experiences or past data like a human does? So here comes the role of Machine
Learning.
➢ Introduction to Machine Learning
A subset of artificial intelligence known as machine learning focuses primarily on the creation of
algorithms that enable a computer to independently learn from data and previous experiences. Arthur
Samuel first used the term "machine learning" in 1959.It could be summarized as follows:
• Without being explicitly programmed, machine learning enables a machine to automatically learn
from data, improve performance from experiences, and predict things.
• Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical data,
or training data.
• For the purpose of developing predictive models, machine learning brings together statistics and
computer science.
• Algorithms that learn from historical data are either constructed or utilized in machine learning. The
performance will rise in proportion to the quantity of information we provide.
• A machine can learn if it can gain more data to improve its performance.

How does Machine Learning work?


• A machine learning system builds prediction models, learns from previous data, and predicts the
output of new data whenever it receives it. The amount of data helps to build a better model that
accurately predicts the output, which in turn affects the accuracy of the predicted output.
• Let's say we have a complex problem in which we need to make predictions. Instead of writing code,
we just need to feed the data to generic algorithms, which build the logic based on the data and
predict the output.
• Our perspective on the issue has changed as a result of machine learning.

The Machine Learning algorithm's operation is depicted in the following block diagram:

➢ Features of Machine Learning:

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• Machine learning uses data to detect various patterns in a given dataset.


• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also deals with the huge amount of the data.
Following are some key points which show the importance of Machine Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult for a human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful information from data.
➢ Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
• Supervised learning
• Unsupervised learning
• Reinforcement learning

➢ NEED FOR MACHINE LEARNING


Business organizations use hug amount of data for their daily activities. They have now started to
use the latest technology, machine learning, to manage the data. Machine learning has become so
popular because of three reasons:
• High volume of available data to manage: Big companies such as Facebook, Twitter, and
YouTube generate huge amount of data that grows at a phenomenal rate. It is estimated that
the data approximately gets doubled every year.
• Second reason is that the cost of storage has reduced. The hardware cost has also dropped.
Therefore, it is easier now to capture, process, store, distribute, and transmit the digital
information.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• Third reason for popularity of machine learning is the availability of complex algorithms
now. Especially with the advent of deep learning, many algorithms are available for machine
learning.

➢ MACHINE LEARNING EXPLAINED


• Machine learning is an important sub-branch of Artificial Intelligence (AI).
• A frequently quoted definition of machine learning was by Arthur Samuel, one of the pioneers of
Artificial Intelligence. He stated that “Machine learning is the field of study that gives the
computers ability to learn without being explicitly programmed.”
• The key to this definition is that the systems should learn by itself without explicit programming.
It is widely known that to perform a computation, one needs to write programs that teach the
computers how to do that computation.
• In conventional programming, after understanding the problem, a detailed design of the program
such as a flowchart or an algorithm needs to be created and converted into programs using a
suitable programming language.
• This approach could be difficult for many real-world problems such as puzzles, games, and
complex image recognition applications.
• Initially, artificial intelligence aims to understand these problems and develop general purpose
rules manually.
• Then, these rules are formulated into logic and implemented in a program to create intelligent
systems.
• This idea of developing intelligent systems by using logic and reasoning by converting an expert’s
knowledge into a set of rules and programs is called an expert system.
• The above approach was impractical in many domains as programs still depended on human
expertise and hence did not truly exhibit intelligence.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• Then, the momentum shifted to machine learning in the form of data driven systems. The focus
of AI is to develop intelligent systems by using data-driven approach, where data is used as an
input to develop intelligent models.
• The models can then be used to predict new inputs. Thus, the aim of machine learning is to learn
a model or set of rules from the given dataset automatically so that it can predict the unknown
data correctly.
• As humans take decisions based on an experience, computers make models based on extracted
patterns in the input data and then use these data-filled models for prediction and to take decisions.
For computers, the learnt model is equivalent to human experience.

• The quality of data determines the quality of experience and, therefore, the quality of the learning
system. In statistical learning, the relationship between the input x and output y is modeled as a
function in the form y = f(x). Here, f is the learning function that maps the input to output y.
Learning of function f is the crucial aspect of forming a model in statistical learning. In machine
learning, this is simply called mapping of input to output.
• The learning program summarizes the raw data in a model. Formally stated, a model is an
explicit description of patterns within the data in the form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs.
3. Logical if/else rules, or
4. Groupings called clusters
• In summary, a model can be a formula, procedure or representation that can generate data
decisions. The difference between pattern and model is that the former is local and applicable only
to certain attributes but the latter is global and fits the entire dataset. For example, a model can be
helpful to examine whether a given email is spam or not. The point is that the model is generated
automatically from the given data.
• Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A computer
program is said to learn from experience E, with respect to task T and some performance measure

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

P,if its performance on T measured by P improves with experience E.” The important components
of this definition are experience E, task T, and performance measure P.
• For example, the task T could be detecting an object in an image. The machine can gain the
knowledge of object using training dataset of thousands of images. This is called experience E.
• So, the focus is to use this experience E for this task of object detection T. The ability of the system
to detect the object is measured by performance measures like precision and recall. Based on the
performance measures, course correction can be done to improve the performance of the system.
• Models of computer systems are equivalent to human experience. Experience is based on data.
Humans gain experience by various means. They gain knowledge by rote learning. They observe
others and imitate it. Humans gain alot of knowledg from teachers and books. We learn many
things by trial and error.
• Once the knowledge is gained, when a new problem is encountered, humans search for similar
past situations and then formulate the heuristics and use that for prediction. But, in systems,
experience is gathered by these steps:
1. Collection of data
2. Once data is gathered, abstract concepts are formed out of that data. Abstraction is used to
generate concepts. This is equivalent to humans’ idea of objects, for example, we have some
idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable form of intelligence. It can be
viewed as ordering of all possible concepts. So, generalization involves ranking of concepts,
inferencing from them and formation of heuristics, an actionable aspect of intelligence.
4. Heuristics are educated guesses for all tasks. For example, if one runs or encounters a danger,
it is the resultant of human experience or his heuristics formation. In machines, it happens the
same way.
5. Heuristics normally works! But, occasionally, it may fail too. It is not the fault of heuristics
as it is just a ‘rule of thumb′. The course correction is done by taking evaluation measures.
Evaluation checks the thoroughness of the models and to-do course correction, if necessary,
to generate better formulations.
➢ MACHINE LEARNING IN RELATION TO OTHER FIELDS
Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics primarily.
It is the resultant of combined ideas of diverse fields.
Machine Learning and Artificial Intelligence

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Machine learning is an important branch of AI, which is a much broader subject. The aim of AI is to
develop intelligent agents. An agent can be a robot, humans, or any autonomous systems. Initially, the
idea of AI was ambitious, that is, to develop intelligent systems like human beings. The focus was on
logic and logical inferences. It had seen many ups and downs. These down periods were called AI
winters.
The resurgence in AI happened due to development of data driven systems. The aim is to find
relations and regularities present in the data. Machine learning is the subbranch of AI, whose aim is
to extract the patterns for prediction. It is a broad field that includes learning from examples and other
areas like reinforcement learning.
The relationship of AI and machine learning is shown in Figure 1.3. The model can take an unknown
instance and generate results.

Figure 1.3: Relationship of AI with Machine Learning

• Deep learning is a subbranch of machine learning.


• In deep learning, the models are constructed using neural network technology.
• Neural networks are based on the human neuron models.
• Many neurons form a network connected with the activation functions that trigger further neurons
to perform tasks.
➢ Machine Learning, Data Science, Data Mining, and Data Analytics
• Data science is an 'Umbrella' term that encompasses many fields. Machine learning starts with
data. Therefore, data science and machine learning are interlinked. Machine learning is a branch
of data science. Data science deals with gathering of data for analysis. It is a broad field that
includes:
• Big Data Data science concerns about collection of data. Big data is a field of data science that
deals with data's following characteristics:
1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter, YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different formats.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

3. Velocity: It refers to the speed at which the data is generated and processed.
• Big data is used by many machine learning algorithms for applications such as language
translation and image recognition.
• Big data influences the growth of subjects like Deep learning. Deep learning is a branch of
machine learning that deals with constructing models using neural networks.
• Data Mining: Data mining's original genesis is in the business. Like while mining the earth one
gets into precious resources, it is often believed that unearthing of the data produces hidden
information that otherwise would have eluded the attention of the management. There is no
difference between these fields except that data mining aims to extract the hidden patterns that
are present in the data, whereas, machine learning aims to use it for prediction.
• Data Analytics: Another branch of data science is data analytics. It aims to extract useful
knowledge from crude data. There are different types of analytics. Predictive data analytics is
used for making predictions. Machine learning is closely related to this branch of analytics and
shares almost all algorithms.
• Pattern Recognition: It is an engineering field. It uses machine learning algorithms to extract
the features for pattern analysis and pattern classification. One can view pattern recognition as a
specific application of machine learning.

Figure 1.4: Relationship of Machine Learning with Other Major Fields


➢ Machine Learning and Statistics

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Statistics is a branch of mathematics that has a solid theoretical foundation regarding statistical
learning. Like machine learning (ML), it can learn from data. But the difference between statistics and
ML is that statistical methods look for regularity in data called patterns. Initially, statistics sets a
hypothesis and performs experiments to verify and validate the hypothesis in order to find
relationships among data.
➢ TYPES OF MACHINE LEARNING
What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of interaction of the
program with its environment. It can be compared with the interaction between a teacher and a student.
Four types of machine learning:

Labelled and Unlabelled Data


• Data is a raw fact. Normally, data is represented in the form of a table. Data also can be referred
to as a data point, sample, or an example.
• Each row of the table represents a data point. Features are attributes or characteristics of an
object. Normally, the columns of the table are attributes. Out of all attributes, one attribute is
important and is called a label. Label is the feature that we aim to predict. Thus, there are two
types of data – labelled and unlabelled.
• Labelled Data To illustrate labelled data, let us take one example dataset called Iris flower
dataset tor Fisher’s Iris dataset. The dataset has 50 samples of Iris – with four attributes, length
and width of sepals and petals. The target variable is called class. There are three classes – Iris
setosa, Iris virginica, and Iris versicolor.

Table 1.1: Iris Flower Dataset

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• A dataset need not be always numbers. It can be images or video frames. Deep neural network
scan handle images with labels. In the following Figure 1.6, the deep neural network takes
images of dogs and cats with labels for classification.

Labelled Data

Unlabelled Data
• In unlabeled data, there are no labels in the dataset.
➢ Supervised Learning
• Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or teacher
component in supervised learning. A supervisor provides labelled data so that the model is
constructed and generates test data.
• In supervised learning algorithms, learning takes place in two stages. In layman terms, during the
first stage, the teacher communicates the information to the student that the student is supposed to
master. The student receives the information and understands it. During this stage, the teacher has
no knowledge of whether the information is grasped by the student.
• This leads to the second stage of learning. The teacher then asks the student a set of questions to
find out how much information has been grasped by the student. Based on these questions, the student
is tested, and the teacher informs the student about his assessment. This kind of learning is typically
called supervised learning.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• Supervised learning has two methods:


1. Classification
2. Regression
1. Classification
• Classification is a supervised learning method. The input attributes of the classification
algorithms are called independent variables.
• The target attribute is called label or dependent variable. The relationship between the input
and target variable is represented in the form of a structure which is called a classification
model. So, the focus of classification is to predict the ‘label’ that is in a discrete form (a value
from the set of finite values).
• An example is shown in Figure 1.7 where a classification algorithm takes a set of labelled
data images such as dogs and cats to construct a model that can later be used to classify an
unknown test image data.

3.

• In Classification, learning takes place in two stages. During the first stage, called the training
stage learning algorithm takes a labelled dataset and starts learning. After the training set,
samples are processed and the model is generated. In the second stage, the constructed model
is tested with test or unknown sample and assigned a label. This is the classification process.
• This is illustrated in the above Figure 1.7. Initially, the classification learning algorithm learns
with the collection of labelled data and constructs the model. Then, a test case is selected,
and the model assigns a label.
• Similarly, in the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the
classification will generate the label for this. This is called classification. One of the examples
of classification is –Image recognition, which includes classification of diseases like cancer,
classification of plants, etc.
• The classification models can be categorized based on the implementation technology like
decision trees, probabilistic methods, distance measures, and soft computing methods.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Classification models can also be classified as generative models and discriminative models.
Generative models deal with the process of data generation and its distribution. Probabilistic
models are examples of generative models. Discriminative models do not care about the
generation of data. Instead, they simply concentrate on classifying the given data.
• Some of the key algorithms of classification are:
1. Decision Tree
2. Random Forest
3. Support Vector Machines
4. Naïve Bayes
2. Regression Models
• Regression models, unlike classification algorithms, predict continuous variables like price.
In other words, it is a number. A fitted regression model is shown in Figure 1.8 for a dataset
that represent weeks input x and product sales y.
• The regression model takes input x and generates a model in the form of a fitted line of the
form y=f(x). Here, x is the independent variable that may be one or more attributes and y is
the dependent variable. In Figure 1.8, linear regression takes the training set and tries to fit it
with a line – product sales = 0.66 Week + 0.54. Here, 0.66 and 0.54 are all regression
coefficients that are learnt from data.
• The advantage of this model is that prediction for product sales (y) can be made for unknown week
data (x). For example, the prediction for unknown eighth week can be made by substituting x as 8 in
that regression formula to get y.
• Both regression and classification models are supervised algorithms. Both have a supervisor
and the concepts of training and testing are applicable to both. What is the difference between
classification and regression models?. The main difference is that regression models predict
continuous variables such as product price, while classification concentrates on assigning
labels such as class.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

➢ Unsupervised Learning
• The second kind of learning is by self-instruction. As the name suggests, there are no supervisor
or teacher components. In the absence of a supervisor or teacher, self-instruction is the most
common kind of learning process. This process of self-instruction is based on the concept of trial
and error.
• Here, the program is supplied with objects, but no labels are defined. The algorithm itself observes
the examples and recognizes patterns based on the principles of grouping. Grouping is done in
ways that similar objects form the same group.
• Cluster analysis and Dimensional reduction algorithms are examples of unsupervised
algorithms.
Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group objects into disjoint
clusters or groups. Cluster analysis clusters objects based on its attributes. All the data objects of
the partitions are similar in some aspect and vary from the data objects in the other partitions
significantly. Some of the examples of clustering processes are — segmentation of a region of
interest in an image, detection of abnormal growth in a medical image, and determining clusters
of signatures in a gene database.
An example of clustering scheme is shown in Figure 1.9 where the clustering algorithm takes a
set of dogs and cats images and groups it as two clusters-dogs and cats. It can be observed that
the samples belonging to a cluster are similar and samples are different radically across clusters.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Some of the key clustering algorithms are:


1. k-means algorithm
2. Hierarchical algorithms
Dimensionality Reduction
Dimensionality reduction algorithms are examples of unsupervised algorithms. It takes a higher
dimension data as input and outputs the data in lower dimension by taking advantage of the
variance of the data. It is a task of reducing the dataset with few features without losing the
generality.

➢ Semi-supervised Learning
There are circumstances where the dataset has a huge collection of unlabelled data and some labelled
data. Labelling isa costly process and difficult to perform by the humans. Semi-supervised
algorithms use unlabelled data by assigning a pseudo-label. Then, the labelled and pseudo-labelled
dataset can be combined.
➢ Reinforcement Learning
Reinforcement learning mimics human beings. Like human beings use ears and eyes to perceive the
world and take actions, reinforcement learning allows the agent to interact with the environment to
get rewards. The agent can be human, animal, robot, or any independent program. The rewards
enable the agent to gain experience. The agent aims to maximize the reward.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

The reward can be positive or negative (Punishment). When the rewards are more, the behavior gets
reinforced and learning becomes possible.
Consider the following example of a Grid game as shown in Figure 1.10

In this grid game, the gray tile indicates the danger, black is a block, and the tile with diagonally nes
is the goal. The aim is to start, say from bottom-left grid, using the actions left, right, top and bottom
to reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment to get
experience. In the above case, the agent tries to create a model by simulating many paths and finding
rewarding paths. This experience helps in constructing a model.
It can be said in summary, compared to supervised learning, there is no supervisor or labelled dataset.
Many sequential decisions need to be taken to reach the final decision. Therefore, reinforcement
algorithms are reward-based, goal-oriented algorithms.

➢ CHALLENGES OF MACHINE LEARNING


Problems that can be Dealt with Machine Learning
Computers are better than humans in performing tasks like computation. For example, while
calculating the square root of large numbers, an average human may blink but computers can display
the result in seconds. Computers can play games like chess, GO, and even beat professional players of
that game.
However, humans are better than computers in many aspects like recognition. But, deep learning
systems challenge human beings in this aspect as well. Machines can recognize human faces in a
second. Still, there are tasks where humans are better as machine learning systems still require quality
data for model construction. The quality of a learning system depends on the quality of data. This is a
challenge. Some of the challenges are listed below:

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

1. Problems – Machine learning can deal with the ‘well-posed’ problems where specifications
are complete and available. Computers cannot solve ‘ill-posed’ problems.
Consider one simple example (shown in Table 1.3):

2. Huge data – This is a primary requirement of machine learning. Availability of a quality data
is a challenge. A quality data means it should be large and should not have data problems such
as missing data or incorrect data.
3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even Tensor
Processing Unit (TPU) are required to execute machine learning algorithms. Also, machine
learning tasks have become complex and hence time complexity has increased, and that can be
solved only with high computing power.
4. Complexity of the algorithms – The selection of algorithms, describing the algorithms,
application of algorithms to solve machine learning task, and comparison of algorithms have
become necessary for machine learning or data scientists now. Algorithms have become a big
topic of discussion and it is a challenge for machine learning professionals to design, select, and
evaluate optimal algorithms.
5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/ variance
tradeoff. A model that fits the training data correctly but fails for test data, in general lacks
generalization, is called overfitting. The reverse problem is called underfitting where the model
fails for training data but has good generalization. Overfitting and underfitting are great
challenges for machine learning algorithms.
➢ MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business organizations is CRISP-DM.
Since machine learning is like data mining, except for the aim, this process can be used for machine

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

learning. CRISP-DM stands for Cross Industry Standard Process – Data Mining. This process
involves
six steps. The steps are listed below in Figure 1.11.

1. Understanding the business – This step involves understanding the objectives and requirements
of the business organization. Generally, a single data mining algorithm is enough for giving the
solution. This step also involves the formulation of the problem statement for the data mining
process.
2. Understanding the data – It involves the steps like data collection, study of the characteristics of
the data, formulation of hypothesis, and matching of patterns to the selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw data and
preparation of data for the data mining process. The missing values may cause problems during
both training and testing phases. Missing data forces classifiers to produce in accurate results. This
is a perennial problem for the classification models. Hence, suitable strategies should be adopted
to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the data to obtain
a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical analysis
and visualization methods. The performance of the classifier is determined by evaluating the
accuracy of the classifier. The process of classification is a fuzzy issue. For example, classification

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

of emails requires extensive domain knowledge and requires domain experts. Hence, performance
of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining algorithm to improve
the existing process or for a new situation.
➢ MACHINE LEARNING APPLICATIONS
Machine Learning technologies are used widely now in different domains. Machine learning
applications are everywhere! One encounters many machine learning applications in the day-to-day
life.
Some applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which are captured
by emoticons effectively. For movie reviews or product reviews, five stars or one star are
automatically attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases possible. For
example, Amazon recommends users to find related books or books bought by people who have
the same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform tasks.
These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine learning
which offer to locate and navigate shortest paths to reduce time.
The machine learning applications are enormous. The following Table 1.4 summarizes some of
the machine learning applications.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

➢ WHAT IS DATA?
• All facts are data. In computer systems, bits encode facts present in numbers, text, images, audio,
and video.
• Data can be directly human interpretable (such as numbers or texts) or diffused data such as
images or video that can be interpreted only by a computer.
• Data is available in different data sources like flat files, databases, or data warehouses. It can
either be an operational data or a non-operational data.
• Operational data is the one that is encountered in normal business procedures and processes. For
example, daily sales data is operational data, on the other hand, non-operational data is the kind
of data that is used for decision making.
• Data by itself is meaningless. It has to be processed to generate any information. A string of bytes
is meaningless. Only when a label is attached like height of students of a class, the data becomes
meaningful.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• Processed data is called information that includes patterns, associations, or relationships among
data. For example, sales data can be analyzed to extract information like which product was sold
larger in the last quarter of the year.

➢ Elements of Big Data


• Data whose volume is less and can be stored and processed by a small-scale computer is called
‘small data’. These data are collected from several sources, and integrated and processed by a
small-scale computer. Big data, on the other hand, is a larger data whose volume is much larger
than ‘small data’ and is characterized as follows:
1. Volume – Since there is a reduction in the cost of storing devices, there has been a tremendous
growth of data. Small traditional data is measured in terms of gigabytes (GB) and terabytes (TB),
but Big Data is measured in terms of petabytes (PB) and exabytes (EB). One exabyte is 1 million
terabytes.
2. Velocity – The fast arrival speed of data and its increase in data volume is noted as velocity. The
availability of IoT devices and Internet power ensures that the data is arriving at a faster rate.
Velocity helps to understand the relative growth of big data and its accessibility by users, systems
and applications.
3. Variety – The variety of Big Data includes:
▪ Form – There are many forms of data. Data types range from text, graph, audio, video, to maps.
There can be composite data too, where one media can have many other sources of data, for
example, a video can have an audio song.
▪ Function – These are data from various sources like human conversations, transaction records,
and old archive data.
▪ Source of data – This is the third aspect of variety. There are many sources of data. Broadly, the
data source can be classified as open/public data, social media data and multimodal data.
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts, truthfulness,
believability, and confidence in data. There may be many sources of error such as technical errors,
typographical errors, and human errors. So, veracity is one of the most important aspects of data.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals that
are needed by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the information that
is extracted from the data and its influence on the decisions that are taken based on it. Thus,
these 6 Vs are helpful to characterize the big data. The data quality of the numeric attributes
is determined by factors like precision, bias, and accuracy.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• Precision is defined as the closeness of repeated measurements. Often, standard deviation is used
to measure the precision.
• Bias is a systematic result due to erroneous assumptions of the algorithms or procedures.
Accuracy is the degree of measurement of errors that refers to the closeness of measurements to
the true value of the quantity. Normally, the significant digits used to store and manipulate
indicate the accuracy of the measurement.
➢ Types of Data
In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi structured
data.
1. Structured Data: In structured data, data is stored in an organized manner such as a database
where it is available in the form of a table. The data can also be retrieved in an organized manner
using tools like SQL. The structured data frequently encountered in machine learning are listed
below:
▪ Record Data A dataset is a collection of measurements taken from a process. We have a
collection of objects in a dataset and each object has a set of measurements. The measurements
can be arranged in the form of a matrix. Rows in the matrix represent an object and can be
called as entities, cases, or records. The columns of the dataset are called attributes, features,
or fields. Label is the term that is used to describe the individual observations.
▪ Data Matrix It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data. The data is thought of as points or
vectors in the multidimensional space where every attribute is a dimension describing the
object.
▪ Graph Data It involves the relationships among objects. For example, a web page can refer
to another web page. This can be modeled as a graph. The modes are web pages and the
hyperlink is an edge that connects the nodes.
▪ Ordered Data Ordered data objects involve attributes that have an implicit order among them.
The examples of ordered data are:
a) Temporal data – It is the data whose attributes are associated with time. For example,
the customer purchasing patterns during festival time is sequential data. Time series data
is a special type of sequence data where the data is a series of measurements over time.
b) Sequence data – It is like sequential data but does not have time stamps. This data
involves the sequence of words or letters. For example, DNA data is a sequence of four
characters – A T G C.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

c) Spatial data – It has attributes such as positions or areas. For example, maps are spatial
data where the points are related by location.
2. Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual documents, programs,
and blog data. It is estimated that 80% of the data are unstructured data.
3. Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.

➢ Data Storage and Representation


Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis. The
goal of data storage management is to make data available for analysis. There are different approaches
to organize and manage data in storage files and systems from flat file to data warehouses. Some of
them are listed below:

Flat Files These are the simplest and most commonly available data source. It is also the cheapest
way of organizing the data. These flat files are the files where data is stored in plain ASCII or EBCDIC
format. Minor changes of data in flat files affect the results of the data mining algorithms. Hence, flat
file is suitable only for storing small dataset and not desirable if the dataset becomes larger.
Some of the popular spreadsheet formats are listed below:
CSV files – CSV stands for comma-separated value files where the values are separated by commas.
These are used by spreadsheet and database applications. The first row may have attributes and the
rest of the rows represent the data.
TSV files – TSV stands for Tab separated values files where values are separated by Tab. Both CSV
and TSV files are generic in nature and can be shared. There are many tools like Google Sheets and
Microsoft Excel to process these files.
Database System
World Wide Web (Www) It provides a diverse, worldwide online information source
XML (eXtensible Markup Language) It is both human and machine interpretable data format.
Data Stream It is dynamic data, which flows in and out of the observing environment.
JSON (JavaScript Object Notation) It is another useful data interchange format that is often used for
many machine learning algorithms.

➢ BIG DATA ANALYTICS AND TYPES OF ANALYTICS

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• The primary aim of data analysis is to assist business organizations to take decisions. For example,
a business organization may want to know which is the fastest selling product, in order for them
to market activities.
• Data analysis is an activity that takes the data and generates useful information and insights for
assisting the organizations.
• Data analysis and data analytics are terms that are used interchangeably to refer to the same
concept. However, there is a subtle difference. Data analytics is a general term and data analysis
is a part of it.
• Data analytics refers to the process of data collection, preprocessing and analysis. It deals with
the complete cycle of data management. Data analysis is just analysis and is a part of data
analytics. It takes historical data and does the analysis. Data analytics, instead, concentrates more
on future and helps in prediction.
• There are four types of data analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
• Descriptive Analytics: It is about describing the main features of the data. After data collection
is done, descriptive analytics deals with the collected data and quantifies it.
• Diagnostic Analytics: It inference part. deals with the question -'Why?. This is also known as
causal analysis as it aims to find out the cause and effect of the events.
• Predictive Analytics: It deals with the future. It deals with the question - What will happen in
future given this data?'. This involves the application of algorithms to identify the patterns to
predict the future.
• Prescriptive Analytics: Prescriptive analytics goes beyond prediction and helps in decision
making by giving a set of actions. It helps the organizations to plan better for the future and to
mitigate the risks that are involved.
➢ BIG DATA ANALYSIS FRAMEWORK
For performing data analytics, many frameworks are proposed. All proposed analytics frameworks
have some common factors. Big data framework is a layered architecture. Such an architecture has
many advantages such as genericness. A 4-layer architecture has the following layers:
1. Date connection layer
2. Data management layer

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

3. Data analytics later


4. Presentation layer
• Data Connection Layer: It has data ingestion mechanisms and data connectors. Data ingestion
means taking raw data and importing it into appropriate data structures. It performs the tasks of
ETL process. By ETL, it means extract, transform and load operations.
• Data Management Layer: It performs preprocessing of data. The purpose of this layer is to allow
parallel execution of queries, and read, write and data management tasks. There may be many
schemes that can be implemented by this layer such as data-in-place, where the data is not moved
at all, or constructing data repositories such as data warehouses and pull data on-demand
mechanisms.
• Data Analytic Layer: It has many functionalities such as statistical tests, machine learning
algorithms to understand, and construction of machine learning models. This layer implements
many model validation mechanisms too.
• Presentation Layer: It has mechanisms such as dashboards, and applications that display the
results of analytical engines and machine learning algorithms.
• Thus, the Big Data processing cycle involves data management that consists of the following
steps.
1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm
This is an iterative process and is carried out on a permanent basis to ensure that data is suitable for data
mining.
➢ Data Collection
The first task of gathering datasets is the collection of data. It is often estimated that most of the time
is spent for collection of good quality data. A good quality data yields a better result. It is often difficult
to characterize a ‘Good data’. ‘Good data’ is one that has the following properties:
Timeliness – The data should be relevant and not stale or obsolete data.
Relevancy – The data should be relevant and ready for the machine learning or data mining
algorithms. All the necessary information should be available and there should be no bias in the data.
Knowledge about the data – The data should be understandable and interpretable, and should be self-
sufficient for the required application as desired by the domain knowledge engineer.
Broadly, the data source can be classified as open/public data, social media data and multimodal data.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Open or public data source – It is a data source that does not have any stringent copyright rules or
restrictions. Its data can be primarily used for many purposes. Government census data are good examples
of open data:
• Digital libraries that have huge amount of text data as well as document images
• Scientific domains with a huge collection of experimental data like genomic data and biological
data.
• Healthcare systems that use extensive databases like patient databases, health insurance data,
doctors’ information, and bioinformatics information
Social media – It is the data that is generated by various social media platforms like Twitter, Facebook,
YouTube, and Instagram. An enormous amount of data is generated by these platforms.
Multimodal data – It includes data that involves many modes such as text, video, audio and mixed types.
Some of them are listed below:
• Image archives contain larger image databases along with numeric and text data
• The World Wide Web (WWW) has huge amount of data that is distributed on the Internet. These
data are heterogeneous in nature.
• Data Preprocessing
In real world, the available data is ’dirty’. By this word ’dirty’, it means:

• Data preprocessing improves the quality of the data mining techniques. The raw data must be pre-
processed to give accurate results. The process of detection and removal of errors in data is called
data cleaning.
• Data wrangling means making the data processable for machine learning algorithms. Some of the
data errors include human errors such as typographical errors or incorrect measurement and
structural errors like improper data formats.
• Data errors can also arise from omission and duplication of attributes. Noise is a random component
and involves distortion of a value or introduction of spurious objects. Often, the noise is used if the
data is a spatial or temporal component. Certain deterministic distortions in the form of a streak are
known as artifacts.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

• It can be observed that data like Salary = ’ ’ is incomplete data. The DoB of patients, John, Andre,
and Raju, is the missing data. The age of David is recorded as ‘5’ but his DoB indicates it is
10/10/1980. This is called inconsistent data.
• Inconsistent data occurs due to problems in conversions, inconsistent formats, and difference in
units. Salary for John is -1500. It cannot be less than ‘0’. It is an instance of noisy data. Outliers are
data that exhibit the characteristics that are different from other data and have very unusual values.
The age of Raju cannot be 136. It might be a typographical error. It is often required to distinguish
between noise and outlier data.
• Outliers may be legitimate data and sometimes are of interest to the data mining algorithms. These
errors often come during data collection stage. These must be removed so that machine learning
algorithms yield better results as the quality of results is determined by the quality of input data.
This removal process is called data cleaning.
➢ Missing Data Analysis
The primary data cleaning process is missing data analysis. Data cleaning routines attempt to fill up
the missing values, smoothen the noise while identifying the outliers and correct the inconsistencies
of the data. This enables data mining to avoid overfitting of the models.
The procedures that are given below can solve the problem of missing data:
Ignore the tuple – A tuple with missing data, especially the class label, is ignored. This method is not
effective when the percentage of the missing values increases.
Fill in the values manually – Here, the domain expert can analyse the data tables and carry out the
analysis and fill in the values manually. But, this is time consuming and may not be feasible for larger
sets.
A global constant can be used to fill in the missing attributes. The missing values may be ’Unknown’
or be ’Infinity’. But, some data mining results may give spurious results by analysing these labels.
The attribute value may be filled by the attribute value. Say, the average income can replace a missing
value.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Use the attribute mean for all samples belonging to the same class. Here, the average value replaces
the missing values of all tuples that fall in this group.
Use the most possible value to fill in the missing value. The most probable value can be obtained
from other methods like classification and decision tree prediction.
Some of these methods introduce bias in the data. The filled value may not be correct and could be just
an estimated value. Hence, the difference between the estimated and the original value is called an error
or bias.
➢ Removal of Noisy or Outlier Data
Noise is a random error or variance in a measured value. It can be removed by using binning, which
is a method where the given data values are sorted and distributed into equal frequency bins. The bins
are also called as buckets. The binning method then uses the neighbor values to smooth the noisy
data.
Some of the techniques commonly used are ‘smoothing by means’ where the mean of the bin removes
the values of the bins, ‘smoothing by bin medians’ where the bin median replaces the bin values, and
‘smoothing by bin boundaries’ where the bin value is replaced by the closest bin boundary. The
maximum and minimum values are called bin boundaries. Binning methods may be used as a
discretization technique.
Example 2.1 illustrates this principle

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

➢ Data Integration and Data Transformations


Data integration involves routines that merge data from multiple sources into a single data source. So,
this may lead to redundant data.
The main goal of data integration is to detect and remove redundancies that arise from integration.
Data transformation routines perform operations like normalization to improve the performance of the
data mining algorithms. It is necessary to transform data so that it can be processed.
This can be considered as a preliminary stage of data conditioning. Normalization is one such
technique. In normalization, the attribute values are scaled to fit in a range (say 0-1) to improve the
performance of the data mining algorithm. Often, in neural networks, these techniques are used. Some
of the normalization procedures used are:
1. Min-Max
2. z-Score
Min-Max Procedure It is a normalization technique where each variable V is normalized
by its difference with the minimum value divided by the range to a new range, say 0–1.
Often, neural networks require this kind of normalization. The formula to implement this
normalization is given as:

Here max-min is the range. Min and max are the minimum and maximum of the given data, new max
and new min are the minimum and maximum of the target range, say 0 and 1.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and map the
marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The new min and new max are
0 and 1, respectively. The mapping can be done using Eq. (2.1) as:

So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range {0, 0.33, 0.66, 1}.
Thus, the Min-Max normalization range is between 0 and 1.
Z-Score Normalization: This procedure works by taking the difference between the field value and
mean value, and by scaling this difference by standard deviation of the attribute.

Here, s is the standard deviation of the list V and m is the mean of the list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the list V are 20 and 10,
respectively. So, the z-scores of these marks are calculated using Eq. (2.2) as:

Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.
Data Reduction
Data reduction reduces data size but produces the same results. There are different ways in which data
reduction can be carried out such as data aggregation, feature selection, and dimensionality reduction.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

➢ DESCRIPTIVE STATISTICS
Descriptive statistics is a branch of statistics that does dataset summarization. It is used to summarize
and describe data. Descriptive statistics are just descriptive and do not go beyond that. In other words,
descriptive statistics do not bother too much about machine learning algorithms and its functioning.
Dataset and Data Types
A dataset can be assumed to be a collection of data objects. The data objects may be records, points,
vectors, patterns, events, cases, samples or observations. These records contain many attributes. An
attribute can be defined as the property or characteristics of an object. For example, consider the
following database shown in sample Table 2.2.

Every attribute should be associated with a value. This process is called measurement. The type of
attribute determines the data types, often referred to as measurement scale types. The data types are
shown in Figure 2.1.

Broadly, data can be classified into two types:


1. Categorical or qualitative data
2. Numerical or quantitative data
Categorical or Qualitative Data The categorical data can be divided into two types. They are nominal
type and ordinal type.
➢ Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are symbols and cannot
be processed like a number. For example, the average of a patient ID does not make any statistical
sense. Nominal data type provides only information but has no ordering among data. Only

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

operations like (=, ≠) are meaningful for these data. For example, the patient ID can be checked
for equality and nothing else.
➢ Ordinal Data – It provides enough information and has natural order. For example, Fever =
{Low, Medium, High} is an ordinal data. Certainly, low is less than medium and medium is less
than high, irrespective of the value. Any transformation can be applied to these data to get a new
value.
Numeric or Qualitative Data It can be divided into two categories. They are interval type and ratio type.
➢ Interval Data – Interval data is a numeric data for which the differences between values are
meaningful. For example, there is a difference between 30 degree and 40 degree. Only the
permissible operations are + and -.
➢ Ratio Data – For ratio data, both differences and ratio are meaningful. The difference between
the ratio and interval data is the position of zero in the scale. For example, take the Centigrade-
Fahrenheit conversion. The zeroes of both scales do not match. Hence, these are interval data.
Another way of classifying the data is to classify it as:
1.Discrete value data
2.Continuous data
➢ Discrete Data This kind of data is recorded as integers. For example, the responses of the survey
can be discrete data. Employee identification number such as 10001 is discrete data.
➢ Continuous Data It can be fitted into a range and includes decimal point. For example, age is a
continuous data. Though age appears to be discrete data, one may be 12.5 years old and it makes
sense. Patient height and weight are all continuous data.
Third way of classifying the data is based on the number of variables used in the dataset. Based on that,
the data can be classified as univariate data, bivariate data, and multivariate data. This is shown in Figure
2.2.

➢ UNIVARIATE DATA ANALYSIS AND VISUALIZATION


Univariate analysis is the simplest form of statistical analysis. As the name indicates, the dataset has
only one variable. A variable can be called as a category. Univariate does not deal with cause or
relationships. The aim of univariate analysis is to describe data and find patterns.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Univariate data description involves finding the frequency distributions, central tendency measures,
dispersion or variation, and shape of the data.
Data Visualization
To understand data, graph visualization is must. Data visualization helps to understand data. It helps to
present information and data to customers. Some of the graphs that are used in univariate data analysis
are bar charts, histograms, frequency polygons and pie charts.
The advantages of the graphs are presentation of data, summarization of data, description of data,
exploration of data, and to make comparisons of data. Let us consider some forms of graphs.
Bar Chart A Bar chart (or Bar graph) is used to display the frequency distribution for variables. Bar
charts are used to illustrate discrete data. The charts can also help to explain the counts of nominal data.
It also helps in comparing the frequency of different groups.
The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown below
in Figure 2.3.

Pie Chart These are equally helpful in illustrating the univariate data. The percentage frequency
distribution of students' marks {22, 22, 40, 40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.

It can be observed that the number of students with 22 marks are 2. The total number of students are
10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in Figure 2.4.
Histogram It plays an important role in data mining for showing frequency distributions. The
histogram for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75, 76-100 is

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

given below in Figure 2.5. One can visually inspect from Figure 2.5 that the number of students in the
range 76-100 is 2.

Histogram conveys useful information like nature of data and its mode. Mode indicates the peak of
dataset. In other words, histograms can be used as charts to show frequency, skewness present in the
data, and shape.
Dot Plots These are similar to bar charts. They are less clustered as compared to bar charts,
as they illustrate the bars only with single points. The dot plot of English marks for five students with
ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The advantage
is that by visual inspection one can find out who got more marks.

Central Tendency
One can remember all the data Therefore, a condensation or summary of the data is necessary. This
makes the data analysis easy and simple. One such summary is called central tendency. Thus, central
tendency can explain the characteristics of data and that further helps in comparison. Mass data have
tendency to concentrate at certain values, normally in the central location. It is called measure of central
tendency (or averages). Popular measures are mean, median and mode.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Mean – Arithmetic average (or mean) is a measure of central tendency that represents the ‘center’ of
the dataset. Mathematically, the average of all the values in the sample (population) is denoted as x. Let
x1, x2, … , xN be a set of ‘N’ values or observations, then the arithmetic mean is given as:

Weighted mean – Unlike arithmetic mean that gives the weightage of all items equally, weighted mean
gives different importance to all items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency distribution, mid values of the
range are taken for computation. This is illustrated in the following computation. In weighted mean, the
mean is computed by adding the product of proportion and group mean. It is mostly used when the
sample sizes are unequal.
Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or observations. Geometric mean is the Nth
root of the product of N items. The formula for computing geometric mean is given as follows:

The problem of mean is its extreme sensitiveness to noise. Even small changes in the input affect the
mean drastically. Hence, often the top 2% is chopped off and then the mean is calcu-lated for a larger
dataset.
Median – The middle value in the distribution is called median. If the total number of items in the
distribution is odd, then the middle value is called median. A median class is that class where (N/2)th
item is present. In the continuous case, the median is given by the formula:

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Median class is that class where N/2th item is present. Here, i is the class interval of the median class
and L1 is the lower limit of median class, f is the frequency of the median class, and cf is the cumulative
frequency of all classes preceding median.
Mode – Mode is the value that occurs more frequently in the dataset. In other words, the value that has
the highest frequency is called mode.
➢ Dispersion
The spread out of a set of data around the central tendency (mean, median or mode) is called
dispersion. Dispersion is represented by various ways such as range, variance, standard deviation,
and standard error. These are second order measures. The most common measures of the dispersion
data are listed below:
• Range Range is the difference between the maximum and minimum of values of the given
list of data.
• Standard Deviation The mean does not convey much more than a middle point. For example,
the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The difference
between these two sets is the spread of data. Standard deviation is the average distance from
the mean of the dataset to each point. The formula for sample standard deviation is given by:

Here, N is the size of the population, xi is observation or value from the population and m
is the population mean. Often, N – 1 is used instead of N in the denominator of Eq. (2.8).
• Quartiles and Inter Quartile Range
It is sometimes convenient to subdivide the dataset using coordinates. Percentiles are about data
that are less than the coordinates by some percentage of the total value. kth percentile is the
property that the k% of the data lies at or below Xi. For example, median is 50th percentile and
can be denoted as Q0.50. The 25th percentile is called first quartile (Q1) and the 75th percentile
is called third quartile (Q3). Another measure that is useful to measure dispersion is Inter
Quartile Range (IQR). The IQR is the difference between Q3 and Q1.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

Outliers are normally the values falling apart at least by the amount 1.5 × IQR above the third
quartile or below the first quartile.

Five-point Summary and Box Plots The median, quartiles Q1 and Q3, and minimum and maximum
written in the order < Minimum, Q1, Median, Q3, Maximum > is known as five-point summary.

Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location of the
dataset.
Skewness
The measures of direction and degree of symmetry are called measures of third order. Ideally, skewness
should be zero as in ideal normal distribution. More often, the given dataset may not have perfect
symmetry.

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

The dataset may also either have very high values or extremely low values. If the dataset has far higher
values, then it is said to be skewed to the right. On the other hand, if the dataset has far more low values
then it is said to be skewed towards left. If the tail is longer on the left-hand side and hump on the right-
hand side, it is called positive skew. Otherwise, it is called negative skew.
The given dataset may have an equal distribution of data. The implication of this is that if the data is
skewed, then there is a greater chance of outliers in the dataset. This affects the mean and median. Hence,
this may affect the performance of the data mining algorithm. A perfect symmetry means the skewness
is zero. In the case of skew, the median is greater than the mean. In positive skew, the mean is greater
than the median.
Generally, for negatively skewed distribution, the median is more than the mean. The relationship
between skew and the relative size of the mean and median can be summarized by a convenient numerical
skew index known as Pearson 2 skewness coefficient.

Also, the following measure is more commonly used to measure skewness. Let X1, X2, …, XN be a set
of ‘N’ values or observations then the skewness can be given as:

Here, m is the population mean and s is the population standard deviation of the univariate data.
Sometimes, for bias correction instead of N, N - 1 is used.
Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates higher kurtosis and
vice versa. Kurtosis is measured using the formula given below:

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14) for bias correction. Here,
x and s are the mean and standard deviation of the univariate data, respectively. Some of the other useful
measures for finding the shape of the univariate dataset are mean absolute deviation (MAD) and
coefficient of variation (CV).
Mean Absolute Deviation (MAD)
MAD is another dispersion measure and is robust to outliers. Normally, the outlier point is detected by
computing the deviation from median and by dividing it by MAD. Here, the absolute deviation between
the data and mean is taken. Thus, the absolute deviation is given as:

Coefficient of Variation (CV)


Coefficient of variation is used to compare datasets with different units. CV is the ratio of standard
deviation and mean, and %CV is the percentage of coefficient of variations.
Special Univariate Plots
The ideal way to check the shape of the dataset is a stem and leaf plot. A stem and leaf plot are a display
that help us to know the shape and distribution of the data. In this method, each value is split into a ’stem’
and a ’leaf’. The last digit is usually the leaf and digits to the left of the leaf mostly form the stem. For
example, marks 45 are divided into stem 4 and leaf 5 in Figure 2.9. The stem and leaf plot for the English
subject marks, say, {45, 60, 60, 80, 85} is given in the figure.

It can be seen from Figure 2.9 that the first column is stem and the second column is leaf. For the given
English marks, two students with 60 marks are shown in stem and leaf plot as stem-6 with 2 leaves with
0.
As discussed earlier, the ideal shape of the dataset is a bell-shaped curve. This corresponds to normality.
Most of the statistical tests are designed only for normal distribution of data. A Q-Q plot can be used to
assess the shape of the dataset. The Q-Q plot is a 2D scatter plot of an univariate data against theoretical

Dr BA, AG, NGY Dept of ISE, RNSIT


Machine Learning BCS602

normal distribution data or of two datasets - the quartiles of the first and second datasets. The normal Q-
Q plot for marks x = [13 11 2 3 4 8 9] is given below in Figure 2.1.

Dr BA, AG, NGY Dept of ISE, RNSIT

You might also like