0% found this document useful (0 votes)
17 views56 pages

Module 1 Machine Learning

The document provides an introduction to machine learning, highlighting its importance for businesses in utilizing large amounts of data. It explains the knowledge pyramid, the objectives of machine learning, and its relationship with artificial intelligence, data science, and statistics. Additionally, it outlines the types of machine learning, specifically focusing on supervised learning methods such as classification and regression.

Uploaded by

2022becs163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views56 pages

Module 1 Machine Learning

The document provides an introduction to machine learning, highlighting its importance for businesses in utilizing large amounts of data. It explains the knowledge pyramid, the objectives of machine learning, and its relationship with artificial intelligence, data science, and statistics. Additionally, it outlines the types of machine learning, specifically focusing on supervised learning methods such as classification and regression.

Uploaded by

2022becs163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Machine Learning (BCS602)

Module 1

Introduction to Machine Learning

Need for Machine Learning

Business organizations use huge amount of data for their daily activities. Earlier, the full
potential of this data was not utilized due to two reasons.

1. One reason was data being scattered across different systems and organizations not being
able to integrate these sources fully.
2. Secondly, the lack of awareness about software tools that could help to unearth the useful
information from data.

Business organizations have now started to use the latest technology, machine learning for this
purpose.

Machine learning has become so popular because of three reasons:

1. High volume of available data to manage: Big companies such as Facebook, Twitter
and Youtube generate huge amount of data that grows at a phenomenal rate. It is
estimated that the data approximately gets doubled every year.
2. The cost of storage has reduced. The hardware cost has also dropped. Therefore, it is
easier now to capture, process, store, distribute and transmit the digital information.
3. Machine learning is popular due to availability of complex algorithms. With the advent
of deep learning, many algorithms are available for machine learning.

With the popularity and ready adaption of machine learning by business organizations, it has
become a dominant technology trend now. A knowledge pyramid is shown in Figure 1.1.

Department of CSE Page 1


Machine Learning (BCS602)

Figure 1.1: The Knowledge Pyramid

Data

 All facts are data.


 Data can be numbers or text that can be processed by a computer.
 Today, organizations are accumulating vast and growing amounts of data with data
sources such as flat files, databases, or data warehouses in different storage formats.

Information

 Processed data is called information.


 This includes patterns, associations, or relationships among data.
 For example, sales data can be analyzed to extract information like which is the fast
selling product.

Department of CSE Page 2


Machine Learning (BCS602)

Knowledge

 Condensed information is called knowledge.


 For example, the historical patterns and future trends obtained in the sales data can be
called knowledge.
 Unless knowledge is extracted, data is of no use.
 Knowledge is not useful unless it is put into action.

Intelligence

 Intelligence is the applied knowledge for actions.


 An actionable form of knowledge is called intelligence.

Wisdom

 The ultimate objective of knowledge pyramid is wisdom that represents the maturity of
mind that is, so far, exhibited only by humans.

Objective of Machine Learning

The objective of machine learning is to process archival data for organizations to take better
decisions to design new products, improve the business processes, and to develop effective
decision support systems.

Machine Learning Explained

 Machine learning is an important sub-branch of Artificial Intelligence (AI).


 Arthur Samuel stated that “Machine learning is the field of study that gives the
computers ability to learn without being explicitly programmed”.
 As per definition of Arthur Samuel, the systems should learn by itself without explicit
programming.
 It is widely known that to perform a computation, one needs to write programs
 In conventional programming, after understanding the problem, a detailed design of the
program such as a flowchart or an algorithm needs to be created and converted into
programs using a suitable programming language.

Department of CSE Page 3


Machine Learning (BCS602)

 This approach could be difficult for many real-world problems such as puzzles, games,
and complex image recognition applications.
 Initially, artificial intelligence aims to understand these problems and develop general
purpose rules manually.
 Then, these rules are formulated into logic and implemented in a program to create
intelligent systems.
 This idea of developing intelligent systems by using logic and reasoning by converting an
expert‟s knowledge into a set of rules and programs is called an expert system.
 The expert system approach was impractical in many domains as programs still depended
on human expertise and hence did not truly exhibit intelligence.
 The focus of AI is to develop intelligent systems by using data-driven approach, where
data is used as an input to develop intelligent models.
 The models can then be used to predict new inputs.
 Thus, the aim of machine learning is to learn a model or set of rules from the given
dataset automatically so that it can predict the unknown data correctly.
 As humans take decisions based on an experience, computers make models based on
extracted patterns in the input data and then use these data-filled models for prediction
and to take decisions.
 For computers, the learnt model is equivalent to human experience.

Figure 1.2: (a) A learning system for humans (b) A learning system for Machine Learning

Department of CSE Page 4


Machine Learning (BCS602)

 Often, the quality of data determines the quality of experience and, therefore, the quality
of the learning system.
 In statistical learning, the relationship between the input x and output y is
modeled as a function in the form y = f(x).
 f is the learning function that maps the input x to output y.
 The learning program summarizes the raw data in a model.

Model is an explicit description of patterns within the data in the form of:

1. Mathematical Equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
 A model can be a formula, procedure or representation that can generate data decisions.
 The difference between pattern and model is that the former is local and applicable only
to certain attributes but the latter is global and fits the entire dataset.

Tom Mitchell’s Definition of Machine Learning

“A computer program is said to learn from experience E, with respect to task T and some
performance measure P, if its performance on T measured by P improves with experience E “.

 The important components of this definition are experience E, task T, and performance
measure P.
 For example, the task T could be detecting an object in an image.
 The machine can gain the knowledge of object using training dataset of thousands of
images.
 This is called experience E.
 So, the focus is to use this experience E for this task object detection T.
 The ability of the system to detect the object is measured by performance measures like
precision and recall.
 Models of computer systems are equivalent to human experience.

Department of CSE Page 5


Machine Learning (BCS602)

 Once the knowledge is gained, when a new problem is encountered, humans search for
similar past situations and then formulate the heuristics and use that for prediction.

But, in systems, experience is gathered by these steps:

1. Collection of data
2. Once data is gathered, abstract concepts are formed out of that data. Abstraction is used
to generate concepts. This is equivalent to humans idea of objects. For example, we have
some idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable form of intelligence. It can be
viewed as ordering of all possible concepts. So, generalization involves ranking of
concepts, inferencing from them and formation of heuristics, an actionable aspect of
intelligence.
4. Heuristics normally works! But, occasionally, it may fail too. It is not the fault of
heuristics as it is just a „rule of thumb′. The course correction is done by taking evaluation
measures. Evaluation checks the thoroughness of the models and to-do course correction,
if necessary, to generate better formulations.

MACHINE LEARNING IN RELATION TO OTHER FIELDS

 Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics
primarily.
 It is the resultant of combined ideas of diverse fields.

Machine Learning and Artificial Intelligence


 Machine learning is an important branch of AI, which is a much broader subject.
 The aim of AI is to develop intelligent agents.
 An agent can be a robot, humans, or any autonomous systems.
 Initially, the idea of AI was ambitious, that is, to develop intelligent systems like human
beings.
 The focus was on logic and logical inferences.
 It had seen many ups and downs.
 These down periods were called AI winters.

Department of CSE Page 6


Machine Learning (BCS602)

 The resurgence in AI happened due to development of data driven systems.


 The aim is to find relations and regularities present in the data.
 Machine learning is the subbranch of AI, whose aim is to extract the patterns for
prediction.

Figure 1.3: Relationship of AI with machine learning

 Deep learning is a subbranch of machine learning.


 In deep learning, the models are constructed using neural network technology.
 Neural networks are based on the human neuron models.
 Many neurons form a network connected with the activation functions that trigger further
neurons to perform tasks.

Machine Learning, Data Science, Data Mining, and Data Analytics

 Data science is an „Umbrella„ term that encompasses many fields.


 Machine learning starts with data.
 Therefore, data science and machine learning are interlinked.
 Machine learning is a branch of data science.
 Data science deals with gathering of data for analysis.
Department of CSE Page 7
Machine Learning (BCS602)

Data science is a broad field that includes:

Big Data: Data science concerns about collection of data. Big data is a field of data science that
deals with data„s following characteristics:

1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter,
YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different
formats.
3. Velocity: It refers to the speed at which the data is generated and processed.

Data Mining

 Data mining„s original genesis is in the business.


 Like while mining the earth one gets into precious resources, it is often believed that
unearthing of the data produces hidden information that otherwise would have eluded the
attention of the management.
 There is no difference between data mining and machine learning except that data mining
aims to extract the hidden patterns that are present in the data, whereas, machine learning
aims to use it for prediction.

Data Analytics

 Another branch of data science is data analytics.


 Data analytics aims to extract useful knowledge from crude data.
 There are different types of analytics.
 Predictive data analytics is used for making predictions.
 Machine learning is closely related to this branch of analytics and shares almost all
algorithms.

Pattern Recognition

 It is an engineering field.

Department of CSE Page 8


Machine Learning (BCS602)

 It uses machine learning algorithms to extract the features for pattern analysis and pattern
classification.
 One can view pattern recognition as a specific application of machine learning.

Figure 1.4: Relationship of Machine Learning with Other Major Fields

Machine Learning and Statistics

 Statistics is a branch of mathematics that has a solid theoretical foundation regarding


statistical learning.
 Like machine learning (ML), it can learn from data.
 But the difference between statistics and ML is that statistical methods look for regularity
in data called patterns.
 Initially, statistics sets a hypothesis and performs experiments to verify and validate the
hypothesis in order to find relationships among data.
 Statistics requires knowledge of the statistical procedures and the guidance of a good
statistician.
 It is mathematics intensive and models are often complicated equations and involve many
assumptions.
 Statistical methods are developed in relation to the data being analyzed.

Department of CSE Page 9


Machine Learning (BCS602)

 Machine learning, comparatively, has less assumptions and requires less statistical
knowledge.
 But, it often requires interaction with various tools to automate the process of learning.

TYPES OF MACHINE LEARNING

There are four types of machine learning as shown in figure 1.5.

Figure 1.5: Types of Machine Learning

Labelled and Unlabelled Data

 Data is a raw fact.


 Normally, data is represented in the form of a table.
 Data also can be referred to as a data point, sample, or an example.
 Each row of the table represents a data point.
 Features are attributes or characteristics of an object.
 Normally, the columns of the table are attributes.
 Out of all attributes, one attribute is important and is called a label.
 Label is the feature that we aim to predict.
 Thus, there are two types of data – labelled and unlabelled.

Department of CSE Page 10


Machine Learning (BCS602)

Labelled data

 Labelled data is data that has some predefined tags such as name, type, or number.
 To illustrate labelled data, let us take one example dataset called Iris flower dataset.
 The dataset has 50 samples of Iris – with four attributes, length and width of sepals and
petals. The target variable is called class. There are three classes – Iris setosa, Iris
virginica, and Iris versicolor.

The partial data of Iris dataset is shown in Table 1.1.

Table 1.1: Iris Flower Dataset

 A dataset need not be always numbers.


 It can be images or video frames.
 Deep neural networks can handle images with labels.

Unlabelled data

 Any data that does not have any labels specifying its characteristics, identity,
classification, or properties can be considered unlabelled data.
 In unlabelled data, there are no labels in the dataset.

Department of CSE Page 11


Machine Learning (BCS602)

(a)

(b)

Figure 1.6: (a) Labelled Dataset (b) Unlabelled Dataset

In the Figure 1.6, the deep neural network takes images of dogs and cats with labels for
classification.

1. Supervised Learning
 Supervised algorithms use labelled dataset.
 There is a supervisor or teacher component in supervised learning.
 A supervisor provides labelled data so that the model is constructed and generates test
data.
 In supervised learning algorithms, learning takes place in two stages.

Department of CSE Page 12


Machine Learning (BCS602)

1. During the first stage, the teacher communicates the information to the student that the
student is supposed to master. The student receives the information and understands it.
During this stage, the teacher has no knowledge of whether the information is grasped by
the student.
2. During the Second stage of learning, the teacher asks the student a set of questions to
find out how much information has been grasped by the student. Based on these
questions, the student is tested, and the teacher informs the student about his assessment.
This kind of learning is typically called supervised learning.

Supervised learning has two methods:

1. Classification
2. Regression

1. Classification

 Classification is a supervised learning method.


 The input attributes of the classification algorithms are called independent variables.
 The target attribute is called label or dependent variable.
 The relationship between the input and target variable is represented in the form of a
structure which is called a classification model.
 So, the focus of classification is to predict the „label„ that is in a discrete form (a value
from the set of finite values).
 In classification, learning takes place in two stages.
1. During the first stage, called training stage, the learning algorithm takes a labelled
dataset and starts learning. After the training set, samples are processed and the model is
generated.
2. In the second stage, the constructed model is tested with test or unknown sample and
assigned a label. This is the classification process.

Some of the key algorithms of classification are:

 Decision Tree
 Random Forest

Department of CSE Page 13


Machine Learning (BCS602)

 Support Vector Machines


 Naïve Bayes
 Artificial Neural Network and Deep Learning networks like CNN

Figure 1.7: An Example Classification System

In Figure 1.7 classification algorithm takes a set of labeled data images such as dogs and cats to
construct a model that can be later used to classify an unknown test image data.

2. Regression Models
 Regression models, unlike classification algorithms, predict continuous variables like
price. In other words, it is a number.
 A fitted regression model is shown in Figure 1.8 for a dataset that represent weeks input x
and product sales y.
 The regression model takes input x and generates a model in the form of a fitted line of
the form y = f(x).
 Here, x is the independent variable that may be one or more attributes and y is the
dependent variable.
 In Figure 1.8, linear regression takes the training set and tries to fit it with a line-product
sales =0.66 x Week +0.54.
 Here, 0.66 and 0.54 are all regression coefficients that are learnt from data.

Department of CSE Page 14


Machine Learning (BCS602)

 The advantage of this model is that prediction for product sales (y) can be made for
unknown week data (x).
 For example, the prediction for unknown eighth week can be made by substituting x as 8
in that regression formula to get y.
 One of the most important regression algorithms is linear regression.

Figure 1.8: A Regression Model of the Form y=ax+b

Regression Algorithm Classification Algorithm


In Regression, the output variable must be of In Classification, the output variable must be a
continuous nature or real value. discrete value.
The task of the regression algorithm is to map The task of the classification algorithm is to
the input value (x) with the continuous output map the input value(x) with the discrete output
variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with
continuous data. discrete data.
In Regression, we try to find the best fit line, In Classification, we try to find the decision
which can predict the output more accurately. boundary, which can divide the dataset into
different classes.

Department of CSE Page 15


Machine Learning (BCS602)

Regression algorithms can be used to solve the Classification Algorithms can be used to solve
regression problems such as Weather classification problems such as Identification
Prediction, House price prediction, etc. of spam emails, Speech Recognition,
Identification of cancer cells, etc.

The regression Algorithm can be further The Classification algorithms can be divided
divided into Linear and Non-linear Regression. into Binary Classifier and Multi-class
Classifier

Unsupervised Learning

 The second kind of learning is by self-instruction.


 In unsupervised learning, there are no supervisor or teacher components.
 In the absence of a supervisor or teacher, self-instruction is the most common kind of
learning process.
 This process of self-instruction is based on the concept of trial and error.
 Here, the program is supplied with objects, but no labels are defined.
 The algorithm itself observes the examples and recognizes patterns based on the
principles of grouping.
 Grouping is done in ways that similar objects form the same group.
 Cluster analysis and Dimensional reduction algorithms are examples of unsupervised
algorithms.

i. Cluster Analysis
 Cluster analysis aims to group objects into disjoint clusters or groups.
 Cluster analysis clusters objects based on its attributes.
 All the data objects of the partitions are similar in some aspect and vary from the data
objects in the other partitions significantly.
 An example of clustering scheme is shown in Figure 1.9 where the clustering algorithm
takes a set of dogs and cats images and groups it as two clusters-dogs and cats.

Department of CSE Page 16


Machine Learning (BCS602)

 It can be observed that the samples belonging to a cluster are similar and samples are
different radically across clusters.

Figure 1.9: An example clustering scheme

Some of the key clustering algorithms are:

 k-means algorithm
 Hierarchical algorithms

ii. Dimensionality Reduction


 Dimensionality reduction algorithms are examples of unsupervised algorithms.
 It takes a higher dimension data as input and outputs the data in lower dimension by
taking advantage of the variance of the data.
 It is a task of reducing the dataset with few features without losing the generality.

Semi-supervised Learning
 There are circumstances where the dataset has a huge collection of unlabelled data
and some labelled data.
 Labelling is a costly process and difficult to perform by the humans.
 Semi-supervised algorithms use unlabelled data by assigning a pseudo-label.
 Then, the labelled and pseudo-labelled dataset can be combined.

Department of CSE Page 17


Machine Learning (BCS602)

Table 1.2: Difference between Supervised and Unsupervised Learning

S. No Supervised Learning Unsupervised Learning


1 There is a supervisor component. No supervisor component
2 Uses labeled data Uses unlabelled data
3 Assigns categories or labels Performs grouping process such that similar
objects will be in one cluster

Reinforcement Learning
 Reinforcement learning mimics human beings.
 Like human beings use ears and eyes to perceive the world and take actions,
reinforcement learning allows the agent to interact with the environment to get rewards.
 The agent can be human, animal, robot, or any independent program.
 The rewards enable the agent to gain experience.
 The agent aims to maximize the reward.
 The reward can be positive or negative (Punishment).
 When the rewards are more, the behavior gets reinforced and learning becomes possible.

Consider the following example of a Grid game as shown in Figure 1.10.

Figure 1.10: A Grid Game

Department of CSE Page 18


Machine Learning (BCS602)

 In this grid game, the gray tile indicates the danger, black is a block, and the tile with
diagonal lines is the goal.
 The aim is to start, say from bottom-left grid, using the actions left, right, top
and bottom to reach the goal state.
 To solve this sort of problem, there is no data.
 The agent interacts with the environment to get experience.
 In the above case, the agent tries to create a model by simulating many paths and finding
rewarding paths.
 This experience helps in constructing a model.
 Reinforcement algorithms are reward-based, goal-oriented algorithms.

CHALLENGES OF MACHINE LEARNING


Problems that can be Dealt with Machine Learning

 Computers are better than humans in performing tasks like computation.


 For example, while calculating the square root of large numbers, an average human may
blink but computers can display the result in seconds.
 Computers can play games like chess, GO, and even beat professional players of that
game.
 However, humans are better than computers in many aspects like recognition.
 But, deep learning systems challenge human beings in this aspect as well.
 Machines can recognize human faces in a second.
 Still, there are tasks where humans are better as machine learning systems still require
quality data for model construction.
 The quality of a learning system depends on the quality of data. This is a challenge.

Some of the challenges are listed below:

1. Problems – Machine learning can deal with the „well-posed„ problems where
specifications are complete and available. Computers cannot solve „ill-posed„ problems.

Consider one simple example

Department of CSE Page 19


Machine Learning (BCS602)

Table 1.3: An Example

Input (x1,x2) Output (y)


1,1 1
2,1 2
3,1 3
4,1 4
5,1 5

There are three functions that fit the data.

1. y=x1 x x2
2. y=x1x2
3. y=x1x2
 This means that the problem is ill-posed.
 To solve this problem, one needs more example to check the model.
 Puzzles and games that do not have sufficient specification may become an ill-posed
problem and scientific computation has many ill-posed problems.
2. Huge data – This is a primary requirement of machine learning. Availability of a quality
data is a challenge. A quality data means it should be large and should not have data
problems such as missing data or incorrect data
3. High computation power – With the availability of Big Data, the computational
resource requirement has also increased. Systems with Graphics Processing Unit (GPU)
or even Tensor Processing Unit (TPU) are required to execute machine learning
algorithms. Also, machine learning tasks have become complex and hence time
complexity has increased, and that can be solved only with high computing power.
4. Complexity of the algorithms – The selection of algorithms, describing the algorithms,
application of algorithms to solve machine learning task, and comparison of algorithms
have become necessary for machine learning or data scientists now. Algorithms have
become a big topic of discussion and it is a challenge for machine learning professionals
to design,select, and evaluate optimal algorithms.

Department of CSE Page 20


Machine Learning (BCS602)

5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/
variance tradeoff. A model that fits the training data correctly but fails for test data, in
general lacks generalization, is called overfitting. The reverse problem is called
underfitting where the model fails for training data but has good generalization.

MACHINE LEARNING PROCESS

The emerging process model for the data mining solutions for business organizations is CRISP-
DM. This process involves six steps. The steps are listed below in Figure 1.11.

Figure 1.11: A Machine Learning/Data Mining Process

1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is
enough for giving the solution. This step also involves the formulation of the problem
statement for the data mining process.

Department of CSE Page 21


Machine Learning (BCS602)

2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the
selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw
data and preparation of data for the data mining process. The missing values may cause
problems during both training and testing phases. Missing data forces classifiers to
produce inaccurate results.
4. Modelling – This step plays a role in the application of data mining algorithm for the
data to obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue.
6. Deployment – This step involves the deployment of results of the data mining algorithm
to improve the existing process or for a new situation.

MACHINE LEARNING APPLICATIONS


Machine Learning technologies are used widely now in different domains. One encounters many
machine learning applications in the day-to-day life. Some applications are listed below:

1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which are captured
by emotions effectively. For movie reviews or product reviews, five stars or one star are
automatically attached using sentiment analysis programs.

2. Recommendation systems – These are systems that make personalized purchases possible.
For example, Amazon recommends users to find related books or books bought by people who
have the same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.

3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform tasks.
These chatbots are the result of machine learning technologies.

Department of CSE Page 22


Machine Learning (BCS602)

4. Technologies like Google Maps and those used by Uber are all examples of machine learning
which offer to locate and navigate shortest paths to reduce time.

Table 1.4: Applications Survey Table


S.No Problem Domain Applications
1 Business Predicting the bankruptcy of a business firm
2 Banking Prediction of bank loan defaulters and detecting credit card frauds
3 Image search engines, object identification, image classification, and
Image Processing
generating synthetic images
4 Chatbots like Alexa, Microsoft Cortana. Developing chatbots for customer
Audio/Voice
support, speech to text, and text to voice
5 Trend analysis and identification of bogus calls, fraudulent calls and its callers,
Telecommunication
churn analysis
6 Retail sales analysis, market basket analysis, product performance analysis,
Marketing market segmentation analysis, and study of travel patterns of customers for
marketing tours
7 Games Game programs for Chess, GO, and Atari video games
8 Natural Language
Google Translate, Text summarization, and sentiment analysis
Translation
9 Identification of access patterns, detection of e-mail spams,
Web Analysis and viruses, personalized web services, search engines like Google, detection of
Services promotion of user websites, and finding loyalty of users after web page layout
modification
10 Prediction of diseases, given disease symptoms as cancer or diabetes.
Medicine Prediction of effectiveness of the treatment using patient history and Chatbots
to interact with patients like IBM Watson uses machine learning technologies.
11 Face recognition/identification, biometric projects like
Multimedia and
identification of a person from a large image or video database, and
Security
applications involving multimedia retrieval
12 Discovery of new galaxies, identification of groups of houses
Scientific Domain based on house type/geographical location, identification of
earthquake epicenters, and identification of similar land use

Department of CSE Page 23


Machine Learning (BCS602)

Understanding Data
WHAT IS DATA?

 All facts are data.


 In computer systems, bits encode facts present in numbers, text, images, audio,
and video.
 Data such as numbers or texts can be directly human interpretable or diffused data such
as images or video that can be interpreted only by a computer.
 Today, business organizations are accumulating vast and growing amounts of data of the
order of gigabytes, tera bytes, exabytes.
 A byte is 8 bits.
 A bit is either 0 or 1.
 A kilo byte(KB) is 1024 bytes.
 1 mega byte(MB) is approximately 1000 KB.
 One giga byte is approximately 1,000,000 KB.
 1000 giga bytes is one tera byte.
 1000000 tera bytes is one Exa byte.
 Data is available in different data sources like flat files, databases, or data warehouses.
 It can either be an operational data or a non-operational data.
 Operational data is the one that is encountered in normal business procedures and
processes. For example, daily sales data is operational data.
 Non-operational data is the kind of data that is used for decision making.
 Data by itself is meaningless. It has to be processed to generate any information. A string
of bytes is meaningless. Only when a label is attached, the data becomes meaningful.
 Processed data is called information that includes patterns, associations, or relationships
among data. For example, sales data can be analyzed to extract information like which
product was sold larger in the last quarter of the year.

Department of CSE Page 24


Machine Learning (BCS602)

Elements of Big Data

 Data whose volume is less and can be stored and processed by a small-scale computer is
called „small data„.
 These data are collected from several sources, and integrated and processed by a
small-scale computer.
 Big data is a larger data whose volume is much larger than „small data„ and is
characterized as follows:

1. Volume – Since there is a reduction in the cost of storing devices, there has been a tremendous
growth of data. Small traditional data is measured in terms of gigabytes (GB) and terabytes (TB),
but Big Data is measured in terms of petabytes (PB) and exabytes (EB). One exabyte is 1 million
terabytes.

2. Velocity – The fast arrival speed of data and its increase in data volume is noted as velocity.
The availability of IoT devices and Internet power ensures that the data is arriving at a faster rate.
Velocity helps to understand the relative growth of big data and its accessibility by users,
systems and applications.

3. Variety – The variety of Big Data includes:

 Form – There are many forms of data. Data types range from text, graph, audio, video, to
maps. There can be composite data too, where one media can have many other sources of
data, for example, a video can have an audio song.
 Function – These are data from various sources like human conversations, transaction
records, and old archive data.
 Source of data – This is the third aspect of variety. There are many sources of data.
Broadly, the data source can be classified as open/public data, social media data and
multimodal data.

Some of the other forms of Vs that are often quoted in the literature as characteristics of Big data
are:
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts,

Department of CSE Page 25


Machine Learning (BCS602)

truthfulness, believablity, and confidence in data. There may be many sources of error such as
technical errors, typographical errors, and human errors.

5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals that
are needed by the given problem.

6. Value – Value is the characteristic of big data that indicates the value of the information that
is extracted from the data and its influence on the decisions that are taken based on it.

 The data quality of the numeric attributes is determined by factors like precision, bias,
and accuracy.
 Precision is defined as the closeness of repeated measurements. Often, standard
deviation is used to measure the precision.
 Bias is a systematic result due to erroneous assumptions of the algorithms or procedures.
 Accuracy is the degree of measurement of errors that refers to the closeness of
measurements to the true value of the quantity.

TYPES OF DATA

In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi-
structured data.

1. Structured Data
 In structured data, data is stored in an organized manner such as a database where it is
available in the form of a table.
 The data can also be retrieved in an organized manner using tools like SQL.

The structured data frequently encountered in machine learning are:

1. Record Data
2. Data Matrix
3. Graph Data
4. Ordered Data
1. Record Data
 A dataset is a collection of measurements taken from a process.

Department of CSE Page 26


Machine Learning (BCS602)

 We have a collection of objects in a dataset and each object has a set of measurements.
 The measurements can be arranged in the form of a matrix.
 Rows in the matrix represent an object and can be called as entities, cases, or records.
 The columns of the dataset are called attributes, features, or fields.
 The table is filled with observed data.
2. Data Matrix
 It is a variation of the record type because it consists of numeric attributes.
 The standard matrix operations can be applied on these data.
 The data is thought of as points or vectors in the multidimensional space where every
attribute is a dimension describing the object.
3. Graph Data
 It involves the relationships among objects.
 For example, a web page can refer to another web page.
 This can be modeled as a graph.
 The nodes are web pages and the hyperlink is an edge that connects the nodes.
4. Ordered Data
 Ordered data objects involve attributes that have an implicit order among them. The
examples of ordered data are:
i. Temporal data – It is the data whose attributes are associated with time. For example,
the customer purchasing patterns during festival time is sequential data. Time series data
is a special type of sequence data where the data is a series of measurements over time.
ii. Sequence data – It is like sequential data but does not have time stamps. This data
involves the sequence of words or letters. For example, DNA data is a sequence of four
characters – A T G C.
iii. Spatial data – It has attributes such as positions or areas. For example, maps are spatial
data where the points are related by location.

2.Unstructured Data

 Unstructured data includes video, image, and audio.


 It also includes textual documents, programs, and blog data.
 It is estimated that 80% of the data are unstructured data.
Department of CSE Page 27
Machine Learning (BCS602)

3. Semi-Structured Data

 Semi-structured data are partially structured and partially unstructured.


 These include data like XML/JSON data, RSS feeds, and hierarchical data.

DATA STORAGE AND REPRESENTATION

 Once the dataset is assembled, it must be stored in a structure that is suitable for data
analysis.
 The goal of data storage management is to make data available for analysis.
 There are different approaches to organize and manage data in storage files and systems
from flat file to data warehouses

Some of them are listed below:

1. Flat Files These are the simplest and most commonly available data source. It is also the
cheapest way of organizing the data. These flat files are the files where data is stored in
plain ASCII or EBCDIC format. Minor changes of data in flat files affect the results of
the data mining algorithms. Hence, flat file is suitable only for storing small dataset and
not desirable if the dataset becomes larger.

Some of the popular spreadsheet formats are listed below:

 CSV files – CSV stands for comma-separated value files where the values are separated
by commas. These are used by spreadsheet and database applications. The first row may
have attributes and the rest of the rows represent the data.
 TSV files – TSV stands for Tab separated values files where values are separated by Tab.
2. Database System It normally consists of database files and a database management
system (DBMS). Database files contain original data and metadata.DBMS aims to
manage data and improve operator performance by including various tools like database
administrator, query processing, and transaction manager.A relational database consists
of sets of tables. The tables have rows and columns. The columns represent the attributes
and rows represent tuples. A tuple corresponds to either an object or a relationship
between objects. A user can access and manipulate the data in the database using SQL.

Department of CSE Page 28


Machine Learning (BCS602)

Different types of databases are listed below:

i. Transactional database is a collection of transactional records. Each record is a transaction. A


transaction may have a time stamp, identifier and a set of items, which may have links to other
tables. Normally, transaction databases are created for performing associational analysis that
indicates the correlation among the items.

ii. Time-series database stores time related information like log files where data is associated
with a time stamp. This data represents the sequences of data, which represent values or events
obtained over a period (for example, hourly, weekly or yearly) or repeated time span. Observing
sales of product continuously may yield a time-series data.

iii. Spatial databases contain spatial information in a raster or vector format. Raster formats are
either bitmaps or pixel maps. For example, images can be stored as a raster data. On the other
hand, the vector format can be used to store maps as maps use basic geometric primitives like
points, lines, polygons and so forth.

3. World Wide Web (WWW) It provides a diverse, worldwide online information source. The
objective of data mining algorithms is to mine interesting patterns of information present in
WWW.

4. XML (eXtensible Markup Language) It is both human and machine interpretable data
format that can be used to represent data that needs to be shared across the platforms.

5. Data Stream It is dynamic data, which flows in and out of the observing environment.
Typical characteristics of data stream are huge volume of data, dynamic, fixed order movement,
and real-time constraints.

6. RSS (Really Simple Syndication) It is a format for sharing instant feeds across services.

7. JSON (JavaScript Object Notation) It is another useful data interchange format that is often
used for many machine learning algorithms.

BIG DATA ANALYTICS AND TYPES OF ANALYTICS

 The primary aim of data analysis is to assist business organizations to take decisions.

Department of CSE Page 29


Machine Learning (BCS602)

 For example, a business organization may want to know which is the fastest selling
product, in order for them to market activities.
 Data analysis is an activity that takes the data and generates useful information and
insights for assisting the organizations.
 Data analysis and data analytics are terms that are used interchangeably to refer to the
same concept.
 Data analytics is a general term and data analysis is a part of it.
 Data analytics refers to the process of data collection, pre-processing and analysis.
 It deals with the complete cycle of data management.
 Data analysis is just analysis and is a part of data analytics. It takes historical data and
does the analysis.
 Data analytics, instead, concentrates more on future and helps in prediction.

There are four types of data analytics:

1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics

1. Descriptive analytics

 It is about describing the main features of the data.


 After data collection is done, descriptive analytics deals with the collected data and
quantifies it.
 It is often stated that analytics is essentially statistics.
 There are two aspects of statistics – Descriptive and Inference.
 Descriptive analytics only focuses on the description part of the data and not the
inference part

2. Diagnostic Analytics

 It deals with the question – „Why?„.


 This is also known as causal analysis, as it aims to find out the cause and effect of the
events.

Department of CSE Page 30


Machine Learning (BCS602)

 For example, if a product is not selling, diagnostic analytics aims to find out the reason.
 There may be multiple reasons and associated effects are analyzed as part of it.

3. Predictive Analytics

 It deals with the future.


 It deals with the question – „What will happen in future given this data? „
 This involves the application of algorithms to identify the patterns to predict the future.

4. Prescriptive Analytics

 It is about the finding the best course of action for the business organizations.
 Prescriptive analytics goes beyond prediction and helps in decision making by giving a
set of actions.
 It helps the organizations to plan better for the future and to mitigate the risks that are
involved.

BIG DATA ANALYSIS FRAMEWORK

 For performing data analytics, many frameworks are proposed.


 All proposed analytics frameworks have some common factors.
 Big data framework is a layered architecture.
 Such an architecture has many advantages such as genericness.
 A 4-layer architecture has the following layers:

1. Data connection layer


2. Data management layer
3. Data analytics layer
4. Presentation layer

1. Data Connection Layer

 It has data ingestion mechanisms and data connectors.


 Data ingestion means taking raw data and importing it into appropriate data
structures.
 It performs the tasks of ETL process.

Department of CSE Page 31


Machine Learning (BCS602)

 By ETL, it means extract, transform and load operations.

2. Data Management Layer

 It performs preprocessing of data.


 The purpose of this layer is to allow parallel execution of queries, and read, write and
data management tasks.
 There may be many schemes that can be implemented by this layer such as data-in-
place, where the data is not moved at all, or constructing data repositories such as
data warehouses and pull data on-demand mechanisms.

3. Data Analytic Layer

 It has many functionalities such as statistical tests, machine learning algorithms to


understand, and construction of machine learning models.
 This layer implements many model validation mechanisms too.

4. Presentation Layer

 It has mechanisms such as dashboards, and applications that display the results of
analytical engines and machine learning algorithms.

The Big Data processing cycle involves data management that consists of the following steps.

1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm

DATA COLLECTION

 The first task of gathering datasets are the collection of data.


 It is often estimated that most of the time is spent for collection of good quality data.
 A good quality data yields a better result.

„Good data„ is one that has the following properties:

Department of CSE Page 32


Machine Learning (BCS602)

1. Timeliness – The data should be relevant and not stale or obsolete data.
2. Relevancy – The data should be relevant and ready for the machine learning or data
mining algorithms. All the necessary information should be available and there should be
no bias in the data.
3. Knowledge about the data – The data should be understandable and interpretable, and
should be self sufficient for the required application as desired by the domain knowledge
engineer.

Broadly, the data source can be classified as open/public data, social media data and multimodal
data.

1. Open or public data source – It is a data source that does not have any stringent copyright
rules or restrictions. Its data can be primarily used for many purposes. Government census data
are good examples of open data

 Digital libraries that have huge amount of text data as well as document images.
 Scientific domains with a huge collection of experimental data like genomic data and
biological data.
 Healthcare systems that use extensive databases like patient databases, health insurance
data, doctor‟s information, and bioinformatics information.

2. Social media – It is the data that is generated by various social media platforms like Twitter,
Facebook, YouTube, and Instagram. An enormous amount of data is generated by these
platforms.

3. Multimodal data – It includes data that involves many modes such as text, video, audio and
mixed types. Some of them are listed below:

 Image archives contain larger image databases along with numeric and text data.
 The World Wide Web (WWW) has huge amount of data that is distributed on the
Internet.

DATA PREPROCESSING

In real world, the available data is „dirty„. By this word „dirty„, it means:

Department of CSE Page 33


Machine Learning (BCS602)

 Incomplete data
 Inaccurate data
 Outlier data
 Data with missing values
 Data with inconsistent values
 Duplicate data
 Data preprocessing improves the quality of the data mining techniques.
 The raw data must be preprocessed to give accurate results.
 The process of detection and removal of errors in data is called data cleaning.
 Data wrangling means making the data processable for machine learning algorithms.
 Some of the data errors include human errors such as typographical errors or incorrect
measurement and structural errors like improper data formats.
 Data errors can also arise from omission and duplication of attributes.
 Noise is a random component and involves distortion of a value or introduction of
spurious objects.

Consider, for example, the following patient Table 1.5. The „bad‟ or „dirty‟ data can be observed
in this table

Table 1.5: Illustration of ‘Bad’ Data

 It can be observed that data like Salary = „ „ is incomplete data.


 The DoB of patients, John, Andre, and Raju, is the missing data.
 The age of David is recorded as „5„ but his DoB indicates it is 10/10/1980. This is called
inconsistent data.

Department of CSE Page 34


Machine Learning (BCS602)

 Inconsistent data occurs due to problems in conversions, inconsistent formats, and


difference in units. Salary for John is -1500. It cannot be less than „0„. It is an instance of
noisy data.
 Outliers are data that exhibit the characteristics that are different from other data and have
very unusual values. The age of Raju cannot be 136. It might be a typographical error. It
is often required to distinguish between noise and outlier data.

MISSING DATA ANALYSIS

 The primary data cleaning process is missing data analysis.


 Data cleaning routines attempt to fill up the missing values, smoothen the noise while
identifying the outliers and correct the inconsistencies of the data.
 This enables data mining to avoid overfitting of the models.

The procedures that are given below can solve the problem of missing data:

1. Ignore the tuple – A tuple with missing data, especially the class label, is ignored. This
method is not effective when the percentage of the missing values increases.
2. Fill in the values manually –The domain expert can analyse the data tables and carry out
the analysis and fill in the values manually. But, this is time consuming and may not be
feasible for larger sets.
3. A global constant can be used to fill in the missing attributes. The missing values may
be „Unknown„ or be „Infinity„. But, some data mining results may give spurious results
by analysing these labels.
4. The attribute value may be filled by the attribute value. Say, the average income can
replace a missing value.
5. Use the attribute mean for all samples belonging to the same class. Here, the average
value replaces the missing values of all tuples that fall in this group.
6. Use the most possible value to fill in the missing value. The most probable value can be
obtained from other methods like classification and decision tree prediction.

REMOVAL OF NOISY OR OUTLIER DATA

 Noise is a random error or variance in a measured value.

Department of CSE Page 35


Machine Learning (BCS602)

 It can be removed by using binning, which is a method where the given data values are
sorted and distributed into equal frequency bins.
 The bins are also called as buckets.
 The binning method then uses the neighbor values to smooth the noisy data.
 Some of the techniques commonly used are
1. Smoothing by means
 In this technique, the mean of the bin removes the values of the bins.
2. Smoothing by bin medians
 In this technique, the bin median replaces the bin values.
3. Smoothing by bin boundaries
 In this technique, the bin value is replaced by the closest bin boundary. The maximum
and minimum values are called bin boundaries.

Example:

Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}. Apply various binning
techniques and show the result.

Solution: By equal-frequency bin method, the data should be distributed across bins. Let us
assume the bins of size 3, then the above data is distributed across the bins as shown below:

Bin 1 : 12 , 14, 19
Bin 2 : 22, 24, 26
Bin 3 : 28, 31, 34

By smoothing bins method, the bins are replaced by the bin means. This method results in:

Bin 1 : 15, 15, 15


Bin 2 : 24, 24, 24
Bin 3 : 31, 31, 31

Using smoothing by bin boundaries method, the bin‟s values would be like:

Bin 1 : 12, 12, 19


Bin 2 : 22, 22, 26
Bin 3 : 28, 34, 34

Department of CSE Page 36


Machine Learning (BCS602)

As per the method, the minimum and maximum values of the bin are determined, and it serves as
bin boundary and does not change. Rest of the values are transformed to the nearest value. It can
be observed in Bin 1,the middle value 14 is compared with the boundary values 12 and 19 and
changed to the closest value, that is 12. This process is repeated for all bins.

DATA INTEGRATION AND DATA TRANSFORMATIONS


 Data integration involves routines that merge data from multiple sources into a single
data source.
 So, this may lead to redundant data.
 The main goal of data integration is to detect and remove redundancies that arise from
integration.
 Data transformation routines perform operations like normalization to improve the
performance of the data mining algorithms.
 It is necessary to transform data so that it can be processed.
 This can be considered as a preliminary stage of data conditioning.
 Normalization is one such technique.
 In normalization, the attribute values are scaled to fit in a range (say 0-1) to improve the
performance of the data mining algorithm.
 Often, in neural networks, these techniques are used.

Some of the normalization procedures used are:

1. Min-Max
2. z-Score

1. Min-Max Procedure

 It is a normalization technique where each variable V is normalized by its difference with


the minimum value divided by the range to a new range, say 0–1.
 Often, neural networks require this kind of normalization.
 The formula to implement this normalization is given as:

Department of CSE Page 37


Machine Learning (BCS602)

Here max-min is the range. min and max are the minimum and maximum of the given data,
new max and new min are the minimum and maximum of the target range, say 0 and 1.

Example:

Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and map the marks to a new
range 0–1.

Solution: The minimum of the list V is 88 and maximum is 94. The new min and new max are
0 and 1, respectively.

Mapping can be done using following equation.

For marks 88,

For marks 90

For marks 92

Department of CSE Page 38


Machine Learning (BCS602)

For marks 94

So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range {0, 0.33,
0.66, 1}. Thus, the Min-Max normalization range is between 0 and 1.

2. z-Score Normalization

This procedure works by taking the difference between the field value and mean value, and by
scaling this difference by standard deviation of the attribute.

Here,  is the standard deviation of the list V and  is the mean of the list V.

Example :

Consider the mark list V = {10, 20, 30}, convert the marks to z-score.

Solution: The mean and Sample Standard deviation () values of the list V are 20 and 10,
respectively. So the z-scores of these marks are calculated using following equation.

Department of CSE Page 39


Machine Learning (BCS602)

Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.

DATA REDUCTION

Data reduction reduces data size but produces the same results.

There are different ways in which data reduction can be carried out such as

1. Data aggregation
2. Feature selection
3. Dimensionality reduction

DESCRIPTIVE STATISTICS

 Descriptive statistics is a branch of statistics that does dataset summarization.


 It is used to summarize and describe data.
 Descriptive statistics are just descriptive and do not go beyond that.
 Data visualization is a branch of study that is useful for investigating the given data.
 Mainly, the plots are used to explain and present data to customers.
 Descriptive analysis and data visualization techniques help to understand the nature of the
data, which further helps to determine the kinds of machine learning or data mining tasks
that can be applied to data.
 This step is known as Exploratory Data Analysis (EDA).

Department of CSE Page 40


Machine Learning (BCS602)

Dataset and Data Types

 A dataset can be assumed to be a collection of data objects.


 The data objects may be records, points, vectors, patterns, events, cases, samples or
observations.
 These records contain many attributes.
 An attribute can be defined as the property or characteristics of an object.

Consider the following database shown in Table 1.6.

Table 1.6: Sample Patient Data

Every attribute should be associated with a value. This process is called measurement. The
type of attribute determines the data types, often referred to as measurement scale types.
Broadly, data can be classified into two types:

1. Categorical or qualitative data


2. Numerical or quantitative data

1. Categorical or Qualitative data

The categorical data can be divided into two types. They are nominal type and ordinal type.

i. Nominal Data

Nominal data are symbols and cannot be processed like a number. For example, the average
of a patient ID does not make any statistical sense. Nominal data type provides only information
but has no ordering among data. Only operations like (=, ≠) are meaningful for these data. In
Table 1.6, Patient ID is nominal data.

ii. Ordinal Data

Department of CSE Page 41


Machine Learning (BCS602)

It provides enough information and has natural order. For example, Fever = {Low, Medium,
High} is an ordinal data. Certainly, low is less than medium and medium is less than high,
irrespective of the value. Any transformation can be applied to these data to get a new value.

2. Numerical or Quantitative Data

It can be divided into two categories. They are interval type and ratio type.

i. Interval Data

Interval data is a numeric data for which the differences between values are meaningful. For
example, there is a difference between 30 degree and 40 degree. Only the permissible
operations are + and – .

ii. Ratio Data

For ratio data, both differences and ratio are meaningful. The difference between the ratio
and interval data is the position of zero in the scale. For example, take the Centigrade
Fahrenheit conversion. The zeroes of both scales do not match. Hence, these are interval data.

Another way of classifying the data is to classify it as:

1. Discrete value data


2. Continuous data

Discrete Data This kind of data is recorded as integers. For example, the responses of the
survey can be discrete data. Employee identification number such as 10001 is discrete data.

Continuous Data It can be fitted into a range and includes decimal point. For example, age
is a continuous data. Though age appears to be discrete data, one may be 12.5 years old and
it makes sense. Patient height and weight are all continuous data.

Third way of classifying the data is based on the number of variables used in the dataset.
Based on that, the data can be classified as univariate data, bivariate data, and multivariate
data.

Department of CSE Page 42


Machine Learning (BCS602)

Figure 1.12: Types of Data

Figure 1.13: Types of Data Based on Variables

 In case of univariate data, the dataset has only one variable.


 A variable is also called as category.
 Bivariate data indicates that the number of variables used is two and multivariate data
uses three or more variables.

UNIVARIATE DATA ANALYSIS AND VISUALIZATION


 Univariate analysis is the simplest form of statistical analysis.
 The dataset has only one variable.
 A variable can be called as a category.
 Univariate does not deal with cause or relationships.
 The aim of univariate analysis is to describe data and find patterns.
 Univariate data description involves finding the frequency distributions, central tendency
measures, dispersion or variation, and shape of the data.

Department of CSE Page 43


Machine Learning (BCS602)

Data Visualization
 To understand data, graph visualization is must.
 Data visualization helps to understand data it helps to present information and data to
customers.
 Some of the graphs that are used in unvaried data analysis are bar charts, histograms,
frequency polygons and pie charts.
 The advantages of graphs are presentation of data, summarization of data, description of
data, exploration of data, and to make comparisons of data.

Bar Chart

 A Bar chart (or Bar graph) is used to display the frequency distribution for variables.
 Bar charts are used to illustrate discrete data.
 The charts can also help to explain the counts of nominal data.
 It also helps in comparing the frequency of different groups.
 The bar chart for student‟s marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is
shown below in Figure 1.14.

Figure 1.14: Bar Chart

Department of CSE Page 44


Machine Learning (BCS602)

Pie Chart

 Pie charts are equally helpful in illustrating the univariate data.


 The percentage frequency distribution of student‟s marks {22, 22, 40, 40, 70, 70, 70, 85,
90, 90} is shown below in Figure 1.15.

Figure 1.15: Pie Chart

It can be observed that the number of students with 22 marks are 2. The total number of students
are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in Figure 1.15.

Histogram
 It plays an important role in data mining for showing frequency distributions.
 The histogram for student‟s marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50,
51-75, 76- 100 is given below in Figure 1.16.
 Histogram conveys useful information like nature of data and its mode.
 Histograms can be used as charts to show frequency, skewness present in the data, and
shape.

Department of CSE Page 45


Machine Learning (BCS602)

Figure 1.16: Sample Histogram of English Marks

Dot Plots

 These are similar to bar charts.


 They are less clustered as compared to bar charts, as they illustrate the bars only with
single points.
 The dot plot of English marks for five students with ID as {1, 2, 3, 4, 5} and marks {45,
60, 60, 80, 85} is given in Figure 1.17.
 The advantage is that by visual inspection one can find out who got more marks.

Figure 1.17: Dot plots

Department of CSE Page 46


Machine Learning (BCS602)

Central Tendency
 A condensation or summary of the data is necessary.
 This makes the data analysis easy and simple.
 One such summary is called central tendency.
 Thus, central tendency can explain the characteristics of data and that further helps in
comparison.
 Mass data have tendency to concentrate at certain values, normally in the central location.
It is called measure of central tendency (or averages).
 This represents the first order of measures.
 Popular measures are mean, median and mode.

1. Mean
 Arithmetic average (or mean) is a measure of central tendency that represents the „center„
of the dataset.
 It can be found by adding all the data and dividing the sum by number of observations.
 Mathematically, the average of all the values in the sample (population) is denoted as 𝑥 .
 Let x1, x2, … , xN be a set of „N„ values or observations, then the arithmetic mean is
given as:

10+20+30 60
For example, the mean of the three numbers 10,20 and 30 is = =20
3 3

Weighted mean

 Unlike arithmetic mean that gives the weightage of all items equally, weighted mean
gives different importance to all items as the item importance varies.
 Hence, different weightage can be given to items.
 In case of frequency distribution, mid values of the range are taken for computation.

Department of CSE Page 47


Machine Learning (BCS602)

Geometric Mean

 Let x1, x2, … , xN be a set of „N„ values or observations.


 Geometric mean is the Nth root of the product of N items.
 The formula for computing geometric mean is given as follows:

 Here N is the number of items and xi are values.



2
For example, if the values are 6 and 8, the geometric mean is given as 6 x 8 = 48
 In larger cases, computing geometric mean is difficult.
 Hence, it is usually calculated as,

2. Median

 The middle value in the distribution is called median.


 If the total number of items in the distribution is odd, then the middle value is called
median.
 If the numbers are even, then the average value of two items in the centre is the median.
 Median is the value where xi is divided into two equal halves, with half of the values
being lower than median and half higher than the median.
 A median class is that class where (N/2)th item is present.

In the continuous case, the median is given by the formula:

Department of CSE Page 48


Machine Learning (BCS602)

Median class is that class where N/2th item is present. Here, i is the class interval of the median
class and L1 is the lower limit of median class, f is the frequency of the median class, and cf is
the cumulative frequency of all classes preceding median.

3. Mode

 Mode is the value that occurs more frequently in the dataset.


 In other words, the value that has the highest frequency is called mode.
 Mode is only for discrete data and is not applicable for continuous data as there are no
repeated values in continuous data.

Dispersion

 The spreadout of a set of data around the central tendency (mean, median or mode) is
called dispersion.
 Dispersion is represented by various ways such as range, variance, standard deviation,
and standard error.
 These are second order measures.
 The most common measures of the dispersion data are listed below:
1. Range
2. Standard Deviarion
3. Quartiles and Inter Quartile Range
4. Five-point Summary and Box Plot
1. Range

Range is the difference between the maximum and minimum of values of the given list of
data.

2. Standard Deviation
 The mean does not convey much more than a middle point.
 For example, the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20.
 The difference between these two sets is the spread of data.
 Standard deviation is the average distance from the mean of the dataset to each point.
 The formula for sample standard deviation is given by

Department of CSE Page 49


Machine Learning (BCS602)

Here, N is the size of the population, xi is observation or value from the population and  is the
population mean.

3.Quartiles and Inter Quartile Range


 It is sometimes convenient to subdivide the dataset using coordinates.
 Percentiles are about data that are less than the coordinates by some percentage of the
total value.
 Kth percentile is the property that the k% of the data lies at or below Xi.
 Median is 50th percentile and can be denoted as Q0.50.
 The 25th percentile is called first quartile (Q1) and the 75th percentile is called third
quartile (Q3).
 Another measure that is useful to measure dispersion is Inter Quartile Range (IQR).
 The IQR is the difference between Q3 and Q1.

Interquartile percentile = Q3 – Q1

 Outliers are normally the values falling apart at least by the amount 1.5 × IQR above the
third quartile or below the first quartile.
 Interquartile is defined by Q0.75 – Q0.25 .

Example:

For patient‟s age list {12,14,19,22,24,26,28,31,34}, find the IQR.

Solution:

The median is in the fifth position. In this case, 24 is the median. The first quartile is median of
scores below the mean i.e, { 12,14,19,22}. Hence, it‟s the median of the list below 24. In this

Department of CSE Page 50


Machine Learning (BCS602)

case, the median is the average of the second and third values, that is Q0.25 =16.5. Similarly, the
third quartile is the median of the values above the median, that is {26,28,31,34}. So, Q0.75 is the
average of the seventh and eighth score. In this case, it is 28+31/2=59/2=29.5

Hence, the IQR is calculated as Q0.75 –Q0.25

= 29.5 – 16.5 =13

The half of IQR is called semi-quartile range. The Semi Inter Quartile Range ( SIQR) is given as:
1
SIQR = x IQR
2

1
= x 13 =6.5
2

4.Five-point Summary and Box Plot

 The median, quartiles Q1 and Q3, and minimum and maximum written in the order
< Minimum, Q1, Median, Q3, Maximum > is known as five-point summary.
 Box plots are suitable for continuous variables and a nominal variable.
 Box plots can be used to illustrate data distributions and summary of data.
 It is the popular way for plotting five number summaries.
 The box contains bulk of the data.
 These data are between first and third quartiles.
 The line inside the box indicates location – mostly median of the data.

Example:

Find the 5-point summary of the list {13,11,2,3,4,8,9}

Solution:

The minimum is 2 and the maximum is 13. The Q1,Q2 and Q3 are 3,8 and 11 respectively. Hence
5-point summary is {2,3,8,11,13}, that is {minimum, Q1,median,Q3,maximum}

Box plots are useful for describing 5-point summary. The box plot for the set is given in Figure
1.18.

Department of CSE Page 51


Machine Learning (BCS602)

Figure 1.18 : Box Plot for English Marks

Shape

Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location of
the dataset.

Skewness

 The measures of direction and degree of symmetry are called measures of third order.
 Ideally,skewness should be zero as in ideal normal distribution.
 More often, the given dataset may not have perfect symmetry(consider the following
Figure 1.19).

Figure 1.19: (a) Positive Skewed and (b) Negative Skewed Data

Department of CSE Page 52


Machine Learning (BCS602)

 The dataset may also either have very high values or extremely low values.
 If the dataset has far high values, then it is skewed to the right.
 If the dataset has far more lower values then it is said to be skewed towards left.
 If the tail is longer on the left-hand side and hump on the right-hand side, it is called
positive skew.
 Otherwise, it is called negative skew.
 A perfect symmetry means skewness is zero.
 In positive skew, the mean is greater than the median.
 Generally, for negatively skewed distribution, the median is more than the mean.
 The relationship between skew and the relative size of the mean and median can be
summarized by a convenient numerical skew index known as Pearson 2 skewness
coefficient.

Also, the following measure is more commonly used to measure skewness.

Let X1, X2, …, XN be a set of „N„ values or observations then the skewness can be given as:

Here,  is the population mean and  is the population standard deviation of the univariate data.

Kurtosis
 Kurtosis also indicates the peaks of data.
 If the data is high peak, then it indicates higher kurtosis and vice versa.
 Kurtosis is the measure of whether the data is heavy tailed or light tailed relative to
normal distribution.

Department of CSE Page 53


Machine Learning (BCS602)

 Let x1,x2,…..,xN be a set of „N‟ values or observations.


 Kurtosis is measured using the formula given below:

 Here, 𝑥 and  are the mean and standard deviation of the univariate data, respectively.
 Some of the other useful measures for finding the shape of the univariate dataset are
mean absolute deviation (MAD) and coefficient of variation (CV)

Mean Absolute Deviation (MAD)

 MAD is another dispersion measure and is robust to outliers.


 Normally, the outlier point is detected by computing the deviation from median and by
dividing it by MAD.
 Absolute deviation between the data and mean is taken.
 The absolute deviation is given as: | x -  |
 The sum of the absolute deviations is given as  | x -  |
|x −|
 The mean absolute deviation is given as:
𝑁

Coefficient of Variation(CV)

 Coefficient of variation is used to compare datasets with different units.


 CV is the ration of standard deviation and mean, and %CV is the percentage of
coefficient of variations.

Special Univariate Plots

 The ideal way to check the shape of the dataset is a stem and leaf plot.
 A stem and leaf plot are a display that help us to know the shape and the distribution of
the data. In this method, each value is split into a „stem‟ and a „leaf;.
 The last digit is usually the leaf and digits to the left of the leaf mostly form the stem.
 For example, marks 45 are divided into stem 4 and leaf 5 in Figure 1.20.

Department of CSE Page 54


Machine Learning (BCS602)

 The stem and leaf plot for the English subject maks, say {45,60,60,80,85}is given in
Figure 1.20.

Figure 1.20: Stem and Leaf Plot for English Marks

 It can be seen from Figure 1.20 that the first column is stem and the second column is
leaf.
 For the given English marks, two students with 60 marks are shown in stem and leaf plot
as stem-6 with 2 leaves with 0.

Q-Q plot

 A Q-Q plot can be used to assess the shape of the dataset.


 The Q-Q plot is a 2D scatter plot of an univariate data against theoretical normal
distribution data or of two datasets – the quartiles of the first and second datasets.
 The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below in Figure 1.21.

Department of CSE Page 55


Machine Learning (BCS602)

Figure 1.21 : Normal Q-Q Plot

 Ideally, the points fall along the reference line (45 Degree) if the data follows normal
distribution.
 If the deviation is more, then there is greater evidence that the datasets follow some
different distribution. i.e Other than the normal distribution

Department of CSE Page 56

You might also like