Module 1
Module 1
Module 1
Introduction to Machine Learning
What is Machine Learning?
In the real world, we are surrounded by humans who can learn everything from their experiences with
their learning capability, and we have computers or machines which work on our instructions. But can a
machine also learn from experiences or past data like a human does? So here comes the role of Machine
Learning.
➢ Introduction to Machine Learning
A subset of artificial intelligence known as machine learning focuses primarily on the creation of
algorithms that enable a computer to independently learn from data and previous experiences. Arthur
Samuel first used the term "machine learning" in 1959.It could be summarized as follows:
• Without being explicitly programmed, machine learning enables a machine to automatically learn
from data, improve performance from experiences, and predict things.
• Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical data,
or training data.
• For the purpose of developing predictive models, machine learning brings together statistics and
computer science.
• Algorithms that learn from historical data are either constructed or utilized in machine learning. The
performance will rise in proportion to the quantity of information we provide.
• A machine can learn if it can gain more data to improve its performance.
The Machine Learning algorithm's operation is depicted in the following block diagram:
• Third reason for popularity of machine learning is the availability of complex algorithms
now. Especially with the advent of deep learning, many algorithms are available for machine
learning.
• Then, the momentum shifted to machine learning in the form of data driven systems. The focus
of AI is to develop intelligent systems by using data-driven approach, where data is used as an
input to develop intelligent models.
• The models can then be used to predict new inputs. Thus, the aim of machine learning is to learn
a model or set of rules from the given dataset automatically so that it can predict the unknown
data correctly.
• As humans take decisions based on an experience, computers make models based on extracted
patterns in the input data and then use these data-filled models for prediction and to take decisions.
For computers, the learnt model is equivalent to human experience.
• The quality of data determines the quality of experience and, therefore, the quality of the learning
system. In statistical learning, the relationship between the input x and output y is modeled as a
function in the form y = f(x). Here, f is the learning function that maps the input to output y.
Learning of function f is the crucial aspect of forming a model in statistical learning. In machine
learning, this is simply called mapping of input to output.
• The learning program summarizes the raw data in a model. Formally stated, a model is an
explicit description of patterns within the data in the form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs.
3. Logical if/else rules, or
4. Groupings called clusters
• In summary, a model can be a formula, procedure or representation that can generate data
decisions. The difference between pattern and model is that the former is local and applicable only
to certain attributes but the latter is global and fits the entire dataset. For example, a model can be
helpful to examine whether a given email is spam or not. The point is that the model is generated
automatically from the given data.
• Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A computer
program is said to learn from experience E, with respect to task T and some performance measure
P,if its performance on T measured by P improves with experience E.” The important components
of this definition are experience E, task T, and performance measure P.
• For example, the task T could be detecting an object in an image. The machine can gain the
knowledge of object using training dataset of thousands of images. This is called experience E.
• So, the focus is to use this experience E for this task of object detection T. The ability of the system
to detect the object is measured by performance measures like precision and recall. Based on the
performance measures, course correction can be done to improve the performance of the system.
• Models of computer systems are equivalent to human experience. Experience is based on data.
Humans gain experience by various means. They gain knowledge by rote learning. They observe
others and imitate it. Humans gain alot of knowledg from teachers and books. We learn many
things by trial and error.
• Once the knowledge is gained, when a new problem is encountered, humans search for similar
past situations and then formulate the heuristics and use that for prediction. But, in systems,
experience is gathered by these steps:
1. Collection of data
2. Once data is gathered, abstract concepts are formed out of that data. Abstraction is used to
generate concepts. This is equivalent to humans’ idea of objects, for example, we have some
idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable form of intelligence. It can be
viewed as ordering of all possible concepts. So, generalization involves ranking of concepts,
inferencing from them and formation of heuristics, an actionable aspect of intelligence.
4. Heuristics are educated guesses for all tasks. For example, if one runs or encounters a danger,
it is the resultant of human experience or his heuristics formation. In machines, it happens the
same way.
5. Heuristics normally works! But, occasionally, it may fail too. It is not the fault of heuristics
as it is just a ‘rule of thumb′. The course correction is done by taking evaluation measures.
Evaluation checks the thoroughness of the models and to-do course correction, if necessary,
to generate better formulations.
➢ MACHINE LEARNING IN RELATION TO OTHER FIELDS
Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics primarily.
It is the resultant of combined ideas of diverse fields.
Machine Learning and Artificial Intelligence
Machine learning is an important branch of AI, which is a much broader subject. The aim of AI is to
develop intelligent agents. An agent can be a robot, humans, or any autonomous systems. Initially, the
idea of AI was ambitious, that is, to develop intelligent systems like human beings. The focus was on
logic and logical inferences. It had seen many ups and downs. These down periods were called AI
winters.
The resurgence in AI happened due to development of data driven systems. The aim is to find
relations and regularities present in the data. Machine learning is the subbranch of AI, whose aim is
to extract the patterns for prediction. It is a broad field that includes learning from examples and other
areas like reinforcement learning.
The relationship of AI and machine learning is shown in Figure 1.3. The model can take an unknown
instance and generate results.
3. Velocity: It refers to the speed at which the data is generated and processed.
• Big data is used by many machine learning algorithms for applications such as language
translation and image recognition.
• Big data influences the growth of subjects like Deep learning. Deep learning is a branch of
machine learning that deals with constructing models using neural networks.
• Data Mining: Data mining's original genesis is in the business. Like while mining the earth one
gets into precious resources, it is often believed that unearthing of the data produces hidden
information that otherwise would have eluded the attention of the management. There is no
difference between these fields except that data mining aims to extract the hidden patterns that
are present in the data, whereas, machine learning aims to use it for prediction.
• Data Analytics: Another branch of data science is data analytics. It aims to extract useful
knowledge from crude data. There are different types of analytics. Predictive data analytics is
used for making predictions. Machine learning is closely related to this branch of analytics and
shares almost all algorithms.
• Pattern Recognition: It is an engineering field. It uses machine learning algorithms to extract
the features for pattern analysis and pattern classification. One can view pattern recognition as a
specific application of machine learning.
Statistics is a branch of mathematics that has a solid theoretical foundation regarding statistical
learning. Like machine learning (ML), it can learn from data. But the difference between statistics and
ML is that statistical methods look for regularity in data called patterns. Initially, statistics sets a
hypothesis and performs experiments to verify and validate the hypothesis in order to find
relationships among data.
➢ TYPES OF MACHINE LEARNING
What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of interaction of the
program with its environment. It can be compared with the interaction between a teacher and a student.
Four types of machine learning:
• A dataset need not be always numbers. It can be images or video frames. Deep neural network
scan handle images with labels. In the following Figure 1.6, the deep neural network takes
images of dogs and cats with labels for classification.
Labelled Data
Unlabelled Data
• In unlabeled data, there are no labels in the dataset.
➢ Supervised Learning
• Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or teacher
component in supervised learning. A supervisor provides labelled data so that the model is
constructed and generates test data.
• In supervised learning algorithms, learning takes place in two stages. In layman terms, during the
first stage, the teacher communicates the information to the student that the student is supposed to
master. The student receives the information and understands it. During this stage, the teacher has
no knowledge of whether the information is grasped by the student.
• This leads to the second stage of learning. The teacher then asks the student a set of questions to
find out how much information has been grasped by the student. Based on these questions, the student
is tested, and the teacher informs the student about his assessment. This kind of learning is typically
called supervised learning.
3.
• In Classification, learning takes place in two stages. During the first stage, called the training
stage learning algorithm takes a labelled dataset and starts learning. After the training set,
samples are processed and the model is generated. In the second stage, the constructed model
is tested with test or unknown sample and assigned a label. This is the classification process.
• This is illustrated in the above Figure 1.7. Initially, the classification learning algorithm learns
with the collection of labelled data and constructs the model. Then, a test case is selected,
and the model assigns a label.
• Similarly, in the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the
classification will generate the label for this. This is called classification. One of the examples
of classification is –Image recognition, which includes classification of diseases like cancer,
classification of plants, etc.
• The classification models can be categorized based on the implementation technology like
decision trees, probabilistic methods, distance measures, and soft computing methods.
Classification models can also be classified as generative models and discriminative models.
Generative models deal with the process of data generation and its distribution. Probabilistic
models are examples of generative models. Discriminative models do not care about the
generation of data. Instead, they simply concentrate on classifying the given data.
• Some of the key algorithms of classification are:
1. Decision Tree
2. Random Forest
3. Support Vector Machines
4. Naïve Bayes
2. Regression Models
• Regression models, unlike classification algorithms, predict continuous variables like price.
In other words, it is a number. A fitted regression model is shown in Figure 1.8 for a dataset
that represent weeks input x and product sales y.
• The regression model takes input x and generates a model in the form of a fitted line of the
form y=f(x). Here, x is the independent variable that may be one or more attributes and y is
the dependent variable. In Figure 1.8, linear regression takes the training set and tries to fit it
with a line – product sales = 0.66 Week + 0.54. Here, 0.66 and 0.54 are all regression
coefficients that are learnt from data.
• The advantage of this model is that prediction for product sales (y) can be made for unknown week
data (x). For example, the prediction for unknown eighth week can be made by substituting x as 8 in
that regression formula to get y.
• Both regression and classification models are supervised algorithms. Both have a supervisor
and the concepts of training and testing are applicable to both. What is the difference between
classification and regression models?. The main difference is that regression models predict
continuous variables such as product price, while classification concentrates on assigning
labels such as class.
➢ Unsupervised Learning
• The second kind of learning is by self-instruction. As the name suggests, there are no supervisor
or teacher components. In the absence of a supervisor or teacher, self-instruction is the most
common kind of learning process. This process of self-instruction is based on the concept of trial
and error.
• Here, the program is supplied with objects, but no labels are defined. The algorithm itself observes
the examples and recognizes patterns based on the principles of grouping. Grouping is done in
ways that similar objects form the same group.
• Cluster analysis and Dimensional reduction algorithms are examples of unsupervised
algorithms.
Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group objects into disjoint
clusters or groups. Cluster analysis clusters objects based on its attributes. All the data objects of
the partitions are similar in some aspect and vary from the data objects in the other partitions
significantly. Some of the examples of clustering processes are — segmentation of a region of
interest in an image, detection of abnormal growth in a medical image, and determining clusters
of signatures in a gene database.
An example of clustering scheme is shown in Figure 1.9 where the clustering algorithm takes a
set of dogs and cats images and groups it as two clusters-dogs and cats. It can be observed that
the samples belonging to a cluster are similar and samples are different radically across clusters.
➢ Semi-supervised Learning
There are circumstances where the dataset has a huge collection of unlabelled data and some labelled
data. Labelling isa costly process and difficult to perform by the humans. Semi-supervised
algorithms use unlabelled data by assigning a pseudo-label. Then, the labelled and pseudo-labelled
dataset can be combined.
➢ Reinforcement Learning
Reinforcement learning mimics human beings. Like human beings use ears and eyes to perceive the
world and take actions, reinforcement learning allows the agent to interact with the environment to
get rewards. The agent can be human, animal, robot, or any independent program. The rewards
enable the agent to gain experience. The agent aims to maximize the reward.
The reward can be positive or negative (Punishment). When the rewards are more, the behavior gets
reinforced and learning becomes possible.
Consider the following example of a Grid game as shown in Figure 1.10
In this grid game, the gray tile indicates the danger, black is a block, and the tile with diagonally nes
is the goal. The aim is to start, say from bottom-left grid, using the actions left, right, top and bottom
to reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment to get
experience. In the above case, the agent tries to create a model by simulating many paths and finding
rewarding paths. This experience helps in constructing a model.
It can be said in summary, compared to supervised learning, there is no supervisor or labelled dataset.
Many sequential decisions need to be taken to reach the final decision. Therefore, reinforcement
algorithms are reward-based, goal-oriented algorithms.
1. Problems – Machine learning can deal with the ‘well-posed’ problems where specifications
are complete and available. Computers cannot solve ‘ill-posed’ problems.
Consider one simple example (shown in Table 1.3):
2. Huge data – This is a primary requirement of machine learning. Availability of a quality data
is a challenge. A quality data means it should be large and should not have data problems such
as missing data or incorrect data.
3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even Tensor
Processing Unit (TPU) are required to execute machine learning algorithms. Also, machine
learning tasks have become complex and hence time complexity has increased, and that can be
solved only with high computing power.
4. Complexity of the algorithms – The selection of algorithms, describing the algorithms,
application of algorithms to solve machine learning task, and comparison of algorithms have
become necessary for machine learning or data scientists now. Algorithms have become a big
topic of discussion and it is a challenge for machine learning professionals to design, select, and
evaluate optimal algorithms.
5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/ variance
tradeoff. A model that fits the training data correctly but fails for test data, in general lacks
generalization, is called overfitting. The reverse problem is called underfitting where the model
fails for training data but has good generalization. Overfitting and underfitting are great
challenges for machine learning algorithms.
➢ MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business organizations is CRISP-DM.
Since machine learning is like data mining, except for the aim, this process can be used for machine
learning. CRISP-DM stands for Cross Industry Standard Process – Data Mining. This process
involves
six steps. The steps are listed below in Figure 1.11.
1. Understanding the business – This step involves understanding the objectives and requirements
of the business organization. Generally, a single data mining algorithm is enough for giving the
solution. This step also involves the formulation of the problem statement for the data mining
process.
2. Understanding the data – It involves the steps like data collection, study of the characteristics of
the data, formulation of hypothesis, and matching of patterns to the selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw data and
preparation of data for the data mining process. The missing values may cause problems during
both training and testing phases. Missing data forces classifiers to produce in accurate results. This
is a perennial problem for the classification models. Hence, suitable strategies should be adopted
to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the data to obtain
a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical analysis
and visualization methods. The performance of the classifier is determined by evaluating the
accuracy of the classifier. The process of classification is a fuzzy issue. For example, classification
of emails requires extensive domain knowledge and requires domain experts. Hence, performance
of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining algorithm to improve
the existing process or for a new situation.
➢ MACHINE LEARNING APPLICATIONS
Machine Learning technologies are used widely now in different domains. Machine learning
applications are everywhere! One encounters many machine learning applications in the day-to-day
life.
Some applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which are captured
by emoticons effectively. For movie reviews or product reviews, five stars or one star are
automatically attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases possible. For
example, Amazon recommends users to find related books or books bought by people who have
the same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform tasks.
These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine learning
which offer to locate and navigate shortest paths to reduce time.
The machine learning applications are enormous. The following Table 1.4 summarizes some of
the machine learning applications.
➢ WHAT IS DATA?
• All facts are data. In computer systems, bits encode facts present in numbers, text, images, audio,
and video.
• Data can be directly human interpretable (such as numbers or texts) or diffused data such as
images or video that can be interpreted only by a computer.
• Data is available in different data sources like flat files, databases, or data warehouses. It can
either be an operational data or a non-operational data.
• Operational data is the one that is encountered in normal business procedures and processes. For
example, daily sales data is operational data, on the other hand, non-operational data is the kind
of data that is used for decision making.
• Data by itself is meaningless. It has to be processed to generate any information. A string of bytes
is meaningless. Only when a label is attached like height of students of a class, the data becomes
meaningful.
• Processed data is called information that includes patterns, associations, or relationships among
data. For example, sales data can be analyzed to extract information like which product was sold
larger in the last quarter of the year.
• Precision is defined as the closeness of repeated measurements. Often, standard deviation is used
to measure the precision.
• Bias is a systematic result due to erroneous assumptions of the algorithms or procedures.
Accuracy is the degree of measurement of errors that refers to the closeness of measurements to
the true value of the quantity. Normally, the significant digits used to store and manipulate
indicate the accuracy of the measurement.
➢ Types of Data
In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi structured
data.
1. Structured Data: In structured data, data is stored in an organized manner such as a database
where it is available in the form of a table. The data can also be retrieved in an organized manner
using tools like SQL. The structured data frequently encountered in machine learning are listed
below:
▪ Record Data A dataset is a collection of measurements taken from a process. We have a
collection of objects in a dataset and each object has a set of measurements. The measurements
can be arranged in the form of a matrix. Rows in the matrix represent an object and can be
called as entities, cases, or records. The columns of the dataset are called attributes, features,
or fields. Label is the term that is used to describe the individual observations.
▪ Data Matrix It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data. The data is thought of as points or
vectors in the multidimensional space where every attribute is a dimension describing the
object.
▪ Graph Data It involves the relationships among objects. For example, a web page can refer
to another web page. This can be modeled as a graph. The modes are web pages and the
hyperlink is an edge that connects the nodes.
▪ Ordered Data Ordered data objects involve attributes that have an implicit order among them.
The examples of ordered data are:
a) Temporal data – It is the data whose attributes are associated with time. For example,
the customer purchasing patterns during festival time is sequential data. Time series data
is a special type of sequence data where the data is a series of measurements over time.
b) Sequence data – It is like sequential data but does not have time stamps. This data
involves the sequence of words or letters. For example, DNA data is a sequence of four
characters – A T G C.
c) Spatial data – It has attributes such as positions or areas. For example, maps are spatial
data where the points are related by location.
2. Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual documents, programs,
and blog data. It is estimated that 80% of the data are unstructured data.
3. Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.
Flat Files These are the simplest and most commonly available data source. It is also the cheapest
way of organizing the data. These flat files are the files where data is stored in plain ASCII or EBCDIC
format. Minor changes of data in flat files affect the results of the data mining algorithms. Hence, flat
file is suitable only for storing small dataset and not desirable if the dataset becomes larger.
Some of the popular spreadsheet formats are listed below:
CSV files – CSV stands for comma-separated value files where the values are separated by commas.
These are used by spreadsheet and database applications. The first row may have attributes and the
rest of the rows represent the data.
TSV files – TSV stands for Tab separated values files where values are separated by Tab. Both CSV
and TSV files are generic in nature and can be shared. There are many tools like Google Sheets and
Microsoft Excel to process these files.
Database System
World Wide Web (Www) It provides a diverse, worldwide online information source
XML (eXtensible Markup Language) It is both human and machine interpretable data format.
Data Stream It is dynamic data, which flows in and out of the observing environment.
JSON (JavaScript Object Notation) It is another useful data interchange format that is often used for
many machine learning algorithms.
• The primary aim of data analysis is to assist business organizations to take decisions. For example,
a business organization may want to know which is the fastest selling product, in order for them
to market activities.
• Data analysis is an activity that takes the data and generates useful information and insights for
assisting the organizations.
• Data analysis and data analytics are terms that are used interchangeably to refer to the same
concept. However, there is a subtle difference. Data analytics is a general term and data analysis
is a part of it.
• Data analytics refers to the process of data collection, preprocessing and analysis. It deals with
the complete cycle of data management. Data analysis is just analysis and is a part of data
analytics. It takes historical data and does the analysis. Data analytics, instead, concentrates more
on future and helps in prediction.
• There are four types of data analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
• Descriptive Analytics: It is about describing the main features of the data. After data collection
is done, descriptive analytics deals with the collected data and quantifies it.
• Diagnostic Analytics: It inference part. deals with the question -'Why?. This is also known as
causal analysis as it aims to find out the cause and effect of the events.
• Predictive Analytics: It deals with the future. It deals with the question - What will happen in
future given this data?'. This involves the application of algorithms to identify the patterns to
predict the future.
• Prescriptive Analytics: Prescriptive analytics goes beyond prediction and helps in decision
making by giving a set of actions. It helps the organizations to plan better for the future and to
mitigate the risks that are involved.
➢ BIG DATA ANALYSIS FRAMEWORK
For performing data analytics, many frameworks are proposed. All proposed analytics frameworks
have some common factors. Big data framework is a layered architecture. Such an architecture has
many advantages such as genericness. A 4-layer architecture has the following layers:
1. Date connection layer
2. Data management layer
Open or public data source – It is a data source that does not have any stringent copyright rules or
restrictions. Its data can be primarily used for many purposes. Government census data are good examples
of open data:
• Digital libraries that have huge amount of text data as well as document images
• Scientific domains with a huge collection of experimental data like genomic data and biological
data.
• Healthcare systems that use extensive databases like patient databases, health insurance data,
doctors’ information, and bioinformatics information
Social media – It is the data that is generated by various social media platforms like Twitter, Facebook,
YouTube, and Instagram. An enormous amount of data is generated by these platforms.
Multimodal data – It includes data that involves many modes such as text, video, audio and mixed types.
Some of them are listed below:
• Image archives contain larger image databases along with numeric and text data
• The World Wide Web (WWW) has huge amount of data that is distributed on the Internet. These
data are heterogeneous in nature.
• Data Preprocessing
In real world, the available data is ’dirty’. By this word ’dirty’, it means:
• Data preprocessing improves the quality of the data mining techniques. The raw data must be pre-
processed to give accurate results. The process of detection and removal of errors in data is called
data cleaning.
• Data wrangling means making the data processable for machine learning algorithms. Some of the
data errors include human errors such as typographical errors or incorrect measurement and
structural errors like improper data formats.
• Data errors can also arise from omission and duplication of attributes. Noise is a random component
and involves distortion of a value or introduction of spurious objects. Often, the noise is used if the
data is a spatial or temporal component. Certain deterministic distortions in the form of a streak are
known as artifacts.
• It can be observed that data like Salary = ’ ’ is incomplete data. The DoB of patients, John, Andre,
and Raju, is the missing data. The age of David is recorded as ‘5’ but his DoB indicates it is
10/10/1980. This is called inconsistent data.
• Inconsistent data occurs due to problems in conversions, inconsistent formats, and difference in
units. Salary for John is -1500. It cannot be less than ‘0’. It is an instance of noisy data. Outliers are
data that exhibit the characteristics that are different from other data and have very unusual values.
The age of Raju cannot be 136. It might be a typographical error. It is often required to distinguish
between noise and outlier data.
• Outliers may be legitimate data and sometimes are of interest to the data mining algorithms. These
errors often come during data collection stage. These must be removed so that machine learning
algorithms yield better results as the quality of results is determined by the quality of input data.
This removal process is called data cleaning.
➢ Missing Data Analysis
The primary data cleaning process is missing data analysis. Data cleaning routines attempt to fill up
the missing values, smoothen the noise while identifying the outliers and correct the inconsistencies
of the data. This enables data mining to avoid overfitting of the models.
The procedures that are given below can solve the problem of missing data:
Ignore the tuple – A tuple with missing data, especially the class label, is ignored. This method is not
effective when the percentage of the missing values increases.
Fill in the values manually – Here, the domain expert can analyse the data tables and carry out the
analysis and fill in the values manually. But, this is time consuming and may not be feasible for larger
sets.
A global constant can be used to fill in the missing attributes. The missing values may be ’Unknown’
or be ’Infinity’. But, some data mining results may give spurious results by analysing these labels.
The attribute value may be filled by the attribute value. Say, the average income can replace a missing
value.
Use the attribute mean for all samples belonging to the same class. Here, the average value replaces
the missing values of all tuples that fall in this group.
Use the most possible value to fill in the missing value. The most probable value can be obtained
from other methods like classification and decision tree prediction.
Some of these methods introduce bias in the data. The filled value may not be correct and could be just
an estimated value. Hence, the difference between the estimated and the original value is called an error
or bias.
➢ Removal of Noisy or Outlier Data
Noise is a random error or variance in a measured value. It can be removed by using binning, which
is a method where the given data values are sorted and distributed into equal frequency bins. The bins
are also called as buckets. The binning method then uses the neighbor values to smooth the noisy
data.
Some of the techniques commonly used are ‘smoothing by means’ where the mean of the bin removes
the values of the bins, ‘smoothing by bin medians’ where the bin median replaces the bin values, and
‘smoothing by bin boundaries’ where the bin value is replaced by the closest bin boundary. The
maximum and minimum values are called bin boundaries. Binning methods may be used as a
discretization technique.
Example 2.1 illustrates this principle
Here max-min is the range. Min and max are the minimum and maximum of the given data, new max
and new min are the minimum and maximum of the target range, say 0 and 1.
Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and map the
marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The new min and new max are
0 and 1, respectively. The mapping can be done using Eq. (2.1) as:
So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range {0, 0.33, 0.66, 1}.
Thus, the Min-Max normalization range is between 0 and 1.
Z-Score Normalization: This procedure works by taking the difference between the field value and
mean value, and by scaling this difference by standard deviation of the attribute.
Here, s is the standard deviation of the list V and m is the mean of the list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the list V are 20 and 10,
respectively. So, the z-scores of these marks are calculated using Eq. (2.2) as:
Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.
Data Reduction
Data reduction reduces data size but produces the same results. There are different ways in which data
reduction can be carried out such as data aggregation, feature selection, and dimensionality reduction.
➢ DESCRIPTIVE STATISTICS
Descriptive statistics is a branch of statistics that does dataset summarization. It is used to summarize
and describe data. Descriptive statistics are just descriptive and do not go beyond that. In other words,
descriptive statistics do not bother too much about machine learning algorithms and its functioning.
Dataset and Data Types
A dataset can be assumed to be a collection of data objects. The data objects may be records, points,
vectors, patterns, events, cases, samples or observations. These records contain many attributes. An
attribute can be defined as the property or characteristics of an object. For example, consider the
following database shown in sample Table 2.2.
Every attribute should be associated with a value. This process is called measurement. The type of
attribute determines the data types, often referred to as measurement scale types. The data types are
shown in Figure 2.1.
operations like (=, ≠) are meaningful for these data. For example, the patient ID can be checked
for equality and nothing else.
➢ Ordinal Data – It provides enough information and has natural order. For example, Fever =
{Low, Medium, High} is an ordinal data. Certainly, low is less than medium and medium is less
than high, irrespective of the value. Any transformation can be applied to these data to get a new
value.
Numeric or Qualitative Data It can be divided into two categories. They are interval type and ratio type.
➢ Interval Data – Interval data is a numeric data for which the differences between values are
meaningful. For example, there is a difference between 30 degree and 40 degree. Only the
permissible operations are + and -.
➢ Ratio Data – For ratio data, both differences and ratio are meaningful. The difference between
the ratio and interval data is the position of zero in the scale. For example, take the Centigrade-
Fahrenheit conversion. The zeroes of both scales do not match. Hence, these are interval data.
Another way of classifying the data is to classify it as:
1.Discrete value data
2.Continuous data
➢ Discrete Data This kind of data is recorded as integers. For example, the responses of the survey
can be discrete data. Employee identification number such as 10001 is discrete data.
➢ Continuous Data It can be fitted into a range and includes decimal point. For example, age is a
continuous data. Though age appears to be discrete data, one may be 12.5 years old and it makes
sense. Patient height and weight are all continuous data.
Third way of classifying the data is based on the number of variables used in the dataset. Based on that,
the data can be classified as univariate data, bivariate data, and multivariate data. This is shown in Figure
2.2.
Univariate data description involves finding the frequency distributions, central tendency measures,
dispersion or variation, and shape of the data.
Data Visualization
To understand data, graph visualization is must. Data visualization helps to understand data. It helps to
present information and data to customers. Some of the graphs that are used in univariate data analysis
are bar charts, histograms, frequency polygons and pie charts.
The advantages of the graphs are presentation of data, summarization of data, description of data,
exploration of data, and to make comparisons of data. Let us consider some forms of graphs.
Bar Chart A Bar chart (or Bar graph) is used to display the frequency distribution for variables. Bar
charts are used to illustrate discrete data. The charts can also help to explain the counts of nominal data.
It also helps in comparing the frequency of different groups.
The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown below
in Figure 2.3.
Pie Chart These are equally helpful in illustrating the univariate data. The percentage frequency
distribution of students' marks {22, 22, 40, 40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.
It can be observed that the number of students with 22 marks are 2. The total number of students are
10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in Figure 2.4.
Histogram It plays an important role in data mining for showing frequency distributions. The
histogram for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75, 76-100 is
given below in Figure 2.5. One can visually inspect from Figure 2.5 that the number of students in the
range 76-100 is 2.
Histogram conveys useful information like nature of data and its mode. Mode indicates the peak of
dataset. In other words, histograms can be used as charts to show frequency, skewness present in the
data, and shape.
Dot Plots These are similar to bar charts. They are less clustered as compared to bar charts,
as they illustrate the bars only with single points. The dot plot of English marks for five students with
ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The advantage
is that by visual inspection one can find out who got more marks.
Central Tendency
One can remember all the data Therefore, a condensation or summary of the data is necessary. This
makes the data analysis easy and simple. One such summary is called central tendency. Thus, central
tendency can explain the characteristics of data and that further helps in comparison. Mass data have
tendency to concentrate at certain values, normally in the central location. It is called measure of central
tendency (or averages). Popular measures are mean, median and mode.
Mean – Arithmetic average (or mean) is a measure of central tendency that represents the ‘center’ of
the dataset. Mathematically, the average of all the values in the sample (population) is denoted as x. Let
x1, x2, … , xN be a set of ‘N’ values or observations, then the arithmetic mean is given as:
Weighted mean – Unlike arithmetic mean that gives the weightage of all items equally, weighted mean
gives different importance to all items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency distribution, mid values of the
range are taken for computation. This is illustrated in the following computation. In weighted mean, the
mean is computed by adding the product of proportion and group mean. It is mostly used when the
sample sizes are unequal.
Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or observations. Geometric mean is the Nth
root of the product of N items. The formula for computing geometric mean is given as follows:
The problem of mean is its extreme sensitiveness to noise. Even small changes in the input affect the
mean drastically. Hence, often the top 2% is chopped off and then the mean is calcu-lated for a larger
dataset.
Median – The middle value in the distribution is called median. If the total number of items in the
distribution is odd, then the middle value is called median. A median class is that class where (N/2)th
item is present. In the continuous case, the median is given by the formula:
Median class is that class where N/2th item is present. Here, i is the class interval of the median class
and L1 is the lower limit of median class, f is the frequency of the median class, and cf is the cumulative
frequency of all classes preceding median.
Mode – Mode is the value that occurs more frequently in the dataset. In other words, the value that has
the highest frequency is called mode.
➢ Dispersion
The spread out of a set of data around the central tendency (mean, median or mode) is called
dispersion. Dispersion is represented by various ways such as range, variance, standard deviation,
and standard error. These are second order measures. The most common measures of the dispersion
data are listed below:
• Range Range is the difference between the maximum and minimum of values of the given
list of data.
• Standard Deviation The mean does not convey much more than a middle point. For example,
the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The difference
between these two sets is the spread of data. Standard deviation is the average distance from
the mean of the dataset to each point. The formula for sample standard deviation is given by:
Here, N is the size of the population, xi is observation or value from the population and m
is the population mean. Often, N – 1 is used instead of N in the denominator of Eq. (2.8).
• Quartiles and Inter Quartile Range
It is sometimes convenient to subdivide the dataset using coordinates. Percentiles are about data
that are less than the coordinates by some percentage of the total value. kth percentile is the
property that the k% of the data lies at or below Xi. For example, median is 50th percentile and
can be denoted as Q0.50. The 25th percentile is called first quartile (Q1) and the 75th percentile
is called third quartile (Q3). Another measure that is useful to measure dispersion is Inter
Quartile Range (IQR). The IQR is the difference between Q3 and Q1.
Outliers are normally the values falling apart at least by the amount 1.5 × IQR above the third
quartile or below the first quartile.
Five-point Summary and Box Plots The median, quartiles Q1 and Q3, and minimum and maximum
written in the order < Minimum, Q1, Median, Q3, Maximum > is known as five-point summary.
Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location of the
dataset.
Skewness
The measures of direction and degree of symmetry are called measures of third order. Ideally, skewness
should be zero as in ideal normal distribution. More often, the given dataset may not have perfect
symmetry.
The dataset may also either have very high values or extremely low values. If the dataset has far higher
values, then it is said to be skewed to the right. On the other hand, if the dataset has far more low values
then it is said to be skewed towards left. If the tail is longer on the left-hand side and hump on the right-
hand side, it is called positive skew. Otherwise, it is called negative skew.
The given dataset may have an equal distribution of data. The implication of this is that if the data is
skewed, then there is a greater chance of outliers in the dataset. This affects the mean and median. Hence,
this may affect the performance of the data mining algorithm. A perfect symmetry means the skewness
is zero. In the case of skew, the median is greater than the mean. In positive skew, the mean is greater
than the median.
Generally, for negatively skewed distribution, the median is more than the mean. The relationship
between skew and the relative size of the mean and median can be summarized by a convenient numerical
skew index known as Pearson 2 skewness coefficient.
Also, the following measure is more commonly used to measure skewness. Let X1, X2, …, XN be a set
of ‘N’ values or observations then the skewness can be given as:
Here, m is the population mean and s is the population standard deviation of the univariate data.
Sometimes, for bias correction instead of N, N - 1 is used.
Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates higher kurtosis and
vice versa. Kurtosis is measured using the formula given below:
It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14) for bias correction. Here,
x and s are the mean and standard deviation of the univariate data, respectively. Some of the other useful
measures for finding the shape of the univariate dataset are mean absolute deviation (MAD) and
coefficient of variation (CV).
Mean Absolute Deviation (MAD)
MAD is another dispersion measure and is robust to outliers. Normally, the outlier point is detected by
computing the deviation from median and by dividing it by MAD. Here, the absolute deviation between
the data and mean is taken. Thus, the absolute deviation is given as:
It can be seen from Figure 2.9 that the first column is stem and the second column is leaf. For the given
English marks, two students with 60 marks are shown in stem and leaf plot as stem-6 with 2 leaves with
0.
As discussed earlier, the ideal shape of the dataset is a bell-shaped curve. This corresponds to normality.
Most of the statistical tests are designed only for normal distribution of data. A Q-Q plot can be used to
assess the shape of the dataset. The Q-Q plot is a 2D scatter plot of an univariate data against theoretical
normal distribution data or of two datasets - the quartiles of the first and second datasets. The normal Q-
Q plot for marks x = [13 11 2 3 4 8 9] is given below in Figure 2.1.