0% found this document useful (0 votes)
57 views

AIML Module-2.2 Notes

Uploaded by

Naveen Gowda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

AIML Module-2.2 Notes

Uploaded by

Naveen Gowda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

C BYREGOWDA INSTITUTE OF TECHNOLOGY

Computer Science and Engineering

MODULE 2.2

Introduction to Machine Learning,


Understanding Data

Artificial Intelligence and Machine Learning (21CS54)

Sem: 5th CSE

2023-2024
Machine Learning

Chapter 1

INTRODUCTION TO MACHINE LEARNING


“Machine Learning (ML) is a promising and flourishing field. It can enable top management of
an organization to extract the knowledge from the data stored in various archives of the business
organizations to facilitate decision making‖.

 Such decisions can be useful for organizations to design new products, improve business
processes, and to develop decision support systems.

1.1 Need For Machine Learning


Business organizations use huge amount of data for their daily activities. Earlier, the full
potential of this data was not utilized due to two reasons. One reason was data being scattered
across different archive systems and organizations not being able to integrate these sources
fully. Secondly, the lack of awareness about software tools that could help to unearth the
useful information from data. Not anymore! Business organizations have now started to use
the latest technology, machine learning, for this purpose.

Machine learning has become so popular because of three reasons:

1. High volume of available data to manage: Big companies such as Facebook, Twitter, and
YouTube generate huge amount of data that grows at a phenomenal rate. It is estimated
that the data approximately gets doubled every year.
2. Second reason is that the cost of storage has reduced. The hardware cost has also dropped.
Therefore, it is easier now to capture, process, store, distribute, and transmit the digital
information.
3. Third reason for popularity of machine learning is the availability of complex algorithms
now. Especially with the advent of deep learning, many algorithms are available for
machine learning.

Dept. of CSE, CBIT-Kolar Page 1


Machine Learning

With the popularity and ready adaption of machine learning by business organizations, it has
become a dominant technology trend now. Before starting the machine learning journey, let us
establish these terms - data, information, knowledge, intelligence, and wisdom. A knowledge
pyramid is shown in Figure 1.1.

What is data?

All facts are data. Data can be numbers or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data with data sources such as flat
files, databases, or data warehouses in different storage formats.

Processed data is called information. This includes patterns, associations, or relationships among
data. For example, sales data can be analyzed to extract information like which is the fast selling
product. Condensed information is called knowledge. For example, the historical patterns and
future trends obtained in the above sales data can be called knowledge. Unless knowledge is
extracted, data is of no use. Similarly, knowledge is not useful unless it is put into action.
Intelligence is the applied knowledge for actions. An actionable form of knowledge is called
intelligence. Computer systems have been successful till this stage. The ultimate objective of
knowledge pyramid is wisdom that represents the maturity of mind that is, so far, exhibited only
by humans. Here comes the need for machine learning. The objective of machine learning is to
process these archival data for organizations to take better decisions to design new products,
improve the business processes, and to develop effective decision support systems.

1.2 Machine Learning Explained


Machine learning is an important sub-branch of Artificial Intelligence (AI). A frequently quoted
definition of machine learning was by Arthur Samuel, one of the pioneers of Artificial
Intelligence.

He stated that ―Machine learning is the field of study that gives the computer‘s ability to learn
without being explicitly programmed.‖ The key to this definition is that the systems should learn
by itself without explicit programming.

How is it possible? It is widely known that to perform a computation, one needs to write programs
that teach the computers how to do that computation. This approach could be difficult for many
real-world problems such as puzzles, games, and complex image recognition applications.

Initially, artificial intelligence aims to understand these problems and develop general purpose
rules manually. Then, these rules are formulated into logic and implemented in a program to
create intelligent systems. This idea of developing intelligent systems by using logic and
reasoning by converting an expert‘s knowledge into a set of rules and programs is called an expert
system. An expert system like MYCIN was designed for medical diagnosis after converting the
expert knowledge of many doctors into a system.

The focus of AI is to develop intelligent systems by using data-driven approach, where data is
used as an input to develop intelligent models. The models can then be used to predict new inputs.
Thus, the aim of machine learning is to learn a model or set of rules from the given dataset
automatically so that it can predict the unknown data correctly. As humans take decisions based
on an experience, computers make models based on extracted patterns in the input data and then

Dept. of CSE, CBIT-Kolar Page 2


Machine Learning

use these data-filled models for prediction and to take decisions. For computers, the learnt model
is equivalent to human experience. This is shown in Figure 1.2.

Often, the quality of data determines the quality of experience and, therefore, the quality of the
learning system. In statistical learning, the relationship between the input x and output y is
modeled as a function in the form y = f(x).

Here, f is the learning function that maps the input x to output y. Learning of function f is the
crucial aspect of forming a model in statistical learning.

In machine learning, this is simply called mapping of input to output.

The learning program summarizes the raw data in a model. Formally stated, a model is an explicit
description of patterns within the data in the form of:

1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters

Another pioneer of AI, Tom Mitchell‘s definition of machine learning states that, ―A computer
program is said to learn from experience E, with respect to task T and some performance measure
P, if its performance on T measured by P improves with experience E.‖ The important
components of this definition are experience E, task T, and performance measure P.

1.3 Machine Learning in Relation to other Fields


Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics
primarily. It is the resultant of combined ideas of diverse fields.

1.3.1 Machine Learning and Artificial Intelligence


Machine learning is an important branch of AI, which is a much broader subject. The aim of AI is
to develop intelligent agents. An agent can be a robot, humans, or any autonomous systems.
Initially, the idea of AI was ambitious, that is, to develop intelligent systems like human beings.
The focus was on logic and logical inferences. It had seen many ups and downs. These down
periods were called AI winters. The resurgence in AI happened due to development of data
driven systems. The aim is to find relations and regularities present in the data. Machine learning
is the sub branch of AI, whose aim is to extract the patterns for prediction. It is a broad field that

Dept. of CSE, CBIT-Kolar Page 3


Machine Learning

includes learning from examples and other areas like reinforcement learning. The relationship of
AI and machine learning is shown in Figure 1.3. The model can take an unknown instance and
generate results.

Deep learning is a sub branch of machine learning. In deep learning, the models are constructed
using neural network technology. Neural networks are based on the human neuron models. Many
neurons form a network connected with the activation functions that trigger further neurons to
perform tasks.

1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
Data science is an ‗Umbrella‘ term that encompasses many fields. Machine learning starts with
data. Therefore, data science and machine learning are interlinked. Machine learning is a branch
of data science. Data science deals with gathering of data for analysis.

Big Data: Data science concerns about collection of data. Big data is a field of data science that
deals with data‘s following characteristics:

1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter, and
YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different formats.
3. Velocity: It refers to the speed at which the data is generated and processed.

Big data is used by many machine learning algorithms for applications such as language
translation and image recognition. Big data influences the growth of subjects like Deep learning.
Deep learning is a branch of machine learning that deals with constructing models using neural
networks.

Data Mining: Data mining‘s original genesis is in the business. Like while mining the earth one
gets into precious resources, it is often believed that unearthing of the data produces hidden
information that otherwise would have eluded the attention of the management.

 Nowadays, many consider that data mining and machine learning are same. There is no
difference between these fields except that data mining aims to extract the hidden patterns
that are present in the data, whereas, machine learning aims to use it for prediction.

Dept. of CSE, CBIT-Kolar Page 4


Machine Learning

Data Analytics: Another branch of data science is data analytics. It aims to extract useful
knowledge from crude data. There are different types of analytics. Predictive data analytics is
used for making predictions. Machine learning is closely related to this branch of analytics and
shares almost all algorithms.

Pattern Recognition: It is an engineering field. It uses machine learning algorithms to extract the
features for pattern analysis and pattern classification. One can view pattern recognition as a
specific application of machine learning.

These relations are summarized in Figure 1.4

1.3.3 Machine Learning and Statistics


 Statistics is a branch of mathematics that has a solid theoretical foundation regarding
statistical learning. Like machine learning (ML), it can learn from data.
 But the difference between statistics and ML is that statistical methods look for regularity
in data called patterns.
 Initially, statistics sets a hypothesis and performs experiments to verify and validate the
hypothesis in order to find relationships among data.
 Statistical methods are developed in relation to the data being analysed. Machine learning,
comparatively, has less assumption and requires less statistical knowledge.

1.3 Types of Machine Learning


What does the word ‗learn‘ mean? Learning, like adaptation, occurs as the result of interaction of
the program with its environment.

 It can be compared with the interaction between a teacher and a student.

There are four types of machine learning as shown in Figure 1.5.

Dept. of CSE, CBIT-Kolar Page 5


Machine Learning

Labelled and Unlabelled Data:

Data is a raw fact. Normally, data is represented in the form of a table. Data also can be referred to
as a data point, sample, or an example. Each row of the table represents a data point. Features are
attributes or characteristics of an object. Normally, the columns of the table are attributes. Out of
all attributes, one attribute is important and is called a label. Label is the feature that we aim to
predict. Thus, there are two types of data – labelled and unlabelled.

In the following Figure 1.6, the deep neural network takes images of dogs and cats with labels for
classification. In unlabelled data, there are no labels in the dataset.

Dept. of CSE, CBIT-Kolar Page 6


Machine Learning

Labelled Data to illustrate labelled data, let us take one example dataset called Iris flower dataset
or Fisher‘s Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width
of sepals and petals. The target variable is called class. There are three classes – Iris setosa, Iris
virginica, and Iris versicolor as shown in Table 1.1.

A dataset need not be always numbers. It can be images or video frames. Deep neural networks
can handle images with labels.

1.4.1 Supervised Learning


Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or teacher
component in supervised learning. A supervisor provides labelled data so that the model is
constructed and generates test data. In supervised learning algorithms, learning takes place in two
stages.

Supervised learning has two methods:

1. Classification
2. Regression

Classification:

Classification is a supervised learning method.

 The input attributes of the classification algorithms are called independent variables. The
target attribute is called label or dependent variable.
 The relationship between the input and target variable is represented in the form of a
structure which is called a classification model.
 An example is shown in Figure 1.7 where a classification algorithm takes a set of labelled
data images such as dogs and cats to construct a model that can later be used to classify an
unknown test image data.

Dept. of CSE, CBIT-Kolar Page 7


Machine Learning

In classification, learning takes place in two stages.

 During the first stage, called training stage, the learning algorithm takes a labelled dataset
and starts learning. After the training set, samples are processed and the model is
generated.
 In the second stage, the constructed model is tested with test or unknown sample and
assigned a label. This is the classification process.

Some of the key algorithms of classification are:

a. Decision Tree
b. Random Forest
c. Support Vector Machines
d. Naïve Bayes‘s
e. Artificial Neural Network and Deep Learning networks like CNN.

Regression Models:

Regression models, unlike classification algorithms, predict continuous variables like price. In
other words, it is a number. A fitted regression model is shown in Figure 1.8 for a dataset that
represent weeks input x and product sales y.

The regression model takes input x and generates a model in the form of a fitted line of the form

y = f(x).

Here, x is the independent variable that may be one or more attributes and y is the dependent
variable.

In Figure 1.8, linear regression takes the training set and tries to fit it with a line – product sales =
0.66 × Week + 0.54. Here, 0.66 and 0.54 are all regression coefficients that are learnt from data.
The advantage of this model is that prediction for product sales (y) can be made for unknown
week data (x). For example, the prediction for unknown eighth week can be made by substituting
x as 8 in that regression formula to get y.

Dept. of CSE, CBIT-Kolar Page 8


Machine Learning

 Both regression and classification models are supervised algorithms.


 Both have a supervisor and the concepts of training and testing are applicable to both.
 The main difference is that regression models predict continuous variables such as product
price, while classification concentrates on assigning labels such as class.

1.4.2 Unsupervised Learning


The second kind of learning is by self-instruction. As the name suggests, there are no supervisor
or teacher components.

 In the absence of a supervisor or teacher, self-instruction is the most common kind of


learning process.
 This process of self-instruction is based on the concept of trial and error. Here, the
program is supplied with objects, but no labels are defined.
 The algorithm itself observes the examples and recognizes patterns based on the principles
of grouping. Grouping is done in ways that similar objects form the same group.
 Cluster analysis and Dimensional reduction algorithms are examples of unsupervised
algorithms.

Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group objects into disjoint
clusters or groups. Cluster analysis clusters objects based on its attributes.

 All the data objects of the partitions are similar in some aspect and vary from the data
objects in the other partitions significantly.
 Some of the examples of clustering processes are — segmentation of a region of interest in
an image, detection of abnormal growth in a medical image, and determining clusters of
signatures in a gene database.

An example of clustering scheme is shown in Figure 1.9 where the clustering algorithm takes a set
of dogs and cats images and groups it as two clusters-dogs and cats. It can be observed that the
samples belonging to a cluster are similar and samples are different radically across clusters.
Unlabelled data Clustering algorithm Cluster 1

Dept. of CSE, CBIT-Kolar Page 9


Machine Learning

Some of the key clustering algorithms are:

 k-means algorithm
 Hierarchical algorithms

Dimensionality Reduction
 Dimensionality reduction algorithms are examples of unsupervised algorithms.
 It takes a higher dimension data as input and outputs the data in lower dimension by taking
advantage of the variance of the data.
 It is a task of reducing the dataset with few features without losing the generality.
 The differences between supervised and unsupervised learning are listed in the following
Table 1.2.

1.4.3 Semi-supervised Learning


 There are circumstances where the dataset has a huge collection of unlabelled data and
some labelled data.
 Labelling is a costly process and difficult to perform by the humans.
 Semi-supervised algorithms use unlabelled data by assigning a pseudo-label. Then, the
labelled and pseudo-labelled dataset can be combined.

1.4.4 Reinforcement Learning


 Reinforcement learning mimics human beings. Like human beings use ears and eyes to
perceive the world and take actions, reinforcement learning allows the agent to interact
with the environment to get rewards.

Dept. of CSE, CBIT-Kolar Page 10


Machine Learning

 The agent can be human, animal, robot, or any independent program. The rewards enable
the agent to gain experience.
 The agent aims to maximize the reward. The reward can be positive or negative
(Punishment). Consider the following example of a Grid game as shown in Figure 1.10.

In this grid game, the gray tile indicates the danger, black is a block, and the tile with diagonal
lines is the goal. The aim is to start, say from bottom-left grid, using the actions left, right, top and
bottom to reach the goal state. To solve this sort of problem, there is no data.

The agent interacts with the environment to get experience. In the above case, the agent tries to
create a model by simulating many paths and finding rewarding paths. This experience helps in
constructing a model.

1.5 Challenges of Machine Learning


What are the challenges of machine learning? Let us discuss about them now.

 Computers are better than humans in performing tasks like computation.


 However, humans are better than computers in many aspects like recognition. But, deep
learning systems challenge human beings in this aspect as well.
 The quality of a learning system depends on the quality of data. This is a challenge.

Some of the challenges are listed below:

a. Problems: Machine learning can deal with the ‗well-posed‘ problems where specifications
are complete and available. Computers cannot solve ‗ill-posed‘ problems.

Can a model for this test data be multiplication? That is, y = x1 × x2. Well! It is true! But,
this is equally true that y may be y = x1 ÷ x2, or y = x1 x2. So, there are three functions
that fit the data. This means that the problem is ill-posed.

b. Huge data: This is a primary requirement of machine learning. Availability of a quality


data is a challenge. A quality data means it should be large and should not have data
problems such as missing data or incorrect data.

c. High computation power: With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even
Tensor Processing Unit (TPU) are required to execute machine learning algorithms.

Dept. of CSE, CBIT-Kolar Page 11


Machine Learning

 Also, machine learning tasks have become complex and hence time complexity has
increased, and that can be solved only with high computing power.

d. Complexity of the algorithms: The selection of algorithms, describing the algorithms,


application of algorithms to solve machine learning task, and comparison of algorithms
have become necessary for machine learning or data scientists now.
Algorithms have become a big topic of discussion and it is a challenge for machine
learning professionals to design, select, and evaluate optimal algorithms.

e. Bias/Variance: Variance is the error of the model. This leads to a problem called bias/
variance tradeoff.

 A model that fits the training data correctly but fails for test data, in general lacks
generalization, is called overfitting.
 The reverse problem is called underfitting where the model fails for training data
but has good generalization.
 Overfitting and underfitting are great challenges for machine learning algorithms.

1.6 Machine Learning Process


The emerging process model for the data mining solutions for business organizations is CRISP-
DM. Since machine learning is like data mining, except for the aim, this process can be used for
machine learning.

CRISP-DM stands for Cross Industry Standard Process – Data Mining. This process involves six
steps. The steps are listed below in Figure 1.11.

Dept. of CSE, CBIT-Kolar Page 12


Machine Learning

a. Understanding the business: This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is
enough for giving the solution. This step also involves the formulation of the problem
statement for the data mining process.

b. Understanding the data: It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the
selected hypothesis.

c. Preparation of data: This step involves producing the final dataset by cleaning the raw
data and preparation of data for the data mining process. The missing values may cause
problems during both training and testing phases. Missing data forces classifiers to
produce inaccurate results. This is a perennial problem for the classification models.
Hence, suitable strategies should be adopted to handle the missing data.

d. Modelling: This step plays a role in the application of data mining algorithm for the data
to obtain a model or pattern.

e. Evaluate: This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue.
For example, classification of emails requires extensive domain knowledge and requires
domain experts. Hence, performance of the classifier is very crucial.

f. Deployment: This step involves the deployment of results of the data mining algorithm
to improve the existing process or for a new situation.

1.7 Machine Learning Applications


Machine Learning technologies are used widely now in different domains. Machine learning
applications are everywhere! One encounters many machine learning applications in the day-to-
day life.

Some applications are listed below:

a. Sentiment analysis: This is an application of natural language processing (NLP) where


the words of documents are converted to sentiments like happy, sad, and angry which are
captured by emoticons effectively. For movie reviews or product reviews, five stars or one
star are automatically attached using sentiment analysis programs.

b. Recommendation systems: These are systems that make personalized purchases possible.
For example, Amazon recommends users to find related books or books bought by people
who have the same taste like you, and Netflix suggests shows or related movies of your
taste. The recommendation systems are based on machine learning.

Dept. of CSE, CBIT-Kolar Page 13


Machine Learning

c. Voice assistants: Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform
tasks. These chatbots are the result of machine learning technologies.

d. Technologies like Google Maps and those used by Uber are all examples of machine
learning which offer to locate and navigate shortest paths to reduce time.

The following Table 1.4 summarizes some of the machine learning applications.

Dept. of CSE, CBIT-Kolar Page 14


Machine Learning

Chapter-2

UNDERSTANDING DATA
Machine learning algorithms involve large datasets. Hence, it is necessary to understand the data
and datasets before applying machine learning algorithms. This chapter aims to introduce the
concepts necessary to understand data better.

2.1 What Is Data?

All facts are data.

 In computer systems, bits encode facts present in numbers, text, images, audio, and video.

 Data can be directly human interpretable (such as numbers or texts) or diffused data such
as images or video that can be interpreted only by a computer.

 Today, business organizations are accumulating vast and growing amounts of data of the
order of gigabytes, tera bytes, exabytes.

 Data is available in different data sources like flat files, databases, or data warehouses. It
can either be an operational data or a non-operational data. Operational data is the one that
is encountered in normal business procedures and processes.

 Processed data is called information that includes patterns, associations, or relationships


among data.

Elements of Big Data

Big data, on the other hand, is a larger data whose volume is much larger than ‗small data‘ and is
characterized as follows:

a. Volume – Since there is a reduction in the cost of storing devices, there has been a
tremendous growth of data. Small traditional data is measured in terms of gigabytes (GB)
and terabytes (TB), but Big Data is measured in terms of petabytes (PB) and exabytes (EB).
One exabyte is 1 million terabytes.

b. Velocity – The fast arrival speed of data and its increase in data volume is noted as
velocity. The availability of IoT devices and Internet power ensures that the data is
arriving at a faster rate. Velocity helps to understand the relative growth of big data and its
accessibility by users, systems and applications.

c. Variety – The variety of Big Data includes:

Dept. of CSE, CBIT-Kolar Page 15


Machine Learning

 Form – There are many forms of data. Data types range from text, graph, audio,
video, to maps. There can be composite data too, where one media can have many
other sources of data, for example, a video can have an audio song.
 Function – These are data from various sources like human conversations,
transaction records, and old archive data.
 Source of data – This is the third aspect of variety. There are many sources of data.
Broadly, the data source can be classified as open/public data, social media data and
multimodal data. These are discussed in Section 2.3.1 of this chapter.

d. Veracity of data – Veracity of data deals with aspects like conformity to the facts,
truthfulness, believability, and confidence in data. There may be many sources of error
such as technical errors, typographical errors, and human errors. So, veracity is one of the
most important aspects of data.
e. Validity – Validity is the accuracy of the data for taking decisions or for any other goals
that are needed by the given problem.
f. Value – Value is the characteristic of big data that indicates the value of the information
that is extracted from the data and its influence on the decisions that are taken based on it.

2.1.1 Types of Data

In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi-
structured data.

Structured Data

In structured data; data is stored in an organized manner such as a database where it is available
in the form of a table. The data can also be retrieved in an organized manner using tools like SQL.

The structured data frequently encountered in machine learning are listed below:

Record Data: A dataset is a collection of measurements taken from a process. We have a


collection of objects in a dataset and each object has a set of measurements. The measurements
can be arranged in the form of a matrix. Rows in the matrix represent an object and can be called
as entities, cases, or records. The columns of the dataset are called attributes, features, or fields.
The table is filled with observed data. Also, it is better to note the general jargons that are
associated with the dataset. Label is the term that is used to describe the individual observations.

Data Matrix: It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data. The data is thought of as points or
vectors in the multidimensional space where every attribute is a dimension describing the object.

Graph Data: It involves the relationships among objects. For example, a web page can refer to
another web page. This can be modeled as a graph. The modes are web pages and the hyperlink is
an edge that connects the nodes. Ordered Data Ordered data objects involve attributes that have
an implicit order among them.

Dept. of CSE, CBIT-Kolar Page 16


Machine Learning

Unstructured Data

Unstructured data includes video, image, and audio. It also includes textual documents, programs,
and blog data. It is estimated that 80% of the data are unstructured data.

Semi-Structured

Data Semi-structured data are partially structured and partially unstructured. These include data
like XML/JSON data, RSS feeds, and hierarchical data.

2.1.2 Data Storage and Representation

Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis.
The goal of data storage management is to make data available for analysis. There are different
approaches to organize and manage data in storage files and systems from flat file to data
warehouses. Some of them are listed below:

Flat Files: These are the simplest and most commonly available data source. It is also the
cheapest way of organizing the data.

Some of the popular spreadsheet formats are listed below:

 CSV files – CSV stands for comma-separated value files where the values are separated
by commas. These are used by spreadsheet and database applications. The first row may
have attributes and the rest of the rows represent the data.
 TSV files – TSV stands for Tab separated values files where values are separated by Tab.
Both CSV and TSV files are generic in nature and can be shared.
 There are many tools like Google Sheets and Microsoft Excel to process these files.

Database System: It normally consists of database files and a database management system
(DBMS). Database files contain original data and metadata.

 A relational database consists of sets of tables.


 The tables have rows and columns. The columns represent the attributes and rows
represent tuples.
 A tuple corresponds to either an object or a relationship between objects.

Different types of databases are listed below:

1. A transactional database is a collection of transactional records. Each record is a


transaction.
2. Time-series database stores time related information like log files where data is
associated with a time stamp.
3. Spatial databases contain spatial information in a raster or vector format. Raster formats
are either bitmaps or pixel maps. For example, images can be stored as a raster data.

Dept. of CSE, CBIT-Kolar Page 17


Machine Learning

World Wide Web (WWW): It provides a diverse, worldwide online information source. The
objective of data mining algorithms is to mine interesting patterns of information present in
WWW.

XML (eXtensible Markup Language): It is both human and machine interpretable data format
that can be used to represent data that needs to be shared across the platforms.

Data Stream: It is dynamic data, which flows in and out of the observing environment. Typical
characteristics of data stream are huge volume of data, dynamic, fixed order movement, and real-
time constraints.

RSS (Really Simple Syndication): It is a format for sharing instant feeds across services.

JSON (JavaScript Object Notation): It is another useful data interchange format that is often
used for many machine learning algorithms.

2.2 Big Data Analytics and Types of Analytics

The primary aim of data analysis is to assist business organizations to take decisions. For
example, a business organization may want to know which is the fastest selling product, in order
for them to market activities.

Data analysis and data analytics are terms that are used interchangeably to refer to the same
concept. However, there is a subtle difference.

Data analytics is a general term and data analysis is a part of it. Data analytics refers to the process
of data collection, pre-processing and analysis. It deals with the complete cycle of data
management.

Data analysis is just analysis and is a part of data analytics. It takes historical data and does the
analysis. Data analytics, instead, concentrates more on future and helps in prediction.

There are four types of data analytics:

1. Descriptive analytics.
2. Diagnostic analytics.
3. Predictive analytics.
4. Prescriptive analytics.

Descriptive Analytics: It is about describing the main features of the data. After data
collection is done, descriptive analytics deals with the collected data and quantifies it.

 Descriptive analytics only focuses on the description part of the data and not the inference
part.

Dept. of CSE, CBIT-Kolar Page 18


Machine Learning

Diagnostic Analytics: It deals with the question – ‗Why?‘ This is also known as causal
analysis, as it aims to find out the cause and effect of the events.

 For example, if a product is not selling, diagnostic analytics aims to find out the reason.

Predictive Analytics: It deals with the future. It deals with the question – ‗What will happen
in future give this data?‘ This involves the application of algorithms to identify the patterns to
predict the future. The entire course of machine learning is mostly about predictive analytics.

Prescriptive Analytics: It is about the finding the best course of action for the business
organizations.

 Prescriptive analytics goes beyond prediction and helps in decision making by giving a set
of actions.
 It helps the organizations to plan better for the future and to mitigate the risks that are
involved.

2.3 Big Data Analysis Framework


For performing data analytics, many frameworks are proposed. All proposed analytics
frameworks have some common factors. Big data framework is a layered architecture. Such
architecture has many advantages such as genericness. 4-layer architecture has the following
layers:
a. Date connection layer
b. Data management layer
c. Data analytics later
d. Presentation layer

Data Connection Layer: It has data ingestion mechanisms and data connectors. Data
ingestion means taking raw data and importing it into appropriate data structures. It performs the
tasks of ETL process. By ETL, it means extract, transform and load operations.

Data Management Layer: It performs pre-processing of data. The purpose of this layer is to
allow parallel execution of queries, and read, write and data management tasks. There may be
many schemes that can be implemented by this layer such as data-in-place, where the data is not
moved at all, or constructing data repositories such as data warehouses and pull data on-demand
mechanisms.

Data Analytic Layer: It has much functionality such as statistical tests, machine learning
algorithms to understand, and construction of machine learning models.

Presentation Layer: It has mechanisms such as dashboards, and applications that display the
results of analytical engines and machine learning algorithms. Thus, the Big Data processing
cycle involves data management that consists of the following steps.

Dept. of CSE, CBIT-Kolar Page 19


Machine Learning

1. Data collection
2. Data pre-processing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm

This is an iterative process and is carried out on a permanent basis to ensure that data is suitable
for data mining.

2.3.1 Data Collection


The first task of gathering datasets is the collection of data. It is often estimated that most of the
time is spent for collection of good quality data. A good quality data yields a better result. It is
often difficult to characterize a ‗Good data‘. ‗Good data‘ is one that has the following properties:

 Timeliness – The data should be relevant and not stale or obsolete data.
 Relevancy – The data should be relevant and ready for the machine learning or data
mining algorithms. All the necessary information should be available and there should
be no bias in the data.
 Knowledge about the data – The data should be understandable and interpretable, and
should be self-sufficient for the required application as desired by the domain
knowledge engineer.

Broadly, the data source can be classified as open/public data, social media data and multimodal
data.

1. Open or public data source – It is a data source that does not have any stringent
copyright rules or restrictions. Its data can be primarily used for many purposes.
Government census data are good examples of open data:

 Digital libraries that have huge amount of text data as well as document images
 Scientific domains with a huge collection of experimental data like genomic data
and biological data.
 Healthcare systems that use extensive databases like patient databases, health
insurance data, doctors‘ information, and bioinformatics information.

2. Social media – It is the data that is generated by various social media platforms like
Twitter, Facebook, YouTube, and Instagram. An enormous amount of data is generated
by these platforms.

3. Multimodal data – It includes data that involves many modes such as text, video, audio
and mixed types.

Dept. of CSE, CBIT-Kolar Page 20


Machine Learning

2.3.2 Data Pre-processing

In real world, the available data is ‘dirty‘. By this word ‘dirty‘, it means:
 Incomplete data
 Inaccurate data
 Outlier data
 Data with missing values
 Data with inconsistent values
 Duplicate data

Data pre-processing improves the quality of the data mining techniques. The raw data must be
pre-processed to give accurate results.

 The process of detection and removal of errors in data is called data cleaning.
 Data wrangling means making the data process able for machine learning algorithms.

Some of the data errors include human errors such as typographical errors or incorrect
measurement and structural errors like improper data formats. Data errors can also arise from
omission and duplication of attributes. Noise is a random component and involves distortion of a
value or introduction of spurious objects. Often, the noise is used if the data is a spatial or
temporal component.

Consider, for example, the following patient Table 2.1. The ‗bad‘ or ‗dirty‘ data can be observed
in this table.

It can be observed that data like Salary =‘ ‘ is incomplete data. The DoB of patients, John, Andre,
and Raju, is the missing data. The age of David is recorded as ‗5‘ but his DoB indicates it is
10/10/1980. This is called inconsistent data.

 These errors often come during data collection stage.


 These must be removed so that machine learning algorithms yield better results as the
quality of results is determined by the quality of input data.
 This removal process is called data cleaning.

Dept. of CSE, CBIT-Kolar Page 21


Machine Learning

Missing Data Analysis:

The primary data cleaning process is missing data analysis. Data cleaning routines attempt to fill
up the missing values, smoothen the noise while identifying the outliers and correct the
inconsistencies of the data. This enables data mining to avoid overfitting of the models.

The procedures that are given below can solve the problem of missing data:

1. Ignore the tuple – A tuple with missing data, especially the class label, is ignored. This
method is not effective when the percentage of the missing values increases.

2. Fill in the values manually – Here, the domain expert can analyse the data tables and carry
out the analysis and fill in the values manually. But, this is time consuming and may not be
feasible for larger sets.

3. A global constant can be used to fill in the missing attributes. The missing values may
be ‘Unknown‘ or be ‘Infinity‘. But, some data mining results may give spurious results by
analysing these labels.

4. The attribute value may be filled by the attribute value. Say, the average income can
replace a missing value.

5. Use the attribute mean for all samples belonging to the same class. Here, the average value
replaces the missing values of all tuples that fall in this group.

6. Use the most possible value to fill in the missing value. The most probable value can be
obtained from other methods like classification and decision tree prediction.

Some of these methods introduce bias in the data. The filled value may not be correct and could
be just an estimated value. Hence, the difference between the estimated and the original value is
called an error or bias.

Removal of Noisy or Outlier Data


 Noise is a random error or variance in a measured value.
 It can be removed by using binning, which is a method where the given data values are
sorted and distributed into equal frequency bins.
 The bins are also called as buckets. The binning method then uses the neighbor values to
smooth the noisy data.
 The maximum and minimum values are called bin boundaries. Binning methods may be
used as a discretization technique.

Example 2.1 illustrates this principle.

2.1: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, and 34}. Apply various
binning techniques and show the result.

Dept. of CSE, CBIT-Kolar Page 22


Machine Learning

Bin 1: 12, 14, 19


Bin 2: 22, 24, 26
Bin 3: 28, 31, 32
By smoothing bins method, the bins are replaced by the bin means. This method results in:
Bin 1: 15, 15, 15
Bin 2: 24, 24, 24
Bin 3: 30.3, 30.3, 30.3
Using smoothing by bin boundaries method, the bins' values would be like:
Bin 1: 12, 12, 19
Bin 2: 22, 22, 26
Bin 3: 28, 32, 32

Data Integration and Data Transformations


 Data integration involves routines that merge data from multiple sources into a single data
source.
 The main goal of data integration is to detect and remove redundancies that arise from
integration.
 Data transformation routines perform operations like normalization to improve the
performance of the data mining algorithms.
 It is necessary to transform data so that it can be processed. Normalization is one such
technique.
 In normalization, the attribute values are scaled to fit in a range (say 0-1) to improve the
performance of the data mining algorithm. Often, in neural networks, these techniques are
used.
1. Min-Max
2. Z-Score

Min-Max Procedure

It is a normalization technique where each variable V is normalized by its difference with the
minimum value divided by the range to a new range, say 0–1. The formula to implement this
normalization is given as:

Here max-min is the range. Min and max are the minimum and maximum of the given data; new
max and new min are the minimum and maximum of the target range, say 0 and 1.

Data Reduction

Data reduction reduces data size but produces the same results. There are different ways in which
data reduction can be carried out such as data aggregation, feature selection, and dimensionality
reduction.

Dept. of CSE, CBIT-Kolar Page 23


Machine Learning

2.4 Descriptive Statistics

 Descriptive statistics is a branch of statistics that does dataset summarization.


 It is used to summarize and describe data.
 Descriptive statistics are just descriptive and do not go beyond that.
 Data visualization is a branch of study that is useful for investigating the given data.

Descriptive analytics and data visualization techniques help to understand the nature of the data,
which further helps to determine the kinds of machine learning or data mining tasks that can be
applied to the data.

Dataset and Data Types

A dataset can be assumed to be a collection of data objects. The data objects may be records,
points, vectors, patterns, events, cases, samples or observations. These records contain many
attributes. An attribute can be defined as the property or characteristics of an object. For
example, consider the following database shown in sample Table 2.2.

Every attribute should be associated with a value. This process is called measurement. The type
of attribute determines the data types, often referred to as measurement scale types. The data
types are shown in Figure 2.1.

Broadly, data can be classified into two types:

 Categorical or qualitative data.


 Numerical or quantitative data.

Dept. of CSE, CBIT-Kolar Page 24


Machine Learning

Categorical or Qualitative Data:

The categorical data can be divided into two types. They are nominal type and ordinal type.

 Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are symbols and
cannot be processed like a number. For example, the average of a patient ID does not make
any statistical sense. Nominal data type provides only information but has no ordering
among data.

 Ordinal Data – It provides enough information and has natural order. For example,
Fever = {Low, Medium, High} is an ordinal data. Certainly, low is less than medium and
medium is less than high, irrespective of the value. Any transformation can be applied to
these data to get a new value.

Numeric or Qualitative Data It can be divided into two categories. They are interval type
and ratio type.

 Interval Data – Interval data is a numeric data for which the differences between values
are meaningful. For example, there is a difference between 30 degree and 40 degree. Only
the permissible operations are + and -.

 Ratio Data – For ratio data, both differences and ratio are meaningful. The difference
between the ratio and interval data is the position of zero in the scale. For example, take the
Centigrade-Fahrenheit conversion. The zeroes of both scales do not match. Hence, these
are interval data.

Discrete Data: This kind of data is recorded as integers. For example, the responses of the survey
can be discrete data. Employee identification number such as 10001 is discrete data.

Continuous Data: It can be fitted into a range and includes decimal point. For example, age is a
continuous data. Though age appears to be discrete data, one may be 12.5 years old and it makes
sense. Patient height and weight are all continuous data.

Third way of classifying the data is based on the number of variables used in the dataset. Based on
that, the data can be classified as univariate data, bivariate data, and multivariate data. This is
shown in Figure 2.2.

Dept. of CSE, CBIT-Kolar Page 25


Machine Learning

2.5 Univariate Data analysis and visualization.

Univariate analysis is the simplest form of statistical analysis. As the name indicates, the dataset
has only one variable. A variable can be called as a category. Univariate does not deal with cause
or relationships.

2.5.1 Data Visualization

To understand data, graph visualization is must. Data visualization helps to understand data. It
helps to present information and data to customers. Some of the graphs that are used in univariate
data analysis are bar charts, histograms, frequency polygons and pie charts.

Bar Chart: A Bar chart (or Bar graph) is used to display the frequency distribution for
variables. Bar charts are used to illustrate discrete data. The charts can also help to explain the
counts of nominal data. It also helps in comparing the frequency of different groups.

The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown
below in Figure 2.3.

Pie Chart: These are equally helpful in illustrating the univariate data. The percentage
frequency distribution of students' marks {22, 22, 40, 40, 70, 70, 70, 85, 90, and 90} is below in
Figure 2.4.

Histogram: It plays an important role in data mining for showing frequency distributions. The
histogram for students‘ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75,
76-100 is given below in Figure 2.5.

Histogram conveys useful information like nature of data and its mode. Mode indicates the peak
of dataset. In other words, histograms can be used as charts to show frequency, skewness present
in the data, and shape.

Dept. of CSE, CBIT-Kolar Page 26


Machine Learning

Dot Plots: These are similar to bar charts. They are less clustered as compared to bar charts, as
they illustrate the bars only with single points. The dot plot of English marks for five students
with ID as {1, 2, 3, 4, and 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The advantage
is that by visual inspection one can find out who got more marks.

2.5.2 Central Tendency

One cannot remember all the data. Therefore, a condensation or summary of the data is necessary.
This makes the data analysis easy and simple. One such summary is called central tendency. Thus,
central tendency can explain the characteristics of data and that further helps in comparison. It is
called measure of central tendency (or averages).

Popular measures are mean, median and mode.

Dept. of CSE, CBIT-Kolar Page 27


Machine Learning

a. Mean: Arithmetic average (or mean) is a measure of central tendency that represents the
‗center‘ of the dataset. This is the commonest measure used in our daily conversation such
as average income or average traffic. Mathematically, the average of all the values in the
sample (population) is denoted as x. Let x1, x2… xN be a set of ‗N‘ values or observations,
and then the arithmetic mean is given as:

Weighted mean: Unlike arithmetic mean that gives the weightage of all items
equally, weighted mean gives different importance to all items as the item importance
varies. Hence, different weightage can be given to items.

In weighted mean, the mean is computed by adding the product of proportion and group mean.

Geometric mean: Let x1, x2, …, xN be a set of ‗N‘ values or observations. Geometric mean is
the Nth root of the product of N items. The formula for computing geometric mean is given as
follows:

b. Median: – The middle value in the distribution is called median. If the total number of
items in the distribution is odd, then the middle value is called median. If the numbers are
even, then the average value of two items in the centre is the median. It can be observed
that the median is the value where xi is divided into two equal halves, with half of the
values being lower than the median and half higher than the median. A median class is that
class where (N/2) Th item is present.

In the continuous case, the median is given by the formula:

Dept. of CSE, CBIT-Kolar Page 28


Machine Learning

Median class is that class where N/2th item is present. Here, i is the class interval of the
median class and L1 is the lower limit of median class, f is the frequency of the median
class, and cf is the cumulative frequency of all classes preceding median.

c. Mode – Mode is the value that occurs more frequently in the dataset. In other words, the
value that has the highest frequency is called mode. Mode is only for discrete data and is
not applicable for continuous data as there are no repeated values in continuous data.

2.5.3 Dispersion

The spread out of a set of data around the central tendency (mean, median or mode) is called
dispersion. Dispersion is represented by various ways such as range, variance, standard deviation,
and standard error. These are second order measures. The most common measures of the
dispersion data are listed below:

Range: Range is the difference between the maximum and minimum of values of the given list of
data.
Standard Deviation: The mean does not convey much more than a middle point. For example,
the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The difference
between these two sets is the spread of data.

Standard deviation is the average distance from the mean of the dataset to each point. The
formula for sample standard deviation is given.

Here, N is the size of the population, xi is observation or value from the population and m is the
population mean. Often, N – 1 is used instead of N in the denominator of above Eq. The reason is
that for larger real-world, the division by N - 1 gives an answer closer to the actual value.

Quartiles and Inter Quartile Range

It is sometimes convenient to subdivide the dataset using coordinates. Percentiles are about data
that are less than the coordinates by some percentage of the total value. kth percentile is the
property that the k% of the data lies at or below Xi.

For example, median is 50th percentile and can be denoted as Q 0.50. The 25th percentile is called
first quartile (Q1) and the 75th percentile is called third quartile (Q3).
Another measure that is useful to measure dispersion is Inter Quartile Range (IQR). The IQR is
the difference between Q3 and Q1.

Inter quartile percentile = Q3 – Q1 (2.9)

Outliers are normally the values falling apart at least by the amount 1.5 × IQR above the third
quartile or below the first quartile. Inter quartile is defined by Q 0.75 – Q 0.25.

Dept. of CSE, CBIT-Kolar Page 29


Machine Learning

Five-point Summary and Box Plots

The median, quartiles Q1 and Q3, and minimum and maximum written in the order < Minimum,
Q1, Median, Q3, and Maximum > is known as five-point summary.

Box plots are suitable for continuous variables and a nominal variable. Box plots can be used to
illustrate data distributions and summary of data. It is the popular way for plotting five number
summaries.

2.5.4 Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location of
the dataset.

Skewness: The measures of direction and degree of symmetry are called measures of third order.
Ideally, skewness should be zero as in ideal normal distribution.

 The dataset may also either have very high values or extremely low values.
 If the dataset has far higher values, then it is said to be skewed to the right.

Dept. of CSE, CBIT-Kolar Page 30


Machine Learning

 On the other hand, if the dataset has far more low values then it is said to be skewed
towards left. If the tail is longer on the left-hand side and hump on the right-hand side, it is
called positive skew.
 Otherwise, it is called negative skew.

 The given dataset may have an equal distribution of data.

Kurtosis

Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates higher kurtosis
and vice versa. Kurtosis is the measure of whether the data is heavy tailed or light tailed relative to
normal distribution. It can be observed that normal distribution has bell-shaped curve with no long
tails. Low kurtosis tends to have light tails. The implication is that there is no outlier data.

Let x1, x2… xN be a set of ‗N‘ values or observations. Then, kurtosis is measured using the
formula given below:

It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14) for bias
correction. Here, x and s are the mean and standard deviation of the univariate data, respectively.

Mean Absolute Deviation (MAD)

MAD is another dispersion measure and is robust to outliers. Normally, the outlier point is
detected by computing the deviation from median and by dividing it by MAD. Here, the absolute
deviation between the data and mean is taken. Thus, the absolute deviation is given

Coefficient of Variation (CV)

Coefficient of variation is used to compare datasets with different units. CV is the ratio of
standard deviation and mean, and %CV is the percentage of coefficient of variations.

Dept. of CSE, CBIT-Kolar Page 31


Machine Learning

2.6 Bivariate Data and Multivariate Data

Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim
is to find relationships among data. Consider the following Table 2.3, with data of the temperature
in a shop and sales of sweaters.

 The aim of bivariate analysis is to find relationships among variables. The relationships
can then be used in comparisons, finding causes, and in further explorations.
 Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or
without nominal variables, to illustrate the trends, and also to show differences.
 The scatter plot (Refer Figure 2.11) indicates strength, shape, direction and the presence of
Outliers.

Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12.

2.6.1 Bivariate Statistics

Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of joint
probability of random variables, say X and Y. Generally, random variables are represented in
capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance
between two dimensions. The formula for finding co-variance for specific x, and y are:

Dept. of CSE, CBIT-Kolar Page 32


Machine Learning

Here, xi and yi are data values from X and Y. E(X) and E(Y) are the mean values of xi and yi. N
is the number of given data. Also, the COV(X, Y) is same as COV(Y, X).

Correlation:

The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.

The correlation indicates the relationship between dimensions using its sign. The sign is more
important than the actual value.
 If the value is positive, it indicates that the dimensions increase together.
 If the value is negative, it indicates that while one-dimension increases, the other
dimension decreases.
 If the value is zero, then it indicates that both the dimensions are independent of each
other.

Dept. of CSE, CBIT-Kolar Page 33


Machine Learning

2.7 Multivariate Statistics


In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of
more than two observable variables, and often, thousands of multiple measurements need to be
conducted for one or more subjects.
 The multivariate data is like bivariate data but may have more than two dependant
variables.
 Some of the multivariate analyses are regression analysis, principal component analysis,
and path analysis.
 The mean of multivariate data is a mean vector and the mean of the above three attributes
is given as (2, 7.5, 1.33).

Heatmap :

Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it. The
darker colours indicate very large values and lighter colours indicate smaller values.
 The advantage of this method is that humans perceive colours well. So, by colour shaping,
larger values can be perceived well.
 For example, in vehicle traffic data, heavy traffic regions can be differentiated from low
traffic regions through heatmap.

In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis is
weights and Y-axis is patient counts. The dark colour regions highlight patients‘ weights vs.
Patient counts in health status.

Dept. of CSE, CBIT-Kolar Page 34


Machine Learning

Pairplot :

Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists
of several pair-wise scatter plots of variables of the multivariate data. All the results are presented
in a matrix format.
 By visual examination of the chart, one can easily find relationships among the variables
such as correlation between the variables.
 A random matrix of three columns is chosen and the relationships of the columns are
plotted as a pairplot (or scatter matrix) as shown below in Figure 2.14.

2.8 Essential Mathematics for Multivariate Data

Machine learning involves many mathematical concepts from the domain of Linear algebra,
Statistics, Probability and Information theory.
 'Linear Algebra' is a branch of mathematics that is central for many scientific applications
and other mathematical subjects.
 Linear algebra deals with linear equations, vectors, matrices, vector spaces and
transformations.

2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data

A linear system of equations is a group of equations with unknown variables.

Dept. of CSE, CBIT-Kolar Page 35


Machine Learning

 If there is a unique solution, then the system is called consistent independent.


 If there are various solutions, then the system is called consistent dependant.
 If there are no solutions and if the equations are contradictory, then the system is called
inconsistent.

For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:

To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1. Swapping the rows.
2. Multiplying or dividing a row by a constant.
3. Replacing a row by adding or subtracting a multiple of another row to it.

Dept. of CSE, CBIT-Kolar Page 36


Machine Learning

2.8.2 Matrix Decompositions

It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations
can be performed. These methods are also known as matrix factorization methods.

 The most popular matrix decomposition is called eigen decomposition.


 It is a way of reducing the matrix into eigen values and eigen vectors.
 Then, the matrix A can be decomposed as:

A = Q Λ QT (2.23)

Where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of
matrix Q.

Dept. of CSE, CBIT-Kolar Page 37


Machine Learning

LU Decomposition
One of the simplest matrix decompositions is LU decomposition where the matrix A can be
decomposed matrices:
A = LU

 Here, L is the lower triangular matrix and U is the upper triangular matrix.
 The decomposition can be done using Gaussian elimination method as discussed in the
previous section.
 First, an identity matrix is augmented to the given matrix.
 Then, row operations and Gaussian elimination is applied to reduce the given matrix to get
matrices L and U.

Example 2.9 illustrates the application of Gaussian elimination to get LU.

Dept. of CSE, CBIT-Kolar Page 38


Machine Learning

2.8.3 Machine Learning and Importance of Probability and Statistics

Machine learning is linked with statistics and probability. Like linear algebra, statistics is the heart
of machine learning. The importance of statistics needs to be stressed as without statistics;
analysis of data is difficult.
 Probability is especially important for machine learning. Any data can be assumed to be
generated by a probability distribution.
 Machine learning datasets have multiple data that are generated by multiple distributions.

What is the objective for an experiment?

 This is about hypothesis or constructing a model.


 Machine learning has many models. So, the theory of hypothesis, construction of models,
evaluating the models using hypothesis testing, significance of the model and evaluation of
the model are all linking factors of machine learning and probability theory.
 Similarly, the construction of datasets using sampling theory is also required.

Probability Distributions

A probability distribution of a variable, say X, summarizes the probability associated with X‘s
events. Distribution is a parameterized mathematical function.
 In other words, distribution is a function that describes the relationship between the
observations in a sample space. Consider a set of data.
 The data is said to follow a distribution if it obeys a mathematical function that
characterizes that distribution.
 The function can be used to calculate the probability of individual observations.

Probability distributions are of two types:

1. Discrete probability distribution.


2. Continuous probability distribution.

Continuous Probability Distributions

“The relationships between the events for a continuous random variable and their probabilities
are called a continuous probability distribution.”

Normal, Rectangular, and Exponential distributions fall under this category.

1. Normal Distribution – Normal distribution is a continuous probability distribution. This is


also known as Gaussian distribution or bell-shaped curve distribution.
 In normal distribution, data tends to be around a central value with no bias on left or
right.
 The heights of the students, blood pressure of a population, and marks scored in a
class can be approximated using normal distribution.

Dept. of CSE, CBIT-Kolar Page 39


Machine Learning

2. Rectangular Distribution – This is also known as uniform distribution. It has equal


probabilities for all values in the range a, b. The uniform distribution is given as follows:

3. Exponential Distribution – This is a continuous uniform distribution. This probability


distribution is used to describe the time between events in a Poisson process.
 Exponential distribution is another special case of Gamma distribution with a fixed
parameter of 1.
 This distribution is helpful in modelling of time until an event occurs.

Discrete Distribution

Binomial, Poisson, and Bernoulli distributions fall under this category.

1. Binomial Distribution – Binomial distribution is another distribution that is often


encountered in machine learning.
 It has only two outcomes: success or failure.
 This is also called Bernoulli trial.
 The objective of this distribution is to find probability of getting success k out of n trials.

The way to get success out of k out of n number of trials is given as:

Dept. of CSE, CBIT-Kolar Page 40


Machine Learning

2. Poisson distribution – It is another important distribution that is quite useful. Given an


interval of time, this distribution is used to model the probability of a given number of events
k. The mean rule l is inclusive of previous events.

 Some of the examples of Poisson distribution are number of emails received, number of
customers visiting a shop and the number of phone calls received by the office.

3. Bernoulli Distribution – This distribution models an experiment whose outcome is binary.


The outcome is positive with p and negative with 1 - p.

The PMF of this distribution is given as:

Dept. of CSE, CBIT-Kolar Page 41


Machine Learning

Density Estimation

Let there be a set of observed values x1, x2, …, xn from a larger set of data whose distribution is
not known.
 Density estimation is the problem of estimating the density function from an observed
data.
 The estimated density function, denoted as, p(x) can be used to value directly for any
unknown data, say xt as p(xt).
 If its value is less than e, then xt is not an outlier or anomaly data. Else, it is categorized as
an anomaly data.

There are two types of density estimation methods,

1. Parametric Density Estimation: It assumes that the data is from a known probabilistic
distribution and can be estimated as p(x | Θ), where, Θ is the parameter. Maximum likelihood
function is a parametric estimation method.

a. Maximum Likelihood Estimation: For a sample of observations, one can estimate


the probability distribution.

 This is called density estimation.


 Maximum Likelihood Estimation (MLE) is a probabilistic framework that can be
used for density estimation.
 This involves formulating a function called likelihood function which is the
conditional probability of observing the observed samples and distribution function
with its parameters.
 For example, if the observations are X = {x1, x2, …, xn}, then density estimation is
the problem of choosing a PDF with suitable parameters to describe the data.

If one assumes that the regression problem can be framed as predicting output y given
input x, then for p(y/x),

Here, h is the linear regression model.

b. Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm: In


machine learning, clustering is one of the important tasks.

 MLE framework is quite useful for designing model-based methods for clustering data.

 A model is a statistical method and data is assumed to be generated by a distribution


model with its parameter, q. Since, Gaussian is normally assumed for data, this mixture
model is categorized as Gaussian Mixture Model (GMM).

Dept. of CSE, CBIT-Kolar Page 42


Machine Learning

The EM algorithm is one algorithm that is commonly used for estimating the MLE in the presence
of latent or missing variables.

The EM algorithm is effective for estimating the PDF in the presence of latent variables.
Generally, there can be many unspecified distributions with different set of parameters. The EM
algorithm has two stages:

 Expectation (E) Stage – In this stage, the expected PDF and its parameters are estimated
for each latent variable.
 Maximization (M) stage – In this, the parameters are optimized using the MLE function.

2. Non-parametric Density Estimation: A non-parametric estimation can be generative


or discriminative. Parzen window is a generative estimation method that finds p(x | Θ) as
conditional density. Discriminative methods directly compute p (Θ | x) as posterior
probability. Parzen window and k-Nearest Neighbour (KNN) rule are examples of non-
parametric density estimation.

KNN Estimation: The KNN estimation is another non-parametric density estimation method.
Here, the initial parameter k is determined and based on that k-neighbours are determined. The
probability density function estimate is the average of the values that are returned by the
neighbours.

Dept. of CSE, CBIT-Kolar Page 43


Machine Learning

2.9 Overview of Hypothesis:


Data collection alone is not enough. Data must be interpreted to give a conclusion. The conclusion
should be a structured outcome. This assumption of the outcome is called a hypothesis.
 Statistical methods are used to confirm or reject the hypothesis.
 The assumption of the statistical test is called null hypothesis. It is also called as
hypothesis zero (H0). The violation of this hypothesis is called first hypothesis (H1) or
hypothesis one.
There are two types of hypothesis tests, parametric and non-parametric.
 Parametric tests are based on parameters such as mean and standard deviation.
 Non-parametric tests are dependent on characteristics such as independence of events or
data following certain distribution.
Statistical tests help to:
1. Define null and alternate hypothesis.
2. Describe the hypothesis using parameters.
3. Identify the statistical test and statistics.
4. Decide the criteria called significance value α.
5. Compute p-value (probability value).
6. Take the final decision of accepting or rejecting the hypothesis based on the
parameters

2.9.1 Comparing Learning Methods

Some of the methods for comparing the learning programs are given below:

 Z-test:

Z-test assumes normal distribution of data whose population variation is known. The sample
size is assumed to be large. The focus is to test the population mean. The z-statistic is given

Dept. of CSE, CBIT-Kolar Page 44


Machine Learning

t-test and Paired t-test:


t-test is a hypothesis test and checks if the difference between two samples‘ mean is real or by
chance. Here, data is continuous and randomly selected. There will only be small number of
samples and variance between groups is real. The t-test statistics follows t-distribution under null
hypothesis and is used when the number of samples <30. The distribution follows t-distribution
rather than Gaussian distribution. It indicates whether two groups are different or not.

Let there be one group {8, 6, 7, 6, 3} and another group {7, 8, 8, 8, 2}. The number of elements in
each group (n) is 5. For the first group, the mean and variance are 6 and 3.5, respectively. For the
second group, the mean and variance are 6.6 and 6.8, respectively. The degree of freedom is
number of samples N - 1 = 4. The t-statistic formula can be applied for the above data and its
value is -0.42. The critical value is 0.343. Since the value is less that the critical value, one can say
the test is insignificant.

Chi-Square Test:

Chi-Square test is a non-parametric test. The goodness-of-fit test statistics follows a Chi-Square
distribution under null hypothesis and measures the statistical significance between observed
frequency and expressed frequency, and each observation is independent of each other and
follows normal distribution. This comparison is used to calculate the value of the Chi-Square
statistic as:

Dept. of CSE, CBIT-Kolar Page 45


Machine Learning

Here, E is the expected frequency, O is the observed frequency and the degree of freedom is C –
1, where, C is number of categories. The Chi-Square test allows us to detect the duplication of
data and helps to remove the redundancy of values.

Dept. of CSE, CBIT-Kolar Page 46


Machine Learning

2.10 Feature Engineering and Dimensionality Reduction Techniques


Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or
any other model in machine learning.

 Feature engineering deals with two problems – Feature Transformation and Feature
Selection.
 Feature transformation is extraction of features and creating new features that may be
helpful in increasing performance.
 For example, the height and weight may give a new attribute called Body Mass Index
(BMI).
 Feature subset selection is another important aspect of feature engineering that focuses on
selection of features to reduce the time but not at the cost of reliability.

The subset selection reduces the dataset size by removing irrelevant features and
constructs a minimum set of attributes for machine learning. If the dataset has n attributes,
then time complexity is extremely high as n dimensions need to be processed for the given
dataset. For n attributes, there are 2n possible subsets.

The features can be removed based on two aspects:

1. Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like
nose. In simple words, the features should be relevant.

2. Feature redundancy – Some features are redundant. For example, when a database
table has a field called Date of birth, then age field is not relevant as age can be
computed easily from date of birth. This helps in removing the column age that leads to
reduction of dimension one.

The procedure is:


a. Generate all possible subsets
b. Evaluate the subsets and model performance
c. Evaluate the results for optimal feature selection

Important algorithms that fall under this category

2.10.1 Stepwise Forward Selection


This procedure starts with an empty set of attributes. Every time, an attribute is tested for
statistical significance for best quality and is added to the reduced set. This process is continued
till a good reduced set of attributes is obtained.

2.10.2 Stepwise Backward Elimination


This procedure starts with a complete set of attributes. At every stage, the procedure removes the
worst attribute from the set, leading to the reduced set.

Dept. of CSE, CBIT-Kolar Page 47


Machine Learning

2.10.3 Principal Component Analysis

The idea of the principal component analysis (PCA) or KL transform is to transform a given set
of measurements to a new set of features so that the features exhibit high information packing
properties. This leads to a reduced and compact set of features. Basically, this elimination is
made possible because of the information redundancies. This compact representation is of a
reduced dimension.

The goal of PCA is to reduce the set of attributes to a newer, smaller set that captures the variance
of the data. The variance is captured by fewer components, which would give the same result as
the original, with all the attributes.

The advantages of PCA are immense. It reduces the attribute list by eliminating all irrelevant
attributes.

Dept. of CSE, CBIT-Kolar Page 48


Machine Learning

The PCA algorithm is as follows:

1. The target dataset x is obtained.


2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is
X – m. The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.
4. Eigen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset. The
eigen values are arranged in a descending order. The feature vector is formed with these
eigen vectors in its columns.
Feature vector = {eigen vector1, eigen vector2, …, eigen vectorn}

6. Obtain the transpose of feature vector. Let it be A.


7. PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the
transpose of the feature vector.

 The new data is a dimensionally reduced matrix that represents the original data.
 Therefore, PCA is effective in removing the attributes that do not contribute.
 If the original data is required, it can be obtained with no loss of information.

Dept. of CSE, CBIT-Kolar Page 49


Machine Learning

Dept. of CSE, CBIT-Kolar Page 50


Machine Learning

Dept. of CSE, CBIT-Kolar Page 51


Machine Learning

2.10.4 Linear Discriminant Analysis Linear Discriminant Analysis (LDA)

2.10.5 Singular Value Decomposition

Singular Value Decomposition (SVD) is another useful decomposition technique. Let A be the
matrix, then the matrix A can be decomposed as:

A = USVT (2.66)

Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is
m × n, S is the diagonal matrix of dimension n × n, and V is the orthogonal matrix.

The procedure for finding decomposition matrix is given as follows:

1. For a given matrix, find AAT


2. Find eigen values of AAT
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5. Find eigen values and eigen vectors for ATA. Find the eigen value and pack the eigen
vector as a matrix called V.

Dept. of CSE, CBIT-Kolar Page 52


Machine Learning

Thus, A = USVT. Here, U and V are orthogonal matrices. The columns of U and V are left
and right singular values, respectively. SVD is useful in compression, as one can decide to
retain only a certain component instead of the original matrix A as:

Dept. of CSE, CBIT-Kolar Page 53


Machine Learning

The main advantage of SVD


 Is compression.
 A matrix, say an image, can be decomposed and selectively only certain components can
be retained by making all other elements zero. This reduces the contents of image while
retaining the quality of the image.
 SVD is useful in data reduction too.

******************************************

―To succeed in your mission, you must have single-minded devotion to your goal.‖

Dr. APJ Abdul Kalam

Dept. of CSE, CBIT-Kolar Page 54

You might also like