0% found this document useful (0 votes)
25 views113 pages

Module-1 ML

Machine learning is essential for business organizations to analyze vast amounts of data for informed decision-making. Its popularity is driven by the increasing volume of data, reduced storage costs, and the availability of complex algorithms. The document explains the relationship between machine learning, artificial intelligence, data science, and statistics, and outlines the types of machine learning, particularly focusing on supervised learning methods such as classification and regression.

Uploaded by

prithvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views113 pages

Module-1 ML

Machine learning is essential for business organizations to analyze vast amounts of data for informed decision-making. Its popularity is driven by the increasing volume of data, reduced storage costs, and the availability of complex algorithms. The document explains the relationship between machine learning, artificial intelligence, data science, and statistics, and outlines the types of machine learning, particularly focusing on supervised learning methods such as classification and regression.

Uploaded by

prithvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 113

MODULE-1

Introduction
NEED FOR MACHINE LEARNING
• BUSINESS ORGANIZATIONS HAVE NUMEROUS DATA
• NEED TO ANALYZE DATA FOR TAKING DECISIONS
Machine learning has become so popular because of three reasons:
1. High volume of available data to manage: Big companies such as Facebook,
Twitter, and YouTube generate huge amount of data that grows
at a phenomenal rate. It is estimated that the data approximately gets
doubled every year.
2. Second reason is that the cost of storage has reduced. The hardware cost
has also dropped. Therefore, it is easier now to capture, process,
store, distribute, and transmit the digital information.
3. Third reason for popularity of machine learning is the availability of
complex algorithms now. Especially with the advent of deep learning,
many algorithms are available for machine learning.
NEED FOR MACHINE LEARNING

Before starting the machine learning journey, let us establish these terms - data, information, knowledge,

intelligence, and wisdom. A knowledge pyramid is shown in Figure 1.1.


What is data?

All facts are data. Data can be numbers or text that can be
processed by a computer. Today, organizations are accumulating
vast and growing amounts of data with data sources such as flat
files, databases, or data warehouses in different storage formats.

Processed data is called information.

This includes patterns, associations, or relationships among data.

For example, sales data can be analyzed to extract information


like which is the fast selling product.
Condensed information is called knowledge.
For example, the historical patterns and future trends obtained in the
above sales data can be called knowledge. Unless knowledge is
extracted, data is of no use. Similarly, knowledge is not useful unless it
is put into action.
Intelligence is the applied knowledge for actions.
An actionable form of knowledge is called intelligence. Computer
systems have been successful till this stage.
The ultimate objective of knowledge pyramid is wisdom that
represents the maturity of mind that is, so far, exhibited only by
humans.
NEED FOR MACHINE LEARNING

The objective of machine learning is to process these archival data for organizations to take better
decisions to design new products, improve the business processes, and to develop effective
decision support systems.
MACHINE LEARNING EXPLAINED
 Machine learning is an important sub- of Artificial
branch

Intelligence (AI).
 A frequently quoted definition of machine learning was by Arthur

Samuel, one of the pioneers of Artificial Intelligence.

 He stated that “Machine learning is the field of study that gives the

computers ability to learn without being explicitly

programmed.”

 The key to this definition is that the systems should learn by itself
MACHINE LEARNING EXPLAINED

 In conventional programming, after understanding the problem, a


detailed design of the program such as a flowchart or an algorithm needs
to be created and converted into programs using a suitable programming
language.
 This approach could be difficult for many real-world problems such
as puzzles, games, and complex image recognition applications.
 Initially, artificial intelligence aims to understand these problems and develop
general purpose rules manually.
 Then, these rules are formulated into logic and implemented in a
program to create intelligent systems.
MACHINE LEARNING EXPLAINED
 This idea of developing intelligent systems by using logic and reasoning
by converting an expert’s knowledge into a set of rules and programs is called an
expert system.
 The focus of AI is to develop intelligent systems by using data- driven
approach, where data is used as an input to develop intelligent
models.
 The models can then be used to predict new inputs.
 Thus, the aim of machine learning is to learn a model or set of rules from
the given dataset automatically so that it can predict the unknown data
correctly.
MACHINE LEARNING EXPLAINED
 As humans take decisions based on an experience, computers make models based
on extracted patterns in the input data and then use these data-filled models for prediction and
to take decisions.
 For computers, the learnt model is equivalent to human experience. This is shown in Figure 1.2.
MACHINE LEARNING EXPLAINED

 Often, the quality of data determines the quality of experience and,


therefore, the quality of the learning system.
 In statistical learning, the relationship between the input x and output y is
modeled as a function in the form y = f(x).
 Here, f is the learning function that maps the input x to output y.
 Learning of function f is the crucial aspect of forming a model in statistical
learning.
 In machine learning, this is simply called mapping of input to output.
MACHINE LEARNING EXPLAINED

 The learning program summarizes the raw data in a model.

 Formally stated, a model is an explicit description of patterns within the data in

the form of:

1. Mathematical equation

2. Relational diagrams like trees/graphs

3. Logical if/else rules, or

4. Groupings called clusters


MACHINE LEARNING EXPLAINED

Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A
computer program is said to learn from experience E, with respect to task T and some
performance measure P, if its performance on T measured by P improves with experience
E.”
The important components of this definition are experience E,task T, and performance
measure P.
MACHINE LEARNING EXPLAINED
 For example, the task T could be detecting an object in an image.
 The machine can gain the knowledge of object using training dataset of
thousands of images. This is called experience E.
 So, the focus is to use this experience E for this task of object
detection T.
 The ability of the system to detect the object is measured by
performance measures like precision and recall.
 Based on the performance measures, course correction can be done to improve
the performance of the system.
 Models of computer systems are equivalent to human experience.
 Experience is based on data.
 Humans gain experience by various means.
MACHINE LEARNING EXPLAINED

 Once the knowledge is gained, when anew problem is


encountered, humans search for similar past situations and then formulate the
heuristics and use that for prediction.
 In systems, experience is gathered by these steps:
1. Collection of data
2. Once data is gathered, abstract concepts are formed out of that data.
Abstraction is used to generate concepts. This is equivalent to humans’ idea
of objects, for example, we have some idea about how an elephant looks like.
MACHINE LEARNING EXPLAINED

3. Generalization converts the abstraction intoan actionable form of intelligence.


• ordering of all possible concepts.
• involves ranking of concepts,
• inferencing from them and
• formation of heuristics, an actionable aspect of intelligence.
• Heuristics are educated guesses for all tasks.
For example, if one runs or encounters a danger, it is the resultant of human
experience or his heuristics formation. In machines, it happens the same way.
4. Heuristics - The course correction is done by taking evaluation measures.
Evaluation checks the thoroughness of the models and to-do course correction, if
necessary, to generate better formulations
MACHINE LEARNING IN RELATION TO OTHER FIELDS

 Machine learning uses the concepts of Artificial Intelligence, Data Science,


and Statistics primarily.

 It is the resultant of combined ideas of diverse fields.


MACHINE LEARNING IN RELATION TO OTHER FIELDS
Machine Learning and Artificial Intelligence
 Machine learning is an important branch of AI, which is a much broader subject.
 AI aims to develop intelligent agents.
 An agent can be a robot, humans, or any autonomous systems
 The resurgence in AI happened due to development of data driven systems.
 The aim is to find relations and regularities present in the data.
 Machine learning is the subbranch of AI, whose aim is to extract the patterns for
prediction.
 It is a broad field that includes learning from examples and other areas like
reinforcement learning.
 The relationship of AI and machine learning is shown in Figure 1.3.
 The model can take an unknown instance and generate results
MACHINE LEARNING IN RELATION TO OTHER FIELDS
 Deep learning is a subbranch of machine learning.
 In deep learning, the models are constructed using neural network technology.
 Neural networks are based on the human neuron models.
 Many neurons form a network connected with the activation
functions that trigger further neurons to perform tasks.
MACHINE LEARNING IN RELATION TO OTHER FIELDS
Machine Learning, Data Science, Data Mining, and Data Analytics

 Data science is an ‘Umbrella’ term that encompasses many fields.


 Machine learning starts with data.
 Therefore, data science and machine learning are interlinked.
 Machine learning is a branch of data science.
 Data science deals with gathering of data for analysis.
 It is a broad field that includes:
Big Data : Data science concerns about collection of data.
Big data is a field of data science that deals with data’s following
characteristics:
1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter,
YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different
formats.
3. Velocity: It refers to the speed at which the data is generated and processed.
MACHINE LEARNING IN RELATION TO OTHER FIELDS

 Big data is used by many machine learning algorithms for applications such as
language translation and image recognition.

 Big data influences the growth of subjects like Deep learning.


Deep learning is a branch of machine learning that deals with constructing models
using neural networks.
Data Mining
 Data mining’s original genesis is in the business.
 unearthing of the data produces hidden information that otherwise would
have eluded the attention of the management.
 Nowadays, many consider that data mining and machine learning are same. There is no
difference between these fields except that data mining aims to extract the hidden
patterns that are present in the data, whereas, machine learning aims to use it for
prediction
MACHINE LEARNING IN RELATION TO OTHER FIELDS
Data Analytics Another branch of data science is data analytics.
 It aims to extract useful knowledge from crude data. There are different types of
analytics.
 Predictive data analytics is used for making predictions. Machine learning is closely
related to this branch of analytics and shares almost all algorithms.
Pattern Recognition It is an engineering field.
 It uses machine learning algorithms to extract the features for pattern analysis
and pattern classification.
 One can view pattern recognition as a specific application of machine learning.
These relations are summarized in Figure 1.4.
MACHINE LEARNING IN RELATION TO OTHER FIELDS
MACHINE LEARNING IN RELATION TO OTHER FIELDS

Machine Learning and Statistics


 Statistics is a branch of mathematics that has a solid theoretical
foundation regarding statistical learning.
 Like machine learning (ML), it can learn from data.
 But the difference between statistics and ML is that statistical methods look for
regularity in data called patterns.
 Initially, statistics sets a hypothesis and performs experiments to verify and
validate the hypothesis in order to find relationships among data.
TYPES OF MACHINE LEARNING
TYPES OF MACHINE LEARNING
 Data is a raw fact.
 Normally, data is represented in the form of a table.
 Data also can be referred to as a data point, sample, or an example.
 Each row of the table represents a data point.
 Features are attributes or characteristics of an object.
 Normally, the columns of the table are attributes.
 Out of all attributes, one attribute is important and is called a label.
 Label is the feature that we aim to predict.
 Thus, there are two types of data – labelled and unlabelled.
TYPES OF MACHINE LEARNING

Labelled Data
 To illustrate labelled data, let us take one example dataset called
Iris flower dataset or Fisher’s Iris dataset.
 The dataset has 50 samples of Iris – with four attributes, length
and width of sepals and petals.
 The target variable is called class.
 There are three classes – Iris setosa, Iris and Iris
virginica, versicolor.
 The partial data of Iris dataset is shown in Table 1.1.
TYPES OF MACHINE LEARNING
TYPES OF MACHINE LEARNING

A dataset need not be always numbers. It can be images or video frames. Deep
neural networks can handle images with labels. In the following Figure 1.6, the
deep neural network takes images of dogs and cats with labels for classification.

Figure 1.6: (a) Labelled Dataset (b) Unlabelled Dataset

In unlabelled data, there are no labels in the dataset.


TYPES OF MACHINE LEARNING

Supervised Learning
 Supervised algorithms use labelled dataset.
 As the name suggests, there is a supervisor or teacher component
in supervised learning.

 A supervisor provides labelled data so that the model is constructed


and generates test data.
 In supervised learning algorithms, learning takes place in two stages
TYPES OF MACHINE LEARNING
Supervised Learning
 In layman terms, during the first stage, the teacher communicates the
information to the student that the student is supposed to master.
 The student receives the information and understands it.
 During this stage, the teacher has no knowledge of whether the information is grasped by the student.
 This leads to the second stage of learning.
 The teacher then asks the student a set of questions to find out how much information has been
grasped by the student.
 Based on these questions the student is tested, and the teacher informs the student about his
assessment.
 This kind of learning is typically called supervised learning.
 Supervised learning has two methods:
1. Classification 2. Regression
TYPES OF MACHINE LEARNING
Supervised Learning-Classification

 Classification is a supervised learning method.


 The input attributes of the classification
algorithms are called independent
variables.
 The target attribute is called label or dependent
variable.
 The relationship between the
input and target variable is
represented in the form of a
structure which is called classification
model.
 So, the focus of classification is to predict the ‘label’ that is in a discrete form (a value
from the set of finite values).
TYPES OF MACHINE LEARNING
Supervised Learning-Classification
 An example is shown in Figure 1.7 where a classification
algorithm takes a set of labelled data images such as dogs and cats to
construct a model that can later be used to classify an unknown test
image data.
TYPES OF MACHINE LEARNING
Supervised Learning-Classification
 In classification, learning takes place in two stages.
 During the first stage, called training stage, the learning algorithm takes a
labelled dataset and starts learning.
 After the training set, samples are processed and the model is generated.
 In the second stage, the constructed model is tested with test or unknown sample
and assigned a label.
 This is the classification process.
This is illustrated in the above Figure 1.7.
Initially, the classification learning algorithm learns with the collection of labelled
data and constructs the model. Then, a test case is selected, and the model assigns a
label.
TYPES OF MACHINE LEARNING

Supervised Learning-Classification

 In the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the

classification will generate the label for this.


 This is called classification.
 One of the examples of classification is – Image recognition, which

includes classification of diseases like cancer, classification of plants, etc.


TYPES OF MACHINE LEARNING
Supervised Learning-Classification

The classification models can be categorized based on the


implementation technology like decision trees, probabilistic
methods, distance measures, and soft computing methods.
 Classification models can also be classified as generative models and
discriminative models.
 Generative models deal with the process of data generation and
its distribution.
 Probabilistic models are examples of generative models.
 Discriminative models do not care about the generation of data. Instead,
they simply concentrate on classifying the given data.
TYPES OF MACHINE LEARNING
Supervised Learning-Classification

Some of the key algorithms of classification are:


 Decision Tree
 Random Forest
 Support Vector Machines
 Naïve Bayes
 Artificial Neural Network and Deep Learning networks like
CNN
TYPES OF MACHINE LEARNING
Supervised Learning-Regression Models
 Regression models, unlike classification algorithms, predict continuous
variables like price. In other words, it is a number.
 A fitted regression model is shown in Figure 1.8 for a dataset that represent weeks
input x and product sales y.
The regression model takes input x and generates a model in the form of a
fitted line of the form y = f(x).
Here, x is the independent variable that may be one or more attributes
and y is the dependent variable.
In Figure 1.8, linear regression takes the training set and tries to fit it with a
line – product sales = 0.66 × Week + 0.54.
 Here, 0.66 and 0.54 are all regression coefficients that are learnt from
data.
 The advantage of this model is that prediction for product sales (y) can be
made for unknown week data (x).
 For example, the prediction for unknown eighth week can be made by
substituting x as 8 in that regression formula to get y.
TYPES OF MACHINE LEARNING
Supervised Learning-Classification

 Both regression and classification models are supervised algorithms.


 Both have a supervisor and the concepts of training and testing are applicable
to both.
 What is the difference between classification and regression models?
 The main difference is that regression models predict continuous variables such as
product price, while classification concentrates on assigning labels such as class.
TYPES OF MACHINE LEARNING

Unsupervised Learning
The second kind of learning is by self-instruction. As the name suggests, there are no
supervisor or teacher components. In the absence of a supervisor or
teacher, self- instruction is the most common kind of learning process.
 This process of self-instruction is based on the concept of trial and error.
 Here, the program is supplied with objects, but no labels are defined.
 The algorithm itself observes the examples and recognizes patterns based on
the principles of grouping.
 Grouping is done in ways that similar objects form the same group.
 Cluster analysis and Dimensional reduction algorithms are examples of
unsupervised algorithms.
TYPES OF MACHINE LEARNING
Unsupervised Learning-Cluster Analysis

 Cluster analysis is an example of unsupervised learning.

 It aims to group objects into disjoint clusters or groups.

 Cluster analysis clusters objects based on its attributes.

 All the data objects of the partitions are similar in some

aspect and vary from the data objects in the other partitions

significantly.
• Some of the key clustering algorithms are:
o k-means algorithm
o Hierarchical algorithms
Dimensionality Reduction

• Dimensionality algorithms are examples of


reduction unsupervised
• algorithms.
It takes a higher dimension data as input and outputs the
data in lower dimension by taking advantage of the variance of
the data.
• It is a task of reducing the dataset with few features without
losing the generality.

Differences between Supervised and


Unsupervised Learning
3. Semi-supervised Learning
• There are circumstances where the dataset has a huge collection of unlabelled
data and some labelled data.
• Labelling is a costly process and difficult to perform by the humans.
• Semi-supervised algorithms use unlabelled data by assigning a pseudo-label.
Then, the labelled and pseudo-labelled dataset can be combined.
4.Reinforcement Learning
• Reinforcement learning mimics human beings. Like human beings use ears and eyes to
perceive the world and take actions, reinforcement learning allows the agent to interact with
the environment to get rewards.
• The agent can be human, animal, robot, or any independent program.
• The rewards enable the agent to gain experience. The agent aims to maximize
the reward.
• The reward can be positive or negative (Punishment). When the rewards are more, the
behavior gets reinforced and learning becomes possible.
CHALLENGES OF MACHINE LEARNING
1. Problems – Machine learning can deal with the ‘well-posed’ problems where
specifications are complete and available. Computers cannot solve ‘ill-posed’ problems.

2. Huge data – This is a primary requirement of machine learning. Availability of a quality


data is a challenge. A quality data means it should be large and should not have data
problems such as missing data or incorrect data.

3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even
Tensor Processing Unit (TPU) are required to execute machine learning algorithms.

4. Complexity of the algorithms – The selection of algorithms, describing the


algorithms, application of algorithms to solve machine learning task, and comparison of
algorithms have become necessary for machine learning or data scientists now.

5. 5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/
variance tradeoff. A model that fits the training data correctly but fails for test
data, in general lacks generalization, is called overfitting. The reverse problem is called
MACHINE LEARNING PROCESS

1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm
is enough for giving the solution.
2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns
to the selected
hypothesis.
3. Preparation of data – This step involves producing the final dataset by
cleaning raw data and the preparation of data for the data mining process.
4. Modelling – This step plays a role in the applicationof data mining algorithm for the
data to obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined
by evaluating the accuracy of the classifier.
6. Deployment – This step involves the deployment of results of the data mining
algorithm to improve the existing process or for a new situation.
Some applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which are captured by
emoticons effectively. For movie reviews or product reviews, five stars or one star are automatically
attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases
possible. For example, Amazon recommends users to find related books or books bought by people
who have the same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and
Google Assistant are all examples of voice assistants. They take speech commands and perform
tasks. These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine
learning which offer to locate and navigate shortest paths to reduce time.
What is Data ?
• All facts are data. In computer systems, bits encode facts present
in numbers, text, images, audio, and video.
• Data is available in different data sources like flat files, databases, or data
warehouses.
• It can either be an operational data or a non-operational data.

• Data by itself is meaningless. It has to be processed to generate any


information. A string of bytes is meaningless. Only when a label is
attached like height of students of a class, the data becomes meaningful.
• Processed data is called information that includes patterns, associations, or
relationships among data.
Elements of Big Data
• Small Data

• Big Data

• Big Data can be characterized as follows :

1. Volume – Since there is a reduction in the cost of storing devices, there has
been a tremendous growth of data. Small traditional data is measured in terms
of gigabytes (GB) and terabytes (TB), but Big Data is measured in terms of
petabytes (PB) and exabytes (EB).

2. Velocity – The fast arrival speed of data and its increase in data volume is noted as
velocity. The availability of IoT devices and Internet power ensures that the data is
arriving at a faster rate.
3. Variety – The variety of Big Data includes:

a. Form – There are many forms of data. Data types range from
text, graph, audio, video, to maps.

b. Function – These are data from various sources like human


conversations, transaction records, and old archive data.

c. Source of data – This is the third aspect of variety. There are many sources
of data. Broadly, the data source can be classified as open/public data, social
media data and multimodal data.
4. Veracity of data – Veracity of data deals with aspects like

conformity to the facts, truthfulness, believability, and confidence in data. There may be
many sources of error such as technical errors,typographical errors, and human errors.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals
that are needed by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the
information that is extracted from the data and its influence on the decisions that are
taken based on it.
Types of
Data
1. Structured Data
Data is stored in an organized manner such as a database where it is available in the form of a table. The
data can also be retrieved in an organized manner using tools like SQL.
• The structured data frequently encountered in machine learning are listed below:
o Record Data A dataset is a collection of measurements taken from a process. We have a
collection of objects in a dataset and each object has a set of measurements. The
measurements can be arranged in the form of a matrix. Rows in the matrix represent an object
and can be called as entities, cases, or records. The columns of the dataset are called
attributes, features, or fields.
o Data Matrix It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data.
o Graph Data It involves the relationships among objects. For example, a web
page can refer to another web page. This can be modeled as a graph.
o Ordered Data Ordered data objects involve attributes that have an
implicit order among them. The examples of ordered data are:
 Temporal data – It is the data whose attributes are associated with time. For example, the customer purchasing patterns during
festival time is sequential data. Time series data is a special type of sequence data where the data is a series of measurements over
time.
 Sequence data – It is like sequential data but does not have time stamps. This data involves the sequence of words or letters. For
example, DNA data is a sequence of four characters – A T G C.
 Spatial data – It has attributes such as positions or areas. For example, maps are spatial data
where the points are related by location.
2. Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual
documents, programs, and blog data. It is estimated that 80% of the data are
unstructured data.

3. Semi-Structured Data

Semi-structured data are partially structured and partially unstructured. These


include data like XML/JSON data, RSS feeds, and hierarchical data.
Data Storage and
Representation
Flat Files
• These are the simplest and most commonly available data source. It is also the
cheapest way of organizing the data.
• These flat files are the files where data is stored in plain ASCII or EBCDIC
format.
• Minor changes of data in flat files affect the results of the data mining
algorithms.
• Hence, flat file is suitable only for storing small dataset and not desirable if the
dataset becomes larger.
• Formats : CSV Files, TSV Files
Data Storage and
Representation

• Database System It normally consists of database files and a


database management system (DBMS).
– A transactional database is a collection of transactional records.

– Time-series database stores time related information like log files


where data is associated with a time stamp.
– Spatial databases contain spatial information in a raster or vector format.
Raster formats are either bitmaps or pixel maps. For example, images
can be stored as a raster data. On the other hand, the vector format can
be used to store maps as maps use basic geometric primitives like points,
lines, polygons and so forth
Data Storage and
Representation
• World Wide Web (WWW) It provides a diverse, worldwide online
information source. The objective of data mining algorithms is to mine
interesting patterns of information present in WWW.
• XML (eXtensible Markup Language) It is both human and machine
interpretable data format that can be used to represent data that needs to be
shared across the platforms.
• Data Stream It is dynamic data, which flows in and out of the observing
environment. Typical characteristics of data stream are huge volume of data,
dynamic, fixed order movement, and real-time constraints.
• RSS (Really Simple Syndication) It is a format for sharing instant feeds
across services.
• JSON (JavaScript Object Notation) It is another useful data interchange format
that is often used for many machine learning algorithms.
BIG DATA ANALYTICS AND TYPES OF ANALYTICS
• Data analysis is an activity that takes the data and generates useful
information and insights for assisting the organizations.
• There are four types of data analytics:

1. Descriptive analytics is about describing the main features of the data. After
data collection is done, descriptive analytics deals with the collected data and quantifies
it. It is often stated that analytics is essentially statistics.

2. Diagnostic Analytics deals with the question – ‘Why?’. This is also known as causal
analysis, as it aims to find out the cause and effect of the events. For example, if a
product is not selling, diagnostic analytics aims to find out the reason.
3. Predictive Analytics deals with the future. It
deals with the question – ‘What will happen in future given
this data?’. This involves the application of algorithms to identify the
patterns to predict the future.
4. Prescriptive Analytics is about the finding the best course of
action for the business organizations. Prescriptive analytics goes
beyond prediction and helps in decision making by giving a set of
actions. It helps the organizations to plan better for the future and to
mitigate the risks that are involved.
BIG DATA ANALYSIS FRAMEWORK

• Big data framework is a layered


architecture.

• A 4-layer architecture has the following


layers:

1. Data connection layer

2. Data management layer

3. Data analytics later

4. Presentation layer
BIG DATA ANALYSIS FRAMEWORK

• Data Connection Layer: It has data ingestion mechanisms and data


connectors. Data ingestion means taking raw data and importing it into
appropriate data structures. It performs the tasks of ETL process.
• Data Management Layer: It performs preprocessing of data. The
purpose of this layer is to allow parallel execution of queries, and read, write
and data management tasks.
• Data Analytic Layer: It has many functionalities such as statistical tests,
machine learning algorithms to understand, and construction of
machine learning models. This layer implements many model
validation mechanisms too. The processing is done as Cloud Computing, Grid
Computing and H-Computing:
BIG DATA ANALYSIS FRAMEWORK

 Cloud computing is an emerging technology which is basically a business


service model or simply called as pay-per-usage model. The term ‘Cloud’
refers to the Internet that provides sharing of processing power,
applications, storage and services. It offers different kinds of services
such as Iaas, Paas, and SaaS.
 Grid Computing Grid Computing is a parallel and distributed computing
framework consisting of a network of computers offering a super
computing service as a single virtual supercomputer.
 H-Computing (High Performance Computing or HPC) It enables to perform
complex tasks at high speed. It aggregates computing power in such a way
that provides much higher performance to solve complex problems in
science, engineering, research or business.
 Presentation Layer: It has mechanisms such as dashboards,
and applications that display the results of analytical
BIG DATA ANALYSIS FRAMEWORK
• Big Data Processing Cycle involves data
management that

consists of the following steps.

1. Data Collection

2. Data Preprocessing
3. Application of Machine Learning Algorithms

4. Interpretation of results and visualization of


machine
learning algorithm
BIG DATA ANALYSIS FRAMEWORK DATA COLLECTION
• The first task of gathering datasets are the
collection of data.
• It is often estimated that most of the time is spent
for collection of good quality data.
• ‘Good data’ is one that has the following properties:

1. Timeliness – The data should be relevant and not stale or obsolete


data.
2. Relevancy – The data should be relevant and ready for the
machine learning or data mining algorithms. All the necessary
information should be available and there should be no bias in
the data.
3. Knowledge about the data – The data should be understandable
and interpretable, and should be self-sufficient for the
BIG DATA ANALYSIS FRAMEWORK

DATA COLLECTION
• Broadly, the data source can be classified as open/public data, social
media data and multimodal data.
1. Open or public data source – It is a data source that does not have
any stringent copyright rules or restrictions. Its data can be primarily used
for many purposes.
2. Social media – It is the data that is generated by various social media
platforms like Twitter, Facebook, YouTube, and Instagram. An
enormous amount of data is generated by these platforms.
3. Multimodal data – It includes data that involves many modes such as text,
video, audio and mixed types.
BIG DATA ANALYSIS FRAMEWORK

DATA PREPROCESSING

• In real world, the available data is


•Incomplete data • Inaccurate data • Outlier data •
Data with missing values • Data with inconsistent values • Duplicate data

• Data preprocessing improves the quality of the data


mining techniques. The raw data must be preprocessed to give
accurate results.
• The process of detection and removal of errors in data is called
data cleaning.
BIG DATA ANALYSIS FRAMEWORK
DATA PREPROCESSING

Illustration of ‘Bad’ Data


• It can be observed that data like Salary = ’ ’ is incomplete data.

• The DoB of patients, John, Andre, and Raju, is the missing data.

• The age of David .This is called inconsistent data. Inconsistent data occurs due to
problems in conversions, inconsistent formats, and difference in units.
• Salary for John. It cannot be less than ‘0’. It is an instance of noisy data.

•Outliers are data that exhibit the characteristics that are different from other data
and have very unusual values. It is often required to distinguish between noise and
outlier data. The age of Raju .
BIG DATA ANALYSIS
FRAMEWORK
DATA
PREPROCESSING
Missing Data Analysis:

• The primary data cleaning process is missing data


analysis.

• Data cleaning routines attempt to fill up the missing


values, smoothen the noise while identifying the outliers
and correct the inconsistencies of the data
BIG DATA ANALYSIS FRAMEWORK
DATA PREPROCESSING
• The procedures that are given below can solve the problem of
missing data:
1. Ignore the tuple – A tuple with missing data, especially the class label, is ignored.
This method is not effective when the percentage of the missing values increases.
2. Fill in the values manually – Here, the domain expert can analyse the data tables
and carry out the analysis and fill in the values manually. But, this is time
consuming and may not be feasible for larger sets.
3. A global constant can be used to fill in the missing attributes. The missing
values may be ’Unknown’ or be ’Infinity’. But, some data mining results may
give spurious results by analysing these labels.
4. The attribute value may be filled by the attribute value. Say, the
average income can replace a missing value.
5. Use the attribute mean for all samples belonging to the same class. Here,
the average value replaces the missing values of all tuples that fall in this group.
6. Use the most possible value to fill in the missing value. The most probable
value can be obtained from other methods like classification and decision tree
prediction.
BIG DATA ANALYSIS FRAMEWORK
DATA PREPROCESSING
Removal of Noisy or Outlier Data:
• Noise is a random error or variance in a measured value. It can be removed by using binning,
which is a method where the given data values are sorted and distributed into equal
frequency bins.
• The bins are also called as buckets.
• The binning method then uses the neighbor values to smooth the noisy data.
• Some of the techniques commonly used are ‘smoothing by means’ where the mean
of the bin removes the values of the bins, ‘smoothing by bin medians’ where the bin median
replaces the bin values, and ‘smoothing by bin boundaries’ where the bin value is
replaced by the closest bin boundary. The maximum and minimum values are called bin
boundaries.
• Binning methods may be used as a discretization technique.
BIG DATA ANALYSIS FRAMEWORK
DATA PREPROCESSING
Example 2.1:Consider the following set: S = {12, 14, 19, 22, 24, 26,
28,31, 32}. Apply various binning techniques and show the result.

Solution: By equal-frequency bin method, the data should be

distributed across bins. Let us assume the bins of size 3, then the

above data is distributed across the bins as shown:


BIG DATA ANALYSIS FRAMEWORK
DATA PREPROCESSING
Data Integration and Data Transformations:
• Data integration involves routines that mergedata from
multiple
sources into a single data source. So, this may lead to redundant data.
The main goal of data integration is to detect and remove redundancies that
arise from integration.
• Data transformation routines perform operations like normalization to
improve the performance of the data mining algorithms. technique. In
normalization, the attribute values are scaled to fit in a range (say 0-1) to
improve the performance of the data mining algorithm. Often, in neural
networks, these techniques are used.
• Some of the normalization procedures used are:
BIG DATA ANALYSIS FRAMEWORK DATA PREPROCESSING

Min-Max Procedure It is a normalization technique where each variable


V is normalized by its difference with the minimum value divided by
the range to a new range, say 0–1. Often, neural networks require this
kind of normalization. The formula to implement this normalization is
given as:

Here max-min is the range. Min and max are the minimum and
maximum of the given data, new max and new min are the minimum and
maximum of the target range, say 0 and 1.
BIG DATA ANALYSIS FRAMEWORK DATA
PREPROCESSING
Example 2.2:Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and
map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The new min and new
max are 0 and 1, respectively. The mapping can be done using max-min as:

So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range
{0, 0.33, 0.66, 1}. Thus, the Min-Max normalization range is between 0 and 1.
z-Score Normalization This procedure works by
taking the difference between the field value and
mean value, and by scaling this difference by standard
deviation of the attribute.

• Here, σ is the standard deviation of the list


V and μ is the mean of the list V.
Example 2.3:Consider the mark list V = {10, 20, 30}, convert the marks
to z-score.
Solution: The mean and Sample Standard deviation (s) values of the
list V are 20 and 10, respectively. So the z-scores of these marks are
calculated using V* as:

What is the use of z-scores?


z-scores are used to detect outlier detection. If the data
value z-score function is either less than -3 or greater than
+3, then it is possibly an outlier.
BIG DATA ANALYSIS FRAMEWORK DATA
PREPROCESSING

Data Reduction:
• Data reduction reduces data size but produces the same
results. There are different ways in which data reduction can be
carried out such as data aggregation, feature selection, and
dimensionality reduction.
DESCRIPTIVE STATISTICS
• Descriptive statistics is a branch of statistics that does dataset
summarization. It is used to summarize and describe data.
Descriptive statistics are just descriptive and do not go beyond that.
• Data visualization is a branch of study that is useful for
investigating the given data.
• Descriptive analytics and data visualization techniques help to
understand the nature of the data, which further helps to
determine the kinds of machine learning or data mining tasks that can
be applied to the data. This step is often known as Exploratory Data
Analysis (EDA).
Dataset and Data
Types
• A dataset can be assumed to be a collection of data objects.
• The data objects may be records, points, vectors, patterns,
events, cases,
samples or observations. These records contain many attributes.
• An attribute can be defined as the property or characteristics
of an object. For example, consider the following database
shown in sample .

• Every attribute should be associated with a value. This


process is called
measurement.
• The type of attribute determines the data
types, often referred to as measurement scale types
Types of
Data
• Broadly, data can be classified into two types:
1.Categorical or qualitative data
2.Numerical or quantitative data
• Categorical or Qualitative Data The categorical data can be
divided into two types. They are nominal type and ordinal type.
 Nominal Data : Nominal data are symbols and cannot be processed
like a number.
– Nominal data type provides only information but has no ordering
among data. Only operations like (=, ≠) are meaningful for these data. For
example, the patient ID can be checked for equality and nothing else.
 Ordinal Data – It provides enough information and has natural
order. For example, Fever = {Low, Medium, High} is an ordinal data
• Numeric or Qualitative Data :It can be divided into two
categories. They are interval type and ratio type.
– Interval Data – Interval data is a numeric data for which the
differences between values are meaningful.

For example, there is a difference between 30 degree and 40 degree. Only the
permissible operations are + and -.
– Ratio Data – For ratio data, both differences and ratio are
meaningful. The difference between the ratio and interval data is the
position of zero in the scale.

For example, take the Centigrade-Fahrenheit conversion. The zeroes


of both scales do not match. Hence, these are interval data.
Another way of classifying the data is to classify
it as:
• Discrete Data : This kind of data is recorded as integers.
Eg : response of the survey can be discrete data.
• Continuous Data : It can be fitted into a range and includes decimal
point.
Eg: age is a continuous data
UNIVARIATE DATA ANALYSIS AND VISUALIZATION

• Univariate analysis is the simplest form of statistical analysis.

• As the name indicates, the dataset has only one variable.

• A variable can be called as a category.

• Univariate does not deal with cause or relationships.

• The aim of univariate analysis is to describe data and find


patterns.
1. Data Visualization:

• To understand data, graph visualization is must.

• Data visualization helps to understand data.

• It helps to present information and data to customers.

• Some of the graphs that are used in univariate data analysis are bar charts,
histograms, frequency polygons and pie charts.
– Bar Chart A Bar chart is used to display the frequency distribution for
variables.
– Bar charts are used to illustrate discrete data. The charts can also help to
explain the counts of nominal data.
– The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1,
2, 3, 4, 5} is shown
UNIVARIATE DATA ANALYSIS AND VISUALIZATION

Chart
Bar

Pie
Chart

Pie Chart These are equally helpful in illustrating the univariate data.
The percentage frequency distribution of students' marks {22, 22, 40, 40, 70,
70, 70, 85, 90, 90} is below in Figure 2.18.
It can be observed that the number of students with 22 marks are 2. The
total number of students are 10. So, 2/10 × 100 = 20% space in a pie of
100% is allotted for marks 22 in Figure 2.18.
Histogram : shows frequency distributions. The
histogram for students’ marks {45, 60, 60, 80, 85} in the
group range of 0-25, 26-50, 51-75, 76-100 is given
below in Figure 2.19. One can visually inspect from
Figure 2.19 that the number of students in the
range 76-100 is 2.
Figure 2.19:
Sample Histogram of
English Marks
Dot Plots similar to bar charts. The dot plot of English
marks for five students with ID as {1, 2, 3, 4, 5} and
marks {45, 60, 60, 80, 85} is given in Figure 2.20.
The advantage is that by visual inspection one can
find out who got more marks.

Figure 2.20: Dot


2.Central Tendency
• A condensation or summary of the data is necessary. This makes the data
analysis easy and simple.
• One such summary is called central tendency.
• Thus, central tendency can explain the characteristics of data and that
further helps in comparison.
• Mass data have tendency to concentrate at certain values, normally in the
central location.
• It is called measure of central tendency (or averages).
• This represents the first order of measures.
• Popular measures are mean, median and mode.
• Mean – Arithmetic average (or mean) is a measure of central tendency that
represents the ‘center’ of the dataset.

• Weighted mean – Unlike arithmetic mean that gives the weightage of all items
equally, weighted mean gives different importance to all items as the item
importance varies. Hence, different weightage can be given to items.
Geometric mean–Let 𝑥1 , 𝑥2 ,…,𝑥𝑁 be a set of ‘N’ values or

observations. Geometric mean is the 𝑁 𝑡 ℎ root of the product of N items. The


formula for computing geometric mean is given as follows:
• Median – The middle value in the distribution is called
median.
If the total number of items in the distribution is odd, then the
middle value is called median. If the numbers are even, then the
average value of two items in the centre is the median.
• Mode – Mode is the value that occurs more frequently in the
dataset.
• In other words, the value that has the highest
frequency is called mode.
• Mode is only for discrete data and is not applicable for
continuous data as there are no repeated values in
continuous data.
3.
Dispersion

• The spreadout ofa set of data around the centraltendency


(mean, median or mode) is called dispersion.
• it shows how data are spreadand how different they are from one
another,
• Dispersion is represented by various ways such as range,
variance, standard deviation, and standard error.
Range
• Range is the difference between the maximum and minimum of values of the
given list of data.
Standard Deviation
• The mean does not convey much more than a middle point. For example, the
following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The
difference between these two sets is the spread of data.
Here, N is the size of the population, 𝑥𝑖 is observation or value from the
population and μ is the population mean. Often, N – 1 is used instead of N in the
denominator.
Quartiles and Inter Quartile Range
• Quartiles are values that divide a series into four equal parts.
• The interquartile range (IQR) is the difference between the first and
third quartiles.
• It's a measure of variability used with the median It is sometimes
convenient to subdivide the dataset using coordinates. Percentiles are
about data that are less than the coordinates by some percentage of the total
value.
•kth percentile is the property that the k% of the data lies at or below 𝑋𝑖. For

example, median is 50th percentile and can be denoted as Q0.50. The 25th

percentile is called first quartile (𝐐𝟏 ) and the 75th percentile is called third

quartile (𝐐𝟑).
In an odd-numbered data set, the median is the number in the middle
of the list. The median itself is excluded from both halves: one half
contains all values below the median, and the other contains all the
values above it.

Q1 is the median of the first half and Q3


is the median of the second half. Since
each of these halves have an odd-numbered
size, there is only one value in the middle of
each half.
Example For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34}, find
2.4: the IQR.
Solution: The median is in the fifth position. In this case, 24 is the median. The first
quartile is median of the scores below the mean i.e., {12, 14, 19, 22}. Hence, it’s
the median of the list below 24. In this case, the median is the average of the second
and third values, that is, Q0.25 = 16.5. Similarly, the third quartile is the median of
the values above the median, that is {26, 28, 31, 34}. So, Q0.75 is the average of
the seventh and eighth score. In this case, it is 28 + 31/2 = 59/2 = 29.5.

Hence, the IQR is:

The half of IQR is called semi-quartile range. The Semi Inter Quartile Range
(SIQR) is given as:
• The median, quartiles Q1 and Q3, and minimum and maximum
written in the order < Minimum, Q1, Median, Q3, Maximum
>is known as five-point summary.

• Box plots are suitable for continuous variables and a nominal variable.
• Box plots can be used to illustrate data distributions and summary of data. It is the popular way for
plotting five number summaries. A Box plot is also known as a Box and whisker plot.
Example Find the 5-point summary of the list {13, 11, 2,
2.5: 3, 4, 8, 9}.

Solution: The minimum is 2 and the


maximum is 13. The Q1, Q2 and Q3 are 3, 8 and
11, respectively. Hence, 5-point summary is {2,
3, 8, 11, 13}, that is,
{minimum, Q1, median, Q3, maximum}.
Box plots are useful for describing 5- point
summary. The Box plot for the set is given in
Figure 2.21.
Figure 2.21: Box Plot for English
Marks
4.Shape
• Skewness and Kurtosis (called moments) indicate the
symmetry/asymmetry and peak location of the dataset.

Skewness
• The measures of direction and degree of symmetry are called measures of third
order.
• Ideally, skewness should be zero as in ideal normal distribution. More
often, the given dataset may not have perfect symmetry (consider the following
Figure 2.22).
• The relationship between skew and the relative size of the mean and median
can be summarized by a convenient numerical skew index known as
Pearson 2 skewness coefficient.

Figure 2.22: (a) Positive Skewed and (b)


Negative Skewed Data
Kurtos
is
• Kurtosis is used to find the presence of outliers in our data. It gives us
the total degree of outliers present.
• Kurtosis is the measure of whether the data is heavy tailed or light
tailed relative to normal distribution.
• It can be observed that normal distribution has bell-shaped curve with
no
long tails. Low kurtosis tends to have light tails. The implication is
that
there is no outlier data.
• Let x1, x2, …, xN be a set of ‘N’ values or observations. Then, kurtosis
is measured using the formula given below:
BIVARIATE DATA AND MULTIVARIATE
• Bivariate Data involves two variables.
• Bivariate data deals with causes of
relationships.
• The aim is to find relationships among data.
• Consider the following Table 2.6, with data of the temperature in a
shop and sales
of sweaters.
• Figure 2.23 and 2.24 shows scatter plot and line chart for the Table 2.6

Figure 2.24: Line


Figure 2.23: Chart
Table 2.6: Temperature in a Shop and
Scatter
Sales Data
Plot
BIVARIATE DATA AND MULTIVARIATE
DATA
1.Bivariate Statistics: Covariance and Correlation are examples of bivariate
statistics.
Covariance
• Covariance is an indicator of the extent to which 2 random variables are
dependent on each other.
• Covariance implies whether the two variables are directly or inversely
proportional.
• A higher number denotes higher dependency.
• Correlation is a statistical measure that indicates how
strongly two variables are related.
• The value of covariance lies in the range of -∞ and +∞.
BIVARIATE DATA AND MULTIVARIATE
DATA
Covariance
• It is a measure of joint probability of random variables, say X and Y.
• Generally, random variables are represented in capital letters
• It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance between
two dimensions.
• The formula for finding co-variance for specific x, and y are:

• Here, 𝑥𝑖 and 𝑦𝑖 are data values from X and Y. E(X) and E(Y) are the mean values of
𝑥𝑖 and 𝑦𝑖. N is the number of given data.
• Also, the COV(X, Y) is same as COV(Y, X).
BIVARIATE DATA AND
MULTIVARIATE
DATA

Example Find the covariance of data X = {1, 2, 3, 4, 5} and


2.6: Y= {1, 4, 9, 16, 25}.
Solution: Mean(X) = E(X) = 15/5 = 3,
Mean(Y) = E(Y) = 55/5 = 11.
The covariance is computed using COV(X, Y)
as:

The covariance between X and Y is 12.


It can be normalized to a value between -1 and
+1.
BIVARIATE DATA AND MULTIVARIATE DATA

Correlation

• Correlation is a statistical concept determining the relationship

potency of two numerical variables.

• While deducing the relation between variables, we conclude the change in

one variable that impacts a difference in another.

• The correlation indicates the relationship between dimensions using


its

sign.

• The sign is more important than the actual value.


BIVARIATE DATA AND MULTIVARIATE
DATA
Correlation

1.If the value is positive,it indicates that the dimensions


increase together.
2.If the value is negative, it indicates that while one-
dimension increases, the other dimension decreases.
3.If the value is zero, then it indicates that both the dimensions are
independent of each other.
4.If the dimensions are correlated, then it is
better to remove one
dimension as it is a redundant dimension.
5.If the given attributes are X = (x1, x2, …, xN) and Y = (y1, y2, …, yN),
then the Pearson correlation coefficient, that is denoted as r, is given as:
BIVARIATE DATA AND MULTIVARIATE DATA

Example Find the correlation coefficient of data X = {1,


2.7: 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution:
The mean values of X and Y are 15/5 = 3 and 55/5 = 11.
The standarddeviationsof X and Y are 1.41 and 8.6486,
respectively.
Therefore, the correlation coefficient is given as ratio of
covariance (12 from the previous problem 2.6) standard
deviation of x and y as per the above equation as

You might also like