Module-1 ML
Module-1 ML
Introduction
NEED FOR MACHINE LEARNING
• BUSINESS ORGANIZATIONS HAVE NUMEROUS DATA
• NEED TO ANALYZE DATA FOR TAKING DECISIONS
Machine learning has become so popular because of three reasons:
1. High volume of available data to manage: Big companies such as Facebook,
Twitter, and YouTube generate huge amount of data that grows
at a phenomenal rate. It is estimated that the data approximately gets
doubled every year.
2. Second reason is that the cost of storage has reduced. The hardware cost
has also dropped. Therefore, it is easier now to capture, process,
store, distribute, and transmit the digital information.
3. Third reason for popularity of machine learning is the availability of
complex algorithms now. Especially with the advent of deep learning,
many algorithms are available for machine learning.
NEED FOR MACHINE LEARNING
Before starting the machine learning journey, let us establish these terms - data, information, knowledge,
All facts are data. Data can be numbers or text that can be
processed by a computer. Today, organizations are accumulating
vast and growing amounts of data with data sources such as flat
files, databases, or data warehouses in different storage formats.
The objective of machine learning is to process these archival data for organizations to take better
decisions to design new products, improve the business processes, and to develop effective
decision support systems.
MACHINE LEARNING EXPLAINED
Machine learning is an important sub- of Artificial
branch
Intelligence (AI).
A frequently quoted definition of machine learning was by Arthur
He stated that “Machine learning is the field of study that gives the
programmed.”
The key to this definition is that the systems should learn by itself
MACHINE LEARNING EXPLAINED
1. Mathematical equation
Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A
computer program is said to learn from experience E, with respect to task T and some
performance measure P, if its performance on T measured by P improves with experience
E.”
The important components of this definition are experience E,task T, and performance
measure P.
MACHINE LEARNING EXPLAINED
For example, the task T could be detecting an object in an image.
The machine can gain the knowledge of object using training dataset of
thousands of images. This is called experience E.
So, the focus is to use this experience E for this task of object
detection T.
The ability of the system to detect the object is measured by
performance measures like precision and recall.
Based on the performance measures, course correction can be done to improve
the performance of the system.
Models of computer systems are equivalent to human experience.
Experience is based on data.
Humans gain experience by various means.
MACHINE LEARNING EXPLAINED
Big data is used by many machine learning algorithms for applications such as
language translation and image recognition.
Labelled Data
To illustrate labelled data, let us take one example dataset called
Iris flower dataset or Fisher’s Iris dataset.
The dataset has 50 samples of Iris – with four attributes, length
and width of sepals and petals.
The target variable is called class.
There are three classes – Iris setosa, Iris and Iris
virginica, versicolor.
The partial data of Iris dataset is shown in Table 1.1.
TYPES OF MACHINE LEARNING
TYPES OF MACHINE LEARNING
A dataset need not be always numbers. It can be images or video frames. Deep
neural networks can handle images with labels. In the following Figure 1.6, the
deep neural network takes images of dogs and cats with labels for classification.
Supervised Learning
Supervised algorithms use labelled dataset.
As the name suggests, there is a supervisor or teacher component
in supervised learning.
Supervised Learning-Classification
In the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the
Unsupervised Learning
The second kind of learning is by self-instruction. As the name suggests, there are no
supervisor or teacher components. In the absence of a supervisor or
teacher, self- instruction is the most common kind of learning process.
This process of self-instruction is based on the concept of trial and error.
Here, the program is supplied with objects, but no labels are defined.
The algorithm itself observes the examples and recognizes patterns based on
the principles of grouping.
Grouping is done in ways that similar objects form the same group.
Cluster analysis and Dimensional reduction algorithms are examples of
unsupervised algorithms.
TYPES OF MACHINE LEARNING
Unsupervised Learning-Cluster Analysis
aspect and vary from the data objects in the other partitions
significantly.
• Some of the key clustering algorithms are:
o k-means algorithm
o Hierarchical algorithms
Dimensionality Reduction
3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even
Tensor Processing Unit (TPU) are required to execute machine learning algorithms.
5. 5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/
variance tradeoff. A model that fits the training data correctly but fails for test
data, in general lacks generalization, is called overfitting. The reverse problem is called
MACHINE LEARNING PROCESS
1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm
is enough for giving the solution.
2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns
to the selected
hypothesis.
3. Preparation of data – This step involves producing the final dataset by
cleaning raw data and the preparation of data for the data mining process.
4. Modelling – This step plays a role in the applicationof data mining algorithm for the
data to obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined
by evaluating the accuracy of the classifier.
6. Deployment – This step involves the deployment of results of the data mining
algorithm to improve the existing process or for a new situation.
Some applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which are captured by
emoticons effectively. For movie reviews or product reviews, five stars or one star are automatically
attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases
possible. For example, Amazon recommends users to find related books or books bought by people
who have the same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and
Google Assistant are all examples of voice assistants. They take speech commands and perform
tasks. These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine
learning which offer to locate and navigate shortest paths to reduce time.
What is Data ?
• All facts are data. In computer systems, bits encode facts present
in numbers, text, images, audio, and video.
• Data is available in different data sources like flat files, databases, or data
warehouses.
• It can either be an operational data or a non-operational data.
• Big Data
1. Volume – Since there is a reduction in the cost of storing devices, there has
been a tremendous growth of data. Small traditional data is measured in terms
of gigabytes (GB) and terabytes (TB), but Big Data is measured in terms of
petabytes (PB) and exabytes (EB).
2. Velocity – The fast arrival speed of data and its increase in data volume is noted as
velocity. The availability of IoT devices and Internet power ensures that the data is
arriving at a faster rate.
3. Variety – The variety of Big Data includes:
a. Form – There are many forms of data. Data types range from
text, graph, audio, video, to maps.
c. Source of data – This is the third aspect of variety. There are many sources
of data. Broadly, the data source can be classified as open/public data, social
media data and multimodal data.
4. Veracity of data – Veracity of data deals with aspects like
conformity to the facts, truthfulness, believability, and confidence in data. There may be
many sources of error such as technical errors,typographical errors, and human errors.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals
that are needed by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the
information that is extracted from the data and its influence on the decisions that are
taken based on it.
Types of
Data
1. Structured Data
Data is stored in an organized manner such as a database where it is available in the form of a table. The
data can also be retrieved in an organized manner using tools like SQL.
• The structured data frequently encountered in machine learning are listed below:
o Record Data A dataset is a collection of measurements taken from a process. We have a
collection of objects in a dataset and each object has a set of measurements. The
measurements can be arranged in the form of a matrix. Rows in the matrix represent an object
and can be called as entities, cases, or records. The columns of the dataset are called
attributes, features, or fields.
o Data Matrix It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data.
o Graph Data It involves the relationships among objects. For example, a web
page can refer to another web page. This can be modeled as a graph.
o Ordered Data Ordered data objects involve attributes that have an
implicit order among them. The examples of ordered data are:
Temporal data – It is the data whose attributes are associated with time. For example, the customer purchasing patterns during
festival time is sequential data. Time series data is a special type of sequence data where the data is a series of measurements over
time.
Sequence data – It is like sequential data but does not have time stamps. This data involves the sequence of words or letters. For
example, DNA data is a sequence of four characters – A T G C.
Spatial data – It has attributes such as positions or areas. For example, maps are spatial data
where the points are related by location.
2. Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual
documents, programs, and blog data. It is estimated that 80% of the data are
unstructured data.
3. Semi-Structured Data
1. Descriptive analytics is about describing the main features of the data. After
data collection is done, descriptive analytics deals with the collected data and quantifies
it. It is often stated that analytics is essentially statistics.
2. Diagnostic Analytics deals with the question – ‘Why?’. This is also known as causal
analysis, as it aims to find out the cause and effect of the events. For example, if a
product is not selling, diagnostic analytics aims to find out the reason.
3. Predictive Analytics deals with the future. It
deals with the question – ‘What will happen in future given
this data?’. This involves the application of algorithms to identify the
patterns to predict the future.
4. Prescriptive Analytics is about the finding the best course of
action for the business organizations. Prescriptive analytics goes
beyond prediction and helps in decision making by giving a set of
actions. It helps the organizations to plan better for the future and to
mitigate the risks that are involved.
BIG DATA ANALYSIS FRAMEWORK
4. Presentation layer
BIG DATA ANALYSIS FRAMEWORK
1. Data Collection
2. Data Preprocessing
3. Application of Machine Learning Algorithms
DATA COLLECTION
• Broadly, the data source can be classified as open/public data, social
media data and multimodal data.
1. Open or public data source – It is a data source that does not have
any stringent copyright rules or restrictions. Its data can be primarily used
for many purposes.
2. Social media – It is the data that is generated by various social media
platforms like Twitter, Facebook, YouTube, and Instagram. An
enormous amount of data is generated by these platforms.
3. Multimodal data – It includes data that involves many modes such as text,
video, audio and mixed types.
BIG DATA ANALYSIS FRAMEWORK
DATA PREPROCESSING
• The DoB of patients, John, Andre, and Raju, is the missing data.
• The age of David .This is called inconsistent data. Inconsistent data occurs due to
problems in conversions, inconsistent formats, and difference in units.
• Salary for John. It cannot be less than ‘0’. It is an instance of noisy data.
•Outliers are data that exhibit the characteristics that are different from other data
and have very unusual values. It is often required to distinguish between noise and
outlier data. The age of Raju .
BIG DATA ANALYSIS
FRAMEWORK
DATA
PREPROCESSING
Missing Data Analysis:
distributed across bins. Let us assume the bins of size 3, then the
Here max-min is the range. Min and max are the minimum and
maximum of the given data, new max and new min are the minimum and
maximum of the target range, say 0 and 1.
BIG DATA ANALYSIS FRAMEWORK DATA
PREPROCESSING
Example 2.2:Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and
map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The new min and new
max are 0 and 1, respectively. The mapping can be done using max-min as:
So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range
{0, 0.33, 0.66, 1}. Thus, the Min-Max normalization range is between 0 and 1.
z-Score Normalization This procedure works by
taking the difference between the field value and
mean value, and by scaling this difference by standard
deviation of the attribute.
Data Reduction:
• Data reduction reduces data size but produces the same
results. There are different ways in which data reduction can be
carried out such as data aggregation, feature selection, and
dimensionality reduction.
DESCRIPTIVE STATISTICS
• Descriptive statistics is a branch of statistics that does dataset
summarization. It is used to summarize and describe data.
Descriptive statistics are just descriptive and do not go beyond that.
• Data visualization is a branch of study that is useful for
investigating the given data.
• Descriptive analytics and data visualization techniques help to
understand the nature of the data, which further helps to
determine the kinds of machine learning or data mining tasks that can
be applied to the data. This step is often known as Exploratory Data
Analysis (EDA).
Dataset and Data
Types
• A dataset can be assumed to be a collection of data objects.
• The data objects may be records, points, vectors, patterns,
events, cases,
samples or observations. These records contain many attributes.
• An attribute can be defined as the property or characteristics
of an object. For example, consider the following database
shown in sample .
For example, there is a difference between 30 degree and 40 degree. Only the
permissible operations are + and -.
– Ratio Data – For ratio data, both differences and ratio are
meaningful. The difference between the ratio and interval data is the
position of zero in the scale.
• Some of the graphs that are used in univariate data analysis are bar charts,
histograms, frequency polygons and pie charts.
– Bar Chart A Bar chart is used to display the frequency distribution for
variables.
– Bar charts are used to illustrate discrete data. The charts can also help to
explain the counts of nominal data.
– The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1,
2, 3, 4, 5} is shown
UNIVARIATE DATA ANALYSIS AND VISUALIZATION
Chart
Bar
Pie
Chart
Pie Chart These are equally helpful in illustrating the univariate data.
The percentage frequency distribution of students' marks {22, 22, 40, 40, 70,
70, 70, 85, 90, 90} is below in Figure 2.18.
It can be observed that the number of students with 22 marks are 2. The
total number of students are 10. So, 2/10 × 100 = 20% space in a pie of
100% is allotted for marks 22 in Figure 2.18.
Histogram : shows frequency distributions. The
histogram for students’ marks {45, 60, 60, 80, 85} in the
group range of 0-25, 26-50, 51-75, 76-100 is given
below in Figure 2.19. One can visually inspect from
Figure 2.19 that the number of students in the
range 76-100 is 2.
Figure 2.19:
Sample Histogram of
English Marks
Dot Plots similar to bar charts. The dot plot of English
marks for five students with ID as {1, 2, 3, 4, 5} and
marks {45, 60, 60, 80, 85} is given in Figure 2.20.
The advantage is that by visual inspection one can
find out who got more marks.
• Weighted mean – Unlike arithmetic mean that gives the weightage of all items
equally, weighted mean gives different importance to all items as the item
importance varies. Hence, different weightage can be given to items.
Geometric mean–Let 𝑥1 , 𝑥2 ,…,𝑥𝑁 be a set of ‘N’ values or
example, median is 50th percentile and can be denoted as Q0.50. The 25th
percentile is called first quartile (𝐐𝟏 ) and the 75th percentile is called third
quartile (𝐐𝟑).
In an odd-numbered data set, the median is the number in the middle
of the list. The median itself is excluded from both halves: one half
contains all values below the median, and the other contains all the
values above it.
The half of IQR is called semi-quartile range. The Semi Inter Quartile Range
(SIQR) is given as:
• The median, quartiles Q1 and Q3, and minimum and maximum
written in the order < Minimum, Q1, Median, Q3, Maximum
>is known as five-point summary.
• Box plots are suitable for continuous variables and a nominal variable.
• Box plots can be used to illustrate data distributions and summary of data. It is the popular way for
plotting five number summaries. A Box plot is also known as a Box and whisker plot.
Example Find the 5-point summary of the list {13, 11, 2,
2.5: 3, 4, 8, 9}.
Skewness
• The measures of direction and degree of symmetry are called measures of third
order.
• Ideally, skewness should be zero as in ideal normal distribution. More
often, the given dataset may not have perfect symmetry (consider the following
Figure 2.22).
• The relationship between skew and the relative size of the mean and median
can be summarized by a convenient numerical skew index known as
Pearson 2 skewness coefficient.
• Here, 𝑥𝑖 and 𝑦𝑖 are data values from X and Y. E(X) and E(Y) are the mean values of
𝑥𝑖 and 𝑦𝑖. N is the number of given data.
• Also, the COV(X, Y) is same as COV(Y, X).
BIVARIATE DATA AND
MULTIVARIATE
DATA
Correlation
sign.