0% found this document useful (0 votes)
26 views24 pages

Unit 1

Uploaded by

bhuvana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views24 pages

Unit 1

Uploaded by

bhuvana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT 1

MACHINE LEARNING BASICS

Introduction to Machine Learning (ML) - Essential concepts of ML – Types of learning –


Machine learning methods based on Time – Dimensionality – Linearity and Non linearity –
Early trends in Machine learning – Data Understanding Representation and visualization.

Artificial Intelligence(AI)

Alan Turing defined Artificial Intelligence as follows: “If there is a machine behind a
curtain and a human is interacting with it (by whatever means, e.g. audio or via typing etc.)
and if the human feels like he/she is interacting with another human, then the machine is
artificially intelligent.”
It does not directly aim at the notion of intelligence, but rather focusses on human- like
behavior. This objective is even broader in scope than mere intelligence. From this
perspective, AI does not mean building an extraordinarily intelligent machine that can solve
any problem in no time, but rather it means to build a machine that is capable of human-like
behavior.
However, just building machines that mimic humans does not sound very interesting. As per
modern perspective, AI mean machines that are capable of performing one or more of these
tasks: understanding human language, performing mechanical tasks involving complex
maneuvering, solving computer-based complex problems possibly involving large data in
very short time and revert back with answers in human-like manner, etc.
The supercomputer depicted in movie 2001: A space odyssey, called HAL represents very
closely to modern view of AI. It is a machine that is capable of processing large amount of
data coming from various sources and generating insights and summary of it at extremely fast
speed and is capable of conveying these results to humans in human-like interaction, e.g.,
voice conversation.
There are two aspects to AI as viewed from human-like behavior standpoint.

1. The machine is intelligent and is capable of communication with humans, but


does not have any locomotive aspects. HAL is example of such AI.
2. The other aspect involves having physical interactions with human-like
locomotion capabilities, which refers to the field of robotics. we are only going to
deal with the first kind of AI.

Machine Learning (ML)

ML was coined in 1959 by Arthur Samuel in the context of solving game of checkers by
machine. The term refers to a computer program that can learn to produce a behavior that is
not explicitly programmed by the author of the program. Rather it is capable of showing
behavior that the author may be completely unaware of. This behavior is learned based on
three factors: (1) Data that is consumed by the program, (2) A metric that quantifies the
error or some form of distance between the current behavior and ideal behavior, and (3) A
feedback mechanism that uses the quantified error to guide the program to produce better
behavior in the subsequent events. As can be seen the second and third factors quickly make
the concept abstract and emphasizes deep mathematical roots of it. The methods in machine
learning theory are essential in building artificially intelligent systems.

Definition

Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning

2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample labeled
data to the machine learning system in order to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labeled, classified,
or categorized, and the algorithm needs to act on that data without any supervision. The goal of
unsupervised learning is to restructure the input data into new features or a group of objects
with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find useful
insights from the huge amount of data. It can be further classifieds into two categories of
algorithms:

o Clustering
o Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the most
reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

Applications of Machine learning


1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors

o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product while
internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models
to detect people and objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails
in our spam box, and the technology behind this is Machine learning. Below are some spam
filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.


These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a
risk of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position
of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem
at all, as for this also machine learning helps us by converting the text into our known
languages. Google's GNMT (Google Neural Machine Translation) provide this feature, which is
a Neural Machine Learning that translates the text into our familiar language, and it called as
automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.

Machine learning Life cycle

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate will be the prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources

2. Data preparation

After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.

o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.

It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques


o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of the
problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.

5. Train Model

Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.

7. Deployment

The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
Key differences between Artificial Intelligence (AI) and Machine learning (ML):

Artificial Intelligence Machine learning


Artificial intelligence is a technology which Machine learning is a subset of AI which allows a
enables a machine to simulate human behavior. machine to automatically learn from past data
without programming explicitly.
The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from
system like humans to solve complex problems. data so that they can give accurate output.
In AI, we make intelligent systems to perform any In ML, we teach machines with data to perform a
task like a human. particular task and give an accurate result.
Machine learning and deep learning are the two Deep learning is a main subset of machine learning.
main subsets of AI.
AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent system Machine learning is working to create machines
which can perform various complex tasks. that can perform only those specific tasks for which
they are trained.
AI system is concerned about maximizing the Machine learning is mainly concerned about
chances of success. accuracy and patterns.
The main applications of AI are Siri, customer The main applications of machine learning
support using catboats, Expert System, Online are Online recommender system, Google search
game playing, intelligent humanoid robot, etc. algorithms, Facebook auto friend tagging
suggestions, etc.
On the basis of capabilities, AI can be divided Machine learning can also be divided into mainly
into three types, which are, Weak AI, General three types that are Supervised
AI, and Strong AI. learning, Unsupervised learning,
and Reinforcement learning.
It includes learning, reasoning, and self- It includes learning and self-correction when
correction. introduced with new data.
AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.

Machine Learning Method based on Time

A time series data is the set of measurements taking place in a constant interval of time, here
time acts as independent variable and the objective ( to study changes in a characteristics) is
dependent variables.
For example, one can measure

 Consumption of energy per hour


 Sales on daily basis
 Company's profits per quarter
 Annual changes in a population of a country.

The time series data is of three types:

 Time series data: A set of observations contains values, taken by variable at different times.
 Cross-sectional data: Data values of one or more variables, gathered at the same time-point.
 Pooled data: A combination of time series data and cross-sectional data.

What is Time Series Analysis?

"Time series analysis is a statistical technique dealing in time series data, or trend analysis."

A time-series contains sequential data points mapped at a certain successive time duration, it
incorporates the methods that attempt to surmise a time series in terms of understanding either
the underlying concept of the data points in the time series or suggesting or making predictions.

 Forecasting data using time-series analysis comprises the use of some significant model to
forecast future conclusions on the basis of known past outcomes.
 An objective of time series analysis is to explore and understand patterns in changes over time
where these patterns signifies the components of a time series including trends, cycles, and
irregular movements.
 When such components reside in a time series, the data model must be considered for these
patterns for generating accurate forecasts, such as future sales, GDP, and global temperatures.

Consider an example of a restaurant in which prediction is made on the number of customers as


when will more customers appear in the restaurant at a specified time duration based on the
previous appearance of customers with time.

We can use Time Series for multiple investigations to predict future as circadian rhythms,
seasonal behaviours, trends, changes, etc. to interrogate the questions like predicted values,
what is leading and lagging behind, connections and association, control, repetitions, and
hidden pattern, etc.

Broadly specified time-series models are Autoregressive (AR), Integrated (I), Moving
Average(MA), and some other models are the combination of these models such
as Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving
Average (ARIMA) models.

These models reflect measurements near concurrently in time will be more closely relevant as
compared to measurements distant apart.

Examples of Time-Series Analysis

Consider an example In the financial domain, the main objective is to recognize trends,
seasonal behaviour, and correlation through the usage of time series analysis technique and
producing filters based on the forecasts, this includes;

1. To predict expected utilities- For the perfect and successively trading, it is necessary
to have accurate and reliable future predictions such as asset prices, variation in usage,
products in demand in statistical form through market research, and time-series dataset.
2. Simulate series- After getting statistical output data of financial time series, that can be
used for creating simulations of future events. It helps us to determine the count of
trades, expected trading costs and returns, required financial and technical investment,
several risks in trading, etc.
3. Presume relationship- Recognition of the relationship between the time series and
other quantities gives us trading signs to improve the existing fashion of trading. For
example, to know the spreading of foreign exchange pair and its variation with a
proposal, estimated trades can be inferred for a certain period for forecasting a
widespread to reduce transaction costs.

ML Methods For Time-Series Forecasting

1. In the Univariate Time-series Forecasting method, forecasting problems contain only


two variables in which one is time and the other is the field we are looking to forecast.
 For example, if you want to predict the mean temperature of a city for the coming week, now
one parameter is time( week) and the other is a city.
 Another example could be when measuring a person’s heart rate per minute through using past
observations of heart rate only. Now one parameter is time( minute) and the other is a heart rate.

2. On the other hand, in the Multivariate Time-series Forecasting method, forecasting


problems contain multiple variables keeping one variable as time fixed and others will
be multiple in parameters.

Consider the same example, predicting the temperature of a city for the coming week, the only
difference would come here now temperature will consider impacting factors such as

 Rainfall and time duration of raining,


 Humidity,
 Wind speed,
 Precipitation,
 Atmospheric pressure, etc,

Dataset

A dataset is a collection of data in which data is arranged in some order. A dataset can contain
any data from a series of an array to a database table. Below table shows an example of the
dataset:

Country Age Salary Purchased


India 38 48000 No
France 43 45000 Yes
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
India 35 58000 Yes
Sources for Machine Learning datasets

1. Kaggle Datasets

2. UCI Machine Learning Repository

3. Datasets via AWS

4. Google's Dataset Search Engine

5. Microsoft Datasets

6. Computer Vision Datasets

1.1 Dimensionality

From the physical standpoint, dimensions are space dimensions: length, width, and height.
However, it is very common to have tens if not hundreds and more dimensions when we deal
with data for machine learning.
The fundamental property of dimensions.

The space dimensions are defined such that each of the dimensions is perpendicular or
orthogonal to other two. This property of orthogonality is essential to have unique
representation of all the points in this 3-dimensional space. If the dimensions are not
orthogonal to each other, then one can have multiple representations for same points in the
space and whole mathematical calculations based on it will fail.
For example, if we setup the three coordinates as length, width, and height with some
arbitrary origin. The precise location of origin only changes the value of coordinates, but does
not affect the uniqueness property and hence any choice of origin is fine as long as it remains
unchanged throughout the calculation.
Example of Iris data: The input has 4 features: lengths and widths of sepals and petals. As all
these 4 features are independent of each other, these 4 features can be considered as
orthogonal. Solving the problem with Iris data, 4 dimensional input space are actually dealt
with.

Curse of Dimensionality

Adding arbitrarily large number of dimensions is fine from mathematical standpoint, but
there is one problem. With increase in dimensions the density of the data gets diminished
exponentially. For example, 1000 data points in training data is there, and data has 3 unique
features and the value of all the features is within 1–10. Thus all these 1000 data points lie in
a cube of size 10 × 10 × 10. Thus the density is 1000/1000 or 1 sample per unit cube. If we
had 5 unique features instead of 3, then quickly the density of the data drops to 0.01 sample
per unit 5-dimensional cube.
The density of the data is important, as higher the density of the data, better is the likelihood
of finding a good model, and higher is the confidence in the accuracy of the model. If the
density is very low, there would be very low confidence in the trained model using that data.
Although high dimensions are acceptable mathematically, in order to be able to develop a
good ML model with high confidence dimensionality is considered as important.

Linearity and Nonlinearity

Concept of linearity and nonlinearity is applicable to both the data and the model that built on
top of it.
Data is called as linear if the relationship between the input and output is linear. For instance,
when the value of input increases, the value of output also increases and vice versa. Pure
inverse relation is also called as linear and would follow the rule with reversal of sign for
either input and output.
Various possible linear relationships between input and output is shown in Figure 1.1.

All the models that use linear equations to model the relationship between input and output
are called as linear models. However, sometimes, by preconditioning the input or output a
nonlinear relationship between the data can be converted into linear relationship and then the
linear model can be applied on it.

Figure 1.1. Examples of linear relationships between input and output

For example, if input and output are related with exponential relationship as y = 5e x. The data
is clearly nonlinear. Instead of building the model on original data, we can build a model after
applying a log operation. This operation transforms the original nonlinear relationship into
linear one as log y = 5x. Then we build the linear model to predict log y instead of y, which
can then be converted to y by taking exponent. There can also be cases where a problem can
be broken down into multiple parts and linear model can be applied to each part to ultimately
solve a nonlinear problem. Figures 1.2 and 1.3 show examples of converted linear and
piecewise linear relationships, respectively. While in some cases the relationship is purely
nonlinear and needs a proper nonlinear model to map it. Figure 1.4 shows examples of pure
nonlinear relationships.
Figure. 1.2 Example of nonlinear relationship between input and output being converted into
linear relationship by applying logarithm

Figure. 1.3 Examples of piecewise linear relationships between input and output
Figure. 1.4. Examples of pure nonlinear relationships between input and output

Linear models are the simplest to understand, build, and interpret. All the models in the
theory of machine learning can handle linear data.
Examples of purely linear models are linear regression, support vector machines without
nonlinear kernels, etc.
Nonlinear models inherently use some nonlinear functions to approximate the nonlinear
characteristics of the data.
Examples of nonlinear models include neural networks, decision trees, probabilistic models
based on nonlinear distributions, etc.
In analyzing data for building the artificially intelligent system, determining the type of
model to use is a critical starting step and knowledge of linearity of the relationship is a
crucial component of this analysis.

Early Trends in Machine Learning


Expert Systems
In the early days (till 1980s), the field of Machine Intelligence or Machine Learning was
limited to what were called as Expert Systems or Knowledge based Systems. Dr. Edward
Feigenbaum, one of the leading experts in the field of expert systems defined as expert
system as, An intelligent computer program that uses knowledge and inference procedures to
solve problems that are difficult enough to require significant human expertise for the
solution.
Such systems were capable of replacing experts in certain areas. These machines were
programmed to perform complex heuristic tasks based on elaborate logical operations.

Disadvantages:

In spite of being able to replace the humans who are experts in the specific areas, these
systems were not “intelligent” in the true sense, if we compare them with human intelligence.
The reason being the system were “hard-coded” to solve only a specific type of problem and
if there is need to solve a simpler but completely different problem, these systems would
quickly become completely useless.
Advantages:

These systems were quite popular and successful specifically in areas where repeated but
highly accurate performance was needed, e.g., diagnosis, inspection, monitoring, control.

Data Understanding, Representation, and Visualization


With recent explosion of small devices that are connected to internet, the amount of data that
is being generated has increased exponentially. This data can be quite useful for generating
variety of insights if handled properly, else it can only burden the systems handling it and
slow down everything. The science that deals with general handling and organizing and then
interpreting the data is called as data science.

Understanding the Data

First step in building an application of AI is to understand the data. The data in raw form can
come from different sources, in different formats. Some data can be missing, some data can
be mal-formatted, etc. It is the first task to get familiar with the data. Clean up the data as
necessary.
The step of understanding the data can be broken down into three parts:

1. Understanding entities

2. Understanding attributes

3. Understanding data types

In order to understand these concepts, let’s us consider a data set called Irisdata. Iris data is
one of the most widely used data set in the field of ML for its simplicity and its ability to
illustrate many different aspects of ML. Specifically, Iris data states a problem of multi-class
classification of three different types of flowers, Setosa, Versicolor, and Virginica. The data
set is ideal for learning basic ML application as it does not contain any missing values, and
all the data is numerical. There are 4 features per sample, and there are 50 samples for each
class totaling 150 samples. Here is a sample taken from the data.

Understanding Entities
Entities represent groups of data separated based on conceptual themes and/or data
acquisition methods. An entity typically represents a table in a database, or a flat file, e.g.,
comma separated variable (csv) file, or tab separated variable (tsv) file. Sometimes it is more
efficient to represent the entities using a more structured formats like svmlight.
Each entity can contain multiple attributes. The raw data for each application can contain
multiple such entities is shown in Table 1.1
In case of Iris data, we have only one such entity in the form of dimensions of sepals and
petals of the flowers. To solve this classification problem and finds that the data about sepals
and petals alone is not sufficient, then adding more information in the form of additional
entities are required.
For example, more information about the flowers in the form of their colors, or smells or
longevity of the trees that produce them, etc. can be added to improve the classification
performance.

Sepal- Sepal- Petal-length Petal-width Class label

length width
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
5.0 3.0 1.6 0.2 Iris-setosa
5.0 3.4 1.6 0.4 Iris-setosa
5.2 3.5 1.5 0.2 Iris-setosa
5.2 3.4 1.4 0.2 Iris-setosa
7.0 3.2 4.7 1.4 Iris-versicolor
6.4 3.2 4.5 1.5 Iris-versicolor
6.9 3.1 4.9 1.5 Iris-versicolor
5.5 2.3 4.0 1.3 Iris-versicolor
6.5 2.8 4.6 1.5 Iris-versicolor
6.7 3.1 4.7 1.5 Iris-versicolor
6.3 2.3 4.4 1.3 Iris-versicolor
5.6 3.0 4.1 1.3 Iris-versicolor
5.5 2.5 4.0 1.3 Iris-versicolor
5.5 2.6 4.4 1.2 Iris-versicolor
6.3 3.3 6.0 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3.0 5.9 2.1 Iris-virginica
6.3 2.9 5.6 1.8 Iris-virginica
6.5 3.0 5.8 2.2 Iris-virginica
6.7 3.0 5.2 2.3 Iris-virginica
6.3 2.5 5.0 1.9 Iris-virginica
6.5 3.0 5.2 2.0 Iris-virginica
6.2 3.4 5.4 2.3 Iris-virginica
5.9 3.0 5.1 1.8 Iris-virginica

Table 1.1 Sample from Iris data set containing 3 classes and 4 attributes Sepal-length Sepal-
width Petal-length Petal-width Class label
Understanding Attributes

Each attribute can be thought of as a column in the file or table. In case of Iris data, the attributes from the
single given entity are sepal length in cm, sepal width in cm, petal length in cm, petal width in cm.
Adding additional entities like color, smell, etc., each of those entities would have their own attributes. As
all the columns are all features, and there is no ID column. As there is only one entity, ID column is
optional, as we can assign arbitrary unique ID to each row. If multiple entities are there means, there is a
need to have an ID column for each entity along with the relationship between IDs of different entities.
These IDs can then be used to join the entities to form the feature space.

Understanding Data Types

Attributes in each entity can be of various different types from the storage and processing perspective,
e.g., string, integer valued, datetime, binary (“true”/“false”, or “1”/“0”), etc. Sometimes the attributes can
originate from completely different domains like an image or a sound file, etc. Each type needs to be
handled separately for generating a feature vector that will be consumed by the ML algorithm.
There can also come across sparse data, in which case, some attributes will have missing values. This
missing data is typically replaced with special characters, which should not be confused with any of the
real values. In order to process data with missing values, fill the missing values with some default values,
or use an algorithm that can work with missing data.
In case of Iris data, all the attributes are real valued numerical and there is no missing data. However, if
we add additional entities like color, it would have enumerative type string features like green, orange,
etc.

Understanding Data visualization

Data visualization is a graphical representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends,
outliers, and patterns in data.

Benefits of good data visualization

The uses of Data Visualization as follows.

 Powerful way to explore data with presentable results.


 Primary use is the pre-processing portion of the data mining process.
 Supports the data cleaning process by finding incorrect and missing values.
 For variable derivation and selection means to determine which variable to include and discarded in
the analysis.
 Also play a role in combining categories as part of the data reduction process.
Methods to Visualize Data

 Column Chart: It is also called a vertical bar chart where each category is represented by a
rectangle. The height of the rectangle is proportional to the values that are plotted.
 Bar Graph: It has rectangular bars in which the lengths are proportional to the values which are
represented.
 Stacked Bar Graph: It is a bar style graph that has various components stacked together so that
apart from the bar, the components can also be compared to each other.
 Stacked Column Chart: It is similar to a stacked bar; however, the data is stacked horizontally.
 Area Chart: It combines the line chart and bar chart to show how the numeric values of one or more
groups change over the progress of a viable area.
 Dual Axis Chart: It combines a column chart and a line chart and then compares the two variables.
 Line Graph: The data points are connected through a straight line; therefore, creating a
representation of the changing trend.
 Mekko Chart: It can be called a two-dimensional stacked chart with varying column widths.
 Pie Chart: It is a chart where various components of a data set are presented in the form of a pie
which represents their proportion in the entire data set.
 Waterfall Chart: With the help of this chart, the increasing effect of sequentially introduced
positive or negative values can be understood.
 Bubble Chart: It is a multi-variable graph that is a hybrid of Scatter Plot and a Proportional Area
Chart.
 Scatter Plot Chart: It is also called a scatter chart or scatter graph. Dots are used to denote values
for two different numeric variables.
 Bullet Graph: It is a variation of a bar graph. A bullet graph is used to swap dashboard gauges and
meters.
 Funnel Chart: The chart determines the flow of users with the help of a business or sales process.
 Heat Map: It is a technique of data visualization that shows the level of instances as color in two
dimensions.

You might also like