0% found this document useful (0 votes)
65 views

Unit 2 Preparing To Model

The document discusses the key steps and concepts in preparing data for machine learning models. It covers types of data like qualitative/categorical and quantitative/numeric data, structures of data like structured data stored in databases, and important preprocessing steps like checking data quality, resolving issues, and cleaning data through remediation. The goal of these preprocessing steps is to transform raw data into a suitable format for building and analyzing machine learning models.

Uploaded by

Yash Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Unit 2 Preparing To Model

The document discusses the key steps and concepts in preparing data for machine learning models. It covers types of data like qualitative/categorical and quantitative/numeric data, structures of data like structured data stored in databases, and important preprocessing steps like checking data quality, resolving issues, and cleaning data through remediation. The goal of these preprocessing steps is to transform raw data into a suitable format for building and analyzing machine learning models.

Uploaded by

Yash Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Silver Oal College Of Engineering And Technology

Unit 2 :
Preparing to Model

1 Prof. Monali Suthar (SOCET-CE)


Outline
 Machine Learning activities,
 Types of data in Machine Learning,
 Structures of data,
 Data quality and remediation,
 Data Pre-Processing: Dimensionality reduction, Feature
subset selection

2 Prof. Monali Suthar (SOCET-CE)


Framework For Developing Machine Learning Models

Problem or Opportunity
Identification

Feature Extraction

Data Preprocessing

Model Building

Communication and deployment of


Data Analysis

3 Prof. Monali Suthar (SOCET-CE)


Machine Learning activities

4 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Most data can be categorized into
2 basic types from a Machine
Learning perspective:
1. Qualitative Data
Type/Categorical data
2. Quantitative Data
Type/Numerical data

5 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Qualitative Data
/Categorical data
 Qualitative or Categorical Data
describes the object under
consideration using a finite set of
discrete classes.
 It means that this type of data can‘t be
counted or measured easily using
numbers and therefore divided into
categories.
 Ex: The gender of a person (male,
female, or others).
 There are two subcategories under
this:
 Nominal data
 Ordinal data

6 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Qualitative Data /Categorical
data
 Nominal data
 These are the set of values that
don‘t possess a natural ordering.
 nominal data type there is no
comparison among the
categories
 Ex: The color of a smartphone
can be considered as a nominal
data type as we can‘t compare
one color with others.
 Ex: The gender of a person is
another one where we can‘t
differentiate between male,
female, or others

7 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Qualitative Data /Categorical data
 Ordinal data
 These types of values have a natural
ordering while maintaining their
class of values.
 These categories help us deciding
which encoding strategy can be
applied to which type of data.
 Ex: nominal data type where there is
no comparison among the
categories small < medium < large.
 Data encoding for Qualitative data is
important because machine learning
models can‘t handle these values
directly and needed to be converted
to numerical types as the models
are mathematical in nature.

8 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Quantitative/ Numeric
Data
 This data type tries to
quantify things and it does
by considering numerical
values that make it
countable in nature.
 Discrete
 Continuous

9 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Quantitative/ Numeric
Data
 Discrete
 The numerical values
which fall under are
integers or whole
numbers are placed under
this category. The number
of speakers in the phone,
cameras, cores in the
processor, the number of
sims supported all these
are some of the examples
of the discrete data type.

10 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Quantitative/ Numeric
Data
 Continuous
 The fractional numbers
are considered as
continuous values. These
can take the form of the
operating frequency of the
processors, the android
version of the phone, wifi
frequency, temperature of
the cores, and so on.

11 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning

12 Prof. Monali Suthar (SOCET-CE)


Structures of data
 The term structured data refers to data that resides in a
fixed field within a file or record. Structured data is
typically stored in a relational database (RDBMS).
 It can consist of numbers and text, and sourcing can
happen automatically or manually, as long as it's within an
RDBMS structure.
 It depends on the creation of a data model, defining what
types of data to include and how to store and process it.

13 Prof. Monali Suthar (SOCET-CE)


Data quality and remediation
 Data quality is an assessment or a perception
of data's fitness to fulfill its purpose. Simply put, data is said
to be high quality if it satisfies the requirements of its
intended purpose.
 There are many aspects to data quality, including consistency,
integrity, accuracy, and completeness.
 Achieving the data quality required for machine learning
 This includes checking for consistency, accuracy, compatibility,
completeness, timeliness, and duplicate or corrupted records.
 At the scale required for a typical ML project, adequately
cleansing training or production data manually is a near
impossibility.

14 Prof. Monali Suthar (SOCET-CE)


Importance of Data quality
 Data Quality matters for machine learning.
Unsupervised machine learning is a savior when the
desired quality of data is missing to reach the requirements
of the business.
 It is capable of delivering precise business insights by
evaluating data for AI-based programs.
 Improved data quality leads to better decision-making across
an organization.
 The more high-quality data you have, the more confidence
you can have in your decisions.
 Data quality is of critical importance especially in the era of
automated decisions, ML, and continuous process optimization

15 Prof. Monali Suthar (SOCET-CE)


Importance of Data quality
 Confusion, limited trust, poor decisions
 Data quality issues explain limited trust in data from corporate
users, waste of resources, or even poor decisions.
 Failures due to low data quality
 Users need to trust the data — if they don‘t, they will gradually
abandon the system impacting its major KPIs and success
criteria.

16 Prof. Monali Suthar (SOCET-CE)


Data quality issues
 Data quality issues can take many forms, for example:
 particular properties in a specific object have invalid or missing
values
 a value coming in an unexpected or corrupted format
 duplicate instances
 inconsistent references or unit of measures
 incomplete cases
 broken URLs
 corrupted binary data
 missing packages of data
 gaps in the feeds
 incorrectly -mapped properties

17 Prof. Monali Suthar (SOCET-CE)


Data quality
 Data quality issues are typically the result of:
 poor software implementations: bugs or improper
handling of particular cases
 system-level issues: failures in certain processes
 changes in data formats, impacting the source and/or
target data stores

18 Prof. Monali Suthar (SOCET-CE)


Data remediation
 Data remediation is the process of cleansing,
organizing and migrating data so that it's properly
protected and best serves its intended purpose. ... Since
the core initiative is to correct data, the data
remediation process typically involves replacing,
modifying, cleansing or deleting any ―dirty‖ data.
 It can be performed manually, with cleansing tools, as a
batch process (script), through data migration or a
combination of these methods.

19 Prof. Monali Suthar (SOCET-CE)


Data remediation
 Need for data remediation: Consider these additional
factors that will drive the need for data remediation
 Moving to a new system or environment
 Eliminating personally identifiable information (a.k.a.
PII)
 Dealing with mergers and acquisitions activity
 Addressing human errors
 Remedying errors in reports
 Other business drivers

20 Prof. Monali Suthar (SOCET-CE)


Data remediation terminology
 Data Migration – The process of moving data between two or more systems, data formats
or servers.
 Data Discovery – A manual or automated process of searching for patterns in data sets to
identify structured and unstructured data in an organization‘s systems.
 ROT – An acronym that stands for redundant, obsolete and trivial data. According to the
Association for Intelligent Information Management, ROT data accounts for nearly 80 percent
of the unstructured data that is beyond its recommended retention period and no longer
useful to an organization.
 Dark Data – Any information that businesses collect, process and store, but do not use for
other purposes. Some examples include customer call records, raw survey data or email
correspondences. Often, the storing and securing of this type of data incurs more expense
and sometimes even greater risk than it does value.
 Dirty Data – Data that damages the integrity of the organization‘s complete dataset. This can
include data that is unnecessarily duplicated, outdated, incomplete or inaccurate.
 Data Overload – This is when an organization has acquired too much data, including low-
quality or dark data. Data overload makes the tasks of identifying, classifying and remediating
data laborious.
 Data Cleansing – Transforming data in its native state to a predefined standardized format.
 Data Governance – Management of the availability, usability, integrity and security of the
21data stored within an organization. Prof. Monali Suthar (SOCET-CE)
Stages of data remediation
 Data remediation is an involved process. After all, it‘s more
than simply purging your organization‘s systems of dirty data.
 It requires knowledgeable assessment on how to most
effectively resolve unclean data.
 Assessment:
 you need to have a complete understanding of the data you possess.
 Organizing and segmentation:
 Not all data is created equally, which means that not all pieces of
data require the same level of protection or storage features.
 when creating segments is determining which historical data is
essential to business operations and needs to be stored in an archive
system versus data that can be safely deleted.

22 Prof. Monali Suthar (SOCET-CE)


Stages of data remediation
 Indexation and classification:
 These steps build off of the data segments you have created and
helps you determine action steps.
 organizations will focus on segments containing non-ROT data
and classify the level of sensitivity of this remaining data.
 Migrating:
 If an organization‘s end goal is to consolidate their data into a new,
cleansed storage environment, then migration is an essential step in
the data remediation process.
 Data cleansing:
 The final task for your organization‘s data may not always involve
migration.
 There may be other actions better suited for the data depending on
what segmentation group it falls under and its classification.
 A few vital actions that a team may proceed with include shredding,
redacting, quarantining, ACL removal and script execution to clean up
data.
23 Prof. Monali Suthar (SOCET-CE)
Benefits of data remediation
 Reduced data storage costs
 Protection for unstructured sensitive data
 Reduced sensitive data footprint
 Adherence to compliance laws and regulations
 Increased staff productivity
 Minimized cyberattack risks
 Improved overall data security

24 Prof. Monali Suthar (SOCET-CE)


Dimensionality reduction
 Dimensionality reduction
 The number of input variables or features for a dataset is referred to as its
dimensionality.

 Dimensionality reduction refers to techniques that reduce the number of


input variables in a dataset.

 More input features often make a predictive modeling task more challenging
to model, more generally referred to as the curse of dimensionality.

 High-dimensionality statistics and dimensionality reduction techniques are


often used for data visualization. Nevertheless these techniques can be used
in applied machine learning to simplify a classification or regression dataset
in order to better fit a predictive model.

25 Prof. Monali Suthar (SOCET-CE)


Why dimensionality reduction needed?
 Some features (dimensions) bear little or nor useful
information (e.g. color of hair for a car selection)
 Can drop some features
 Have to estimate which features can be dropped from data

 Several features can be combined together without loss


or even with gain of information (e.g. income of all family
members for loan application)
 Some features can be combined together
 Have to estimate which features to combine from data

26 Prof. Monali Suthar (SOCET-CE)


Feature selection vs extraction

 Feature selection: Choosing k<d important features,


ignoring the remaining d – k
 Subset selection algorithms
 Feature extraction: Project the original xi , i =1,...,d
dimensions to new k<d dimensions, zj , j =1,...,k
 Principal Components Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Factor Analysis (FA)

27 Prof. Monali Suthar (SOCET-CE)


Principal Components Analysis (PCA)
 The Principal Component Analysis is a popular unsupervised learning
technique for reducing the dimensionality of data.
 It increases interpretability yet, at the same time, it minimizes information
loss.
 It helps to find the most significant features in a dataset and makes the data
easy for plotting in 2D and 3D.
 PCA helps in finding a sequence of linear combinations of variables.

28 Prof. Monali Suthar (SOCET-CE)


Principal Components
 The Principal Components are a straight line that captures most of the
variance of the data.
 They have a direction and magnitude. Principal components are orthogonal
projections (perpendicular) of data onto lower-dimensional space.

29 Prof. Monali Suthar (SOCET-CE)


Application of PCA
• PCA is used to visualize multidimensional data.
• It is used to reduce the number of dimensions in healthcare data.
• PCA can help resize an image.
• It can be used in finance to analyze stock data and forecast returns.
• PCA helps to find patterns in the high-dimensional datasets.

30 Prof. Monali Suthar (SOCET-CE)


Uses of PCA
• To reduce the number of dimensions in the dataset.
• To find patterns in the high-dimensional dataset
• To visualize the data of high dimensionality
• To ignore noise
• To improve classification
• To gets a compact description
• To captures as much of the original variance in the
data as possible

31 Prof. Monali Suthar (SOCET-CE)


objective of PCA
• Find an orthonormal basis for the data.
• Sort dimensions in the order of importance.
• Discard the low significance dimensions.
• Focus on uncorrelated and Gaussian components.

32 Prof. Monali Suthar (SOCET-CE)


Working of PCA
1. Normalize the data
Standardize the data before performing PCA. This will ensure that each feature has a mean = 0 and variance = 1.

2. Build the covariance matrix


Construct a square matrix to express the correlation between two or more features in a multidimensional dataset.

33 Prof. Monali Suthar (SOCET-CE)


Working of PCA
3. Find the Eigenvectors and Eigenvalues
Calculate the eigenvectors/unit vectors and eigenvalues. Eigenvalues are
scalars by which we multiply the eigenvector of the covariance matrix.

4. Sort the eigenvectors in highest to lowest order and select the number
of principal components.

34 Prof. Monali Suthar (SOCET-CE)


PCA in python
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)

print(pca.explained_variance_ratio_)

print(pca.singular_values_)

35 Prof. Monali Suthar (SOCET-CE)


PCA in python
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)

print(pca.explained_variance_ratio_)

print(pca.singular_values_)

36 Prof. Monali Suthar (SOCET-CE)


LDA : Linear Discriminant Analysis
 LDA is most commonly used as dimensionality reduction
technique.
 It is very similar to PCA
 If PCA involves finding the components axes that
maximize the variance of the entire data, LDA involves
finding the axes that maximize the separation between
multiple classes.
 LDA projects a features space of size n onto a smaller
subspace k [where k  (n-)] while maintaining the class-
discriminatory information.
 Avoid overfitting
 Reduce computational cost.

37 Prof. Monali Suthar (SOCET-CE)


ICA : Independent Component Analysis
 ICA is the method for finding underlying factors or
components from multivariate statistical data.
 It looks for components that are both statistically
independent and non-Gaussian.
 Statistically independent and non-Gaussian components
can be separated using a blind source separation method.
Nonlinear decorre-lation and maximum non-Gaussianity
are the methods used in ICA.
 Methods for Estimating ICA
1. Nonlinear decorrelation:
2. Maximum non-Gaussianity:

38 Prof. Monali Suthar (SOCET-CE)


ICA : Independent Component Analysis
 Methods for Estimating ICA
1. Nonlinear decorrelation:
 Nonlinear decorrelation method involves fnding the matrix W so that
for any i ≠ j, the components yi and yj are completely uncorrelated.
 The transformed components g(yi) and h(yj) are also uncorrelated,
where g and h are some suitable nonlinear functions.
 In this method of estimating ICA, if the nonlinearities are properly
chosen, the method does find the independent components.
 The main problem in this method is to address how the
nonlinearities g and h are chosen. One of the approaches to select
the nonlinear functions is to use maximum likelihood method in
information theory.

39 Prof. Monali Suthar (SOCET-CE)


ICA : Independent Component Analysis
 Methods for Estimating ICA
2. Maximum non-Gaussianity:
 The maximum non-Gaussianity approach for estimating independent
component involves finding the local maxima of non-Gaussianity of a
linear combination, y = Σbixi, under the constraint that the variance of
y is constant.
 Each local maximum corresponds to one independent component. In
practice, kurtosis is used to measure non-Gaussianity. Kurtosis is a
higher order cumulate method, which involves some ways of
generalizations of variance using higher order polynomials.
 Cumulants are used for ICA as they have important algebraic and
statistical properties.

40 Prof. Monali Suthar (SOCET-CE)


High Dimensional data
 High Dimensional refers to the high number of variables or
attributes or features present in certain data sets, more so in
the domains.
 Ex: DNA analysis, geographic information system (GIS), etc.
 A model built on an extremely high number of features may be
very difficult to understand.
 So, Start with Feature selection
 Benefits :
1. Having a faster and more cost-effective (less need for
computational resources) learning model
2. Having a better understanding of the underlying model that
generates the data.
3. Improving the efficacy of the learning model.

41 Prof. Monali Suthar (SOCET-CE)


Feature subset selection
 Feature Selection is the most critical pre-
processing activity in any machine learning process.
 It intends to select a subset of attributes or
features that makes the most meaningful
contribution to a machine learning activity. In order
to understand it, let us consider a small example
i.e.
 Predict the weight of students based on the
past information about similar students,
which is captured inside a ‗Student Weight‘ data set.
 The data set has 04 features like Roll Number,
Age, Height & Weight.
 Roll Number has no effect on the weight of the
students, so we eliminate this feature. Reduced Data Set
 So now the new data set will be having only 03
features.
 This subset of the data set is expected to give
better results than the full set.

42 Prof. Monali Suthar (SOCET-CE)


Feature Subset Selection
 The Goal of Feature Subset Selection is to find the optimal
feature subset.
 Feature Subset Selection Methods can be classified into three
broad categories:
 Filter Methods
 Wrapper Methods
 Embedded Methods
 Requirements:
 A measure for assessing the goodness of a feature subset
(scoring function)
 A strategy to search the space of possible feature subsets
 Finding a minimal optimal feature set for an arbitrary
target concept is hard. It would need Good Heuristics.

43 Prof. Monali Suthar (SOCET-CE)


Filter Methods
 In this method, select subsets of variables as a pre-processing
step, independently of the used classifier
 It would be worthwhile to note that Variable Ranking-
Feature Selection is a Filter Method

 Key features of Filter Methods


 Filter Methods are usually fast
 Filter Methods provide generic selection of features, not tuned by given
learner (universal)
 Filter Methods are also often criticized (feature set not optimized for
used classifier)
 Filter Methods are sometimes used as a pre-processing step for other
44 methods. Prof. Monali Suthar (SOCET-CE)
Wrapper Methods
 In Wrapper Methods, the
Learner is considered a
black-box. Interface of the
black-box is used to score
subsets of variables
according to the predictive
power of the learner when
using the subsets.
 Results vary for different
learners
 One needs to define: – how
to search the space of all
possible variable subsets ?–
how to assess the prediction
performance of a learner ?

45 Prof. Monali Suthar (SOCET-CE)


Embedded Methods
 Embedded Methods are specific to a given learning
machine
 Performs variable selection (implicitly) in the process of
training
 E.g. WINNOW-algorithm (linear unit with multiplicative
updates).

46 Prof. Monali Suthar (SOCET-CE)


47 Prof. Monali Suthar (SOCET-CE)
 https://medium.com/ml-research-lab/chapter-2-data-and-
its-different-types-3dfebcbb4dbe
 https://blog.statsbot.co/data-structures-related-to-
machine-learning-algorithms-
5edf77c8bbf4#:~:text=Array,mathematical%20tool%20at%
20your%20disposal.
 https://www.upgrad.com/blog/types-of-data/
 https://www.spirion.com/data-remediation/

48 Prof. Monali Suthar (SOCET-CE)


49 Prof. Monali Suthar (SOCET-CE)

You might also like