Script

For data to be useful for BI systems and models, it must be cleaned and prepared by addressing anomalies. Raw data often has incomplete or noisy data that needs validation and corrective actions. Common problems are missing values, which can be addressed through elimination, inspection, identification, or substitution, and outliers or noise, which can be identified and corrected. Further preparation includes data transformations like standardization and feature selection to improve model accuracy while maintaining efficiency and simplicity.

Uploaded by

Simo Jayat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views5 pages

Script

Uploaded by

Simo Jayat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Why do we prepare data?

For BI systems and mathematical models, we can only give high accuracy and effective results with a
well cleaned reliable set of data.
However, the raw data collected from primary sources may usually have several anomalies that need to
be identified and corrected. And that’s what we do in data preparation
As a first step of preparing our data, we need to validate it, it means that we need to identify and implement
corrective actions in case of anomalies.
We usually encounter 2 types of problems in raw data.
- Incomplete Data
- Data affected by noise (means that sometimes all the attributes are here but it’s noisy)
Obviously, incomplete data is a set of data with missing attributes, as we can see here, and to heal that,
there are several solutions:
- Elimination
- Inspection
- Identification
- Substitution
Elimination:
As a first solution, we can discard all the records where we have missing attributes, but this method may
occur loss of large amounts of data in case of a high percentage of missing values or when the distribution of
missing values varies in an irregular way across the attributes.
Inspection:
Alternatively, we can opt for inspecting each missing value, but this approach is rather time-consuming
or really difficult for large amounts of data
Identification:
As a third option, we can replace all the missing values with a conventionally chosen value to identify
those values, making it unnecessary to remove entire records from our dataset.
For example, for a continuous attribute that only takes positive values, it is possible to assign the value -1 to all
missing data. For categorical attribute, we can do the same with a default token.
Substitution:
For instance, missing values of an attribute may be replaced with the mean of the attribute calculated for
the remaining observations, this technique can only be used with numerical attributes. It is also possible to
replace missing values by calculating the mean of the records that have the same target class.
Finally, the maximum likelihood value, it is usually estimated using regression models, can be used to
replace missing values.
What is noise in physics? It’s a random perturbation in sound waves, right? So just like in physics, noise
in data is a random perturbation within the values of a numerical attribute, usually resulting noticeable
anomalies.
First of all, we need to identify those outliers, so that we can either correct, regularize or eliminate them.
1- The easiest way to identify those unusual values is the statistical concept of dispersion. The mean and
the variance of a sample X are calculated. If our attribute follows a distribution not far from normal, the
values falling outside a chosen interval centered around the mean are identified as outliers.
2- An alternative way, is based on the distance between observations. Once we identify the clusters, we can
assume that records that are not placed in any of the clusters are identified as outliers.
3- Unlike the others methods, that identify and correct each single anomaly, there is also techniques that
automatically correct anomalous data. For example, simple or multiple regression models predict the
value of an attribute a j. Once the regression model is developed and the corresponding confidence
interval calculated, it is possible to substitute the value computed along the prediction curve for the
values of the attribute a j that fall outside the interval.

After this step, and now that we have a complete set of data with no missing values or anomalies, we aim to
increase or improve the accuracy of our learning models. To do so, there are multiple types of transformations
that we can apply on our Data.

Slide 11
The most used transformation is standardization. And most popular standardization techniques include
decimal scaling, min-max scaling and z-score standardization.
Decimal scaling is based on the transformation below where h is the scaling intensity that is, in general,
fixed at a value that gives transformed values in the range [-1,1]
Min-Max scaling is achieved through the following transformation
where x min, j is the minimum value of x i , j for a fixed i, and x max , j is the maximum value of x i , j for a fixed i
Z-index or Z-score has the formula below
Where μ j is the sample mean and σ j is the sample standard deviation of a given attribute a j, this transformation
gives transformed values in the range [-3,3]
Slide 12
When dealing with small amounts of data, the transformation described earlier are sufficient to prepare
input data for a data mining analysis. However, with large datasets it is preferable to reduce the size of data in
order to make learning algorithms more efficient without decreasing the quality of the results.
There are 3 main criteria to determine if a data reduction technique should be used:
- Efficiency
- Accuracy
- Simplicity
Efficiency
A smaller dataset than the original means shorter computation time. Therefore, a reduction in processing
time allows the analyses to be carried out more quickly
Accuracy
Accuracy is a critical success factor in most models. As a consequence, data reduction should not
compromise the accuracy of the model.
Simplicity
In some data mining applications, concerned more with interpretation than with prediction. It is more
important that the models can be easily translated into simple rules that can be easily understood by experts in
the application domain. Some decision makers may allow a slight decrease in accuracy as a trade for simpler
rules.

Slide 14
Also called feature reduction, is the elimination of a subset of attributes judged irrelevant for the
purpose of data mining activities. The choice of the combination of predictive variable is one of the most
critical aspects in a learning process.
Feature reduction means fewer columns, that implies quicker execution time. The models generated
after elimination of irrelevant attributes are more often more accurate and easier to understand.
There are 3 main categories of feature selection models:
- Filter methods
- Wrapper methods
- Embedded methods
Filter methods select the relevant attributes before moving to the learning phase, and therefore
independent of the algorithm being used.
The simplest filter method to apply for supervised learning is the selection of each single attribute based
on its level of correlation with the target. As a result, we only high correlated attributes with the target.
In the wrapper methods the selection of predictive variables is based, not only on the level of relevance
of each single variable, but also on the learning algorithm utilized, which makes those methods burdensome for
computational standpoint, since it takes in count every possible combination of variable, that means huge
computation time.

For the embedded methods the attribute selection process is inside the learning algorithm, so that the selection
of the optimal subset of attribute is made during the phase of model generation. Decision Trees are the perfect example
for embedded methods, because at each node, they use a function that estimates the predictive value the attribute. By this
way, the relevant attributes are automatically selected and determine the rule for splitting the records into corresponding
nodes.

Principal Component Analysis is the most known technique of attribute reduction by means of
prediction. Generally speaking, the purpose of this method is to obtain a new subset of attributes which has a
lower number of attributes obtained as their linear combination, without this change causing any loss of
information.
Before applying this method, it is a must to standardize the data, so that all the values are in the same
range of [-1,1]. In addition to that, the mean of each attribute is made equal to 0 by applying this transformation.
Now, we need to find the principal components that are going to form our new set of attributes.
To do that we chose the attribute with the highest variance value as our first principal component. And
we iterate this operation until we find all the principal components.
Let’s give an example!

Principal Component Analysis (PCA) clearly explained (2015) - YouTube

Finally, we’re going to talk about Data discretization. The purpose of data reduction is to decrease the
number of distinct values assumed by one or more attributes. Data discretization is the primary reduction
method, it reduces continuous attributes to categorical ones characterized by a limited number of distinct values.
For instance, the weekly spending of a mobile phone customer is a continuous numerical value, which
might be discretized into, say, five classes: low, [0 − 10) euros; medium low, [10 − 20) euros; medium, [20 −
30) euros; medium high, [30 − 40) euros; and high, over 40 euros.
The discretization process has brought about a reduction in the number of distinct values assumed by
each attribute. The models that can be generated on the reduced dataset are likely to be more intuitive and less
arbitrary
For example, this:
if spending is in the medium low range, and if a customer resides in region A, then the probability of
churning is higher than 0.85.
Is easier to read than this:
if spending is in the [12.21, 14.79] euro range, and if a customer resides in province B, then the
probability of churning is higher than 0.85
Among the most popular discretization techniques are subjective subdivision, subdivision into classes
and hierarchical discretization.

Subjective subdivision is the most popular and intuitive method. Classes are defined based on the
experience and judgment of experts in the application domain.
Subdivision into categorical classes may be achieved in an automated way using the techniques
described below. In particular, the subdivision can be based on classes of equal size or equal width.
The third type of discretization is hierarchical discretization and it is based on hierarchical
relationships between concepts and may be applied to categorical attributes, just as for the hierarchical
relationships between provinces and regions.
In general, given a hierarchical relationship of the one-to-many kind, it is possible to replace each value
of an attribute with the corresponding value found at a higher level in the hierarchy of concepts.

Slide 20
Here are some examples.

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Sagara Technology Profile
No ratings yet
Sagara Technology Profile
39 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
India Russia
No ratings yet
India Russia
27 pages
Tydings-Kosciasowzki Law
No ratings yet
Tydings-Kosciasowzki Law
127 pages
Biomedical and Instrumentation Lab File
No ratings yet
Biomedical and Instrumentation Lab File
37 pages
PNOZ XV2 en
No ratings yet
PNOZ XV2 en
8 pages
SSP 406 DCC Adaptive Chassis Control Design and Function
No ratings yet
SSP 406 DCC Adaptive Chassis Control Design and Function
32 pages
Conservation Equations and Modeling of Chemical and Biochemical Processes 1st Edition Said S.E.H. Elnashaie Download
No ratings yet
Conservation Equations and Modeling of Chemical and Biochemical Processes 1st Edition Said S.E.H. Elnashaie Download
63 pages
Marketing Mix 4ps PNG
50% (2)
Marketing Mix 4ps PNG
2 pages
Lifeboat Seat Belt Requirements
No ratings yet
Lifeboat Seat Belt Requirements
9 pages
e-GP System User Manual - Organization Admin
No ratings yet
e-GP System User Manual - Organization Admin
50 pages
Kashaf New Bsns Plan Bba
No ratings yet
Kashaf New Bsns Plan Bba
22 pages
Tillett Car Seat 2011 Brochure
No ratings yet
Tillett Car Seat 2011 Brochure
8 pages
ROADMAP First Edition
0% (1)
ROADMAP First Edition
32 pages
SQM-Unit1 and Unit 2
No ratings yet
SQM-Unit1 and Unit 2
103 pages
Qatar PPPs
No ratings yet
Qatar PPPs
29 pages
Ethical Issues in Research
No ratings yet
Ethical Issues in Research
5 pages
Annie Aho Updated Resume
No ratings yet
Annie Aho Updated Resume
2 pages
The Cultural Revolution Extra Reading
No ratings yet
The Cultural Revolution Extra Reading
2 pages
Allegro PCB Si Sigxplorer L Series Tutorial: Product Version 15.7 July 2006
No ratings yet
Allegro PCB Si Sigxplorer L Series Tutorial: Product Version 15.7 July 2006
48 pages
Auo t370xw02 VC
No ratings yet
Auo t370xw02 VC
29 pages
Put The Verbs Into The Correct Tense (Simple Present or Present Progressive)
No ratings yet
Put The Verbs Into The Correct Tense (Simple Present or Present Progressive)
3 pages
NCP Format
No ratings yet
NCP Format
2 pages
NCS 2008 Mathematics Exam
100% (4)
NCS 2008 Mathematics Exam
13 pages
Econf412 Finf313 Mids Q
No ratings yet
Econf412 Finf313 Mids Q
4 pages
B. Ujwala Libre
No ratings yet
B. Ujwala Libre
5 pages
Case Study Presentation Two Tough Calls A Harvard Business School
No ratings yet
Case Study Presentation Two Tough Calls A Harvard Business School
10 pages
ATC-SEAOC Training - Built To Resist Earthquakes - Contents
No ratings yet
ATC-SEAOC Training - Built To Resist Earthquakes - Contents
2 pages
IIMK Year 1 Syllabus
No ratings yet
IIMK Year 1 Syllabus
3 pages
Samsung Gt-m5650 Lindy Service Manual
No ratings yet
Samsung Gt-m5650 Lindy Service Manual
79 pages
How Technology Has Made Governance Easier
No ratings yet
How Technology Has Made Governance Easier
8 pages

Script

Uploaded by

Script

Uploaded by

Why do we prepare data?

Principal Component Analysis (PCA) clearly explained (2015) - YouTube

You might also like