Data Preprocessing

Uploaded by

Merenissa Balato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

32 views13 pages

Data Preprocessing

Uploaded by

Merenissa Balato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 13

1115124, 820 Pm Data Preprocessing In Depth | Towards Data Science Understanding Data Preprocessing @. Harshita Singh - Follow @ Published in Towards Data Science - 6minread - May 13,2020 4 Q a © Oo Photo by Franki Chamaki on Unsplash hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ans1115724, 620 Pw Data Preprocessing In Depth | Towards Data Science Data preprocessing is an important task. It is a data mining technique that transforms raw data into a more understandable, useful and efficient format. | Data has a better idea. This idea will be clearer and understandable after Openinapp 7 Gianep) signin @ Medium = seach F write Real world data is generally: Incomplete: Certain attributes or values or both are missing or only aggregate data is available. Noisy: Data contains errors or outliers Inconsistent: Data contains differences in codes or names etc. Tasks in data preprocessing 1. Data Cleaning: It is also known as scrubbing. This task involves filling of missing values, smoothing or removing noisy data and outliers along with resolving inconsistencies. N . Data Integration: This task involves integrating data from multiple sources such as databases (relational and non-relational), data cubes, files, etc. The data sources can be homogeneous or heterogeneous. The data obtained from the sources can be structured, unstructured or semi- structured in format. hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ans1115724, 620 Pw Data Preprocessing In Depth | Towards Data Science 3. Data Transformation: This involves normalisation and aggregation of data according to the needs of the data set. 4. Data Reduction: During this step data is reduced. The number of records or the number of attributes or dimensions can be reduced. Reduction is performed by keeping in mind that reduced data should produce the same results as original data. 5. Data Discretization: It is considered as a part of data reduction. The numerical attributes are replaced with nominal ones. Data Cleaning The data cleaning process detects and removes the errors and inconsistencies present in the data and improves its quality. Data quality problems occur due to misspellings during data entry, missing values or any other invalid data. Basically, “dirty” data is transformed into clean data. “Dirty” data does not produce the accurate and good results, Garbage data gives garbage out. So it becomes very important to handle this data. Professionals spend a lot of their time on this step. Reasons for “dirty” or “unclean” data 1. Dummy values 2. Absence of data 3. Violation of business rules 4. Data integration problems 5. Contradicting data hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ans.1115124, 820 Pm Data Preprocessing In Depth | Towards Data Science 6. Inappropriate use of address line 7. Reused primary keys 8. Non-unique identifiers What to do to clean data? 1. Handle Missing Values 2. Handle Noise and Outliers 3. Remove Unwanted data Handle Missing Values Missing values cannot be looked over in a data set. They must be handled. Also, a lot of models do not accept missing values. There are several techniques to handle missing data, choosing the right one is of utmost importance. The choice of technique to deal with missing data depends on the problem domain and the goal of data mining process. The different ways to handle missing data are: 1. Ignore the data row: This method is suggested for records where maximum amount of data is missing, rendering the record meaningless. This method is usually avoided where only less attribute values are missing. If all the rows with missing values are ignored i.e. removed, it will result in poor performance. N . Fill the missing values manually: This is a very time consuming method and hence infeasible for almost all scenarios. w . Use a global constant to fill in for missing values: A global constant like “NA” or 0 can be used to fill all the missing data. This method is used when missing values are difficult to be predicted. hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 431115124, 820 Pm Data Preprocessing In Depth | Towards Data Science 4, Use attribute mean or median: Mean or median of the attribute is used to fill the missing value. 5. Use forward fill or backward fill method: In this, either the previous value or the next value is used to fill the missing value. A mean of the previous and succession values may also be used. 6. Use a data-mining algorithm to predict the most probable value Handle Noise and Outliers Noise in data may be introduced due to fault in data collection, error during data entering or due to data transmission errors, etc. Unknown encoding (Example : Marital Status — Q), out of range values (Example : Age — -10), Inconsistent Data (Example : DoB — 4th Oct 1999, Age — 50), inconsistent formats (Example : DoJ — 13th Jan 2000, Dol — 10/10/2016), etc. are different types of noise and outliers. Noise can be handled using binning. In this technique, sorted data is placed into bins or buckets. Bins can be created by equal-width (distance) or equal- depth (frequency) partitioning. On these bins, smoothing can be applied. Smoothing can be by bin mean, bin median or bin boundaries. Outliers can be smoothed by using binning and then smoothing it. They can be detected using visual analysis or boxplots. Clustering can be used identify groups of outlier data.The detected outliers may be smoothed or removed. Remove Unwanted Data Unwanted data is duplicate or irrelevant data. Scraping data from different sources and then integrating may lead to some duplicate data if not done efficiently. This redundant data should be removed as it is of no use and will only increase the amount of data and the time to train the model. Also, due hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 531115724, 620 Pw Data Preprocessing In Depth | Towards Data Science to redundant records, the model may not provide accurate results as the duplicate data interferes with analysis process, giving more importance to the repeated values. Data Integration In this step, a coherent data source is prepared. This is done by collecting and integrating data from multiple sources like databases, legacy systems, flat files, data cubes etc. Data is like garbage. You'd better know what you are going to do with it before you collect it. — Mark Twain Issues in Data Integration 1. Schema Integration: Metadata (i.e. the schema) from different sources may not be compatible. This leads to entity identification problem. Example : Consider two data sources R and S. Customer id in R is represented as cust_id and in S is represented is c_id. They mean the same thing, represent the same thing but have different names which leads to integration problems. Detecting and resolving them is very important to have a coherent data source. nN . Data value conflicts: The values or metrics or representations of the same data maybe different in for the same real world entity in different data sources. This leads to different representations of the same data, different scales etc. Example : Weight in data source R is represented in kilograms and in source S is represented in grams. To resolve this, data representations should be made consistent and conversions should be performed accordingly. hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 eins1115724, 620 Pw Data Preprocessing In Depth | Towards Data Science 3. Redundant data: Duplicate attributes or tuples may occur as a result of integrating data from various sources. This may also lead to inconsistencies. These redundancies or inconsistencies may be reduced by careful integration of data from multiple sources. This will help in improving the mining speed and quality. Also, co-relational analysis can be performed to detect redundant data. Data Reduction If the data is very large, data reduction is performed. Sometimes, it is also performed to find the most suitable subset of attributes from a large number of attributes. This is known as dimensionality reduction. Data reduction also involves reducing the number of attribute values and/or the number of tuples. Various data reduction techniques are: 1. Data cube aggregation: In this technique the data is reduced by applying OLAP operations like slice, dice or rollup. It uses the smallest level necessary to solve the problem. N . Dimensionality reduction: The data attributes or dimensions are reduced. Not all attributes are required for data mining. The most suitable subset of attributes are selected by using techniques like forward selection, backward elimination, decision tree induction or a combination of forward selection and backward elimination. w . Data compression: In this technique. large volumes of data is compressed i.e. the number of bits used to store data is reduced. This can be done by using lossy or lossless compression. In loss compression, the quality of data is compromised for more compression. In lossless hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ms1115124, 820 Pm Data Preprocessing In Depth | Towards Data Science compression, the quality of data is not compromised for higher compression level. 4. Numerosity reduction : This technique reduces the volume of data by choosing smaller forms for data representation. Numerosity reduction can be done using histograms, clustering or sampling of data. Numerosity reduction is necessary as processing the entire data set is expensive and time consuming. Data Science Machine Learning Artificial Intelligence Data Preparation Some rights reserved © © @ Written by Harshita Singh 15 Followers » Writer for Towards Data Science Full Stack Developer | MS Al for Earth Grantee 2020 hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 Data Preprocessing 831115128, 820 Pee Data Preprocessing In Depth | Towards Data Science More from Harshita Singh and Towards Data Science PROMPT Petey seo Pale Tell bed © arstita singh Google Maps & ReactJS Google has a lot of APIs available for use. One of the most used is the Google Maps API... Sminread - May 16,2020 Ss Q Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710 @ sheila Teo in Towards Data Science How Won Singapore’s GPT-4 Prompt Engineering Competition A deep dive into the strategies | learned for harnessing the power of Large Language. + + 24minread - Dec 29, 2023 ons1115128, 820 Pee Data Preprocessing In Depth | Towards Data Science §D Tru u in Towards Data Science © Harshita singh How to Learn Al on Your Own (a What & Why of Data Exploration self-study guide) What is Data Exploration? Why is it needed? If your hands touch a keyboard for work, Artificial Intelligence is going to change your... + + 12min read + Jané 2minread + May 23,2020 Sick Q16 GH 87 Q a See all from Harshita Singh See all from Towards Data Science Recommended from Medium realm of data science and machine tg, dita preprocessing isthe compass ties us through te rough train of "data. Before algo can work thei “ve ned to ensure our data is clean, fen and ey for sal nis blog, ~E&. ¢ Senet Bo eprocessng, exploring how to handle @ ‘gales, clr, nd the mumces of oneal ata Let's dive int Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710 1031115128, 820 Pee @ sonatika Ray Data Preprocessing: Handling Missing Values, Outliers, and... Handling Missing Values amin read - Aug 16,202 6 Qa i Lists Predictive Modeling w/ Python 20stories - 784 saves Natural Language Processing 1094 stories - 560 saves Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710 Data Preprocessing In Depth | Towards Data Science ®. Python Programming Data Cleaning Techniques with Python A Practical Guide with code examples + - Amin read - Sep 4,2023 Hs Q ct Practical Guides to Machine Learning tO stories, 907 saves ChatGPT prompts 34 stories - 967 saves wins1115124, 820 Pm EXPLORATORY DATA ANALYSIS @ Paresh Pati A beginner’s Guide to exploratory data analysis (EDA) Table of contents: Sminread - Aug 11,2023 Hw Q Ww Data Preprocessing In Depth | Towards Data Science ® Rebecca in Python in Plain English Best Practices for Exploratory Data Analysis in Data Science Introduction @min read - Aug 1,2023 tL @® iunammad Abuzar AComprehensive Guide to Data Preprocessing Introduction Sminread + Oct 12,2023, Ser Q a hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 @® benizcunay Feature Encoding Although some machine learning models are able to deal with categorical(non numerical). 19min read » Aug 18,2023 S29 Q xh rans1115128, 820 Put Data Preprocessing In Depth | Towards Data Science ‘See more recommendations Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710 1383

Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit - II
No ratings yet
Unit - II
56 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Mining
No ratings yet
Data Mining
22 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
DMI UNIT 3
No ratings yet
DMI UNIT 3
12 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Warehousing - CH3
No ratings yet
Data Warehousing - CH3
15 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Correlation
No ratings yet
Correlation
14 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
DWM
No ratings yet
DWM
14 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

You might also like