0% found this document useful (0 votes)
26 views15 pages

What Is Data Processing ?

This document discusses data processing and data quality problems. It begins with an outline of topics like data preprocessing, data integration, and data reduction. It then defines data processing as the collection and manipulation of data to produce meaningful information. Several data quality problems are identified such as incomplete data, data duplication, inconsistent formats, and poor organization. Common data preprocessing operations are also outlined, including data cleansing, feature selection and construction.

Uploaded by

aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views15 pages

What Is Data Processing ?

This document discusses data processing and data quality problems. It begins with an outline of topics like data preprocessing, data integration, and data reduction. It then defines data processing as the collection and manipulation of data to produce meaningful information. Several data quality problems are identified such as incomplete data, data duplication, inconsistent formats, and poor organization. Common data preprocessing operations are also outlined, including data cleansing, feature selection and construction.

Uploaded by

aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Ch 2 Data processing

Outline
Data Processing  Data quality problems
Data preprocessing  Data cleaning

Data integration 
Data integration
Data reduction 
Data reduction

What is data processing ?


 In STEM, the terms data processing are considered too broad, and the term data processing
is typically used for the initial stage followed by analysis to handle the overall data.
 The collection and manipulation of items of data to produce meaningful information.
 The processing is in any manner detectable by an observer.
 The conversion of data into usable and desired form. The
conversion is carried out using a predefined sequence of
operations either
 Manual data processing or

 Automatic (Electronic) data processing 1/20/22


 Data processing may involve various processes;
 Validation: ensuring that supplied data is correct and relevant.
 Sorting: is an arrangement of items in some sequence and/or in different sets.
 Summarization: reducing detailed data to its main points.
 Aggregation: combining multiple pieces of data.
 Analysis: collection, organization, analysis, interpretation and presentation of data.
 Reporting: list detail or summary of the data processed.
 Classification: separation of data into various categories.

Data quality Problems


 Incomplete Data
 Data duplication
 Inconsistent Formats
 Accessibility
 System upgrades
 Data purging and storage
 Poor organization 1/20/22
1. Incomplete Data
 Why data incomplete,
 Data has not beenentered in the system correctly,
 Certain files may have been corrupted,
 Some data has several missing variables.

 If an address does not include a zip code at all, the remaining information can
be of little value, since the geographical aspect of it would be hard to
determine.
2. Data duplication
 Multiple copies of the same records take a toll on the computation and storage.
 This produce skewed or incorrect insights when they go undetected.
 One of the key problems could be human error
 Simply entering the data multiple times by accident
 Sometimes the problem might be the algorithm
that has gone wrong.
1/20/22
3. Inconsistent Formats
 Differentorganization their data indifferent way.
These include;
■ Name (First name, Last name),
■ Date of birth (US/UK style),
■ Phone number with or without country code.
 If the data is stored in inconsistently,
the systems used to analyze or store Storing basic
the information may not interpret it information should
be pre-determined.
correctly.
 Inconsistent data may take data
 scientists a considerable amount of
time to simply unravel the many
versions of data saved.

1/20/22
4. Accessibility
 Most of the data and information scientists use to create, evaluate, theorise and predict the results
or end products often gets lost.
 The way data trickles down to business analysts in big organizations from
departments ,sub- divisions ,branches, and finally the teams who are working on the data
 This information may or may not have complete access to the next user.
 Data sharing and making available information in an efficient manner to all is the cornerstone in
sharing corporate data.

5. System Upgrades
 Every time the data management system gets an upgrade or the hardware is updated, there are
chances of information getting lost or corrupt.
 Making several back-ups of data and upgrading the systems only through authenticated sources
is always advisable.

1/20/22
6. Data purging and storage
 In organization, there are chances that locally saved
information could be deleted either by mistake or deliberately.
■ Saving the data in a safe manner, and sharing a with the
community is crucial.

7. Poor organization
 If we are not able to easily search through the data, we find that it
becomes significantly more difficult to make use of.
■ Through different organizational methods and procedures, there
are dozens of ways that data can be represented.

1/20/22
Examples of data quality problems
 Noisy data due to
 Faulty data collection instruments, entry errors, transmission problems,
technology limitation and inconsistency in naming convention.
 Duplication: data set may include data objects that are

 duplicates, or almost duplicates of one another


 Major issue when merging data from heterogeneous data sources. Such as, person with
multiple email addresses.
 Impossible data combination (eg. Gender: Male, Pregnan: Yes)
 Data from multiple units and languages

Quality Data
 Data have quality if they satisfy the requirements of the intended use and when it solves the
data quality problems. These includes;
 Accuracy,
 Completeness,
 Consistency,
 Timeliness,
 Believability and
1/20/22
 Interpretability
Data Preprocessing
 Is a theory and practice of manipulating/automating a electronic data in a way that can be
used for specific application.
 preprocessing might have different scope based on the application and domain.
 Trivial string manipulation programs is not economical and performing these tasks
requires robust text processing.
 Most widely used Approach: RegEx for NLP
 Preprocessing ML data involves both data engineering and feature engineering.
 Data engineering is the process of converting raw data into prepared data.
 Feature engineering then tunes the prepared data to create the features expected by the ML mod

Data Engineering (Raw data)


 Refers to the data in its source form, without any prior preparation for ML.
 The data might be in its raw form (flat file) or in a transformed form (in a database).
 Transformed data might have been converted from its original raw form to be used for
analytics, but not in the context our ML task.
 In addition, data sent from other systems that eventually call ML models for
predictions is considered to be data in its1/20/22
raw form.
Data Engineering (Prepared data)
 Refers to the dataset in the form ready for your Machine learning task.
 Data sources have been parsed, joined, and put into a tabular form after aggregating and
summarizing in the right granularity
 Each row in the dataset represents a unique record, and each column represents summary
information for ML.
 In the case of supervised learning tasks, the target feature is present.
 Irrelevant columns have been dropped, and invalid records have been filtered out.

Feature Engineering
 This refers to the dataset with the tuned features expected by the model.
 Performing certain ML specific operations on the
 columns in the prepared dataset, and creating new features for your model during
training and prediction under Preprocessing operations.
 Scaling numerical columns to a value between 0 and 1, clipping values, and one-hot-
encoding categorical features.
1/20/22
Preprocessing Operations
 Each operation aims to help ML build better predictive models.
Some of the operations for structured data:

1.Data cleansing
■ Removing or correcting records with corrupted or invalid values from raw data, as well as
removing records that are missing a large number of columns.
2.Instances selection and partitioning
■ Selecting data points from the input dataset to create training, evaluation (validation),
and test sets using random sampling, minority classes oversampling, and stratified
partitioning.
3.Feature tuning
■ Improving the quality of a feature for ML, which includes scaling and normalizing numeric
values, imputing missing values, clipping outliers, and adjusting values with skewed
distributions. transformation
4. Representation
■ Converting a numeric feature to a categorical feature and vice verse.
5. Feature extraction
■ Reducing the number of features by creating lower-dimension and more powerful data
representations using PCA, embedding extraction, and hashing.
6. Feature selection
■ Selecting a subset of the input features for training the model, and ignoring the irrelevant or
redundant ones, using filter or wrapper methods which involve simply dropping features if the
features are missing a large number of values.
7. Feature construction
■ Creating new features either by using typical techniques, such as polynomial
expansion or feature crossing.

 When working with Unstructured data such as images, audio, or text documents, deep
learning has gotten rid of the domain knowledge-based feature engineering by folding it into
the model architecture.
 A convolutional layer is an automatic feature preprocessor for constructing the right model
architecture which requires some empirical knowledge of the data. In addition, some
amount of preprocessing is needed, such as:
■ Text documents: stemming and lemmatization, TF- IDF calculation, and n-gram
extraction, embedding lookup.
■ Images: clipping, resizing, cropping, gaussian blur, and canary filters.
■ Transfer learning, in which you are treating all-but-last layers of the fully trained model as a
feature engineering step. This applies to all types of data, including text and images.
Data Cleaning
 Is the process of preparing data for analysis by removing or modifying data that is incorrect,
incomplete, irrelevant, duplicated, or improperly formatted.
 Data cleaning clean the data by:
■ Filter unwanted outliers and smoothing noisy data
■ Remove duplicate and irrelevant observations
■ Fix structural errors such capitalization
■ Filling in missing values as typos or inconsistent

Handling Missing value


 Ignore the tuple whenever a class label is missed
 Estimate missing Values
■ Filling missing values manually is time consuming and not feasible for a large
data set
■ Use a global constant
■ Use the attribute or group mean

■ Use the most probable value (popular)


 Choosing the right technique is a choice that depends on the problem domain.
Data Integration
 Blending data from multiple sources into a coherent data store.
 Issues to be considered during integration:
■ Some redundancies can be detected by correlation analysis if there is same entity from multiple

source.

Data Integration Approaches


 Data consolidation
■ Brings data together from several separate systems to reduce the number of data storage
locations.
 Data propagation
■ Data propagation is the use of applications to copy data from one location to another
synchronously or asynchronously
 Data virtualization
■ Uses an interface to provide a near real-time, unified view of data from disparate sources
with different data models.
 Data federation
■ Uses a virtual database and creates a common data model for heterogeneous data from
different systems being data virtualization type.
Data Transformation
 ML data may not be in the right format or may require transformations to make it more
useful. Data Transformation activities and techniques include:
■ Categorical encoding
▪ Label encoding converts categorical variables to numerical representation, something
that is machine-readable.
■ Dealing with skewed data
▪ Regression algorithms with linear regression or ANN, a better improvements registered
with more symmetric distribution, you can use roots (square-root, cube root),
logarithms (base e, or base 10), reciprocals (positive or negative), or Box-Cox
transformation.
■ Bias mitigation
▪ If bias detected in data, mitigation with replacing the current values, or labels, with those that
will result in a fairer model. Such as reweighing, optimized preprocessing, learning fair
representations and disparate impact remover.
■ Scaling
▪ Scaling is a method of transforming data into a particular range. This is important when
using regression algorithms and algorithms using Euclidean distances (e.g. KNN, or K-
Means) as they are sensitive to the variation in magnitude and range across features.
▪ The goal of scaling is to change the values of each numerical feature in the data set to a
common scale. Such as min-max scaling or z-score standardization.
Data Reduction
 Most machine learning techniques may not be effective for high-dimensional data. A
database or date warehouse may store TB of data.
■ This may take very long to perform data analysis on such huge amounts of data.
 Data reduction techniques can be applied to obtain a reduced representation of the
actual data in volume but still contain critical information.

Data Reduction Techniques


 Data Cube Aggregation
■ Aggregation operations are applied to the data in the construction of a data cube.
 Dimensionality Reduction
■ In dimensionality reduction redundant attributes are detected and removed which reduce the
data set size.
 Data Compression
■ Encoded in reduced or compressed representation of the original data.
 Numerosity Reduction
■ Where the data are replaced or estimated by alternative.
 Discretization and concept hierarchy generation
■ Data values are replaced by ranges or higher conceptual levels.

You might also like