What Is Data Processing ?
What Is Data Processing ?
Outline
Data Processing Data quality problems
Data preprocessing Data cleaning
Data integration
Data integration
Data reduction
Data reduction
If an address does not include a zip code at all, the remaining information can
be of little value, since the geographical aspect of it would be hard to
determine.
2. Data duplication
Multiple copies of the same records take a toll on the computation and storage.
This produce skewed or incorrect insights when they go undetected.
One of the key problems could be human error
Simply entering the data multiple times by accident
Sometimes the problem might be the algorithm
that has gone wrong.
1/20/22
3. Inconsistent Formats
Differentorganization their data indifferent way.
These include;
■ Name (First name, Last name),
■ Date of birth (US/UK style),
■ Phone number with or without country code.
If the data is stored in inconsistently,
the systems used to analyze or store Storing basic
the information may not interpret it information should
be pre-determined.
correctly.
Inconsistent data may take data
scientists a considerable amount of
time to simply unravel the many
versions of data saved.
1/20/22
4. Accessibility
Most of the data and information scientists use to create, evaluate, theorise and predict the results
or end products often gets lost.
The way data trickles down to business analysts in big organizations from
departments ,sub- divisions ,branches, and finally the teams who are working on the data
This information may or may not have complete access to the next user.
Data sharing and making available information in an efficient manner to all is the cornerstone in
sharing corporate data.
5. System Upgrades
Every time the data management system gets an upgrade or the hardware is updated, there are
chances of information getting lost or corrupt.
Making several back-ups of data and upgrading the systems only through authenticated sources
is always advisable.
1/20/22
6. Data purging and storage
In organization, there are chances that locally saved
information could be deleted either by mistake or deliberately.
■ Saving the data in a safe manner, and sharing a with the
community is crucial.
7. Poor organization
If we are not able to easily search through the data, we find that it
becomes significantly more difficult to make use of.
■ Through different organizational methods and procedures, there
are dozens of ways that data can be represented.
1/20/22
Examples of data quality problems
Noisy data due to
Faulty data collection instruments, entry errors, transmission problems,
technology limitation and inconsistency in naming convention.
Duplication: data set may include data objects that are
Quality Data
Data have quality if they satisfy the requirements of the intended use and when it solves the
data quality problems. These includes;
Accuracy,
Completeness,
Consistency,
Timeliness,
Believability and
1/20/22
Interpretability
Data Preprocessing
Is a theory and practice of manipulating/automating a electronic data in a way that can be
used for specific application.
preprocessing might have different scope based on the application and domain.
Trivial string manipulation programs is not economical and performing these tasks
requires robust text processing.
Most widely used Approach: RegEx for NLP
Preprocessing ML data involves both data engineering and feature engineering.
Data engineering is the process of converting raw data into prepared data.
Feature engineering then tunes the prepared data to create the features expected by the ML mod
Feature Engineering
This refers to the dataset with the tuned features expected by the model.
Performing certain ML specific operations on the
columns in the prepared dataset, and creating new features for your model during
training and prediction under Preprocessing operations.
Scaling numerical columns to a value between 0 and 1, clipping values, and one-hot-
encoding categorical features.
1/20/22
Preprocessing Operations
Each operation aims to help ML build better predictive models.
Some of the operations for structured data:
1.Data cleansing
■ Removing or correcting records with corrupted or invalid values from raw data, as well as
removing records that are missing a large number of columns.
2.Instances selection and partitioning
■ Selecting data points from the input dataset to create training, evaluation (validation),
and test sets using random sampling, minority classes oversampling, and stratified
partitioning.
3.Feature tuning
■ Improving the quality of a feature for ML, which includes scaling and normalizing numeric
values, imputing missing values, clipping outliers, and adjusting values with skewed
distributions. transformation
4. Representation
■ Converting a numeric feature to a categorical feature and vice verse.
5. Feature extraction
■ Reducing the number of features by creating lower-dimension and more powerful data
representations using PCA, embedding extraction, and hashing.
6. Feature selection
■ Selecting a subset of the input features for training the model, and ignoring the irrelevant or
redundant ones, using filter or wrapper methods which involve simply dropping features if the
features are missing a large number of values.
7. Feature construction
■ Creating new features either by using typical techniques, such as polynomial
expansion or feature crossing.
When working with Unstructured data such as images, audio, or text documents, deep
learning has gotten rid of the domain knowledge-based feature engineering by folding it into
the model architecture.
A convolutional layer is an automatic feature preprocessor for constructing the right model
architecture which requires some empirical knowledge of the data. In addition, some
amount of preprocessing is needed, such as:
■ Text documents: stemming and lemmatization, TF- IDF calculation, and n-gram
extraction, embedding lookup.
■ Images: clipping, resizing, cropping, gaussian blur, and canary filters.
■ Transfer learning, in which you are treating all-but-last layers of the fully trained model as a
feature engineering step. This applies to all types of data, including text and images.
Data Cleaning
Is the process of preparing data for analysis by removing or modifying data that is incorrect,
incomplete, irrelevant, duplicated, or improperly formatted.
Data cleaning clean the data by:
■ Filter unwanted outliers and smoothing noisy data
■ Remove duplicate and irrelevant observations
■ Fix structural errors such capitalization
■ Filling in missing values as typos or inconsistent
source.