0% found this document useful (0 votes)

14 views19 pages

PMA Unit-2 PDF

Uploaded by

Saru Latha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views19 pages

PMA Unit-2 PDF

Uploaded by

Saru Latha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Data Transformations in Predictive Models and Analytics

1. Individual Predictors

1.1. Scaling and Normalization

Scaling and normalization adjust the scale of features to ensure that no single feature
dominates the model due to its scale.

 Standardization (Z-score Normalization)

o Purpose: To transform features to have zero mean and unit variance.
o Formula: Z=X−μσZ = \frac{X - \mu}{\sigma}Z=σX−μ
o Use Case: Useful when features are on different scales. Often used in
algorithms sensitive to feature scaling, such as Support Vector Machines and
K-Means clustering.
o Example: If feature XXX has a mean of 50 and a standard deviation of 10,
standardizing a value of 60 yields Z=60−5010=1.0Z = \frac{60 - 50}{10} =
1.0Z=1060−50=1.0.
 Min-Max Scaling
o Purpose: To scale features to a fixed range, typically [0, 1].
o Formula: Xscaled=X−XminXmax−XminX_{scaled} = \frac{X -
X_{min}}{X_{max} - X_{min}}Xscaled=Xmax−XminX−Xmin
o Use Case: Useful when you want features in a specific range, especially in
neural networks which use activation functions sensitive to input scale.
o Example: For a feature with values ranging from 10 to 50, scaling a value of
30 to the [0, 1] range yields Xscaled=30−1050−10=0.5X_{scaled} = \frac{30
- 10}{50 - 10} = 0.5Xscaled=50−1030−10=0.5.
 Robust Scaling
o Purpose: To scale features based on median and interquartile range (IQR),
reducing the impact of outliers.
o Formula: Xscaled=X−MedianIQRX_{scaled} = \frac{X -
\text{Median}}{\text{IQR}}Xscaled=IQRX−Median
o Use Case: When features contain outliers that could skew the results of
standard scaling.
o Example: For a feature with a median of 20 and an IQR of 10, robust scaling
a value of 30 yields Xscaled=30−2010=1.0X_{scaled} = \frac{30 - 20}{10} =
1.0Xscaled=1030−20=1.0.

1.2. Transformation Functions

Transformations can stabilize variance, make data more normal, or reduce skewness.

 Log Transformation
o Purpose: To reduce right skewness and stabilize variance.
o Formula: Xlog=log (X+ϵ)X_{log} = \log(X + \epsilon)Xlog=log(X+ϵ)
o Use Case: Useful for data that span several orders of magnitude or are highly
skewed.
o Example: If XXX is 1000, log-transforming with a small constant (ϵ\epsilonϵ)
of 1 yields Xlog=log (1000+1)≈3.00X_{log} = \log(1000 + 1) \approx
3.00Xlog=log(1000+1)≈3.00.
 Square Root Transformation
o Purpose: To reduce right skewness.
o Formula: Xsqrt=XX_{sqrt} = \sqrt{X}Xsqrt=X
o Use Case: Commonly used for count data or skewed data with moderate
skewness.
o Example: For X=16X = 16X=16, square root transformation yields
Xsqrt=16=4X_{sqrt} = \sqrt{16} = 4Xsqrt=16=4.
 Box-Cox Transformation
o Purpose: To stabilize variance and make data more normally distributed. It is
a family of power transformations.
o Formula: XBC=(Xλ−1)λX_{BC} = \frac{(X^\lambda -
1)}{\lambda}XBC=λ(Xλ−1) for λ≠0\lambda \neq 0λ =0;
log (X)\log(X)log(X) for λ=0\lambda = 0λ=0
o Use Case: When data is positively skewed and transformation needs to adjust
variance.
o Example: Determining the optimal λ\lambdaλ using maximum likelihood
estimation to transform data effectively.

1.3. Categorical Encoding

Converts categorical variables into numerical format for model compatibility.

 One-Hot Encoding
o Purpose: Converts categorical variables into binary vectors.
o Method: Each category is represented as a binary vector with one '1' and rest
'0's.
o Example: For a feature with categories [Red, Blue, Green], one-hot encoding
results in [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.
 Label Encoding
o Purpose: Converts categories into integer values.
o Method: Assigns a unique integer to each category.
o Example: Categories [Red, Blue, Green] are encoded as [0, 1, 2].
 Frequency Encoding
o Purpose: Replaces categories with their frequency in the dataset.
o Method: Encodes categories based on their occurrence count.
o Example: If 'Red' appears 50 times, 'Blue' 30 times, and 'Green' 20 times,
these will be encoded as [50, 30, 20].

2. Multiple Predictors

2.1. Interaction Terms

Interaction terms capture the combined effect of multiple predictors.

 Purpose: To model interactions between features that might jointly affect the target
variable.
 Method: Create new features by multiplying pairs or groups of existing features.
 Example: For features X1X_1X1 and X2X_2X2, the interaction term would be
Xint=X1×X2X_{int} = X_1 \times X_2Xint=X1×X2.

2.2. Polynomial Features

Polynomial features help capture non-linear relationships.

 Purpose: Extend the linear model by adding polynomial terms.

 Method: Generate features by raising existing features to various powers.
 Example: For a feature XXX, polynomial features include X2X^2X2, X3X^3X3, etc.
A quadratic term would be Xpoly=X2X_{poly} = X^2Xpoly=X2.

2.3. Feature Selection

Feature selection techniques reduce the dimensionality and improve model performance.

 Filter Methods
o Purpose: Evaluate features based on statistical measures.
o Method: Use techniques like correlation coefficients or Chi-square tests to
select important features.
o Example: Selecting features with a high correlation to the target variable.
 Wrapper Methods
o Purpose: Evaluate subsets of features based on model performance.
o Method: Use methods like Recursive Feature Elimination (RFE) to iteratively
select features.
o Example: RFE evaluates feature subsets by training a model and removing the
least significant features.
 Embedded Methods
o Purpose: Perform feature selection during model training.
o Method: Use algorithms like LASSO, which includes feature selection as part
of the regularization process.
o Example: LASSO regression applies L1 regularization, which can shrink
some feature coefficients to zero.

3. Dealing with Missing Values

3.1. Imputation Techniques

Imputation fills in missing values using various methods.

 Mean/Median/Mode Imputation
o Purpose: Replace missing values with central tendency measures.
o Method: Fill missing values with mean (for numerical), median, or mode (for
categorical).
o Example: Imputing missing age values with the mean age of other
observations.
 K-Nearest Neighbors (KNN) Imputation
o Purpose: Impute missing values based on the nearest neighbors' values.
o Method: Use KNN to find similar observations and fill in missing values.
o Example: Imputing missing values of a feature by averaging the values from
the nearest K neighbors.
 Multiple Imputation
o Purpose: Address uncertainty in missing data by creating multiple imputed
datasets.
o Method: Impute missing values multiple times, analyze each dataset, and
combine results.
o Example: Generating multiple datasets with different imputations and
averaging the results.

3.2. Removing Missing Data

Involves deleting rows or columns with missing values.

 Listwise Deletion
o Purpose: Remove rows with any missing values.
o Method: Exclude rows where any feature is missing.
o Example: Removing all records with missing values in any feature column.
 Pairwise Deletion
o Purpose: Use available data without removing entire rows.
o Method: Analyze data based on pairs of variables that are both present.
o Example: Calculating correlation coefficients only using cases where both
variables are present.

4. Removing and Adding Predictors

4.1. Removing Predictors

Selective removal of predictors to improve model performance.

 Redundancy and Correlation

o Purpose: Avoid multicollinearity by removing highly correlated predictors.
o Method: Calculate correlation matrix and remove redundant features.
o Example: Removing one of two features with a correlation coefficient above a
certain threshold.
 Feature Importance
o Purpose: Drop less important features based on feature importance scores.
o Method: Use algorithms like Random Forest to assess feature importance and
remove the least important.
o Example: Removing features with low importance scores from a Random Forest
model.

4.2. Adding Predictors

Involves introducing new features to improve model performance.

 Domain Knowledge
o Purpose: Add features based on expert knowledge or domain-specific
insights.
o Method: Create features that are known to have a significant impact based on
industry knowledge.
o Example: Adding a feature representing user activity level based on domain
expertise in an e-commerce setting.
 Feature Engineering
o Purpose: Create new features from existing ones.
o Method: Generate new features by combining, transforming, or aggregating
existing features.
o Example: Creating interaction terms or ratios from existing features.

5. Binning Predictors

5.1. Discretization

Converts continuous features into categorical bins.

 Equal-Width Binning
o Purpose: Divide feature range into intervals of equal width.
o Method: Define bin edges and assign each value to a bin.
o Example: Binning age into intervals [0-20], [21-40], [41-60], etc.
 Equal-Frequency Binning
o Purpose: Ensure each bin contains approximately the same number of
observations.
o Method: Sort values and divide them into bins with equal frequencies.
o Example: Creating bins such that each bin contains 20% of the data.
 Custom Binning
o Purpose: Define bins based on domain knowledge or specific criteria.
o Method: Set bin edges based on insights or requirements.
o Example: Custom bins based on specific thresholds relevant to a business
context.

6. Computing Derived Features

6.1. Feature Creation

Involves generating new features from existing data.

 Aggregations
o Purpose: Create features based on aggregated values.
o Method: Compute summary statistics like sum, mean, or median.
o Example: Creating a feature for total spending by aggregating individual
purchase amounts.
 Date/Time Features
o Purpose: Extract meaningful features from datetime data.
o Method: Derive features like day of the week, month, or hour.
o Example: Extracting 'day of the week' from a timestamp to capture weekly
patterns.

7. Model Tuning
7.1. Hyperparameter Optimization

Optimizing model parameters to improve performance.

 Grid Search
o Purpose: Systematically search through specified hyperparameter values.
o Method: Evaluate all possible combinations of parameters.
o Example: Testing various values for learning rate and regularization strength
in a regression model.
 Random Search
o Purpose: Sample a subset of hyperparameter combinations randomly.
o Method: Randomly select combinations to find the best performing set.
o Example: Randomly choosing values for hyperparameters like tree depth and
number of estimators in a Random Forest model.
 Bayesian Optimization
o Purpose: Optimize hyperparameters using probabilistic models.
o Method: Use Bayesian methods to model the performance and find the
optimal parameters.
o Example: Using Gaussian processes to iteratively sample and evaluate
hyperparameters.

7.2. Cross-Validation

Technique to assess model performance and stability.

 K-Fold Cross-Validation
o Purpose: Evaluate model performance by splitting data into kkk subsets.
o Method: Train on k−1k-1k−1 folds and validate on the remaining fold. Repeat
for each fold.
o Example: 10-fold cross-validation splits data into 10 parts, using each as a
validation set once.
 Leave-One-Out Cross-Validation (LOOCV)
o Purpose: A special case of k-fold where kkk equals the number of
observations.
o Method: Use one observation as the validation set and the rest as training set.
o Example: For a dataset with 100 observations, LOOCV will train on 99 and
validate on 1 for 100 iterations.

8. Data Splitting

8.1. Training, Validation, and Test Sets

Separating data into distinct subsets for model development and evaluation.

 Training Set
o Purpose: Used to train the model.
o Method: The largest portion of the data.
o Example: Typically 60-80% of the data.
 Validation Set
o Purpose: Used to tune model hyperparameters and select the best model.
o Method: A portion of data not seen by the model during training.
o Example: Typically 10-20% of the data.
 Test Set
o Purpose: Used to evaluate the final model’s performance.
o Method: The data held back from training and validation.
o Example: Typically 10-20% of the data.

9. Resampling in Predictive Models

9.1. Bootstrap Resampling

Technique to estimate the distribution of a statistic by sampling with replacement.

 Purpose: Generate multiple samples from the original data to estimate variability.
 Method: Create several bootstrap samples by sampling with replacement.
 Example: Estimating the confidence interval of a statistic like the mean or variance.

9.2. Repeated K-Fold Cross-Validation

Improves model evaluation by repeating cross-validation.

 Purpose: Assess model performance and variability.

 Method: Perform k-fold cross-validation multiple times with different splits.
 Example: 5 repetitions of 10-fold cross-validation.

9.3. Stratified Sampling

Ensures representative samples, especially in imbalanced datasets.

 Purpose: Maintain the distribution of the target variable across splits.

 Method: Split data in a way that each subset maintains the original distribution.
 Example: In a dataset with 90% class A and 10% class B, stratified sampling ensures
each fold has a similar distribution.
In predictive modeling, a predictor is a field that has a predictive relationship with the
outcome, or the field whose behavior is being predicted. Predictors contain information about
cases whose values might be associated with the behavior being predicted. Predictor variables
are also known as criterion variables or explanatory variables.

What were some of the predictors used for the predictive analytics model?

Machine learning is a tool used in predictive analysis. The most common predictive models
include decision trees, regressions (linear and logistic), and neural networks, which is the
emerging field of deep learning methods and technologies.

In predictive modeling, the goal is to find the right set of predictor variables that can
accurately predict an outcome. When using multiple predictors, some important concepts
include:

 Interaction: The effect of a predictor can depend on the level of another predictor.
 Backwards stepwise regression: When there are many predictors, it's not possible to fit all
possible models. This strategy starts with a model that includes all potential predictors, then
removes one predictor at a time until the model improves no further.
 Bias: Adding more predictors can reduce bias by capturing more information about the
dependent variable. However, it can also increase the standard error if the predictors aren't
associated with the dependent variable or if they're correlated with each other. This is called
multicollinearity.

Other applications of multiple predictors in predictive modeling include:

 Multiple regression
When scores are available for multiple predictors and a criterion, multiple regression can be
used to create a single equation that predicts the criterion's performance.
 Clustering models
These models can group data like customer behavior, market trends, and image
pixels. Some examples of clustering model algorithms include K-means clustering,
hierarchical clustering, and density-based clustering

An important concept of models with multiple predictors is interaction. Interaction means

that the effect of a predictor depends on the level of another predictor.

How can you address missing data in predictive modeling?

1. Identify the cause.
2. Impute the values.
3. Drop the rows or columns.
4. Create dummy variables.
5. Use algorithms that can handle missing data.
6. Validate and test the results.
7. Here's what else to consider.

How do you handle missing data in predictive modeling?

The simplest approach is to replace the missing data with the sample mean of the observed
cases (in the case of quantitative variables). Another approach is to input sample means for
the predictors, then use the reconstructed dataset to predict the missing responses.

How do you deal with missing values in data?

Impute with Averages or Midpoints: Fill missing values with mean, median, or mode.
However, be mindful of potential bias introduced by this method. Use Advanced Techniques
like K-Nearest Neighbors (KNN): Estimate missing values by finding similar data points
using KNN. This method can preserve data integrity.

Binning of numeric predictors allows you to group cases into bins of equal volume or
width. For example, your cases are customers that you want to group according to their age in
bins of equal width. You can create bins for customers aged 20-29, 30-39, 40-49, and so on.

Computing

Predictive modeling is a statistical analysis of data done by computers and software with
input from operators. It is used to generate possible future scenarios for entities the data used
is collected from. It can be used in any industry, enterprise, or endeavor in which data is
collected.

Model tuning, also known as hyperparameter optimization, is a process that involves

finding the best values for hyperparameters to improve the performance of a machine
learning or large language model. Hyperparameters are variables that control the training
process and whose values are set before the model is trained. They can significantly affect the
model's performance, and tuning them appropriately can improve its accuracy, generation
quality, and other metrics.

Model tuning is an iterative process that involves experimentation and fine-tuning. Some
approaches to optimizing hyperparameters include:

 Grid search
 Random search
 Bayesian optimization: This more advanced technique uses Bayesian inference to build a
probabilistic model of the objective function and select the most promising hyperparameters
to evaluate.

Data splitting is a fundamental practice in predictive modeling that involves dividing a

dataset into two or more subsets to train, test, and evaluate machine learning models. It's an
important aspect of data science, especially when creating models based on data. Data
splitting ensures that the model's performance is accurately assessed, prevents overfitting, and
promotes the development of robust, generalizable models.

Data splitting is when data is divided into two or more subsets. Typically, with a two-part
split, one part is used to evaluate or test the data and the other to train the model. Data
splitting is an important aspect of data science, particularly for creating models based on data.

In predictive modeling, resampling is a statistical technique that involves repeatedly

drawing samples from a population to gather more information about it. The goal is to
improve the model's performance. Resampling techniques typically involve using a subset of
samples to fit a model, and then using the remaining samples to estimate the model's
effectiveness. This process is repeated multiple times, and the results are aggregated and
summarized. The differences between techniques usually come down to the method used to
choose the subsamples.

Here are some examples of resampling techniques:

 Cross-validation
Often used to estimate the test error associated with a statistical learning method.
 Bootstrap sampling
A more general and simpler method that's often used to provide a measure of accuracy for a
given parameter or method. In bootstrap sampling, a sampling distribution is generated by
repeatedly taking random samples from a known sample, with replacement. The more
samples that are taken, the more accurate the results will be, but it will also take longer.
 Leave-one-out cross-validation (LOOCV)
Involves splitting the observations into two parts, with one observation used for the
validation set and the remaining observations used to fit the model.
Other resampling techniques include jackknife sampling, stratified sampling, random
sampling, upsampling, and downsampling.
What is data transformation?

Data transformation is the process of converting, cleansing, and structuring data into a usable
format that can be analyzed to support decision making processes, and to propel the growth
of an organization.

Data transformation is used when data needs to be converted to match that of the destination
system. This can occur at two places of the data pipeline. First, organizations with on-site
data storage use an extract, transform, load, with the data transformation taking place during
the middle ‘transform’ step.

Organizations today mostly use cloud-based data warehouses because they can scale their
computing and storage resources in seconds. Cloud based organizations, with this huge
scalability available, can skip the ETL process. Instead, they use a transformation process that
converts the data as the raw data is uploaded, a process called extract, load, and transform.
The process of data transformation can be handled manually, automated or a combination of
both.

Transformation is an essential step in many processes, such as data integration, migration,

warehousing and wrangling. The process of data transformation can be:

 Constructive, where data is added, copied or replicated

 Destructive, where records and fields are deleted
 Aesthetic, where certain values are standardized, or
 Structural, which includes columns being renamed, moved and combined
On a basic level, the data transformation process converts raw data into a usable format by
removing duplicates, converting data types and enriching the dataset. This data
transformation process involves defining the structure, mapping the data, extracting the data
from the source system, performing the transformations, and then storing the transformed
data in the appropriate dataset. Data then becomes accessible, secure and more usable,
allowing for use in a multitude of ways. Organizations perform data transformation to ensure
the compatibility of data with other types while combining it with other information or
migrating it into a dataset. Through data transformations, organizations can gain valuable
insights into the operational and informational functions.
Given the massive amounts of data from disparate sources that businesses have to deal with
on a daily basis, data transformation has become an essential tool. It facilitates the conversion
of data, irrespective of its format, to be integrated, stored, analyzed and mined for business
intelligence.

How is data transformation used?

Data transformation works on the simple objective of extracting data from a source,
converting it into a usable format and then delivering the converted data to the destination
system. The extraction phase involves data being pulled into a central repository from
different sources or locations, therefore it is usually in its raw original form which is not
usable. To ensure the usability of the extracted data it must be transformed into the desired
format by taking it through a number of steps. In certain cases, the data also needs to be
cleaned before the transformation takes place. This step resolves the issues of missing values
and inconsistencies that exist in the dataset. The data transformation process is carried out in
five stages.

1. Discovery
The first step is to identify and understand data in its original source format with the help
of data profiling tools. Finding all the sources and data types that need to be transformed.
This step helps in understanding how the data needs to be transformed to fit into the desired
format.

2. Mapping
The transformation is planned during the data mapping phase. This includes determining the
current structure, and the consequent transformation that is required, then mapping the data to
understand at a basic level, the way individual fields would be modified, joined or
aggregated.

3. Code generation
The code, which is required to run the transformation process, is created in this step using a
data transformation platform or tool.
4. Execution
The data is finally converted into the selected format with the help of the code. The data is
extracted from the source(s), which can vary from structured to streaming, telemetry to log
files. Next, transformations are carried out on data, such as aggregation, format conversion or
merging, as planned in the mapping stage. The transformed data is then sent to the destination
system which could be a dataset or a data warehouse.

Some of the transformation types, depending on the data involved, include:

 Filtering which helps in selecting certain columns that require transformation
 Enriching which fills out the basic gaps in the data set
 Splitting where a single column is split into multiple or vice versa
 Removal of duplicate data, and
 Joining data from different sources

5. Review
The transformed data is evaluated to ensure the conversion has had the desired results in
terms of the format of the data.
It must also be noted that not all data will need transformation, at times it can be used as is.

Data transformation techniques

There are several data transformation techniques that are used to clean data and structure it
before it is stored in a data warehouse or analyzed for business intelligence. Not all of these
techniques work with all types of data, and sometimes more than one technique may be
applied. Nine of the most common techniques are:

1. Revising
Revising ensures the data supports its intended use by organizing it in the required and
correct way. It does this in a range of ways.

 Dataset normalization revises data by eliminating redundancies in the data set. The data model
becomes more precise and legible while also occupying less space. This process, however, does
involve a lot of critical thinking, investigation and reverse engineering.
 Data cleansing ensures the formatting capability of data.
 Format conversion changes the data types to ensure compatibility.
 Key structuring converts values with built-in meanings to generic identifiers to be used as unique
keys.
 Deduplication identifies and removes duplicates.
 Data validation validates records and removes the ones that are incomplete.
 Repeated and unused columns can be removed to improve overall performance and legibility of the
data set.

2. Manipulation
This involves creation of new values from existing ones or changing current data through
computation. Manipulation is also used to convert unstructured data into structured data that
can be used by machine learning algorithms.
 Derivation, which is cross column calculations
 Summarization that aggregates values
 Pivoting which involves converting columns values into rows and vice versa
 Sorting, ordering and indexing of data to enhance search performance
 Scaling, normalization and standardization that helps in comparing dissimilar numbers by putting
them on a consistent scale
 Vectorization which helps convert non-numerical data into number arrays that are often used for
machine learning applications

3. Separating
This involves dividing up the data values into its parts for granular analysis. Splitting
involves dividing up a single column with several values into separate columns with each of
those values. This allows for filtering on the basis of certain values.

4. Combining/ integrating
Records from across tables and sources are combined to acquire a more holistic view of
activities and functions of an organization. It couples data from multiple tables and datasets
and combines records from multiple tables.
5. Data smoothing
This process removes meaningless, noisy, or distorted data from the data set. By removing
outliers, trends are most easily identified.
6. Data aggregation
This technique gathers raw data from multiple sources and turns it into a summary form
which can be used for analysis. An example is the raw data providing statistics such as
averages and sums.
7. Discretization
With the help of this technique, interval labels are created in continuous data in an attempt to
enhance its efficiency and easier analysis. The decision tree algorithms are utilized by this
process to transform large datasets into categorical data.
8. Generalization
Low level data attributes are transformed into high level attributes by using the concept of
hierarchies and creating layers of successive summary data. This helps in creating clear data
snapshots.
9. Attribute construction
In this technique, a new set of attributes is created from an existing set to facilitate the mining
process.

Why do businesses need data transformation?

Organizations generate a huge amount of data daily. However, it is of no value unless it can
be used to gather insights and drive business growth. Organizations utilize data
transformation to convert data into formats that can then be used for several processes. There
are a few reasons why organizations should transform their data.

 Transformation makes disparate sets of data compatible with each other, which makes it easier to
aggregate data for a thorough analysis
 Migration of data is easier since the source format can be transformed into the target format
 Data transformation helps in consolidating data, structured and unstructured
 The process of transformation also allows for enrichment which enhances the quality of data
The ultimate goal is consistent, accessible data that provides organizations with accurate
analytic insights and predictions.
Benefits of data transformation
Data holds the potential to directly affect an organization’s efficiencies and its bottom line. It
plays a crucial role in understanding customer behavior, internal processes, and industry
trends. While every organization has the ability to collect an immense amount of data, the
challenge is to ensure that this is usable. Data transformation processes empower
organizations to reap the benefits offered by the data.

Data utilization
If the data being collected isn’t in an appropriate format, it often ends up not being utilized at
all. With the help of data transformation tools, organizations can finally realize the true
potential of the data they have amassed since the transformation process standardizes the data
and improves its usability and accessibility.

Data consistency
Data is continuously being collected from a range of sources which increases the
inconsistencies in metadata. This makes organization and understanding data a huge
challenge. Data transformation helps making it simpler to understand and organize data sets.

Better quality data

Transformation process also enhances the quality of data which can then be utilized to
acquire business intelligence.

Compatibility across platforms

Data transformation also supports compatibility between types of data, applications and
systems.

Faster data access

It is quicker and easier to retrieve data that has been transformed into a standardized format.

More accurate insights and predictions

The transformation process generates data models which are then converted to metrics,
dashboards and reports which enable organizations to achieve specific goals. The metrics and
key performance indicators help businesses quantify their efforts and analyze their progress.
After being transformed, data can be used for many use cases, including:

 Analytics which use metrics from one or many sources to gain deeper insights about the functions
and operations of any organization. Transformation of data is required when the metric combines
data from multiple sources.

 Machine learning which helps businesses with their profit and revenue projections, supports their
decision making with predictive modeling, and automation of several business processes.
 Regulatory compliance which involves sensitive data that is vulnerable to malicious attacks

Challenges of data transformation

Data transformation is considered essential by organizations due to all the benefits it has to
offer. However, there are also a few challenges that come alongside.

High cost of implementation

The process of data transformation is an expensive one. Depending on the infrastructure, the
software and tools being utilized, the cost of the solution differs and tends to be on the higher
side considering the extra resources who need to be hired, the computing resources and the
licensing of tools that are used.

Resource intensive
The process of transformation is a resource intensive one. A huge computational burden is
created when transformations are performed in an on-premises data warehouse, which
consequently slows down other operations. However, this isn’t an issue when a cloud-based
data warehouse is used since the platform is able to scale up easily.
Data transformation also needs expertise from data scientists, which can be expensive and
divert attention from other tasks.

Errors and inconsistency

Without proper expertise, many issues can crop up during transformation which are likely to
hamper the end results. Whether it is a poor transformation that results in flawed data, or a
migration that fails and corrupts data, there are risks.
Data transformation helps in organizing data and making it meaningful, which improves the
overall quality of the data. This compatibility between systems provides valuable support for
functions like analytics and machine learning. Given the large volume of data that is being
generated from new applications and emerging technologies, organizations are relying on
data transformation processes to manage and handle it in a more efficient and effective
manner. Data transformation not only helps organizations derive the maximum value from
their data, but they also ensure that data can be managed in easier ways without feeling
overwhelmed by the sheer amount of it all.

Aws Certified Solutions Architect Associate Saa C03 DUMPS
50% (4)
Aws Certified Solutions Architect Associate Saa C03 DUMPS
402 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Postgresql DBA Architecture
100% (1)
Postgresql DBA Architecture
60 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
Top Down Distribution (TDD) - CO-PA - SAP
No ratings yet
Top Down Distribution (TDD) - CO-PA - SAP
23 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
ML Notes
No ratings yet
ML Notes
44 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Unit 3
No ratings yet
Unit 3
55 pages
Week 10
No ratings yet
Week 10
50 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Processing
No ratings yet
Data Processing
19 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
DS 1
No ratings yet
DS 1
20 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Graph Database
No ratings yet
Graph Database
64 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
Module 2
No ratings yet
Module 2
12 pages
SML
No ratings yet
SML
8 pages
Exp 2
No ratings yet
Exp 2
6 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
4 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Final ML
No ratings yet
Final ML
2 pages
CLASS 12 ComputerScience SQP With Marking Scheme (2024-25) - 2
No ratings yet
CLASS 12 ComputerScience SQP With Marking Scheme (2024-25) - 2
43 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
Getting Started With Objectarx
No ratings yet
Getting Started With Objectarx
43 pages
COBOL DB2 Tutorial
100% (1)
COBOL DB2 Tutorial
4 pages
Student Information System Project
No ratings yet
Student Information System Project
21 pages
Advanced Excel Final Assessment Without Answer
No ratings yet
Advanced Excel Final Assessment Without Answer
8 pages
OLM Tables
No ratings yet
OLM Tables
14 pages
Minor Project (MCA-169) - MCA2022-24 - Format
No ratings yet
Minor Project (MCA-169) - MCA2022-24 - Format
14 pages
4 - Discretization and Concept Hierarchy
No ratings yet
4 - Discretization and Concept Hierarchy
26 pages
Student Attendance System Using QR Code
No ratings yet
Student Attendance System Using QR Code
9 pages
Oportunidades Remotas
No ratings yet
Oportunidades Remotas
7 pages
Data Analytics by Srikanth Sagar
No ratings yet
Data Analytics by Srikanth Sagar
439 pages
Collective Data
No ratings yet
Collective Data
1 page
IBM Spectrum Protect Introduction To Data Protection Solutions
No ratings yet
IBM Spectrum Protect Introduction To Data Protection Solutions
62 pages
WAPP Assignment Question
No ratings yet
WAPP Assignment Question
4 pages
Advanced Backup: For Acronis Cyber Protect Cloud
No ratings yet
Advanced Backup: For Acronis Cyber Protect Cloud
2 pages
1a. Cable Route
No ratings yet
1a. Cable Route
13 pages
Library Management System Arun
No ratings yet
Library Management System Arun
16 pages
SADCW 7e Chapter07
No ratings yet
SADCW 7e Chapter07
30 pages
ClimsoftV4 User Manual Feb 2022
No ratings yet
ClimsoftV4 User Manual Feb 2022
109 pages
Business Agriculture and Technology: IUBAT-International University of
No ratings yet
Business Agriculture and Technology: IUBAT-International University of
9 pages
Microsoft SQL Server 2008 Для Поддержки Системы 1С Предприятие 8
No ratings yet
Microsoft SQL Server 2008 Для Поддержки Системы 1С Предприятие 8
159 pages
Reviewer Data Management
No ratings yet
Reviewer Data Management
3 pages
Prof. Ram Meghe Institute of Technology & Research Badnera, Amravati (M.S.) 444701
No ratings yet
Prof. Ram Meghe Institute of Technology & Research Badnera, Amravati (M.S.) 444701
17 pages
Knowledge Discovery in Textual Databases (KDT)
No ratings yet
Knowledge Discovery in Textual Databases (KDT)
7 pages
Title: Introduction To PHP Programming Slide 1: Title
No ratings yet
Title: Introduction To PHP Programming Slide 1: Title
4 pages
Correction File
No ratings yet
Correction File
1 page
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
120 Advanced JavaScript Interview Questions
From Everand
120 Advanced JavaScript Interview Questions
Hernando Abella
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

PMA Unit-2 PDF

Uploaded by

PMA Unit-2 PDF

Uploaded by

Data Transformations in Predictive Models and Analytics

1.1. Scaling and Normalization

 Standardization (Z-score Normalization)

1.2. Transformation Functions

1.3. Categorical Encoding

Converts categorical variables into numerical format for model compatibility.

2.1. Interaction Terms

Interaction terms capture the combined effect of multiple predictors.

2.2. Polynomial Features

Polynomial features help capture non-linear relationships.

 Purpose: Extend the linear model by adding polynomial terms.

2.3. Feature Selection

3. Dealing with Missing Values

3.1. Imputation Techniques

Imputation fills in missing values using various methods.

3.2. Removing Missing Data

Involves deleting rows or columns with missing values.

4. Removing and Adding Predictors

4.1. Removing Predictors

Selective removal of predictors to improve model performance.

 Redundancy and Correlation

4.2. Adding Predictors

Involves introducing new features to improve model performance.

Converts continuous features into categorical bins.

6. Computing Derived Features

6.1. Feature Creation

Involves generating new features from existing data.

Optimizing model parameters to improve performance.

Technique to assess model performance and stability.

8.1. Training, Validation, and Test Sets

9. Resampling in Predictive Models

9.1. Bootstrap Resampling

Technique to estimate the distribution of a statistic by sampling with replacement.

9.2. Repeated K-Fold Cross-Validation

Improves model evaluation by repeating cross-validation.

 Purpose: Assess model performance and variability.

9.3. Stratified Sampling

Ensures representative samples, especially in imbalanced datasets.

 Purpose: Maintain the distribution of the target variable across splits.

Other applications of multiple predictors in predictive modeling include:

An important concept of models with multiple predictors is interaction. Interaction means

How can you address missing data in predictive modeling?

How do you handle missing data in predictive modeling?

How do you deal with missing values in data?

Model tuning, also known as hyperparameter optimization, is a process that involves

Data splitting is a fundamental practice in predictive modeling that involves dividing a

In predictive modeling, resampling is a statistical technique that involves repeatedly

Here are some examples of resampling techniques:

Transformation is an essential step in many processes, such as data integration, migration,

 Constructive, where data is added, copied or replicated

How is data transformation used?

Some of the transformation types, depending on the data involved, include:

Data transformation techniques

Why do businesses need data transformation?

Better quality data

Compatibility across platforms

Faster data access

More accurate insights and predictions

Challenges of data transformation

High cost of implementation

Errors and inconsistency

You might also like