PMA Unit-2 PDF
PMA Unit-2 PDF
1. Individual Predictors
Scaling and normalization adjust the scale of features to ensure that no single feature
dominates the model due to its scale.
Transformations can stabilize variance, make data more normal, or reduce skewness.
Log Transformation
o Purpose: To reduce right skewness and stabilize variance.
o Formula: Xlog=log (X+ϵ)X_{log} = \log(X + \epsilon)Xlog=log(X+ϵ)
o Use Case: Useful for data that span several orders of magnitude or are highly
skewed.
o Example: If XXX is 1000, log-transforming with a small constant (ϵ\epsilonϵ)
of 1 yields Xlog=log (1000+1)≈3.00X_{log} = \log(1000 + 1) \approx
3.00Xlog=log(1000+1)≈3.00.
Square Root Transformation
o Purpose: To reduce right skewness.
o Formula: Xsqrt=XX_{sqrt} = \sqrt{X}Xsqrt=X
o Use Case: Commonly used for count data or skewed data with moderate
skewness.
o Example: For X=16X = 16X=16, square root transformation yields
Xsqrt=16=4X_{sqrt} = \sqrt{16} = 4Xsqrt=16=4.
Box-Cox Transformation
o Purpose: To stabilize variance and make data more normally distributed. It is
a family of power transformations.
o Formula: XBC=(Xλ−1)λX_{BC} = \frac{(X^\lambda -
1)}{\lambda}XBC=λ(Xλ−1) for λ≠0\lambda \neq 0λ =0;
log (X)\log(X)log(X) for λ=0\lambda = 0λ=0
o Use Case: When data is positively skewed and transformation needs to adjust
variance.
o Example: Determining the optimal λ\lambdaλ using maximum likelihood
estimation to transform data effectively.
One-Hot Encoding
o Purpose: Converts categorical variables into binary vectors.
o Method: Each category is represented as a binary vector with one '1' and rest
'0's.
o Example: For a feature with categories [Red, Blue, Green], one-hot encoding
results in [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.
Label Encoding
o Purpose: Converts categories into integer values.
o Method: Assigns a unique integer to each category.
o Example: Categories [Red, Blue, Green] are encoded as [0, 1, 2].
Frequency Encoding
o Purpose: Replaces categories with their frequency in the dataset.
o Method: Encodes categories based on their occurrence count.
o Example: If 'Red' appears 50 times, 'Blue' 30 times, and 'Green' 20 times,
these will be encoded as [50, 30, 20].
2. Multiple Predictors
Purpose: To model interactions between features that might jointly affect the target
variable.
Method: Create new features by multiplying pairs or groups of existing features.
Example: For features X1X_1X1 and X2X_2X2, the interaction term would be
Xint=X1×X2X_{int} = X_1 \times X_2Xint=X1×X2.
Feature selection techniques reduce the dimensionality and improve model performance.
Filter Methods
o Purpose: Evaluate features based on statistical measures.
o Method: Use techniques like correlation coefficients or Chi-square tests to
select important features.
o Example: Selecting features with a high correlation to the target variable.
Wrapper Methods
o Purpose: Evaluate subsets of features based on model performance.
o Method: Use methods like Recursive Feature Elimination (RFE) to iteratively
select features.
o Example: RFE evaluates feature subsets by training a model and removing the
least significant features.
Embedded Methods
o Purpose: Perform feature selection during model training.
o Method: Use algorithms like LASSO, which includes feature selection as part
of the regularization process.
o Example: LASSO regression applies L1 regularization, which can shrink
some feature coefficients to zero.
Mean/Median/Mode Imputation
o Purpose: Replace missing values with central tendency measures.
o Method: Fill missing values with mean (for numerical), median, or mode (for
categorical).
o Example: Imputing missing age values with the mean age of other
observations.
K-Nearest Neighbors (KNN) Imputation
o Purpose: Impute missing values based on the nearest neighbors' values.
o Method: Use KNN to find similar observations and fill in missing values.
o Example: Imputing missing values of a feature by averaging the values from
the nearest K neighbors.
Multiple Imputation
o Purpose: Address uncertainty in missing data by creating multiple imputed
datasets.
o Method: Impute missing values multiple times, analyze each dataset, and
combine results.
o Example: Generating multiple datasets with different imputations and
averaging the results.
Listwise Deletion
o Purpose: Remove rows with any missing values.
o Method: Exclude rows where any feature is missing.
o Example: Removing all records with missing values in any feature column.
Pairwise Deletion
o Purpose: Use available data without removing entire rows.
o Method: Analyze data based on pairs of variables that are both present.
o Example: Calculating correlation coefficients only using cases where both
variables are present.
Domain Knowledge
o Purpose: Add features based on expert knowledge or domain-specific
insights.
o Method: Create features that are known to have a significant impact based on
industry knowledge.
o Example: Adding a feature representing user activity level based on domain
expertise in an e-commerce setting.
Feature Engineering
o Purpose: Create new features from existing ones.
o Method: Generate new features by combining, transforming, or aggregating
existing features.
o Example: Creating interaction terms or ratios from existing features.
5. Binning Predictors
5.1. Discretization
Equal-Width Binning
o Purpose: Divide feature range into intervals of equal width.
o Method: Define bin edges and assign each value to a bin.
o Example: Binning age into intervals [0-20], [21-40], [41-60], etc.
Equal-Frequency Binning
o Purpose: Ensure each bin contains approximately the same number of
observations.
o Method: Sort values and divide them into bins with equal frequencies.
o Example: Creating bins such that each bin contains 20% of the data.
Custom Binning
o Purpose: Define bins based on domain knowledge or specific criteria.
o Method: Set bin edges based on insights or requirements.
o Example: Custom bins based on specific thresholds relevant to a business
context.
Aggregations
o Purpose: Create features based on aggregated values.
o Method: Compute summary statistics like sum, mean, or median.
o Example: Creating a feature for total spending by aggregating individual
purchase amounts.
Date/Time Features
o Purpose: Extract meaningful features from datetime data.
o Method: Derive features like day of the week, month, or hour.
o Example: Extracting 'day of the week' from a timestamp to capture weekly
patterns.
7. Model Tuning
7.1. Hyperparameter Optimization
Grid Search
o Purpose: Systematically search through specified hyperparameter values.
o Method: Evaluate all possible combinations of parameters.
o Example: Testing various values for learning rate and regularization strength
in a regression model.
Random Search
o Purpose: Sample a subset of hyperparameter combinations randomly.
o Method: Randomly select combinations to find the best performing set.
o Example: Randomly choosing values for hyperparameters like tree depth and
number of estimators in a Random Forest model.
Bayesian Optimization
o Purpose: Optimize hyperparameters using probabilistic models.
o Method: Use Bayesian methods to model the performance and find the
optimal parameters.
o Example: Using Gaussian processes to iteratively sample and evaluate
hyperparameters.
7.2. Cross-Validation
K-Fold Cross-Validation
o Purpose: Evaluate model performance by splitting data into kkk subsets.
o Method: Train on k−1k-1k−1 folds and validate on the remaining fold. Repeat
for each fold.
o Example: 10-fold cross-validation splits data into 10 parts, using each as a
validation set once.
Leave-One-Out Cross-Validation (LOOCV)
o Purpose: A special case of k-fold where kkk equals the number of
observations.
o Method: Use one observation as the validation set and the rest as training set.
o Example: For a dataset with 100 observations, LOOCV will train on 99 and
validate on 1 for 100 iterations.
8. Data Splitting
Separating data into distinct subsets for model development and evaluation.
Training Set
o Purpose: Used to train the model.
o Method: The largest portion of the data.
o Example: Typically 60-80% of the data.
Validation Set
o Purpose: Used to tune model hyperparameters and select the best model.
o Method: A portion of data not seen by the model during training.
o Example: Typically 10-20% of the data.
Test Set
o Purpose: Used to evaluate the final model’s performance.
o Method: The data held back from training and validation.
o Example: Typically 10-20% of the data.
Purpose: Generate multiple samples from the original data to estimate variability.
Method: Create several bootstrap samples by sampling with replacement.
Example: Estimating the confidence interval of a statistic like the mean or variance.
What were some of the predictors used for the predictive analytics model?
Machine learning is a tool used in predictive analysis. The most common predictive models
include decision trees, regressions (linear and logistic), and neural networks, which is the
emerging field of deep learning methods and technologies.
In predictive modeling, the goal is to find the right set of predictor variables that can
accurately predict an outcome. When using multiple predictors, some important concepts
include:
Interaction: The effect of a predictor can depend on the level of another predictor.
Backwards stepwise regression: When there are many predictors, it's not possible to fit all
possible models. This strategy starts with a model that includes all potential predictors, then
removes one predictor at a time until the model improves no further.
Bias: Adding more predictors can reduce bias by capturing more information about the
dependent variable. However, it can also increase the standard error if the predictors aren't
associated with the dependent variable or if they're correlated with each other. This is called
multicollinearity.
Multiple regression
When scores are available for multiple predictors and a criterion, multiple regression can be
used to create a single equation that predicts the criterion's performance.
Clustering models
These models can group data like customer behavior, market trends, and image
pixels. Some examples of clustering model algorithms include K-means clustering,
hierarchical clustering, and density-based clustering
Impute with Averages or Midpoints: Fill missing values with mean, median, or mode.
However, be mindful of potential bias introduced by this method. Use Advanced Techniques
like K-Nearest Neighbors (KNN): Estimate missing values by finding similar data points
using KNN. This method can preserve data integrity.
Binning of numeric predictors allows you to group cases into bins of equal volume or
width. For example, your cases are customers that you want to group according to their age in
bins of equal width. You can create bins for customers aged 20-29, 30-39, 40-49, and so on.
Computing
Predictive modeling is a statistical analysis of data done by computers and software with
input from operators. It is used to generate possible future scenarios for entities the data used
is collected from. It can be used in any industry, enterprise, or endeavor in which data is
collected.
Model tuning is an iterative process that involves experimentation and fine-tuning. Some
approaches to optimizing hyperparameters include:
Grid search
Random search
Bayesian optimization: This more advanced technique uses Bayesian inference to build a
probabilistic model of the objective function and select the most promising hyperparameters
to evaluate.
Data splitting is when data is divided into two or more subsets. Typically, with a two-part
split, one part is used to evaluate or test the data and the other to train the model. Data
splitting is an important aspect of data science, particularly for creating models based on data.
Cross-validation
Often used to estimate the test error associated with a statistical learning method.
Bootstrap sampling
A more general and simpler method that's often used to provide a measure of accuracy for a
given parameter or method. In bootstrap sampling, a sampling distribution is generated by
repeatedly taking random samples from a known sample, with replacement. The more
samples that are taken, the more accurate the results will be, but it will also take longer.
Leave-one-out cross-validation (LOOCV)
Involves splitting the observations into two parts, with one observation used for the
validation set and the remaining observations used to fit the model.
Other resampling techniques include jackknife sampling, stratified sampling, random
sampling, upsampling, and downsampling.
What is data transformation?
Data transformation is the process of converting, cleansing, and structuring data into a usable
format that can be analyzed to support decision making processes, and to propel the growth
of an organization.
Data transformation is used when data needs to be converted to match that of the destination
system. This can occur at two places of the data pipeline. First, organizations with on-site
data storage use an extract, transform, load, with the data transformation taking place during
the middle ‘transform’ step.
Organizations today mostly use cloud-based data warehouses because they can scale their
computing and storage resources in seconds. Cloud based organizations, with this huge
scalability available, can skip the ETL process. Instead, they use a transformation process that
converts the data as the raw data is uploaded, a process called extract, load, and transform.
The process of data transformation can be handled manually, automated or a combination of
both.
Data transformation works on the simple objective of extracting data from a source,
converting it into a usable format and then delivering the converted data to the destination
system. The extraction phase involves data being pulled into a central repository from
different sources or locations, therefore it is usually in its raw original form which is not
usable. To ensure the usability of the extracted data it must be transformed into the desired
format by taking it through a number of steps. In certain cases, the data also needs to be
cleaned before the transformation takes place. This step resolves the issues of missing values
and inconsistencies that exist in the dataset. The data transformation process is carried out in
five stages.
1. Discovery
The first step is to identify and understand data in its original source format with the help
of data profiling tools. Finding all the sources and data types that need to be transformed.
This step helps in understanding how the data needs to be transformed to fit into the desired
format.
2. Mapping
The transformation is planned during the data mapping phase. This includes determining the
current structure, and the consequent transformation that is required, then mapping the data to
understand at a basic level, the way individual fields would be modified, joined or
aggregated.
3. Code generation
The code, which is required to run the transformation process, is created in this step using a
data transformation platform or tool.
4. Execution
The data is finally converted into the selected format with the help of the code. The data is
extracted from the source(s), which can vary from structured to streaming, telemetry to log
files. Next, transformations are carried out on data, such as aggregation, format conversion or
merging, as planned in the mapping stage. The transformed data is then sent to the destination
system which could be a dataset or a data warehouse.
5. Review
The transformed data is evaluated to ensure the conversion has had the desired results in
terms of the format of the data.
It must also be noted that not all data will need transformation, at times it can be used as is.
There are several data transformation techniques that are used to clean data and structure it
before it is stored in a data warehouse or analyzed for business intelligence. Not all of these
techniques work with all types of data, and sometimes more than one technique may be
applied. Nine of the most common techniques are:
1. Revising
Revising ensures the data supports its intended use by organizing it in the required and
correct way. It does this in a range of ways.
Dataset normalization revises data by eliminating redundancies in the data set. The data model
becomes more precise and legible while also occupying less space. This process, however, does
involve a lot of critical thinking, investigation and reverse engineering.
Data cleansing ensures the formatting capability of data.
Format conversion changes the data types to ensure compatibility.
Key structuring converts values with built-in meanings to generic identifiers to be used as unique
keys.
Deduplication identifies and removes duplicates.
Data validation validates records and removes the ones that are incomplete.
Repeated and unused columns can be removed to improve overall performance and legibility of the
data set.
2. Manipulation
This involves creation of new values from existing ones or changing current data through
computation. Manipulation is also used to convert unstructured data into structured data that
can be used by machine learning algorithms.
Derivation, which is cross column calculations
Summarization that aggregates values
Pivoting which involves converting columns values into rows and vice versa
Sorting, ordering and indexing of data to enhance search performance
Scaling, normalization and standardization that helps in comparing dissimilar numbers by putting
them on a consistent scale
Vectorization which helps convert non-numerical data into number arrays that are often used for
machine learning applications
3. Separating
This involves dividing up the data values into its parts for granular analysis. Splitting
involves dividing up a single column with several values into separate columns with each of
those values. This allows for filtering on the basis of certain values.
4. Combining/ integrating
Records from across tables and sources are combined to acquire a more holistic view of
activities and functions of an organization. It couples data from multiple tables and datasets
and combines records from multiple tables.
5. Data smoothing
This process removes meaningless, noisy, or distorted data from the data set. By removing
outliers, trends are most easily identified.
6. Data aggregation
This technique gathers raw data from multiple sources and turns it into a summary form
which can be used for analysis. An example is the raw data providing statistics such as
averages and sums.
7. Discretization
With the help of this technique, interval labels are created in continuous data in an attempt to
enhance its efficiency and easier analysis. The decision tree algorithms are utilized by this
process to transform large datasets into categorical data.
8. Generalization
Low level data attributes are transformed into high level attributes by using the concept of
hierarchies and creating layers of successive summary data. This helps in creating clear data
snapshots.
9. Attribute construction
In this technique, a new set of attributes is created from an existing set to facilitate the mining
process.
Transformation makes disparate sets of data compatible with each other, which makes it easier to
aggregate data for a thorough analysis
Migration of data is easier since the source format can be transformed into the target format
Data transformation helps in consolidating data, structured and unstructured
The process of transformation also allows for enrichment which enhances the quality of data
The ultimate goal is consistent, accessible data that provides organizations with accurate
analytic insights and predictions.
Benefits of data transformation
Data holds the potential to directly affect an organization’s efficiencies and its bottom line. It
plays a crucial role in understanding customer behavior, internal processes, and industry
trends. While every organization has the ability to collect an immense amount of data, the
challenge is to ensure that this is usable. Data transformation processes empower
organizations to reap the benefits offered by the data.
Data utilization
If the data being collected isn’t in an appropriate format, it often ends up not being utilized at
all. With the help of data transformation tools, organizations can finally realize the true
potential of the data they have amassed since the transformation process standardizes the data
and improves its usability and accessibility.
Data consistency
Data is continuously being collected from a range of sources which increases the
inconsistencies in metadata. This makes organization and understanding data a huge
challenge. Data transformation helps making it simpler to understand and organize data sets.
Analytics which use metrics from one or many sources to gain deeper insights about the functions
and operations of any organization. Transformation of data is required when the metric combines
data from multiple sources.
Machine learning which helps businesses with their profit and revenue projections, supports their
decision making with predictive modeling, and automation of several business processes.
Regulatory compliance which involves sensitive data that is vulnerable to malicious attacks
Resource intensive
The process of transformation is a resource intensive one. A huge computational burden is
created when transformations are performed in an on-premises data warehouse, which
consequently slows down other operations. However, this isn’t an issue when a cloud-based
data warehouse is used since the platform is able to scale up easily.
Data transformation also needs expertise from data scientists, which can be expensive and
divert attention from other tasks.