Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
Data aggregators summarize data from multiple sources. They provide capabilities for multiple
aggregate measurements, such as sum, average and counting.
• Voter turnout by state or county. Individual voter records are not presented, just the vote
totals by candidate for the specific region.
• Average age of customer by product. Each individual customer is not identified, but for
each product, the average age of the customer is saved.
• Number of customers by country. Instead of examining each customer, a count of the
customers in each country is presented.
For example, raw data can be aggregated over a given time period to provide statistics such as
average, minimum, maximum, sum, and count. After the data is aggregated and written to a view
or report, you can analyze the aggregated data to gain insights about particular resources or
resource groups. There are two types of data aggregation:
Time aggregation
All data points for a single resource over a specified time period.
Spatial aggregation
All data points for a group of resources over a specified time period.
The goal of data transformation is to prepare the data for data mining so that it can be used to
extract useful insights and knowledge. Data transformation typically involves several steps,
including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0
and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing
or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance
of data mining algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.
ADVANTAGES OR DISADVANTAGES:
1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data from
multiple sources, which can improve the accuracy and completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis and
modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or to
remove sensitive information from the data, which can help to increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve the
performance of data mining algorithms by reducing the dimensionality of the data and
scaling the data to a common range of values.
Overfitting: Data transformation can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it negatively
impacts the performance of the model on new unseen data.
1. Pre-merging process
Data Profiling: Before merging, it is crucial to profile the data, analyzing the different parts of
data sources. This step helps an organization understand the outcomes of merging and prevent any
potential errors that may occur. Data profiling consists of two important steps:
Standardize and Transform Data: Data sources may contain incomplete and invalid values.
These datasets cannot be merged before they are standardized. In addition to errors, data attributes
from different sources may contain the same information, such as customer names. However, the
format of these data values can be entirely different. Due to the lexical and structural differences
in datasets, data loss and errors may occur. In order to standardize data, certain factors need to be
managed.
• Invalid characters that cannot be printed, void values, and trailing spaces should be
replaced with valid values. An example is not allowing more than one space when data is
entered or reducing all multiple spaces to one when transforming data.
• To standardize long fields of data, records should be parsed into smaller parts in different
source files. This helps to ensure data accuracy remains even after data sources are merged.
• Constraints for integration should be defined. For example, the maximum or minimum
number of characters in a certain field should be defined, or a hyphenated surname should
contain no spaces.
Data Filtering: A part or subset of the original data sources can be merged instead of an entire
data source. Such horizontal slicing of the data is done when data in a constrained period of time
needs to be merged, or when only a subgroup of rows meets the conditional criteria. Vertical
slicing can only be done when a data source contains attributes that do not have any valuable
information.
Data Uniqueness: Many times, the information from a single entity may be stored across a
number of sources. Merging data becomes more complex if the datasets contain duplicates.
Therefore, before beginning any merging process, it is important to run data matching algorithms.
These help to apply conditional rules to identify and delete duplicates and result in the uniqueness
of records across all sources.
According to specific use cases, appending rows, columns or both can be done. This can be
quite simple if the datasets do not contain many null values and are reasonably complete. But there
could be problems if there are vacant spaces in the datasets that need to be looked up and filled.
Often, data merging techniques are used to bring the data together. It is also possible to perform a
conditional merge initially, and then finish the merge by appending columns and rows.
Appending Columns: This process is done when more dimensions need to be added to an existing
record. In such scenarios, all columns from different sources must be made unique. Every single
record should be uniquely identifiable across all sets of data, making it easier for records with the
same identifier to be merged. If the merging column does not contain any data, then null values
should be specified for all records from that dataset. However, if many datasets contain the same
or highly similar dimension information, then these dimensions can be merged together in one
field.
Conditional Merging: Conditional merging is used when there are incomplete datasets that need
to be united. In this type of merge, values from one dataset need to be looked up and the other
datasets need to be filled in accordingly. The source dataset from which values are looked up
should contain all unique records. However, the dataset to which data is being appended or the
target dataset does not need to have unique values.
3. Post-merging process
Once the merge process is finished, it is important to do a final profile audit of the merged source,
as is done at the start of the merge process. This will help find errors that may have occurred
during the merge. Any inaccurate, incomplete, or invalid values can also be spotted.
The lexical and structural differences across datasets can make it difficult to merge without error.
• Structural heterogeneity occurs when datasets do not contain the same type or number
of attributes (columns).
• Lexical heterogeneity happens when fields from different datasets are the same
structurally, but have the same information in an objectively different manner.
• Another major issue is scalability. Data merges are usually planned and actioned not based
on the ability to scale up but by the number and types of sources. Over time, systems that
integrate more data sources with a range of structures and mechanisms for storage will be
required. To overcome this, an organization must design a system that is scalable in size,
structures, and mechanisms. Instead of hardcoding the integration to be a set process, the
data integration system needs to be reusable, with a scalable architecture.
• There is also the challenge of data duplication. There are different ways in which data
duplication can happen in the dataset. To start with, there may be multiple records of the
same entity. Further, there may be many attributes storing exactly the same information
about an individual entity. These duplicate attributes or records can be found in the same
Types of merging
There are a range of merge options, depending on the organization’s legacy datasets, and software
options.
melt()
This method flattens/melts tabular data such that the specified columns and their
respective values are transformed into key-value pairs. ‘keys’ are the column names of the
dataset before transformation and ‘values’ are the values in the respective columns. After
transformation, ‘keys’ are stored in a column named ‘variable’ and ‘values’ are stored in another
column named ‘value’, by default.
The ‘Particulars’ column stays as it is and the associated yearly data appears as key-value pairs in
the adjacent columns. In this case, the ‘Particulars’ column is called an ‘id’ column and should be
passed to the “id_vars” argument of melt() method.
This method does the reverse of what melt() did. It transforms the key-value pairs into columns.
Let’s look at an example to convert ‘df_melt’ (shown below) back to the original format.
Building on the existing data allows for better business decisions and better customer
relationships.
Experian can help you fill in missing or incomplete information and enhance your records with
more than 900 available data attributes, including financial data, buyer propensity, automotive
data, and much more.
The more you know about your customers, the better you can serve them. Enrich your data to
advance your messaging, provide offers tailored to their interests, and build loyal clientele.
Marketing – adding data from any number of sources to better refine marketing campaigns and
processes for better targeting and offers.
Lending – using third-party databases which help them develop a more complete profile of their
customers to help with credit scores or underwriting.
Insurance – add as much data as possible (both internal and third-party) to enrich data for
customer categorization, segmentation, and targeting.
Retail – adding data to better profile and recognize customer needs for recommendations,
upselling, and cross-selling.
Appending Data
By appending data, you bring multiple data sources together to create a more holistic data set than
any one data source. This helps you generate more accurate analytics or explore more variables
to use as features to improve machine learning models. Appended data can be both internal and
external data.
Segmentation
Data segmentation allows you to divide or organize a dataset according to specific field values in
the data. Very common segmentation is done using demographic, geographic, technology, or
behavior values. This is often used in marketing use cases for targeting.
Derived Attributes
Derived attributes are fields added to a dataset that are calculated from other fields. The most
commonly thought of derived attribute is Age – calculated by subtracting birthdate minus current
Data Imputation
Some consider data imputation part of data cleansing, as it is the process of replacing values for
missing or inconsistent data within fields. A prime example is estimating the value of a missing
field based on other values.
Entity Extraction
When one is using more complex unstructured or semi-structured data, multiple data values may
be encoded within one field. To make the data useful, the values need to be extracted from one
field, then exploded out into one or more new columns in the data.
Conclusion
Data enrichment is an often overlooked yet highly critical part of producing analytics-ready
datasets. This is often because when designers decide what data to capture in applications, they
are not privy to downstream analytics data requirements. In addition, analytics data needs will
always change over time.
Therefore, it is critical to have a highly evolved, easy-to-use data transformation tool that allows
any team member to transform and enrich data to their specific needs. This allows the analytics
teams to be more responsive to the business, produce highly accurate analytics, and drive greater
adoption of analytics.
There are several different normalization techniques that can be used in data mining,
including:
1. Min-Max normalization: This technique scales the values of a feature to a range between
0 and 1. This is done by subtracting the minimum value of the feature from each value,
and then dividing by the range of the feature.
2. Z-score normalization: This technique scales the values of a feature to have a mean of 0
and a standard deviation of 1. This is done by subtracting the mean of the feature from
each value, and then dividing by the standard deviation.
3. Decimal Scaling: This technique scales the values of a feature by dividing the values of a
feature by a power of 10.
4. Logarithmic transformation: This technique applies a logarithmic transformation to the
values of a feature. This can be useful for data with a wide range of values, as it can help
to reduce the impact of outliers.
It’s important to note that normalization should be applied only to the input features, not the target
variable, and that different normalization technique may work better for different types of data
and models.
Need of Normalization
Normalization is generally required when we are dealing with attributes on a different
scale, otherwise, it may lead to a dilution in effectiveness of an important equally important
attribute(on lower scale) because of other attribute having values on larger scale. In simple words,
when multiple attributes are there but attributes have values on different scales, this may lead to
poor data models while performing data mining operations. So they are normalized to bring all the
attributes on the same scale.
Here:
• V’ is the new value after applying the decimal scaling
• V is the respective value of the attribute
Now, integer J defines the movement of decimal points. So, how to define it? It is equal to the
number of digits present in the maximum value in the data table. Here is an example:
Example 1: Suppose a company wants to compare the salaries of the new joiners. Here are
the data values:
Employee Name Salary Now, look for the maximum value in the data. In this case, it is
Sudhakar 10,000 25,000. Now count the number of digits in this value. In this
Valentina 25,000 case, it is ‘5’. So here ‘j’ is equal to 5, i.e 100,000. This means
Mirsha 8,000 the V (value of the attribute) needs to be divided by 100,000
Srilakshmi 15,000
here.
After applying the zero decimal scaling formula, here are the new values:
Name Salary Salary after Decimal Scaling
Thus, decimal scaling can tone down
Sudhakar 10,000 0.1
big numbers into easy-to-understand
smaller decimal values. Also, data Valentina 25, 000 0.25
attributed to different units becomes Mirsha 8, 000 0.08
easy to read and understand once it is Srilakshmi 15,000 0.15
converted into smaller decimal values.
The following formula is used in the case of z-score normalization on every single value of the
dataset.
New value = (x – μ) / σ
Here:
x: Original value
μ: Mean of data
σ: Standard deviation of data
Example:
Solution:
• Compute Mean and Std deviation of input data. The mean of this dataset is 21.2 also the
standard deviation is 29.8.
• If we have to perform z score normalization on the first value of the dataset,
Then according to the formula, it will be,
New value = (x – μ) / σ
Where A is the attribute data, Min(A), Max(A) are the minimum and maximum absolute value of
A respectively. v’ is the new value of each entry in data. v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range
required) respectively.
Example1: Suppose a company wants to decide on a promotion based on the years of work
experience of its employees. So, it needs to analyze a database that looks like this:
Employee Name Years of Experience • The minimum value is 8
Sudhakar 8 • The maximum value is 20
Valentina 20 As this formula scales the data between 0 and 1,
Mirsha 10 • The new min is 0
Srilakshmi 15 • The new max is 1
After applying the min-max normalization formula, the following are the V’ values for the
attributes:
• For scaling the minimum and • For scaling mean deviation and the
maximum values of the feature are standard deviation is used.
used. • Useful when want to maintain a zero
• Applicable when the features are mean and unit standard deviation.
of different sizes • No fixed range is present
• The values are scaled between the • Not that much affected by outliers.
range of [0,1] or [-1, 1] • Transformer named StandardScaler is
• Gets easily affected by outliers available in Scikit-Learn to perform the
• A transformer named task.
MaxMinScaler is available in • This method translates data to the mean
Scikit-Learn vector of the original data and then either
• This method transforms an n- squeezes or expands it.
dimensional data into an n- • Best when Normal or Gaussian
dimensional unit hypercube distribution
• Best if the distribution is unknown • Also known as Standardization
• Also known as Scaling
Normalization
• Hold-out cross-validation
• Leave-p-out cross-validation
One fold is used for validation and other K-1 folds are used for training the model. To use every
fold as a validation set and other left-outs as a training set, this technique is repeated k times until
each fold is used once.
The image above shows 5 folds and hence, 5 iterations. In each iteration, one fold is the
test set/validation set and the other k-1 sets (4 sets) are the train set. To get the final accuracy, you
need to take the accuracy of the k-models validation data.This validation technique is not
considered suitable for imbalanced datasets as the model will not get trained properly owing to
the proper ratio of each class's data.
Code:
Output:
Output:
Code:
Output:
Output:
The next step is to fit the model on the train data set in that iteration and calculate the accuracy of
the fitted model on the test dataset. Repeat these iterations many times - 100,400,500 or even
higher - and take the average of all the test errors to conclude how well your model performs.
For a 100 iteration run, the model training will look like this:
Output:
In the case of time series datasets, the cross-validation is not that trivial. You can’t choose data
instances randomly and assign them the test set or the train set. Hence, this technique is used to
perform cross-validation on time series data with time as the important factor.
Code:
Output:
4.6.1 Accuracy
Accuracy simply measures how often the classifier correctly predicts. We can define
accuracy as the ratio of the number of correct predictions and the total number of predictions.
4.6.2 Accuracy
Confusion Matrix is a performance measurement for the machine learning classification
problems where the output can be two or more classes. It is a table with combinations of
predicted and actual values.
• True Positive: We predicted positive and it’s true. In the image, we predicted that a
woman is pregnant and she actually is.
• True Negative: We predicted negative and it’s true. In the image, we predicted that a
man is not pregnant and he actually is not.
• False Positive (Type 1 Error): We predicted positive and it’s false. In the image, we
predicted that a man is pregnant but he actually is not.
• False Negative (Type 2 Error): We predicted negative and it’s false. In the image, we
predicted that a woman is not pregnant but she actually is.
4.6.3 Precision
It explains how many of the correctly predicted cases actually turned out to be positive.
Precision is useful in the cases where False Positive is a higher concern than False Negatives. The
Precision for a label is defined as the number of true positives divided by the number
of predicted positives.
Recall for a label is defined as the number of true positives divided by the total number of
actual positives.
4.6.5 AUC-ROC
The Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold values and separates the
‘signal’ from the ‘noise’.
In a ROC curve, the X-axis value shows False Positive Rate (FPR), and Y-axis shows True
Positive Rate (TPR). Higher the value of X means higher the number of False Positives(FP) than
True Negatives(TN), while a higher Y-axis value indicates a higher number of TP than FN. So,
the choice of the threshold depends on the ability to balance between FP and FN.
1. The power of ‘square root’ empowers this metric to show large number deviations.
2. The ‘squared’ nature of this metric helps to deliver more robust results, which
prevent canceling the positive and negative error values. In other words, this metric aptly
displays the plausible magnitude of the error term.
3. It avoids the use of absolute error values, which is highly undesirable in mathematical
calculations.
4. When we have more samples, reconstructing the error distribution using RMSE is
considered to be more reliable.
5. RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers
from your data set prior to using this metric.
6. As compared to mean absolute error, RMSE gives higher weightage and punishes large
errors.
In the case of a classification problem, if the model has an accuracy of 0.8, we could gauge how
good our model is against a random model, which has an accuracy of 0.5. So the random model
can be treated as a benchmark. But when we talk about the RMSE metrics, we do not have a
benchmark to compare.
Example: Suppose there are two variables, sex (male or female) and handedness (right- or
left-handed). Further suppose that 100 individuals are randomly sampled from a very large
population as part of a study of sex differences in handedness. A contingency table can be created
to display the numbers of individuals who are male right-handed and left-handed, female right-
handed and left-handed.
The numbers of the males, females, and right- and left-handed individuals are called marginal
totals. The grand total (the total number of individuals represented in the contingency table) is the
number in the bottom right corner.
The table allows users to see at a glance that the proportion of men who are right-handed is about
the same as the proportion of women who are right-handed although the proportions are not
identical.
o Row Percentage: Take a cell value and divide by the cell’s row total.
o Column Percentage: Take a cell value and divide by the cell’s column total.
For example, the row percentage of females who prefer chocolate is simply the number of
observations in the Female/Chocolate cell divided by the row total for women: 37 / 66 = 56%.
The column percentage for the same cell is the frequency of the Female/Chocolate cell divided by
the column total for chocolate: 37 / 58 = 63.8%.
Analysis or
Plotting the
Contingency
Table
• It is important to note that Precision is also called the Positive Predictive Value (PPV).
• Recall is also called Sensitivity, Hit Rate or True Positive Rate (TPR).
Plotting recall values on the x-axis and corresponding precision values on the y-axis generates a
PR curve that illustrates a negative slope function. It represents the trade-off between precision
(reducing FPs) and recall (reducing FNs) for a given model. Considering the inverse relationship
between precision and recall, the curve is generally non-linear, implying that increasing one metric
decreases the other, but the decrease might not be proportional.
A/B tests are useful for understanding user engagement and satisfaction of online features like a
new feature or product.
Pros: Through A/B testing, it is easy to get a clear idea of what users prefer, since it is directly
testing one thing over the other. It is based on real user behavior so the data can be very helpful
especially when determining what works better between two options. In addition, it can also
provide answers to very specific design questions. One example of this is Google's A/B testing
with hyperlink colors. In order to optimize revenue, they tested dozens of different hyperlink hues
to see which color the users tend to click more on.
Cons: There are, however, a couple of cons to A/B testing. Like mentioned above, A/B testing is
good for specific design questions but it can also be a downside since it is mostly only good for
specific design problems with very measurable outcomes. It could also be a very costly and timely
process. Depending on the size of the company and/or team, there could be a lot of meetings and
discussions about what exactly to test and what the impact of the A/B test is. If there's not a
significant impact, it could end up as a waste of time and resources.
*End of Chapter 4*