100% found this document useful (1 vote)
36 views31 pages

Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation

FUNDAMENTALS OF DATA SCIENCE UNIT 4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
36 views31 pages

Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation

FUNDAMENTALS OF DATA SCIENCE UNIT 4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit 4

Data Handling and Model Evaluation


4.1 Data Aggregation
Data aggregation is the process of collecting data to present it in summary form. This information
is then used to conduct statistical analysis and can also help company executives make more
informed decisions about marketing strategies, price settings, and structuring operations, among
other things.

What does data aggregation do?

Data aggregators summarize data from multiple sources. They provide capabilities for multiple
aggregate measurements, such as sum, average and counting.

Examples of aggregate data include the following:

• Voter turnout by state or county. Individual voter records are not presented, just the vote
totals by candidate for the specific region.
• Average age of customer by product. Each individual customer is not identified, but for
each product, the average age of the customer is saved.
• Number of customers by country. Instead of examining each customer, a count of the
customers in each country is presented.

Types of data aggregation


There are a few types of data aggregation: time, spatial, manual, and automated being the most
common.

For example, raw data can be aggregated over a given time period to provide statistics such as
average, minimum, maximum, sum, and count. After the data is aggregated and written to a view
or report, you can analyze the aggregated data to gain insights about particular resources or
resource groups. There are two types of data aggregation:

Time aggregation
All data points for a single resource over a specified time period.

Spatial aggregation

All data points for a group of resources over a specified time period.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


4.2 Data Transformation
Data transformation is the process of converting data from one format, such as a database file,
XML document or Excel spreadsheet, into another. Transformations typically involve converting
a raw data source into a cleansed, validated and ready-to-use format.

The goal of data transformation is to prepare the data for data mining so that it can be used to
extract useful insights and knowledge. Data transformation typically involves several steps,
including:

1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0
and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing
or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance
of data mining algorithms, by reducing the dimensionality of the data, and by scaling the
data to a common range of values.

ADVANTAGES OR DISADVANTAGES:

Advantages of Data Transformation in Data Mining:

1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data from
multiple sources, which can improve the accuracy and completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis and
modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or to
remove sensitive information from the data, which can help to increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve the
performance of data mining algorithms by reducing the dimensionality of the data and
scaling the data to a common range of values.

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process, especially when


dealing with large datasets.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


2. Complexity: Data transformation can be a complex process, requiring specialized skills
and knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when discretizing
continuous data, or when removing attributes or features from the data.
4. Biased transformation: Data transformation can result in bias, if the data is not properly
understood or used.
5. High cost: Data transformation can be an expensive process, requiring significant
investments in hardware, software, and personnel.

Overfitting: Data transformation can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it negatively
impacts the performance of the model on new unseen data.

4.2.1 Dataset Merging


Data merging is the process of combining two or more similar records into a single one. Merging
is done to add variables to a dataset, append or add cases or observations to a dataset, or remove
duplicates and other
incorrect information.

Done correctly, this


process makes it easier
and faster to analyze data
stored in multiple
locations, worksheets, or
data tables. Merging data
into a single point is
necessary in certain
situations, especially
when an organization
needs to add new cases, variables, or data based on the lookup values. However, data merging
needs to be performed with caution; otherwise, it can lead to duplication, inaccuracy, or
inconsistency issues.

Stages of data merging


The process of merging can be categorized into three stages: pre merging, merging, and post-
merging processes.

1. Pre-merging process
Data Profiling: Before merging, it is crucial to profile the data, analyzing the different parts of
data sources. This step helps an organization understand the outcomes of merging and prevent any
potential errors that may occur. Data profiling consists of two important steps:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


• Analyzing the list of attributes that each data source possesses. This step helps an
organization understand how the merged data will scale, what attributes are intended to be
merged, and what may need to be supplemented.
• Analysis of the data values in each part of a source to assess the completeness, uniqueness,
and distribution of attributes. In short, a data profile validates the attributes of a predefined
pattern and helps to identify invalid values.

Standardize and Transform Data: Data sources may contain incomplete and invalid values.
These datasets cannot be merged before they are standardized. In addition to errors, data attributes
from different sources may contain the same information, such as customer names. However, the
format of these data values can be entirely different. Due to the lexical and structural differences
in datasets, data loss and errors may occur. In order to standardize data, certain factors need to be
managed.

• Invalid characters that cannot be printed, void values, and trailing spaces should be
replaced with valid values. An example is not allowing more than one space when data is
entered or reducing all multiple spaces to one when transforming data.
• To standardize long fields of data, records should be parsed into smaller parts in different
source files. This helps to ensure data accuracy remains even after data sources are merged.
• Constraints for integration should be defined. For example, the maximum or minimum
number of characters in a certain field should be defined, or a hyphenated surname should
contain no spaces.

Data Filtering: A part or subset of the original data sources can be merged instead of an entire
data source. Such horizontal slicing of the data is done when data in a constrained period of time
needs to be merged, or when only a subgroup of rows meets the conditional criteria. Vertical
slicing can only be done when a data source contains attributes that do not have any valuable
information.

Data Uniqueness: Many times, the information from a single entity may be stored across a
number of sources. Merging data becomes more complex if the datasets contain duplicates.
Therefore, before beginning any merging process, it is important to run data matching algorithms.
These help to apply conditional rules to identify and delete duplicates and result in the uniqueness
of records across all sources.

2. Merging: Integration and aggregation of data


The process of merging can either be an integration or an aggregation. Once all the previous steps
have been completed, the data is ready for merging. There are a number of ways this process can
be achieved.

According to specific use cases, appending rows, columns or both can be done. This can be
quite simple if the datasets do not contain many null values and are reasonably complete. But there
could be problems if there are vacant spaces in the datasets that need to be looked up and filled.
Often, data merging techniques are used to bring the data together. It is also possible to perform a
conditional merge initially, and then finish the merge by appending columns and rows.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Append Rows: Appending rows is done when records sourced from different datasets need to be
combined in one place. The data sources to be joined need to have an identical structure. The data
types, pattern validations, and integrity constraints of corresponding columns also need to be the
same to avoid invalid formatting problems. Data matching should be performed with or before
merging if the data of an entity is from different sources.

Appending Columns: This process is done when more dimensions need to be added to an existing
record. In such scenarios, all columns from different sources must be made unique. Every single
record should be uniquely identifiable across all sets of data, making it easier for records with the
same identifier to be merged. If the merging column does not contain any data, then null values
should be specified for all records from that dataset. However, if many datasets contain the same
or highly similar dimension information, then these dimensions can be merged together in one
field.

Conditional Merging: Conditional merging is used when there are incomplete datasets that need
to be united. In this type of merge, values from one dataset need to be looked up and the other
datasets need to be filled in accordingly. The source dataset from which values are looked up
should contain all unique records. However, the dataset to which data is being appended or the
target dataset does not need to have unique values.

3. Post-merging process
Once the merge process is finished, it is important to do a final profile audit of the merged source,
as is done at the start of the merge process. This will help find errors that may have occurred
during the merge. Any inaccurate, incomplete, or invalid values can also be spotted.

Challenges during the data merging process


There are several challenges an organization may face during the data merge process.

The lexical and structural differences across datasets can make it difficult to merge without error.

• Structural heterogeneity occurs when datasets do not contain the same type or number
of attributes (columns).
• Lexical heterogeneity happens when fields from different datasets are the same
structurally, but have the same information in an objectively different manner.
• Another major issue is scalability. Data merges are usually planned and actioned not based
on the ability to scale up but by the number and types of sources. Over time, systems that
integrate more data sources with a range of structures and mechanisms for storage will be
required. To overcome this, an organization must design a system that is scalable in size,
structures, and mechanisms. Instead of hardcoding the integration to be a set process, the
data integration system needs to be reusable, with a scalable architecture.
• There is also the challenge of data duplication. There are different ways in which data
duplication can happen in the dataset. To start with, there may be multiple records of the
same entity. Further, there may be many attributes storing exactly the same information
about an individual entity. These duplicate attributes or records can be found in the same

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


dataset or across multiple datasets. The solution to this problem is using data matching
algorithms and conditional rules.
• Lengthy merging processes are another common issue. Many times, data integration
processes take far longer than anticipated. This can be prevented through more realistic
planning including data merging experts in the planning and avoiding last-minute changes
and amendments. Scope creep is to be avoided.

Types of merging
There are a range of merge options, depending on the organization’s legacy datasets, and software
options.

One to one merge


• One-to-one is the most basic and simple type of merge. Both the master dataset and the
dataset being merged should have a common key variable to enable a smooth merge.
Variables in both datasets will be merged together into the same file. This might include
any missing values for observations that do not match in both datasets.

Many to one merge


• Many-to-one merges may contain duplicate entries for one of the two key columns.
• Each unique identifier in the master dataset corresponds to one row. Unique identifiers in
the dataset may correspond to multiple rows.

Many to many merge


• Many-to-many joins are complex compared to the other merges. These are done when the
key column contains duplicates in both the right and left array. The usage of dataset values
should be mapped in this case.

New case merging


• The merging of new cases is done by appending data in different ways. It can be achieved
by adding more rows of data in every column. This is possible when the two files’ variables
are the same. For example, if the variable is numeric in one file, it needs to be numeric in
the other. It cannot be numeric in one and string variable in the other. In case of an
automated merge, the tool matches the data on the name of the variable. This makes it
important to use the same names in the two files. If one file has a variable that does not
have a match in another, the missing data or blank values will be inserted accordingly.

Merging new variables


• When merging new variables, while the identifiers for each case in both files are required
to be the same, the variable names need to be different. This process is also called
augmenting the data. In structured query language, it is done using the join function. When

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


merging data column by column or adding more columns of data to each row, the user
adds new variables for each current case in the data file.
• When new variables are merged where all of the variables are not present, the missing
cases should be replaced with blank values, such as in merging new cases. If there are new
files with new variables and new cases, the merge depends on which software is being used
for the merge. Sometimes it cannot handle merging cases and variables simultaneously. In
such scenarios, first augment or merge in only the new variables. Then, the new cases can
be appended to all variables.

Merging data using lookup values


• Merging works best when there are complete, whole data sets to be combined. However,
when data must be augmented with information from other datasets or sources, there are
certain factors to be considered. Organizations must survey data in one file with
corresponding values of the other. The lookup code is used as an identifier, adding values
as new variables in the data file. The data is paired up for each case by using the lookup
code. Then the data from the original file is augmented with the merging data for the
matching lookup code.
• This look-up code should be unique in the file with the additional data, but the same value
can be present many times in the file that needs to be augmented.

4.2.2 Dataset Reshaping


While analyzing data, we may need to reshape tabular data. Pandas has two methods that aid in
reshaping the data into a desired format. Pandas has two methods namely, melt() and pivot(), to
reshape the data.

melt()

This method flattens/melts tabular data such that the specified columns and their
respective values are transformed into key-value pairs. ‘keys’ are the column names of the
dataset before transformation and ‘values’ are the values in the respective columns. After
transformation, ‘keys’ are stored in a column named ‘variable’ and ‘values’ are stored in another
column named ‘value’, by default.

Consider the following dataset:

The ‘Particulars’ column stays as it is and the associated yearly data appears as key-value pairs in
the adjacent columns. In this case, the ‘Particulars’ column is called an ‘id’ column and should be
passed to the “id_vars” argument of melt() method.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


df_melt = df.melt(id_vars='Particulars',value_name="Amount") ; We may also specify a list
of columns to be considered as “id” columns. We can rename the ‘value’ and ‘variable’ columns
using the ‘var_name’ and ‘value_name’ arguments as shown below.
pivot ()

This method does the reverse of what melt() did. It transforms the key-value pairs into columns.
Let’s look at an example to convert ‘df_melt’ (shown below) back to the original format.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


In short:
melt() and pivot() are the two methods that come
handy when it comes to transforming tabular data
into a desired format.

melt() converts columns to key-value pairs, while


pivot() converts key-value pairs to columns.

4.3 Data Enrichment


Data enrichment is the process of improving the accuracy and reliability of your raw customer
data. Teams enrich data by adding new and supplemental information and verifying the
information against third-party sources. Data enriching (also called data appending) ensures
your data accurately and thoroughly represents your audience.

Building on the existing data allows for better business decisions and better customer
relationships.

Experian can help you fill in missing or incomplete information and enhance your records with
more than 900 available data attributes, including financial data, buyer propensity, automotive
data, and much more.

The more you know about your customers, the better you can serve them. Enrich your data to
advance your messaging, provide offers tailored to their interests, and build loyal clientele.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Data Enrichment with us means:

• A deeper understanding of customer needs and assets


• An improved customer experience
• Additional verified data attributes

Use Case Examples


The most common use case example for data enrichment is adding demographic data that comes
from other internal systems or external (3 rd party) sources to customer data. Specific examples
include:

Marketing – adding data from any number of sources to better refine marketing campaigns and
processes for better targeting and offers.

Lending – using third-party databases which help them develop a more complete profile of their
customers to help with credit scores or underwriting.

Insurance – add as much data as possible (both internal and third-party) to enrich data for
customer categorization, segmentation, and targeting.

Retail – adding data to better profile and recognize customer needs for recommendations,
upselling, and cross-selling.

Data Enrichment Approaches


There are multiple ways data can be enriched, including appending data, segmentation, deriving
attributes, imputation, entity extraction, and categorization.

Appending Data

By appending data, you bring multiple data sources together to create a more holistic data set than
any one data source. This helps you generate more accurate analytics or explore more variables
to use as features to improve machine learning models. Appended data can be both internal and
external data.

Segmentation

Data segmentation allows you to divide or organize a dataset according to specific field values in
the data. Very common segmentation is done using demographic, geographic, technology, or
behavior values. This is often used in marketing use cases for targeting.

Derived Attributes

Derived attributes are fields added to a dataset that are calculated from other fields. The most
commonly thought of derived attribute is Age – calculated by subtracting birthdate minus current

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


date. Other derived attributes include date/time conversions (hour, day, month, quarter), time
periods, time between, counts, and classifications (time bands, age bands, etc.).

Data Imputation

Some consider data imputation part of data cleansing, as it is the process of replacing values for
missing or inconsistent data within fields. A prime example is estimating the value of a missing
field based on other values.

Entity Extraction

When one is using more complex unstructured or semi-structured data, multiple data values may
be encoded within one field. To make the data useful, the values need to be extracted from one
field, then exploded out into one or more new columns in the data.

Conclusion
Data enrichment is an often overlooked yet highly critical part of producing analytics-ready
datasets. This is often because when designers decide what data to capture in applications, they
are not privy to downstream analytics data requirements. In addition, analytics data needs will
always change over time.

Therefore, it is critical to have a highly evolved, easy-to-use data transformation tool that allows
any team member to transform and enrich data to their specific needs. This allows the analytics
teams to be more responsive to the business, produce highly accurate analytics, and drive greater
adoption of analytics.

4.4 Data Normalization


Data normalization is a technique used in data mining to transform the values of a dataset into a
common scale. This is important because many machine learning algorithms are sensitive to the
scale of the input features and can produce better results when the data is normalized.

There are several different normalization techniques that can be used in data mining,
including:

1. Min-Max normalization: This technique scales the values of a feature to a range between
0 and 1. This is done by subtracting the minimum value of the feature from each value,
and then dividing by the range of the feature.
2. Z-score normalization: This technique scales the values of a feature to have a mean of 0
and a standard deviation of 1. This is done by subtracting the mean of the feature from
each value, and then dividing by the standard deviation.
3. Decimal Scaling: This technique scales the values of a feature by dividing the values of a
feature by a power of 10.
4. Logarithmic transformation: This technique applies a logarithmic transformation to the
values of a feature. This can be useful for data with a wide range of values, as it can help
to reduce the impact of outliers.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


5. Root transformation: This technique applies a square root transformation to the values of
a feature. This can be useful for data with a wide range of values, as it can help to reduce
the impact of outliers.

It’s important to note that normalization should be applied only to the input features, not the target
variable, and that different normalization technique may work better for different types of data
and models.

Need of Normalization
Normalization is generally required when we are dealing with attributes on a different
scale, otherwise, it may lead to a dilution in effectiveness of an important equally important
attribute(on lower scale) because of other attribute having values on larger scale. In simple words,
when multiple attributes are there but attributes have values on different scales, this may lead to
poor data models while performing data mining operations. So they are normalized to bring all the
attributes on the same scale.

4.4.1Decimal Scaling Normalization


Decimal scaling is another technique for normalization in data mining. It functions by
converting a number to a decimal point. Normalization by decimal scaling follows the method of
standard deviation. In decimal scaling normalization, the decimal point of values of the attributes
is moved. The movement of the decimal points in decimal scaling normalization is dependent
upon the maximum values amongst all values of the attribute.

Decimal Scaling Formula

Here:
• V’ is the new value after applying the decimal scaling
• V is the respective value of the attribute
Now, integer J defines the movement of decimal points. So, how to define it? It is equal to the
number of digits present in the maximum value in the data table. Here is an example:

Example 1: Suppose a company wants to compare the salaries of the new joiners. Here are
the data values:
Employee Name Salary Now, look for the maximum value in the data. In this case, it is
Sudhakar 10,000 25,000. Now count the number of digits in this value. In this
Valentina 25,000 case, it is ‘5’. So here ‘j’ is equal to 5, i.e 100,000. This means
Mirsha 8,000 the V (value of the attribute) needs to be divided by 100,000
Srilakshmi 15,000
here.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Now, look for the maximum value in the data. In this case, it is 25,000. Now count the
number of digits in this value. In this case, it is ‘5’. So here ‘j’ is equal to 5, i.e 100,000. This
means the V (value of the attribute) needs to be divided by 100,000 here.

After applying the zero decimal scaling formula, here are the new values:
Name Salary Salary after Decimal Scaling
Thus, decimal scaling can tone down
Sudhakar 10,000 0.1
big numbers into easy-to-understand
smaller decimal values. Also, data Valentina 25, 000 0.25
attributed to different units becomes Mirsha 8, 000 0.08
easy to read and understand once it is Srilakshmi 15,000 0.15
converted into smaller decimal values.

4.4.2 Z-Score Normalization


Z-Score value is to understand how far the data point is from the mean. Technically, it
measures the standard deviations below or above the mean. It ranges from -3 standard deviation
up to +3 standard deviation. Z-score normalization in data mining is useful for those kinds of
data analysis wherein there is a need to compare a value with respect to a mean(average) value,
such as results from tests or surveys. Thus, Z-score normalization is also popularly known as
Standardization.

The following formula is used in the case of z-score normalization on every single value of the
dataset.
New value = (x – μ) / σ
Here:
x: Original value
μ: Mean of data
σ: Standard deviation of data
Example:

Consider the following Dataset: 3 5 5 8 9 12 12 13 15 16 17 19 22 24 25 & 134.

Solution:

• Compute Mean and Std deviation of input data. The mean of this dataset is 21.2 also the
standard deviation is 29.8.
• If we have to perform z score normalization on the first value of the dataset,
Then according to the formula, it will be,
New value = (x – μ) / σ

New value = (3-21.2)/ 29.8

∴ New value = -0.61

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


By performing z score normalization on each of the value of the dataset, we will get the
following chart.
Data Z score normalized value
3 -0.61
5 -0.54
5 -0.54
8 -0.44
9 -0.41
12 -0.31
12 -0.31
13 -0.28
15 -0.21
16 -0.17
17 -0.14
19 -0.07
22 0.03
24 0.09
25 0.13
134 3.79

4.4.3 Min-Max Normalization


In this technique of data normalization, linear transformation is performed on the original data.
Minimum and maximum value from data is fetched and each value is replaced according to the
following formula.

Where A is the attribute data, Min(A), Max(A) are the minimum and maximum absolute value of
A respectively. v’ is the new value of each entry in data. v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range
required) respectively.

Example1: Suppose a company wants to decide on a promotion based on the years of work
experience of its employees. So, it needs to analyze a database that looks like this:
Employee Name Years of Experience • The minimum value is 8
Sudhakar 8 • The maximum value is 20
Valentina 20 As this formula scales the data between 0 and 1,
Mirsha 10 • The new min is 0
Srilakshmi 15 • The new max is 1

After applying the min-max normalization formula, the following are the V’ values for the
attributes:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


For 8 years of experience: v’= 0
For 10 years of experience: v’ = 0.16
For 15 years of experience: v’ = 0.58
For 20 years of experience: v’ = 1
So, the min-max normalization can reduce big numbers to much smaller values. This makes it
extremely easy to read the difference between the ranging numbers.

Difference between Min Max normalization and Z Score


Normalization:

Min Max normalization Z Score Normalization

• For scaling the minimum and • For scaling mean deviation and the
maximum values of the feature are standard deviation is used.
used. • Useful when want to maintain a zero
• Applicable when the features are mean and unit standard deviation.
of different sizes • No fixed range is present
• The values are scaled between the • Not that much affected by outliers.
range of [0,1] or [-1, 1] • Transformer named StandardScaler is
• Gets easily affected by outliers available in Scikit-Learn to perform the
• A transformer named task.
MaxMinScaler is available in • This method translates data to the mean
Scikit-Learn vector of the original data and then either
• This method transforms an n- squeezes or expands it.
dimensional data into an n- • Best when Normal or Gaussian
dimensional unit hypercube distribution
• Best if the distribution is unknown • Also known as Standardization
• Also known as Scaling
Normalization

4.5 Cross Validation of Data


Cross-validation,sometimes called rotation estimation or out-of-sample testing, is any of
various similar model validation techniques for assessing how the results of a statistical analysis
will generalize to an independent data set. Cross-validation is a resampling method that uses
different portions of the data to test and train a model on different iterations. It is mainly used in
settings where the goal is prediction, and one wants to estimate how accurately a predictive model
will perform in practice. In a prediction problem, a model is usually given a dataset of known data
on which training is run (training dataset), and a dataset of unknown data (or first seen data) against
which the model is tested (called the validation dataset or testing set).The goal of cross-validation
is to test the model's ability to predict new data that was not used in estimating it, in order to flag
problems like overfitting or selection bias[10] and to give an insight on how the model will
generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Types of Cross Validation
• K-fold cross-validation

• Hold-out cross-validation

• Stratified k-fold cross-validation

• Leave-p-out cross-validation

• Leave-one-out cross-validation (similar to Previous one)

• Monte Carlo (shuffle-split)

• Time series (rolling cross-validation)

4.5.1. K-fold cross-validation


In this technique, the whole dataset is partitioned in k parts of equal size and each partition is
called a fold. It’s known as k-fold since there are k parts where k can be any integer - 3,4,5, etc.

One fold is used for validation and other K-1 folds are used for training the model. To use every
fold as a validation set and other left-outs as a training set, this technique is repeated k times until
each fold is used once.

The image above shows 5 folds and hence, 5 iterations. In each iteration, one fold is the
test set/validation set and the other k-1 sets (4 sets) are the train set. To get the final accuracy, you
need to take the accuracy of the k-models validation data.This validation technique is not
considered suitable for imbalanced datasets as the model will not get trained properly owing to
the proper ratio of each class's data.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Here’s an example of how to perform k-fold cross-validation using Python.

Code:

Output:

4.5.2. Holdout cross-validation


Also called a train-test split, holdout cross-validation has the entire dataset partitioned
randomly into a training set and a validation set. A rule of thumb to partition data is that nearly
70% of the whole dataset will be used as a training set and the remaining 30% will be used as a
validation set. Since the dataset is split into only two sets, the model is built just one time on the
training set and executed faster.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Code:

Output:

4.5.3. Stratified k-fold cross-validation


As seen above, k-fold validation can’t be used for imbalanced datasets because data is split
into k-folds with a uniform probability distribution. Not so with stratified k-fold, which is an
enhanced version of the k-fold cross-validation technique. Although it too splits the dataset into k
equal folds, each fold has the same ratio of instances of target variables that are in the complete
dataset. This enables it to work perfectly for imbalanced datasets, but not for time-series data.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


In the example above, the original dataset contains females that are a lot less than males,
so this target variable distribution is imbalanced. In the stratified k-fold cross-validation technique,
this ratio of instances of the target variable is maintained in all the folds.

Code:

Output:

4.5.4. Leave-p-out cross-validation


An exhaustive cross-validation technique, p samples are used as the validation set and n-p
samples are used as the training set if a dataset has n samples. The process is repeated until the
entire dataset containing n samples gets divided on the validation set of p samples and the training
set of n-p samples. This continues till all samples are used as a validation set. The technique, which
has a high computation time, produces good results. However, it’s not considered ideal for an
imbalanced dataset and is deemed to be a computationally unfeasible method. This is because if
the training set has all samples of one class, the model will not be able to properly generalize and
will become biased to either of the classes.
Code

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Output:

4.5.5. Leave-one-out cross-validation


In this technique, only 1 sample point is used as a validation set and the remaining n-1
samples are used in the training set. Think of it as a more specific case of the leave-p-out cross-
validation technique with P=1.

To understand this better, consider this example:


There are 1000 instances in your dataset. In each iteration, 1 instance will be used for the
validation set and the remaining 999 instances will be used as the training set. The process
repeats itself until every instance from the dataset is used as a validation sample.

The leave-one-out cross-validation method is computationally expensive to perform and shouldn’t


be used with very large datasets. The good news is that the technique is very simple and requires
no configuration to specify. It also provides a reliable and unbiased estimate for your model
performance.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Code:

Output:

4.5.6. Monte Carlo cross-validation


Also known as shuffle split cross-validation and repeated random subsampling cross-validation,
the Monte Carlo technique involves splitting the whole data into training data and test data.
Splitting can be done in the percentage of 70-30% or 60-40% - or anything you prefer. The only
condition for each iteration is to keep the train-test split percentage different.

The next step is to fit the model on the train data set in that iteration and calculate the accuracy of
the fitted model on the test dataset. Repeat these iterations many times - 100,400,500 or even
higher - and take the average of all the test errors to conclude how well your model performs.

For a 100 iteration run, the model training will look like this:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


You can see that in each iteration, the split ratio of the training set and test set is different.
The average has been taken to get the test errors.
Code:

Output:

4.5.7. Time series


(rolling cross-validation / forward chaining method)
Time series is the type of data collected at different points in time. This kind of data allows one to
understand what factors influence certain variables from period to period. Some examples of time
series data are weather records, economic indicators, etc.

In the case of time series datasets, the cross-validation is not that trivial. You can’t choose data
instances randomly and assign them the test set or the train set. Hence, this technique is used to
perform cross-validation on time series data with time as the important factor.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Since the order of data is very important for time series-related problems, the dataset is split into
training and validation sets according to time. Therefore, it’s also called the forward chaining
method or rolling cross-validation.

The image below shows the method.

Code:

Output:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


4.6. Metrics to Evaluate Classification Model
There are many ways for measuring classification performance. Accuracy, confusion
matrix, log-loss, and AUC-ROC are some of the most popular metrics. Precision-recall is a widely
used metrics for classification problems.

4.6.1 Accuracy
Accuracy simply measures how often the classifier correctly predicts. We can define
accuracy as the ratio of the number of correct predictions and the total number of predictions.

4.6.2 Accuracy
Confusion Matrix is a performance measurement for the machine learning classification
problems where the output can be two or more classes. It is a table with combinations of
predicted and actual values.

• True Positive: We predicted positive and it’s true. In the image, we predicted that a
woman is pregnant and she actually is.
• True Negative: We predicted negative and it’s true. In the image, we predicted that a
man is not pregnant and he actually is not.
• False Positive (Type 1 Error): We predicted positive and it’s false. In the image, we
predicted that a man is pregnant but he actually is not.
• False Negative (Type 2 Error): We predicted negative and it’s false. In the image, we
predicted that a woman is not pregnant but she actually is.

4.6.3 Precision
It explains how many of the correctly predicted cases actually turned out to be positive.
Precision is useful in the cases where False Positive is a higher concern than False Negatives. The

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


importance of Precision is in music or video recommendation systems, e-commerce websites, etc.
where wrong results could lead to customer churn and this could be harmful to the business.

Precision for a label is defined as the number of true positives divided by the number
of predicted positives.

4.6.4 Recall (Sensitivity)


It explains how many of the actual positive cases we were able to predict correctly with our model.
Recall is a useful metric in cases where False Negative is of higher concern than False Positive.
It is important in medical cases where it doesn’t matter whether we raise a false alarm but the
actual positive cases should not go undetected!

Recall for a label is defined as the number of true positives divided by the total number of
actual positives.

4.6.5 AUC-ROC
The Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold values and separates the
‘signal’ from the ‘noise’.

The Area Under the Curve (AUC) is the


measure of the ability of a classifier to
distinguish between classes. From the graph,
we simply say the area of the curve ABDE and
the X and Y-axis.

From the graph shown below, the greater the


AUC, the better is the performance of the
model at different threshold points between
positive and negative classes. This simply
means that When AUC is equal to 1, the
classifier is able to perfectly distinguish
between all Positive and Negative class points.
When AUC is equal to 0, the classifier would be predicting all Negatives as Positives and vice
versa. When AUC is 0.5, the classifier is not able to distinguish between the Positive and Negative
classes.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Working of AUC

In a ROC curve, the X-axis value shows False Positive Rate (FPR), and Y-axis shows True
Positive Rate (TPR). Higher the value of X means higher the number of False Positives(FP) than
True Negatives(TN), while a higher Y-axis value indicates a higher number of TP than FN. So,
the choice of the threshold depends on the ability to balance between FP and FN.

4.6.7 Root Mean Squared Error (RMSE)


RMSE is the most popular evaluation metric used in regression problems. It follows an assumption
that errors are unbiased and follow a normal distribution. Here are the key points to consider on
RMSE:

1. The power of ‘square root’ empowers this metric to show large number deviations.
2. The ‘squared’ nature of this metric helps to deliver more robust results, which
prevent canceling the positive and negative error values. In other words, this metric aptly
displays the plausible magnitude of the error term.
3. It avoids the use of absolute error values, which is highly undesirable in mathematical
calculations.
4. When we have more samples, reconstructing the error distribution using RMSE is
considered to be more reliable.
5. RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers
from your data set prior to using this metric.
6. As compared to mean absolute error, RMSE gives higher weightage and punishes large
errors.

RMSE metric is given by:

where N is the Total Number of Observations.

4.6.8. R-Squared/Adjusted R-Squared


We learned that when the RMSE decreases, the model’s performance will improve. But these
values alone are not intuitive.

In the case of a classification problem, if the model has an accuracy of 0.8, we could gauge how
good our model is against a random model, which has an accuracy of 0.5. So the random model
can be treated as a benchmark. But when we talk about the RMSE metrics, we do not have a
benchmark to compare.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


4.6.9. Contingency table
A contingency table (also known as a cross tabulation or crosstab) is a type of table in a
matrix format that displays the multivariate frequency distribution of the variables. They are
heavily used in survey research, business intelligence, engineering, and scientific research. They
provide a basic picture of the interrelation between two variables and can help find interactions
between them.

Example: Suppose there are two variables, sex (male or female) and handedness (right- or
left-handed). Further suppose that 100 individuals are randomly sampled from a very large
population as part of a study of sex differences in handedness. A contingency table can be created
to display the numbers of individuals who are male right-handed and left-handed, female right-
handed and left-handed.

The numbers of the males, females, and right- and left-handed individuals are called marginal
totals. The grand total (the total number of individuals represented in the contingency table) is the
number in the bottom right corner.

The table allows users to see at a glance that the proportion of men who are right-handed is about
the same as the proportion of women who are right-handed although the proportions are not
identical.

Finding Relationships in Contingency Table


In the contingency table below, the two categorical variables are gender and ice cream
flavor preference. This is a two-way table (2 X 3) where each cell represents the number of times
males and females prefer a particular ice cream flavor. The CSV datasheet shows one format you
can use to enter the data into your software:

How do we go about identifying a relationship between gender and flavor preference?

If there is a relationship between


ice cream preference and gender,
we’d expect the conditional
distribution of flavors in the two
gender rows to differ. From the
contingency table, females are more likely to prefer chocolate (37 vs. 21), while males prefer

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


vanilla (32 vs. 12). Both genders have an equal preference for strawberry. Overall, the two-way
table suggests that males and females have different ice cream preferences.

Here’s how to calculate row and column percentages in a two-way table.

o Row Percentage: Take a cell value and divide by the cell’s row total.
o Column Percentage: Take a cell value and divide by the cell’s column total.

For example, the row percentage of females who prefer chocolate is simply the number of
observations in the Female/Chocolate cell divided by the row total for women: 37 / 66 = 56%.

The column percentage for the same cell is the frequency of the Female/Chocolate cell divided by
the column total for chocolate: 37 / 58 = 63.8%.

Interpreting Percentages in a Contingency Table


The contingency table below uses the same raw data as the previous table and displays both row
and column percentages. Note how the row percentages sum to 100% in the right margin while
the column percentages sum to 100% at the bottom.

Analysis or
Plotting the
Contingency
Table

56% of females prefer


chocolate versus only
29.6% of males.
Conversely, 45% of
males prefer vanilla,
while only 18.2% of
females prefer it. These
results reconfirm our
previous findings
using the raw counts.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


4.6.10 PR Curve
A PR curve is simply a graph with Precision values on the y-axis and Recall values on the x-axis.
In other words, the PR curve contains TP/(TP+FP) on the y-axis and TP/(TP+FN) on the x-axis.

• It is important to note that Precision is also called the Positive Predictive Value (PPV).
• Recall is also called Sensitivity, Hit Rate or True Positive Rate (TPR).

Plotting recall values on the x-axis and corresponding precision values on the y-axis generates a
PR curve that illustrates a negative slope function. It represents the trade-off between precision
(reducing FPs) and recall (reducing FNs) for a given model. Considering the inverse relationship
between precision and recall, the curve is generally non-linear, implying that increasing one metric
decreases the other, but the decrease might not be proportional.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


4.6.11 A/B testing
A/B testing (also known as bucket testing, split-run testing, or split testing) is a user
experience research methodology. A/B tests consist of a randomized experiment that usually
involves two variants (A and B), although the concept can be also extended to multiple variants
of the same variable. It includes application of statistical hypothesis testing or "two-sample
hypothesis testing" as used in the field of statistics. A/B testing is a way to compare multiple
versions of a single variable, for example by testing a subject's response to variant A against
variant B, and determining which of the variants is more effective.

A/B tests are useful for understanding user engagement and satisfaction of online features like a
new feature or product.

Pros: Through A/B testing, it is easy to get a clear idea of what users prefer, since it is directly
testing one thing over the other. It is based on real user behavior so the data can be very helpful
especially when determining what works better between two options. In addition, it can also
provide answers to very specific design questions. One example of this is Google's A/B testing
with hyperlink colors. In order to optimize revenue, they tested dozens of different hyperlink hues
to see which color the users tend to click more on.

Cons: There are, however, a couple of cons to A/B testing. Like mentioned above, A/B testing is
good for specific design questions but it can also be a downside since it is mostly only good for
specific design problems with very measurable outcomes. It could also be a very costly and timely
process. Depending on the size of the company and/or team, there could be a lot of meetings and
discussions about what exactly to test and what the impact of the A/B test is. If there's not a
significant impact, it could end up as a waste of time and resources.

*End of Chapter 4*

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan

You might also like