Data Mining unit-1 complete
Data Mining unit-1 complete
Data Mining
Data mining is the process of extracting knowledge or insights from large amounts
of data using various statistical and computational techniques. The data can be
structured, semi-structured or unstructured, and can be stored in various forms
such as databases, data warehouses, and data lakes.
The primary goal of data mining is to discover hidden patterns and relationships in
the data that can be used to make informed decisions or predictions. This involves
exploring the data using various techniques such as clustering, classification,
regression analysis, association rule mining, and anomaly detection.
object class of data. The data corresponding to the user-specified class is generally
in multiple forms.
class data objects with the general characteristics of objects from one or a set of
contrasting classes. The target and contrasting classes can be represented by the
user, and the equivalent data objects fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in
a transactional dataset. There are two parameters that are used for determining the
association rules −
building a model based on a training dataset, where the classes are already known,
and then using this model to classify new, unseen data points.
Steps in Classifications:
1. Data Collection: Collect labeled data (input features and corresponding class labels).
Example: A dataset containing patient information and whether they have a disease
(yes or no).
2. Data Preprocessing: Clean the data, handle missing values, and normalize it if
necessary.
3. Model Training: Train a classification algorithm (e.g., Decision Tree, Support Vector
Machine, or Neural Networks) on the dataset. The algorithm learns patterns in the data
to distinguish between classes.
· Bayesian Classification
forecast the value of a variable based on known data. In prediction, the algorithm
learns a model from historical data and then uses that model to predict future or
unknown outcomes.
Example: Predicting house prices based on features like size, location, number of
rooms, etc.
Example: Predicting whether an email is "spam" or "not spam" based on the content of
the email.
Example of Prediction:
Objective: Predict the price of a house based on features like size, number of rooms,
and location.
Clustering − It is similar to classification but the classes are not predefined. The
Outlier analysis − Outliers are data elements that cannot be grouped in a given
class or cluster. These are the data objects which have multiple behaviour from the
general behaviour of other data objects. The analysis of this type of data can be
Evolution analysis − It defines the trends for objects whose behaviour changes
Data Processing
Data Processing: Data processing is an essential step in the data mining workflow. It
involves the steps of preparing and transforming raw data into a format suitable for
analysis. Data processing ensures that the data is clean, consistent, and usable for
modeling, allowing algorithms to extract meaningful patterns. Data processing
typically includes several stages such as data collection, cleaning, transformation,
reduction and integration. Each stage addresses specific issues with the data to ensure
the mining process produces accurate and valuable result
Stages of Data Processing
1. Data Collection
The collection of raw data is the first step of the data processing cycle. The raw
data collected has a huge impact on the output produced. Hence, raw data should
be gathered from defined and accurate sources so that the subsequent findings are
valid and usable. Raw data can include monetary figures, website cookies,
profit/loss statements of a company, user behavior, etc.
2. Data Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw
data to remove unnecessary and inaccurate data. Raw data is checked for errors,
duplication, miscalculations, or missing data and transformed into a suitable form
for further analysis and processing. This ensures that only the highest quality data
is fed into the processing unit.
3. Data Input
In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner,
or any other input source.
4. Data Processing
In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate the desired
output. This step may vary slightly from process to process depending on the
source of data being processed (data lakes, online databases, connected devices,
etc.) and the intended use of the output.
The data is finally transmitted and displayed to the user in a readable form like
graphs, tables, vector files, audio, video, documents, etc. This output can be stored
and further processed in the next data processing cycle.
6. Data Storage
The last step of the data processing cycle is storage, where data and metadata are
stored for further use. This allows quick access and retrieval of information
whenever needed. Effective proper data storage is necessary for compliance with
GDPR (data protection legislation).
Data is processed manually in this data processing method. The entire procedure of data
collecting, filtering, sorting, calculation and alternative logical operations is all carried out
with human intervention without using any electronic device or automation software. It is a
low-cost methodology and does not need very many tools. However, it produces high
errors and requires high labor costs and lots of time.
Data is processed mechanically through the use of devices and machines. These can
include simple devices such as calculators, typewriters, printing presses, etc. Simple data
processing operations can be achieved with this method. It has much fewer errors than
manual data processing, but the increase in data has made this method more complex and
difficult.
Data is processed with modern technologies using data processing software and
programs. The software gives a set of instructions to process the data and yield output.
This method is the most expensive but provides the fastest processing speeds with the
highest reliability and accuracy of output.
Forms of Data Pre-processing
Data pre-processing refers to the set of techniques used to prepare and clean raw data
for analysis or modeling. It is a crucial step in data mining and machine learning
processes, as it ensures that the data Is consistent, accurate, and ready for further
exploration. The different forms of data pre-processing typically address issues such as
missing values, noisy data, inconsistent data, and irrelevant features.
1. Data Cleaning
2. Data Transformation
3. Data Integration
4. Data Reduction
1. Supervised Discretization
Uses class labels to guide the discretization process.
Example: Decision tree algorithms that split continuous variables into discrete ranges
based on information gain.
2. Unsupervised Discretization
Does not use class labels; relies only on the distribution of the data.
6.Data Sampling : Data sampling is the process of selecting a subset of data from a
larger dataset to analyze, model, or draw conclusions. It is widely used in statistics,
data science, and machine learning when working with massive datasets or when
analyzing the entire dataset is impractical.
Feasibility: Enables analysis when it’s impossible to work with the entire population.
Accuracy: Provides insights and predictions with manageable data size if sampling is
done correctly.
1.Probability Sampling:
Each data point in the dataset has an equal chance of being selected. It ensures
randomness and reduces bias.
· Stratified Sampling:
Divides the population into distinct groups (strata) and samples proportionally from
each.
Example: Dividing customers by age groups and sampling equally from each group.
2. Non-Probability Sampling
· Convenience Sampling:
Data Cleaning
Data cleaning, also called data cleansing, is the process of identifying and correcting
(or removing) inaccurate, incomplete, or irrelevant data in a dataset. It ensures that the
data is accurate, consistent, and ready for analysis.
Objective: Understand the dataset's structure and identify potential issues such as
missing values, duplicates, or inconsistencies.
Actions:
Examine Summary Statistics: Look at mean, median, mode, min, max, and standard
deviation for numerical data.
Visual Inspection: Plot histograms, box plots, and scatter plots to spot outliers or
unusual patterns.
Check Data Types: Ensure that each column's data type (integer, string, date, etc.) is
appropriate.
3)Removing Duplicates :
Objective: Eliminate duplicate records to ensure that analysis is based on unique data
points.
Actions:
Identify Duplicates: Use tools or scripts to find and highlight duplicate rows based on
key columns (e.g., IDs, emails).
6)Data Transformation
7)Data Integration
8)Data Reduction
Objective: Ensure that the cleaned data is accurate, consistent, and ready for analysis.
Actions:
Check Data Integrity: Verify that all data transformations have been applied correctly.
Visualize Data: Plot data distributions to ensure the cleaning process hasn’t distorted
the data.
Revalidate Key Metrics: Ensure that important business metrics and relationships are
still intact after cleaning.
Objective: Document the steps taken during data cleaning to ensure transparency and
reproducibility.
Actions:
Log Cleaning Steps: Keep a record of all changes made, including imputation methods,
transformations, and removed data points.
Review with Stakeholders: Ensure that the cleaned data meets the needs of the
analysis or business requirements.
These steps will help ensure that your data is accurate, consistent, and ready for
analysis or modeling. Proper data cleaning is essential for achieving reliable and valid
results.
Missing Values
Handling missing values is an important step in data cleaning, as missing data can
significantly impact the quality and reliability of analysis or machine learning models.
There are several strategies to handle missing values, and the choice of method
depends on the amount of missing data, the nature of the dataset, and the importance
of the missing information.
Remove Rows with Missing Values: If only a few rows have missing values, you can
simply drop those rows.
When to use: If the missing data is not critical and the number of affected rows is
small.
Example: If 1% of customer records have missing email addresses, and the email is not
essential for the analysis, you might remove these rows.
Remove Columns with Missing Values: If a column has a large portion of missing data
(e.g., more than 50% missing), it may be best to remove the entire column.
When to use: If the missing column does not contribute much to the analysis or model
and removing it won't lose valuable information.
Imputation refers to replacing missing values with substituted values based on other
data points. This can help preserve the integrity of the dataset without losing
information.
Mean, Median, or Mode Imputation: Fill missing numerical values with the mean,
median, or mode of that variable.
When to use: When the missing data is small and does not follow a pattern.
Example: If a column of ages has missing values, you can replace the missing values
with the average age of all the other customers.
Noisy Data
Noisy data refers to data that has random errors, fluctuations, or outliers that make it
unreliable or inconsistent. Noise in data can distort the analysis or the predictive
model's performance, so it's crucial to clean or reduce noise during the data
pre-processing stage.
Noisy data contains random variations that do not represent the actual patterns or
trends in the dataset. This noise can arise from various sources, such as:
In a dataset tracking the height of children, one child's height might be recorded as 200
cm when it should be closer to 100 cm.
In financial data, a sudden fluctuation in stock prices due to a typo could appear as
noise.
1.Random Noise: This occurs due to random errors, which are unpredictable and do not
follow a pattern.
2.Outliers: Extreme values that deviate significantly from the rest of the data, often
representing noise.
Example: A person’s age being recorded as 120 years old when the average age is
around 30.
3.Duplicate Records: The same data point being entered multiple times in a dataset,
often leading to noise and confusion.
Handling noisy data involves identifying and reducing the noise in the dataset through
various techniques. The goal is to ensure the data used for analysis is as accurate and
relevant as possible.
The first step in dealing with noisy data is detecting the noise. This can be done
through:
1.Statistical Methods: Using measures like mean, median, variance, and standard
deviation to identify values that deviate significantly from the expected range.
2.Visual Inspection: Creating visualizations (like histograms, scatter plots, or box plots)
to spot outliers and irregular data points.
Outliers are data points that are significantly different from the rest of the data and
often represent noise. There are several methods to handle outliers:
Winsorization: Replace extreme values with the nearest valid data points within an
acceptable range.
Z-Score Method: Data points with z-scores greater than a threshold (e.g., 3 or -3) are
considered outliers and can be removed or adjusted.
In cases where missing or noisy data is identified, imputing the values based on other
available information can be an effective approach. For instance:
Mean/Median Imputation: Replacing missing or noisy values with the mean or median
of the column.
Predictive Imputation: Using models (like k-nearest neighbors) to predict the correct
value based on similar records.
Binning
Data binning, also called discrete binning or bucketing, is a data pre-processing
technique used to reduce the effects of minor observation errors. It is a form of
quantization. The original data values are divided into small intervals known as
bins, and then they are replaced by a general value calculated for that bin. This has
a soothing effect on the input data and may also reduce the chances of overfitting
in the case of small datasets.
Statistical data binning is a way to group numbers of more or less continuous values
into a smaller number of "bins". It can also be used in multivariate statistics, binning in
several dimensions simultaneously. For example, if you have data about a group of
people, you might want to arrange their ages into a smaller number of age intervals,
such as grouping every five years together.
Binning can dramatically improve resource utilization and model build response time
without significant loss in model quality. Binning can improve model quality by
strengthening the relationship between attributes.
Supervised binning is a form of intelligent binning in which important characteristics
of the data are used to determine the bin boundaries. In supervised binning, the bin
boundaries are identified by a single-predictor decision tree that considers the joint
distribution with the target. Supervised binning can be used for both numerical and
categorical attributes.
Binning Process:
1. Sort the Data: Arrange the data values in increasing or decreasing order.
2. Define the Bins: Specify the number of bins or the boundaries for each bin. The bins
can have equal width or unequal width depending on the method.
3. Assign Data to Bins: Place the data points into their corresponding bins based on the
bin boundaries.
4. Label the Bins: Each bin is typically labeled with a representative value (e.g., the
mean or median of the bin) or simply by the bin interval.
Types of Binning
In this method, the range of the data is divided into intervals of equal width. Each bin
will have the same range of values.
Example: If you have a data range from 0 to 100, and you choose 5 bins, each bin will
have an interval of 20 (i.e., 0-20, 21-40, 41-60, 61-80, and 81-100).
Advantages:
Disadvantages:
Not ideal for skewed data, as some bins may contain too many data points, while
others may have too few.
Equal Width Binning: Bins have equal width with a range of each bin are defined as [min
+ w], [min + 2w] …. [min + nw] where w = (max - min) / (no of bins).
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[92]
[204, 215]
This method divides the data into bins so that each bin contains approximately the
same number of data points. It is also known as quantile binning.
Example: For 100 data points and 5 bins, each bin will contain 20 data points.
Advantages:
Ensures balanced bins, which can help improve the performance of algorithms.
Disadvantages:
Unequal bin widths may result, which could make interpretation harder.
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Applications of Binning:
Data Smoothing: Binning can smooth out the impact of noisy data by grouping similar
values together, helping to reduce the effect of outliers.
Clustering
The process of making a group of abstract objects into classes of similar objects is
known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
● In the process of cluster analysis, the first step is to partition the set of data
into groups with the help of data similarity, and then groups are assigned to
changes made and helps single out useful features that differentiate
different groups.
● It helps marketers to find the distinct groups in their customer base and they
The following are some points why clustering is important in data mining.
large databases.
able to work with the type of data such as categorical, numerical, and binary
data.
● Discovery of clusters with attribute shape – The algorithm should be able
measures.
interpretable.
Clustering Methods:
1. Model-Based Method
2. Hierarchical Method
3. Constraint-Based Method
4. Grid-Based Method
5. Partitioning Method
6. Density-Based Method
1. Model-Based Method
estimate the parameters of these distributions and to assign each data point to a
2. Hierarchical Method
clusters into larger ones (agglomerative) or by splitting larger clusters into smaller
● Use Case: Useful for datasets where a nested grouping is beneficial, such as
3. Constraint-Based Method
(forcing two points to be in the same cluster) or cannot-link (forcing two points to
be in different clusters).
● Use Case: When domain knowledge can be translated into constraints, such
4. Grid-Based Method
Grid-based clustering quantizes the data space into a finite number of cells that
form a grid structure. Clustering is then performed on the grid cells rather than the
actual data points, which can significantly speed up the clustering process,
5. Partitioning Method
Partitioning methods divide the dataset into a set of k clusters, where each cluster
image compression.
6. Density-Based Method
points, separated by areas of low density. It can identify clusters of arbitrary shape
Noise)
● Use Case: Effective for spatial data, such as identifying geographic regions
noise.
Regression
Regression refers to a data mining technique that is used to predict the numeric
values in a given data set. For example, regression might be used to predict the
product or service cost or other variables. It is also used in various industries for
business and marketing behavior, trend analysis, and financial forecast. In this
certain examples.
What is regression?
a most significant tool to analyze the data that can be used for financial forecasting
and time series modeling. Regression involves the technique of fitting a straight line
or a curve on numerous data points. It happens in such a way that the distance
between the data points and cure comes out to be the lowest. The most popular
types of regression are linear and logistic regressions. Other than that, many other
types of regression can be performed depending on their performance on an
individual data set.Regression can predict all the dependent data sets, expressed in
the expression of independent variables, and the trend is available for a finite
period. Regression provides a good way to predict variables, but there are certain
normal distributions of the variables. For example, suppose one considers two
variables, A and B, and their joint distribution is a bivariate distribution, then by that
nature. In that case, these two variables might be independent, but they are also
Before applying Regression analysis, the data needs to be studied carefully and
perform certain preliminary tests to ensure the Regression is applicable. There are
Types of Regression
1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
Linear Regression
Linear regression is the type of regression that forms a relationship between the target
variable and one or more independent variables utilizing a straight line. The given equation
Y = a + b*X + e.
Where,
In linear regression, the best fit line is achieved utilizing the least squared method, and it
minimizes the total sum of the squares of the deviations from each data point to the line of
regression. Here, the positive and negative deviations do not get canceled as all the
Polynomial Regression
If the power of the independent variable is more than 1 in the regression equation, it is
termed a polynomial equation. With the help of the example given below, we will
Y = a + b * x2
In the particular regression, the best fit line is not considered a straight line like a linear
Applying linear regression techniques can lead to overfitting when you are tempted to
minimize your errors by making the curve more complex. Therefore, always try to fit the
Logistic Regression
When the dependent variable is binary in nature, i.e., 0 and 1, true or false, success or
failure, the logistic regression technique comes into existence. Here, the target value (Y)
ranges from 0 to 1, and it is primarily used for classification-based problems. Unlike linear
regression, it does not need any independent and dependent variables to have a linear
relationship.
Ridge Regression
Ride regression refers to a process that is used to analyze various regression data that
Ridge regression exists when the least square estimates are the least biased with high
variance, so they are quite different from the real value. However, by adding a degree of
bias to the estimated regression value, the errors are reduced by applying ridge regression.
Lasso Regression
The term LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso
regression is a linear type of regression that utilizes shrinkage. In Lasso regression, all the
data points are shrunk towards a central point, also known as the mean. The lasso process
is most fitted for simple and sparse models with fewer parameters than other regression.
This type of regression is well fitted for models that suffer from multicollinearity.
Application of Regression
Regression is a very popular technique, and it has wide applications in businesses and
industries. The regression procedure involves the predictor variable and response variable.
○ Environmental modeling
Regression Classification
eg
attribute.
In regression, the nature of the predicted In classification, the nature of the
Inconsistent data occurs when the same variable or attribute is represented in different
ways across different records, or when multiple records that should contain the same
value contain conflicting information. Common examples of inconsistent data include:
· Different formats used for the same data (e.g., date format inconsistencies,
inconsistent text capitalization).
· Conflicting entries for the same attribute (e.g., two records stating different ages for
the same person).
Example: Heights may be recorded in feet in some records and in meters in others.
3. Value Inconsistencies: When the same data point is recorded differently in various
records.
Example: A person’s age might be recorded as “25” in one record and “twenty-five” in
another.
4. Contradictory Data: When data points contradict each other for the same entity.
Example: One record says a person is “married,” while another record says they are
“single.”
Manual Inspection: Visually inspect the dataset to spot any obvious inconsistencies in
format, spelling, or data entry.
B. Standardizing Formats
Standardize Text Data: Ensure consistency in text values, such as names, addresses,
and categorical variables. This can involve:
Capitalization Consistency: Convert all text data to a uniform case (e.g., all uppercase
or lowercase).
Abbreviations: Expand abbreviations to full forms (e.g., "NY" to "New York") or vice
versa, depending on the preferred standard.
Spelling Correction: Use spell checkers or predefined lists to correct misspelled entries.
Example:
Data Verification: Cross-check records to verify which one is correct. This may require
consulting external sources or subject-matter experts.
Consolidation: If two conflicting records are found, decide which value is correct, or
combine information from both (e.g., merging the correct portions from each record).
Example:
Conflict: One record says "25 years old," while another says "26 years old."
Solution: Use the most reliable or recent data, or apply a rule to reconcile the
discrepancy, such as selecting the most frequent value.
Data Integration
Data Integration refers to the process of combining data from multiple sources into a
unified view to provide a consolidated and consistent dataset for analysis and
decision-making. It is essential in data mining, business intelligence, and other
data-driven applications, as it allows organizations to gather insights from various
sources of data in a comprehensive manner.
Holistic View: It allows organizations to combine data from various systems (e.g.,
customer databases, sales records, external data) to form a complete picture of the
information.
Improved Decision Making: By integrating data from multiple sources, businesses can
derive more accurate insights, leading to better decision-making.
Enhanced Efficiency: Automation of data integration processes saves time and reduces
the likelihood of errors associated with manual data processing.
1.Data Extraction:
The first step in the integration process is extracting data from different sources. These
sources could include databases, spreadsheets, APIs, flat files (CSV, XML), web
scraping, or even cloud-based systems.
Example: Extracting customer data from a CRM system, sales data from an ERP
system, and financial data from external market feeds.
2.Data Transformation:
This step involves transforming the extracted data into a consistent format suitable for
integration. It includes:
Data Standardization: Ensuring that different datasets follow the same format (e.g.,
dates, units of measurement).
In this step, the data from different sources is matched and linked based on a common
identifier or key. For example, matching customer names or IDs from different systems
to ensure data corresponds to the same entity.
Example: Linking customer records from a sales database with customer feedback data
based on a unique customer ID.
4.Data Loading:
The final step is loading the integrated data into a target storage system, such as a
data warehouse, database, or cloud-based platform. This process involves populating
the target database with the newly integrated dataset.
Example: After transforming the data, loading it into a central data warehouse to
create a unified view for reporting and analytics.
Load: The transformed data is loaded into a target data warehouse or database.
In ELT, the data is first extracted and loaded into the target system, and then
transformation is done within the target system itself.
Example: Data from various sources (sales, marketing) is loaded into a data lake, and
then complex transformations are performed directly in the lake to prepare the data for
analysis.
1.Data Heterogeneity:
Data from different sources may vary in format, structure, and quality. For example,
data might come from different database systems, file formats (CSV, JSON, XML), or
even different business domains.
Example: A company may have customer data in a relational database and financial
data in a spreadsheet format.
2.Data Redundancy:
Example: A customer might have multiple records in different systems, causing the
integrated data to have duplicated information.
Data Transformation
Data transformation refers to the process of converting data from one format,
structure, or scale to another. This transformation can take many forms, depending on
the needs of the analysis or model. It can involve scaling values, converting categorical
data to numerical values, encoding information, and other techniques that help the
data align with the intended analysis process.
Improve Accuracy: Proper transformation can improve the accuracy and efficiency of
statistical models or machine learning algorithms by reducing noise and simplifying
patterns in the data.
These techniques scale the data values into a specific range or distribution, making it
easier to work with, especially when working with machine learning algorithms that
are sensitive to the scale of data.
Normalization: This process rescales the data into a range, usually between 0 and 1, to
ensure all features have the same weight and magnitude.
When to Use:
Normalization is useful when the data has different units or when using algorithms
like k-means clustering or neural networks.
Many machine learning models and algorithms require data in a numerical format.
Categorical data, which involves values representing categories or classes (e.g.,
gender, color), must be converted into numerical data before it can be processed.
One-Hot Encoding: This technique creates a new binary column for each possible
category and assigns a 1 or 0 based on whether the data belongs to that category.
Example: If “Color” has categories “Red”, “Green”, and “Blue”, the encoding would
create three columns: “Color_Red”, “Color_Green”, and “Color_Blue”. A record with
“Green” would be represented as (0, 1, 0).
Label Encoding: Each category is assigned a unique integer. For example, “Red” = 0,
“Green” = 1, “Blue” = 2.
When to Use:
One-Hot Encoding is useful when there is no inherent ordering in the categories, and
you want to avoid giving arbitrary numerical weight to categorical variables.
Label Encoding is more efficient when there is a natural ordinal relationship between
the categories (e.g., “Low”, “Medium”, “High”).
C. Aggregation
Data Reduction
Data reduction refers to the process of reducing the volume of data while maintaining
its integrity and essential information. It is particularly useful in handling large
datasets, as it reduces computational complexity, storage requirements, and enhances
the efficiency of analysis or machine learning algorithms. Data reduction helps in
making the data more manageable without compromising its value.
Reduction Types:
1. Dimensionality Reduction
2.Data Compression
3.Aggregation
1. Dimensionality Reduction
PCA is a statistical technique that transforms the original features into a set of new
features, called principal components, that are uncorrelated and ordered in terms of
variance. The goal is to reduce the number of dimensions by selecting the top principal
components that explain the most variance in the data.
Example: LDA is often used in classification tasks where the goal is to reduce
dimensionality while maintaining class separability.
When to Use:
2.Data Compression
Data compression reduces the size of the dataset by encoding the information in a
more compact form without losing the important details. This is particularly important
for storage and transmission efficiency. Data compression is generally used when the
dataset contains redundancies that can be exploited to minimize the storage space
needed.
1.Lossless Compression:
In lossless compression, the original data can be fully restored after compression. No
information is lost, and the data retains its original quality.
Example: Algorithms like Huffman coding and Run-Length Encoding (RLE) are used in
lossless compression.
2.Lossy Compression:
Example: JPEG for images and MP3 for audio files are popular lossy compression
techniques.
When to Use:
Data compression is helpful when the dataset contains a lot of repetitive or redundant
information, such as in image, audio, and text data. It can also be used to reduce the
size of large files for efficient storage or transmission.
3.Aggregation
Aggregation involves combining multiple data points into a single representative value,
typically through summarizing or averaging. This technique is useful when you need to
simplify data and focus on key statistics rather than dealing with every individual data
point.
Types of Aggregation:
You can aggregate data by taking the sum, average, or count of certain features. For
example, in sales data, you might aggregate the total sales per store, the average sales
per month, or the count of products sold per day.
2.Grouping by Categories:
Aggregation can also involve grouping data by certain categories (e.g., grouping
transactions by customer, region, or product type) and then calculating aggregate
measures like the mean or sum within each group.
Example: Aggregating data by regions to get the total sales in each region or
aggregating data by product categories to find the average price.
3.Statistical Aggregation:
In this type of aggregation, you can apply statistical measures like median, standard
deviation, or mode to aggregate data in a way that summarizes the overall distribution
and spread of data.
When to Use:
Aggregation is often used when dealing with large datasets or time-series data where
summarizing or generalizing the data is required to make it more digestible and
meaningful.
It is particularly useful for data visualization, reporting, and when analyzing data at
different levels of granularity (e.g., monthly vs. yearly sales data).