0% found this document useful (0 votes)
30 views

Data Mining unit-1 complete

Uploaded by

Sandeep Nayal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Data Mining unit-1 complete

Uploaded by

Sandeep Nayal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Mining Unit-1

Data Mining
Data mining is the process of extracting knowledge or insights from large amounts
of data using various statistical and computational techniques. The data can be
structured, semi-structured or unstructured, and can be stored in various forms
such as databases, data warehouses, and data lakes.

The primary goal of data mining is to discover hidden patterns and relationships in
the data that can be used to make informed decisions or predictions. This involves
exploring the data using various techniques such as clustering, classification,
regression analysis, association rule mining, and anomaly detection.

Data Mining Functionality

Data characterization − It is a summarization of the general characteristics of an

object class of data. The data corresponding to the user-specified class is generally

collected by a database query. The output of data characterization can be presented

in multiple forms.

Data discrimination − It is a comparison of the general characteristics of target

class data objects with the general characteristics of objects from one or a set of

contrasting classes. The target and contrasting classes can be represented by the

user, and the equivalent data objects fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in

a transactional dataset. There are two parameters that are used for determining the

association rules −

It provides which identifies the common item set in the database.

Confidence is the conditional probability that an item occurs in a transaction when

another item occurs.

Classification − Classification is a supervised learning technique where the goal is to

assign a data item to one of a predefined set of classes or categories. It involves

building a model based on a training dataset, where the classes are already known,

and then using this model to classify new, unseen data points.

Steps in Classifications:

1. Data Collection: Collect labeled data (input features and corresponding class labels).

Example: A dataset containing patient information and whether they have a disease
(yes or no).

2. Data Preprocessing: Clean the data, handle missing values, and normalize it if
necessary.

3. Model Training: Train a classification algorithm (e.g., Decision Tree, Support Vector
Machine, or Neural Networks) on the dataset. The algorithm learns patterns in the data
to distinguish between classes.

4. Model Testing: Evaluate the model's accuracy by testing it on a separate dataset.

5. Prediction: Use the model to classify new data points.

Techniques for Classification:


· Decision Tree

· Bayesian Classification

· K-Nearest Neighbors (KNN)

Prediction −Prediction is a key functionality of data mining, where the goal is to

forecast the value of a variable based on known data. In prediction, the algorithm

learns a model from historical data and then uses that model to predict future or

unknown outcomes.

Types of Prediction Tasks :

1. Regression Prediction (Continuous Values):

Objective: Predict a continuous value.

Example: Predicting house prices based on features like size, location, number of
rooms, etc.

2. Classification Prediction (Categorical Values):

Objective: Predict a class or label.

Example: Predicting whether an email is "spam" or "not spam" based on the content of
the email.

Example of Prediction:

Example : Predicting House Prices (Regression)

Objective: Predict the price of a house based on features like size, number of rooms,
and location.

1.Dataset: A dataset containing historical data about houses.


2.Model Training: Using a Linear Regression model, we can train the model on the data
to learn the relationship between house features (size, number of rooms, location) and
the price.

Clustering − It is similar to classification but the classes are not predefined. The

classes are represented by data attributes. It is unsupervised learning. The objects

are clustered or grouped, depending on the principle of maximizing the intraclass

similarity and minimizing the intraclass similarity.

Outlier analysis − Outliers are data elements that cannot be grouped in a given

class or cluster. These are the data objects which have multiple behaviour from the

general behaviour of other data objects. The analysis of this type of data can be

essential to mine the knowledge.

Evolution analysis − It defines the trends for objects whose behaviour changes

over some time.

Data Processing

Data Processing: Data processing is an essential step in the data mining workflow. It
involves the steps of preparing and transforming raw data into a format suitable for
analysis. Data processing ensures that the data is clean, consistent, and usable for
modeling, allowing algorithms to extract meaningful patterns. Data processing
typically includes several stages such as data collection, cleaning, transformation,
reduction and integration. Each stage addresses specific issues with the data to ensure
the mining process produces accurate and valuable result
Stages of Data Processing

The data processing consists of the following six stages.

1. Data Collection

The collection of raw data is the first step of the data processing cycle. The raw
data collected has a huge impact on the output produced. Hence, raw data should
be gathered from defined and accurate sources so that the subsequent findings are
valid and usable. Raw data can include monetary figures, website cookies,
profit/loss statements of a company, user behavior, etc.

2. Data Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw
data to remove unnecessary and inaccurate data. Raw data is checked for errors,
duplication, miscalculations, or missing data and transformed into a suitable form
for further analysis and processing. This ensures that only the highest quality data
is fed into the processing unit.

3. Data Input

In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner,
or any other input source.

4. Data Processing

In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate the desired
output. This step may vary slightly from process to process depending on the
source of data being processed (data lakes, online databases, connected devices,
etc.) and the intended use of the output.

5. Data Interpretation or Output

The data is finally transmitted and displayed to the user in a readable form like
graphs, tables, vector files, audio, video, documents, etc. This output can be stored
and further processed in the next data processing cycle.

6. Data Storage

The last step of the data processing cycle is storage, where data and metadata are
stored for further use. This allows quick access and retrieval of information
whenever needed. Effective proper data storage is necessary for compliance with
GDPR (data protection legislation).

Methods of Data Processing


There are three main data processing methods, such as:

1. Manual Data Processing

Data is processed manually in this data processing method. The entire procedure of data
collecting, filtering, sorting, calculation and alternative logical operations is all carried out
with human intervention without using any electronic device or automation software. It is a
low-cost methodology and does not need very many tools. However, it produces high
errors and requires high labor costs and lots of time.

2. Mechanical Data Processing

Data is processed mechanically through the use of devices and machines. These can
include simple devices such as calculators, typewriters, printing presses, etc. Simple data
processing operations can be achieved with this method. It has much fewer errors than
manual data processing, but the increase in data has made this method more complex and
difficult.

3. Electronic Data Processing

Data is processed with modern technologies using data processing software and
programs. The software gives a set of instructions to process the data and yield output.
This method is the most expensive but provides the fastest processing speeds with the
highest reliability and accuracy of output.
Forms of Data Pre-processing
Data pre-processing refers to the set of techniques used to prepare and clean raw data
for analysis or modeling. It is a crucial step in data mining and machine learning
processes, as it ensures that the data Is consistent, accurate, and ready for further
exploration. The different forms of data pre-processing typically address issues such as
missing values, noisy data, inconsistent data, and irrelevant features.

1. Data Cleaning

2. Data Transformation

3. Data Integration

4. Data Reduction

5. Data Discretization: Data discretization is the process of transforming continuous


data into discrete intervals or categories. This technique is commonly used in data
mining, machine learning, and statistical analysis to simplify data representation
and improve interpretability.

Importance of Data Discretization

1. Improves Model Performance: Many machine learning algorithms


perform better with discrete data.

2. Reduces Complexity: Simplifies data by reducing the number of


possible values

3. Enhances Interpretability: Easier for humans to understand and


analyze.

4. Facilitates Rule-Based Models: Enables the creation of rules and


patterns in data mining.

Types of Data Discretization

1. Supervised Discretization
Uses class labels to guide the discretization process.

Aims to maximize the separation between different classes.

Example: Decision tree algorithms that split continuous variables into discrete ranges
based on information gain.

2. Unsupervised Discretization

Does not use class labels; relies only on the distribution of the data.

Example: Equal-width or equal-frequency binning.

6.Data Sampling : Data sampling is the process of selecting a subset of data from a
larger dataset to analyze, model, or draw conclusions. It is widely used in statistics,
data science, and machine learning when working with massive datasets or when
analyzing the entire dataset is impractical.

Importance of Data Sampling

Efficiency: Reduces computational time and resources.

Cost-Effective: Saves effort and expenses when collecting or processing large


datasets.

Feasibility: Enables analysis when it’s impossible to work with the entire population.

Accuracy: Provides insights and predictions with manageable data size if sampling is
done correctly.

Types of Data Sampling

1.Probability Sampling:

Each data point in the dataset has an equal chance of being selected. It ensures
randomness and reduces bias.

· Simple Random Sampling:

Each element has an equal probability of selection.


Example: Randomly picking 100 students from a list of 1,000.

· Stratified Sampling:

Divides the population into distinct groups (strata) and samples proportionally from
each.

Example: Dividing customers by age groups and sampling equally from each group.

2. Non-Probability Sampling

Sampling is based on non-random selection, often relying on judgment or convenience.

· Convenience Sampling:

Selects data that is easiest to access.

Example: Surveying people at a mall entrance.

· Judgmental (Purposive) Sampling:

Selection based on expertise or criteria.

Example: Choosing experienced employees for feedback.

Data Cleaning
Data cleaning, also called data cleansing, is the process of identifying and correcting
(or removing) inaccurate, incomplete, or irrelevant data in a dataset. It ensures that the
data is accurate, consistent, and ready for analysis.

Importance of Data Cleaning -


· Improves Data Quality: Ensures accuracy and reliability for better decision-making.
· Enhances Model Performance: Clean data leads to better outcomes in machine
learning models.

· Reduces Errors: Prevents misleading conclusions and wrong interpretations.

Steps in Data Cleaning -


1) Data Inspection and Understanding:

Objective: Understand the dataset's structure and identify potential issues such as
missing values, duplicates, or inconsistencies.

Actions:

Examine Summary Statistics: Look at mean, median, mode, min, max, and standard
deviation for numerical data.

Visual Inspection: Plot histograms, box plots, and scatter plots to spot outliers or
unusual patterns.

Check Data Types: Ensure that each column's data type (integer, string, date, etc.) is
appropriate.

2)Handling Missing Values

3)Removing Duplicates :

Objective: Eliminate duplicate records to ensure that analysis is based on unique data
points.

Actions:

Identify Duplicates: Use tools or scripts to find and highlight duplicate rows based on
key columns (e.g., IDs, emails).

Remove or Aggregate Duplicates: Remove duplicates entirely or aggregate them (e.g.,


sum sales values if the same customer made multiple purchases).
4)Correcting Inconsistent Data

5)Dealing with Noisy Data

6)Data Transformation

7)Data Integration

8)Data Reduction

9)Validation and Verification:

Objective: Ensure that the cleaned data is accurate, consistent, and ready for analysis.

Actions:

Check Data Integrity: Verify that all data transformations have been applied correctly.

Visualize Data: Plot data distributions to ensure the cleaning process hasn’t distorted
the data.

Revalidate Key Metrics: Ensure that important business metrics and relationships are
still intact after cleaning.

10)Final Documentation and Review:

Objective: Document the steps taken during data cleaning to ensure transparency and
reproducibility.

Actions:

Log Cleaning Steps: Keep a record of all changes made, including imputation methods,
transformations, and removed data points.

Review with Stakeholders: Ensure that the cleaned data meets the needs of the
analysis or business requirements.
These steps will help ensure that your data is accurate, consistent, and ready for
analysis or modeling. Proper data cleaning is essential for achieving reliable and valid
results.

Missing Values
Handling missing values is an important step in data cleaning, as missing data can
significantly impact the quality and reliability of analysis or machine learning models.

Types of Missing Data:

· Missing Completely at Random (MCAR): Data is missing for reasons unrelated to


other variables.

· Missing at Random (MAR): The probability of a value being missing is related to


observed data but not the missing data itself.

· Missing Not at Random (MNAR): The missingness is related to the unobserved


value (e.g., higher income data is less likely to be missing for people with low
income).

Approaches to Handle Missing Values

There are several strategies to handle missing values, and the choice of method
depends on the amount of missing data, the nature of the dataset, and the importance
of the missing information.

A. Remove Missing Data

Remove Rows with Missing Values: If only a few rows have missing values, you can
simply drop those rows.
When to use: If the missing data is not critical and the number of affected rows is
small.

Example: If 1% of customer records have missing email addresses, and the email is not
essential for the analysis, you might remove these rows.

Remove Columns with Missing Values: If a column has a large portion of missing data
(e.g., more than 50% missing), it may be best to remove the entire column.

When to use: If the missing column does not contribute much to the analysis or model
and removing it won't lose valuable information.

B. Imputation (Filling Missing Values)

Imputation refers to replacing missing values with substituted values based on other
data points. This can help preserve the integrity of the dataset without losing
information.

Mean, Median, or Mode Imputation: Fill missing numerical values with the mean,
median, or mode of that variable.

When to use: When the missing data is small and does not follow a pattern.

Example: If a column of ages has missing values, you can replace the missing values
with the average age of all the other customers.

Noisy Data
Noisy data refers to data that has random errors, fluctuations, or outliers that make it
unreliable or inconsistent. Noise in data can distort the analysis or the predictive
model's performance, so it's crucial to clean or reduce noise during the data
pre-processing stage.
Noisy data contains random variations that do not represent the actual patterns or
trends in the dataset. This noise can arise from various sources, such as:

Measurement Errors: Errors during data collection or recording.

Data Entry Mistakes: Typing or transcription errors when inputting data.

Environmental Factors: For example, fluctuating conditions that affect sensor


readings.

Random Variations: Unexpected fluctuations or outliers that don't reflect real-world


values.

Examples of Noisy Data:

In a dataset tracking the height of children, one child's height might be recorded as 200
cm when it should be closer to 100 cm.

In financial data, a sudden fluctuation in stock prices due to a typo could appear as
noise.

Types of Noisy Data

1.Random Noise: This occurs due to random errors, which are unpredictable and do not
follow a pattern.

Example: A sensor recording a random fluctuation in temperature readings.

2.Outliers: Extreme values that deviate significantly from the rest of the data, often
representing noise.
Example: A person’s age being recorded as 120 years old when the average age is
around 30.

3.Duplicate Records: The same data point being entered multiple times in a dataset,
often leading to noise and confusion.

Example: The same transaction being recorded twice in a sales database.

Steps to Handle NoisyData

Handling noisy data involves identifying and reducing the noise in the dataset through
various techniques. The goal is to ensure the data used for analysis is as accurate and
relevant as possible.

A. Detecting Noisy Data

The first step in dealing with noisy data is detecting the noise. This can be done
through:

1.Statistical Methods: Using measures like mean, median, variance, and standard
deviation to identify values that deviate significantly from the expected range.

2.Visual Inspection: Creating visualizations (like histograms, scatter plots, or box plots)
to spot outliers and irregular data points.

B. Removing or Correcting Outliers

Outliers are data points that are significantly different from the rest of the data and
often represent noise. There are several methods to handle outliers:

Truncation: Remove data points that exceed a certain threshold.

Winsorization: Replace extreme values with the nearest valid data points within an
acceptable range.
Z-Score Method: Data points with z-scores greater than a threshold (e.g., 3 or -3) are
considered outliers and can be removed or adjusted.

C. Imputation of Missing or Noisy Values

In cases where missing or noisy data is identified, imputing the values based on other
available information can be an effective approach. For instance:

Mean/Median Imputation: Replacing missing or noisy values with the mean or median
of the column.

Predictive Imputation: Using models (like k-nearest neighbors) to predict the correct
value based on similar records.

Binning
Data binning, also called discrete binning or bucketing, is a data pre-processing
technique used to reduce the effects of minor observation errors. It is a form of
quantization. The original data values are divided into small intervals known as
bins, and then they are replaced by a general value calculated for that bin. This has
a soothing effect on the input data and may also reduce the chances of overfitting
in the case of small datasets.

Statistical data binning is a way to group numbers of more or less continuous values
into a smaller number of "bins". It can also be used in multivariate statistics, binning in
several dimensions simultaneously. For example, if you have data about a group of
people, you might want to arrange their ages into a smaller number of age intervals,
such as grouping every five years together.

Binning can dramatically improve resource utilization and model build response time
without significant loss in model quality. Binning can improve model quality by
strengthening the relationship between attributes.
Supervised binning is a form of intelligent binning in which important characteristics
of the data are used to determine the bin boundaries. In supervised binning, the bin
boundaries are identified by a single-predictor decision tree that considers the joint
distribution with the target. Supervised binning can be used for both numerical and
categorical attributes.

Binning Process:

1. Sort the Data: Arrange the data values in increasing or decreasing order.

2. Define the Bins: Specify the number of bins or the boundaries for each bin. The bins
can have equal width or unequal width depending on the method.

3. Assign Data to Bins: Place the data points into their corresponding bins based on the
bin boundaries.

4. Label the Bins: Each bin is typically labeled with a representative value (e.g., the
mean or median of the bin) or simply by the bin interval.

Types of Binning

1. Equal Width Binning:

In this method, the range of the data is divided into intervals of equal width. Each bin
will have the same range of values.

Example: If you have a data range from 0 to 100, and you choose 5 bins, each bin will
have an interval of 20 (i.e., 0-20, 21-40, 41-60, 61-80, and 81-100).

Advantages:

Simple to implement and interpret.


Useful when the data is uniformly distributed.

Disadvantages:

Not ideal for skewed data, as some bins may contain too many data points, while
others may have too few.

Equal Width Binning: Bins have equal width with a range of each bin are defined as [min
+ w], [min + 2w] …. [min + nw] where w = (max - min) / (no of bins).

For example, equal Width:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:

[5, 10, 11, 13, 15, 35, 50, 55, 72]

[92]

[204, 215]

2. Equal Frequency Binning (Quantile Binning):

This method divides the data into bins so that each bin contains approximately the
same number of data points. It is also known as quantile binning.

Example: For 100 data points and 5 bins, each bin will contain 20 data points.

Advantages:

Ensures balanced bins, which can help improve the performance of algorithms.

Suitable for datasets with skewed distributions.

Disadvantages:
Unequal bin widths may result, which could make interpretation harder.

Extreme values may be grouped into a single bin, distorting analysis.

Equal Frequency Binning: Bins have an equal frequency.

For example, equal frequency:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:[5, 10, 11, 13]

[15, 35, 50, 55]

[72, 92, 204, 215]

Applications of Binning:

Data Smoothing: Binning can smooth out the impact of noisy data by grouping similar
values together, helping to reduce the effect of outliers.

Data Transformation for Algorithms: Many machine learning algorithms, such as


decision trees, may work better when continuous data is transformed into categorical
data through binning.

Clustering
The process of making a group of abstract objects into classes of similar objects is
known as clustering.

Points to Remember:
One group is treated as a cluster of data objects
● In the process of cluster analysis, the first step is to partition the set of data

into groups with the help of data similarity, and then groups are assigned to

their respective labels.

● The biggest advantage of clustering over-classification is it can adapt to the

changes made and helps single out useful features that differentiate

different groups.

Applications of cluster analysis :


● It is widely used in many applications such as image processing, data

analysis, and pattern recognition.

● It helps marketers to find the distinct groups in their customer base and they

can characterize their customer groups by using purchasing patterns.

● It can be used in the field of biology, by deriving animal and plant

taxonomies and identifying genes with the same capabilities.

● It also helps in information discovery by classifying documents on the web.

Requirements of clustering in data mining:

The following are some points why clustering is important in data mining.

● Scalability – we require highly scalable clustering algorithms to work with

large databases.

● Ability to deal with different kinds of attributes – Algorithms should be

able to work with the type of data such as categorical, numerical, and binary

data.
● Discovery of clusters with attribute shape – The algorithm should be able

to detect clusters in arbitrary shapes and it should not be bound to distance

measures.

● Interpretability – The results should be comprehensive, usable, and

interpretable.

● High dimensionality – The algorithm should be able to handle high

dimensional space instead of only handling low dimensional data.

Clustering Methods:

It can be classified based on the following categories.

1. Model-Based Method

2. Hierarchical Method

3. Constraint-Based Method

4. Grid-Based Method

5. Partitioning Method

6. Density-Based Method

1. Model-Based Method

Model-based clustering assumes that the data is generated by a mixture of

underlying probability distributions, typically Gaussian distributions. The goal is to

estimate the parameters of these distributions and to assign each data point to a

cluster based on the estimated model.

● Example: Gaussian Mixture Models (GMMs)


● Use Case: Situations where data naturally fits a probabilistic model, such as

identifying different types of galaxy clusters in astronomy.

2. Hierarchical Method

Hierarchical clustering builds a hierarchy of clusters either by merging smaller

clusters into larger ones (agglomerative) or by splitting larger clusters into smaller

ones (divisive). It results in a tree-like structure called a dendrogram, which can be

cut at different levels to obtain different cluster formations.

● Example: Agglomerative Hierarchical Clustering

● Use Case: Useful for datasets where a nested grouping is beneficial, such as

taxonomy of species or document clustering in a hierarchical manner.

3. Constraint-Based Method

Constraint-based clustering incorporates user-defined constraints into the

clustering process. These constraints can be of various types, such as must-link

(forcing two points to be in the same cluster) or cannot-link (forcing two points to

be in different clusters).

● Example: Semi-Supervised Clustering

● Use Case: When domain knowledge can be translated into constraints, such

as grouping patients with similar medical histories while ensuring that

certain treatments are not grouped together.

4. Grid-Based Method
Grid-based clustering quantizes the data space into a finite number of cells that

form a grid structure. Clustering is then performed on the grid cells rather than the

actual data points, which can significantly speed up the clustering process,

especially for large datasets.

● Example: STING (Statistical Information Grid)

● Use Case: Efficient for large spatial datasets, such as geographical

information systems (GIS) or image segmentation.

5. Partitioning Method

Partitioning methods divide the dataset into a set of k clusters, where each cluster

is represented by a centroid. The most common algorithm is k-means, which

iteratively updates the cluster centroids until convergence.

● Example: k-means Clustering

● Use Case: Widely used in market segmentation, document clustering, and

image compression.

6. Density-Based Method

Density-based clustering forms clusters based on areas of high density of data

points, separated by areas of low density. It can identify clusters of arbitrary shape

and is robust to noise.

● Example: DBSCAN (Density-Based Spatial Clustering of Applications with

Noise)
● Use Case: Effective for spatial data, such as identifying geographic regions

with high earthquake activity or finding clusters in large databases with

noise.

Regression

Regression refers to a data mining technique that is used to predict the numeric

values in a given data set. For example, regression might be used to predict the

product or service cost or other variables. It is also used in various industries for

business and marketing behavior, trend analysis, and financial forecast. In this

tutorial, we will understand the concept of regression, types of regression with

certain examples.

What is regression?

Regression refers to a type of supervised machine learning technique that is used to

predict any continuous-valued attribute. Regression helps any business

organization to analyze the target variable and predictor variable relationships. It is

a most significant tool to analyze the data that can be used for financial forecasting

and time series modeling. Regression involves the technique of fitting a straight line

or a curve on numerous data points. It happens in such a way that the distance

between the data points and cure comes out to be the lowest. The most popular

types of regression are linear and logistic regressions. Other than that, many other
types of regression can be performed depending on their performance on an

individual data set.Regression can predict all the dependent data sets, expressed in

the expression of independent variables, and the trend is available for a finite

period. Regression provides a good way to predict variables, but there are certain

restrictions and assumptions like the independence of the variables, inherent

normal distributions of the variables. For example, suppose one considers two

variables, A and B, and their joint distribution is a bivariate distribution, then by that

nature. In that case, these two variables might be independent, but they are also

correlated. The marginal distributions of A and B need to be derived and used.

Before applying Regression analysis, the data needs to be studied carefully and

perform certain preliminary tests to ensure the Regression is applicable. There are

non-Parametric tests that are available in such cases.

Types of Regression

Regression is divided into five different types

1. Linear Regression

2. Logistic Regression

3. Lasso Regression

4. Ridge Regression

5. Polynomial Regression

Linear Regression
Linear regression is the type of regression that forms a relationship between the target

variable and one or more independent variables utilizing a straight line. The given equation

represents the equation of linear regression

Y = a + b*X + e.

Where,

a represents the intercept

b represents the slope of the regression line

e represents the error

X and Y represent the predictor and target variables, respectively.

If X is made up of more than one variable, termed as multiple linear equations.

In linear regression, the best fit line is achieved utilizing the least squared method, and it

minimizes the total sum of the squares of the deviations from each data point to the line of

regression. Here, the positive and negative deviations do not get canceled as all the

deviations are squared.

Polynomial Regression

If the power of the independent variable is more than 1 in the regression equation, it is

termed a polynomial equation. With the help of the example given below, we will

understand the concept of polynomial regression.

Y = a + b * x2
In the particular regression, the best fit line is not considered a straight line like a linear

equation; however, it represents a curve fitted to all the data points.

Applying linear regression techniques can lead to overfitting when you are tempted to

minimize your errors by making the curve more complex. Therefore, always try to fit the

curve by generalizing it to the issue.

Logistic Regression

When the dependent variable is binary in nature, i.e., 0 and 1, true or false, success or

failure, the logistic regression technique comes into existence. Here, the target value (Y)

ranges from 0 to 1, and it is primarily used for classification-based problems. Unlike linear

regression, it does not need any independent and dependent variables to have a linear

relationship.

Ridge Regression

Ride regression refers to a process that is used to analyze various regression data that

have the issue of multicollinearity. Multicollinearity is the existence of a linear correlation

between two independent variables.

Ridge regression exists when the least square estimates are the least biased with high

variance, so they are quite different from the real value. However, by adding a degree of

bias to the estimated regression value, the errors are reduced by applying ridge regression.

Lasso Regression

The term LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso

regression is a linear type of regression that utilizes shrinkage. In Lasso regression, all the

data points are shrunk towards a central point, also known as the mean. The lasso process
is most fitted for simple and sparse models with fewer parameters than other regression.

This type of regression is well fitted for models that suffer from multicollinearity.

Application of Regression

Regression is a very popular technique, and it has wide applications in businesses and

industries. The regression procedure involves the predictor variable and response variable.

The major application of regression is given below.

○ Environmental modeling

○ Analyzing Business and marketing behavior

○ Financial predictors or forecasting

○ Analyzing the new trends and patterns.

Difference between Regression and Classification in data mining

Regression Classification

eg

Regression refers to a type of supervised Classification refers to a process of

machine learning technique that is used assigning predefined class labels to

to predict any continuous-valued instances based on their attributes.

attribute.
In regression, the nature of the predicted In classification, the nature of the

data is ordered. predicted data is unordered.

The regression can be further divided Classification is divided into two

into linear regression and non-linear categories: binary classifier and

regression. multi-class classifier.

In the regression process, the In the classification process, the

calculations are basically done by calculations are basically done by

utilizing the root mean square error. measuring the efficiency.

Examples of regressions are regression The examples of classifications are the

tree, linear regression, etc. decision tree.


Inconsistent Data
Inconsistent data refers to situations where different records or fields in a dataset
provide conflicting or contradictory information. These inconsistencies can result from
various factors, such as human errors, data entry mistakes, or differences in formatting.
Identifying and correcting inconsistent data is an essential part of data cleaning, as it
ensures that the data is accurate, reliable, and usable for analysis or modeling.

Inconsistent data occurs when the same variable or attribute is represented in different
ways across different records, or when multiple records that should contain the same
value contain conflicting information. Common examples of inconsistent data include:

· Different formats used for the same data (e.g., date format inconsistencies,
inconsistent text capitalization).

· Conflicting entries for the same attribute (e.g., two records stating different ages for
the same person).

· Mismatched units of measurement (e.g., heights recorded in centimeters in one


record and in inches in another).

Types of Inconsistent Data

1. Format Inconsistencies: When the same type of data is represented differently in


different places.

Example: Dates may be written as “DD/MM/YYYY” in one record and “YYYY-MM-DD”


in another.
2. Unit Inconsistencies: Different units of measurement used for the same attribute.

Example: Heights may be recorded in feet in some records and in meters in others.

3. Value Inconsistencies: When the same data point is recorded differently in various
records.

Example: A person’s age might be recorded as “25” in one record and “twenty-five” in
another.

4. Contradictory Data: When data points contradict each other for the same entity.

Example: One record says a person is “married,” while another record says they are
“single.”

Steps to Correct Inconsistent Data

The process of correcting inconsistent data involves identifying, standardizing, and


reconciling conflicts. Here are the key steps involved:

A. Identifying Inconsistent Data

Manual Inspection: Visually inspect the dataset to spot any obvious inconsistencies in
format, spelling, or data entry.

Automated Detection: Use algorithms and scripts to detect anomalies and


inconsistencies, such as different date formats, inconsistent spellings, or numeric
inconsistencies.

B. Standardizing Formats

Standardize Text Data: Ensure consistency in text values, such as names, addresses,
and categorical variables. This can involve:

Capitalization Consistency: Convert all text data to a uniform case (e.g., all uppercase
or lowercase).
Abbreviations: Expand abbreviations to full forms (e.g., "NY" to "New York") or vice
versa, depending on the preferred standard.

Spelling Correction: Use spell checkers or predefined lists to correct misspelled entries.

Example:

Incorrect: "new york", "NY", "New York"

Correct: "New York" (uniform standard)

C. Resolving Conflicting Data

Data Verification: Cross-check records to verify which one is correct. This may require
consulting external sources or subject-matter experts.

Consolidation: If two conflicting records are found, decide which value is correct, or
combine information from both (e.g., merging the correct portions from each record).

Example:

Conflict: One record says "25 years old," while another says "26 years old."

Solution: Use the most reliable or recent data, or apply a rule to reconcile the
discrepancy, such as selecting the most frequent value.

Data Integration
Data Integration refers to the process of combining data from multiple sources into a
unified view to provide a consolidated and consistent dataset for analysis and
decision-making. It is essential in data mining, business intelligence, and other
data-driven applications, as it allows organizations to gather insights from various
sources of data in a comprehensive manner.

Importance of Data Integration:


Consistency: Data integration ensures that data from different sources is aligned and
reconciled, providing a consistent dataset for analysis.

Holistic View: It allows organizations to combine data from various systems (e.g.,
customer databases, sales records, external data) to form a complete picture of the
information.

Improved Decision Making: By integrating data from multiple sources, businesses can
derive more accurate insights, leading to better decision-making.

Enhanced Efficiency: Automation of data integration processes saves time and reduces
the likelihood of errors associated with manual data processing.

Steps Involved in Data Integration:

1.Data Extraction:

The first step in the integration process is extracting data from different sources. These
sources could include databases, spreadsheets, APIs, flat files (CSV, XML), web
scraping, or even cloud-based systems.

Example: Extracting customer data from a CRM system, sales data from an ERP
system, and financial data from external market feeds.

2.Data Transformation:

This step involves transforming the extracted data into a consistent format suitable for
integration. It includes:

Data Cleaning: Removing errors, inconsistencies, and missing values.

Data Standardization: Ensuring that different datasets follow the same format (e.g.,
dates, units of measurement).

Data Aggregation: Summarizing data at a higher level, such as calculating averages,


totals, or counts.

Data Normalization: Scaling values to a common range, which is particularly important


when integrating numerical data from different sources.
Example: Converting a “Date of Birth” column in one dataset from “MM-DD-YYYY” to
“YYYY-MM-DD” to standardize the format.

3. Data Matching and Linking:

In this step, the data from different sources is matched and linked based on a common
identifier or key. For example, matching customer names or IDs from different systems
to ensure data corresponds to the same entity.

Example: Linking customer records from a sales database with customer feedback data
based on a unique customer ID.

4.Data Loading:

The final step is loading the integrated data into a target storage system, such as a
data warehouse, database, or cloud-based platform. This process involves populating
the target database with the newly integrated dataset.

Example: After transforming the data, loading it into a central data warehouse to
create a unified view for reporting and analytics.

Techniques Used in Data Integration:

1.ETL (Extract, Transform, Load):

ETL is a common process used in data integration, especially in data warehousing:

Extract: Data is extracted from source systems.

Transform: The extracted data is cleaned, transformed, and enriched.

Load: The transformed data is loaded into a target data warehouse or database.

Example: Extracting transaction data from an online store, transforming it into a


common format, and loading it into a data warehouse for analysis.
2.ELT (Extract, Load, Transform):

In ELT, the data is first extracted and loaded into the target system, and then
transformation is done within the target system itself.

Example: Data from various sources (sales, marketing) is loaded into a data lake, and
then complex transformations are performed directly in the lake to prepare the data for
analysis.

Challenges in Data Integration:

1.Data Heterogeneity:

Data from different sources may vary in format, structure, and quality. For example,
data might come from different database systems, file formats (CSV, JSON, XML), or
even different business domains.

Example: A company may have customer data in a relational database and financial
data in a spreadsheet format.

2.Data Redundancy:

When integrating data, duplicate records may arise, leading to redundant or


inconsistent information across systems.

Example: A customer might have multiple records in different systems, causing the
integrated data to have duplicated information.

Data Transformation
Data transformation refers to the process of converting data from one format,
structure, or scale to another. This transformation can take many forms, depending on
the needs of the analysis or model. It can involve scaling values, converting categorical
data to numerical values, encoding information, and other techniques that help the
data align with the intended analysis process.

Purpose of Data Transformation

The primary reasons for transforming data include:

Enhance Data Quality: Raw data often needs to be refined, standardized, or


normalized to ensure consistency.

Suitability for Analysis: Some algorithms or models require data to be in a specific


format or structure (e.g., numerical values for machine learning models).

Improve Accuracy: Proper transformation can improve the accuracy and efficiency of
statistical models or machine learning algorithms by reducing noise and simplifying
patterns in the data.

Compatibility: Certain types of data, such as categorical variables, may need to be


converted into a format that can be processed by algorithms that only accept numeric
input.

Types of Data Transformation


There are several types of transformations commonly applied to data. The following
are some of the most frequently used techniques:

A. Normalization and Standardization

These techniques scale the data values into a specific range or distribution, making it
easier to work with, especially when working with machine learning algorithms that
are sensitive to the scale of data.

Normalization: This process rescales the data into a range, usually between 0 and 1, to
ensure all features have the same weight and magnitude.
When to Use:

Normalization is useful when the data has different units or when using algorithms
like k-means clustering or neural networks.

Standardization is beneficial when data follows a normal distribution, particularly for


algorithms like linear regression or support vector machines.

B. Encoding Categorical Data

Many machine learning models and algorithms require data in a numerical format.
Categorical data, which involves values representing categories or classes (e.g.,
gender, color), must be converted into numerical data before it can be processed.

One-Hot Encoding: This technique creates a new binary column for each possible
category and assigns a 1 or 0 based on whether the data belongs to that category.

Example: If “Color” has categories “Red”, “Green”, and “Blue”, the encoding would
create three columns: “Color_Red”, “Color_Green”, and “Color_Blue”. A record with
“Green” would be represented as (0, 1, 0).

Label Encoding: Each category is assigned a unique integer. For example, “Red” = 0,
“Green” = 1, “Blue” = 2.

When to Use:

One-Hot Encoding is useful when there is no inherent ordering in the categories, and
you want to avoid giving arbitrary numerical weight to categorical variables.

Label Encoding is more efficient when there is a natural ordinal relationship between
the categories (e.g., “Low”, “Medium”, “High”).
C. Aggregation

D. Binning : Binning is a data transformation technique used to group a set of


continuous data values into discrete intervals or bins. The goal of binning is to simplify
the data by converting numerical data into categories, thereby reducing noise and
improving the efficiency of data analysis, especially for machine learning algorithms.
This process can also help improve data visualization and make patterns in the data
more apparent.

Data Reduction
Data reduction refers to the process of reducing the volume of data while maintaining
its integrity and essential information. It is particularly useful in handling large
datasets, as it reduces computational complexity, storage requirements, and enhances
the efficiency of analysis or machine learning algorithms. Data reduction helps in
making the data more manageable without compromising its value.

Reduction Types:

1. Dimensionality Reduction

2.Data Compression

3.Aggregation

1. Dimensionality Reduction

Dimensionality reduction involves reducing the number of input variables (features) in


a dataset while retaining most of the important information. High-dimensional
datasets often contain redundant features, making it difficult to analyze and interpret
data effectively. Dimensionality reduction helps by projecting the data into a
lower-dimensional space, preserving essential patterns and relationships.

Key Techniques for Dimensionality Reduction:

1.Principal Component Analysis (PCA):

PCA is a statistical technique that transforms the original features into a set of new
features, called principal components, that are uncorrelated and ordered in terms of
variance. The goal is to reduce the number of dimensions by selecting the top principal
components that explain the most variance in the data.

Example: In a dataset with 10 features, PCA may reduce it to 2 or 3 principal


components, capturing the maximum variance and reducing computational complexity.

2.Linear Discriminant Analysis (LDA):

LDA is a supervised dimensionality reduction technique that focuses on maximizing the


separability of classes. It finds the linear combinations of features that best separate
two or more classes in the dataset.

Example: LDA is often used in classification tasks where the goal is to reduce
dimensionality while maintaining class separability.

3.t-Distributed Stochastic Neighbor Embedding (t-SNE):

t-SNE is a non-linear technique for dimensionality reduction that is particularly useful


for visualizing high-dimensional data in 2D or 3D spaces. It aims to preserve the local
structure of the data while reducing dimensions.
Example: It is commonly used in visualizing clusters in high-dimensional datasets.

When to Use:

Dimensionality reduction is most useful when working with high-dimensional data,


such as text data, image data, or genomic data, where the number of features is very
large compared to the number of observations.

2.Data Compression

Data compression reduces the size of the dataset by encoding the information in a
more compact form without losing the important details. This is particularly important
for storage and transmission efficiency. Data compression is generally used when the
dataset contains redundancies that can be exploited to minimize the storage space
needed.

Types of Data Compression:

1.Lossless Compression:

In lossless compression, the original data can be fully restored after compression. No
information is lost, and the data retains its original quality.

Example: Algorithms like Huffman coding and Run-Length Encoding (RLE) are used in
lossless compression.

2.Lossy Compression:

In lossy compression, some information is discarded to achieve higher compression


rates. It is suitable when a slight loss of data quality is acceptable, such as in images,
videos, or audio files.

Example: JPEG for images and MP3 for audio files are popular lossy compression
techniques.
When to Use:

Data compression is helpful when the dataset contains a lot of repetitive or redundant
information, such as in image, audio, and text data. It can also be used to reduce the
size of large files for efficient storage or transmission.

3.Aggregation

Aggregation involves combining multiple data points into a single representative value,
typically through summarizing or averaging. This technique is useful when you need to
simplify data and focus on key statistics rather than dealing with every individual data
point.

Types of Aggregation:

1.Sum, Average, Count:

You can aggregate data by taking the sum, average, or count of certain features. For
example, in sales data, you might aggregate the total sales per store, the average sales
per month, or the count of products sold per day.

2.Grouping by Categories:

Aggregation can also involve grouping data by certain categories (e.g., grouping
transactions by customer, region, or product type) and then calculating aggregate
measures like the mean or sum within each group.

Example: Aggregating data by regions to get the total sales in each region or
aggregating data by product categories to find the average price.

3.Statistical Aggregation:

In this type of aggregation, you can apply statistical measures like median, standard
deviation, or mode to aggregate data in a way that summarizes the overall distribution
and spread of data.
When to Use:

Aggregation is often used when dealing with large datasets or time-series data where
summarizing or generalizing the data is required to make it more digestible and
meaningful.

It is particularly useful for data visualization, reporting, and when analyzing data at
different levels of granularity (e.g., monthly vs. yearly sales data).

You might also like