Unit-2 New
Unit-2 New
Contents
• Evaluation of classification methods – Confusion matrix, Students T-tests and ROC curve
• Exploratory Data Analysis – Basic tools (plots, graphs and summary statistics) of EDA
2
Data Quality: Why Preprocess the Data?
3
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
4
Data cleaning
• It is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set.
• It involves identifying data errors and then changing, updating or removing data to correct them.
• Data cleansing improves data quality and helps provide more accurate, consistent and reliable
information for decision-making in an organization
5
Types of data cleaning
• Missing Values − Missing values are filled with appropriate values. There are the following
approaches to fill the values.
o The tuple is ignored when it includes several attributes with missing values.
• Inconsistence data − The inconsistency can be recorded in various transactions, during data
entry, or arising from integrating information from multiple databases. Some redundancies can be
recognized by correlation analysis. Accurate and proper integration of the data from various
sources can decrease and avoid redundancy.
6
Types of data cleaning
• Noisy data − Noise is a random error or variance in a measured variable. There are the following smoothing
methods to handle noise which are as follows −
• Binning − These methods smooth out a arrange data value by consulting its “neighborhood,” especially,
the values around the noisy information. The arranged values are distributed into multiple buckets or
bins. Because binning methods consult the neighborhood of values, they implement local smoothing.
• Regression − Data can be smoothed by fitting the information to a function, including with regression.
Linear regression contains finding the “best” line to fit two attributes (or variables) so that one attribute
can be used to forecast the other. Multiple linear regression is a development of linear regression, where
more than two attributes are contained and the data are fit to a multidimensional area.
• Clustering − Clustering supports in identifying the outliers. The same values are organized into clusters
and those values which fall outside the cluster are known as outliers.
• Combined computer and human inspection − The outliers can also be recognized with the support
of computer and human inspection. The outliers pattern can be descriptive or garbage. Patterns having
astonishment value can be output to a list.
7
Data Integration
8
Data Integration
• Data integration is the process of combining data from multiple sources and consolidating it into a
unified view.
• The goal of data integration in data mining is to provide a complete and accurate representation of
the data for further analysis. It involves extracting data from various sources, transforming it into a
common format, and loading it into a target system. Data integration can be challenging, especially
when dealing with large volumes of data, complex data structures, and different data formats.
• There are communication sources between systems that can include multiple databases, data
cubes, or flat files. Data fusion merges data from various diverse sources to produce meaningful
results. The consolidated findings must exclude inconsistencies, contradictions, redundancies, and
inequities.
9
Data Integration
Let's consider a data integration scenario that aims to
• Data integration strategy is typically described using
combine employee data from two different HR databases,
a triple (G, S, M) approach, where G denotes the
database A and database B.
global schema, S denotes the schema of the
• The global schema (G) would define the unified view
heterogeneous data sources, and M represents the
of employee data, including attributes like
mapping between the queries of the source and
EmployeeID, Name, Department, and Salary.
global schema.
• In the schema of heterogeneous sources, database A
(S1) might have attributes like EmpID, FullName,
Dept, and Pay, while database B's schema (S2) might
have attributes like ID, EmployeeName,
DepartmentName, and Compensation.
• The mappings (M) would then define how the
attributes in S1 and S2 map to the attributes in G,
allowing for the integration of employee data from
both systems into the global schema.
10
Why is Data Integration Important?
• Provides a Unified View of Data - Data integration in data mining enables combining data from different
sources into a unified view. This allows for better decision-making by providing a complete and accurate data
representation.
• Increases Data Accuracy - Integrating data from multiple sources helps to identify and eliminate
inconsistencies, redundancies, and errors in the data. This improves data accuracy and reliability, making it
easier to draw accurate conclusions.
• Improves Efficiency - Data integration in data mining automates combining data from multiple sources,
reducing the time and effort required to access and analyze the data. This improves efficiency and reduces the
costs associated with data management.
• Facilitates Data Analysis - Integrating data from multiple sources provides a broader perspective of the
data. This enables more sophisticated and accurate data analysis, leading to more informed and effective
decision-making.
11
Approaches for Data Integration
Tight Coupling
• This approach involves the creation of a centralized database that integrates data from different sources. The
data is loaded into the centralized database using Extract, Transform, and Load (ETL) processes.
• In this approach, the integration is tightly coupled, meaning that the data is physically stored in the central
database, and any updates or changes made to the data sources are immediately reflected in the central
database.
• Tight coupling is suitable for situations where real-time access to the data is required, and data
consistency is critical. However, this approach can be costly and complex, especially when dealing with large
volumes of data.
12
Approaches for Data Integration
Loose Coupling
• This approach involves the integration of data from different sources without physically storing it in a
centralized database.
• This approach provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to obtain the result.
• In this approach, data is accessed from the source systems as needed and combined in real-time to provide a
unified view. This approach uses middleware, such as application programming interfaces (APIs) and web
services, to connect the source systems and access the data.
• Loose coupling is suitable for situations where real-time access to the data is not critical, and the data sources
are highly distributed. This approach is more cost-effective and flexible than tight coupling but can be more
complex to set up and maintain.
13
Issues in Data Integration
• Data Quality: Data from different sources may have varying levels of accuracy, completeness, and
consistency, which can lead to data quality issues in the integrated data.
• Data Semantics: Data semantics refers to the meaning and interpretation of data. Integrating data from
different sources can be challenging because the same data element may have different meanings across
sources. Different sources may use different terms or definitions for the same data, making it difficult to
combine and understand the data.
• Data Heterogeneity: Data heterogeneity refers to the differences in data formats, structures, and storage
mechanisms across different data sources. Data integration can be challenging when dealing with
heterogeneous data sources, as it requires data transformation and mapping to make the data compatible with
the target data model.
• Data Privacy and Security: Data integration can increase the risk of data privacy and security breaches.
Integrating data from multiple sources can expose sensitive information and increase the risk of unauthorized
access or disclosure.
14
Issues in Data Integration
• Scalability: Scalability refers to the ability of the data integration solution to handle increasing volumes of
data and accommodate changes in data sources. Data integration solutions must be scalable to meet the
organization's evolving needs and ensure that the integrated data remains accurate and consistent.
• Integration with existing systems: Integrating new data sources with existing systems can be a complex
task, requiring significant effort and resources.
• Data Governance: Managing and maintaining the integration of data from multiple sources can be difficult,
especially when it comes to ensuring data accuracy, consistency, and timeliness.
• Performance: Integrating data from multiple sources can also affect the performance of the system.
• Complexity: The complexity of integrating data from multiple sources can be high, requiring specialized skills
and knowledge.
15
Major Issues to consider
Some major issues to consider during data integration
1. Entity Identification Problem
As the data is unified from the heterogeneous sources then how can we ‘match the real-world
entities from the data’. For example, we have customer data from two different data source. An entity
from one data source has customer_id and the entity from the other data source has
customer_number. Now how does the data analyst or the system would understand that these two
entities refer to the same attribute?
Here the schema integration can be achieved using metadata of each attribute.
Metadata of an attribute incorporates its name, what does it mean in the particular scenario, what is
its data type, up to what range it can accept the value. What rules does the attribute follow for the null
value, blank, or zero? Analyzing this metadata information will prevent error in schema integration.
16
Major Issues to consider
Structural integration can be achieved by ensuring that the functional dependency of an
attribute in the source system and its referential constraints matches the functional dependency and
referential constraint of the same attribute in the target system.
This can be understood with the help of an example suppose in the one system, the discount
would be applied to an entire order but in another system, the discount would be applied to every
single item in the order. This difference must be caught before the data from these two sources are
integrated into the target system.
17
Major Issues to consider
2. Redundancy and Correlation Analysis
Redundancy is one of the big issues during data integration. Redundant data is an
unimportant data or the data that is no longer needed. It can also arise due to attributes that could be
derived using another attribute in the data set.
For example, one data set has the customer age and other data set has the customers date of
birth then age would be a redundant attribute as it could be derived using the date of birth.
Inconsistencies in the attribute also raise the level of redundancy. The redundancy can be
discovered using correlation analysis. The attributes are analyzed to detect their interdependency on
each other thereby detecting the correlation between them.
18
Major Issues to consider
3. Tuple Duplication
Along with redundancies data integration has also deal with the duplicate tuples. Duplicate
tuples may come in the resultant data if the denormalized table has been used as a source for data
integration.
19
Data Integration Techniques
There are various data integration techniques in data mining. Some of them are as follows:
1. Manual Integration
• This method avoids using automation during data integration. The data analyst collects, cleans, and integrates
the data to produce meaningful information.
• This strategy is suitable for a mini organization with a limited data set. But it would be tedious for the large,
complex and recurring integration.
• It is a time-consuming operation because the entire process must be done manually.
2. Middleware Integration
• The middleware software is employed to collect the information from different sources, normalize the data and
stored into the resultant data set. This technique is adopted when the enterprise wants to integrate data from
the legacy systems to modern systems.
• Middleware software act as an interpreter between the legacy systems and advanced systems. You can take an
example of the adapter which helps in connecting two systems with different interfaces.
20
Data Integration Techniques
3. Application-Based Integration
• This technique makes use of software application to extract, transform and load the data from the
heterogeneous sources.
• This technique also makes the data from disparate source compatible in order to ease the transfer of the data
from one system to another.
• This technique saves time and effort but is little complicated as designing such an application requires technical
knowledge.
4. Middleware Integration
• This technique integrates data from a more discrepant source. But, here the location of the data is not changed,
the data stays in its original location.
• This technique only creates a unified view which represents the integrated data. No separate storage is required
to store the integrated data as only the integrated view is created for the end-user.
21
Data Integration Techniques
5. Data Warehousing
• This technique loosely relates to the uniform access integration technique. But the difference is that the unified
view is stored in certain storage. This allows the data analyst to handle more complex queries.
• Though this is a promising technique it has increased storage cost as the view or copy of the unified data needs
separate storage and even it has an increase in maintenance cost.
22
Integration tools
1. On-Premise Data Integration Tools
• On-premise data integration tools are installed and run on the organization's infrastructure.
• These tools offer complete control over the data integration process and are typically used by larger
organizations requiring high customization and security levels.
• Some popular on-premise data integration tools include IBM InfoSphere DataStage, Talend, and
Microsoft SQL Server Integration Services (SSIS).
23
Integration tools
2. Open-Source Data Integration Tools
• Open-source data integration tools are free and often community-driven solutions that allow users
to modify the source code and add new features.
• These tools are typically less expensive than proprietary tools and can be customized to fit the
organization's specific needs.
• Some popular open-source data integration tools include Apache NiFi, Apache Kafka, and Pentaho.
24
Integration tools
3. Cloud-Based Data Integration Tools
• Cloud-based data integration tools are hosted in the cloud and accessed through a web browser. These tools
offer scalability, flexibility, and easy access to data from different sources.
• They are ideal for organizations requiring quick implementation and not wanting to invest in on-premise
hardware or software.
• Some popular cloud-based data integration tools include Amazon Web Services (AWS) Glue, Microsoft Azure
Data Factory, and Google Cloud Data Fusion.
• However, privacy and security are major concerns when using cloud-based data integration tools. Storing and
transferring sensitive data to the cloud can expose it to potential risks, such as unauthorized access, data
breaches, or data leakage. Adequate security measures, including encryption, access controls, and secure
authentication, must be implemented to protect data during storage and transmission.
25
Data Reduction
26
Data Reduction
• Data reduction is a process to reduce the size of a dataset while still preserving the most important
information.
• Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data.
• Data Reduction is a way to attain a compressed version or representation of the data with less
volume. This condensed data maintains the integrity of data and generates similar analysis as that
of the actual data.
27
Data Reduction
28
Data Reduction Techniques
• Dimensionality reduction
o Wavelet Transform
o Principal Component Analysis
o Feature Selection or Attribute Subset Selection
• Numerosity reduction
o Parametric reduction (Regression and Log-linear)
o Non-parametric reduction
▪ Histogram
▪ Clustering
▪ Sampling ( i) Simple Random Sample without Replacement, ii) Simple Random Sample with
Replacement, iii) Cluster Sample, iv) Stratified Sample )
• Data Cube Aggregation
• Discretization & Concept Hierarchy Operation
29
Discretization & Concept Hierarchy Operation
Data discretization methodology minimizes large continuous data points into smaller fixed sets of
data points . This data reduction can be achieved by dividing into a range of intervals with less loss of
data. Now, for these interval data points, class labels or information can be used for replacing the
original data values.
The process of discretization can be carried off on a recursive basis on an attribute leading to splitting
of the values in a hierarchical or multiresolution manner. During the recursive process, the data is
sorted at every step.
This data reduction technique is faster when there are fewer unique values for sorting. This
partitioning of the data values is also known as Concept Hierarchy.
30
Discretization
Top-down Discretization
In general, the top-down approach starts from the top and goes till the bottom of the ladder. In the
top-down discretization, the splitting process starts by considering some breakpoints at the top and
continues till the end for splitting the complete range of all the attributes.
Bottom-up Discretization
The bottom-up discretization has the reverse mapping. It begins from the bottom and moves up the
ladder to the top element of the series. Here, in the bottom-up discretization, all the continuous data
values are taken as the prospective break points. Some of these points are discarded after being
merged with the neighborhood data values. This is done so as to form the intervals.
31
Concept Hierarchy Operation
The methodologies for concept hierarchy operation are:
• Histogram Analysis
• Binning
• Cluster Analysis
• Entropy-Based Discretization
• Data Segmentation by natural partitioning
• Interval merging by Chi^2 Analysis
32
Histogram Analysis
40
• Divide data into buckets and store average
(sum) for each bucket 35
• Partitioning rules: 30
Interval of 10000
33
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
Features,
(Observed − Expected ) 2
=
2 attributes or
variable
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
34
Airflow = 1.5* Speed + 1*No_of_blades + 0.7*size_of_blades
36
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)
where n is the number of tuples, A and B are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
37
Visually Evaluating Correlation
f1 Height
f2 Age
Scatter plots
showing the
similarity from
–1 to 1.
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, andA are the B respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely
to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent. Only
under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Co-Variance: An Example
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Histograms
• Barplots
• Boxplots
• Scatterplots
41
Histograms:
When visualizing a single numerical variable, a histogram can be a go-to tool, which can be
created in R using the hist() function
data("mtcars")
hist(mtcars$mpg)
42
Histograms:
hist(mtcars$mpg, xlab =
"Miles/gallon", main =
"Histogram of MPG
(mtcars)", breaks = 12,
col = "lightseagreen",
border = "darkorange")
To know more about the arguments that a histogram can take check this
link.
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html
43
Barplots:
A barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite
number of values, like a ranking from 1 to 10. For drawing barplot I will use cyl varible which is
nothing but Number of cylinders in mtcars dataset.
barplot(table(mtcars$cyl))
44
Barplots:
barplot(table(mtcars$cyl), xlab
= "Number of cylinders", ylab =
"Frequency", main = "mtcars
dataset", col = "lightseagreen",
border = "darkorange")
To know more about the arguments that a barplot can take check this link.
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html
45
Boxplots:
we can use a single boxplot as an alternative to a histogram for visualizing a single numerical variable.
Let’s do a boxplot for Weight column in mtcars.
boxplot(mtcars$wt)
46
Boxplots:
To visualize the relationship between a numerical and categorical variable, we can use a boxplot. Here
mpg is a numerical variable and Number of cylinders is categorical.
47
Boxplots:
You can make the box plot more attractive by setting some of its parameters
48
Scatterplots:
To visualize the relationship between two numeric variables we will use a scatterplot. This can be done
with the plot() function.
plot(mpg~disp, data=mtcars)
49
Scatterplots:
50
Data Science Lifecycle
51
• Scientific methods are in use for centuries, still provides a solid framework for thinking about and deconstructing
problems into their principal parts. One of the most valuable ideas of the scientific method relates to forming
hypotheses and finding ways to test ideas.
https://en.wikipedia.org/wiki/Scientific_method
• CRISP-DM provides useful input on ways to frame analytics problems and is a popular approach for data mining.
https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
• Tom Davenport’s DELTA framework : The DELTA framework offers an approach for data analytics projects,
including the context of the organization’s skills, datasets, and leadership engagement.
(Analytics at Work: Smarter Decisions, Better Results, 2010, Harvard Business Review Press )
• Doug Hubbard’s Applied Information Economics (AIE) approach : AIE provides a framework for measuring
intangibles and provides guidance on developing decision models, calibrating expert estimates, and deriving the
expected value of information.
(How to Measure Anything: Finding the Value of Intangibles in Business, 2010, Hoboken, NJ: John Wiley &
Sons)
• “MAD Skills” by Cohen et al. offers input for several of the techniques mentioned in Phases 2–4 that focus on model
planning, execution, and key findings.
(MAD Skills: New Analysis Practices for Big Data, Watertown, MA 2009)
52
Phase 1: Discovery
In this phase, the data science team must learn and investigate the problem, develop context and
understanding, and learn about the data sources needed and available for the project. In addition, the
team formulates initial hypotheses that can later be tested with data.
53
Phase 2: Data Preparation
This phase includes the steps to explore, preprocess, and condition data prior to modeling and analysis.
• In this phase, the team needs to create a robust environment in which it can explore the data that is
separate from a production environment. Usually, this is done by preparing an analytics sandbox.
• To get the data into the sandbox, the team needs to perform ETL, by a combination of extracting,
transforming, and loading data into the sandbox. Once the data is in the sandbox, the team needs to learn
about the data and become familiar with it.
• The team also must decide how to condition and transform data to get it into a format to facilitate
subsequent analysis.
• The team may perform data visualizations to help team members understand the data, including its
trends, outliers, and relationships among data variables.
54
Phase 2: Data Preparation
55
Phase 3: Model Planning
• The data science team identifies candidate models to apply to the data for clustering, classifying, or
finding relationships in the data depending on the goal of the project.
• Given the kind of data and resources that are available, evaluate whether similar, existing
approaches will work or if the team will need to create something new.
56
Phase 3: Model Planning
57
Phase 4: Model Building
• The data science team needs to develop datasets for training, testing, and production purposes.
These datasets enable the data scientist to develop the analytical model and train it (“training data”),
while holding aside some of the data (“hold-out data” or “test data”) for testing the model.
• It is critical to ensure that the training and test datasets are sufficiently robust for the model and
analytical techniques
58
Phase 4: Model Building
• Does the model appear valid and accurate on the test data?
• Does the model output/behavior make sense to the domain experts?
• Do the parameter values of the fitted model make sense in the context of the domain?
• Is the model sufficiently accurate to meet the goal?
• Does the model avoid intolerable mistakes?
• Are more data or more inputs needed? Do any of the inputs need to be transformed or eliminated?
• Will the kind of model chosen support the runtime requirements?
• Is a different form of the model required to address the business problem? If so, go back to the
model planning phase and revise the modeling approach.
59
Phase 5: Communicate Results
• After executing the model, the team needs to compare the outcomes of the modeling to the criteria
established for success and failure.
• The team considers how best to articulate the findings and outcomes to the various team members
and stakeholders, considering caveats, assumptions, and any limitations of the results
• As a result of this phase, the team will have documented the key findings and major insights derived
from the analysis.
• The deliverable of this phase will be the most visible portion of the process to the outside
stakeholders and sponsors, so take care to clearly articulate the results, methodology, and business
value of the findings
60
Phase 6: Operationalize
• The team communicates the benefits of the project more broadly and sets up a pilot project to
deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of
users.
• This approach enables the team to learn about the performance and related constraints of the model
in a production environment on a small scale and make adjustments before a full deployment.
• While scoping the effort involved in conducting a pilot project, consider running the model in a
production environment for a discrete set of products or a single line of business, which tests the
model in a live setting.
• Part of the operationalizing phase includes creating a mechanism for performing ongoing monitoring
of model accuracy and, if accuracy degrades, finding ways to retrain the model.
61