0% found this document useful (0 votes)

9 views61 pages

Unit-2 New

The document discusses various aspects of data preprocessing, including data cleaning, integration, reduction, transformation, and evaluation methods. It highlights the importance of data quality and outlines the major tasks and techniques involved in data cleaning and integration, as well as the challenges faced during these processes. Additionally, it covers different data integration approaches, tools, and issues such as redundancy, data conflicts, and privacy concerns.

Uploaded by

PARTH BHARADIA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views61 pages

Unit-2 New

Uploaded by

PARTH BHARADIA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Unit-2

Contents

• Data preprocessing: Data cleaning – Data integration – Data Reduction

• Data Transformation and Data Discretization.

• Evaluation of classification methods – Confusion matrix, Students T-tests and ROC curve

• Exploratory Data Analysis – Basic tools (plots, graphs and summary statistics) of EDA

• Philosophy of EDA – The Data Science Process.

2
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

4
Data cleaning
• It is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set.

• It involves identifying data errors and then changing, updating or removing data to correct them.

• Data cleansing improves data quality and helps provide more accurate, consistent and reliable
information for decision-making in an organization

Dirty data Examples

Incomplete data salary=” ”
Age =”5 years”, Birthday =”06/06/1990″, Current Year
Inconsistent data
=”2017″
Noisy data Salary = “-5000”, Name = “123”
Sometimes applications a lot auto value to attribute. e.g some
Intentional error application put gender value as male by default.
gender=”male”

5
Types of data cleaning
• Missing Values − Missing values are filled with appropriate values. There are the following
approaches to fill the values.

o The tuple is ignored when it includes several attributes with missing values.

o The values are filled manually for the missing value.

o The same global constant can fill the values.

o The attribute mean can fill the missing values.

o The most probable value can fill the missing values.

• Inconsistence data − The inconsistency can be recorded in various transactions, during data
entry, or arising from integrating information from multiple databases. Some redundancies can be
recognized by correlation analysis. Accurate and proper integration of the data from various
sources can decrease and avoid redundancy.

6
Types of data cleaning
• Noisy data − Noise is a random error or variance in a measured variable. There are the following smoothing
methods to handle noise which are as follows −

• Binning − These methods smooth out a arrange data value by consulting its “neighborhood,” especially,
the values around the noisy information. The arranged values are distributed into multiple buckets or
bins. Because binning methods consult the neighborhood of values, they implement local smoothing.

• Regression − Data can be smoothed by fitting the information to a function, including with regression.
Linear regression contains finding the “best” line to fit two attributes (or variables) so that one attribute
can be used to forecast the other. Multiple linear regression is a development of linear regression, where
more than two attributes are contained and the data are fit to a multidimensional area.

• Clustering − Clustering supports in identifying the outliers. The same values are organized into clusters
and those values which fall outside the cluster are known as outliers.

• Combined computer and human inspection − The outliers can also be recognized with the support
of computer and human inspection. The outliers pattern can be descriptive or garbage. Patterns having
astonishment value can be output to a list.

7
Data Integration

8
Data Integration
• Data integration is the process of combining data from multiple sources and consolidating it into a
unified view.

• The goal of data integration in data mining is to provide a complete and accurate representation of
the data for further analysis. It involves extracting data from various sources, transforming it into a
common format, and loading it into a target system. Data integration can be challenging, especially
when dealing with large volumes of data, complex data structures, and different data formats.

• There are communication sources between systems that can include multiple databases, data
cubes, or flat files. Data fusion merges data from various diverse sources to produce meaningful
results. The consolidated findings must exclude inconsistencies, contradictions, redundancies, and
inequities.

9
Data Integration
Let's consider a data integration scenario that aims to
• Data integration strategy is typically described using
combine employee data from two different HR databases,
a triple (G, S, M) approach, where G denotes the
database A and database B.
global schema, S denotes the schema of the
• The global schema (G) would define the unified view
heterogeneous data sources, and M represents the
of employee data, including attributes like
mapping between the queries of the source and
EmployeeID, Name, Department, and Salary.
global schema.
• In the schema of heterogeneous sources, database A
(S1) might have attributes like EmpID, FullName,
Dept, and Pay, while database B's schema (S2) might
have attributes like ID, EmployeeName,
DepartmentName, and Compensation.
• The mappings (M) would then define how the
attributes in S1 and S2 map to the attributes in G,
allowing for the integration of employee data from
both systems into the global schema.

10
Why is Data Integration Important?

• Provides a Unified View of Data - Data integration in data mining enables combining data from different
sources into a unified view. This allows for better decision-making by providing a complete and accurate data
representation.

• Increases Data Accuracy - Integrating data from multiple sources helps to identify and eliminate
inconsistencies, redundancies, and errors in the data. This improves data accuracy and reliability, making it
easier to draw accurate conclusions.

• Improves Efficiency - Data integration in data mining automates combining data from multiple sources,
reducing the time and effort required to access and analyze the data. This improves efficiency and reduces the
costs associated with data management.

• Facilitates Data Analysis - Integrating data from multiple sources provides a broader perspective of the
data. This enables more sophisticated and accurate data analysis, leading to more informed and effective
decision-making.

11
Approaches for Data Integration

There are mainly two kinds of approaches to data integration:

• Tight Coupling
• Loose Coupling

Tight Coupling
• This approach involves the creation of a centralized database that integrates data from different sources. The
data is loaded into the centralized database using Extract, Transform, and Load (ETL) processes.

• In this approach, the integration is tightly coupled, meaning that the data is physically stored in the central
database, and any updates or changes made to the data sources are immediately reflected in the central
database.

• Tight coupling is suitable for situations where real-time access to the data is required, and data
consistency is critical. However, this approach can be costly and complex, especially when dealing with large
volumes of data.

12
Approaches for Data Integration

Loose Coupling
• This approach involves the integration of data from different sources without physically storing it in a
centralized database.

• This approach provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to obtain the result.

• In this approach, data is accessed from the source systems as needed and combined in real-time to provide a
unified view. This approach uses middleware, such as application programming interfaces (APIs) and web
services, to connect the source systems and access the data.

• Loose coupling is suitable for situations where real-time access to the data is not critical, and the data sources
are highly distributed. This approach is more cost-effective and flexible than tight coupling but can be more
complex to set up and maintain.

13
Issues in Data Integration
• Data Quality: Data from different sources may have varying levels of accuracy, completeness, and
consistency, which can lead to data quality issues in the integrated data.

• Data Semantics: Data semantics refers to the meaning and interpretation of data. Integrating data from
different sources can be challenging because the same data element may have different meanings across
sources. Different sources may use different terms or definitions for the same data, making it difficult to
combine and understand the data.

• Data Heterogeneity: Data heterogeneity refers to the differences in data formats, structures, and storage
mechanisms across different data sources. Data integration can be challenging when dealing with
heterogeneous data sources, as it requires data transformation and mapping to make the data compatible with
the target data model.

• Data Privacy and Security: Data integration can increase the risk of data privacy and security breaches.
Integrating data from multiple sources can expose sensitive information and increase the risk of unauthorized
access or disclosure.

14
Issues in Data Integration
• Scalability: Scalability refers to the ability of the data integration solution to handle increasing volumes of
data and accommodate changes in data sources. Data integration solutions must be scalable to meet the
organization's evolving needs and ensure that the integrated data remains accurate and consistent.

• Integration with existing systems: Integrating new data sources with existing systems can be a complex
task, requiring significant effort and resources.

• Data Governance: Managing and maintaining the integration of data from multiple sources can be difficult,
especially when it comes to ensuring data accuracy, consistency, and timeliness.

• Performance: Integrating data from multiple sources can also affect the performance of the system.

• Complexity: The complexity of integrating data from multiple sources can be high, requiring specialized skills
and knowledge.

15
Major Issues to consider
Some major issues to consider during data integration
1. Entity Identification Problem
As the data is unified from the heterogeneous sources then how can we ‘match the real-world
entities from the data’. For example, we have customer data from two different data source. An entity
from one data source has customer_id and the entity from the other data source has
customer_number. Now how does the data analyst or the system would understand that these two
entities refer to the same attribute?
Here the schema integration can be achieved using metadata of each attribute.
Metadata of an attribute incorporates its name, what does it mean in the particular scenario, what is
its data type, up to what range it can accept the value. What rules does the attribute follow for the null
value, blank, or zero? Analyzing this metadata information will prevent error in schema integration.

16
Major Issues to consider
Structural integration can be achieved by ensuring that the functional dependency of an
attribute in the source system and its referential constraints matches the functional dependency and
referential constraint of the same attribute in the target system.
This can be understood with the help of an example suppose in the one system, the discount
would be applied to an entire order but in another system, the discount would be applied to every
single item in the order. This difference must be caught before the data from these two sources are
integrated into the target system.

17
Major Issues to consider
2. Redundancy and Correlation Analysis
Redundancy is one of the big issues during data integration. Redundant data is an
unimportant data or the data that is no longer needed. It can also arise due to attributes that could be
derived using another attribute in the data set.
For example, one data set has the customer age and other data set has the customers date of
birth then age would be a redundant attribute as it could be derived using the date of birth.
Inconsistencies in the attribute also raise the level of redundancy. The redundancy can be
discovered using correlation analysis. The attributes are analyzed to detect their interdependency on
each other thereby detecting the correlation between them.

18
Major Issues to consider
3. Tuple Duplication
Along with redundancies data integration has also deal with the duplicate tuples. Duplicate
tuples may come in the resultant data if the denormalized table has been used as a source for data
integration.

4. Data Conflict Detection and Resolution

Data conflict means the data merged from the different sources do not match. Like the
attribute values may differ in different data sets. The difference maybe because they are represented
differently in the different data sets. For suppose the price of a hotel room may be represented in
different currencies in different cities. This kind of issues is detected and resolved during data
integration.

19
Data Integration Techniques
There are various data integration techniques in data mining. Some of them are as follows:
1. Manual Integration
• This method avoids using automation during data integration. The data analyst collects, cleans, and integrates
the data to produce meaningful information.
• This strategy is suitable for a mini organization with a limited data set. But it would be tedious for the large,
complex and recurring integration.
• It is a time-consuming operation because the entire process must be done manually.

2. Middleware Integration
• The middleware software is employed to collect the information from different sources, normalize the data and
stored into the resultant data set. This technique is adopted when the enterprise wants to integrate data from
the legacy systems to modern systems.
• Middleware software act as an interpreter between the legacy systems and advanced systems. You can take an
example of the adapter which helps in connecting two systems with different interfaces.

20
Data Integration Techniques
3. Application-Based Integration
• This technique makes use of software application to extract, transform and load the data from the
heterogeneous sources.
• This technique also makes the data from disparate source compatible in order to ease the transfer of the data
from one system to another.
• This technique saves time and effort but is little complicated as designing such an application requires technical
knowledge.

4. Middleware Integration
• This technique integrates data from a more discrepant source. But, here the location of the data is not changed,
the data stays in its original location.

• This technique only creates a unified view which represents the integrated data. No separate storage is required
to store the integrated data as only the integrated view is created for the end-user.

21
Data Integration Techniques
5. Data Warehousing
• This technique loosely relates to the uniform access integration technique. But the difference is that the unified
view is stored in certain storage. This allows the data analyst to handle more complex queries.

• Though this is a promising technique it has increased storage cost as the view or copy of the unified data needs
separate storage and even it has an increase in maintenance cost.

22
Integration tools
1. On-Premise Data Integration Tools

• On-premise data integration tools are installed and run on the organization's infrastructure.

• These tools offer complete control over the data integration process and are typically used by larger
organizations requiring high customization and security levels.

• Some popular on-premise data integration tools include IBM InfoSphere DataStage, Talend, and
Microsoft SQL Server Integration Services (SSIS).

23
Integration tools
2. Open-Source Data Integration Tools
• Open-source data integration tools are free and often community-driven solutions that allow users
to modify the source code and add new features.

• These tools are typically less expensive than proprietary tools and can be customized to fit the
organization's specific needs.

• Some popular open-source data integration tools include Apache NiFi, Apache Kafka, and Pentaho.

24
Integration tools
3. Cloud-Based Data Integration Tools
• Cloud-based data integration tools are hosted in the cloud and accessed through a web browser. These tools
offer scalability, flexibility, and easy access to data from different sources.

• They are ideal for organizations requiring quick implementation and not wanting to invest in on-premise
hardware or software.

• Some popular cloud-based data integration tools include Amazon Web Services (AWS) Glue, Microsoft Azure
Data Factory, and Google Cloud Data Fusion.

• However, privacy and security are major concerns when using cloud-based data integration tools. Storing and
transferring sensitive data to the cloud can expose it to potential risks, such as unauthorized access, data
breaches, or data leakage. Adequate security measures, including encryption, access controls, and secure
authentication, must be implemented to protect data during storage and transmission.

25
Data Reduction

26
Data Reduction
• Data reduction is a process to reduce the size of a dataset while still preserving the most important
information.

• Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data.

• Data Reduction is a way to attain a compressed version or representation of the data with less
volume. This condensed data maintains the integrity of data and generates similar analysis as that
of the actual data.

27
Data Reduction

• The data that is generated is not clean,

nor is it small or has less data. The
datasets are usually voluminous, having
zettabytes of data.

• In such cases, a lot of time goes into

processing, applying the data mining
tools, and even running complex queries
– making it impractical to work with
such data.

28
Data Reduction Techniques
• Dimensionality reduction
o Wavelet Transform
o Principal Component Analysis
o Feature Selection or Attribute Subset Selection
• Numerosity reduction
o Parametric reduction (Regression and Log-linear)
o Non-parametric reduction
▪ Histogram
▪ Clustering
▪ Sampling ( i) Simple Random Sample without Replacement, ii) Simple Random Sample with
Replacement, iii) Cluster Sample, iv) Stratified Sample )
• Data Cube Aggregation
• Discretization & Concept Hierarchy Operation

29
Discretization & Concept Hierarchy Operation
Data discretization methodology minimizes large continuous data points into smaller fixed sets of
data points . This data reduction can be achieved by dividing into a range of intervals with less loss of
data. Now, for these interval data points, class labels or information can be used for replacing the
original data values.

The process of discretization can be carried off on a recursive basis on an attribute leading to splitting
of the values in a hierarchical or multiresolution manner. During the recursive process, the data is
sorted at every step.

This data reduction technique is faster when there are fewer unique values for sorting. This
partitioning of the data values is also known as Concept Hierarchy.

30
Discretization

Top-down Discretization
In general, the top-down approach starts from the top and goes till the bottom of the ladder. In the
top-down discretization, the splitting process starts by considering some breakpoints at the top and
continues till the end for splitting the complete range of all the attributes.

Bottom-up Discretization
The bottom-up discretization has the reverse mapping. It begins from the bottom and moves up the
ladder to the top element of the series. Here, in the bottom-up discretization, all the continuous data
values are taken as the prospective break points. Some of these points are discarded after being
merged with the neighborhood data values. This is done so as to form the intervals.

31
Concept Hierarchy Operation
The methodologies for concept hierarchy operation are:

• Histogram Analysis
• Binning
• Cluster Analysis
• Entropy-Based Discretization
• Data Segmentation by natural partitioning
• Interval merging by Chi^2 Analysis

32
Histogram Analysis

40
• Divide data into buckets and store average
(sum) for each bucket 35

• Partitioning rules: 30

• Equal-width: equal bucket range 25

• Equal-frequency (or equal-depth) 20

15
10
5
0
10000 30000 50000 70000 90000

Interval of 10000

33
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
Features,
(Observed − Expected ) 2
 =
2 attributes or
variable
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

34
Airflow = 1.5* Speed + 1*No_of_blades + 0.7*size_of_blades

Airflow Speed No of blades Size of blades

1200 rpm 3 2m
1300 4 1.9
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based

on the data distribution in the two categories)

(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2

 =
2
+ + + = 507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are correlated in the group

36
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)

i=1 (ai − A)(bi − B) 

n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B

where n is the number of tuples, A and B are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.

• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

37
Visually Evaluating Correlation

f1 Height

f2 Age

Scatter plots
showing the
similarity from
–1 to 1.
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, andA are the B respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely
to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent. Only
under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

Data Visualization Using R

Four most common methods of visualizing data are:

• Histograms
• Barplots
• Boxplots
• Scatterplots

41
Histograms:
When visualizing a single numerical variable, a histogram can be a go-to tool, which can be
created in R using the hist() function

data("mtcars")

hist(mtcars$mpg)

42
Histograms:

hist(mtcars$mpg, xlab =
"Miles/gallon", main =
"Histogram of MPG
(mtcars)", breaks = 12,
col = "lightseagreen",
border = "darkorange")

To know more about the arguments that a histogram can take check this
link.

https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html

43
Barplots:
A barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite
number of values, like a ranking from 1 to 10. For drawing barplot I will use cyl varible which is
nothing but Number of cylinders in mtcars dataset.

barplot(table(mtcars$cyl))

44
Barplots:

barplot(table(mtcars$cyl), xlab
= "Number of cylinders", ylab =
"Frequency", main = "mtcars
dataset", col = "lightseagreen",
border = "darkorange")

To know more about the arguments that a barplot can take check this link.

https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html

45
Boxplots:
we can use a single boxplot as an alternative to a histogram for visualizing a single numerical variable.
Let’s do a boxplot for Weight column in mtcars.

boxplot(mtcars$wt)

46
Boxplots:

To visualize the relationship between a numerical and categorical variable, we can use a boxplot. Here
mpg is a numerical variable and Number of cylinders is categorical.

boxplot(mpg ~ cyl , data = mtcars)

47
Boxplots:
You can make the box plot more attractive by setting some of its parameters

boxplot(mpg ~ cyl, data = mtcars,

xlab = "Number of cylinders",
ylab = "Miles/(US) gallon",
main = "Number of cylinders VS Miles/(US) gallon",
pch = 20,
cex = 2,
col = "lightseagreen",
border = "red")

48
Scatterplots:
To visualize the relationship between two numeric variables we will use a scatterplot. This can be done
with the plot() function.

plot(mpg~disp, data=mtcars)

49
Scatterplots:

plot(mpg ~ disp, data = mtcars,

xlab = "Displacement",
ylab = "Miles Per Gallon",
main = "MPG vs Displacement",
pch = 20,
cex = 2,
col = "red")

50
Data Science Lifecycle

51
• Scientific methods are in use for centuries, still provides a solid framework for thinking about and deconstructing
problems into their principal parts. One of the most valuable ideas of the scientific method relates to forming
hypotheses and finding ways to test ideas.
https://en.wikipedia.org/wiki/Scientific_method

• CRISP-DM provides useful input on ways to frame analytics problems and is a popular approach for data mining.
https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

• Tom Davenport’s DELTA framework : The DELTA framework offers an approach for data analytics projects,
including the context of the organization’s skills, datasets, and leadership engagement.
(Analytics at Work: Smarter Decisions, Better Results, 2010, Harvard Business Review Press )

• Doug Hubbard’s Applied Information Economics (AIE) approach : AIE provides a framework for measuring
intangibles and provides guidance on developing decision models, calibrating expert estimates, and deriving the
expected value of information.
(How to Measure Anything: Finding the Value of Intangibles in Business, 2010, Hoboken, NJ: John Wiley &
Sons)

• “MAD Skills” by Cohen et al. offers input for several of the techniques mentioned in Phases 2–4 that focus on model
planning, execution, and key findings.
(MAD Skills: New Analysis Practices for Big Data, Watertown, MA 2009)

52
Phase 1: Discovery

In this phase, the data science team must learn and investigate the problem, develop context and
understanding, and learn about the data sources needed and available for the project. In addition, the
team formulates initial hypotheses that can later be tested with data.

• Learning the Business Domain

• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
• Identifying Potential Data Sources

53
Phase 2: Data Preparation

This phase includes the steps to explore, preprocess, and condition data prior to modeling and analysis.

• In this phase, the team needs to create a robust environment in which it can explore the data that is
separate from a production environment. Usually, this is done by preparing an analytics sandbox.

• To get the data into the sandbox, the team needs to perform ETL, by a combination of extracting,
transforming, and loading data into the sandbox. Once the data is in the sandbox, the team needs to learn
about the data and become familiar with it.

• The team also must decide how to condition and transform data to get it into a format to facilitate
subsequent analysis.

• The team may perform data visualizations to help team members understand the data, including its
trends, outliers, and relationships among data variables.

54
Phase 2: Data Preparation

• Preparing the Analytic Sandbox (workspace)

• Performing ETL
• Learning About the Data
• Data Conditioning
• Survey and Visualize

55
Phase 3: Model Planning

• The data science team identifies candidate models to apply to the data for clustering, classifying, or
finding relationships in the data depending on the goal of the project.

• Given the kind of data and resources that are available, evaluate whether similar, existing
approaches will work or if the team will need to create something new.

Market Sector Analytic Techniques/Methods Used

Consumer Multiple linear regression, automatic relevance determination

Packaged Goods (ARD), and decision tree

Retail Banking Multiple regression

Retail Business Logistic regression, decision tree

Neural network, decision tree, hierarchical neurofuzzy systems,

Wireless Telecom
rule evolver, logistic regression

56
Phase 3: Model Planning

Some of the activities to consider in this phase include the following:

• Assess the structure of the datasets

• Ensure that the analytical techniques enable the team to meet the business objectives and
accept or reject the working hypotheses.
• Determine if the situation warrants a single model or a series of techniques as part of a larger
analytic workflow
• Data Exploration and Variable Selection
• Model Selection
• Common Tools for the Model Planning Phase (R, SQL Analysis Service, SAS/Access)

57
Phase 4: Model Building

• The data science team needs to develop datasets for training, testing, and production purposes.
These datasets enable the data scientist to develop the analytical model and train it (“training data”),
while holding aside some of the data (“hold-out data” or “test data”) for testing the model.

• It is critical to ensure that the training and test datasets are sufficiently robust for the model and
analytical techniques

58
Phase 4: Model Building

Questions to consider include these:

• Does the model appear valid and accurate on the test data?
• Does the model output/behavior make sense to the domain experts?
• Do the parameter values of the fitted model make sense in the context of the domain?
• Is the model sufficiently accurate to meet the goal?
• Does the model avoid intolerable mistakes?
• Are more data or more inputs needed? Do any of the inputs need to be transformed or eliminated?
• Will the kind of model chosen support the runtime requirements?
• Is a different form of the model required to address the business problem? If so, go back to the
model planning phase and revise the modeling approach.

Write some common tools for the Model Building Phase???

59
Phase 5: Communicate Results

• After executing the model, the team needs to compare the outcomes of the modeling to the criteria
established for success and failure.

• The team considers how best to articulate the findings and outcomes to the various team members
and stakeholders, considering caveats, assumptions, and any limitations of the results

• As a result of this phase, the team will have documented the key findings and major insights derived
from the analysis.

• The deliverable of this phase will be the most visible portion of the process to the outside
stakeholders and sponsors, so take care to clearly articulate the results, methodology, and business
value of the findings

60
Phase 6: Operationalize

• The team communicates the benefits of the project more broadly and sets up a pilot project to
deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of
users.

• This approach enables the team to learn about the performance and related constraints of the model
in a production environment on a small scale and make adjustments before a full deployment.

• While scoping the effort involved in conducting a pilot project, consider running the model in a
production environment for a discrete set of products or a single line of business, which tests the
model in a live setting.

• Part of the operationalizing phase includes creating a mechanism for performing ongoing monitoring
of model accuracy and, if accuracy degrades, finding ways to retrain the model.

Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
MPC 006
No ratings yet
MPC 006
99 pages
Unit 2 Preprocessing in Data Mining
No ratings yet
Unit 2 Preprocessing in Data Mining
6 pages
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
No ratings yet
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
40 pages
BECE352E Module 2
No ratings yet
BECE352E Module 2
58 pages
Machine Learning in Action
100% (1)
Machine Learning in Action
1 page
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Bafpred Module 2 Week 5 6
No ratings yet
Bafpred Module 2 Week 5 6
35 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Warehousing Unit 1
No ratings yet
Data Warehousing Unit 1
26 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Mining - Unit - 3
No ratings yet
Data Mining - Unit - 3
62 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
Data Integration
No ratings yet
Data Integration
10 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
N4cs2495a2 - Mat530 Group Assignment
No ratings yet
N4cs2495a2 - Mat530 Group Assignment
21 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Human Capital
100% (1)
Human Capital
31 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Correlation
No ratings yet
Correlation
14 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Mining
No ratings yet
Data Mining
22 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
Assignment 2 Question
No ratings yet
Assignment 2 Question
4 pages
Effects of Social Media Networking On Consumer Purchase Decision in Nepal Ssrn-Id3485421
No ratings yet
Effects of Social Media Networking On Consumer Purchase Decision in Nepal Ssrn-Id3485421
12 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Microsoft - Ai 900.VFeb 2024.by .VCEplus.110q
No ratings yet
Microsoft - Ai 900.VFeb 2024.by .VCEplus.110q
69 pages
Multivariate Analysis IBS
No ratings yet
Multivariate Analysis IBS
20 pages
Yaregal Birhanu
No ratings yet
Yaregal Birhanu
8 pages
Advanced Certificate Program in Data Science and AI Curriculum v1.0
No ratings yet
Advanced Certificate Program in Data Science and AI Curriculum v1.0
55 pages
Quiz 4 5 6
No ratings yet
Quiz 4 5 6
11 pages
The Effects of Green Marketing Mix On Consumer Behavior in Danang City
No ratings yet
The Effects of Green Marketing Mix On Consumer Behavior in Danang City
6 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Data Engineers
No ratings yet
Data Engineers
21 pages
Heteroscedasticity Slides PDF
No ratings yet
Heteroscedasticity Slides PDF
17 pages
Chapter Three Lecture Note
No ratings yet
Chapter Three Lecture Note
12 pages
A Note On The Evaluation of The Acid-Insoluble Ash Technique As A Method For Determining Apparent Diet Digestibility in Beef Cattle
No ratings yet
A Note On The Evaluation of The Acid-Insoluble Ash Technique As A Method For Determining Apparent Diet Digestibility in Beef Cattle
6 pages
CE 205: Numerical Methods: Curve-Fitting (Linear and Non-Linear Regression)
No ratings yet
CE 205: Numerical Methods: Curve-Fitting (Linear and Non-Linear Regression)
15 pages
Data Mining For CRM
No ratings yet
Data Mining For CRM
9 pages
Predicting Length of Stay For Cardiovascular Hospitalizations in The Intensive Care Unit: Machine Learning Approach
No ratings yet
Predicting Length of Stay For Cardiovascular Hospitalizations in The Intensive Care Unit: Machine Learning Approach
4 pages
Detail Interp For Econometrics
No ratings yet
Detail Interp For Econometrics
8 pages
KOM6115 Assignment 2 (GS65807)
No ratings yet
KOM6115 Assignment 2 (GS65807)
12 pages
Ferreira Dan Vilela 2004 PDF
No ratings yet
Ferreira Dan Vilela 2004 PDF
26 pages
Thuy Et Al 2025 A Quantitative Study On The Consumption of Organic Fruits and Vegetables in Vietnam
No ratings yet
Thuy Et Al 2025 A Quantitative Study On The Consumption of Organic Fruits and Vegetables in Vietnam
30 pages
Forest Soil Inoculation With Bacillus Subtilus Reduces Soil Detachment Rate To
No ratings yet
Forest Soil Inoculation With Bacillus Subtilus Reduces Soil Detachment Rate To
6 pages
Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests
No ratings yet
Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests
16 pages
Management Science HW 6
No ratings yet
Management Science HW 6
3 pages
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet

Unit-2 New

Uploaded by

Unit-2 New

Uploaded by

Unit-2

• Data preprocessing: Data cleaning – Data integration – Data Reduction

• Data Transformation and Data Discretization.

• Philosophy of EDA – The Data Science Process.

• Measures for data quality: A multidimensional view

Dirty data Examples

o The values are filled manually for the missing value.

o The same global constant can fill the values.

o The attribute mean can fill the missing values.

o The most probable value can fill the missing values.

There are mainly two kinds of approaches to data integration:

4. Data Conflict Detection and Resolution

• The data that is generated is not clean,

• In such cases, a lot of time goes into

• Equal-width: equal bucket range 25

• Equal-frequency (or equal-depth) 20

Airflow Speed No of blades Size of blades

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based

(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2

i=1 (ai − A)(bi − B) 

• It can be simplified in computation as

• Thus, A and B rise together since Cov(A, B) > 0.

Four most common methods of visualizing data are:

boxplot(mpg ~ cyl , data = mtcars)

boxplot(mpg ~ cyl, data = mtcars,

plot(mpg ~ disp, data = mtcars,

• Learning the Business Domain

• Preparing the Analytic Sandbox (workspace)

Market Sector Analytic Techniques/Methods Used

Consumer Multiple linear regression, automatic relevance determination

Retail Banking Multiple regression

Neural network, decision tree, hierarchical neurofuzzy systems,

Some of the activities to consider in this phase include the following:

• Assess the structure of the datasets

Questions to consider include these:

Write some common tools for the Model Building Phase???

You might also like