0% found this document useful (0 votes)

18 views84 pages

DM-M1-PPT v1.11

The document outlines a syllabus for a course on Data Mining and Data Warehousing, covering topics such as data warehouse architecture, data preprocessing, classification, clustering, and association rule analysis. It details the differences between operational databases and data warehouses, the need for data preprocessing, and various data mining functionalities and issues. Additionally, it discusses multidimensional data models, OLAP operations, and the evolution of database technology.

Uploaded by

Insha Nourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views84 pages

DM-M1-PPT v1.11

Uploaded by

Insha Nourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Module – 1 (Introduction to Data Mining

and Data Warehousing)

Major Source For This Material:

Data Mining: Concepts and Techniques (3rd ed.) —
Jiawei Han, Micheline Kamber, and Jian Pei
Data Mining and Warehousing - YouTube

1
Data Mining Syllabus
Module – 1 (Introduction to Data Mining and Data Warehousing)

Data warehouse-Differences between Operational Database Systems

and Data Warehouses, Multidimensional data model- Warehouse
schema, OLAP Operations, Data Warehouse Architecture, Data
Warehousing to Data Mining, Data Mining Concepts and Applications,
Knowledge Discovery in Database Vs Data mining, Architecture of typical
data mining system, Data Mining Functionalities, Data Mining Issues.

Module - 2 (Data Preprocessing)

Data Preprocessing-Need of data preprocessing, Data Cleaning-

Missing values, Noisy data, Data Integration and Transformation, Data
Reduction-Data cube aggregation, Attribute subset selection,
*Dimensionality reduction*, Numerosity reduction, Discretization and
concept hierarchy generation.

2
Data Mining Syllabus
Module - 3 (Advanced classification and Cluster analysis)

Classification- Introduction, Decision tree construction principle, Splitting

indices -Information Gain, Gini index, Decision tree construction
algorithms-ID3, Decision tree construction with presorting-SLIQ,
Classification Accuracy-Precision, Recall.

Introduction to clustering-Clustering Paradigms, Partitioning

Algorithm- PAM, Hierarchical Clustering-DBSCAN, Categorical Clustering-
ROCK

Module 4: (Association Rule Analysis)

*Association Rules-Introduction, Methods to discover Association rules,

Apriori(Level-wise algorithm)*, Partition Algorithm, Pincer Search
Algorithm, Dynamic Itemset Counting Algorithm, FP-tree Growth
Algorithm.

3
Data Mining Syllabus
Module – 1 (Introduction to Data Mining and Data Warehousing)

Data warehouse-Differences between Operational Database Systems

Module - 2 (Data Preprocessing)

Data Preprocessing-Need of data preprocessing, Data Cleaning-

4
Module – 1 (Introduction to Data Mining and Data
Warehousing)
1.1 Data warehouse-Differences between Operational Database Systems
and Data Warehouses
1.2 Multidimensional data model- Warehouse schema, OLAP Operations
1.3 Data Warehouse Architecture

1.4 Data Warehousing to Data Mining

1.5 Data Mining Concepts and Applications
1.6 Knowledge Discovery in Database Vs Data mining

1.7 Architecture of typical data mining system

1.8 Data Mining Functionalities
1.9 Data Mining Issues.

5
1.1 Data warehouse-Differences between Operational
Database Systems and Data Warehouses

◼ 301 List out the three major features of data warehouse. (3)
◼ 401 List and explain any two applications of data warehouse. (3)

6
Evolution of Database Technology – Not in Syllabus

◼ 1. 1960s: Hierarchical database systems, eg.,IMS; Network DBMS

◼ 2. 1970s: Relational DBMS (RDBMS) Introduction
◼ 3. 1980s: RDBMS Popularity; OODBMS; Application-Oriented DB
(spatial data, scientific, and engineering applications)
◼ 4. 1990s:
◼ Data Warehousing: consolidate large volumes of data for analysis
◼ Data Mining: extract patterns and insights from large datasets.
◼ Multimedia Databases: images, audio, video, and other media.
◼ Web Databases: optimized for internet and dynamic web content.
◼ 5. 2000s
◼ Stream Data Management and Mining
◼ Data Mining Advances
◼ Web Technology and Global Information Systems

7
What is a Data Warehouse?
◼ A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.
◼ Mentioned above are the 3 major features of Datawarehouse

8
Data Warehouse—Subject-Oriented

◼ Organized around major subjects, such as customer, product,

sales
◼ Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
◼ Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process

9
Data Warehouse—Integrated

◼ Constructed by integrating multiple, heterogeneous data

sources
◼ relational databases, flat files, on-line transaction records

◼ Data cleaning and data integration techniques are applied.

◼ Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data

sources
◼ E.g., Hotel price: currency, tax, breakfast covered, etc.

◼ When data is moved to the warehouse, it is converted.

10
Data Warehouse—Time Variant

◼ The time horizon for the data warehouse is significantly longer

than that of operational systems
◼ Operational database: current value data
◼ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
◼ Every key structure in the data warehouse
◼ Contains an element of time, explicitly or implicitly
◼ But the key of operational data may or may not contain
“time element”

11
Data Warehouse—Nonvolatile
◼ A physically separate store of data transformed from the
operational environment
◼ Operational update of data does not occur in the data
warehouse environment
◼ Does not require transaction processing, recovery, and
concurrency control mechanisms
◼ Requires only two operations in data accessing:
◼ initial loading of data and access of data

12
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (1)

◼ 202 List the differences between OLAP and OLTP. (3)

◼ 302 Describe the similarities and differences between OLTP and
OLAP. (3)

13
Feature OLAP OLTP
Large volumes (TB) of historical Smaller volumes (GB) of
Data Volume
data (data warehouse). current, operational data.
Complex queries involving Simple and short transactions
Operations
aggregations and joins. like CRUD operations.
Response Optimized for query speed, not Requires quick responses for
Time immediate responses. real-time transactions.
Database Denormalized schema (star or Normalized schema to minimize
Design snowflake schema). redundancy (ER Based).
Low concurrency due to High concurrency to handle
Concurrency
analytical workloads. multiple user requests.
Read-intensive, historical and Write-intensive, operational
Data Type
summary data. and detailed data.
Example Use Business intelligence, trend Banking systems, order
Cases analysis, forecasting. processing, inventory systems.
Data warehouses (e.g., Transactional databases (e.g.,
Tools
Snowflake, BigQuery). MySQL, PostgreSQL).
Periodic updates (ETL
Data Updates Continuous, real-time updates.
processes). 14
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (2)
◼ 201 Illustrate the multi-dimensional data model with a neat
figure. (3)
2 marks; Figure – 1 mark
◼ 211 b) List and illustrate the schemas used for the physical
representation of the multidimensional data with examples. (8)
3 schema - a star schema, a snowflake schema, or a fact constellation
schema – 3 marks
Each with explanation – 1 mark each, figures and example – 2 marks
◼ 312 a) Explain the differences between star schema and
snowflake schema in a data warehouse (6)
◼ 402 Describe the similarities and the differences of star schema
and snowflake schema – 3 marks

15
1.2(a) Multidimensional data model- Warehouse schema

Chapter 4: Data Warehousing and On-line Analytical Processing, Data

Mining: Concepts and Techniques

16
From Tables to Data Cubes

◼ A data warehouse is based on a multidimensional data model

which views data in the form of a data cube
◼ A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
◼ Dimension tables, such as item (item_name, brand, type), or
time (day, week, month, quarter, year)
◼ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
◼ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
17
Physical Modeling of Data Warehouses

◼ Modeling data warehouses - dimensions & measures

◼ Star schema: A fact table in the middle connected to a
set of dimension tables
◼ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
◼ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

18
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

19
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

20
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 21
Multi Dimensional Data Model
- Data Cube and Concept Hierarchy
◼ Sales volume as a function of product, month, and region

Dimensions: Product, Location, Time

Hierarchical summarization paths

Industry Region Year

Category Country Quarter.

Location

Product. City Month Week

Office Day

Time
22
Data Cube Aggregation

Quarter Total annual sales

2Qtr of TVs in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

All, All, All

23
Cuboids Corresponding to the Cube

all
0-D (apex) cuboid
product Quarter Country
1-D cuboids

Product, Quarter Product, Country Quarter, Country

2-D cuboids

3-D (base) cuboid

Product, Quarter, Country

(See “Data Mining: Concepts and Techniques “ 5.2.1 Multiway Array

Aggregation for Full Cube Computation)
24
Cuboids Corresponding to the Cube

Quarter
2Qtr 3Qtr 4Qtr sum
U.S.A 0-D (apex) cuboid All.

Country
Canada
1-D Product. Quarter . C
cuboids
Mexico Product,
Quarter. Product,
sum Country.
2-D cuboids

3-D (base) cuboid Product, Quarter, C

All, All, All

25
Cuboids Corresponding to the Cube
Q
1Qtr 2Qtr 3
TV
0-D (apex) cuboid All. PC
VCR
Product. sum
1-D Quarter . Country.
cuboids

Product, Quarter,
Quarter. Product, Country.
Country.
2-D cuboids

3-D (base) cuboid Product, Quarter, Country

26
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier

1-D cuboids

time,location item,location location,supplier

time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid

time, item, location, supplier

27
1.2(b) Multidimensional data model- OLAP Operations

Chapter 4: Data Warehousing and On-line Analytical Processing, Data

Mining: Concepts and Techniques

28
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (3)

◼ 212 b) Illustrate different OLAP operations in a

multidimensional data model with examples. List the differences
between ROLAP, MOLAP, and HOLAP.
Roll Up, Drill Down, Slice & Dice, Pivot(Rotate) with examples – 5
marks ROLAP, MOLAP and HOLAP - any three differences – 1
mark each
◼ 311 a) Explain different OLAP operations on multidimensional
data with suitable examples (7)
different OLAP operations on multidimensional data – 4marks
explanation with suitable examples - 3 marks

29
Typical OLAP Operations
1. Roll-Up (Drill-Up)
◼ Summarizes data by climbing up a hierarchy or by reducing
dimensions. It allows users to view data at a higher level of
abstraction.
◼ Example: In a sales data cube, roll up from `city` to
`region` or `country`. This aggregates the data at a
broader level.
2. Drill-Down (Roll-Down)
◼ The reverse of roll-up. It moves from higher-level summarized
data to lower-level detailed data or introduces new dimensions
for detailed analysis.
◼ In a sales data cube, drill down from `region` to `city` to
see the granular data.
30
3. Slice and Dice
◼ Slice: Selects a single value from one dimension to create a
sub-cube.
◼ Example: Filter sales data for `Year = 2023`.
◼ Dice: Filters data based on multiple dimension values,
resulting in a sub-cube with specific data.
◼ Filter sales data for `Year = 2023` and `Product =
Electronics`.
4. Pivot (Rotate)
◼ Reorients the data cube for better visualization, typically
converting 3D data into 2D views (e.g., switching rows and
columns in a table).
◼ Rotate a data cube to view `products` on rows and
`regions` on columns.

31
5. Drill Across
◼ Retrieves data involving more than one fact table.
◼ Combine `sales` data with `inventory` data to analyze
stock levels and sales performance together.
6. Drill Through
◼ Navigates through the bottom level of the cube to access
back-end relational tables, often using SQL queries.
◼ From aggregated sales data, drill through to view the
raw transaction records in a store.

32
Fig. 3.10 Typical OLAP
Operations

33
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (4)

◼ 312 b) Suppose that a data warehouse consists of the three

dimensions: time, doctor, and patient, and the two measures: count
and charge, where charge is the fee that a doctor charges a patient
for a visit (8)
i. Draw a schema diagram for the above data warehouse using star
schema (4)
ii. Starting with the base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2004? (4)

34
3.1.2 (b)
i. Draw a schema diagram for the above data warehouse using one of
the schemas. [star, snowflake, fact constellation]

The base cuboid [day, doctor, patient]

35
3.1.2 (b)
ii. Starting with the base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2004
Given
The base cuboid [day, doctor, patient],
list the total fee collected by each doctor in 2004
1. *Slice* the data for the year 2004:
2. *Roll up* the data along the “patient” dimension:
3. *Aggregation*: on the “charge” measure, for each doctor. This will
give the total fee collected by each doctor in 2004. (PTO)

36
1. *Slice* the data for the year 2004:
This operation will select only the data for the year 2004 from the “time”
dimension. After this operation, the “time” dimension will be reduced to
just the year-level, and the cuboid becomes [doctor, patient,
year=2004]
2. *Roll up* the data along the “patient” dimension:
The “patient” dimension can be aggregated to summarize the fee by
doctor. This will result in the cuboid “[doctor, year=2004]” with the total
charges for each doctor.
3. *Aggregation*:
After the *roll-up* operation, you will perform an *aggregation*
(summing) on the “charge” measure, for each doctor. This will give the
total fee collected by each doctor in 2004.
Final Result: List of doctors with their corresponding total charge
collected in 2004.

37
Not In Syllabus
To obtain the same list, write an SQL query assuming the data are
stored in a relational database with the schema fee (day, month, year,
doctor, hospital, patient, count, charge)

Select doctor, Sum(charge) From fee

Where year = 2004
Group by doctor

38
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (5)
◼ 412 b) Suppose that a data warehouse for a university consists of
the following four dimensions: student, course, semester, and
instructor, and two measures: count and avg_grade. (8)
(i) Draw a snowflake schema diagram for the data warehouse.
(ii) Starting with the base cuboid, what specific OLAP operations
should one perform in order to list the average grade of CS courses
for each University student
i) a snowflake schema diagram for the data warehouse – 4 marks
[marks can be given for the correct steps to the solution]
(ii) specific OLAP operations should one perform in order to list the
average grade of CS courses for each University student – 4 marks

39
4.1.2 (b)
(i) Draw a snowflake schema diagram for the data warehouse.

The base cuboid to be [student, course, semester, instructor]

40
4.1.2 (b)
ii. Starting with the base cuboid [student, course, semester, instructor],
what specific OLAP operations (e.g., roll-up from semester to year)
should one perform in order to list the average grade of CS courses for
each Big- University student
◼ Base Cuboid [student, course, semester, instructor]:
Start with the base cuboid, which contains data at the most granular
level,
including all four dimensions: student, course, semester, and
instructor,
and the two measures: count and grade.

41
◼ Base Cuboid [student, course, semester, instructor]:
1. Selection: Slice Operation
Apply a slice operation to filter for CS courses in the "course"
dimension (course.department = "CS"). This reduces the dataset to
include only data related to Computer Science courses.
2. Projection: Drill-Down or Roll-up
Perform a drill-down operation on the "student" dimension to move
from a higher-level aggregation (e.g., university-level or department-
level data) to individual student-level data. This ensures that data is
listed for individual students.
3. Aggregation:
Use the *grade* measure to calculate the average grade obtained by
each student across all CS courses he/she has registered for. This is
achieved by aggregating the data across the other dimensions
(“semester” and “instructor”).
Final Output:
The result will be a two-dimensional representation with student as
rows and their corresponding average grade for CS courses.
42
Not In Syllabus
Step 1: Roll-Up on Course from Course_ID to Department
◼ Operation Explanation:
◼ Aggregate the data from individual courses (e.g., CS101,
CS102) to the department level (e.g., CS). This simplifies
the dataset to focus on departments rather than specific
courses.
◼ Schema After Roll-Up:

Student Department Semester Instructor Grade

Student_A CS Fall 2024 Prof_X 85
Student_A CS Spring 2024 Prof_Y 90
Student_B MATH Fall 2024 Prof_Z 78
Student_C CS Spring 2024 Prof_X 92

43
Not In Syllabus
Step 2: Dice on Course and Student with Department =
"CS" and University = "BigUniversity“
◼ Operation Explanation:
◼ Apply a filter to include only the relevant data:
◼ Department = "CS" ensures only Computer Science courses
are included.
◼ University = "BigUniversity" ensures only students from Big
University are included.
◼ Schema After Dice:

Student Department Semester Instructor Grade

Student_A CS Fall 2024 Prof_X 85
Student_A CS Spring 2024 Prof_Y 90
Student_C CS Spring 2024 Prof_X 92

44
Not In Syllabus
Step 3: Drill-Down on Student from University to
Student_Name
◼ Operation Explanation:
◼ Navigate from the higher University level (e.g.,
BigUniversity) to the individual Student_Name level (e.g.,
Student_A, Student_C). This ensures granularity at the
student level for analysis.
◼ Schema After Drill-Down

Student Department Semester Instructor Grade

Student_A CS Fall 2024 Prof_X 85
Student_A CS Spring 2024 Prof_Y 90
Student_C CS Spring 2024 Prof_X 92

45
Not In Syllabus
Step 4: Aggregation on Grade
◼ Operation Explanation:
◼ Group by Student_Name and calculate the average of the
Grade measure for all CS courses taken by each student.
◼ Schema After Aggregation:

Student Avg_Grade
Student_A 87.5
Student_C 92.0

46
Not in Syllabus
Corresponding SQL query
SELECT student_id, AVG(grade)
FROM data_warehouse_table
WHERE course = 'CS’
GROUP BY student_id
This query effectively implements the *slice*, *drill-down*, and
*aggregation* operations specified in the OLAP process.

47
1.3 Data Warehouse Architecture

◼ 211 a) Explain the three-tier architecture of the data warehouse

with a neat figure. (6)
Figure – 2 marks based on correctness, Each tier description – 4 marks

48
Why a Separate Data Warehouse?
◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
◼ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
◼ Different functions and different data:
◼ missing data: Decision support requires historical data which
operational DBs do not typically maintain
◼ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
◼ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
49
Data Warehouse: A Multi-Tiered Architecture
See Section 4.1.4 of “Data Mining: Concepts and Techniques”
1. The bottom tier is a warehouse database server that is generally a
relational database system. Back-end tools and utilities are used to feed
data into the bottom tier from operational databases or other external
sources (e.g., customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning,
and transformation (e.g., to merge similar data from different sources
into a unified format), as well as load and refresh functions to update
the data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying
DBMS and allows client programs to generate SQL code to be executed
at a server. This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using
either a ROLAP model or MOLAP) model.
3. The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and data mining tools (e.g., identifying
patterns and trends, prediction etc.)
50
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

OLAP
Operational Extract Query,
DBs Transform Data Serve Reports,
Load
Refresh
Warehouse Analysis,
Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

51
Data Warehouse Models
◼ Enterprise warehouse
◼ collects all of the information about subjects spanning the

entire organization
◼ Data Mart
◼ a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific, selected

groups, such as marketing data mart
◼ Independent vs. dependent (directly from warehouse)

data mart
◼ Virtual warehouse
◼ A set of views over operational databases

◼ Only some of the possible summary views may be

materialized

52
Extraction, Transformation, and Loading (ETL)
◼ Data extraction
◼ get data from multiple, heterogeneous, and external sources
◼ Data cleaning
◼ detect errors in the data and rectify them when possible
◼ Data transformation
◼ convert data from legacy or host format to warehouse format
◼ Load
◼ sort, summarize, consolidate, compute views, check integrity,
and build indices and partitions
◼ Refresh
◼ periodic updates from the data sources to the warehouse

53
Metadata Repository
◼ Meta data is the data defining warehouse objects. It stores:
◼ Description of the structure of the data warehouse
◼ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
◼ Operational meta-data
◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance
◼ warehouse schema, view and derived data definitions
◼ Business data
◼ business terms and definitions, ownership of data, charging policies

54
OLAP Server Architectures
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middleware
◼ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
◼ ROLAP is suited for environments with large volumes of
transactional data and the need for scalability.
◼ Multidimensional OLAP (MOLAP)
◼ Sparse array-based multidimensional storage engine
◼ Fast indexing to pre-computed summarized data
◼ MOLAP excels at providing fast performance for complex
analytical queries by leveraging pre-aggregated multidimensional
cubes.

55
OLAP Server Architectures
◼ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
◼ Flexibility, e.g., low-level: relational, high-level: array
◼ HOLAP combines the best of both worlds, offering flexibility and
scalability by using relational storage for detailed data and
multidimensional cubes for fast aggregation.
◼ Specialized SQL servers (e.g., Redbricks)
◼ Specialized support for SQL queries over star/snowflake schemas.
Implements Query optimization for complex, multidimensional
analytical queries using techniques such as:
◼ Rewriting SQL queries to improve performance.
◼ Indexing: Creating specialized indexes to optimize the retrieval
of multidimensional data from star and snowflake schemas.
◼ Materialized Views: Pre-computing and storing query results to
speed up response times for frequently requested data.

56
1.4 Data Warehousing to Data Mining
1.5 Data Mining Concepts and Applications
1.6 Knowledge Discovery in Database Vs Data mining

◼ 311 b) Illustrate the various stages in Knowledge discovery

process with a diagram. (7)
◼ Explaining the various stages in Knowledge discovery process-- 4 marks
◼ Explanation with diagram – 3 marks

◼ 401 a) Explain the knowledge discovery process (KDD) in

databases for finding useful
◼ information and patterns in data. (3)
◼ 411 a) Explain the knowledge discovery process (KDD) in databases for finding
useful
◼ information and patterns in data. (7)

57
Why Data Mining?

◼ The Explosive Growth of Data: from terabytes to petabytes

◼ Data collection and data availability
◼ Automated data collection tools, database systems, Web,
computerized society
◼ Major sources of abundant data
◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube
◼ We are drowning in data, but starving for knowledge!
◼ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

58
What Is Data Mining?

◼ Data mining (knowledge discovery from data)

◼ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
◼ Data mining: a misnomer?
◼ Alternative names
◼ Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
◼ Watch out: Is everything “data mining”?
◼ Simple search and query processing
◼ (Deductive) expert systems

59
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
60
The knowledge discovery process consists of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved
from the database)
4. Data transformation (where data are transformed and consolidated
into forms appropriate for mining by performing summary or aggregation
operations)
5. Data mining (an essential process where intelligent methods are
applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation (visualization and knowledge representation
techniques are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data
are prepared for mining. The data mining step may interact with the user
or a knowledge base. The interesting patterns are presented to the user
and may be stored as new knowledge in the knowledge base
February 4, 2025 Data Mining: Concepts and Techniques 61
Example: A Web Mining Framework

◼ Web mining usually involves

◼ Data cleaning
◼ Data integration from multiple sources
◼ Warehousing the data
◼ Data cube construction
◼ Data selection for data mining
◼ Data mining
◼ Presentation of the mining results
◼ Patterns and knowledge to be used or stored into
knowledge-base

62
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining

63
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
64
1.7 Architecture of typical data mining system

◼ https://www.javatpoint.com/data-mining-architecture

65
66
Architecture of typical data mining system

Data Source:
◼ The sources of data for the warehouse are the Database, data
warehouse, World Wide Web (WWW), text files, and other
documents.

67
ETL Process

◼ Extract: The data comes from multiple, heterogeneous, and

external sources
◼ Transform:
◼ We must detect errors in the data and rectify them where
possible.
◼ All data must be converted to warehouse format
◼ Sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions.
◼ Then Load the data
◼ Refresh: Periodic updates from the data sources to the
warehouse is necessary to keep the data in sych.

68
Architecture of typical data mining system

Database or Data Warehouse Server:

◼ The database or data warehouse server is responsible for
fetching the relevant data, based on the user’s data mining
request.
Data Mining Engine:
◼ The data mining engine is a major component of any data
mining system.
◼ It contains several modules for operating data mining tasks,
including characterization, association, classification, clustering,
prediction, time-series analysis, outlier analysis etc.

69
Architecture of typical data mining system

Pattern Evaluation Module:

◼ The Pattern evaluation module collaborates with the data
mining engine to focus the search on patterns of interest.
The User Interface:
◼ The graphical user interface (GUI) module communicates
between the data mining system and the user.
◼ This module helps the user to easily and efficiently use the
system without knowing the complexity of the process.
◼ This module cooperates with the data mining system when the
user specifies a query or a task and displays the results.

70
Architecture of typical data mining system

Knowledge Base:
◼ The knowledge base is used to guide the search or evaluate the
stake of the result patterns.
◼ The knowledge base may even contain user views and data

from user experiences that might be helpful in the data mining

process.
◼ The data mining engine may receive inputs from the knowledge

base to make the result more accurate and reliable.

◼ The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.

71
1.8 Data Mining Functionalities

◼ 212 a) List and explain various data mining functionalities (6)

◼ Datamining functionalities -class/concept description, mining
frequent patterns, classification and regression, clustering, outlier
analysis – 1 mark each /description
◼ 412 a) Describe different issues in data mining (6)
◼ Any six issues in data mining – 1 mark each

72
Data Mining Function: (1) Generalization
◼ Data entries can be associated with classes or concepts, such
as "computers" or "printers" for items and "big Spenders" or
"budget Spenders" for customers.
◼ Descriptions of these classes or concepts, known as
class/concept descriptions, can be derived using data
characterization and data discrimination.
◼ Data Characterization involves summarizing the general
characteristics or features of a target class (e.g., software
products with a 10% sales increase).
◼ Data Discrimination focuses on comparing the general features
of a target class against one or more contrasting classes (e.g.,
comparing software products with sales increases versus
decreases).

73
Data Mining Function: (2) Association
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Milk → Bread [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering, and
other applications?

74
Data Mining Function: (3) Classification

◼ Classification (supervised learning)

◼ Construct models (functions) based on some training examples
◼ Describe and distinguish classes or concepts for future prediction
◼ E.g., classify countries based on (climate), or classify cars

based on (gas mileage)

◼ Predict unknown class labels
◼ Typical methods
◼ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
◼ Typical applications:
◼ Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

75
Data Mining Function: (4) Cluster Analysis

◼ Unsupervised learning (i.e., Class label is unknown)

◼ Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
◼ Principle: Maximizing intra-class similarity & minimizing interclass
similarity
◼ Methods
◼ Hierarchical, Partitioning, Density-based, Grid-based Methods
◼ Clustering Graphs and Network Data
◼ Applications
◼ Customer segmentation, social network analysis, recommendation
systems, information retrieval, information security, comp. biology
◼ Data stream clustering: network intrusion detection, transaction
streams, phone records, web click-streams, weather monitoring
76
Data Mining Function: (5) Outlier Analysis

◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data
◼ Noise or exception?
◼ Methods: statistics, clustering, classification, regression
analysis, …
◼ Useful in fraud detection, rare events analysis

77
Data Mining Function: (6) Trend Analysis

◼ Identifies patterns over time, including trend shifts and

sequential patterns.
◼ Example: Stock market analysis to predict future price
movements based on historical data.

78
Data Mining Function: (7) Data Visualization

◼ Presents mined patterns using graphs, charts, and interactive

tools for better interpretation.
◼ Example: Heatmaps for website user activity, highlighting
frequently clicked areas

79
February 4, 2025 Data Mining: Concepts and Techniques 80
1.9 Data Mining Issues
◼ 412 a) Describe different issues in data mining. (6)
Any six issues in data mining – 1 mark each

81
Major Issues in Data Mining

Refer: Section 1.7 Major Issues in Data Mining of “Data Mining:

Concepts and Techniques, Jiawei Han”
1. Diversity of data types
◼ Handling complex types of data
◼ Mining dynamic, networked, and global data repositories
2. Data Quality Issues:
◼ Handling noise, uncertainty, and incompleteness of data
3. Data Preprocessing:
◼ Cleaning, Integration, and Transformation

82
Major Issues in Data Mining

4. Efficiency and Scalability

◼ Mining knowledge in multi-dimensional space
◼ Efficiency and scalability of data mining algorithms
◼ Parallel, distributed, stream, and incremental mining
methods
5. Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked
environment
◼ Pattern evaluation and pattern- or constraint-guided mining

83
Major Issues in Data Mining

6. Data mining and society

◼ Social impacts of data mining
◼ Privacy-preserving data mining
◼ Invisible data mining (recommendation systems, smart
assistants, fraud detection, etc.)
7. User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results

Full Mail Access
No ratings yet
Full Mail Access
30 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
59 pages
SQL Cheat Sheet
100% (1)
SQL Cheat Sheet
4 pages
How To Find CDS View On S4HANA Embeded Analytics
100% (2)
How To Find CDS View On S4HANA Embeded Analytics
75 pages
Data Warehousing
No ratings yet
Data Warehousing
61 pages
Data Warehousing
No ratings yet
Data Warehousing
63 pages
Overview of Data Warehousing and OLAP: Slide 29-2
No ratings yet
Overview of Data Warehousing and OLAP: Slide 29-2
36 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
57 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Unit 3 Data Mining1
No ratings yet
Unit 3 Data Mining1
53 pages
Module 1 Chapter 2
No ratings yet
Module 1 Chapter 2
53 pages
2 DW
No ratings yet
2 DW
63 pages
CH 4 (Data Warehousing)
No ratings yet
CH 4 (Data Warehousing)
57 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
58 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
59 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
04OLAP Editted v1
No ratings yet
04OLAP Editted v1
59 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
58 pages
Warehouse
No ratings yet
Warehouse
58 pages
Unit - 3 Data Warehousing and OLAP Technology
No ratings yet
Unit - 3 Data Warehousing and OLAP Technology
20 pages
Module-3 Data Warehousing
No ratings yet
Module-3 Data Warehousing
44 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
58 pages
CH 4 DW
No ratings yet
CH 4 DW
36 pages
03-Data Warehousing and OLAP Technology
No ratings yet
03-Data Warehousing and OLAP Technology
28 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
Data Mining: Concepts and Techniques: - Chapter 2
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 2
62 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-28 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-28 Reference-Material-I
32 pages
Sap Bods - Quick Guide
0% (1)
Sap Bods - Quick Guide
99 pages
Data Warehouse and Olap
No ratings yet
Data Warehouse and Olap
20 pages
UNIT-1 Data Warehousing Part-III
No ratings yet
UNIT-1 Data Warehousing Part-III
68 pages
Olp PDF
No ratings yet
Olp PDF
25 pages
What Is Data Warehouse?: Separately
No ratings yet
What Is Data Warehouse?: Separately
22 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
DB2 For ZOS Stored Procedures
No ratings yet
DB2 For ZOS Stored Procedures
722 pages
Question With Answer
No ratings yet
Question With Answer
22 pages
Datawarehouse: Fact Table
No ratings yet
Datawarehouse: Fact Table
55 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
DMDW Chapter 1
No ratings yet
DMDW Chapter 1
31 pages
Data Warehousing and OLAP Technology
No ratings yet
Data Warehousing and OLAP Technology
51 pages
04OLAP
No ratings yet
04OLAP
66 pages
Multitier DW Architecture & Implementation
No ratings yet
Multitier DW Architecture & Implementation
63 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2
86 pages
04OLAP
No ratings yet
04OLAP
50 pages
Chap3 Oltp Olap Olam
No ratings yet
Chap3 Oltp Olap Olam
32 pages
Chapter-2 DM
No ratings yet
Chapter-2 DM
23 pages
Data Warehouse
No ratings yet
Data Warehouse
49 pages
DWDM IT-32 DATAWAREHOUSING & DATAMINING
No ratings yet
DWDM IT-32 DATAWAREHOUSING & DATAMINING
9 pages
CH 1
No ratings yet
CH 1
53 pages
DataMining and Data Warehousing
No ratings yet
DataMining and Data Warehousing
96 pages
Running A Java Concurrent Program in Oracle Apps
No ratings yet
Running A Java Concurrent Program in Oracle Apps
6 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
70 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
50 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
58 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
48 pages
Open SQL and Native SQL in SAP ABAP PDF
No ratings yet
Open SQL and Native SQL in SAP ABAP PDF
6 pages
Data Warehousing and Data Mining - Handbook
0% (2)
Data Warehousing and Data Mining - Handbook
27 pages
CSEP 546 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSEP 546 Data Mining: Instructor: Pedro Domingos
63 pages
Oracle MySQL PreSales Specialist Assessment 2
No ratings yet
Oracle MySQL PreSales Specialist Assessment 2
17 pages
CSE 592 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSE 592 Data Mining: Instructor: Pedro Domingos
63 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
DW and Olap
No ratings yet
DW and Olap
59 pages
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
23 pages
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
No ratings yet
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
75 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
DATA PROCESSING FULL No805673418
No ratings yet
DATA PROCESSING FULL No805673418
48 pages
PL/SQL Programms: WRITE A Program To Display The Multiplication Tables
No ratings yet
PL/SQL Programms: WRITE A Program To Display The Multiplication Tables
33 pages
Data Warehouse and OLAP
No ratings yet
Data Warehouse and OLAP
55 pages
SQL
No ratings yet
SQL
2 pages
ISYS6307 Data & Information Management: Week 5 Class OOP PHP Native
No ratings yet
ISYS6307 Data & Information Management: Week 5 Class OOP PHP Native
38 pages
Design Patterns Elements of Reusable Object-Oriented Software
No ratings yet
Design Patterns Elements of Reusable Object-Oriented Software
17 pages
Comandos Informix
No ratings yet
Comandos Informix
10 pages
Goldengate 12 3 X Cert Matrix 3424388
No ratings yet
Goldengate 12 3 X Cert Matrix 3424388
35 pages
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
No ratings yet
3 - Business Analysis in Data Mining - L6 - 7 - 8 - 9 - 10
39 pages
WDT01 Introduction
No ratings yet
WDT01 Introduction
89 pages
RSRAN120 - Relocation ISHO Per cause-RSRAN-RSGS-2-day-rsran WCDMA17 Reports RSRAN120 xml-2021 06 02-13 37 15 854
No ratings yet
RSRAN120 - Relocation ISHO Per cause-RSRAN-RSGS-2-day-rsran WCDMA17 Reports RSRAN120 xml-2021 06 02-13 37 15 854
69 pages
DMS Practice Question
No ratings yet
DMS Practice Question
2 pages
05 Chapter Performance MongoDB
No ratings yet
05 Chapter Performance MongoDB
42 pages
DBT Lab
No ratings yet
DBT Lab
53 pages
mySQL Dump
No ratings yet
mySQL Dump
16 pages
Guide To SQL 9th Edition Pratt Solutions Manual 1
100% (68)
Guide To SQL 9th Edition Pratt Solutions Manual 1
36 pages
SQL Quiz Window Functions 24dec2023
No ratings yet
SQL Quiz Window Functions 24dec2023
20 pages
Select All From Department
No ratings yet
Select All From Department
4 pages
12 Practical File Term 2
No ratings yet
12 Practical File Term 2
15 pages
Mysql Question Dec2023
No ratings yet
Mysql Question Dec2023
3 pages
Cassandra Assignment
No ratings yet
Cassandra Assignment
4 pages
Addm
No ratings yet
Addm
4 pages
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet

DM-M1-PPT v1.11

Uploaded by

DM-M1-PPT v1.11

Uploaded by

Module – 1 (Introduction to Data Mining

and Data Warehousing)

Major Source For This Material:

Data warehouse-Differences between Operational Database Systems

Module - 2 (Data Preprocessing)

Data Preprocessing-*Need of data preprocessing*, Data Cleaning-

Classification- Introduction, Decision tree construction principle, Splitting

Introduction to clustering-*Clustering Paradigms*, Partitioning

Module 4: (Association Rule Analysis)

*Association Rules-Introduction, Methods to discover Association rules,

Data warehouse-Differences between Operational Database Systems

Module - 2 (Data Preprocessing)

Data Preprocessing-*Need of data preprocessing*, Data Cleaning-

1.4 Data Warehousing to Data Mining

1.7 Architecture of typical data mining system

◼ 1. 1960s: Hierarchical database systems, eg.,IMS; Network DBMS

◼ Organized around major subjects, such as customer, product,

◼ Constructed by integrating multiple, heterogeneous data

◼ Data cleaning and data integration techniques are applied.

structures, attribute measures, etc. among different data

◼ When data is moved to the warehouse, it is converted.

◼ The time horizon for the data warehouse is significantly longer

◼ 202 List the differences between OLAP and OLTP. (3)

Chapter 4: Data Warehousing and On-line Analytical Processing, Data

◼ A data warehouse is based on a multidimensional data model

◼ Modeling data warehouses - dimensions & measures

branch location_key location to_location

Dimensions: Product, Location, Time

Industry Region Year

Category Country Quarter.

Product. City Month Week

Quarter Total annual sales

All, All, All

Product, Quarter Product, Country Quarter, Country

3-D (base) cuboid

(See “Data Mining: Concepts and Techniques “ 5.2.1 Multiway Array

3-D (base) cuboid Product, Quarter, C

All, All, All

3-D (base) cuboid Product, Quarter, Country

time item location supplier

time,location item,location location,supplier

4-D (base) cuboid

Chapter 4: Data Warehousing and On-line Analytical Processing, Data

◼ 212 b) Illustrate different OLAP operations in a

◼ 312 b) Suppose that a data warehouse consists of the three

The base cuboid [day, doctor, patient]

Select doctor, Sum(charge) From fee

The base cuboid to be [student, course, semester, instructor]

Student Department Semester Instructor Grade

Student Department Semester Instructor Grade

Student Department Semester Instructor Grade

◼ 211 a) Explain the three-tier architecture of the data warehouse

Data Sources Data Storage OLAP Engine Front-End Tools

groups of users. Its scope is confined to specific, selected

◼ Only some of the possible summary views may be

◼ 311 b) Illustrate the various stages in Knowledge discovery

◼ 401 a) Explain the knowledge discovery process (KDD) in

◼ The Explosive Growth of Data: from terabytes to petabytes

◼ Data mining (knowledge discovery from data)

Data Warehouse Selection

◼ Web mining usually involves

Data Presentation Business

Data Preprocessing/Integration, Data Warehouses

◼ Extract: The data comes from multiple, heterogeneous, and

Database or Data Warehouse Server:

Pattern Evaluation Module:

from user experiences that might be helpful in the data mining

base to make the result more accurate and reliable.

◼ 212 a) List and explain various data mining functionalities (6)

◼ Classification (supervised learning)

based on (gas mileage)

◼ Unsupervised learning (i.e., Class label is unknown)

◼ Identifies patterns over time, including trend shifts and

◼ Presents mined patterns using graphs, charts, and interactive

Refer: Section 1.7 Major Issues in Data Mining of “Data Mining:

4. Efficiency and Scalability

Data Preprocessing-Need of data preprocessing, Data Cleaning-

Introduction to clustering-Clustering Paradigms, Partitioning

Data Preprocessing-Need of data preprocessing, Data Cleaning-