0% found this document useful (0 votes)

19 views20 pages

Unit 2 - Data Science BCA

Uploaded by

Shashank G S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views20 pages

Unit 2 - Data Science BCA

Uploaded by

Shashank G S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

1

Program
B.C.A Semester VI
Name
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
40 Summative Assessment Marks 60
Marks
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data
Cleaning, Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-
Based Methods, Grid-Based Methods, Evaluation of Clustering 8
2

Unit 2
Topics:

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,

Data Integration and transformation, Data reduction, Discretization.

Data Warehouse:

Def 1: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection

of data in support of management’s decision-making process.”—W. H. Inmon (Father of Data
Warehouse- American Computer Scientist)

Def 2: Centralized data location for multiple sources of data.

Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
According to William H. Inmon, a leading architect in the construction of data warehouse systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data
in support of management’s decision making process”
Subject-oriented: A data warehouse is organized around major subjects such as customer,
supplier, product, and sales.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
Time-variant : Data are stored to provide information from an historic perspective.
(e.g., the past 5–10 years).
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data

Data warehousing:

The process of constructing and using data warehouses as shown the following figure.
3

Fig 1.1: Data ware house of a sales organization.

Difference between

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,

detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc

access read/write lots of scans

index/hash on prim. key
unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Data Warehousing: Three Tier Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure.

Fig. Three Tier Architecture of Data warehousing

 The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse

 The middle tier is an OLAP server that is typically implemented using either

(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or

(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).

 The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.
6

Fig: A recommended approach for data warehouse development

Data Warehouse Modeling: Data Cube and OLAP

Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube.
o A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts. Fact tables contain numerical data, while dimension
tables provide context and background information.
- Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year) (entities in which org keeps records)
- Fact table contains numeric measures (such as dollars_sold(sale amt in $), units
sold) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The apex cuboid is
typically denoted by ‘all’.

.
7

The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.

Schemas for Multidimensional Data Models

Stars, Snowflakes, and Fact Constellations:

o Star schema: A fact table in the middle connected to a set of dimension tables
8

o Snowflake schema: A refinement of star schema where some dimensional

hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake

o Fact constellations: Multiple fact tables share dimension tables, viewed as a

collection of stars, therefore called galaxy schema or fact constellation
9
10

OLAP Operations

o Roll up (drill-up): summarize data or aggregation of data

- by climbing up hierarchy or by dimension reduction

- In the cube given in the overview section, the roll-up operation is

performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:

- Moving down in the concept hierarchy

- Adding a new dimension

- In the cube given in overview section, the drill down operation is

performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).
11

o Slice and dice: It selects a single dimension from the OLAP cube which
results in a new sub-cube creation. In the cube given in the overview
section, Slice is performed on the dimension Time = “Q1”

o Pivot (rotate):

- reorient the cube, visualization, 3D to series of 2D planes

- It is also known as rotation operation as it rotates the current view

to get a new view of the representation. In the sub-cube obtained
after the slice operation, performing pivot operation gives a new
view of it.
12

-
Data Cleaning, …

Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.

Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining.
13

Fig: Forms of Data Preprocessing

Data cleaning

Data cleaning stage helps in smooth out noise, attempts to fill in missing values, removing outliers,
and correct inconsistency in data.

1) Handling missing values:

i. Ignoring the tuple: Used when class label is missing. This method is not very effective
when more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given tuple
vi. Use most probable value to fill the missing value: (using decision tree)

2) Noisy data: Noise is a random error or variance in measured variable.

Different methods for smoothing are:

1. Binning: Smooth the sorted data by consulting its neighborhood. The values are
distributed into buckets/bins. They perform local smoothing.

Different binning methods for data smoothing:

i. Smoothing by bin means: Each value in bin is replaced by mean

Ex: BIN 1 : 4,8,15 = BIN 1: 9,9,9
ii. Smoothing by bin boundaries: Min and max value is identified and value is
replaced by closest boundary value
Ex: BIN 1 : 4,8,15 = BIN 1: 4,4,15

2. Regression: Data smoothing can also be done by regression(linear regression, multiple

linear regression). In this one attribute can be used to predict the value of another.
3. Outlier analysis: Outliers can be done by clustering. The value outside the clusters are
outliers.

Data Integration

Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.

Challenges of DI

1. Entity Identification Problem:

“How to match schema and objects from many sources?” This is called Entity
Identification Problem.
Ex: Cust-id in one table and Cust-no in another table.
Metadata helps in avoiding these problems.
2. Redundancy and correlation analysis:
Redundancy -> repetition.
Some redundancy can be detected by correlation analysis. Given two attributes, correlation
tell how strongly the relationship is (Chi-square test, correlation coefficient are ex).

Data Reduction

Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.

Data Reduction Strategies:

1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
15

Ex: Attribute subset selection, Wavelet Transform, PCA.

2. Numerosity reduction:
Replace original data by alternate smaller forms.
Ex: Histograms, Sampling, Data cube aggregation.
3. Data compression:
Reduce the size of data.

Wavelet Transform:

DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.

Principal Components Analysis(PCA)

PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.

Attribute Subset Selection:

Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.

Some heuristics methods for attribute subset selection are:

1. Stepwise forward selection:

 Start with empty set
 Best attribute are added to reduce set
 At each iteration, the rest of remaining attribute are added.
2. Stepwise backward elimination:
 Start with full set of attributes
 At each step, remove the worst attributes.
3. Combination of forward selection & backward selection:
 Combined method
 At each step, procedure selects the best attribute & remove worst from remaining.
4. Decision Tree Induction:
In DTI a tree is constructed from the given data. All attributes that do not appear in tree are
assumed to be irrelevant.
16

Histograms:

Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.

The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.
17

Fig : Histogram for ALL Electronics

Clustering:

Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data.

Sampling:

Used as data reduction technique in which large data are represented as small random samples
(subset).

Common ways to sample:

i. Simple random sample without replacements of size(SRSWOR)

This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.

ii. Simple random sample with replacement(SRSWR)

This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.

iii. Cluster sample

The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .

iv. Stratified sample

If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.

Fig. Sampling Techniques

Data Cube Aggregation:

 Aggregate data into one view.

 Data cube store multidimensional aggregated information.
 Data cube provides fast access to precomputed, summarized data, thereby benefits
OLAP/DM.
 Data cube created for varying level of abstraction are often referred to as cuboids.
 Cube created at lowest level of abstraction is base cuboids.
o Ex: Data regarding sales or customers.
 Cube created at highest level of abstraction is apex cuboids.
o Ex: Total sales for all 3 years, for items.

Fig. Data Cube

Data Transformation

The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.

Data Transformation Strategies overview:

1. Smoothing: Performed to remove noise.

Ex: Binning, regression, clustering.
2. Attribute construction: New attributes are added to help mining process.
3. Aggregation: Data is summarized or aggregated.
Ex: Sales data is aggregated into monthly & annual sales. This step is used for constructing
data cube.
4. Normalization: Data is scaled so as to fall within a smaller range.
Ex: -1.0 to +1.0.
5. Data Discretization: Where raw values are replaced by interval labels or conceptual labels.
Ex: Age
21

 Interval labels (10-18, 19-50)

 Conceptual labels (youth, adult)
6. Concept hierarchy generation for nominal data: Attributes are generalized to higher level
concepts
Ex: Street is generalized to city or country.

Data Warehousing 07012013132829 Data Warehousing
No ratings yet
Data Warehousing 07012013132829 Data Warehousing
29 pages
Fundamentals of Data Science Notes (Module - 2)
No ratings yet
Fundamentals of Data Science Notes (Module - 2)
11 pages
ERP BI Assignments
No ratings yet
ERP BI Assignments
4 pages
FDS Unit 2
No ratings yet
FDS Unit 2
21 pages
Dmda M1
No ratings yet
Dmda M1
30 pages
(2025!04!03) - Data Warehouse - Lecture 3
No ratings yet
(2025!04!03) - Data Warehouse - Lecture 3
41 pages
The Data Fabric Handbook
100% (1)
The Data Fabric Handbook
9 pages
UNIT2DM
No ratings yet
UNIT2DM
63 pages
Oec-Csbs601a
No ratings yet
Oec-Csbs601a
10 pages
Datascience Unit 02 1
No ratings yet
Datascience Unit 02 1
53 pages
CTIT
No ratings yet
CTIT
72 pages
Unit 2 - Data Science
No ratings yet
Unit 2 - Data Science
21 pages
Sample Questions Human Resource Information System
No ratings yet
Sample Questions Human Resource Information System
9 pages
TDWI EBook UDA Teradata Web
No ratings yet
TDWI EBook UDA Teradata Web
12 pages
03 04OLAP SKJ Edited Oct 1, 2024
No ratings yet
03 04OLAP SKJ Edited Oct 1, 2024
93 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
Warehouse
No ratings yet
Warehouse
58 pages
Introduction To Data Mining and Data Warehousing
No ratings yet
Introduction To Data Mining and Data Warehousing
2 pages
Lesson4 - DATA MAPPING
No ratings yet
Lesson4 - DATA MAPPING
7 pages
4-Data Warehousing and Integration in Business
No ratings yet
4-Data Warehousing and Integration in Business
39 pages
Data Warehouse
No ratings yet
Data Warehouse
174 pages
Warehouse
No ratings yet
Warehouse
60 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Unit 2 - V2 - Data Science
No ratings yet
Unit 2 - V2 - Data Science
23 pages
CS2202 DataWarehouse OLAP
No ratings yet
CS2202 DataWarehouse OLAP
49 pages
Data Mining 4
No ratings yet
Data Mining 4
59 pages
Data Warehouse
No ratings yet
Data Warehouse
23 pages
7 Data Warehousing - 1
No ratings yet
7 Data Warehousing - 1
32 pages
Data Mining and Warehosuing Lecture 01
No ratings yet
Data Mining and Warehosuing Lecture 01
36 pages
04OLAP
100% (1)
04OLAP
58 pages
Fdsa Unit 1
No ratings yet
Fdsa Unit 1
25 pages
Ssis Ssas Training Course
No ratings yet
Ssis Ssas Training Course
4 pages
Unit 1 - Data Warehouse
No ratings yet
Unit 1 - Data Warehouse
21 pages
DM 6
No ratings yet
DM 6
29 pages
Lecture 4 (Dataware Housing)
No ratings yet
Lecture 4 (Dataware Housing)
50 pages
Data Warehouse C
No ratings yet
Data Warehouse C
34 pages
Data Quality Sas
No ratings yet
Data Quality Sas
13 pages
04OLAP
No ratings yet
04OLAP
66 pages
Azure Synapse Analytics: Proof of Concept Playbook
No ratings yet
Azure Synapse Analytics: Proof of Concept Playbook
21 pages
Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor
No ratings yet
Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor
44 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
58 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
17 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
28 pages
OLAP and Data Mining
No ratings yet
OLAP and Data Mining
27 pages
CCD Chapter 3 Notes
No ratings yet
CCD Chapter 3 Notes
11 pages
Unit - II Data Warehouseing&OLAP
No ratings yet
Unit - II Data Warehouseing&OLAP
17 pages
TL QTHTTT Quan Tri He Thong Thong Tin
No ratings yet
TL QTHTTT Quan Tri He Thong Thong Tin
117 pages
ITB Notes
No ratings yet
ITB Notes
47 pages
DWDM 3
0% (1)
DWDM 3
52 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Chapter 2.introduction To Data Warehouse
No ratings yet
Chapter 2.introduction To Data Warehouse
49 pages
Compiled Course Handout - Sem 4 - Symbiosis
No ratings yet
Compiled Course Handout - Sem 4 - Symbiosis
65 pages
02datawarehousing For DM
No ratings yet
02datawarehousing For DM
38 pages
Datawarehouse Notes
No ratings yet
Datawarehouse Notes
39 pages
Data Warehouse Modeling
No ratings yet
Data Warehouse Modeling
17 pages
04OLAP
No ratings yet
04OLAP
58 pages
UEU Sistem Pendukung Keputusan Pertemuan 5
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 5
46 pages
Data Mining 9,10,11
No ratings yet
Data Mining 9,10,11
27 pages
Google Cloud Analytics Lakehouse
No ratings yet
Google Cloud Analytics Lakehouse
47 pages
DMDW-Unit I
No ratings yet
DMDW-Unit I
14 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Unit2 Olap
No ratings yet
Unit2 Olap
13 pages
04OLAP
No ratings yet
04OLAP
50 pages
Cse 6TH Sem Syllabus
No ratings yet
Cse 6TH Sem Syllabus
9 pages
5 Ferilion Labs Handbook Data Engg
No ratings yet
5 Ferilion Labs Handbook Data Engg
12 pages
UNIT-1 (RIT-062) : Data Warehousing
No ratings yet
UNIT-1 (RIT-062) : Data Warehousing
34 pages
Chapter-2 DM
No ratings yet
Chapter-2 DM
23 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
48 pages
Chapter 1 Datawarehouse
100% (1)
Chapter 1 Datawarehouse
47 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
25 pages
Data Quality - Information Quality For Northwind
No ratings yet
Data Quality - Information Quality For Northwind
18 pages
Data Warehouse
No ratings yet
Data Warehouse
77 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
47 pages
Error List - Informatica
0% (2)
Error List - Informatica
216 pages
B.Tech. (CSE) - 7 - 23 - 10 - 2016 PDF
No ratings yet
B.Tech. (CSE) - 7 - 23 - 10 - 2016 PDF
15 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
51 pages
An Overview of Data Warehousing and OLAP Technology What Is Decision Support?
No ratings yet
An Overview of Data Warehousing and OLAP Technology What Is Decision Support?
4 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
1 Problem Statement
No ratings yet
1 Problem Statement
1 page
Advanced Database Management Systems
100% (1)
Advanced Database Management Systems
3 pages
Dimensional Model Schemas - Start and Snowflake
No ratings yet
Dimensional Model Schemas - Start and Snowflake
2 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Materialized Views in Oracle
No ratings yet
Materialized Views in Oracle
3 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
Chapter: Architecture & User Interface Design
No ratings yet
Chapter: Architecture & User Interface Design
26 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
CS614 Updated Quiz 1 Solution BY MCS of Virtuallians
0% (1)
CS614 Updated Quiz 1 Solution BY MCS of Virtuallians
9 pages

Unit 2 - Data Science BCA

Uploaded by

Unit 2 - Data Science BCA

Uploaded by

1

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,

Def 1: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection

Def 2: Centralized data location for multiple sources of data.

Fig 1.1: Data ware house of a sales organization.

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,

access read/write lots of scans

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Data Warehousing: Three Tier Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure.

Fig. Three Tier Architecture of Data warehousing

Fig: A recommended approach for data warehouse development

Data Warehouse Modeling: Data Cube and OLAP

Schemas for Multidimensional Data Models

Stars, Snowflakes, and Fact Constellations:

o Snowflake schema: A refinement of star schema where some dimensional

o Fact constellations: Multiple fact tables share dimension tables, viewed as a

o Roll up (drill-up): summarize data or aggregation of data

- by climbing up hierarchy or by dimension reduction

- In the cube given in the overview section, the roll-up operation is

- Moving down in the concept hierarchy

- Adding a new dimension

- In the cube given in overview section, the drill down operation is

- reorient the cube, visualization, 3D to series of 2D planes

- It is also known as rotation operation as it rotates the current view

Fig: Forms of Data Preprocessing

1) Handling missing values:

2) Noisy data: Noise is a random error or variance in measured variable.

Different methods for smoothing are:

Different binning methods for data smoothing:

i. Smoothing by bin means: Each value in bin is replaced by mean

2. Regression: Data smoothing can also be done by regression(linear regression, multiple

1. Entity Identification Problem:

Data Reduction Strategies:

Ex: Attribute subset selection, Wavelet Transform, PCA.

Principal Components Analysis(PCA)

Attribute Subset Selection:

Some heuristics methods for attribute subset selection are:

1. Stepwise forward selection:

Fig : Histogram for ALL Electronics

Common ways to sample:

i. Simple random sample without replacements of size(SRSWOR)

ii. Simple random sample with replacement(SRSWR)

iii. Cluster sample

iv. Stratified sample

Fig. Sampling Techniques

Data Cube Aggregation:

 Aggregate data into one view.

Fig. Data Cube

Data Transformation Strategies overview:

1. Smoothing: Performed to remove noise.

 Interval labels (10-18, 19-50)

You might also like