0% found this document useful (0 votes)
19 views20 pages

Unit 2 - Data Science BCA

Uploaded by

Shashank G S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views20 pages

Unit 2 - Data Science BCA

Uploaded by

Shashank G S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Program
B.C.A Semester VI
Name
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
40 Summative Assessment Marks 60
Marks
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data
Cleaning, Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-
Based Methods, Grid-Based Methods, Evaluation of Clustering 8
2

Unit 2
Topics:

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,


Data Integration and transformation, Data reduction, Discretization.

Data Warehouse:

Def 1: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection


of data in support of management’s decision-making process.”—W. H. Inmon (Father of Data
Warehouse- American Computer Scientist)

Def 2: Centralized data location for multiple sources of data.

Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.
According to William H. Inmon, a leading architect in the construction of data warehouse systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data
in support of management’s decision making process”
Subject-oriented: A data warehouse is organized around major subjects such as customer,
supplier, product, and sales.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
Time-variant : Data are stored to provide information from an historic perspective.
(e.g., the past 5–10 years).
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data

Data warehousing:

The process of constructing and using data warehouses as shown the following figure.
3

Fig 1.1: Data ware house of a sales organization.


Difference between

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,


detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc

access read/write lots of scans


index/hash on prim. key
unit of work short, simple transaction complex query

# records accessed tens millions


4

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Data Warehousing: Three Tier Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure.

Fig. Three Tier Architecture of Data warehousing


5

 The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse

 The middle tier is an OLAP server that is typically implemented using either

(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or

(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).

 The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.
6

Fig: A recommended approach for data warehouse development

Data Warehouse Modeling: Data Cube and OLAP


Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube.
o A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts. Fact tables contain numerical data, while dimension
tables provide context and background information.
- Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year) (entities in which org keeps records)
- Fact table contains numeric measures (such as dollars_sold(sale amt in $), units
sold) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The apex cuboid is
typically denoted by ‘all’.

.
7

The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.

Schemas for Multidimensional Data Models

Stars, Snowflakes, and Fact Constellations:

o Star schema: A fact table in the middle connected to a set of dimension tables
8

o Snowflake schema: A refinement of star schema where some dimensional


hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake

o Fact constellations: Multiple fact tables share dimension tables, viewed as a


collection of stars, therefore called galaxy schema or fact constellation
9
10

OLAP Operations

o Roll up (drill-up): summarize data or aggregation of data

- by climbing up hierarchy or by dimension reduction

- In the cube given in the overview section, the roll-up operation is


performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:

- Moving down in the concept hierarchy

- Adding a new dimension

- In the cube given in overview section, the drill down operation is


performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).
11

o Slice and dice: It selects a single dimension from the OLAP cube which
results in a new sub-cube creation. In the cube given in the overview
section, Slice is performed on the dimension Time = “Q1”

o Pivot (rotate):

- reorient the cube, visualization, 3D to series of 2D planes

- It is also known as rotation operation as it rotates the current view


to get a new view of the representation. In the sub-cube obtained
after the slice operation, performing pivot operation gives a new
view of it.
12

-
Data Cleaning, …

Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.

Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining.
13

Fig: Forms of Data Preprocessing

Data cleaning

Data cleaning stage helps in smooth out noise, attempts to fill in missing values, removing outliers,
and correct inconsistency in data.

1) Handling missing values:


i. Ignoring the tuple: Used when class label is missing. This method is not very effective
when more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given tuple
vi. Use most probable value to fill the missing value: (using decision tree)

2) Noisy data: Noise is a random error or variance in measured variable.


14

Different methods for smoothing are:

1. Binning: Smooth the sorted data by consulting its neighborhood. The values are
distributed into buckets/bins. They perform local smoothing.

Different binning methods for data smoothing:

i. Smoothing by bin means: Each value in bin is replaced by mean


Ex: BIN 1 : 4,8,15 = BIN 1: 9,9,9
ii. Smoothing by bin boundaries: Min and max value is identified and value is
replaced by closest boundary value
Ex: BIN 1 : 4,8,15 = BIN 1: 4,4,15

2. Regression: Data smoothing can also be done by regression(linear regression, multiple


linear regression). In this one attribute can be used to predict the value of another.
3. Outlier analysis: Outliers can be done by clustering. The value outside the clusters are
outliers.

Data Integration

Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.

Challenges of DI

1. Entity Identification Problem:


“How to match schema and objects from many sources?” This is called Entity
Identification Problem.
Ex: Cust-id in one table and Cust-no in another table.
Metadata helps in avoiding these problems.
2. Redundancy and correlation analysis:
Redundancy -> repetition.
Some redundancy can be detected by correlation analysis. Given two attributes, correlation
tell how strongly the relationship is (Chi-square test, correlation coefficient are ex).

Data Reduction

Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.

Data Reduction Strategies:

1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
15

Ex: Attribute subset selection, Wavelet Transform, PCA.


2. Numerosity reduction:
Replace original data by alternate smaller forms.
Ex: Histograms, Sampling, Data cube aggregation.
3. Data compression:
Reduce the size of data.

Wavelet Transform:

DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.

Principal Components Analysis(PCA)

PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.

Attribute Subset Selection:

Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.

Some heuristics methods for attribute subset selection are:

1. Stepwise forward selection:


 Start with empty set
 Best attribute are added to reduce set
 At each iteration, the rest of remaining attribute are added.
2. Stepwise backward elimination:
 Start with full set of attributes
 At each step, remove the worst attributes.
3. Combination of forward selection & backward selection:
 Combined method
 At each step, procedure selects the best attribute & remove worst from remaining.
4. Decision Tree Induction:
In DTI a tree is constructed from the given data. All attributes that do not appear in tree are
assumed to be irrelevant.
16

Histograms:

Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.

The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.
17

Fig : Histogram for ALL Electronics

Clustering:

Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data.

Sampling:

Used as data reduction technique in which large data are represented as small random samples
(subset).

Common ways to sample:

i. Simple random sample without replacements of size(SRSWOR)

This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.

ii. Simple random sample with replacement(SRSWR)

This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.

iii. Cluster sample


18

The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .

iv. Stratified sample

If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.

Fig. Sampling Techniques


19

Data Cube Aggregation:

 Aggregate data into one view.


 Data cube store multidimensional aggregated information.
 Data cube provides fast access to precomputed, summarized data, thereby benefits
OLAP/DM.
 Data cube created for varying level of abstraction are often referred to as cuboids.
 Cube created at lowest level of abstraction is base cuboids.
o Ex: Data regarding sales or customers.
 Cube created at highest level of abstraction is apex cuboids.
o Ex: Total sales for all 3 years, for items.

Fig. Data Cube

Data Transformation

The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.

Data Transformation Strategies overview:

1. Smoothing: Performed to remove noise.


Ex: Binning, regression, clustering.
2. Attribute construction: New attributes are added to help mining process.
3. Aggregation: Data is summarized or aggregated.
Ex: Sales data is aggregated into monthly & annual sales. This step is used for constructing
data cube.
4. Normalization: Data is scaled so as to fall within a smaller range.
Ex: -1.0 to +1.0.
5. Data Discretization: Where raw values are replaced by interval labels or conceptual labels.
Ex: Age
21

 Interval labels (10-18, 19-50)


 Conceptual labels (youth, adult)
6. Concept hierarchy generation for nominal data: Attributes are generalized to higher level
concepts
Ex: Street is generalized to city or country.

You might also like