Unit 1 DWDM Complete
Unit 1 DWDM Complete
Data Mining
BCA VI
DATA WAREHOUSING
A Data Warehouse is a repository of information collected from
multiple sources, stored under a unified schema, and usually
residing at a single site.
W.H. Inmon is the father of data warehouse.
Data Warehouse is a subject –oriented, Integrated, time –variant,
nonvolatile collection of data in support of management's decisions.
Data warehouses generalize and consolidate data in
multidimensional space. The construction of data
warehouses involves data cleaning, data integration,
and data transformation, and can be viewed as an
important preprocessing step for data mining.
Transactional and
Workloads Analytical
Operational
It is subject-focused since it
provides information on a Removes redundancy and
Characteristics certain topic rather than offers security. It allows for
information about a numerous data views.
company's current activities.
Schema Fixed and pre-defined schema Flexible or rigid schema based on the
Flexibility definition for ingest. type of database.
It may hold multiple subject areas. It holds only one subject area. For
example, Finance or Sales.
In data warehousing, Fact constellation is In Data Mart, Star Schema and Snowflake
used. Schema are used.
The mechanism of
extracting information from
source systems and bringing
it into the data warehouse is
commonly called ETL,
which stands
for Extraction,
Transformation and
Loading.
The ETL process requires
active inputs from various
stakeholders, including
developers, analysts, testers,
top executives and is
technically challenging.
Extraction
system for further use in a data warehouse environment. This is the first
ETL.
● Loose texts may hide valuable information. For example, XYZ PVT Ltd does not
explicitly show.
● Different formats can be used for individual data. For example, data can be
saved as a string or as three integers.
Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.
Matching that associates equivalent fields in different sources.
● Selection that reduces the number of source fields and records.
ETL Process
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
1. Refresh: Data Warehouse data is completely rewritten. This means that older
2. Update: Only those changes applied to source information are added to the
pre existing data. This method is used in combination with incremental extraction to
Dimension Table
● A dimension table contains dimensions of a fact.
● They are joined to fact table via a foreign key.
● Dimension tables are de-normalized tables.
● The Dimension Attributes are the various columns in a
dimension table
● Dimensions offers descriptive characteristics of the facts
with the help of their attributes
● No set limit set for given for number of dimensions
● The dimension can also contain one or more hierarchical
relationships
Types of Dimensions in Data
Warehouse
Disadvantages
1. Time consuming with larger groups.
2. Influenced by the quality of observations and note taking. Observers
must be trained and use good instruments to record what they observe.
3. The results of one observation cannot be generalized to other
observations (individual performances). More observations are therefore
needed to confirm how more employees perform.
4. Can be difficult to set up (e.g. permissions or scheduling).
5. Being observed can change how some perform so that what is
observed does not reflect typical performance.
6. Some may refuse to be observed or be uncomfortable and resistant.
Document review / critical incident
analysis
This method involves finding and reviewing documents ranging
from letters of complaint, industry reports, policy documents or
more strategic ones, to better understand the problem.
The critical incident report typically describes an important event
that is a problem or that otherwise negatively affects the
organization. For example, reports about accidents or
emergencies.
Unless documents can’t be found or are clearly not useful, every
TNA should identify and review relevant documents. Whenever
possible, the information obtained from documents should be
validated and any serious difference between what different
sources report should be reconciled.
Advantages
1. Uses existing information.
2. Less influenced by changes or unforeseen circumstances.
3. Unobtrusive: no need to disrupt work underway.
4. Allows identifying job performance standards that can be applied locally.
5. Can provide leads to explore (people to interview, for example).
6. Can provide a historical perspective to better understand current events.
7. Considers both internal and external documents.
Disadvantages
1. Available documents are not always good sources of information. Better
documents may not be available (or shared).
2. Can be time consuming to review all documents.
Critical incident
Advantages
1. Can provide insight into the causes of problems.
2. Reports real events.
Disadvantages
1. Must be well reported to be useful. Bad reports can be misleading.
2. Can be difficult to analyze and understand after the fact.
3. May require consulting experts to confirm findings.
Interviews
Interviews are one-on-one conversations to explore ideas, opinions, values
or other points of view. Some interviews can be quite structured, use
specific questions and record answers in non-equivocal terms (like yes/no).
Other interviews are more open and allow exploring issues as they arise.
Regardless of the approach used, it is essential to take good notes that truly
reflect the interview.
Interviews are particularly useful to,
• Investigate issues in depth.
• Explore ideas, opinions and attitudes.
• Explore sensitive topics that some may not want to discuss in public.
Interviews alone are not always effective to explore issues that affect larger
groups. Samples can be used instead if they represent the population well.
For example, interviewing 5 employees that represent well the
characteristics of a larger group of 20 may be enough to identify important
issues.
Advantages
1. Allows for face-to-face contact and observing behavior.
2. Allows exploring and clarifying opinions, or dealing with the
unexpected.
3. Helps engage participants in the TNA (Training Need Analysis)
process.
4. Helps explore / confirm other data / information (for example, the
information obtained from documents).
Disadvantages
1. Can be time consuming and depend on the availability of individuals. 2.
Individuals can’t always identify or express true needs.
3. Some may use this opportunity to vent frustrations or discuss other
issues.
4. Interviewers must be skilled and well prepared.
5. Interviewing many can be time consuming and expensive.
6. Requires careful sampling when dealing with a large population.
7. Interviewers sometimes ‘take over’ and negatively affect the interview.
Focus groups
6. Observer : The observer will observe each JAD session and will
gather knowledge for end-user needs and of JAD session
decisions, interact with JAD participants outside JAD sessions