0% found this document useful (0 votes)
28 views38 pages

Unit-1.2 ETL Tools

The document outlines the ETL process, consisting of extraction, transformation, and loading stages, which ensures data accuracy and readiness for analysis in data warehouses. It emphasizes the importance of defining business objectives and gathering requirements through interviews and JAD sessions, focusing on understanding user needs and business dimensions. Additionally, it introduces the concept of information packages for capturing data warehouse requirements and highlights the significance of documenting data sources, transformation, storage, and delivery methods.

Uploaded by

kirpabajaj2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views38 pages

Unit-1.2 ETL Tools

The document outlines the ETL process, consisting of extraction, transformation, and loading stages, which ensures data accuracy and readiness for analysis in data warehouses. It emphasizes the importance of defining business objectives and gathering requirements through interviews and JAD sessions, focusing on understanding user needs and business dimensions. Additionally, it introduces the concept of information packages for capturing data warehouse requirements and highlights the significance of documenting data sources, transformation, storage, and delivery methods.

Uploaded by

kirpabajaj2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

ETL

TOOLS
ETL Process
• Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
• Transform: In this stage, the extracted data is transformed into a format that is suitable
for loading into the data warehouse. This may involve cleaning and validating the data,
converting data types, combining data from multiple sources, and creating new data
fields.
• Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the warehouse.
• The ETL process is an iterative process that is repeated as new data is added to the
warehouse.
• The process is important because it ensures that the data in the data warehouse is
accurate, complete, and up-to-date.
• It also helps to ensure that the data is in the format required for data mining and
reporting.
Extraction
• The first step of the ETL process is extraction.
• In this step, data from various source systems is extracted which can be in various formats
like relational databases, No SQL, XML, and flat files into the staging area.
• It is important to extract the data from various source systems and store it into the staging
area first and not directly into the data warehouse because the extracted data is in various
formats and can be corrupted also.
• Hence loading it directly into the data warehouse may damage it and rollback will be
much more difficult.
Transformation
• The second step of the ETL process is transformation.
• In this step, a set of rules or functions are applied on the extracted data to convert it into a
single standard format.
• It may involve following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States, and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
Loading
• The third and final step of the ETL process is loading.
• In this step, the transformed data is finally loaded into the data warehouse.
• Sometimes the data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals.
• The rate and period of loading solely depends on the requirements and varies from system
to system.
• ETL process can also use the
pipelining concept i.e. as soon as
some data is extracted, it can
transformed and during that
period some new data can be
extracted.
• And while the transformed data
is being loaded into the data
warehouse, the already extracted
data can be transformed.
• ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse
builder, CloverETL, and MarkLogic.
• Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift,
BigQuery, and Firebolt.
Defining Business objectives
• A data warehouse is an information delivery system focused on solving users' problems
and providing strategic information.
• During the requirements definition phase, the emphasis should be on understanding
what information users need rather than on the technical methods for providing it.
• Data warehouse systems, in contrast, are designed for information delivery, requiring a
shift in mindset during the project phases.
a) Usage of Information
Unpredictable
• Building an operational system for order processing in a company requires gathering
requirements from the Order Processing department users.
• Users provide precise details about functions, data presentation on GUI, and reports
needed for the order processing application.
• Users can articulate their requirements for operational systems clearly due to familiarity
with daily work processes.
• In contrast, defining requirements for a data warehousing system is challenging as users
are generally unable to specify the information they want or how to use it.
• Analysts in a data warehouse project face a dilemma in defining requirements when users
cannot express their needs clearly and precisely.
• Generalities such as industry practices are not sufficient to determine detailed
requirements for a data warehouse.
b) Dimensional Nature of
Business Data
• If users of the data warehouse think in terms of business dimensions for decision making,
you should also think of business dimensions while collecting requirements.
• Although the actual proposed usage of a data warehouse could be unclear, the business
dimensions used by the managers for decision making are not nebulous at all.
• The users will be able to describe these business dimensions to you.
• You are not totally lost in the process of requirements definition.
• You can find out about the business dimensions.
INFORMATION PACKAGES—A
NEW CONCEPT
• A new methodology based on business dimensions is proposed for gathering and
recording data warehouse requirements.
• The new methodology involves creating information packages that capture basic
measurements and relevant business dimensions for specific subjects.
• An information package example for analyzing sales includes measurements like actual,
forecast, and budget sales, with business dimensions such as time, location, product, and
demographic age group.
• Each business dimension has a hierarchy or levels, and the information package diagram
depicts these components.
• The primary goal in the requirements definition phase is to compile information
packages for all data warehouse subjects.
Information packages help
• Define the common subject areas
• Design key business metrics
• Decide how data must be presented
• Determine how users will aggregate or roll up
• Decide the data quantity for user analysis or query
• Decide how data will be accessed
• Establish data granularity
• Estimate data warehouse size
• Determine the frequency for data refreshing
• Ascertain how information must be packaged
Business Dimensions
• Business dimensions are fundamental to the new methodology for requirements
definition in data warehousing.
• Data must be stored to accommodate business dimensions, and their hierarchical levels
serve as the foundation for subsequent phases.
• The process involves identifying and selecting a proper set of dimensions related to
the measurements.
Examples of Business
Dimensions
• An automobile manufacturer, business dimensions for sales analysis include product,
dealer, customer demographics, method of payment, and time.
• For a hotel chain analyzing occupancy, critical business dimensions are hotel, room
type, and time.
• In both examples, the goal is to build a data warehouse that allows users to analyze the
subject (sales or hotel occupancy) from various perspectives.
• The chosen dimensions (product, dealer, customer demographics, method of payment,
hotel, room type, and time) become the basis for information packages in the data
warehouse.
Dimension
Hierarchies/Categories
• Users analyzing measurements along a business dimension prefer to view numbers
initially in summary and then at different levels of detail.
• The user typically traverses hierarchical levels of a business dimension to access details at
various levels.
• For example, in the time dimension hierarchy (such as year, quarter, and month), a user
may first view total sales for the entire year, then drill down to individual quarters, and
further down to individual months.
• Hierarchy in business dimensions serves as the basis for drilling down (going into more
detail) or rolling up (summarizing) during analysis.
• In the given example, the time dimension hierarchy provides paths for users to navigate
and explore data at different levels of granularity.
• Within each major business dimension there are categories of data elements that can also
be useful for analysis.
• In the time dimension, you may have a data element to indicate whether a particular
day is a holiday.
• This data element would enable you to analyze by holidays and see how sales on holidays
compare with sales on other days.
• Hierarchies and categories are included in the information packages for each dimension.
• For example, the "Product" dimension, hierarchies and categories include model name,
model year, package styling, product line, product category, exterior color, interior color,
and first model year.
• Hierarchies and categories for other business dimensions in the auto sales analysis are
summarized as follows:
• Dealer: Dealer name, city, state, single brand flag, date first operation
• Customer demographics: Age, gender, income range, marital status, household size,
vehicles owned, home value, own or rent
• Payment method: Finance type, term in months, interest rate, agent
• Time: Date, month, quarter, year, day of week, day of month, season, holiday flag.
REQUIREMENTS GATHERING
METHODS
• Two basic techniques are universally adopted for meeting with groups of people:
(1) interviews, one-on-one or in small groups;
(2) Joint application development (JAD) sessions.
Interviews
• Two or three persons at a time
• Easy to schedule
• Good approach when details are intricate
• Some users are comfortable only with one-on-one interviews
• Need good preparation to be effective
• Always conduct preinterview research
• Also encourage users to prepare for the interview
Group Sessions
• Groups of twenty or less persons at a time
• Use only after getting a baseline understanding of the requirements
• Not good for initial data gathering
• Useful for confirming requirements
• Need to be very well organized
Interview Techniques
• Select and train the project team members conducting the interviews
• Assign specific roles for each team member (lead interviewer/scribe)
• Prepare list of users to be interviewed and prepare broad schedule
• List your expectations from each set of interviews
• Complete preinterview research
• Prepare interview questionnaires
• Prepare the users for the interviews
• Conduct a kick-off meeting of all users to be interviewed
Pre-interview research is important for the success of the interviews. Here is a list of some
key research topics:
• History and current structure of the business unit
• Number of employees and their roles and responsibilities
• Locations of the users
• Primary purpose of the business unit in the enterprise
• Relationship of the business unit to the strategic initiatives of the enterprise
• Secondary purposes of the business unit
• Relationship of the business unit to other units and to outside organizations
• Contribution of the business unit to corporate revenues and costs
• Company’s market
• Competition in the market
Types of questions to be asked in
the interviews follow
Current Information Sources
• Which operational systems generate data about important business subject areas?
• What are the types of computer systems that support these subject areas?
• What information is currently delivered in existing reports and online queries?
• How about the level of details in the existing information delivery systems?
Subject Areas
• Which subject areas are most valuable for analysis?
• What are the business dimensions? Do these have natural hierarchies?
• What are the business partitions for decision making?
• Do the various locations need global information or just local information for decision
making? What is the mix?
• Are certain products and services offered only in certain areas?
Key Performance Metrics
• How is the performance of the business unit currently measured?
• What are the critical success factors and how are these monitored?
• How do the key metrics roll up?
• Are all markets measured in the same way?
Information Frequency
• How often must the data be updated for decision making? What is the time frame?
• How does each type of analysis compare the metrics over time?
• What is the timeliness requirement for the information in the data warehouse?
As initial documentation for the requirements definition, prepare interview write-ups using
this general outline:
1) User profile
2) Background and objectives
3) Information requirements
4) Analytical requirements
5) Current tools used
6) Success criteria
7) Useful business metrics
8) Relevant business dimensions
Adapting the JAD Methodology
• JAD is a joint process, with all the concerned groups getting together for a well-defined
purpose.
• It is a methodology for developing computer applications jointly by the users and the IT
professionals in a well-structured manner.
• JAD centers around discussion workshops lasting a certain number of days under the
direction of a facilitator.
• Under suitable conditions, the JAD approach may be adapted for building a data
warehouse.
JAD consists of a five-phased approach
Project Definition
• Complete high-level interviews
• Conduct management interviews
• Prepare management definition guide
Research
• Become familiar with the business area and systems
• Document user information requirements
• Document business processes
• Gather preliminary information
• Prepare agenda for the sessions

Preparation
• Create working document from previous phase
• Train the scribes
• Prepare visual aids
• Conduct presession meetings
• Set up a venue for the sessions
• Prepare checklist for objectives
JAD Sessions
• Open with review of agenda and purpose
• Review assumptions
• Review data requirements
• Review business metrics and dimensions
• Discuss dimension hierarchies and roll-ups
• Resolve all open issues
• Close sessions with lists of action items
Final Document
• Convert the working document
• Map the gathered information
• List all data sources
• Identify all business metrics
• List all business dimensions and hierarchies
• Assemble and edit the document
• Conduct review sessions
• Get final approvals
• Establish procedure to change requirements
REQUIREMENTS DEFINITION:
SCOPE AND CONTENT
• The requirements definition document serves as the foundation for the next phases in the
system development life cycle.
• It prevents knowledge loss if team members leave the project, ensuring continuity.
• Formal documentation helps validate findings when reviewed with users.
• The content of the formal requirements definition document includes specific types of
information, which will be outlined in a suggested document structure.
Types of information document
must contain:
Data Sources
• This piece of information is essential in the requirements definition document. Typically,
the requirements definition document should include the following information:
 Available data sources
 Data structures within the data sources
 Location of the data sources
 Operating systems, networks, protocols, and client architectures
 Data extraction procedures
 Availability of historical data
Data Transformation
• Once you have listed the data sources, you need to determine how the source data will
have to be transformed appropriately into the type of data suitable to be stored in the data
warehouse.
• In your requirements definition document, include details of data transformation.
• This will necessarily involve mapping of source data to the data in the data warehouse.
• Indicate where the data about your metrics and business dimensions will come from.
• Describe the merging, conversion, and splitting that need to take place before moving the
data into the data warehouse.
Data Storage
• Interviews with users provide insights into the level of detailed data required for the data
warehouse.
• Information gathered helps in determining the number of data marts necessary to support
users effectively.
• Detailed metrics and business dimensions are identified through user interactions.
• Knowledge about the types of analyses users typically perform enables the identification
of necessary aggregations for storage in the data warehouse.
• The requirements definition document should encompass detailed information about
storage requirements.
• Preliminary estimates should be prepared for the amount of storage needed for both
detailed and summary data.
• Considerations include estimating the storage requirements for historical and archived
data within the data warehouse.
Information Delivery
Your requirements definition document must contain the following requirements on
information delivery to the users:
• Drill-down analysis
• Roll-up analysis
• Drill-through analysis
• Slicing and dicing analysis
• Ad hoc reports

You might also like