Incorporating Data Warehouse Using SSIS
Incorporating Data Warehouse Using SSIS
Abstract
Information is key to the success for many businesses today. The more information is
managed regarding the customers, the better business decisions a company can make. This
means that as the database continue to grow in size and complexity, there is a significant
need in its management to extract vital information from this data.
Many AI techniques and tools are being deployed in order to find those hidden patterns and
trends in data to gain competitive advantage. The project discusses some of these BI
techniques and their implications on the business.
Introduction
It justifies how the data warehouse implementation could enable business decision making
in Quality food retail store.
Task 1
As we can the store has 4 branches which store its data in a distributed environment. It
makes use of Microsoft Access Database. It also stores data in flat files.
This is the high level architecture diagram of current system:
The complete table schema design of the existing OLTP system with entities and attributes
is shown below :
ProductID Customer_Name
Quantity Customer_Address
Total_Amount DOB
DateTime Gendre
Staff
Store
StaffID
Product StoreID
Staff_Name
ProductID StaffID
Staff_Designation
Product_Name StoreName
Gendre
Product_Description StoreLocation
Received_Training
Product_category StoreSales
Training_Date
Date_Join
As we can see from the ERD that the existing design of the OLTP database system has
many limitations which prevent it from offering valuable insights about the sales. It does not
show any particular relation between the customer, products, sales, staff or the stores
demographics.
We will now improve on these aspects and design the data warehouse system that would
offer a deep insight into the relation between customer, products, sales and staff.
Task 2.A
Approach
A proposed solution is the implementation of the Data Warehouse System (DWH) which
would facilitate the data analysis and data mining.
We will make use of the Entity Relational concept to design the data base for the grocery
store and then use that to built the data warehouse and mine the point-of-sale records into
the data warehouse.
By mining demographic data about customers, Quality foods could develop products and
promotions to appeal to specific customer segments.
We will make use of this flow to incorporate data warehouse methodologies into the Quality
Food Grocery. The first stage in designing a data warehouse is to design the logical model
of the data warehouse schema based on the initial findings.
There are various database management tools like Oracle, Microsoft, Teradata etc. We will
use Microsoft SQL Server since the old data was stored in flat file and in MS Access
database, so a Microsoft Product will be most suitable for a smooth ETL process.
Task 2.B
We considered a Snowflake schema for the design of data warehouse system. In this
schema we intend to store the sales data as a Fact table which links all the Dimension tables
which have relationship among each other.
The data model illustrated in the following diagram shows the tables and relationships that
are to be used in the data warehouse:
DimProduct FactSales *
DimCustomer *
DimProductSubcategory
DimGeography
DWH Schema
The planned data warehouse logical schema consists of the following entities and attributes:
DimCustomer *
CustomerKey
GeographyKey DimDate
Datekey
CustomerLabel
FullDateLabel
Title
DateDescription
FirstName
CalendarYear
MiddleName
DimGeography CalendarYearLabel
GeographyKey LastName
CalendarHalfYear
GeographyType NameStyle
CalendarHalfYearLabel
ContinentName BirthDate
CalendarQuarter
CityName MaritalStatus
CalendarQuarterLabel
StateProvinceName Suffix
CalendarMonth
RegionCountryName Gender
CalendarMonthLabel
Geometry EmailAddress
CalendarWeek
ETLLoadID YearlyIncome
CalendarWeekLabel
LoadDate TotalChildren
CalendarDayOfWeek
UpdateDate NumberChildrenAtHome
CalendarDayOfWeekLabel
Education
FiscalYear
Occupation
FiscalYearLabel
HouseOwnerFlag
FiscalHalfYear
DimProduct NumberCarsOwned
FiscalHalfYearLabel
ProductKey AddressLine1
FiscalQuarter
ProductLabel AddressLine2
FiscalQuarterLabel
ProductName Phone
FiscalMonth
ProductDescription DateFirstPurchase
FiscalMonthLabel
ProductSubcategoryKey CustomerType
IsWorkDay
Manufacturer CompanyName
IsHoliday
BrandName ETLLoadID
HolidayName
ClassID LoadDate
EuropeSeason
ClassName UpdateDate
NorthAmericaSeason
StyleID
AsiaSeason
StyleName
ColorID
ColorName FactSales *
Size SalesKey
SizeRange CustomerKey
UpdateDate LoadDate
UnitsSold
Markup
Profit
PurchaseCostPerUnit
PurchaseCost
UpdateDate
There are various assumptions and reasons for choosing the snowflake design schema for
the grocery store data warehouse design. Some of these are mentioned below :
1- Sales data is very crucial for the data analysis. Since the sales data is directly co-related
with the Product and Customers so we decided to make the Sales Table as a Fact Table
which will incorporate all the important aspects of data mining.
However, since the initial data source of transactional data from which the data has been
extracted into the data warehouse, stores the customer's address within the customer table;
we decided to store that information separately in Geography Table. This was done to
analyse geographical data of user at a later stage. This design caused our DHW schema to
be a Snowflake schema.
2 - Secondly, the transactional data had timestamp of each transaction stored in the invoice
table. For our data warehousing the date and time of each transaction is a very important
signal of information so we have decided to make a separate dimension table for Date and
Time. We call it the DimDate Table.
3- We have decided to do an analysis on the customer transaction with respect to their age
and salary, so for that we needed the customer information along with their invoice details
which contain information about their purchased products.
Task 2.C
The Extraction, Transformation and Loading (ETL) process begins by extracting data from
the source database. The destination database is then populated on a database system like
Oracle or SQL Server which will host the data warehouse database.
Many vendors have produced their own version of ETL tools, like Microsoft SQL Server
Integration Services, IBM Cognos, Informatica PowerCenter, SAS Data Integration Studio to
do perform the ETL tasks.
We will make use of a ETL tool which would be best suited to communicate with different
relational databases and different file formats.
For this project we will make use of Microsoft SQL Server Integration Services (SSIS) which
is part of the SQL Server 2012 powered by Visual Studio 2013.
6) At this stage if you see an error on the OLE DB Destination Component that means
there is some data type mismatch between the source and destination table.
In order to fix this we would make use of a Data Conversion Component, which will help us
to auto suggest the data type conversions.
7) Once we have done these steps for all the tables that we need to bring into our
destination data source, we can start the transformation process.
8) We can Start the Transformation process and at this moment we should see the
successful results.
Task 3
The Entity Relation (ER) model is a detailed, logical representation of the data. They lead to
ease of development and implementation of relational databases.
Entity relationship concept are rules to interpret, specify and document logical data
requirements.
The initial model of the database was not-relational and so it was difficult to associate any
relationship among the customer and sales data.
That is why the use of ER concept is considered in the design of this project. The database
management system used in this project is SQL Server which is a relational database
management system (RDBMS) based on relational model as introduced by EF Codd.
Task 4
DWH is a way or methodology of efficiently storing large amounts of data. This data can then
be used to apply data mining and business intelligence techniques.
The legacy OLTP system was impractical for the new business requirements of Quality Food
Store. They needed an system on which they can produce reports, analyse trends and
patterns to enhance their sales and store performance.
The legacy system could not store data for analytics. It was not capable to store large
amounts of data. It was also not suitable of querying large data sets and produce efficient
search results. The old system was not capable of predicting any short terms future trends.
It was based on flat files and access database which are considered old technologies and
latest database management software are able to overcome all of these.
There was no historical data storage facility available in the legacy system which means that
the data gets overwritten after a certain period of time. This was a huge loss of important
data.
There was no way to capture the user trends and user buying and shopping preferences.
The data was highly normalised which means they have to get rid of minute details which is
critical in data analysis.
Data warehouse implementation facilitates all of that. The OLAP tools tends to extract
historical data that has accumulated over a long period of time. For the new data warehouse
system, redundant or "de-normalized" data would likely facilitate business intelligence
applications.
The schema that has been recommended for the data warehouse has considered these
details about the new business requirements for the Quality Food Store.
Task 5
For this task we would extract information from the data using data mining tools. In order to
do that we have to analyse what are the those signals of information for which we would
need to extract from the data.
We will analyse the data from various angles to find the problems specified in the case
study. Some signals of information could be:
For this assignment we will be making use of Weka Machine Learning Tool which is a
combination of machine learning algorithms for data mining.
Note that it's the data on which we apply Machine Learning Algorithms (using Weka in this
case) to discover valuable knowledge.
Now that we have the data from our data warehouse, we can start analysing the data to
predict some trends and patterns in order to aid the management in making the promotion
plans.
We will make use of Linear Regression Model in WEKA to forecast the Markup of a product
based on the unit sales and purchase cost and that would give us the estimated profit for
that product.
This model can then be used to calculate the markup for other products as well.
This will give us some insights weather we should bring this product on promotion or not.
We will pull the required data from the data warehouse and use it for analysis in WEKA.
STEP 1 (Training)
Weka works by first training the algorithm with test data. Once the system is trained with
initial training data then we can use test data to predict the results. We have the initial
training file .arff file.
We will take sample of 7 different products, and use the units sold, purchase cost and
purchase cost per unit to find markup for that product, so that we can then predict the profit
of this product.
We will use product sales figure as a sample file to train the system. We use Linear
Regression Model in Weka.
Now that the data has been chosen, we have to train the algorithm to build the model with
this data. Use training set option to train the system. This tells Weka that to build our desired
model using this data as a training set.
Next, In Classify tab, Choose LinearRegression Model under the classifier functions.
Choose the dependent variable (the column we are looking to predict). We know this should
be the Markup, since that's what we're trying to determine for the product.
Right below the test options, there's a combo box which lets you choose the dependent
variable. The column Markup should be selected by default. If it's not, please select it.
Now we are ready to create our model.
Click Start to begin the Machine Learning process; to train the system with the data
provided. Figure shows what the output should look like:
Regression output :
markUp = 5 * purchaseCostPerUnit + 0
As we can see that WEKA has calculated the formula for marup for us. We can use this
information to predict the markup for any other product.
STEP 3 (Prediction)
Now we can use that model to predict the profit of other products. For a single product we
can simple use the above formula and calculate it's markup.
Results
Correlation coefficient 1
Mean absolute error 0
Root mean squared error 0
Total Number of Instances 7
Ignored Class Unknown Instances 1
Conclusion
Based on this estimated value the Sales Price will be 80000 (1 X 80000) and therefore the
foreseeable Net Profit will be 64000 (80000-16000).
Also we concluded that data mining strives to turn simple data into useful information by
creating models and rules. Our goal was to use the models and rules to predict future
behaviour, to improve business, and to explain things which we might not otherwise be able
to with normal analysis.