0% found this document useful (0 votes)
121 views85 pages

DW Lab

dw lab manual

Uploaded by

cse girls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views85 pages

DW Lab

dw lab manual

Uploaded by

cse girls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Ex: No: 1a) Data Exploration with WEKA

Date:

Aim:

To explore and integrate the data for analysis process, helping to understand the data and
prepare it for further analysis. WEKA (Waikato Environment for Knowledge Analysis) is a
popular open-source software for machine learning and data mining that provides a user-friendly
interface to perform these tasks.

Algorithm:

Step 1: Launch WEKA and open the "Explorer" tab

Step2: Load the Dataset

1. Click on the "Open file" button in the top-left corner.


2. Navigate to your dataset file (e.g., data.csv) and select it.
3. The dataset will be loaded into WEKA

Step3: Explore the Data set.

1. Switch to the "Preprocess" tab to explore the dataset's attributes and instances.
2. Click on the "Edit" button to examine the attributes and their types (numeric,
nominal, etc.).
3. Check for any missing values or outliers by clicking on the "Missing" or "Outliers"
tabs, respectively.
4. Use the "Visualize" tab to create scatter plots, histograms, or other visualizations to
understand attribute distributions and relationships.

Step 4: Preprocess the data

1. Handle any missing values or outliers as needed. You can use filters under the
"Preprocess" tab, such as "RemoveWithValues" or "RemovePercentage" for missing
values and "RemoveWithValues" or "RemoveOutliers" for outliers.
2. Convert nominal attributes to numeric format if required. Use the
"NumericToNominal" filter for this purpose.

Step 5: Integration of data (if applicable):

1. If you have multiple datasets to integrate, you can use the "MergeTwoFiles" filter
under the "Preprocess" tab.
2. Make sure the datasets have a common identifier column to match the records from
different sources.

Step 6: Save the preprocessed and integrated data:

Go back to the "Preprocess" tab, and click on the "Save" button to save the processed data in
ARFF format.

Step 7: Classify and analyze the data (optional):

1. Switch to the "Classify" tab to apply different classifiers on the preprocessed data.
2. Select a classifier from the list (e.g., Decision Trees, Naive Bayes) and set its
parameters if needed.
3. Click on the "Start" button to run the classifier on the data.
4. Analyze the results and performance metrics to gain insights into the data.

Step 8: Interpret the results and draw conclusions:

1. Based on the exploratory analysis and integration, draw conclusions about the dataset's
characteristics, patterns, and potential issues.
2. Determine if further data cleaning or preprocessing is necessary before applying machine
learning algorithms.

Step 9: Save the WEKA session (optional):

If you want to save your work for future use or reference, you can save the WEKA session by
clicking on the "Save" button in the top-right corner.

Output:
Result:

Thus the program for data exploration using WEKA was executed and verified
successfully.
Ex: No: 1b) Data Integration with WEKA

Aim:

To explore and integrate the data for analysis process, helping to understand the data and
prepare it for further analysis. Append two files together and two files merge together using
WEKA. Through this integration, the aim is to unlock the full potential of the combined data,
empowering organizations and researchers to derive valuable knowledge and actionable insights
from their data resources.

Algorithm:

Step 1: Launch WEKA and open the "CLI" tab

Step2: Create own data set

Step3: Write the following command in CLI window

java weka. core.Instances append h:/training.arff h:/test.arff > h:/append.arff

Step 4: The instances are appended sucessfully

Step 5: Write the following command in CLI window

java weka. core.Instances merge h:/training.arff h:/test.arff > h:/merge.arff

Step 6: The instances merge succesfully

Step 7: Save and load the file for further usage

Step 8: Interpret the results and draw conclusions:

Step 9: Save the WEKA session (optional):

If you want to save your work for future use or reference, you can save the WEKA session by
clicking on the "Save" button in the top-right corner.

OUTPUT:
Test.arff code
@relation weather

@attribute outlook {sunny,overcast,rainy}

@attribute temperature real

@attribute humidity real

@attribute windy {TRUE,FALSE}

@attribute play {Yes,No}

@data

sunny,85,89,FALSE,No

overcast,85,89,TRUE,Yes

sunny,80,85,FALSE,No

rainy,85,89,FALSE,No

sunny,85,89,FALSE,No

overcast,85,89,TRUE,Yes

training.arff code

@relation weather

@attribute outlook {sunny,overcast,rainy}

@attribute temperature real

@attribute humidity real

@attribute windy {TRUE,FALSE}

@attribute play {Yes,No}

@data
sunny,85,89,FALSE,No

overcast,85,89,TRUE,Yes

sunny,80,85,FALSE,No

rainy,85,89,FALSE,No

sunny,90,89,FALSE,No

overcast,89,89,TRUE,Yes

sunny,80,99,FALSE,No

rainy,85,87,FALSE,No

Append Result

@relation weather

@attribute outlook {sunny,overcast,rainy}

@attribute temperature numeric

@attribute humidity numeric

@attribute windy {TRUE,FALSE}

@attribute play {Yes,No}

@data

sunny,85,89,FALSE,No

overcast,85,89,TRUE,Yes

sunny,80,85,FALSE,No

rainy,85,89,FALSE,No

sunny,90,89,FALSE,No

overcast,89,89,TRUE,Yes
sunny,80,99,FALSE,No

rainy,85,87,FALSE,No

sunny,85,89,FALSE,No

overcast,85,89,TRUE,Yes

sunny,80,85,FALSE,No

rainy,85,89,FALSE,No

Merge Result

@relation weather_weather

@attribute outlook2 {sunny,overcast,rainy}

@attribute temperature2 numeric

@attribute humidity2 numeric

@attribute windy2 {TRUE,FALSE}

@attribute play2 {Yes,No}

@attribute outlook {sunny,overcast,rainy}

@attribute temperature numeric

@attribute humidity numeric

@attribute windy {TRUE,FALSE}

@attribute play {Yes,No}

@data

sunny,85,89,FALSE,No,sunny,85,89,FALSE,No

overcast,85,89,TRUE,Yes,overcast,85,89,TRUE,Yes

sunny,80,85,FALSE,No,sunny,80,85,FALSE,No
rainy,85,89,FALSE,No,rainy,85,89,FALSE,No

sunny,80,85,FALSE,No,sunny,85,89,FALSE,No

rainy,85,89,FALSE,No,overcast,85,89,TRUE,Yes

Result:

Thus the program for data integration using WEKA was executed and verified
successfully.
Ex: No: 1c) Data Resampling Using Weka

Aim:

To explore and integrate the data for analysis process, helping to understand the data and
prepare it for further analysis. WEKA (Waikato Environment for Knowledge Analysis) is a
popular open-source software for machine learning and data mining that provides a user-friendly
interface to perform these tasks.

Algorithm:

Step 1: Launch WEKA and open the "Explorer" tab

Step2: Load the Dataset

1. Click on the "Open file" button in the top-left corner.


2. Navigate to your dataset file (e.g., data.csv) and select it.
3. The dataset will be loaded into WEKA

Step3: Choose filter option

Step 4: Under filter Unsupervised tab open in that click instances

Step 5: Below Instances click resample

Step 6: Pause the cursor on resample tab.it shows some information

Step 7: Click the Resample tab it will display window. Here changes the no replacement value as
TRUE and sample size percent is 60% and click apply

Step 9:The no of instances will be reduced, save the file as training.arff

Step 10:Click the Resample tab it will display window. Here changes the Invert selection value
as TRUE and sample size percent is 60% and click apply

Step 11:The no of instances will be reducedagain ,save the file as testing.arff

Step12:Remove the class name in testing .arff then save

Step13:Choose classify tab under this select training set and start random tree algorithm

Step14:Choose testing set apply SMO function under functions tab

Step15: Right click SMO function and visualize the classification error and save the file as
output.arff
Step 16: Then open the testing file and see the output

If you want to save your work for future use or reference, you can save the WEKA session by
clicking on the "Save" button in the top-right corner.

Output:
Ex: No: 2a) Apply WEKA tool for data validation

Aim:

To explore the data for analysis process, helping to understand the data and
prepare it for further analysis. We are going to apply the Algorithms on data set and visualize the
result.

Algorithm:

WEKA contains an implementation of Association using Apriori algorithm for learning


association rules. Clustering using KMeans Algorithm, and visualize the output

Steps with Output:

1) Prepare an excel file dataset and name it as “apriori.csv“.

2) Open WEKA Explorer and under Preprocess tab choose “apriori.csv” file.
3) The file now gets loaded in the WEKA Explorer.

4) Remove the Transaction field by checking the checkbox and clicking on Remove as shown in
the image below. Now save the file as “aprioritest.arff”.
5) Go to the Associate tab. The apriori rules can be mined from here.
6) Click on Choose to set the support and confidence parameters. The various parameters that
can be set here are:

• “lowerBoundMinSupport” and “upperBoundMinSupport”, this is the support level


interval in which our algorithm will work.
• Delta is the increment in the support. In this case, 0.05 is the increment of support from
0.1 to 1.
• metricType can be “Confidence”, “Lift”, “Leverage” and “Conviction”. This tells us how
we rank the association rules. Generally, Confidence is chosen.
• numRules tells the number of association rules to be mined. By default, it is set as 10.
• significanceLevel depicts what is the significance of the confidence level.
7) The Textbox next to choose button, shows the “Apriori-N-10-T-0-C-0.9-D 0.05-U1.0-M0.1-S-
1.0-c-1”, which depicts the summarized rules set for the algorithm in the settings tab.

8) Click on Start Button. The association rules are generated in the right panel. This panel
consists of 2 sections. First is the algorithm, the dataset was chosen to run. The second part
shows the Apriori Information.
Let us understand the run information in the right panel:

• The scheme used us Apriori.


• Instances and Attributes: It has 6 instances and 4 attributes.
• Minimum support and minimum confidence are 0.4 and 0.9 respectively. Out of 6
instances, 2 instances are found with min support,
• The number of cycles performed for the mining association rule is 12.
• The large itemsets generated are 3: L (1), L (2), and L (3) but these are not ranked as their
sizes are 7, 11, and 5 respectively.
• Rules found are ranked. The interpretation of these rules is as follows:
• Butter T 4 => Beer F 4: means out of 6, 4 instances show that for butter true, beer is
false. This gives a strong association. The confidence level is 0.1.
Output
The association rules can be mined out using WEKA Explorer with Apriori Algorithm. This
algorithm can be applied to all types of datasets available in the WEKA directory as well as other
datasets made by the user. The support and confidence and other parameters can be set using the
Setting window of the algorithm.
b)K-means Clustering Implementation Using WEKA
Let us see how to implement the K-means algorithm for clustering using WEKA Explorer

Steps:

1) Open WEKA Explorer and click on Open File in the Preprocess tab. Choose dataset
“vote.arff”.

2) Go to the “Cluster” tab and click on the “Choose” button. Select the clustering method as
“SimpleKMeans”.

3) Choose Settings and then set the following fields:

• Distance function as Euclidian


• The number of clusters is 6. With more clusters, the sum of squared error will reduce.
• Seed as 10. of
Click on Ok and start the algorithm.

4) Click on Start in the left panel. The algorithm display results on the white screen. Let us
analyze the run information:

• Scheme, Relation, Instances, and Attributes describe the property of the dataset and the
clustering method used. In this case, vote.arff dataset has 435 instances and 13 attributes.
• With the Kmeans cluster, the number of iterations is 5.
• The sum of the squared error is 1098.0. This error will reduce with an increase in the
number of clusters.
• The 5 final clusters with centroids are represented in the form of a table. In our case,
Centroids of clusters are 168.0, 47.0, 37.0, 122.0.33.0 and 28.0.
• Clustered instances represent the number and percentage of total instances falling in the
cluster.
5) Choose “Classes to Clusters Evaluations” and click on Start.
The algorithm will assign the class label to the cluster. Cluster 0 represents Republicans and
Cluster 3 represents Democrats. The Incorrectly clustered instance is 39.77% which can be
reduced by ignoring the unimportant attributes.
6) To ignore the unimportant attributes. Click on the “Ignore attributes” button and select the
attributes to be removed.
7) Use the “Visualize” tab to visualize the Clustering algorithm result. Go to the tab and click on
any box. Move the Jitter to the max.

• The X-axis and Y-axis represent the attribute.


• The blue color represents the class label democrat and the red color represents the class
label republican.
• Jitter is used to view Clusters.
• Click the box on the right-hand side of the window to change the x coordinate attribute
and view clustering with respect to othe

r
attributes.
Output
K means clustering is a simple cluster analysis method. The number of clusters can be set using
the setting tab. The centroid of each cluster is calculated as the mean of all points within the
clusters. With the increase in the number of clusters, the sum of square errors is reduced. The
objects within the cluster exhibit similar characteristics and properties. The clusters represent the
class labels.

c)Implement Data Visualization Using WEKA


Data Visualization
The method of representing data through graphs and plots with the aim to understand data clearly
is data visualization.

There are many ways to represent data. Some of them are as follows:
1) Pixel Oriented Visualization: Here the color of the pixel represents the dimension value. The
color of the pixel represents the corresponding values.

2) Geometric Representation: The multidimensional datasets are represented in 2D, 3D, and
4D scatter plots.
3) Icon-Based Visualization: The data is represented using Chernoff’s faces and stick figures.
Chernoff’s faces use the human mind’s ability to recognize facial characteristics and differences
between them. The stick figure uses 5 stick figures to represent multidimensional data.

4) Hierarchical Data Visualization: The datasets are represented using treemaps. It represents
hierarchical data as a set of nested triangles.
Data Visualization Using WEKA Explorer
Data Visualization using WEKA is done on the IRIS.arff dataset.

The steps involved are as follows:


1) Go to the Preprocess tab and open IRIS.arff dataset.
2) The dataset has 4 attributes and 1 class label. The attributes in this dataset are:

• Sepallength : Type -numeric


• Sepalwidth: Type- numeric
• Petalength: Type-numeric
• Petalwidth: Type-numeric
• Class: Type-nominal
3) To visualize the dataset, go to the Visualize tab. The tab shows the attributes plot matrix. The
dataset attributes are marked on the x-axis and y-axis while the instances are plotted. The box
with the x-axis attribute and y-axis attribute can be enlarged.

4) Click on the box of the plot to enlarge. For example, x: petal length and y:petalwidth. The
class labels are represented in different colors.

• Class label- Iris-setosa: blue color


• Class label- Iris-versicolor: red
• Class label-Iris-virginica-green
These colors can be changed. To change the color, click on the class label at the bottom, and a
color window will appear.
5) Click on the instance represented by ‘x’ in the plot. It will give the instance details. For
example:

• Instance number: 91
• Sepalength: 5.5
• Sepalwidth: 2.6
• Petalength: 4.4
• Petalwidth: 1.2
• Class: Iris-versicolor
Some of the points in the plot appear darker than other points. These points represent 2 or more
instances with the same class label and the same value of attributes plotted on the graph such as
petalwidth and petallength.

The figure below represents a point with 2 instances of information.

6) The X and Y-axis attributes can be changed from the right panel in Visualize graph. The user
can view different plots.
7) The Jitter is used to add randomness to the plot. Sometimes the points overlap. With jitter, the
darker spots represent multiple instances.

8) To get a clearer view of the dataset and remove outliers, the user can select an instance from
the dropdown. Click on the “select instance” dropdown. Choose “Rectangle”. With this, the user
will be able to select points in the plot by plotting a rectangle.

9) Click on “Submit”. Only the selected dataset points will be displayed and the other points will
be excluded from the graph.
The figure below shows the points from the selected rectangular shape. The plot represents
points with only 3 class labels. The user can click on “Save” to save the dataset or “Reset” to
select another instance. The dataset will be saved a separately.ARFF file.

Result:
Data visualization using WEKA is simplified with the help of the box plot. The user can
view any level of granularity. The attributes are plotted on X-axis and y-axis while the instances
are plotted against the X and Y-axis. Some points represent multiple instances which are
represented by points with dark colors.
Ex: No: 3 Plan the Architecture for real time application
Date:

Aim:

Plan the Architecture of Data Warehouse using the schema method.

Algorithm:

Step 1:Collect andDefine Requirements and Use Cases of particular real time Data
warehouse

Step2: Choose the Technology Stack of data warehousing planning

Step3: Real-time Data Ingestion process of data mart

Step 4: Data Processing and Transformation

Step 5: Data Storage

Step 6: Data Integration

Step 7: Data Virtualization

Step 8: Data Visualization and Analytics

Step 9: Security and Compliance,Monitoring and Alerting,High Availability and


Scalability,Data Archiving and Retention,Testing and Quality Assurance,Documentation
and Training,Continuous Improvement

Step 10: Follow all the above steps successfully to plan the Architecture of Data
Warehouse

Step 11:To create the design of dimensional and fact table of data warehouse in Excel
sheet

Output:
Result:

Thus the plan for the Data Warehouse Architecture was successfully planned and verified
for the real time case study using Snowflake.
Ex: No: 4 Write the query for Schema definition

Aim:

To write query for schema definition using SQL (Structured Query Language) is a
programming language used as an interface to manage databases. Essential elements of SQL is
a schema.

Algorithm:

Step 1:When objects have circular references, such as when we need to construct two tables, one
with a foreign key referencing the other table, creating schemas can be useful.

You can create a schema in SQL by following the below syntax.

Syntax

CREATE SCHEMA [schema_name] [AUTHORIZATION owner_name]

[DEFAULT CHARACTER SET char_set_name]

[PATH schema_name[, ...]]

[ ANSI CREATE statements [...] ]

[ ANSI GRANT statements [...] ];

Step2: Example

CREATE SCHEMA STUDENT AUTHORIZATION STUDENT

CREATE TABLE DETAILS (IDNO INT NOT NULL,

NAME VARCHAR(40),

CLASS INTEGER)
The above query will create a schema named as STUDENT and with user STUDENT as the
owner of the schema. Further, the CREATE command will create the table named DETAILS
under the STUDENT schema.

Step3: The ALTER SCHEMA statement is used to rename a schema or specify a new owner,
who must be a pre-existing database user.

Syntax for Altering a Schema:

ALTER SCHEMA schema_name [RENAME TO new_schema_name] [OWNER TO


new_user_name]

Here, new_schema_name refers to the name to which you want to rename the existing schema
and new_user_name refers to the new owner of the schema.

Example:

Suppose we want to rename the previously created schema- STUDENT as


STUDENT_DETAILS and pass the ownership to new user DAVID. The following query will
result in the desired result.

ALTER SCHEMA STUDENT [RENAME TO STUDENT_DETAILS] [OWNER TO DAVID]

Step 4: The DROP SCHEMA in SQL is used to delete all tables present in that particular
schema.

Syntax:

DROP SCHEMA <schema name>

Example:

If you want to delete the schema STUDENT_DETAILS, then use the following SQL query.

DROP SCHEMA STUDENT_DETAILS


Step 5:

CREATE SCHEMA statement used to create a new schema in current database.

Syntax:
CREATE SCHEMA schema name
[AUTHORIZATION owner name]
GO
Example –
CREATE SCHEMA geeks_sch;
GO
To select SQL Server SCHEMA :
To list all schema in the current database, use query as shown below :
SELECT *FROM sys. schemas
Result –
name schema_id principal_id

dbo 1 1

guest 2 2

INFORMATION_SCHEMA 3 4

sys 4 4

db_owner 16384 16384

db_accessadmin 16385 16385

db_securityadmin 16386 16386

db_ddladmin 16387 16387

db_backupoperator 16389 16389

db_datareader 16390 16390


db_datawriter 16391 16391

db_denydatareader 16392 16392

db_denydatawriter 16393 16393

Step 6: Alter schema :


Alter is generally used to change the contents related to a table in SQL. In case of SQL Server,
alter_schema is used to transfer the securables/contents from one schema to another within a
same database.

Syntax –
alter schema target_schemaname
TRANSFER [entity_type::] securable name
Example –
A table named university has two schemas:
student and lecturer
If suppose, the marks of the students has to be transferred to the lecturer schema, the query is
as follows –
alter schema student
TRANSFER [marks::] lecturer
This way, the marks are transferred to the lecturer schema.

Step 7: Drop schema :


Ddrop_schema is used when the schema and its related objects has to be completely banished
from the database including its definition.

Syntax –
drop schema [IF EXISTS] schema_name
IF EXISTS is optional yet if a user wants to check whether a schema actually exists in database
or not. Schema_name is the name of the schema in the database.
Example –
drop schema [IF EXISTS] student
• Student is a schema that is actually present in the university database.
• The schema is dropped from the database along with its definition

• .
Step 8: object_name is the name of the object that will be moved to the target_schema_name.

Note : SYS or INFORMATION_SCHEMA cannot be altered.


Example :
Let us create table named geektab in the dbo schema :
CREATE TABLE dbo.geektab
(id INT PRIMARY KEY IDENTITY,
name NVARCHAR(40) NOT NULL,
address NVARCHAR(255) NOT NULL);
Now, insert some rows into the dbo.geektab table :
INSERT INTO dbo.geektab (id, name, address)
VALUES (1, 'Neha', 'B-Wing, Delhi'), (2, 'Vineet', 'D-Wing, Noida');
Lets us create a stored procedure that finds id :
CREATE PROCEDURE sp_get_id(@id INT
) AS
BEGIN
SELECT *
FROM dbo.geektab
WHERE id = @id;
END;
Let us move this dbo.geektab table to the geek schema :
ALTER SCHEMA geek TRANSFER OBJECT::dbo.geektabs;
Run the sp_get_id stored procedure :
EXEC sp_get_id;
SQL Server will throw an error similar to mentioned below :
strong>Msg 208, Level 16, State 1, Procedure sp_get_id, Line 3
Invalid object name 'dbo.geektab'
Now, let us manually alter the stored procedure to reflect the geek schema :
ALTER PROCEDURE sp_get_id( @id INT
) AS
BEGIN SELECT *
FROM geek.geektab
WHERE id = @id;
END;
Run the sp_get_id stored procedure :
EXEC sp_get_id 1;

id name address

1 Neha B-Wing, Delhi

Output:
Result

Thus the schema was successfully created and executed.


Ex: No: 5 Design Data Warehouse for real time applications

Date:

Aim:

To design the Data Warehouse using snowflake schema. This Exercise introduces you to
using Snowflake together with Dataiku Cloud as part of a Machine learning project, and builds
an end-to-end machine learning solution. This exercise will showcase seamless integration of
both Snowflake and Dataiku at every stage of ML life cycle.

Algorithm:

Step 1:If you haven't already, register for a Snowflake free 30-day trial

Step2: Please ensure that you use the same email address for both your Snowflake and Dataiku
sign up

Step3: Confirm the above steps

Step 4: Region - Kindly choose US West (Oregon) for this lab

Step 5: Set the Granularity

Step 6: Cloud Provider - Kindly choose AWS for this lab

Step 7: Snowflake edition - Select the Enterprise edition so you can leverage some advanced
capabilities that are not available in the Standard Edition.

Step 8: For new account use the provider and region.

Step 9: After registering, you will receive an email with an activation link and your Snowflake
account URL. Kindly activate the account.

Step 10: After activation, you will create a username and password. Write down these
credentials and Bookmark this Snowflake account URL for easy, future access
Output:

2.After registering, you will receive an email with an activation link and your Snowflake account
URL. Kindly activate the account.
After activation, you will create a username and password. Write down these credentials and
Bookmark this Snowflake account URL for easy, future access.

Log in with your username and password credentials from the previous step Once logged in, By
default it will open up the home page.
CONNECT DATAIKU WITH SNOWFLAKE

Select the Admin from the menu.

For the next steps

● Click the Partner Connect

● From drop down switch role and make sure ACCOUNTADMIN is selected

● Search title type Dataiku

● Click on the Dataiku tile.

Your screen should look like the below ScreenShot.


After you have clicked on Dataiku. This will launch the following window, which will
automatically create the connection parameters required for Dataiku to connect to Snowflake
Result:

Thus the design of data warehouse for real time application was successfully completed.
Ex: No: 6 Analyze the dimensional Modeling

Date:

Aim:

To create the dimensional modeling structure using snowflake schema for a simple
example

Algorithm:

Step 1:MySQL Server is used to create the dimensional modeling.

Step2: Microsoft SQL Server Management Studio is the most powerful toolkit for
executing the dimensional schema.

Step3: Building the Data warehouse from operational data sources to analytical tools to
support business decisions through ETL (Extract, Transformation, Load) process.

Step 4: Take the use case of e-Wallet to build a data warehouse using dimensional
modeling technique.

Step 5: One of the online retail company’s features is an e-wallet service, that holds credit
that can be used to pay for products purchased on the platform. Users can receive credit in
three different ways

Step 6: When a product purchase that is paid for is canceled, the money is refunded as
cancellation credit.

Step 7: Users can receive gift card credit as a gift.

Step 8: If a user has a poor service experience, so-sorry credit may be provided.

Step 9: Credit in the e-wallet expires after 6 months if it is gift card credit and so-sorry
credit, but in 1 year if it is cancellation credit.

Step10: Use all the above steps we can create new dimensional modeling structure for a
Data Warehouse
Requirement
The Finance department of the company would like to build reporting and analytics on the e-
wallet service so they can understand the extent of the wallet liabilities the company has.
Some of the questions they would want to answer from this are like below:

• What is the daily balance of credit in the e-wallet service?


• How much credit will expire in the next month?
• What is the outcome (i.e. % used, % expired, % left) of credit given in a particular month?

Solution Design
The four key decisions made during the design of a dimensional model include:
1. Select the business process. 2. Declare the grain. 3. Identify the dimensions. 4. Identify the
facts.
Let’s write down this decision steps for our e-Wallet case:
1. Assumptions: Design is developed based on the background (Business Process) given but also
keeping flexibility in mind. All the required fields are assumed to be available from the
company’s transactional database.
2. Grain definition: Atomic grain refers to the lowest level at which data is captured by a given
business process.
The lowest level of data that can be captured in this context is wallet transactions i.e., all the credit
and debit transactions on e-wallet.
3. Dimensions: Dimensions provide the “who, what, where, when, why, and how” context
surrounding a business process event.
Even though a wide number of descriptive attributes can be added designing dimensions are
restricted to the current business process but the model is flexible to add any more details as and
when required. (Tables name prefixed with Dim)
Dimension Tables:

• DimWallet
• DimDate: This dimension has all the date related parsed values like Month of the date, Week

of the date, Day of the week, etc. This will be very handy to get reports based on time.

4. Facts: Facts are the measurements that result from a business process event and are almost

always numeric.

Facts are designed such that focusing on having fully additive facts. Even though some business

process requirements want facts that are non-additive (% used, % expired, % left, etc). These

values can be achieved effectively by calculating the additive facts separately. Each row in fact

table represents the physical observable events not only focused on the demands of reports

required.

Fact Table:

• FactWallet

STAR schema model


Output:

SQL Server Management Studio sequence steps screenshots:


Result:

Thus the program for dimensional modeling is successfully executed and verified.
Ex: No:7 Case study using OLAP

Date:

Aim:

Explain the OLAP operation with SQL

Algorithm:

Step 1: Create table in excel

Step2: Select table from excel

Step3: Import the table into pgadmin or postre SQL

Step 4: Trunket the table

Step 5: Perform rollup and drilldown

Step 6: Perform dice

Step 7: Perform slice

Step 8: Perform pivot

Step 9: View the result in console window

Output:
Result:

Thus the program for OLAP operation was successfully executed and verified.
Ex: No: 8 Case study using OTLP
Date:

Detail information about OLTP

On-Line Transaction Processing (OLTP) System refers to the system that manage
transaction oriented applications. These systems are designed to support on-line transaction
and process query quickly on the Internet.
For example: POS (point of sale) system of any supermarket is a OLTP System.
Every industry in today’s world use OLTP system to record their transactional data. The main
concern of OLTP systems is to enter, store and retrieve the data. They cover all day to day
operations such as purchasing, manufacturing, payroll, accounting, of an organization. Such
systems have large numbers of user which conduct short transaction. It supports simple
database query so the response time of any user action is very fast.
The data acquired through an OLTP system is stored in commercial RDBMS, which can be
used by an OLAP System for data analytics and other business intelligence operations.
Some other examples of OLTP systems include order entry, retail sales, and financial
transaction systems.
Advantages of an OLTP System:

• OLTP Systems are user friendly and can be used by anyone having basic understanding
• It allows its user to perform operations like read, write and delete data quickly.
• It responds to its user actions immediately as it can process query very quickly.
• This systemis original source of the data.
• It helps to administrate and run fundamental business tasks
• It helps in widening customer base of an organization by simplifying individual processes

Challenges of an OLTP system:

• It allows multiple users to access and change the same data at the same time. So it requires
concurrency control and recovery mechanism to avoid any unprecedented situations
• The data acquired through OLTP systems are not suitable for decision making. OLAP
systems are used for the decision making or “what if” analysis.

Type of queries that an OLTP system can Process:


An OLTP system is an online database modifying system. So it supports database query like
INSERT, UPDATE and DELETE information from the database. Consider a POS system of a
supermarket, below are the sample queries that it can process –
• Retrieve the complete description of a particular product
• Filter all products related to any particular supplier
• Search for the record of any particular customer.
• List all products having price less than Rs 1000.

Type of queries that an OLTP system cannot Process:


An OLTP system supports simple database query like INSERT, UPDATE and DELETE only.
It does not support complex query. Reconsider the POS system of the supermarket, below are
the sample queries that it cannot process –

• How much discount should they offer on a particular product?


• Which product should be introduced to its customer?


OLTP Queries:

OLTP Architecture:
OLTP Advantages:

Data Warehousing : Data Warehousing is a technique that gathers or collects data from
different sources into a central repository, or, in other words, a single, complete, and consistent
store of data that is obtained from different sources. It is a powerful database model that
enhances the user ability to analyze huge, multidimensional datasets; allowing the user to make
a business decision based on facts, for tracking quick and effective decisions, or providing
necessary information.
Requirements for an OLTP system

The most common architecture of an OLTP system that uses transactional data is a three-tier
architecture that typically consists of a presentation tier, a business logic tier, and a data store
tier. The presentation tier is the front end, where the transaction originates via a human
interaction or is system-generated. The logic tier consists of rules that verify the transaction and
ensure all the data required to complete the transaction is available. The data store tier stores the
transaction and all the data related to it.

The main characteristics of an online transaction processing system are the following:

• ACID compliance: OLTP systems must ensure that the entire transaction is recorded
correctly. A transaction is usually an execution of a program that may require the
execution of multiple steps or operations. It may be complete when all parties involved
acknowledge the transaction, or when the product/service is delivered, or when a certain
number of updates are made to the specific tables in the database. A transaction is
recorded correctly only if all the steps involved are executed and recorded. If there is any
error in any one of the steps, the entire transaction must be aborted and all the steps must
be deleted from the system. Thus OLTP systems must comply with atomic, consistent,
isolated, and durable (ACID) properties to ensure the accuracy of the data in the system.
o Atomic: Atomicity controls guarantee that all the steps in a transaction are
completed successfully as a group. That is, if any steps between the transactions
fail, all other steps must also fail or be reverted. The successful completion of a
transaction is called commit. The failure of a transaction is called abort.
o Consistent: The transaction preserves the internal consistency of the database. If
you execute the transaction all by itself on a database that’s initially consistent,
then when the transaction finishes executing the database is again consistent.
o Isolated: The transaction executes as if it were running alone, with no other
transactions. That is, the effect of running a set of transactions is the same as
running them one at a time. This behavior is called serializability and is usually
implemented by locking the specific rows in the table.
o Durable: The transaction’s results will not be lost in a failure.
• Concurrency: OLTP systems can have enormously large user populations, with many
users trying to access the same data at the same time. The system must ensure that all
these users trying to read or write into the system can do so concurrently. Concurrency
controls guarantee that two users accessing the same data in the database system at the
same time will not be able to change that data, or that one user has to wait until the other
user has finished processing before changing that piece of data.
• Scale: OLTP systems must be able to scale up and down instantly to manage the
transaction volume in real time and execute transactions concurrently, irrespective of the
number of users trying to access the system.
• Availability: An OLTP system must be always available and always ready to accept
transactions. Loss of a transaction can lead to loss of revenue or may have legal
implications. Because transactions can be executed from anywhere in the world and at any
time, the system must be available 24/7.
• High throughput and short response time: OLTP systems require nanosecond or even
shorter response times to keep enterprise users productive and to meet the growing
expectations of customers.
• Reliability: OLTP systems typically read and manipulate highly selective, small amounts
of data. It is paramount that at any given point of time the data in the database is reliable
and trustworthy for the users and applications accessing that data.
• Security: Because these systems store highly sensitive customer transaction data, data
security is critical. Any breach can be very costly for the company.
• Recoverability: OLTP systems must have the ability to recover in case of any hardware
or software failure.
Databases for OLTP workloads

Relational databases were built specifically for transaction applications. They embody all the
essential elements required for storing and processing large volumes of transactions, while also
continuously being updated with new features and functionality for extracting more value from
this rich transaction data. Relational databases are designed from the ground up to provide the
highest possible availability and fastest performance. They provide concurrency and ACID
compliance so the data is accurate, always available, and easily accessible. They store data in
tables after extracting relationships between the data so the data can be used by any application,
ensuring a single source of truth.

Get started with Oracle Autonomous Database


The evolution of transaction processing databases
As transactions became more complex, originating from any source or device, from anywhere in
the world, traditional relational databases were not advanced enough to meet the needs of
modern-day transactional workflows. They had to evolve to handle the modern-day transactions,
heterogeneous data, and global scale, and most importantly to run mixed workloads. Relational
databases transformed into multimodal databases that store and process not only relational data
but also all other types of data, including xml, html, JSON, Apache Avro and Parquet, and
documents in their native form, without much transformation.. Relational databases also needed
to add more functionality such as clustering and sharding so they could be distributed globally
and scale infinitely to store and process increasingly large volumes of data and to make use of
cheaper storage available on cloud. With other capabilities such as in-memory, advanced
analytics, visualization, and transaction event queues included, these databases now can run
multiple workloads — such as running analytics on transaction data or processing streaming
(Internet of Things (IoT)) data, or running spatial, and graph analytics.

Modern relational databases built in the cloud automate a lot of the management and operational
aspects of the database, making them easier for users to provision and use. They provide
automated provisioning, security, recovery, backup, and scaling so DBAs and IT teams have to
spend much less time maintaining them. They also embed intelligence to automatically tune and
index the data so database query performance is consistent irrespective of the amount of data, the
number of concurrent users, or the complexity of the queries. These cloud databases also include
self-service capabilities and REST APIs so developers and analysts can easily access and use the
data. This simplifies application development, giving flexibility and making it easier for
developers to build new functionality and customizations into their applications. It also
simplifies analytics, making it easier for analysts and data scientists to use the data for extracting
insights.

How to select the right database for your OLTP workload

As IT struggles to keep pace with the speed of business, it is important that when you choose an
operational database you consider your immediate data needs and long-term data requirements.
For storing transactions, maintaining systems of record, or content management, you will need a
database with high concurrency, high throughput, low latency, and mission-critical
characteristics such as high availability, data protection, and disaster recovery. Most likely, your
workload will fluctuate throughout the day or week or year, so ensuring that the database can
autoscale will help you save a lot of expense. You’ll also may need to decide whether to use a
purpose-built database or general-purpose database. If your requirements are for a specific type
of data, a purpose-built database may work for you, but make sure you aren’t compromising on
any of the other characteristics you need. It would be costly and resource-intensive to build for
those characteristics later in the application layer. Also, if your data needs grow and you want to
expand the functionality of your application, adding more single-purpose or fit-for-purpose
databases will only create data silos and amplify the data management problems. You must also
consider other functionalities that may be necessary for your specific workload—for example,
ingestion requirements, push-down compute requirements, and size at limit.

Select a future-proof cloud database service with self-service capabilities that will automate all
the data management so that your data consumers—developers, analysts, data engineers, data
scientists and DBAs—can do more with the data and accelerate application development.

Result:

Thus the case study of OLTP was successfully studied.


Ex: No:9 Implementation of warehouse testing

Date:

Aim:

To test the Data warehouse with the help of ETL tools using MySQL workspace

Algorithm:

Step 1: Loading the data from the flat file

Step2: Test the Case

Step3: Validate the data

Step 4: Data match in Excel

Step 5: Create the source data in excel file

Step 6:Trunket the table first using MySQL Workbench wizard

Step 7: import the source data in Workspace

Step 8: Test the import data with Excel file

Step 9: Identify the duplicate record from the workspace running environment

Output:
Result:

Thus the data warehouse testing process was successfully executed with MySQL
workspace

You might also like