0% found this document useful (0 votes)
32 views13 pages

Assignment2 4

Data analutics

Uploaded by

hackerone1sk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views13 pages

Assignment2 4

Data analutics

Uploaded by

hackerone1sk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

School of Computing and Information Technology Session: Spring 2024

University of Wollongong Lecturer: Janusz R. Getta

ISIT912 Big Data Management


Assignment 2
Published on 19 August 2024

Scope
This assignment includes the tasks related to intuitive design of data cubes, conceptual modelling
of data warehouse, implementation of "snowflake schema" as a collection of external tables in
HQL, implementation of 0NF Hive tables, denormalization of snowflake schema and creation of
star schema.

This assignment is due on Saturday, 21 September 2024, 7:00pm (sharp).

This assignment is worth 20% of the total evaluation in the subject.

The assignment consists of 5 tasks and specification of each task starts from a new page.

Only electronic submission through Moodle at:


https://moodle.uowplatform.edu.au/login/index.php
will be accepted. A submission procedure is explained at the end of Assignment 1 specification.

A policy regarding late submissions is included in the subject outline.

Only one submission of Assignment 2 is allowed and only one submission per student is
accepted.

A submission marked by Moodle as "late" is always treated as a late submission no matter how
many seconds it is late.

A submission that contains an incorrect file attached is treated as a correct submission with all
consequences coming from the evaluation of the file attached.

All files left on Moodle in a state "Draft(not submitted)" will not be evaluated.

A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, … etc) is not
allowed. The compressed files will not be evaluated.

An implementation that does not compile well due to one or more syntactical and/or run time
errors scores no marks.

Using any sort of Generative Artificial Intelligence (GenAI) for this assignment is NOT allowed !

It is expected that all tasks included within Assignment 2 will be solved individually without
any cooperation with the other students. If you have any doubts, questions, etc. please consult
your lecturer or tutor during lab classes or office hours. Plagiarism will result in a FAIL grade
being recorded for the assessment task.
Task 1 (2 marks)
Intuitive design of data cubes

(1) 0.5 mark


Consider the following description of a sample database domain.

We collect the values of temperature and humidity from two sensors located in the rooms.
The rooms are located in the buildings at the university campuses. Assume that the values
collected from all sensors are always recorded one time per minute.

We would like to save the values in a multi-dimensional data cube. Later on, we would
like to compute an average temperature per hour, per day, per month and per year, an
average temperature per building, per room, per campus, an average temperature per
day and building, per day and room and so on.

Use a short explanation of a database domain given above to find a multi-dimensional


data cube that should be implemented. In your specification of a data cube, list the facts,
the measures, the names of dimensions and the hierarchies.

(2) 0.5 mark


Consider the following description of a sample database domain.

A large international network of supermarkets records at the checkouts information


about the customer baskets. A customer basket is described by the total number of items
in a basket, the total value of all items, date and time when paid, and credit card number
used. The supermarkets are located at the suburbs of cities located in different countries.
Customers belong to a number of customer groups where each customer belongs to only
one group. Customer groups belong to different types of groups such that each group is
of one type.

The network would like to save data collected at the supermarkets in multi-dimensional
data cube. Later on, the network would like to compute the total number of finalized
baskets, the total value and the total number of items in the finalized baskets per day, per
month, per year, per suburb, per city per country per credit card used per customer
group per group type, per day and suburb, per month and country, per year and customer
group and so on.

Use a short explanation of a database domain given above to find a multi-dimensional


data cube that should be implemented. In your specification of a data cube, list the facts,
the measures, the names of dimensions and the hierarchies.

(3) 0.5 mark


Consider the following description of a sample database domain.

A transportation company would like to keep information about its past and present
activities. The company employs a number of drivers and it owns a number of vehicles.
The company owns two types of vehicles: cars and busses. The drivers use the vehicles to
perform one day trips with their customers. The company also employs the
administration staff members who register information related to the trips performed by
the drivers. Assume that information about a trip is always recorded by one
administration staff member.

The company would like to save the data related to their activities in a multidimensional
data cube. Later on, the company would like to compute the total number of trips, the
total amount of fuel consumed, the total distance traveled per day, the total number of
administration staff members involved per month, per year, per vehicle, per all cars, per
all busses, per month and driver, per year and driver and so on.

Use a short explanation of a database domain given above to find a multi-dimensional


data cube that should be implemented. In your specification of a data cube, list the facts,
the measures, the names of dimensions and the hierarchies.

(4) 0.5 mark


Consider the following description of a sample database domain.

A university would like to keep information about participation of the students in lecture
classes. The university uses a sophisticated electronic system monitoring the presence of
each student in a lecture class. The system is able to measure the length of periods of
time spent by a student on several activities. The length of the following activities can be
measures: a student listens to a lecturer, a student is involved in a conversation with
another student, or a student does not pay any attention to a lecture, for example a
student fell asleep for a while.

The university would to save data in a multidimensional data cube. Later on, the
university would like to find the total time spent on participation in a class, total time
spent on conversations with another students, total time when no attention is paid per
student, per subject, per degree, per day, per month per session, per lecture hall, per
student and subject, per student and day, per student and session and so on.

Use a short explanation of a database domain given above to find a multi-dimensional


data cube that should be implemented. In your specification of a data cube, list the facts,
the measures, the names of dimensions and the hierarchies.

Deliverables
A file solution1.pdf that contains
(1) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (1),
(2) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (2),
(3) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (3),
(4) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (4).
Task 2 (4 marks)
Conceptual modelling of a data warehouse

A large international bank would like to create a data warehouse with information about
the loans approved for its customers.

A customer is described by a unique account number associated with a loan, full name
and address.

The bank records the dates when the customers are provided with the loans and the dates
when the loans are fully repaid. A date consists of a day, month and year.

The banks offer the following types of loans: home, investment, personal. Different types
of loans are offered at different interest rates.

All loans must be insured at the insurance companies. An insurance company is


described by a unique name and address. Insurance rates are provided for each loan.

The loans are issued by the tellers located at the branches. A description of a teller
consists of a unique employee number and full name. A branch is described by a unique
name.

The bank plans to use a data warehouse to implement the following classes of analytical
applications.

Find the total number of loans issued per day, per month, per year, per branch, per bank
teller, per city, per state, per country, per loan type, per customer, etc.

Find the total amount of money loaned to the customers per day, per month, per year, per
city, per country, per loan type etc.

Find the total interest rates on the loans per day, per month, per year, per city, per
country, per loan type, etc.

Find an average period of time needed for the loan repayment per loan type, per
customer, per city, per country, etc.

Find the total number of different currencies used for the loans, etc.

Find the total amount of money on loans per currency.

Find the average insurance rates for the loans per month, per year, per city, per city and
year, etc.

Your task is to create a conceptual schema of a data warehouse needed by the bank. To
create a conceptual schema, follow the steps listed below.
Step 1 Find a fact entity,
Step 2 Find the measures describing a fact entity.
Step 2 Find the dimensions.
Step 3 Find the hierarchies over the dimensions.
Step 4 Find the descriptions (attributes) of all entity types.
Step 5 Draw a conceptual schema.

To draw a conceptual schema, use a graphical notation explained to you in a presentation


Conceptual Data Warehouse Design. Use diagram drawing tool UMLet
15.1 and apply the graphical widgets available in ISIT312palette. The choice of
the graphical widgets is available in the right upper corner of the main menu of UMLet.

UMLet 15.1 software is can be downloaded from a section Resources available at


Moodle site for the subject.

Save a drawing of your conceptual schema in a file solution2.bmp.

Deliverables
A file solution2.bmp with a drawing of a conceptual schema of a sample data
warehouse domain.
Task 3 (6 marks)
Implementation of a data warehouse as a collection of internal tables in Hive

Consider the following conceptual schema of three-dimensional data cube.

(1) 2 marks
Perform a step of logical design and draw a "snowflake schema" obtained from the
transformation of a conceptual schema given above.

When creating a "snowflake schema" apply the surrogate keys to implement the
relationships. Cleary identify primary, candidate and foreign keys in the relational
schemas. Reduce the number of dimensions to two in a "snowflake schema" through the
implementation of time dimension as a single attribute.

To draw a "snowflake schema", use a graphical notation explained to you in a


presentation 11 Logical Data Warehouse Design. To draw a "snowflake
schema" use diagram drawing tool UMLet 15.1 and apply the graphical widgets
available in LogicalDesign palette.

UMLet 15.1 software is can be downloaded from a section Resources available at


Moodle site for the subject.

Save a drawing of your "snowflake schema" in a file solution3.bmp.

(2) 4 marks
Implement a fact table and the dimension tables as internal tables in Hive.

Create a script file solution3.hql with CREATE TABLE statements, LOAD


statements to load sample data into the internal tables and SELECT statements that list
the contents of each table. Do not load more than 5 rows per table. All input data are up
to you.

When ready connect to Hive through beeline, process your script and save a report in
a file solution3.txt.
Processing of your script must return NO ERRORS! A solution with errors is worth no
marks!

Deliverables
A file solution3.bmp with a drawing of "snowflake schema" of a data warehouse
and a file solution3.txt with a report from processing of HQL script
solution3.hql.
Task 4 (8 marks)
Implementation of 0NF table in Hive

Consider the following description of a sample database domain.

We would like to store information about the employees, the projects they are assigned
to, their programming skills and their employment record. An employee is described by
an employee number and full name. An employee can be assigned to many projects. Some
employees are not assigned to any projects. A project is identified by its name. If an
employee is assigned to some projects then we need to keep information about a
percentage contribution of an employee to each project. We also would like to record
information about the programming languages that can be used by the employees. An
employee can use none or many programming languages. An employment record consists
of hire date, salary and employee number of a supervisor.

(1) Implement HQL script solution4.hql that creates an internal 0NF relational
table to store information about the employees, the projects they are assigned to and
their programming skills.

(2) Include into the script INSERT statements that load sample data into the table.
Insert at least 5 rows into the relational table created in the previous step. Two
employees must participate in few projects and must know few programming
languages. One employee must participate in few projects and must not know any
programming languages. One employee must know few programming languages and
must not participate in any projects. One employee must not know programming
languages and must not participate in the projects. Each employee must have a
nonempty employment record.

(3) Include into the script SELECT statements that lists the contents of the table.

When ready, use a command line interface beeline to process a script


solution4.hql and to save a report from processing in a file solution4.txt.

If the processing of the file returns the errors then you must eliminate the errors!
Processing of your script must return NO ERRORS! A solution with errors is worth no
marks!

Deliverables
A file solution4.txt with a report from processing of HQL script
solution4.hql. The report MUST NOT include any errors, and the report must
list all SQL statements processed.
Task 5 (10 marks)
Logical design, denormalizations and star schema

Consider the following conceptual schema of 6-dimensional data cube.

(1) 4 marks
Perform a step of logical design and draw a "snowflake schema" obtained from the
transformation of conceptual schema given above.

When creating a "snowflake schema" apply the surrogate keys to implement the
relationships. Cleary identify primary, candidate and foreign keys in the relational
schemas. A method used for implementation of time dimension is up to you.

To draw a "snowflake schema", use a graphical notation explained to you in a


presentation 11 Logical Data Warehouse Design. To draw a "snowflake
schema" use diagram drawing tool UMLet 15.1 and apply the graphical widgets
available in LogicalDesign palette.

UMLet 15.1 software can be downloaded from a section Resources available at the
Moodle site for the subject.

Save a drawing of your "snowflake schema" in a file solution5-1.bmp.


(2) 4 marks
Consider the templates of queries listed below. The queries consistent with the templates
supposed to be applied to the data warehouse.

(i) Find the total number of visits to a hotel/total hotel payments/total restaurant payments/total
discounts applied/total number of facilities used per each bank that issued a credit card used
by the visitors.

(ii) Find the total number of visits to the hotels/total hotel payments/total discount applied/total
restaurant payments/total number of facilities used per year/per year and month/ per year
and month and day.

(iii) Find the total number of visits to the hotels/total hotel payments/total discount applied/total
restaurant payments/total number of facilities used per each country the visitors came from.

(iv) Find the total number of visits to the hotels/total hotel payments/total discount applied/total
restaurant payments/total number of facilities per hotel name and city.

Denormalization of relational tables is a common technique used to speed up processing


of queries that require expensive join operations. Your task is to denormalize the
relational schemas obtained from a logical design in a step (1) to speed up the processing
of queries consistent with the templates given above. An objective is to speed up
processing of such queries as much as possible through denormalization of relational
schemas (denormalization of "snowflake schema").

To draw a denormalized "snowflake schema" use a graphical notation explained to you in


a presentation 11 Logical Data Warehouse Design. To draw a denormalized
"snowflake schema" use a diagram drawing tool UMLet 15.1 and apply the graphical
widgets available in LogicalDesign palette.

UMLet 15.1 software can be downloaded from a section Resources available at the
Moodle site for the subject.

Redraw a snowflake schema after the denormalizations and save it in a file


solution5-2.pdf.

(3) 2 marks
Transform a logical schema obtained from step (1) into a "star schema".
To draw "star schema" use a graphical notation explained to you in a presentation 11
Logical Data Warehouse Design. To draw a "star schema" use a diagram
drawing tool UMLet 15.1 and apply the graphical widgets available in
LogicalDesign palette.

UMLet 15.1 software can be downloaded from a section Resources available at the
Moodle site for the subject.

Save a drawing of your "star schema" in a file solution5-3.bmp


Deliverables
A file solution5-1.bmp that contains a "snowflake schema" obtained from
application of a logical design to the original conceptual schema.
A file solution5-2.bmp that contains a "snowflake schema" obtained from
denormalization of relational schemas in step (2).
A file solution5-3.bmp that contains a "star schema" obtained from a "snowflake
schema " created in a step (1)
Submission of Assignment 2

Note, that you have only one submission. So, make it absolutely sure that you submit
the correct files with the correct contents. No other submission is possible !

Submit the files solution1.pdf, solution2.bmp, solution3.bmp,


solution3.txt, solution4.txt, solution5-1.bmp, solution5-
2.bmp, and solution5-3.bmp and through Moodle in the following way:
(1) Access Moodle at http://moodle.uowplatform.edu.au/
(2) To login use a Login link located in the right upper corner the Web page or in
the middle of the bottom of the Web page
(3) When logged select a site ISIT912/312 (S224) Big Data
Management
(4) Scroll down to a section Assessment items (Assignments)
(5) Click at In this place you can submit the outcomes of your
work on the tasks included in Assignment 2 for ISIT912
students link.
(6) Click at a button Add Submission
(7) Move a file solution1.pdf into an area File submissions. You can
also use a link Add…
(8) Repeat a step (7) for the files solution2.bmp, solution3.bmp,
solution3.txt, solution4.txt, solution5-1.bmp,
solution5-2.bmp, and solution5-3.bmp.
(9) Click at a button Submit assignment.
(10) Click at the checkbox with a text attached: By checking this box, I
confirm that this submission is my own work, I accept
responsibility for any copyright infringement that may
occur as a result of this submission, and I acknowledge
that this submission may be forwarded to a text-
matching service.
(11) Click at a button Continue
(12) Check if Submission status is Submitted for grading.

End of specification

You might also like