Assignment2 4
Assignment2 4
Scope
This assignment includes the tasks related to intuitive design of data cubes, conceptual modelling
of data warehouse, implementation of "snowflake schema" as a collection of external tables in
HQL, implementation of 0NF Hive tables, denormalization of snowflake schema and creation of
star schema.
The assignment consists of 5 tasks and specification of each task starts from a new page.
Only one submission of Assignment 2 is allowed and only one submission per student is
accepted.
A submission marked by Moodle as "late" is always treated as a late submission no matter how
many seconds it is late.
A submission that contains an incorrect file attached is treated as a correct submission with all
consequences coming from the evaluation of the file attached.
All files left on Moodle in a state "Draft(not submitted)" will not be evaluated.
A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, … etc) is not
allowed. The compressed files will not be evaluated.
An implementation that does not compile well due to one or more syntactical and/or run time
errors scores no marks.
Using any sort of Generative Artificial Intelligence (GenAI) for this assignment is NOT allowed !
It is expected that all tasks included within Assignment 2 will be solved individually without
any cooperation with the other students. If you have any doubts, questions, etc. please consult
your lecturer or tutor during lab classes or office hours. Plagiarism will result in a FAIL grade
being recorded for the assessment task.
Task 1 (2 marks)
Intuitive design of data cubes
We collect the values of temperature and humidity from two sensors located in the rooms.
The rooms are located in the buildings at the university campuses. Assume that the values
collected from all sensors are always recorded one time per minute.
We would like to save the values in a multi-dimensional data cube. Later on, we would
like to compute an average temperature per hour, per day, per month and per year, an
average temperature per building, per room, per campus, an average temperature per
day and building, per day and room and so on.
The network would like to save data collected at the supermarkets in multi-dimensional
data cube. Later on, the network would like to compute the total number of finalized
baskets, the total value and the total number of items in the finalized baskets per day, per
month, per year, per suburb, per city per country per credit card used per customer
group per group type, per day and suburb, per month and country, per year and customer
group and so on.
A transportation company would like to keep information about its past and present
activities. The company employs a number of drivers and it owns a number of vehicles.
The company owns two types of vehicles: cars and busses. The drivers use the vehicles to
perform one day trips with their customers. The company also employs the
administration staff members who register information related to the trips performed by
the drivers. Assume that information about a trip is always recorded by one
administration staff member.
The company would like to save the data related to their activities in a multidimensional
data cube. Later on, the company would like to compute the total number of trips, the
total amount of fuel consumed, the total distance traveled per day, the total number of
administration staff members involved per month, per year, per vehicle, per all cars, per
all busses, per month and driver, per year and driver and so on.
A university would like to keep information about participation of the students in lecture
classes. The university uses a sophisticated electronic system monitoring the presence of
each student in a lecture class. The system is able to measure the length of periods of
time spent by a student on several activities. The length of the following activities can be
measures: a student listens to a lecturer, a student is involved in a conversation with
another student, or a student does not pay any attention to a lecture, for example a
student fell asleep for a while.
The university would to save data in a multidimensional data cube. Later on, the
university would like to find the total time spent on participation in a class, total time
spent on conversations with another students, total time when no attention is paid per
student, per subject, per degree, per day, per month per session, per lecture hall, per
student and subject, per student and day, per student and session and so on.
Deliverables
A file solution1.pdf that contains
(1) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (1),
(2) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (2),
(3) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (3),
(4) a specification of data cube including a list of facts, measures, dimensions, and
hierarchies obtained as a solution of problem (4).
Task 2 (4 marks)
Conceptual modelling of a data warehouse
A large international bank would like to create a data warehouse with information about
the loans approved for its customers.
A customer is described by a unique account number associated with a loan, full name
and address.
The bank records the dates when the customers are provided with the loans and the dates
when the loans are fully repaid. A date consists of a day, month and year.
The banks offer the following types of loans: home, investment, personal. Different types
of loans are offered at different interest rates.
The loans are issued by the tellers located at the branches. A description of a teller
consists of a unique employee number and full name. A branch is described by a unique
name.
The bank plans to use a data warehouse to implement the following classes of analytical
applications.
Find the total number of loans issued per day, per month, per year, per branch, per bank
teller, per city, per state, per country, per loan type, per customer, etc.
Find the total amount of money loaned to the customers per day, per month, per year, per
city, per country, per loan type etc.
Find the total interest rates on the loans per day, per month, per year, per city, per
country, per loan type, etc.
Find an average period of time needed for the loan repayment per loan type, per
customer, per city, per country, etc.
Find the total number of different currencies used for the loans, etc.
Find the average insurance rates for the loans per month, per year, per city, per city and
year, etc.
Your task is to create a conceptual schema of a data warehouse needed by the bank. To
create a conceptual schema, follow the steps listed below.
Step 1 Find a fact entity,
Step 2 Find the measures describing a fact entity.
Step 2 Find the dimensions.
Step 3 Find the hierarchies over the dimensions.
Step 4 Find the descriptions (attributes) of all entity types.
Step 5 Draw a conceptual schema.
Deliverables
A file solution2.bmp with a drawing of a conceptual schema of a sample data
warehouse domain.
Task 3 (6 marks)
Implementation of a data warehouse as a collection of internal tables in Hive
(1) 2 marks
Perform a step of logical design and draw a "snowflake schema" obtained from the
transformation of a conceptual schema given above.
When creating a "snowflake schema" apply the surrogate keys to implement the
relationships. Cleary identify primary, candidate and foreign keys in the relational
schemas. Reduce the number of dimensions to two in a "snowflake schema" through the
implementation of time dimension as a single attribute.
(2) 4 marks
Implement a fact table and the dimension tables as internal tables in Hive.
When ready connect to Hive through beeline, process your script and save a report in
a file solution3.txt.
Processing of your script must return NO ERRORS! A solution with errors is worth no
marks!
Deliverables
A file solution3.bmp with a drawing of "snowflake schema" of a data warehouse
and a file solution3.txt with a report from processing of HQL script
solution3.hql.
Task 4 (8 marks)
Implementation of 0NF table in Hive
We would like to store information about the employees, the projects they are assigned
to, their programming skills and their employment record. An employee is described by
an employee number and full name. An employee can be assigned to many projects. Some
employees are not assigned to any projects. A project is identified by its name. If an
employee is assigned to some projects then we need to keep information about a
percentage contribution of an employee to each project. We also would like to record
information about the programming languages that can be used by the employees. An
employee can use none or many programming languages. An employment record consists
of hire date, salary and employee number of a supervisor.
(1) Implement HQL script solution4.hql that creates an internal 0NF relational
table to store information about the employees, the projects they are assigned to and
their programming skills.
(2) Include into the script INSERT statements that load sample data into the table.
Insert at least 5 rows into the relational table created in the previous step. Two
employees must participate in few projects and must know few programming
languages. One employee must participate in few projects and must not know any
programming languages. One employee must know few programming languages and
must not participate in any projects. One employee must not know programming
languages and must not participate in the projects. Each employee must have a
nonempty employment record.
(3) Include into the script SELECT statements that lists the contents of the table.
If the processing of the file returns the errors then you must eliminate the errors!
Processing of your script must return NO ERRORS! A solution with errors is worth no
marks!
Deliverables
A file solution4.txt with a report from processing of HQL script
solution4.hql. The report MUST NOT include any errors, and the report must
list all SQL statements processed.
Task 5 (10 marks)
Logical design, denormalizations and star schema
(1) 4 marks
Perform a step of logical design and draw a "snowflake schema" obtained from the
transformation of conceptual schema given above.
When creating a "snowflake schema" apply the surrogate keys to implement the
relationships. Cleary identify primary, candidate and foreign keys in the relational
schemas. A method used for implementation of time dimension is up to you.
UMLet 15.1 software can be downloaded from a section Resources available at the
Moodle site for the subject.
(i) Find the total number of visits to a hotel/total hotel payments/total restaurant payments/total
discounts applied/total number of facilities used per each bank that issued a credit card used
by the visitors.
(ii) Find the total number of visits to the hotels/total hotel payments/total discount applied/total
restaurant payments/total number of facilities used per year/per year and month/ per year
and month and day.
(iii) Find the total number of visits to the hotels/total hotel payments/total discount applied/total
restaurant payments/total number of facilities used per each country the visitors came from.
(iv) Find the total number of visits to the hotels/total hotel payments/total discount applied/total
restaurant payments/total number of facilities per hotel name and city.
UMLet 15.1 software can be downloaded from a section Resources available at the
Moodle site for the subject.
(3) 2 marks
Transform a logical schema obtained from step (1) into a "star schema".
To draw "star schema" use a graphical notation explained to you in a presentation 11
Logical Data Warehouse Design. To draw a "star schema" use a diagram
drawing tool UMLet 15.1 and apply the graphical widgets available in
LogicalDesign palette.
UMLet 15.1 software can be downloaded from a section Resources available at the
Moodle site for the subject.
Note, that you have only one submission. So, make it absolutely sure that you submit
the correct files with the correct contents. No other submission is possible !
End of specification