Ps Assignment - Solution
Ps Assignment - Solution
Assignemnt-3(Solution)
1. Suppose that a data warehouse consists of the three dimensions’ time, doctor, and patient, and the
two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
a). Draw a schema diagram for the above data warehouse using one of the schemas. [star,
snowflake, fact constellation]
b). Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be
performed in order to list the total fee collected by each doctor in 2004?
c). To obtain the same list, write an SQL query assuming the data are stored in a relational
database with the schema fee (day, month, year, doctor, hospital, patient, count, charge)
a) Ans- --
b)Ans- First, we should use roll-up operation to get the year 2004(rolling-up from day then
month to year). After getting that, we need to use slice operation to select (2004). Second, we
should use roll-up operation again to get all patients. Then, we need to use slice operation to
select (all). Finally, we get list the total fee collected by each doctor in 2004.
c) Ans.- s Select doctor, Sum(charge) From fee Where year = 2004 Group by doctor
2. Suppose that a data warehouse for Big-University consists of the following four dimensions:
student, course, semester, and instructor, and two measures count and avg_grade. When at the
lowest conceptual level (e.g., for a given student, course, semester, and instructor combination),
the avg_grade measure stores the actual course grade of the student. At higher conceptual levels,
avg_grade stores the average grade for the given combination.
a) Ans.
Ans.- Since the weather bureau has about 1000 probes scattered throughout various land and
ocean locations, we need to construct a spatial data warehouse so that a user can view weather
patterns on a map by month, by region, and by different combinations of temperature and
precipitation, and can dynamically drill down or roll up along any dimension to explore desired
patterns.
The star schema of this weather spatial data warehouse can be constructed as shown below.
To construct this spatial data warehouse, we may need to integrate spatial data from
heterogeneous sources and systems. Fast and flexible online analytical processing in spatial data
warehouse is an important factor. There are three types of dimensions in a spatial data cube:
nonspatial dimensions, spatial-to-nonspatial dimensions, and spatial-to-spatial dimensions. We
distinguish two types of measures in a spatial data cube: numerical measures and spatial
measures. A nonspatial data cube contains only nonspatial dimensions and numerical measures. If
a spatial data cube contains spatial dimensions but no spatial measures, then its OLAP operations
(such as drilling or pivoting) can be implemented in a manner similar to that of nonspatial data
cubes. If a user needs to use spatial measures in a spatial data cube, we can selectively
precompute some spatial measures in the spatial data cube. Which portion of the cube should be
selected for materialization depends on the utility (such as access frequency or access priority),
sharability of merged regions, and the balanced overall cost of space and online computation.
5. Suppose a company would like to design a data warehouse to facilitate the analysis of moving
vehicles in an online analytical processing manner. The company registers huge amounts of auto
movement data in the format of (Auto ID, location, speed, time). Each Auto ID represents a
vehicle associated with information, such as vehicle category, driver category, etc., and each
location may be associated with a street in a city. Assume that a street map is available for the
city.
a) Design such a data warehouse to facilitate effective online analytical processing in multidimensional
space.
(b) The movement data may contain noise. Discuss how you would develop a method to automatically
discover data records that were likely erroneously registered in the data repository.
(c) The movement data may be sparse. Discuss how you would develop a method that constructs a
reliable data warehouse despite the sparsity of data.
(d) If one wants to drive from A to B starting at a particular time, discuss how a system may use the data
in this warehouse to work out a fast route for the driver.
a) Ans-
To design a data warehouse for the analysis of moving vehicles, we can consider vehicle as a fact table
that points to four dimensions: auto, time, location and speed. The measures can vary depending on the
desired target and the power required. Here, measures considered are vehicle sold and vehicle mileage.
A star schema is shown below
b) Ans-
To handle the noise in data, we first need to do data cleaning. Missing values may be filled or dropped
entirely, depending on the tolerance of the system. Then we can use some data smoothing techniques to
remove noisy data points, for example, regression and outlier analysis. Finally, we can also set up some
rules to detect inconsistent data and remove them based on domain knowledge.
c) Ans-
It may be possible to get a data in the data warehouse that may be sparse. Analyzing a sparse data is not
reliable as a single outlier may completely shift the results. Hence there are few efficient values to deal
with it. We have to evaluate the confidence interval in such case where confidence interval is a measure
that defines the reliability of the data. Smaller the confidence interval, better it is. If the confidence
interval is too large this indicates larger ambiguity in the data. Hence, for our vehicle database we can
compute confidence interval. Confidence interval can be large in an example like when data size was
large enough, but the queried data cell had only few or no values. In such a case, we have 2 ways to
resolve this issue:
Intra-cuboid query: Expand the sample size by including the nearby cells in the same cuboid as the
queried cellthat reduces the confidence interval.
Inter-cuboid query: It is the extreme case of intra-cuboid. In this remove the dimension by generalizing it.
Also, sparsity can be considered as a missing value. We may use the existing data to find that missing
value. This technique is most commonly used in machine learning. E.g. if speed is missing, then we can
view speed as per a function of location or time. Hence, speed recorded previously on that particular
street and that particular time may be considered instead of that missing value. Provided the semantics of
the query is not change, we can even reduce the dimensionality of the data cube. Hence, if some cell was
sparse at a query execution for a particular hour, we may generalize it to a day. This may now give some
values in that cell.
d) Ans.-
Using this warehouse, we can look up the information for those vehicles of the same vehicle category and
driver category. Then using OLAP operation (drill, dice,..) we look up for the speed of a location at a
specific time (atlevel of hour) and will use that as the weight for the street on the city graph. We need to
find the weight for allthe possible paths from the start location to end location. Using this weighted graph
we can work out the fastest route for the driver by any famous algorithm such as A* and Dijktra. We
might need to update the weights everyhour. Using this algorithm, we don’t care about the direction of the
street. We can also integrated that information and create a directed graph. Based on the graph, we can
calculate the fastest route
6. A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base
cuboid. Assume that there are no concept hierarchies associated with the dimensions.
(a) What is the maximum number of cells possible in the base cuboid?
(b) What is the minimum number of cells possible in the base cuboid?
(c) What is the maximum number of cells possible (including both base cells and aggregate cells) in
the data cube, C?
(d) What is the minimum number of cells possible in the data cube, C?
Ans.-
(a) p^n.
(b) p.
(c) (p + 1)^n.
(d) (2^n − 1) × p + 1.
7. RFID (Radio-frequency identification) is commonly used to trace object movement and perform
inventory control. An RFID reader can successfully read an RFID tag from a limited distance at any
scheduled time. Suppose a company would like to design a data warehouse to facilitate the analysis of
objects with RFID tags in an online analytical processing manner. The company registers huge
amounts of RFID data in the format of (RFID, at location, time), and also has some information about
the objects carrying the RFID tag, e.g., (RFID, product name, product category, producer, date
produced, price).
(a) Design a data warehouse to facilitate effective registration and online analytical processing of
such data.
(b) The RFID data may contain lots of redundant information. Discuss a method that maximally
reduces redundancy during data registration in the RFID data warehouse.
(c) The RFID data may contain lots of noise, such as missing registration and misreading of IDs.
Discuss a method that effectively cleans up the noisy data in the RFID data warehouse.
(d) One may like to perform online analytical processing to determine how many TV sets were
shipped from the LA seaport to BestBuy in Champaign, Illinois by month, by brand, and by price
range. Outline how this can be done efficiently if you were to store such RFID data in the warehouse.
(e) If a customer returns a jug of milk and complains that is has spoiled before its expiration data,
discuss how you could investigate such a case in the warehouse to find out what could be the
problem, either in shipping or in storage.
a) Ans.-
A RFID warehouse need to contains a fact table, stay, composed of cleansed RFID records; an
information table, info, that stores path-independent information for each item; and a map table that
links together different records in the fact table that form a path. The main difference between the
RFID warehouse and a traditional warehouse is the presence of the map table linking records from the
fact table (stay) in order to preserve the original structure of the data.
b) Ans.- Each reader provides tuples of the form (RFID; location; time) at xed time intervals. When
an item stays at the same location, for a period of time, multiple tuples will be generated. We can
group these tuples into a single one of the form (RFID; location; time in; time out). For example, if a
supermarket has readers on each shelf that scan the items every minute, and items stay on the shelf on
average for 1 day, we get a 1,440 to 1 reduction in size without loss of information
c) Ans.- One can use the assumption that many RFID objects stay or move together, especially at the
early stage of distribution, or use the historically most likely path for a given item, to infer or
interpolate the miss and error reading.
d) Ans.- Compute an aggregate measure on the tags that travel through a set of locations and that
match a selection criterion on path independent dimensions
e) Ans.- For this case, after we obtain the RFID of the milk, we can directly use traditional OLAP
operations to get the shipping and storage time efficiently.