Notes On Dimension and Facts
Notes On Dimension and Facts
Notes On Dimension and Facts
MATTEO GOLFARELLI, DARIO MAIO and STEFANO RIZZI DEIS - Universit di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy {mgolfarelli,dmaio,srizzi}@deis.unibo.it Data warehousing systems enable enterprise managers to acquire and integrate information from heterogeneous sources and to query very large databases efficiently. Building a data warehouse requires adopting design and implementation techniques completely different from those underlying operational information systems. Though most scientific literature on the design of data warehouses concerns their logical and physical models, an accurate conceptual design is the necessary foundation for building a DW which is well-documented and fully satisfies requirements. In this paper we formalize a graphical conceptual model for data warehouses, called Dimensional Fact model, and propose a semi-automated methodology to build it from the pre-existing (conceptual or logical) schemes describing the enterprise relational database. The representation of reality built using our conceptual model consists of a set of fact schemes whose basic elements are facts, measures, attributes, dimensions and hierarchies; other features which may be represented on fact schemes are the additivity of fact attributes along dimensions, the optionality of dimension attributes and the existence of non-dimension attributes. Compatible fact schemes may be overlapped in order to relate and compare data for drill-across queries. Fact schemes should be integrated with information of the conjectured workload, to be used as the input of logical and physical design phases; to this end, we propose a simple language to denote data warehouse queries in terms of sets of fact instances. Keywords: Data warehouse, Conceptual models, Multidimensional data model, Entity-Relationship model
1. Introduction The database community is devoting increasing attention to the research themes concerning data warehouses; in fact, the development of decision-support systems will probably be one of the leading issues for the coming years. The enterprises, after having invested a lot of time and resources to build huge and complex information systems, ask for support in quickly obtaining summary information which may help managers in planning and decision-making. Data warehousing systems address this issue by enabling managers to acquire and integrate information from different sources and to query very large databases efficiently. The topic of data warehousing encompasses application tools, architectures, information service and communication infrastructures to synthesize information useful for decision-making from distributed heterogeneous operational data sources. This
1 This work was partially supported by the INTERDATA project from the Italian Ministry of University and Scientific Research and by Olivetti Sanit.
information is brought together into a single repository, called a data warehouse (DW), suitable for direct querying and analysis and as a source for building logical data marts oriented to specific areas of the enterprise.17 While it is universally recognized that a DW leans on a multidimensional model, little is said about how to carry out its conceptual design starting from the user requirements. On the other hand, we argue that an accurate conceptual design is the necessary foundation for building an information system which is both well-documented and fully satisfies requirements. The Entity/Relationship (E/R) model is widespread in the enterprises as a conceptual formalism to provide standard documentation for relational information systems, and a great deal of effort has been made to use E/R schemes as the input for designing non-relational databases as well8; unfortunately, as argued in Ref. 17: "Entity relation data models [...] cannot be understood by users and they cannot be navigated usefully by DBMS software. Entity relation models cannot be used as the basis for enterprise data warehouses." In this paper we present a graphical conceptual model for DWs, called Dimensional Fact Model (DFM). The representation of reality built using the DFM is called dimensional scheme, and consists of a set of fact schemes whose basic elements are facts, dimensions and hierarchies. Compatible fact schemes may be overlapped in order to relate and compare data. Fact schemes may be integrated with information of the conjectured workload, expressed in terms of fact instance expressions denoting queries, to be used as the input of a design phase whose output are the logical and physical schemes of the DW. To this end, we propose a simple language to denote data warehouse queries in terms of sets of fact instances. Most information systems implemented in enterprises during the last decade are relational, and in most cases their analysis documentation consists of E/R schemes. In this paper we propose a semi-automated methodology to carry out conceptual modelling starting from the pre-existing E/R schemes describing the operational information system. In some cases, the E/R documentation held by the enterprise is incomplete or incorrect; often, the only documentation available consists of logical relational schemes. Thus, we show how our methodology can be applied starting from the database logical scheme. After surveying the literature on DWs in Section 2, in Section 3 we describe the DFM and introduce fact instance expressions as a formalism to denote DW queries. In Section 4, the overlapping of related fact schemes is discussed. Section 5 describes a methodology for deriving fact schemes from the schemes describing the operational database. 2. Background and literature on data warehousing From a functional point of view, the data warehouse process consists of three phases: extracting data from distributed operational sources; organizing and integrating data consistently into the DW; accessing the integrated data in an efficient and flexible fashion. The first phase encompasses typical issues concerning distributed heterogeneous
information services, such as inconsistent data, incompatible data structures, data granularity, etc. (for instance, see Ref. 23). The third phase requires capabilities of aggregate navigation12, optimization of complex queries6, advanced indexing techniques18 and friendly visual interface to be used for On-Line Analytical Processing (OLAP) 7,5 and data mining.9 As to the second phase, designing the DW requires techniques completely different from those adopted for operational information systems. While most scientific literature on the design of DWs focuses on specific issues such as materialization of views2,15 and index selection13,16, no significant effort has been made so far to develop a complete and consistent design methodology. The apparent lack of interest in the issues related to conceptual design can be explained as follows: (a) data warehousing was initially devised within the industrial world, as a result of practical demands of users who typically do not give predominant importance to conceptual issues; (b) logical and physical design have a primary role in optimizing the system performances, which is the main goal in data warehousing applications. In Ref. 19, the author proposes an approach to the design of DWs based on a business model of the enterprise which is actually a relational database scheme. Regretfully, conceptual and logical design are mixed up; since logical design is necessarily targeted towards a logical model (relational in this case), no unifying conceptual model of data is devised. Ref. 1 and Ref. 14 propose two data models for multidimensional databases and the related algebras. Both models are at the logical level, thus, they do not address conceptual modelling issues such as the structure of attribute hierarchies and nonadditivity constraints. The approach to conceptual DW modeling presented in Ref. 4 shares several ideas with our early work on the topic10, though it is mainly addressed towards representing attribute hierarchies and neglects other conceptual issues such as additivity and scheme overlapping. The multidimensional model may be mapped on the logical level differently depending on the underlying DBMS. If a DBMS directly supporting the multidimensional model is used, fact attributes are typically represented as the cells of multidimensional arrays whose indices are determined by key attributes.15 On the other hand, in relational DBMSs the multidimensional model of the DW is mapped in most cases through star schemes17 consisting of a set of dimension tables and a central fact table. Dimension tables are strongly denormalized and are used to select the facts of interest based on the user queries. The fact table stores fact attributes; its key is defined by importing the keys of the dimension tables. Different versions of these base schemes have been proposed in order to improve the overall performances3, handle the sparsity of data20 and optimize the access to aggregated data.16 In particular, the efficiency issues raised by data warehousing have been dealt with by means of new indexing techniques (see Ref. 22 for a survey), among which we mention bitmap indices.20
3. The Dimensional Fact Model Definition 1. Let g=(V,E) be a directed, acyclic and weakly connected graph. We say g is a quasi-tree with root in v0V if each other vertex vjV can be reached from v0 through at least one directed path. We will denote with path0j(g)g a directed path starting in v0 and ending in vj; given vipath0j(g), we will denote with pathij(g)g a directed path starting in vi and ending in vj. We will denote with sub(g,vi)g the quasi-tree rooted in viv0. Within a quasi-tree, two or more directed path may converge on the same vertex. A quasitree in which the root is connected to each other vertex through exactly one path degenerates into a directed tree. A dimensional scheme consists of a set of fact schemes. The components of fact schemes are facts, measures, dimensions and hierarchies. In the following an intuitive description of these concepts is given; a formal definition of fact schemes can be found in Definition 2. A fact is a focus of interest for the decision-making process; typically, it models an event occurring in the enterprise world (e.g., sales and shipments). Measures are continuously valued (typically numerical) attributes which describe the fact from different points of view; for instance, each sale is measured by its revenue. Dimensions are discrete attributes which determine the minimum granularity adopted to represent facts; typical dimensions for the sale fact are product, store and date. Hierarchies are made up of discrete dimension attributes linked by -to-one relationships, and determine how facts may be aggregated and selected significantly for the decision-making process. The dimension in which a hierarchy is rooted defines its finest aggregation granularity; the other dimension attributes define progressively coarser granularities. A hierarchy on the product dimension will probably include the dimension attributes product type, category, department, department manager. Hierarchies may also include non-dimension attributes. A nondimension attribute contains additional information about a dimension attribute of the hierarchy, and is connected by a -to-one relationship (e.g., the department address); unlike dimension attributes, it cannot be used for aggregation. Some multidimensional models in the literature focus on treating dimensions and measures symmetrically.1,14 This promises to be an important achievement from both the point of view of the uniformity of the logical model and that of the flexibility of OLAP operators. Nevertheless we claim that, at a conceptual level, distinguishing between measures and dimensions is important since it allows the logical design to be more specifically aimed at the efficiency required by data warehousing applications. Definition 2. A fact scheme is a sextuple f = (M, A, N, R, O, S) where:
M is a set of measures. Each measure miM is defined by a numeric or Boolean expression which involves values acquired from the operational information systems. A is a set of dimension attributes. Each dimension attribute aiA is characterized by a discrete domain of values, Dom(ai). N is a set of non-dimension attributes. R is a set of ordered couples, each having the form (ai,aj) where aiA{a0} and . ajAN (aiaj), such that the graph qt(f)=(AN{a0},R) is a quasi-tree with root a0. a0 is a dummy attribute playing the role of the fact on which the scheme is centred. The couple (ai,aj) models a -to-one relationship between attributes ai and aj. . We call dimension pattern the set Dim(f)={a iA | (a 0 ,ai)R}; each element in Dim(f) is a dimension. When we need to emphasize that an attribute ai is a dimension, we will denote it as di. The hierarchy on dimension diDim(f) is the quasi-tree rooted in di, sub(qt(f),di). OR is a set of optional relationships. The domain of each dimension attribute aj such that (ai,aj)O includes a 'null' value. S is a set of aggregation statements, each consisting of a triple (mj, di, ) where mjM, diDim(f) and {'SUM','AVG','COUNT','MIN','MAX','AND','OR',...} (aggregation operator). Statement (mj, di, )S declares that measure mj can be aggregated along dimension di by means of the grouping operator . If no aggregation statement exists for a given pair (mj, di), then mj cannot be aggregated at all along di. In the following we will discuss the graphic representation of the concepts introduced above with reference to the fact scheme SALE, shown in Figure 1, which describes the sales in a chain store. This scheme, as well as the INVENTORY and the SHIPMENT schemes proposed in Section 4, are based on the star schemes reported in Ref. 17. In the DFM, a fact scheme is structured as a quasi-tree whose root is a fact. A fact is represented by a box which reports the fact name and, typically, one or more measures. In the sale scheme, quantity sold, revenue and no. of customers are measures. Dimension attributes are represented by circles. Each dimension attribute directly attached to the fact is a dimension. The dimension pattern of the sale scheme is {date, product, store, promotion}. Non-dimension attributes are always terminal within the quasi-tree, and are represented by lines (for instance, address). Subtrees rooted in dimensions are hierarchies. The arc connecting two attributes represents a -to-one relationship between them (for instance, there is a many-to-one relationship between city and county); thus, every directed path within one hierarchy necessarily represents a -to-one relationship between the starting and the ending attributes. We denote with i.aj the value of aj determined by value iDom(ai) assumed by ai (for instance, Venice.state denotes Italy); by convention, i.ai=i.
manager manager hierarchy marketing group weight dimension attribute season day of week holiday SALE year quarter month date qty sold revenue no. of customers product type brand diet fact sales manager sale district store city county state department category city
aggregation dimension
promotion
Fig. 1. The SALE fact scheme. Arrows are placed by convention only on the attributes where two or more paths converge.
The fact scheme may not be a tree: in fact, two or more distinct paths may connect two given dimension attributes within a hierarchy, provided that every directed path still represents a -to-one relationship. Consider for instance the hierarchy on dimension store: states are partitioned into counties and sale districts, and no relationship exists between them; nevertheless, a store belongs to the same state whichever of the two paths is followed (i.e., store determines state). Thus, notation i.aj explained above is still not ambiguous even if two or more paths connect ai to aj. On the other hand, consider attribute city on the product dimension, which represents the city where a brand is manufactured. In this case the two city attributes have different semantics and must be represented separately; in fact, a product manufactured in a city can be sold in stores of other cities. Optional relationships between pairs of attributes are represented by marking with a dash the corresponding arc. For instance, attribute diet takes a value only for food products; for the other products, it will take a conventional null value. A measure is additive on a dimension if its values can be aggregated along the corresponding hierarchy by the sum operator. Since this is the most frequent case, in order to simplify the graphic notation in the DFM, only the exceptions are represented explicitly. In particular, given measure mj and dimension di:
1. If (mj, di, 'SUM')S (mj is not additive along di), mj and di are connected by a dashed line labelled with all aggregation operators (if any) such that (mj, di, )S (for instance, see Figures 1 and 5). 2. If (mj, di, 'SUM')S (mj is additive along di): 2.1 If 'SUM' | (mj, di, )S (only sum can be used for aggregation), mj and di / are not graphically connected. 2.2 Otherwise (other operators can be used besides the sum), mj and di are connected by a dashed line labelled with the symbol '+' followed by all the other operators 'SUM' such that (mj, di, )S. Additivity will be discussed in more detail in Subsection 3.3. 3.1. Fact instances Given a fact scheme f, each n-tuple of values taken from the domains of the n dimensions of f defines an elemental cell where one unit of information for the DW can be represented. We call primary fact instances the units of information present within the DW, each characterized by exactly one value for each measure. We will denote with pf( 1 ,... n ) the primary fact instance corresponding to the combination of values (1,...n)Dom(d1)...Dom(dn). In the sale scheme, each primary instance describes the sales of one product during one day in one store adopting one promotion ('no promotion' should be considered as a particular case of promotion). Not every possible combination of values necessarily originates a primary fact instance. For instance, in the sale scheme, a missing primary fact instance denotes that a product was not on sale on a given day in a given store (null assumption); this is different from having a primary fact instance with qty=0, which denotes that the product remained unsold. Alternatively, it might be reasonable to assume that all products are always on sale, hence, that a missing primary fact instance denotes that the product remained unsold (zero assumption). Some issues related to these two different interpretations will be discussed in Subsection 3.3. Since analysing data at the maximum level of detail is often overwhelming, it may be useful to aggregate primary fact instances at different levels of abstraction, each corresponding to an aggregation pattern; if a given dimension is not interesting for the current analysis, aggregation is carried out over all the possible values that dimension can assume. In the OLAP terminology, this operation is called roll-up. Definition 3. Given a fact scheme f with n dimensions, a v-dimensional aggregation pattern (0v) is a set P={a1,...av} where: 1. i=1,...v (aiA); 2. PDim(f); 3. aiP ( ajP, aiaj | ajsub(qt(f),ai)) (i.e., no directed path exists between each / pair of attributes in P).
A dimension diDim(f) is said to be hidden within P if no attribute of its hierarchy sub(qt(f),di) appears within P. An aggregation pattern P is legal with reference to measure mjM if dk | (mj, dk, )S dkP / Examples of aggregation patterns in the sale scheme are {product,county,month, promotion}, {state,date} (product and promotion are hidden), {year,season} (two attributes are taken from dimension date), {} (all dimensions are hidden). Pattern {brand,month} is illegal with reference to no. of customers since the latter cannot be aggregated along the product hierarchy. Let P={a1 ,...av } be an aggregation pattern, and dh* denote the dimension whose hierarchy includes ahP. The secondary fact instance sf(1,...v) corresponding to the combination of values (1,...v)Dom(a1)...Dom(av) aggregates the set of primary fact instances
store
date
prod
uct
(a)
city
date
Fig. 2. A primary (a) and a secondary (b) fact istance for the SALE scheme (dimension promotion is omitted for clarity).
store
prod uct
(b)
category
In the following, we will use sometimes the term pattern to denote either the dimension pattern or an aggregation pattern. 3.2. Representing queries on the dimensional scheme In general, querying an information system means linking different concepts through userdefined paths in order to retrieve some data of interest; in particular, for relational databases this is done by formulating a set of joins to connect relation schemes. On the other hand, a substantial amount of queries on DWs are aimed at extracting summary data to fill structured reports to be analysed for decisional or statistical purposes. Thus, within our framework, a typical DW query can be represented by the set of fact instances, at any aggregation level, whose measure values are to be retrieved. In this subsection we discuss how sets of fact instances can be denoted by writing fact instance expressions. The simple language we propose is aimed at defining, with reference to a dimensional scheme, the queries forming the expected workload for the DW, to be used for logical design; thus, it focuses on which data must be retrieved and at which level they must be consolidated. A fact instance expression has the general form: <fact instance expression> ::= <fact name> ( <pattern clause> ; <selection clause> ) <pattern clause> ::= comma-list of <pattern elements> <pattern elements> ::= <dimension name> | <dimension name>.<attribute name> <selection clause> ::= comma-list of <predicate> The pattern clause describes a pattern. The selection clause contains a set of Boolean predicates which may either select a subset of the aggregated fact instances or affect the way fact instances are aggregated. If an attribute involved either in a pattern clause or in a selection clause is not a dimension, it should be referenced by prefixing its dimension name. The value(s) assumed by a measure within the fact instance(s) described by a fact instance expression is(are) denoted as follows: <measure values> ::= <fact instance expression>.<measure> Given a fact scheme f having n dimensions d 1 ,...d n , consider the fact instance expression f(d1,...dp,ap+1,...av ; e1(bi1),...eh(bih)) where we have assumed, without loss of generality, that: The first p pattern elements involve a dimension and the other vp involve a dimension attribute (0pv). Each Boolean predicate ej (j=1,...h, h0) involves one attribute bij belonging to the hierarchy rooted in dij*, which may also be hidden. (1)
If p=v=n (i.e., the pattern clause describes the dimension pattern), expression (1) denotes the set of primary fact instances
For instance, the expression SALE(date, product, store, promotion ; date.year>='1995',product='P5').qtySold denotes the quantities of product P5 sold in each store, with each promotion, during each day since 1995. Otherwise (p<v and/or at least one dimension is hidden), let P be the aggregation pattern described by the pattern clause. Let bij be the attribute involved by ej; we say ej is external if a ij* P | aij* path 0i j(qt(f)), internal otherwise (see Figure 3). External predicates restrict the set of secondary fact instances to be returned, while internal predicates determine which primary fact instances will form each secondary fact instance. Let e1,...er and er+1,...eh be, respectively, the external and the internal predicates (0rh); in this case, expression (1) denotes the set of secondary fact instances
where each sf(1,...v) aggregates the set of primary fact instances {pf(1,...n) | k{1,...n} kDom(dk) h{1,...v} h*.ah=h j{r+1,...h} ej(ij*.bij)} b2 b3 b1 a0
Fig. 3. Representation of a fact instance expression on qt(f): black circles represent the attributes in the aggregation pattern, crosses mark the attributes on which selection predicates are defined. The predicates on b1 and b2 are internal; that on b3 is external.
Consider, for instance, the two expressions SALE(date.month, product.type ; date.month='JAN98', product.category='food').qtySold SALE(date.month, product.type ; date.month='JAN98', product.brand='General').qtySold which denote, respectively, the total sales of each type of products of category 'food' for January 1998 (Figure 4.a) and the total sales of each type of products of brand 'General' for January 1998 (Figure 4.b). The predicates on month and on category are external, whereas that on brand is internal. With reference to the sample set of data in Table I, and considering that qtySold is additive on all the dimensions, the results of the two expressions are shown in Table II.
qty sold = ... revenuesold = ... qty = ... revenuesold = ... qty = ... revenue = ...
qty sold = ... qty sold = revenue = ... ... qty = ... revenuesold = ... revenuesold = ... qty = ... revenue = ...
store
store
produ
ct
month=JAN98 category='food'
date
prod
uct
date
month=JAN98
(a)
(b)
Fig. 4. Sales of the three types of products of category 'food' (a) and sales of all four types of products but including only the products of brand 'General' (b).
product GD BB GB BS GS BT
qtySold 100 50 50
A significant amount of DW queries require consolidating data on multiple levels of abstraction; this queries can be expressed in our language as the union of two or more sets of fact instances. For instance, the query requiring the sales of products of brand 'General' for each month, showing also the subtotals for each year and the total, can be expressed as follows: SALE(date.month, product ; product.brand='General').qtySold SALE(date.year, product ; product.brand='General').qtySold SALE(product ; product.brand='General').qtySold
3.3. A d d i t i v i t y Aggregation requires defining a proper operator to compose the measure values characterizing primary fact instances into measure values characterizing each secondary fact instance. Definition 4. Given a fact scheme f, measure mjM is said to be aggregable on dimension dkDim(f) if (mj, dk, )S, non-aggregable otherwise. Measure mj is said to be additive on dk if (mj, dk, 'SUM')S, non-additive otherwise. As a guideline, most measures in a fact scheme should be additive. An example of additive measure in the sale scheme is qty sold: the quantity sold for a given sales manager is the sum of the quantities sold for all the stores managed by that sales manager. A measure may be non-additive on one or more dimensions. Examples of this are all the measures expressing a level, such as an inventory level, a temperature, etc. An inventory level is non-additive on time, but it is additive on the other dimensions. A temperature measure is non-additive on all the dimensions, since adding up two temperatures hardly makes sense. However, this kind of non-additive measures can still be aggregated by using operators such as average, maximum, minimum; Figure 5 shows an example where both operators AVG and MIN can be used for aggregation; measure qty expresses, for each product, the number of copies present within each warehouse during each week.
category weight package size package type product season INVENTORY year month week qty AVG, MIN warehouse city state type brand units per pallet
address
For other measures, aggregation is inherently impossible for conceptual reasons. Consider the measure number of customers in the sale example, estimated for a given product, day and store by counting the number of purchase tickets for that product printed on that day in that store. Since the same ticket may include other products, adding or averaging the number of customers for two or more products would lead to an inconsistent result. Thus, number of customers is non-aggregable on the product dimension (while it is additive on the time and the stores dimensions). In this case, the reason for nonaggregability is that the relationship between purchase tickets and products is many-tomany instead of many-to-one: measure number of customers cannot be consistently
aggregated on the product dimension, whatever operator is used, unless the grain of fact instances is made finer. If mj is non-aggregable on dk , any aggregation pattern not including dk is illegal with reference to mj. Given a measure mj aggregable on dk by operator and the aggregation pattern P={d 1 ,...,d k-1 ,a k ,d k+1 ,...d n }, which includes all the dimensions except dk which is represented by any other dimension attribute ak belonging to its hierarchy, the value of mj may be computed for each secondary fact instance at pattern P as: f(d1,...ak,...dn ; d1=1,...ak=k,...dn=n).mj = = Dom(dk)|.ak=k
f(d1,...dk,...dn ; d1=1,...dk=,...dn=n).mj
for each kDom(ak), iDom(di) (i=1,...n; ik). Similarly, if dk is hidden within P, it is: f(d1,...dk-1,dk+1,...dn ; d1=1,...dn=n).mj = = Dom(dk)
f(d1,...dk,...dn ; d1=1,...dk=,...dn=n).mj
In the following these formulae are explained with an example. Let the primary fact instances for the INVENTORY fact scheme be those represented in Table III. The matrix reports the values of measure qty; dimension warehouse is not considered for simplicity. A missing primary fact instance denotes that a product was not in the catalogue on a given week. The secondary fact instances at patterns {week, type} and {week} are shown in Table IV. Since qty is additive along product, the quantity for each product type for each week is the sum of the quantities for the products of that type for that week; the total quantity for each week is the sum of all quantities for that week. The secondary fact instances at patterns {month, product} and {product} are shown in Table V; they are calculated using the average function to aggregate qty along week. type product month week jan98 1-98 2-98 3-98 4-98 5-98 feb98 6-98 7-98 8-98 9-98 T1 P1 10 10 8 8 12 12 9 9 7 T2 P4 15 15 15 15 15 15 15 5 5
P2 50 60 60 40 40 40 35 55 55
P3 35 30 30 25 20 20 20 20 35
P5 20 20 30 20 20 10 10 5
Table III. Primary fact instances for a given warehouse (symbol '-' denotes a missing fact instance).
type month week jan98 1-98 2-98 3-98 4-98 5-98 feb98 6-98 7-98 8-98 9-98
T1 95 100 98 73 72 72 64 84 97
Table IV. Secondary fact instances at patterns {week,type} (left) and {week} (right).
T1 P1
P2
P3
T2 P4
P5
9.60 50.00 28.00 15.00 18.00 9.25 46.25 23.75 10.00 11.25 9.44 48.33 26.11 12.78 15.00
Table V. Secondary fact instances at patterns {month,product} (top) and {product} (bottom).
As a matter of fact, when using for instance pattern {week}, secondary fact instances could be more conveniently computed by aggregating the secondary fact instances at pattern {week, type} instead of aggregating the primary fact instances. As pointed out in Ref. 11, this can be done efficiently only for distributive and algebraic functions: SUM, MIN, MAX, COUNT, AND, OR fall within the first category, AVG within the second. These optimization issues, which in Ref. 21 are discussed also for complex aggregation queries, fall outside the scope of this paper. When aggregating instances along two or more dimensions at the same time, it is necessary to declare in which order dimensions are to be considered. Let ' and " be the operators used to aggregate mj along d1 and d2 respectively, and P={a1 ,a 2 } be the aggregation pattern to be computed, where a1 and a2 belong to the hierarchies defined on d 1 and d2 , respectively. In order to compute the values of m j at P, two different aggregation sequences can be adopted: {d1,d2} ' {a1,d2} " {a1,a2} {d1,d2} " {d1,a2} ' {a1,a2} In general, the outcome depends on which sequence is adopted unless one of the following situations occurs: ' = " {'SUM','MIN','MAX','AND','OR'}; '{'SUM','AVG'} and " = 'AVG' (or vice versa) and the zero assumption is made (missing fact instances denote products out of stock).
The restrictions applied when the average operator is involved arise since, when the null assumption is made, the subsets on which average operates may not have the same cardinality. Table VI shows, with reference to the inventory example, the secondary fact instances at patterns {month, type}, {month}, {type} and {} when the zero assumption is made. It is easy to verify that, if the null assumption were made instead, or if function MIN were used to aggregate qty along week, applying the two aggregation sequences {week, product} SUM {week, type} MIN {month, type} or {week, product} MIN {month, product} SUM {month, type} would lead to different results. type month jan98 feb98 T1 T2 120.60 100.50 111.67
Table VI. Secondary fact instances at patterns {month,type} (top left), {month} (top right), {type} (bottom left), {} (bottom right).
In order to give non ambiguous semantics to aggregation we suggest that, for each fact scheme, a preferred aggregation sequence is declared by specifying an ordering for dimensions. In the inventory scheme, we believe that the most suitable ordering is (product, warehouse, week) (or, indifferently, (warehouse, product, week)). It should be noted that the COUNT operator behaves differently from the others. Firstly, it counts the number of primary fact instances within each secondary fact instance, hence, it does not operate on any measure. Furthermore, it is not obvious how counting on a given dimension can be combined with other operators working on the other dimensions. For this reason, we recommend using COUNT on all the dimensions contemporarily. 3.4. Empty facts A fact scheme is said to be empty if it has no measures (M=). In this case, primary fact instances only record the occurrence of events. Consider for instance, within the university domain, the fact scheme shown in Figure 6. In this case, each fact instance states that a given student attended a given course during a given year; no measure is used to further describe this fact.
age address name COUNT area year COUNT ATTENDANCE course COUNT faculty sex nationality student
In an empty fact scheme, two approaches to the problem of aggregation can be pursued. In the first approach, which requires using either the AND or the OR operators, the information carried by each secondary fact instance is related to the existence of the corresponding primary fact instances. In order to explain this concept, we may suppose that the fact is described by an implicit Boolean measure, which is true if the event occurred and false otherwise: in this case, both operators AND and OR can be used for aggregation, with universal and existential semantics, respectively. For instance: ATTENDANCE(course.area, student ; year='1998', course.area='Databases', course.faculty='Computer Science') may denote either the students who during 98 attended all the database courses in the Computer Science Faculty (AND operator), or the students who during 98 attended at least one database course in the Computer Science Faculty (OR operator). In the second approach, which requires using the COUNT operator, the information carried by each secondary fact instance is the number of corresponding primary fact instances. Equivalently, one may suppose that the fact is described by an implicit integer measure, which has value 1 if the event occurred and 0 otherwise, and aggregate fact instances by the SUM operator. For instance: ATTENDANCE(course, student.sex ; year='1998', course.faculty='Computer Science') denotes, for each course in the Computer Science Faculty, the number of students of each sex who attended the course. Empty fact schemes correspond, on the logical level, to factless fact tables, typically used for event tracking or as coverage tables.17 4. Overlapping fact schemes In the DFM, different facts are represented in different fact schemes. However, part of the queries the user formulates on the DW may require comparing measures taken from distinct, though related, schemes; in the OLAP terminology, these are called drill-across queries. In this section we define the rules for combining two related fact schemes into a new scheme; since the same attribute ai may appear within different fact schemes,
possibly with different domains, we will denote with Domf(ai) the domain of ai within scheme f. Definition 5 . Two fact schemes f'=(M',A',N',R',O',S') and f"=(M",A",N",R",O",S") are said to be compatible if they share at least one dimension attribute: A'A". Attribute ai is considered to be common to f' and f" if, within f' and f", it has the same semantics and if Domf'(ai)Domf"(ai). Definition 6. Given a quasi-tree t=(V {a 0 },E) with root a0 , and a subset of vertices IV, we define the contraction of t on I as the quasi-tree cnt(t,I)=(I{a0},E*) where E* = {(ai,aj) | aiI{a0} ajI pathij(t) akI{ai,aj} akpathij(t)} The arcs of cnt(t,I) are the directed paths which, inside t, connect pairs of vertices of I without including other vertices of I. A quasi-tree can be contracted on a given set of vertices by applying an appropriate sequence of arc contractions, i.e., a sequence in which each step replaces two consecutive vertices ai and aj by a single vertex ai adjacent to those vertices that were previously adjacent to ai or aj. Figure 7 shows a quasi-tree and its contraction on a subset of the vertices.
5 9
(a)
6 7
8 10 11
5 9
(b)
7 10
Fig. 7. A quasi-tree (a) and its contraction on the black vertices (b). The grey vertex is the root.
Definition 7. Let two compatible fact schemes f'=(M',A',N',R',O',S') and f"=(M",A",N",R",O",S") be given, and let I=A'A". Schemes f' and f" are said to be strictly compatible if cnt(qt(f'),I) and cnt(qt(f"),I) are equal 2. Two compatible schemes f' and f" may be overlapped to create a resulting scheme f; if the compatibility is strict, the inter-attribute dependencies in the two schemes are not conflicting and f may be intuitively described as follows:
2 Actually, the semantics of the root and of the arcs exiting the root may be different in the two quasitrees, since the corresponding facts may express different concepts. Nevertheless, since in this definition and in the following ones we are interested in facts only from a topological point of view (their connections with the attributes), for notational simplicity we will denote with the same dummy symbol, a 0, the roots of both quasi-trees.
manager department category weight package size invoice number order date season SHIPMENT year quarter month date qty shipped ..... ship to ship from address contact person ship mode type carrier address allowance
(a)
product
category weight package size product season month year SHIPMENT INVENTORY type brand
The measures in f are the union of those in f' and f". Thus, the fact on which f is centred may be considered as a sort of "macro-fact" embracing both f' and f". Each hierarchy in f includes all and only the attributes included in the corresponding hierarchies of both f' and f". The functional dependencies expressed by the interattribute links in f' and f" are preserved. The domain of each dimension attribute in f is the intersection of the domains of the corresponding attributes in f' and f".
An inter-attribute link in f is optional if at least one of the links in the corresponding paths in f' or f" is optional. Aggregation statements of f' and f" are preserved in f. Formally: Definition 8. Given two strictly compatible schemes f' and f", we define the overlap of f' and f" as the scheme f'f"=(M,A,N,R,O,S) where: M = M'M" A = A'A" aiA (Domf'f"(ai) = Domf'(ai)Domf"(ai)) N = N'N" R = {(ai,aj) | (ai,aj)cnt(qt(f'),A)} = {(ai,aj) | (ai,aj)cnt(qt(f"),A)} O = {(ai,aj)R | (aw,az)O' | (aw,az)pathij(qt(f')) (aw,az)O" | (aw,az)pathij(qt(f"))} S = {(mj,di,) | diDim(f'f") ((mj,dk,)S' disub(qt(f'),dk)) ((mj,dk,)S" disub(qt(f"),dk))} Figure 8 shows the overlapping between the two strictly compatible schemes INVENTORY and SHIPMENT, which share the time and the product dimensions. The scheme resulting from overlapping can be used, for instance, to compare the quantities shipped and stored for each product. As a matter of fact, overlapping may be extended by considering more accurately the information expressed by the hierarchies in the two source schemes. Consider for instance the INVENTORY and SHIPMENT schemes, which include two compatible hierarchies on dimensions week and date, respectively. Based on Definition 8, their overlap should include only attributes month, year and season. Attribute date cannot definitely be included, since in the INVENTORY scheme it is impossible to disaggregate the primary fact instances at the date level. On the other hand, quarter could be included: in fact, the months represented in the overlap are those represented in both the source schemes, and for each month the quarter is known from SHIPMENT. Even two non-strictly compatible schemes can be overlapped; since in this case the two contracted quasi-trees are different, there must be one or more conflicts in the interattribute dependencies in the two schemes. The resulting scheme is defined as in the case of strict compatibility, except that each conflict is solved by representing an inter-attribute dependency which subsumes both conflicting dependencies. Consider the example in Figure 9, where two non-strictly compatible fact schemes (a) and (b) are shown. The dependencies expressed by the two quasi-trees are as follows: (a) (b) root 1,2,3 root 1,2 24 25 45 54 13
The common elemental dependencies (namely, root 1,2) are directly represented within the resulting scheme (c). The conflicts are solved by considering the transitive closure of the two sets of dependencies; thus, for instance, vertex 5 is positioned in (c) as a child of 2 since, in both (a) and (b), the dependency 2 5 holds.
2 4 5
(a)
2 5
2 4 5
3 4
(b) (c)
Fig. 9. Overlapping (c) of two non-strictly compatible fact schemes (a) (b).
Definition 9. Given two non-strictly compatible schemes f' and f", we define the overlap of f' and f" as the scheme f'f"=(M,A,N,R,O,S) where M = M'M" A = A'A" aiA (Domf'f"(ai) = Domf'(ai)Domf"(ai)) N = N'N" R = {(ai,aj) | pij(cnt(qt(f'),A)) pij(cnt(qt(f"),A)) awai | (pwj(cnt(qt(f'),A)) pwj(cnt(qt(f"),A))) (pij(cnt(qt(f'),A))pwj(cnt(qt(f'),A)) pij(cnt(qt(f"),A))pwj(cnt(qt(f"),A)))} O = {(ai,aj)R | ((aw,az)O' | (aw,az)pathij(qt(f'))) ((aw,az)O" | (aw,az)pathij(qt(f")))} S = {(mj,di,<op>) | diDim(f'f") ((mj,dk,<op>)S' disub(qt(f'),dk)) ((mj,dk,<op>)S" disub(qt(f"),dk))} Queries formulated on the overlap of two schemes are actually formulated on one or both the source schemes, depending on which measures are involved in the query. In general, let q=f(P,<sel>) be a query formulated on the overlapped fact scheme f=f1...fm . From the conceptual point of view, q is equivalent to m queries q1,...qm , where qi=fi(P;<sel>,d1Domf(d1),... dnDomf(dn)) and d1,...dn are the dimensions of f. An example of query formulated on an overlap is: SHIPMENTINVENTORY(month,product ; month.year='1997').inventoryQtyqtyShipped
5. Conceptual design from relational schemes The methodology we outline in this section to build a DF model starting from the documentation describing the operational relational database consists of the following steps: 1. Defining facts. 2. For each fact: a. Building the attribute tree. b. Pruning and grafting the attribute tree. c. Defining dimensions. d. Defining measures. e. Defining hierarchies. This methodology can be applied, with minor differences, starting from both E/R and logical schemes. In the following subsections we will describe the steps referring to the sale example, considering as two alternative sources its conceptual and its logical documentation. A simplified E/R scheme for sales (the part involving promotions is omitted) is shown in Figure 10. Each instance of relationship SALE represents an item referring to a single product within a purchase ticket. Attribute unitPrice is placed on SALE instead of PRODUCT since the price of the products may vary over time. The corresponding logical scheme is shown below (primary keys are underlined; for each foreign key, the referenced scheme is reported). For simplicity, no artificial codes are introduced to identify relation schemes.
marketing manager group MARKETING GROUP (1,N) type for (1,1) TYPE (1,1) diet (0,1) size (0,N) of (1,1) (0,N) (1,N) PURCHASE (1,1) (0,N) sale in TICKET qty ticket number (1,1) address WAREHOUSE of store address phone (1,N) BRAND (1,1) (1,N) city STORE (1,1) in (1,N) of (1,N) date unit price department manager district no. SALE DISTRICT in state
(1,1)
(1,N)
STATE (1,N)
county of (1,N) of (1,1) sales manager (1,1) COUNTY (1,N) of (1,1) CITY
PRODUCT
weight warehouse
produced in
brand
Fig. 10. The (simplified) E/R scheme for the sale fact scheme.
STORES(store,address,phone,salesManager,city:CITIES, saleDistr:DISTRICTS) CITIES(city,county:COUNTIES) COUNTIES(county,state:STATES) STATES(state) DISTRICTS(distrNo,state:STATES) PRODUCTS(product,weight,size,diet,brand:BRANDS,type:TYPES) BRANDS(brand,city:CITIES) TYPES(type,markGroup:GROUPS,category:CATEGORIES) GROUPS(markGroup,manager) CATEGORIES(category,dept:DEPTS) DEPTS(dept,manager) TICKETS(tickNo,date,store:STORES) SALES(product:PRODUCTS,tickNo:TICKETS,qty,unitPrice) WAREHOUSES(warehouse,address) PROD_IN_WH(product:PRODUCTS,warehouse:WAREHOUSES)
5.1. Defining facts Facts are concepts of primary interest for the decision-making process. Typically, they correspond to events occurring dynamically in the enterprise world. On the E/R scheme: A fact may be represented either by an entity F or by an n-ary relationship R between entities E1,...En. In the latter case, for the sake of simplicity, it is worth transforming R into an entity F by replacing each branch Ei with a binary relationship R i between F and E i ; if we denote with min(E,R) and max(E,R) 3 , respectively, the minimum and maximum cardinalities with which entity E participates in relationship R, it is: min(F,Ri) = 1, max(F,Ri) = 1, min(Ei,Ri) = min(Ei,R), max(Ei,Ri) = max(Ei,R), i=1,...n The attributes of the relationship become attributes of F; the identifier of F is the combination of the identifiers of Ei, i=1,...n. On the logical scheme: A fact corresponds to a relation scheme F. Entities or relationships (relation schemes) representing frequently updated archives such as SALE - are good candidates for defining facts; those representing structural properties of the domain, corresponding to nearly-static archives - such as STORE and CITY - are not.
Each fact identified on the source scheme becomes the root of a different fact scheme. In the following subsections, we will focus the discussion on a single fact, the one corresponding to entity (relation scheme) F. In the sale example, the fact of primary interest for business analysis is the sale of a product, represented in the E/R and in the logical schemes, respectively, by relationship sale and by relation scheme SALES. Figure 11 shows how relationship sale is transformed into an entity.
qty (0,N) in (1,1) SALE unit price (1,1) in (1,N) PURCHASE TICKET ticket number
Fig. 11. Transformation of relationship sale into an entity.
PRODUCT
product
5.2. Building the attribute tree Given a portion of interest of a source scheme and an entity (relation scheme) F belonging to it, we call attribute tree the quasi-tree such that: each vertex corresponds to an attribute - simple or compound - of the scheme; the root corresponds to the identifier (primary key) of F; for each vertex v, the corresponding attribute functionally determines all the attributes corresponding to the descendants of v. The attribute tree will be used in the following subsections to build the fact scheme for the fact corresponding to F. On the E/R scheme: Let identifier(E) denote the set of attributes which make up the identifier of entity E. The attribute tree for F may be constructed automatically by applying the following recursive procedure: root=newVertex(identifier(F)); // newVertex(<attributeSet>) returns a new vertex labelled // with the concatenation of the names of the attributes in // the set translate(F,root); where translate(E,v): // E is the current entity, v is the current vertex { for each attribute aE | aidentifier(E) do
addChild(v,newVertex({a})); // adds child a to vertex v for each entity G connected to E by a relationship R | max(E,R)=1 do { for each attribute bR do addChild(v,newVertex({b})); next=newVertex(identifier(G)); addChild(v,next); translate(G,next); } } In the following we illustrate how procedure translate works by showing in a step-by-step fashion how a branch of the attribute tree for the sale example is generated; the resulting attribute tree is shown in Figure 12. root=newVertex(ticketNumber+product) // renamed sale
translate(E=SALE,v=sale): addchild(v,qty); addchild(v,unitPrice); for G=PURCHASE TICKET: addchild(v,ticketNumber); translate(PURCHASE TICKET,ticketNumber); for G=PRODUCT: addchild(v,product); translate(PRODUCT,product); translate(E=PURCHASE TICKET,v=ticketNumber): addchild(v,date); for G=STORE: addchild(v,store); translate(STORE,store); translate(E=STORE,v=store): addchild(v,address); addchild(v,phone); addchild(v,salesManager); for G=SALE DISTRICT: addchild(v,districtNo+state); translate(SALE DISTRICT,districtNo+state); for G=CITY: addchild(v,city); translate(CITY,city); translate(E=SALE DISTRICT,v=districtNo+state): addchild(v,districtNo);
type product
mark. grp.
Fig. 12. Attribute tree for the sale example (the root is in grey).
It is worth adding some further notes: As the attribute tree undergoes the next step in the methodology, the granularity of fact instances may change and become coarser than that expressed by the identifier of F. Thus, in order to avoid confusion, we prefer to label the root of the attribute tree with the name of entity F rather than with its identifier. The source scheme may contain a cycle of -to-one relationships; the simplest example of this is given by a scheme representing the fact that a part is a component of another part. In this case, procedure translate would loop on this cycle generating an infinite branch. Since representing a recursive association at the logical level is impossible, the loop should be detected and the branch should be cut after a number of cycles depending on the relevance of the association within the application domain. As procedure translate "explores" a cyclic source scheme, the same entity E may be reached twice through different paths, thus generating two homologous vertices v' and v" in the quasi-tree. If each instance of F determines exactly one instance of E whichever of the two paths is followed (i.e., if the cycle is redundant), v' and v" may be merged into a vertex v entered by two arcs; the same applies to each couple of homologous vertices descending from v' and v". Otherwise, v' and v" must be left distinct. The existence of optional relationships between attributes in a hierarchy should be emphasized on the fact scheme by marking the arcs corresponding to optional relationships (min(E,R)=0) or optional attributes of the E/R scheme with a dash. A one-to-one relationship belonging to a cycle within the E/R scheme can be crossed in both directions. Thus, it may happen that two paths including opposite arcs are inserted into the attribute tree. In this case, the less significant path should be dropped.
Generalization hierarchies in the E/R scheme are equivalent to one-to-one relationships between the super-entity and each sub-entity, and should be treated as such by the algorithm. -to-many relationships (max(E,R)>1) and multiple attributes of the source scheme cannot be inserted into the attribute tree since representing them at the logical level, for instance by a star scheme, would be impossible without violating the first normal form. As already stated in Section 5.1, an n-ary relationship is equivalent to n binary relationships. Most n-ary relationships have maximum multiplicity greater than 1 on all their branches; in this case, they determine n one-to-many binary relationships which cannot be inserted into the attribute tree. On the other hand, a branch with maximum multiplicity equal to 1 determines a one-to-one binary relationship which can be inserted. A compound attribute c of the E/R scheme, consisting of the simple attributes a1,...am, is inserted in the attribute tree as a vertex c with children a1,...am. It is then possible either to graft c or to prune its children (see Section 5.3). On the logical scheme: Let pk(R) and fk(R,S) denote the sets of the attributes of R forming, respectively, the primary key of R and a foreign key referencing S. The attribute tree for F may be constructed automatically by applying the following recursive procedure: root=newVertex(pk(F)); // newVertex(<attributeSet>) returns a new vertex labelled // with the concatenation of the names of the attributes in // the set translate(F,root); where translate(R,v): // R is the current relational scheme, // v is the current vertex { for each attribute aR | (apk(R) (S | afk(R,S))) / addChild(v,newVertex({a})); // adds child a to vertex v for each attribute set AR | (S | A=fk(R,S)) { next=newVertex(A); addChild(v,next); translate(S,next); } for each relational scheme T | pk(T)=fk(T,R) { for each attribute bT | (bpk(R) (S | bfk(T,S))) / addChild(v,newVertex({b}));
for each attribute set BT | (SR | B=fk(T,S)) { next=newVertex(B); addChild(v,next); translate(S,next); } } } Procedure translate builds the tree by following the functional dependencies represented within the database scheme. The first cycle considers the dependencies between the primary key of R and each other attribute of R (including, if the key is compound, the single attributes which make it up but excluding those belonging to foreign keys, which are considered at the next step). The second cycle deals with the dependencies between the primary key and each foreign key referencing a relational scheme S, by triggering the recursion on S. The third cycle considers the situation: R(kR,...) T(kT:R,...kS:S) S(kS,...) in which the relationship one-to-many between R and S has been represented through a third relation scheme T. The same considerations made for the E/R case hold when the attribute tree is built from the logical scheme. The attribute tree obtained for the sale example is the same shown in Figure 12. 5.3. Pruning and grafting the attribute tree Probably, not all of the attributes represented in the attribute tree are interesting for the DW. Thus, the attribute tree may be pruned and grafted in order to eliminate the unnecessary levels of detail. Pruning is carried out by dropping any subtree from the quasi-tree. The attributes dropped will not be included in the fact scheme, hence it will be impossible to use them to aggregate data. For instance, on the sale example, the subtree rooted in county may be dropped from the brand branch. Grafting is used when, though a vertex of the quasi-tree expresses an uninteresting piece of information, its descendants must be preserved; for instance, one may want to classify products directly by category, without considering the information on their type. Let v be the vertex to be eliminated: graft(v): { for each v' | v' is father of v do for each v" | v" is child of v do addChild(v',v");
drop v; } Thus, grafting is carried out by moving the entire subtree with root in v to its father(s) v'; if we denote with t the attribute tree and with I the set of its vertices, procedure graft(v) returns cnt(t,I{v}). As a result, attribute v will not be included in the fact scheme and the corresponding aggregation level will be lost; on the other hand, all the descendant levels will be maintained. In the sale example, the detail of purchase tickets is uninteresting and vertex ticket number can be grafted. In general, grafting a child of the root corresponds to making the granularity of fact instances coarser and, if the node grafted has two or more children, leads to increasing the number of dimensions in the fact scheme. Two considerations: A one-to-one relationship can be thought of as a particular kind of many-to-one relationship, hence, it can be inserted into the attribute tree. Nevertheless, in a DW query, drilling down along a one-to-one relationship means adding a row header to the result without introducing further detail; thus, it is often worth grafting from the attribute tree the attributes following one-to-one relationships, or representing them as non-dimension attributes. Let entity E have a compound identifier including the internal attributes a1,...am and the external attributes b 1 ,...b t (m,t 0). The algorithm outlined in Subsection 5.2 translates E into a vertex c=a1+...am+b1+...bt with children a1,...am (children b1,...bt will be added when translating the entities which they identify). Essentially, two situations may occur. If the granularity of E must be preserved in the fact scheme, vertex c is maintained while one or more of its children may be pruned; for instance, vertex district no.+state is maintained since aggregation must be carried out at the level of single districts, while district no. may be pruned since it does not express any interesting aggregation. Otherwise, if the granularity expressed by E is too fine, c may be grafted and some or all of its children maintained. Similar considerations can be made, when the source scheme is logical, for the relation schemes with compound primary key. After grafting ticket number and pruning county, district no. and size, the attribute tree is transformed as shown in Figure 13. It should be noted that, when an optional vertex is grafted, all its children inherit the optionality dash. 5.4. Defining dimensions
Dimensions determine how fact instances may be aggregated significantly for the decisionmaking process. The dimensions must be chosen in the attribute tree among the children vertices of the root (including the attributes which have become children of the root after the quasi-tree has been grafted); they may correspond either to discrete attributes, or to
ranges of discrete or continuous attributes. Their choice is crucial for the DW design since it determines the granularity of fact instances.
city brand diet weight category type product unit price
Fig. 13. Attribute tree for the sale example after pruning and grafting.
sales qty manager sale store date district no+state address phone city county state
mark. grp.
Each primary fact instance "summarizes" all the instances of entity (relation scheme) F corresponding to a combination of dimension values. If the dimension pattern includes all the attributes which constitute an identifier (the primary key) of F, every primary instance corresponds to one instance (tuple) of F; often, one or more of the attributes which identify F are either pruned or grafted, hence, each primary instance may correspond to several instances (tuples) of F. It is widely recognized that time is a key dimension for DWs. Source schemes can be classified, according to the way they deal with time, into snapshot and temporal. A snapshot scheme describes the current state of the application domain; old versions of data varying over time are continuously replaced by new versions. On the other hand, a temporal scheme describes the evolution of the application domain over a range of time; old versions of data are explicitly represented and stored. When designing a DW from a temporal scheme, time is explicitly represented as an attribute and thus it is an obvious candidate for defining a dimension. Should time appear in the attribute tree as a child of some vertex different from the root, it is worth considering the possibility of grafting the quasi-tree in order to have time become a dimension (i.e., become a child of the root). In snapshot schemes, time is not explicitly represented (it is implicitly assumed that the scheme represents data at the current time); however, also for snapshot schemes time should be added as a dimension to the fact scheme. In the sale example, the attributes chosen as dimensions are product, store and date. At this stage, the fact scheme may be sketched by adding the chosen dimensions to the root fact. 5.5. Defining measures Measures are defined by applying, to numerical attributes of the attribute tree, aggregation functions which operate on all the instances (tuples) of F corresponding to each primary fact instance. The aggregation function typically consists either in the sum/average/maximum/ minimum of expressions or in the count of the number of entity
instances (tuples). A fact may have no attributes, if the only information to be recorded is the occurrence of the fact. The measures determined, if any, are reported on the fact scheme. At this step, it is useful for the phase of logical design to build a glossary which associates each measure to an expression describing how it can be calculated from the attributes of the source scheme. Referring to the sale example and to its logical scheme, the glossary may be compiled in SQL as follows: qty sold = SELECT SUM(S.qty) FROM SALES S,TICKETS T WHERE S.tickNo = T.tickNo GROUP BY S.product,T.date,T.store SELECT SUM(S.qty * S.unitPrice) FROM SALES S,TICKETS T WHERE S.tickNo = T.tickNo GROUP BY S.product,T.date,T.store SELECT COUNT(*) FROM SALES S,TICKETS T WHERE S.tickNo = T.tickNo GROUP BY S.product,T.date,T.store
revenue =
no. of customers =
At this point, the aggregation functions more used for each combination measure/dimension should be represented; if necessary, the preferred ordering of dimensions for aggregation should be specified. 5.6. Defining hierarchies The last step in building the fact scheme is the definition of hierarchies on dimensions. Along each hierarchy, attributes must be arranged into a quasi-tree such that a -to-one relationship holds between each node and its descendants. The attribute tree already shows a plausible organization for hierarchies; at this stage, it is still possible to prune and graft the quasi-tree in order to eliminate irrelevant details. It is also possible to add new levels of aggregation by defining ranges for numerical attributes; typically, this is done on the time dimension. In the sale example, the time dimension is enriched by introducing attributes month, quarter, etc. During this phase, the attributes which should not be used for aggregation but only for informative purposes may be identified as non-dimension attributes (for instance, address, weight, etc.). It should be noted that non-numerical attributes which are children of the root but have not been chosen as dimensions must necessarily either be grafted (if the granularity of the primary fact instances is coarser than that of the fact) or be represented as non-dimension (if the two granularities are equal).
6. Conclusion In this paper we have proposed a conceptual model for data warehouse design and a semiautomated methodology for deriving it from the documentation describing the information system of the enterprise. The DFM is independent of the target logical model (multidimensional or relational); in order to bridge the gap between the fact schemes and the DW logical scheme, a methodology for logical design is needed. As in operational information systems, DW logical design should be based on an estimate of the expected workload and data volumes. The workload will be expressed in terms of query patterns and their frequencies; data volumes will be computed by considering the sparsity of facts and the cardinality of the dimension attributes. Our current work is devoted to developing the methodology for logical design and implementing it within an automated tool. Among the specific issues we are investigating, we mention the following: Partitioning of the DW into integrated data marts. View materialization. This problem involves the whole dimensional scheme; in fact, due to the presence of drill-across queries, cross-optimization must be carried out. Selection of the logical model. Each materialized view can be mapped on the logical level by adopting different models (star scheme, constellation scheme, snowflake scheme). Translation into fact and dimension tables. The fact and dimension tables are created according to the logical models adopted. Vertical partitioning of fact tables. The query response time can be reduced by considering the set of measures required by each query. Horizontal partitioning of fact tables. The query response time can be reduced by considering the selectivity of each query. References
1. 2. 3. 4. R. Agrawal, A. Gupta and S. Sarawagi, Modeling multidimensional databases, I B M Research Report, IBM Almaden Research Center, 1995. E. Baralis, S. Paraboschi and E. Teniente, Materialized view selection in multidimensional database, Proc. 23rd Int. Conf. on Very Large Data Bases, Athens, Greece, 1997, 156-165. R. Barquin and S. Edelstein, Planning and Designing the Data Warehouse. (Prentice Hall, 1996). L. Cabibbo and R. Torlone, A logical approach to multidimensional databases, eds. H.J. Schek, F. Saltor, I. Ramos, G. Alonso, Advances in DB technology - EDBT 98, (LNCS 1377, Springer, 1998) 183-197. S. Chaudhuri and U. Dayal, An overview of data warehousing and OLAP technology, SIGMOD Record 26, 1 (1997) 65-74. S. Chaudhuri and K. Shim, Including group-by in query optimization, Proc. 20th Int. Conf. on Very Large Data Bases (1994) 354-366. G. Colliat, OLAP, relational and multidimensional database systems, SIGMOD Record 25, 3 (1996) 64-69.
5. 6. 7.
8. 9. 10.
11.
12.
C. Fahrner and G. Vossen, A survey of database transformations based on the EntityRelationship model, Data & Knowledge Engineering 15, 3 (1995) 213-250. U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, Data mining and knowledge discovery in databases: an overview, Comm. of the ACM 39, 11 (1996). M. Golfarelli, D. Maio and S. Rizzi, Conceptual design of data warehouses from E/R schemes, Proc. Hawaii International Conference on System Sciences, Kona, Hawaii (1998) 334-343. J. Gray, A. Bosworth, A. Lyman and H. Pirahesh, Data-Cube: a relational aggregation operator generalizing group-by, cross-tab and sub-totals, Technical Report MSR-TR-9522, Microsoft Research, 1995. A. Gupta, V. Harinarayan and D. Quass, Aggregate-query processing in data-warehousing environments, Proc. 21th Int. Conf. on Very Large Data Bases, Zurich, Switzerland (1995). H. Gupta, V. Harinarayan and A. Rajaraman, Index selection for OLAP, Proc. Int. Conf. Data Engineering, Binghamton, UK (1997). M. Gyssens and L.V.S. Lakshmanan, A foundation for multi-dimensional databases, Proc. 23rd Int. Conf. on Very Large Data Bases, Athens, Greece (1997) 106-115. V. Harinarayan, A. Rajaraman and J. Ullman, Implementing Data Cubes Efficiently, Proc. of ACM Sigmod Conf., Montreal, Canada (1996). T. Johnson and D. Shasha, Hierarchically split cube forests for decision support: description and tuned design, Bullettin of Technical Committee on Data Engineering 20, 1 (1997). R. Kimball, The data warehouse toolkit (John Wiley & Sons, 1996). D. Lomet and B. Salzberg, The Hb-Tree: a multidimensional indexing method with good guaranteed performance, ACM Trans. On Database Systems 15, 44 (1990) 625-658. F. McGuff, Data modeling for data warehouses, http://members.aol.com/fmcguff /dwmodel/dwmodel.htm (1996). P. O'Neil and G. Graefe, Multi-table joins through bitmapped join indices, SIGMOD Record 24, 3 (1995) 8-11. K. Ross, D. Srivastava and D. Chatziantoniou, Complex aggregation at multiple granularities, Proc. Int. Conf. on Extending Database Technology (1998) 263-277. S. Sarawagi, Indexing OLAP data, Bullettin of Technical Committee on Data Engineering 20, 1 (1997). Y. Zhuge, H. Garcia-Molina and J. L. Wiener, The Strobe Algorithms for Multi-Source Warehouse Consistency, Proc. Conference on Parallel and Distributed Information Systems, Miami Beach, FL (1996).