How To Create A Data Quality
How To Create A Data Quality
scorecard
Building and Using a Data Quality Scorecard
Data quality scorecard is the centerpiece of any data quality management program.
It provides comprehensive information about quality of data in a database, and allows both
aggregated analysis and detailed drill-downs.
A well-designed data quality scorecard is the key to understanding how well the data supports
various reports, analytical and operational processes, and data-driven projects.
It is also critical for making good decisions about data quality improvement initiatives.
Project teams spend months designing, implementing, and fine-tuning data quality rules; they
build neat rule catalogues and produce extensive error reports. But, without a data quality
scorecard, all they have are raw materials and no value-added product to justify further
investment into data quality management.
Indeed, no amount of firewood will make you warm in the winter unless you can make a decent
fire. The main product of data quality assessment is the data quality scorecard!
The picture below represents the data quality scorecard as an information pyramid.
At the top level are aggregate scores which are high-level measures of the data quality.
Well-designed aggregate scores are goal driven and allow us to evaluate data fitness for various
purposes and indicate quality of various data collection processes.
From the perspective of understanding the data quality and its impact on the business, aggregate
scores are the key piece of data quality metadata.
At the bottom level of the data quality scorecard is information about data quality of individual
data records.
In the middle are various score decompositions and error reports allowing us to analyze and
summarize data quality across various dimensions and for different objectives.
Aggregate Scores
Each score aggregates errors identified by the data quality rules into a single number – a
percentage of good data records among all target data records.
Aggregate scores help make sense out of the numerous error reports produced in the course of
data quality assessment.
Without aggregate scores, error reports often discourage rather than enable data quality
improvement.
You have to be careful when choosing which aggregate scores to measure. The scores that are
not tied with a meaningful business objective are useless. For instance, a simple aggregate score
for the entire database is usually rather meaningless.
Suppose, we know that 6.3% of all records in the database have some errors. So what? This
number does not help me at all if I cannot say whether it is good or bad, and I cannot make any
decisions based on this information.
On the other hand, consider an HR database that is used, among other things, to calculate
employee retirement benefits.
Now, if you can build an aggregate score that says 6.3% of all calculations are incorrect because
of data quality problems, such a score is extremely valuable.
You can use it to measure the annual cost of data quality to the business through its impact to a
specific business process. You can further use it to decide whether or not to initiate a data-
cleansing project by estimating its ROI.
The bottom line is that good aggregate scores are goal driven and allow us to make better
decisions and take actions. Poorly designed aggregate scores are just meaningless numbers.
Of course, it is possible and desirable to build many different aggregate scores by selecting
different groups of target data records. The most valuable scores measure data fitness for various
business uses.
These scores allow us to estimate the cost of bad data to the business, to evaluate potential ROI
of data quality initiatives, and to set correct expectations for data-driven projects.
In fact, if you define the objective of a data quality assessment project as calculating one or
several of such scores, you will have much easier time finding sponsors for your initiative.
Other important aggregate scores measure quality of various data collection procedures.
For example, scores based on the data origin provide estimates of the quality of the data obtained
from a particular data source or through a particular data interface. A similar concept involves
measuring the quality of the data collected during a specific period of time.
Indeed, it is usually important to know if the data errors are mostly historic or were introduced
recently.
The presence of recent errors indicates a greater need for data collection improvement initiatives.
Such measurement can be accomplished by an aggregate score with constraints on the
timestamps of the relevant records.
To conclude, analysis of the aggregate scores answers key data quality questions:
Score Decompositions
Next layer in the data quality scorecard is composed of various score decompositions, which
show contributions of different components to the data quality.
Score decompositions can be built along many dimensions, including data elements, data quality
rules, subject populations, and record subsets.
For instance, in the above example we may find that 6.3% of all calculations are incorrect.
Decomposition may indicate that in 80% of cases it is caused by the problem with the employee
compensation data; in 15% of cases the reason is missing or incorrect employment history; and
in 5% of cases the culprit is invalid date of birth.
This can be used to prioritize a data cleansing initiative. Another score decomposition may
indicate that over 70% of errors are for employees from a specific subsidiary.
This may suggest a need to improve data collection procedures in that subsidiary.
The level of detail obtained through score decompositions is enough to understand where most
data quality problems come from.
However, if we want to investigate data quality further, more drill-downs are necessary.
The next step would be to produce various reports of individual errors that contribute to the score
(or sub-score) tabulation. These reports can be filtered and sorted in various ways to better
understand the causes, nature, and magnitude of the data problems.
Finally, at the very bottom of the data quality scorecard pyramid are reports showing the quality
of individual records or subjects. These atomic level reports identify records and subjects
affected by errors and could even estimate the probability that each data element is erroneous.
Summary
Data quality scorecard is a valuable analytical tool that allows to measure the cost of bad data for
the business and to estimate ROI of data quality improvement initiatives.
Building and maintaining a dimensional time-dependent data quality scorecard must be one of
the first priorities in any data quality management initiative.
Data quality rules play the same role in data quality assessment as the rules of baseball in
refereeing a major league game. They determine the outcome! Unfortunately, identifying data
quality rules is more difficult than learning rules of baseball because there is no official rulebook
that is the same for all databases. In every project, we have to discover the rules anew. Also,
some rules are easy to find, while others require lots of digging; some rules are easy to
understand and implement, while others necessitate writing rather complex programs. But, as
with baseball, all rules are equally important. Omitting a few complex and obscure data quality
rules can (and most of the time will!) jeopardize the entire effort.
1. Attribute domain constraints restrict allowed values of individual data attributes. They are
the most basic of all data quality rules.
2. Relational integrity rules are derived from the relational data models and enforce identity
and referential integrity of the data.
3. Rules for historical data include timeline constraints and value patterns for time-
dependent value stacks and event histories.
4. Rules for state-dependent objects place constraint on the lifecycle of objects described by
so-called state-transition models.
5. General dependency rules describe complex attribute relationships, including constraints
on redundant, derived, partially dependent, and correlated attributes.
At the most atomic level, the data in any database consists of individual values of various
attributes. Those values generally represent measurements of the characteristics of real world
people, things, places, or events. For instance, height and weight are characteristics of people;
latitude and longitude are characteristics of geographical locations on Earth; and room number
and duration are characteristics of a “business meeting” event.
Now, real world objects cannot take any shape and form. We do not expect people to be 12 feet
tall, or meetings to be held on the 215th floor. What this means is that attribute values of these
objects cannot take any values but only certain reasonable ones. The data quality rules used to
validate individual attribute values are commonly referred to as attribute domain constraints.
These include optionality, format, valid value, precision, and granularity requirements.
Attribute domain constraints can be deduced from analysis of the meta data, such as data models,
data dictionaries, and lookup tables. However, these meta data should be used with caution since
they are often incorrect or incomplete. Data models typically reflect the data structure at the time
of database design. Over time data models are rarely updated and quickly become obsolete,
especially in the volatile area of attribute domains. Data dictionaries and lookup tables are also
seldom up-to-date.
The cure to this common meta data malady is data profiling – a combination of techniques aimed
at examining the data and understanding its actual content, rather than that described
theoretically in the data models and data dictionaries. Specifically, attribute profiling examines
the values of individual data attributes and produces three categories of meta data for each
attribute: basic aggregate statistics, frequent values, and value distribution. Analysis of this
information helps identify allowed values for an attribute.
Optionality Constraints
These prevent attributes from taking Null, or missing, values. On the surface, these constraints
appear to be the easiest to identify. In fact, Non-Null constraints are often enforced by relational
databases. So what is the point to validate such constraints? The devil, as usual, is in the details.
First, Not-Null constraints are often turned off in the databases to allow for situations when
attribute value is not available (though required!) but the record must be created. For the same
reason, optionality is not always represented correctly in the meta data. Routinely, data models
and data dictionaries reflect actual database configuration (i.e. Null value allowed), rather than
the true data quality requirement.
More importantly, default values are often entered to circumvent the Not-Null constraints.
Attribute is populated with such a default when actual value is not available. Database designers
are often unaware of such default values, and data dictionaries rarely list them. Even business
users who enter them might forget all default values they use or used in the past. Yet, default
values are no different from missing values for all practical purposes.
How do we identify the default values? This is done by analysis of frequent values in the
attribute profile. The default values are usually “strikingly inappropriate.” Frequent values that
do not look “real” are likely candidates.
Format Constraints
Format constraints define the expected form in which the attribute values are stored in the
database field. Format constraints are most important when dealing with “legacy” databases.
However, even modern databases are full of surprises. From time to time, numeric and date/time
attributes are still stored in text fields.
Format constraints for numeric, date/time, and currency attributes are usually represented as a
value mask, a la MMDDYYYY standing for 2-digit month followed by 2-digit day and 4-digit
year for the date. Text attributes are most often made of a single word that has restriction on
length, allowed set of characters, and mask. Text attributes made of multiple words often have
format constraints in the form of a word pattern.
Valid value constraints limit permitted attribute values to a prescribed list or range.
Unfortunately, valid value lists are often unavailable, incomplete, or incorrect. To identify valid
values, we first need to collect counts of all actual values. These counts can then be analyzed,
and actual values can be cross-referenced against the valid value list, if available. Values that are
found in many records are probably valid, even if they are missing from the data dictionary. This
typically happens when new values are added after the original database design and are not
added to the documentation. Values that have low frequency are suspect.
For numeric and date/time attributes, the number of valid values is typically infinite (or at least
too large to be enumerated in practice). However, even for these attributes certain values are not
valid. For instance, annual compensation cannot be negative. Further, an employee date of birth
can be neither too far in the past nor too close to present day. In these situations, the attribute
domain constraints take form of a valid value range rather than a list.
Some numeric and especially date/time attributes have very complex domain constraints. Instead
of a single range, the domain is defined as a set of ranges following a certain pattern. For
example, year-end bonus payment date may only fall in December and January. Extensive
analysis of value distributions is the only way to identify such domain constraints.
Precision Constraints
Precision constraints require all values of an attribute to have the same precision, granularity, and
unit of measurement. Precision constraints can apply to both numeric and date/time attributes.
For numeric values, they define the desired number of decimals. For the date/time attributes,
precision can be defined as calendar month, day, hour, minute, or second. Data profiling can be
used to calculate distribution of values for each precision level. Fluctuation of the distribution
from random is a sign of a prevailing precision.
Summary
In this article we introduced the concept of data quality rules and discussed the first category of
rules – attribute domain constraints. There are many practical challenges in identifying even
these seemingly trivial rules. Comprehensive attribute profiling is the key to success. Without
detailed understanding of attribute profiles, domain constraints will always be incomplete and
incorrect. In the next article of the series we will discuss relational integrity rules.
Note: To read the previous tutorial in this series click on this link.
Of all revolutions in information technology, the introduction of relational data models arguably had the
greatest impact. It gave database designers a recipe for systematic and efficient organization of data.
Now some 30+ years since their introduction, relational databases are the cornerstone of the
information universe. Relational data models offer a comprehensive notion of the data structure. In
doing so, they also place many constraints on the data. We refer to the data quality rules that are
derived from the analysis of relational data models as relational integrity constraints. They are relatively
easy to identify and implement, which makes a relational data model a starting point in any data quality
assessment project. These rules include identity, reference, cardinality, and inheritance constraints.
Identity Rules
An identity rule validates that every record in a database table corresponds to one and only one
real world entity and that no two records reference the same entity. Imagine the pirates dividing
the stolen loot according to the personnel table. Mad Dog is engaged in a fight till death with
recently recruited Mad Doug whose name was accidentally misspelled by the spelling-
challenged captain. Wild Billy who changed his name to One-Eyed Billy after the last battle is
trying to sneak in and collect two shares in accordance with the register. Life would be tough for
pirates in the information age. But it is equally tough for employees, customers, and other objects
whose data is maintained in the modern databases.
A reader familiar with database design will naturally ask, “Aren’t identity rules always enforced
in relational databases through primary keys?” Indeed, according to sound data modeling
principles every entity must have a primary key – a nominated set of attributes that uniquely
identifies each entity occurrence. In addition to the uniqueness requirement, primary keys impose
not-Null constraints on all nominated attributes.
While primary keys are usually enforced in relational databases, this does not guarantee proper
entity identity! One of the reasons is that surrogate keys are often created and nominated as
primary keys. Surrogate keys use computer generated unique values to identify each record in a
table, but their uniqueness is meaningless for data quality. More importantly, multiple records
with distinct values of the primary key may represent the same real world object, if the key
values are erroneous. Finding these cases of hidden mistaken identity requires sophisticated de-
duplication software. Fortunately, various tools are available on the market for de-duplication of
records for persons or businesses.
Reference Rules
A reference rule ensures that every reference made from one entity occurrence to another entity
occurrence can be successfully resolved. Each reference rule is represented in relational data
models by a foreign key that ties an attribute or a collection of attributes of one entity with the
primary key of another entity. Foreign keys guarantee that navigation of a reference across
entities does not result in a “dead end.”
Foreign keys are always present in data models but are often not enforced in actual databases.
This is done primarily to accommodate real data that may be erroneous! Solid database design
precludes entering such records, but in practice it is often considered a lesser evil to allow an
unresolved link in the database than to possibly loose valuable data by not entering it at all. The
problem is intended to be fixed later, but “later” never comes. Foreign key violations are
especially typical for data loaded during data conversions from “legacy” non-relational systems,
or as a result of incomplete record purging.
Cardinal Rules
A cardinal rule defines the constraints on relationship cardinality. Cardinal rules are not to be
confused with reference rules. Whereas reference rules are concerned with the identity of the
occurrences in referenced entities, cardinal rules define the allowed number of such occurrences.
Probably the most famous example of a practical application of cardinal rules is Noah’s ark.
Noah had to take into his vessel two animals of each species – male and female. Assuming that
he had tracked his progress using a relational database, Noah would need at least two entities –
SPECIES and ANIMAL – tied by a relationship with a cardinality of exactly one on the left side
and two on the right side. In fact, Noah’s task was even more complex as he needed to ensure
that the two selected species were of different gender – an inheritance rule that we will discuss in
the next section. And, of course, he needed to ensure the proper identity of each animal. I
imagine that had Noah used modern technology and had the data quality been consistent with a
common level of that in modern databases, we would remember the story of Noah’s ark in the
same context as the mass extinction of the dinosaurs.
Cardinal rules can be initially identified by analysis of the relationships shown in the relational
data models. However relationship cardinality is often represented incorrectly in relational data
models. For example, optionality is sometimes built into the entity-relationship diagrams simply
because real data is imperfect. Another problem is that commonly used data modeling notations
do not distinguish cardinality beyond zero, one, and many. Thus, cardinality “many” is used as a
proxy for “more than one,” without specifying actual cardinality constraint.
In order to identify true cardinal rules, we use relationship cardinality profiling – an exercise in
counting actual occurrences for each relationship in the data model. Once counted, the results are
presented in a cardinality chart showing how many of the parent records have 0, 1, 2, and so on
corresponding dependent records. Large frequency is usually indicative of legitimate
cardinalities, while rare occurrences are suspicious and require further investigation.
Inheritance Rules
An inheritance rule expresses integrity constraints on entities that are associated through
generalization and specialization, or more technically through sub-typing. Consider entities
EMPLOYEE and APPLICANT representing company employees and job applicants
respectively. These entities overlap as some of the applicants are eventually hired and become
employees. More importantly, they share many attributes, such as name and date of birth. In
order to minimize redundancy an additional entity – PERSON – can be created. It houses
common basic indicative data for all employees and applicants. The original entities now only
store attributes unique to employees and applicants. These three entities are said to have a sub-
typing relationship.
Inheritance rules enforce validity of the data governed by the sub-typing relationships. For
instance, the rule based on the complete conjoint relationship between entities PERSON,
EMPLOYEE and APPLICANT has the form:
Any PERSON occurrence not found in either EMPLOYEE or APPLICANT entity is erroneous
(or more likely points to a missing employee or applicant record).
Summary
Relational data models are a gold mine for data quality rules, specifically identity, reference,
cardinality, and inheritance constraints. Along with the attribute domain constraints discussed in
the previous article, relational integrity constraints are the easiest to identify and implement.
Unfortunately they will only locate the most basic and glaring data errors. In the future articles of
this series we will graduate to more advanced categories of data quality rules.
Data Quality Rules: Historical Data
Introduction
Newborn babies grow into playful toddlers, love-stricken teenagers, busy adults, and finally wise
matriarchs and patriarchs.
Employee positions change over time, their skills increase, and so hopefully do their salaries.
Stock markets fluctuate, product sales ebb and flow, corporate profits vary, empires rise and fall,
and even celestial bodies move about in an infinite dance of time. We use the term time-
dependent attribute to designate any object characteristic that changes over time.
The databases charged with the task of tracking various object attributes inevitably have to
contend with this time-dependency of the data.
Historical data comprise the majority of data in both operational systems and data warehouses.
They are also most error-prone.
There is always a chance to miss parts of the history during data collection, or incorrectly
timestamp the collected records.
Also, historical data often spend years inside databases and undergo many transformations,
providing plenty of opportunity for data corruption and decay.
This combination of abundance, critical importance, and error-affinity of the historical data
makes them the primary target in any data quality assessment project.
The good news is that historical data also offer great opportunities for validation. Both the
timestamps and values of time-dependent attributes usually follow predictable patterns that can
be checked using data quality rules.
Timestamp Constraints
Timestamp constraints validate that all required, desired, or expected measurements are recorded
and that all timestamps are valid.
A currency rule enforces the desired “freshness” of the historical data.
Currency rules are usually expressed in the form of constraints on the effective date of the most
recent record in the history. For example, the currency rule for annual employee compensation
history requires the most recent record for each employee to match the last complete calendar
year.
Retention rules are usually expressed in the form of constraints on the overall duration or the
number of records in the history. Retention rules often reflect common retention policies and
regulations requiring data to be stored for a certain period of time before it can be discarded.
For instance, all tax-related information may need to be stored for seven years pending
possibility of an audit. Further, a bank may be required to keep data of all customer transactions
for several years.
Values of some attributes are most meaningful when accumulated over a period of time. We
refer to any series of cumulative time-period measurements as accumulator history.
For instance, product sales history might be a collection of the last 20 quarterly sales totals.
Employee compensation history may include annual compensation for the last five calendar
years.
A granularity rule requires all measurement periods in an accumulator history to have the same
size. In the product sales example, it is a calendar quarter; for the employee compensation
example, it is a year.
A continuity rule prohibits gaps and overlaps in accumulator histories. Continuity rules require
that the beginning date of each measurement period immediately follows the end date of the
previous period.
The aforementioned rules enforce that historical data cover the entire desired space of time.
However, this does not yet guarantee that the data is complete and accurate.
More advanced rules are necessary to identify possibly missing historical records or to find
records with incorrect timestamps. All such rules are based on validation of more complex
patterns in historical data.
A timestamp pattern rule requires all timestamps to fall into a certain prescribed date interval,
such as every March or every other Wednesday or between the first and fifth of each month.
Occasionally the pattern takes the form of minimum or maximum length of time between
measurements.
For example, participants in a medical study may be required to take blood pressure readings at
least once a week. While the length of time between particular measurements will differ, it has to
be no longer than seven days.
Timestamp patterns are common to many historical data. However, finding the pattern can be a
challenge. Extensive data profiling and analysis is the only reliable solution.
A useful profiling technique is to collect counts of records by calendar year, month, day, day of
week, or any other regular time interval.
For example, frequencies of records for each calendar month (year and day of the record does
not matter) will tell if the records have effective dates spread randomly over the year or if they
follow some pattern.
Value Constraints
Value histories for time-dependent attributes usually also follow systematic patterns.
A value pattern rule utilizes these patterns to predict reasonable ranges of values for each
measurement and identify likely outliers. Value pattern rules can restrict direction, magnitude, or
volatility of change in data values.
The simplest value pattern rules restrict the direction in value changes from measurement to
measurement. This is by far the most common rule type. Electric meter measurements, total
number of copies of my book sold to-date, and the values of many other common attributes
always grow or at least remain the same.
A slightly more complex form of a value pattern rule restricts the magnitude of value changes.
It is usually expressed as a maximum (and occasionally minimum) allowed change per unit of
time.
For instance, a person’s height changes might be restricted to six inches per year. This does not
mean that values from measurement to measurement may not change by more than six inches,
but rather that the change cannot exceed six inches times the length of the interval in years.
The magnitude-of-change constraints work well for attributes whose values are rather stationary.
This does not apply to many real world attributes. For instance, regular pay raises rarely exceed
10-15%, but raises for employees promoted to a new position routinely reach 20-30% or even
more. Since the majority of employees experience a promotion at least once in their career, we
could not use magnitude-of-change constraint for pay rate histories.
However, pay rates still do not change arbitrarily. Normal behavior of pay rate history for an
employee of most companies is a steady increase over the years. Sudden increase in pay rate
followed by a drop signals an error in the data (or the end to the dot-com bubble).
The value pattern rule that can identify such errors must look for spikes and drops in consecutive
values. Here we do not restrict individual value change, but rather do not permit an increase to be
followed by a decrease and vice versa.
In other words, the rule restricts volatility of value changes. Rules of this type are applicable to
many data histories.
Summary
Historical data comprise the majority of data in both operational systems and data warehouses.
The abundance, critical importance, and error-affinity of the historical data make them the
primary target in any data quality assessment project.
In this article we discussed data quality rules for the common time-dependent attributes.
In the future articles of this series we will address more advanced data categories, such as event
histories and state-dependent data.
Introduction
Time is arguably the most important aspect of our life. We are surrounded by calendars and watches,
and rare is the activity that does not involve time. Ever since my son entered elementary school, his life
became a collection of timestamps and time intervals: school schedule, soccer schedule, play date, time
to do homework, TV time, time to play video games, time to go to bed, number of days till Christmas
and to the next vacation, and even the number of years left to accumulate college funds. And it stays
that way for an entire life, except rare Hawaii vacations.
This phenomenon stays true in the databases we build. Much of the data is time-stamped, and absolute
majority of the database entities contain histories. In the previous article we discussed data quality rules
for simple value histories. In this article we will discuss event histories.
Car accidents, doctor appointments, employee reviews and pay raises are all examples of events. Event
histories are more complex than value histories. First, events often apply to several objects. For
instance, a doctor’s appointment involves two individuals – the doctor and the patient. Secondly, events
sometimes occupy a period rather than a point in time. Thus, recording a doctor’s appointment requires
appointment scheduled time and duration. Finally, events are often described with several event-
specific attributes. For example, the doctor’s appointment can be prophylactic, scheduled, or due to an
emergency. It can further be an initial or a follow-up visit, and it will often result in diagnosis.
In the practice of data quality assessment, validating event histories often occupies the bulk of the
project and finds numerous errors. Rules that are specific to event histories can be classified into event
dependencies, event conditions, and event-specific attribute constraints.
Event Dependencies
Various events often affect the same objects and therefore may be interdependent. Data quality rules
can use these dependencies to validate the event histories. The simplest rule of this kind restricts
frequency of the events. For example, patients may be expected to visit the dentist at least every six
months for regular checkups. While the length of time between particular visits will differ, it has to be no
longer than six months.
Sometimes event frequency can be defined as a function of other data. For example, an airplane is
required to undergo extensive maintenance after a certain number of flights. Here frequency of
maintenance events is not a function of time but of another data attribute. Assuming good safety
procedures, a greater than required number of flights between maintenance events is a likely indication
of a missing record in the event history.
A constraint can also be placed on the number of events per unit of time. For example, a doctor may not
be able to see more than 15 patients in a normal workday. Higher number of doctor visits will likely
indicate that some of the records in the event history show erroneous name of the doctor or date of the
visit.
The most complex rules apply to situations when events are tied by a cause-and-effect relation. For
example, mounting a dental crown will involve several visits to the dentist. The nature, spacing, and
duration of the visits are related. Relationships of this kind can get quite complex with the timing and
nature of the next event being a function of the outcome of the previous event. For instance, a diagnosis
made during the first appointment will influence following appointments.
Event Conditions
Events of many kinds do not occur at random but rather only happen under certain unique
circumstances. Event conditions verify these circumstances. Consider a typical new car maintenance
program. It includes several visits to the dealership for scheduled maintenance activities. These activities
may include engine oil change, wheel alignment, tire rotation, and break pad replacement. For each
activity, there is a desired frequency.
My new car has a great gadget that reminds me when each of the activities is due. It does it in a
beautiful voice, but in no uncertain terms. A typical announcement is “Your tires are due for rotation.
Driving the car may be VERY unsafe. Please, make a legal U-turn and proceed to the nearest dealership
at a speed of no more than 15 miles an hour.” Since I do not appreciate this kind of life-threatening
circumstance, I decided to visit the dealership before maintenance was due. Unfortunately, for obvious
business reasons, the dealership would not do the maintenance before it is due. As it was, my only
option was to wait for the next announcement and find my way to the nearest dealership at the speed
of 15 miles an hour.
On a more serious note, this constraint is an example of event condition – a condition that must be
satisfied for an event to take place. Each specific car maintenance event has pre-conditions based on the
car make and model, the age of the car, car mileage, and the time since the last event of same type. All
of these conditions can be implemented in a data quality rule (or rules) and used to validate car
maintenance event histories in an auto dealership database.
Events themselves are often complex entities, each with numerous attributes. Consider
automobile accidents. Record of each accident must be accompanied by much data – involved
cars and their post-accident condition, involved drivers and their accident accounts, police
officers and their observations, witnesses and their view of events. The list of data elements can
be quite long, and the data may be stored simply in extra attributes of the event table or in
additional dependent entities.
Event-specific attribute constraints enforce that all attributes relevant to the event are present.
The exact form of these constraints may depend on the nature of the event and its specific
characteristics. For instance, a collision must involve two or more cars with two or more drivers,
each driver matched to a car. Having two drivers steering the same car is a recipe for collision, or
more likely an indication of a data error.
It gets even more exciting when different events may have different attributes. For instance,
collision events have somewhat different attributes than hit-and-run events. The former embroil
two or more cars, each with a driver; the latter usually involve a single car with no identified
driver. Thus the name “event-specific attribute constraints” has two connotations – both the
attributes and the constraints are event-specific.
Summary
The critical importance of the event histories in the database makes them the primary target in any data
quality assessment project. At the same time, event constraints are rarely enforced by databases, and
erroneous data in this area proliferate. In most cases event dependencies can only be found by
extensive analysis of the nature of events. Business users will provide key input here. While the data
quality rules for event histories are rather complex to design and implement, they are crucial to data
quality assessment since they usually identify numerous critical data errors.