2 Data Types Quality
2 Data Types Quality
and
Data Quality
Data Mining
• Data Mining is
• Data
• Data objects
• Attributes of data
• Object
Qualitative Quantitative
• Nominal • Numeric
• Interval-scaled
• Ordinal
• Ratio-scaled
• Binary
• Discrete
• Continuous
Qualitative Attributes :: Nominal
• Related to names
• The values of a Nominal attribute
– Are names of things, some kind of symbols
– Represents some category or state
• Also referred as categorical attributes
– No ordering (rank, position) among values
• Example
Qualitative Attributes :: Ordinal
• Provides sufficient information to order the objects
• Example
Qualitative Attributes :: Binary
• Has only 2 values or states
• For Example
– Yes or no, affected or unaffected, true or false etc.
• Symmetric:
– Both values are equally important (Gender)
• Asymmetric:
– Both values are not equally important (Result)
Quantitative Attributes :: Numeric
• Quantitative
– It is a measurable quantity
– Of two types
• Interval-Scaled
• Ratio-Scaled
Quantitative Attributes :: Numeric
• Interval-Scaled
– The values are ordered, and the difference between values, the
mean, median, mode etc. can be computed
• Examples: length, time, counts etc.
Quantitative Attributes :: Discrete
• Have finite values which can be numerical or categorical
• Has finite or countable infinite set of values
• Example:
Quantitative Attributes :: Continuous
• Has real numbers as attribute values
• Typically represented as floating point variables
• Examples: temperature, height, or weight etc.
Data Quality
• The measure of how well suited a data set is to serve its specific
purpose
– Accuracy
– Completeness
– Consistency
– Validity
– Uniqueness
– Timeliness
Data Quality
• Accuracy
– The data should reflect actual, real-world scenarios
– The measure of accuracy can be confirmed with a verifiable
source.
• Completeness
– Ability of the data to effectively deliver all the required values that
are available
• Consistency
– The uniformity of data as it moves across networks and
applications.
– The same data values stored in difference locations should not
conflict with one another.
Data Quality
• Validity
Data should be collected according to defined business rules and
parameters
Data should conform to the right format and fall within the right
range
• Uniqueness
– Ensures that there are no duplications or overlapping of values
across all data sets
• Timeliness
– Timely data is data that is available when it is required
– Data may be updated in real time to ensure that it is readily
available and accessible.