Data Structuring • Data structuring is the process of changing the organization and relationships among data fields to prepare the data for analysis. • Extracted data often needs to be structured in a manner that will enable analysis.
• Aggregate data is the presentation of data in a
summarized form. • Data joining is the process of combining different data sources.
Data Standardization (1 of 3) • Data standardization is the process of standardizing the structure and meaning of each data element so it can be analyzed and used in decision making. – It is particularly important when merging data from several sources. – It may involve changing data to a common format, data type, or coding scheme.
Data Standardization (3 of 3) • Cryptic data values are data items that have no meaning without understanding a coding scheme. – When a field contains only two different responses, typically 0 or 1, this field is called a dummy variable or dichotomous variable. • Misfielded data values are data values that are correctly formatted but not listed in the correct field. Country City Zip code Berlin German ZL1340
• Data consistency is the principle that every value in a field
• Data contradiction errors are errors that exist when the
same entity is described in two conflicting ways. • Data threshold violations are data errors that occur when a data value falls outside an allowable level. • Violated attribute dependencies are errors that occur when a secondary attribute in a row of data does not match the primary attribute. • Data entry errors are all types of errors that come from inputting data incorrectly.
Data Validation Data validation is the process of analyzing data to make certain the data has the properties of high-quality data:
• Visual inspection is the process of examining data using
human vision to see if there are problems. • Basic statistical tests can be performed to validate the data. • Audit a sample is one of the best techniques for assuring data quality. • Advanced testing techniques are possible with a deeper understanding of the content of data.