0% found this document useful (0 votes)
63 views3 pages

Data Profiling

Data profiling is the process of examining existing data sources to collect statistics and summaries about the data. This helps assess data quality, understand data structure and relationships, discover metadata, and identify potential issues. It involves analyzing data at the column, table, and cross-table levels using descriptive statistics. Data profiling is conducted at various stages of data warehouse development to evaluate source systems and ensure proper data extraction, transformation and loading. It provides benefits like improved data quality, shorter project timelines, and better user understanding of data.

Uploaded by

charlotte899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views3 pages

Data Profiling

Data profiling is the process of examining existing data sources to collect statistics and summaries about the data. This helps assess data quality, understand data structure and relationships, discover metadata, and identify potential issues. It involves analyzing data at the column, table, and cross-table levels using descriptive statistics. Data profiling is conducted at various stages of data warehouse development to evaluate source systems and ensure proper data extraction, transformation and loading. It provides benefits like improved data quality, shorter project timelines, and better user understanding of data.

Uploaded by

charlotte899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Data profiling

Data profiling is the process of examining the data available from an existing information source (e.g. a
database or a file) and collecting statistics or informative summaries about that data.[1] The purpose of these
statistics may be to:

1. Find out whether existing data can be easily used for other purposes
2. Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to
a category
3. Assess data quality, including whether the data conforms to particular standards or
patterns[2]
4. Assess the risk involved in integrating data in new applications, including the challenges of
joins
5. Discover metadata of the source database, including value patterns and distributions, key
candidates, foreign-key candidates, and functional dependencies
6. Assess whether known metadata accurately describes the actual values in the source
database
7. Understanding data challenges early in any data intensive project, so that late project
surprises are avoided. Finding data problems late in the project can lead to delays and cost
overruns.
8. Have an enterprise view of all data, for uses such as master data management, where key
data is needed, or data governance for improving data quality.

Introduction
Data profiling refers to the analysis of information for use in a data warehouse in order to clarify the
structure, content, relationships, and derivation rules of the data.[3] Profiling helps to not only understand
anomalies and assess data quality, but also to discover, register, and assess enterprise metadata.[4][5] The
result of the analysis is used to determine the suitability of the candidate source systems, usually giving the
basis for an early go/no-go decision, and also to identify problems for later solution design.[3]

How data profiling is conducted


Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode,
percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional
metadata information obtained during data profiling such as data type, length, discrete values, uniqueness,
occurrence of null values, typical string patterns, and abstract type recognition.[4][6][7] The metadata can
then be used to discover problems such as illegal values, misspellings, missing values, varying value
representation, and duplicates.

Different analyses are performed for different structural levels. E.g. single columns could be profiled
individually to get an understanding of frequency distribution of different values, type, and use of each
column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally, overlapping
value sets possibly representing foreign key relationships between entities can be explored in an inter-table
analysis.[4]
Normally, purpose-built tools are used for data profiling to ease the process.[3][4][6][7][8][9] The computation
complexity increases when going from single column, to single table, to cross-table structural profiling.
Therefore, performance is an evaluation criterion for profiling tools.[5]

When is data profiling conducted?


According to Kimball,[3] data profiling is performed several times and with varying intensity throughout the
data warehouse developing process. A light profiling assessment should be undertaken immediately after
candidate source systems have been identified and DW/BI business requirements have been satisfied. The
purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate
detail level and that anomalies can be handled subsequently. If this is not the case the project may be
terminated.[3]

Additionally, more in-depth profiling is done prior to the dimensional modeling process in order assess
what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system
design process in order to determine the appropriate data to extract and which filters to apply to the data
set.[3]

Additionally, data profiling may be conducted in the data warehouse development process after data has
been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning
and transformations have been done correctly and in compliance of requirements.

Benefits and examples


The benefits of data profiling are to improve data quality, shorten the implementation cycle of major
projects, and improve users' understanding of data.[9] Discovering business knowledge embedded in data
itself is one of the significant benefits derived from data profiling.[5] Data profiling is one of the most
effective technologies for improving data accuracy in corporate databases.[9]

See also
Data quality
Data governance
Master data management
Database normalization
Data visualization
Analysis paralysis
Data analysis

References
1. Johnson, Theodore (2009). "Data Profiling". In Springer, Heidelberg (ed.). Encyclopedia of
Database Systems.
2. Woodall, Philip; Oberhofer, Martin; Borek, Alexander (2014). "A classification of data quality
assessment and improvement methods" (http://www.inderscience.com/link.php?id=68656).
International Journal of Information Quality. 3 (4): 298. doi:10.1504/ijiq.2014.068656 (https://
doi.org/10.1504%2Fijiq.2014.068656).
3. Kimball, Ralph; et al. (2008). The Data Warehouse Lifecycle Toolkit (https://archive.org/detai
ls/datawarehouselif00kimb_924) (Second ed.). Wiley. pp. 376 (https://archive.org/details/dat
awarehouselif00kimb_924/page/n17). ISBN 9780470149775.
4. Loshin, David (2009). Master Data Management (https://archive.org/details/masterdatamana
ge00losh). Morgan Kaufmann. pp. 94 (https://archive.org/details/masterdatamanage00losh/p
age/n197)–96. ISBN 9780123742254.
5. Loshin, David (2003). Business Intelligence: The Savvy Manager's Guide, Getting Onboard
with Emerging IT. Morgan Kaufmann. pp. 110–111. ISBN 9781558609167.
6. Rahm, Erhard; Hai Do, Hong (December 2000). "Data Cleaning: Problems and Current
Approaches". Bulletin of the Technical Committee on Data Engineering. IEEE Computer
Society. 23 (4).
7. Singh, Ranjit; Singh, Kawaljeet; et al. (May 2010). "A Descriptive Classification of Causes of
Data Quality Problems in Data Warehousing". IJCSI International Journal of Computer
Science Issue. 2. 7 (3).
8. Kimball, Ralph (2004). "Kimball Design Tip #59: Surprising Value of Data Profiling" (http://w
ww.kimballgroup.com/wp-content/uploads/2012/05/DT59SurprisingValue.pdf) (PDF).
Kimball Group.
9. Olson, Jack E. (2003). Data Quality: The Accuracy Dimension (https://archive.org/details/dat
aqualityaccur00olso_641). Morgan Kaufmann. pp. 140 (https://archive.org/details/dataquality
accur00olso_641/page/n159)–142.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Data_profiling&oldid=1102297638"

You might also like