Data mining is the process of finding useful new correlations, patterns, and trends by transferring through a high amount of data saved in repositories, using pattern recognition technologies including statistical and mathematical techniques. It is the analysis of factual datasets to discover unsuspected relationships and to summarize the records in novel methods that are both logical and helpful to the data owner.
There are various transformations of data mining which are as follows −
Flag normal, abnormal, out of bounds, or impossible facts − Marking measured facts with special flags can be completely beneficial. Some measured facts may be correct but highly unusual. Perhaps these facts are established on a small sample or a specific circumstance.
Other facts may be present in the data but must be regarded as impossible or inexplicable. For each of these circumstances, it is better to mark the data with a status flag so that it can be constrained into or out of the analysis, rather than to delete the unusual value from the table.
A good way to handle these cases is to create a special data status dimension for the fact record. It can need this dimension as a constraint and to define the status of each fact.
Recognize random or noise values from context and mask out − A special case of the preceding transformation is to recognize when the legacy system has supplied a random number rather than a real fact. This can happen when no value is meant to be delivered by the legacy system, but a number leftover in a buffer has been passed down to the data warehouse. When this case is identified, the random number should be restored with a null value.
Apply a uniform treatment to null values − Data mining tools are sensitive to the distinction between “cannot exist” and “exists but is unknown.” Some data mining professionals assign a most probable or median value in the second case so that the rest of the fact table record can participate in the analysis.
This could be done either in the original data by overwriting the null value with the estimated value, or it could be handled by a sophisticated data mining tool that knows how to process null data with various analysis options.
Flag fact records with changed status − A helpful data transformation is to add a special status indicator to a fact table record to show that the status of that account (or customer or product or location) has just changed or is about to change. The status indicator is implemented as a status dimension in the star join design.