UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
This document discusses various data mining techniques including definition, predictive modeling, classification, regression, time series analysis, prediction, descriptive modeling, clustering, summarization, and association rules. It provides examples for each technique to illustrate how they are used in applications such as credit risk analysis, airport security screening, savings prediction, customer catalog targeting, university rankings, and grocery store sales analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
591 views40 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
This document discusses various data mining techniques including definition, predictive modeling, classification, regression, time series analysis, prediction, descriptive modeling, clustering, summarization, and association rules. It provides examples for each technique to illustrate how they are used in applications such as credit risk analysis, airport security screening, savings prediction, customer catalog targeting, university rankings, and grocery store sales analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40
UNIT-04
INTRODUCTION TO DATA MINING:
• Definition • Data mining Techniques • KDD Process • Association rules. • (http://index-of.co.uk/Data-Mining/Dunham%20-%20Data%20Mining.pdf) Definition • In simple words, data mining is defined as a process used to extract usable data from a larger set of any raw data. It implies analyzing data patterns in large batches of data using one or more software. • It is one of the step in KDD Process. • Data mining is often defined as finding hidden information in a database. Alternatively, it has been called exploratory data analysis, data driven discovery, and deductive learning. Predictive Model • A predictive model makes a prediction about values of data using known results found from different data. • Predictive modeling may be made based on the use of other historical data. For example, a credit card use might be refused not because of the user's own credit history, but because the current purchase is similar to earlier purchases that were subsequently found to be made with stolen cards. • Example 1.1 uses predictive modeling to predict the credit risk. Predictive model/Supervised Techniques tasks Predictive model data mining tasks include: • Classification. • Regression. • Time series analysis. • Prediction Classification • Classification maps data into predefined groups or classes. • It is often referred to as supervised learning because the classes are determined before examining the data. • Two examples of classification applications are determining whether to make a bank loan and identifying credit risks. Classification algorithms require that the classes be defined based on data attribute values. • They often describe these classes by looking at the characteristics of data already known to belong to the classes. • Pattern recognition is a type of classification where an input pattern is classified into one of several classes based on its similarity to these predefined classes. Example:2 • An airport security screening station is used to determine: • if passengers are potential terrorists or criminals. • To do this, the face of each passenger is scanned and its basic pattern (distance between eyes, size and shape of mouth, shape of head, etc.) is identified. • This pattern is compared to entries in a database to see if it matches any patterns that are associated with known offenders. Regression • Regression is used to map a data item to a real valued prediction variable. In actuality, regression involves the learning of the function that does this mapping. • Regression assumes that the target data fit into some known type of function (e.g., linear, logistic, etc.) and then determines the best function of this type that models the given data. • some type of error analysis is used to determine which function is "best." Example • EXAMPLE 1.3 A college professor wishes to reach a certain level of savings before her retirement. Periodically, she predicts what her retirement savings will be based on its current value and several past values. • She uses a simple linear regression formula to predict this value by fitting past behavior to a linear function and then using this function to predict the values at points in the future. Based on these values, she then alters her investment portfolio. Time Series Analysis • With time series analysis, the value of an attribute is examined as it varies over time. The values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.). • A time series plot (Figure 1.3), is used to visualize the time series. • In this figure you can easily see that the plots for Y and Z have similar behavior, while X appears to have less volatility. • There are three basic functions performed in time series . analysis: • In one case, distance measures are used to determine the similarity between different time series. In the second case, the structure of the line is examined to determine (and perhaps classify) its behavior. A third application would be to use the historical time series plot to predict future values. Prediction • Many real-world data mining applications can be seen as predicting future data states based on past and current data. • Prediction can be viewed as a type of classification. (Note: This is a data mining task that is different from the prediction model, although the prediction task is a type of prediction model.) • The difference is that prediction is predicting a future state rather than a current state. • Here we are referring to a type of application rather than to a type of data mining modeling approach, as discussed earlier. Prediction applications include flooding, speech recognition, machine learning, and pattern recognition. Although future values may be predicted using time series analysis or regression techniques, other approaches may be used as well. Descriptive model • . A descriptive model identifies patterns or relationships in data. Unlike the predictive model, a descriptive model serves as a way to explore the properties of the data examined, not to predict new properties. Clustering, summarization, association rules, and sequence discovery are usually viewed as descriptive in nature. Descriptive/unsupervised techniques • Clustering • Summarization • Association Rules. • Sequence Pattern Discovery Clustering • Clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. • Clustering is alternatively referred to as unsupervised learning or segmentation. • It can be thought of as partitioning or segmenting the data into groups that might or might not be disjointed. • The clustering is usually accomplished by determining the similarity among the data on predefined attributes. • The most similar data are grouped into clusters. Example 1.6 provides a simple clustering Since the clusters are not predefined, a domain expert is often required to interpret the meaning of the created clusters. EXAMPLE • A certain national department store chain creates special catalogs targeted to various demographic groups based on attributes such as income, location, and physical characteristics of potential customers (age, height, weight, income etc.). • To determine the target mailings of the various catalogs and to assist in the creation of new, more specific catalogs, the company performs a clustering of potential customers based on the determined attribute values. • The results of the clustering exercise are then used by management to create special catalogs and distribute them to the correct target population based on the cluster for that catalog. Summarization • Summarization maps data into subsets with associated simple descriptions. Summarization is also called characterization or generalization. It extracts or derives representative information about the database. This may be accomplished by actually retrieving portions of the data. Alternatively, summary type information (such as the mean of some numeric attribute) can be derived from the data. The summarization succinctly characterizes the contents of the database. Example 1.7 illustrates this process. EXAMPLE • One of the many criteria used to compare universities by the U.S. News & World Report is the average SAT or AC T score [GM99]. This is a summarization used to estimate the type and intellectual level of the student body Association Rules
• Link analysis, alternatively referred to as affinity analysis or association, refers to the
data mining task of uncovering relationships among data. • The best example of this type of application is to determine association rules. • An association rule is a model that identifies specific types of data associations. These associations are often used in the retail sales community to identify items that are frequently purchased together. • Example 1.8 illustrates the use of association rules in market basket analysis. Here the data analyzed consist of information about what items a customer purchases. • Associations are also used in many other applications such as predicting the failure of telecommunication switches. • Users of association rules must be cautioned that these are not causal relationships. • They do not represent any relationship inherent in the actual data (as is true with functional dependencies) or in the real world. • There probably is no relationship between bread and pretzels that causes them to be purchased together. And there is no guarantee that this association will apply in the future. • However, association rules can be used to assist retail store management in effective advertising, marketing, and inventory control. EXAMPLE • A grocery store retailer is trying to decide whether to put bread on sale. • To help determine the impact of this decision, the retailer generates association rules that show what other products are frequently purchased with bread. • He finds that 60% of the time that bread is sold so are pretzels and that 70% of the time jelly is also sold. • Based on these facts, he tries to capitalize on the association between bread, pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where the bread is placed. • In addition, he decides not to place either of these items on sale at the same time. Sequence Discovery • Sequential analysis or sequence discovery is used to determine sequential patterns in data. • These patterns are based on a time sequence of actions. These patterns are similar to associations in that data (or events) are found to be related, but the relationship is based on time. • Unlike a market basket analysis, which requires the items to be purchased at the same time, in sequence discovery the items are purchased over time in some order. • Example 1.9 illustrates the discovery of some simple patterns. A similar type of discovery can be seen in the sequence within which data are purchased. For example, most people who purchase CD players may be found to purchase CDs within one week. As we will see, temporal association rules really fall into this category. EXAMPLE • The Webmaster at the XYZ Corp. periodically analyzes the Web log data to determine how users of the XYZ's Web pages access them. • He is interested in determining what sequences of pages are frequently accessed. • He determines that 70 percent of the users of page A follow one of the following patterns of behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C). • He then determines to add a link directly from page A to page C. DATA MINING ISSUES 1. Human interaction: 2. Over fitting: 3. Outliers 4. Interpretation of Results 5. Visualization of results: 6. Large datasets: 7. High dimensionality: DATA MINING ISSUES CONTD… 8. Multimedia data 9. Missing data: 10. Irrelevant data: 11. Noisy data. 12. Changing data. KDD PROCESS Knowledge Discovery in Databases(KDD). • Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data stored in databases. • Data mining is also one step in KDD. 1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection. • Cleaning in case of Missing values. • Cleaning noisy data, where noise is a random or variance error. • Cleaning with Data discrepancy detection and Data transformation tools. KDD contd.. 2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source(Data Warehouse). • Data integration using Data Migration tools. • Data integration using Data Synchronization tools. • Data integration using ETL(Extract-Load-Transformation) process. KDD contd.. 3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection. • Data selection using Neural network. • Data selection using Decision Trees. • Data selection using Naive Bayes. • Data selection using Clustering, Regression, etc. KDD contd.. 4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure. Data Transformation is a two step process: • Data Mapping: Assigning elements from source base to destination to capture transformations. • Code generation: Creation of the actual transformation program. KDD contd.. 5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful. • Transforms task relevant data into patterns. • Decides purpose of model using classification or characterization. KDD contd.. 6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures. • Find interestingness score of each pattern. • Uses summarization and Visualization to make data understandable by user. KDD contd.. 7. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent data mining results. • Generate reports. • Generate tables. • Generate discriminant rules, classification rules, characterization rules, etc.