We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20
Data Mining Tasks
Data Mining Tasks
• The data mining tasks can be classified generally into two types based on what a specific task tries to achieve. Those two categories are descriptive tasks and predictive tasks. The descriptive data mining tasks characterize the general properties of data whereas predictive data mining tasks perform inference on the available data set to predict how a new data set will behave. Different Data Mining Tasks Different Data Mining Tasks
• Predictive data mining tasks come up with a model from
the available data set that is helpful in predicting unknown or future values of another data set of interest. A medical practitioner trying to diagnose a disease based on the medical test results of a patient can be considered as a predictive data mining task. • Descriptive data mining tasks usually finds data describing patterns and comes up with new, significant information from the available data set. A retailer trying to identify products that are purchased together can be considered as a descriptive data mining task. Classification Classification derives a model to determine the class of an object based on its attributes. A collection of records will be available, each record with a set of attributes. One of the attributes will be class attribute and the goal of classification task is assigning a class attribute to new set of records as accurately as possible. Classification can be used in direct marketing, that is to reduce marketing costs by targeting a set of customers who are likely to buy a new product. Using the available data, it is possible to know which customers purchased similar products and who did not purchase in the past. Hence, {purchase, don’t purchase} decision forms the class attribute in this case. Once the class attribute is assigned, demographic and lifestyle information of customers who purchased similar products can be collected and promotion mails can be sent to them directly. Classification has two types of variables A. explanatory variables – which defines the essential properties of data B. Target variables – whose values can be determined It is used to predicate the value of discrete target variable Prediction
• Prediction task predicts the possible values of
missing or future data. Prediction involves developing a model based on the available data and this model is used in predicting future values of a new data set of interest. For example, a model can predict the income of an employee based on education, experience and other demographic factors like place of stay, gender etc. Also prediction analysis is used in different areas including medical diagnosis, fraud detection etc. Time - Series Analysis • Time series is a sequence of events where the next event is determined by one or more of the preceding events. Time series reflects the process being measured and there are certain components that affect the behavior of a process. Time series analysis includes methods to analyze time-series data in order to extract useful patterns, trends, rules and statistics. Stock market prediction is an important application of time- series analysis. Outlier Analysis in Data Mining What are Outliers? • Outliers are an integral part of data analysis. An outlier can be defined as observation point that lies in a distance from other observations. • An outlier is important as it specifies an error in the experiment. Outliers are extensively used in various areas such as detecting frauds, introducing potential new trends in the market and others. • Usually, outliers are confused with noise. However, outliers are different from noise data in the following sense: • Noise is a random error, but outlier is an observation point that is situated away from different observations. • Noise should be removed for better outlier detection. Various causes of outliers in Data Mining
• It is used in identifying the frauds in banking
sectors such as credit card hacking or any similar frauds. • It is used in observing the change in trends of buying patterns of a customer. • It is used in identifying the typing errors and reporting errors made by humans. • It is used in discovering the errors or faults in machines or systems. What is the need of handling the outliers in Data Mining?
• Outliers affect the results of the databases.
• Outliers often give useful or beneficial results and conclusions due to which various trends or patterns can be recorded. • Outliers can be beneficial in research department also. They can be extremely useful in some discovery. • Outliers are the key branches of data mining. Applications of Outlier Detection in Data Mining
• In Data Mining, Outlier Detection is extensively used. It is
used to obtain patterns or trends in data mining. The applications of Outlier Detection in Data Mining are given below: • Fraud Detection • Telecom Fraud Detection • Intrusion Detection in Cyber Security • Medical Analysis • Environment Monitoring such as Cyclone, Tsunami, Floods, Drought and so on • Noticing unforeseen entries in Databases Different approaches in Outlier Detection • There are majorly three approaches observed in outlier detection. Those approaches are given below: • The Statistical Approach • The Distance Based Approach • The Deviation Based Approach Regression in data mining • A data mining technique that is used to predict the numeric values in a given data set. For example, regression might be used to predict the product or service cost or other variables. It is also used in various industries for business and marketing behavior, trend analysis, and financial forecast. Application of Regression
• Regression is a very popular technique, and it has
wide applications in businesses and industries. The regression procedure involves the predictor variable and response variable. The major application of regression is given below. • Environmental modeling • Analyzing Business and marketing behavior • Financial predictors or forecasting • Analyzing the new trends and patterns. Difference between Regression and Classification in data mining • Regression and classification are quite similar to each other. Classification and Regression are two significant prediction issues that are used in data mining. If you have given a training set of inputs and outputs and learn a function that relates the two, that hopefully enables you to predict outputs given inputs on new data. The only difference is that in classification, the outputs are discrete, whereas, in regression, the outputs are not. But the concepts are blurred, as in "logistic regression", which can be interpreted as either a classification or a regression method. So, it becomes difficult for the user to understand when to use classification and regression. Association • Association discovers the association or connection among a set of items. Association identifies the relationships between objects. Association analysis is used for commodity management, advertising, catalog design, direct marketing etc. A retailer can identify the products that normally customers purchase together or even find the customers who respond to the promotion of same kind of products. If a retailer finds that beer and nappy are bought together mostly, he can put nappies on sale to promote the sale of beer. Clustering • Clustering is used to identify data objects that are similar to one another. The similarity can be decided based on a number of factors like purchase behavior, responsiveness to certain actions, geographical locations and so on. For example, an insurance company can cluster its customers based on age, residence, income etc. This group information will be helpful to understand the customers better and hence provide better customized services. Summarization
• Summarization is the generalization of data. A set of
relevant data is summarized which result in a smaller set that gives aggregated information of the data. For example, the shopping done by a customer can be summarized into total products, total spending, offers used, etc. Such high level summarized information can be useful for sales or customer relationship team for detailed customer and purchase behavior analysis. Data can be summarized in different abstraction levels and from different angles.