0% found this document useful (0 votes)
414 views6 pages

Study Guide For Test 4

This document provides an overview and study guide for a test on data mining concepts from chapters 4 and 5. It outlines factors driving the popularity of data mining, examples of applications, characteristics and categories of data mining algorithms. The CRISP-DM process for conducting data mining projects is described. Key concepts for linear regression, classification, clustering and association analysis are defined. Text mining is introduced as extracting patterns from unstructured data sources.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
414 views6 pages

Study Guide For Test 4

This document provides an overview and study guide for a test on data mining concepts from chapters 4 and 5. It outlines factors driving the popularity of data mining, examples of applications, characteristics and categories of data mining algorithms. The CRISP-DM process for conducting data mining projects is described. Key concepts for linear regression, classification, clustering and association analysis are defined. Text mining is introduced as extracting patterns from unstructured data sources.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 6

ISDS 2001 STUDY GUIDE for TEST 4 (Chapters 4 and 5) * - Indicates questions on sample test Bring your calculator

or to Test 4 Objectives: After completing Chapter 4, you should know: 1. FACTORS BEHIND SUDDEN POPULARITY IN DATA MINING: More intense competition at the global scale driven by customers ever-changing needs/wants in an increasingly saturated marketplace. General recognition of the untapped value hidden in large data sources. Consolidation & integration of database records that enables single view of customers, vendors, transactions.. Consolidation of database and other data repositories into a single location in the form of a DW. Exponential increase in the data processing and storage technologies. Significant reduction in the cost of hardware/software for data storage and processing. Movement toward de-massification (conversion of info resources into nonphysical form) of business practices EXAMPLES OF APPLICATIONS OF DATA MINING: Customer relationship management (CRM) Government and defense Banking Travel industry (airlines, hotels/resorts, rental car management) Retail and logistics Health care/medicine Manufacturing & production Entertainment industry Brokerage and securities trading Homeland security and law enforcement Insurance Sports Computer hardware and software Data mining: discovering or mining knowledge from large amounts of data; process of identifying valid/useful patterns in data stores in structures databases CHARACTERISTICS OF DATA MINING: Cleanse and consolidate data into DW Usu. a client/server architecture or a web-based information systems architecture Massaging and synchronizing data to get the right results; exploring the usefulness of soft data Miner usu. an end-user powered by data drills that ask ad hoc questions and obtain answers quickly Finding unexpected result and requires creative thinking Readily combines w/ spreadsheets and other software developed tools; mined data can be analyzed & deployed quickly and easily Use parallel processing for data mining Data mining finds patterns & defines them in terms of mathematical rules used for prediction or association. 4 BROAD CATEGORIES FOR DATA MINING ALGORITHMS: a. Prediction tell nature of future occurrences of events based on past EX. Forecasting temperature b. Clusters identify groups based on known characteristics EX. Group customers based on buying habits c. Associations find commonly co-occurring groups of things EX. beer & diapers market basket analysis d. Sequence Relationships Data mining procedures include data visualization and time series forecasting. Classification procedures are the most common of all data mining approaches. Classification involves identifying patterns of data and associates those patterns with observations belonging to a certain category. EX. credit approval, store location, target marketing, fraud detection, telecommunications, route or segmentation and any other decision making systems. THE BASIC IDEA: Define the data, use the data to develop a mathematical model, then use that model to predict unknown outcomes for future observations. YOU WOULD USE: A. Decision tree for classification if the outcome is categorical & predictors are either categorical/numeric B. Linear discriminant analysis if the outcome is categorical & predictors are all numeric and have normal distributions and equal variances C. Linear regression if the outcome is continuous numeric & predictors are all numeric have normal distributions and equal variances Organizations must use a standardized approach for conducting a data mining project and be able to identify some proposed models (CRISP-DM, DMAIC, SEMMA).

2.

3.

4. 5.

6. 7. 8. 9. 10.

11.

12.

13. 14. 15*. 16*. 17*. 18*. 19*. 20*. 21*. 22*. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.

33. 34. 35.

6 steps of the CRISP-DM model: Business Understanding: know what the study is for; thorough understanding for the managerial need for new knowledge EX. What are the common characteristics of the customers we have lost to our competitors recently? Data Understanding: identify relevant data from many available databases EX. Where is the relevant data stored and in what form? What the process of collecting data is? Who the collectors of it are? How often is it updated? Data Preparation: aka data processing; take identified data and prepare them for analysis by data mining methods EX. Consolidation, cleaning, transformation, and reduction Modeling: apply modeling techniques on prepared data; assessment/comparative analysis of various models EX. Decision trees, k-means, Apriori algorithm Evaluation: assess models for their accuracy and generality; decide if model meets business objective and if so to what extent EX. Do more models need to be built? Use pivot tables, pie charts, etc. to interpret patterns Deployment: organize/present data in a way that the end user can understand and benefit from EX. maintenance activities, generating a report; implementing repeatable data mining process enterprise-wide DMAIC = Define, Measure, Analyze, Improve, Control SEMMA = Sample, Explore, Modify, Model, Assess HOW TO SELECT, IN LINEAR REGRESSION, THE BEST POSSIBLE MODEL USING P-VALUES AND ADJUSTED R2 ALSO KNOW HOW TO USE/DEPLOY A REGRESSION EQUATION FOR PREDICTION HOW TO CALCULATE LINEAR CLASSIFICATION FUNCTIONS (LCF0 AND LCF1), GIVEN THE CLASSIFICATION COEFFICIENTS IN A PRINTOUT. HOW TO USE LCFS TO MAKE A DECISION CALCULATE THE PROPORTION OF CORRECT CLASSIFICATIONS HOW TO READ A CLASSIFICATION MATRIX HOW TO INTERPRET LINE LISTING OF OBSERVATION INFORMATION (INCLUDING PROBABILITY OF SUCCESS) HOW TO DETERMINE A DECISION TREE PREDICTION ALGORITHM USING A COORDINATE PLOT OF DATA HOW TO APPLY A DECISION TREE PREDICTION ALGORITHM FOR PREDICTING OUTCOMES. Clustering analysis places observations (rows, customers, students, etc.) into groups such that the members share similar characteristics but the groups themselves are highly different. Cluster analysis is different from classification analyses b/c groups are unknown and created in cluster analysis and groups are distinct and known when conducting a classification analysis. EX. Questionnaires/surveys, Harry Potters sorting hat choose houses for students, seating at a wedding A common application of cluster analysis is market segmentation (also know some of the other examples of cluster analysis referred to in the notes). Market segmentation is used to understand the buyer behavior of customers. Market segmentation helps retailers target similar customer groups for defining appropriate advertising campaign. EXAMPLES OF MARKET SEGMENTATION: Questionnaire to airline rider based on flying behavior, demographics (gender, age, income, housing type, and education), Marriot Hotels Association analysis is aimed at establishing relationships between items (variables, columns). GOAL OF ASSOCIATION: to group variables that are similar A common application of association analysis is market basket analysis. EXAMPLES OF MARKET BASKET ANALYSIS: customers who bought diapers often bought beer, cold medicine frequently will also buy tissue, go to store just for milk milk placed in back of store, cross promotional programs (During Thanksgiving, Walmart displays w/y cornbread mix, canned sweet potatoes, brown sugar, pecans, pie shells, flour) Text Mining: the semiautomatic process of extracting patterns from large amounts of unstructured data sources; PURPOSE of text mining is to collect all documents related to a domain of interest for analysis SOME OF THE MOST POPULAR TEXT MINING ANALYSES (P. 193) ARE: A. Summarization C. Clustering B. Categorization (classification) D. Concept linking (association) COMMON APPLICATIONS OF TEXT MINING: Information extraction Categorization Question answering Topic tracking Clustering Summarization Concept linking

36. 37. 38. 39. 40. a. b. c. d. 41. 42.

43.

44.


45. 46. 47.


48. 1) 2)

3)

Term extraction = the most basic form of text mining and is used for summarization (the 1st of the 4 popular types of texting mining analyses listed above) Term-Document Matrix is used for out categorization/classification, clustering, & concept linking (association) Text mining maps unstructured information (in the form of a document of words) into a structured format (in the form of a feature/term vector) or a concept. Feature vector: (or term vector) is a weighted list of words which defines a concept that describes unstructured information (document of words) HOW A FEATURE VECTOR IS CREATED: Eliminate articles (the, and, other..) ; called stop-words Replace words with their stems/roots. Eliminate plurals & conjugations. (phonephonesphoned) Consider synonyms & phrases. Distinguish words w/ diff meanings (Microsoft Windows vs. house windows) Calculate the weights of the remaining terms based upon the frequency w/ which the word appears. One common weighting measures is the TF Factor: the number of times a word is in a document Calculated by: eliminate commonly used words, count how many words are left, take the ratio of the words over the total # of words to get the TF factor Term-Document Matrix: ROWS represent documents and the COLUMNS represent the terms (excluding stop terms); frequencies represent the # of times a term appears in a particular document; used to analyze a collection of documents related to a domain of interest. It is used for conducting classification analysis/categorization, cluster analysis, and association analysis/concept linking. TEXT MINING PROCESS CAN BE DEFINED IN 3 CONSECUTIVE TASKS: A. Establish the corpus (body): collect & organize the domain specific unstructured data; collect all documents related to a domain of interest for analysis, once document are collected, they should be converted to a similar format B. Create the Term-Document Matrix: introduce structure to the corpus C. Extract the knowledge: discover novel patterns from the T-D Matrix; data can be extracted using classification, clustering, association, and/or trend analysis SOME TEXT MINING APPLICATIONS MENTIONED IN CLASS FROM SECTION 5.3: Marketing Apps Biomedical Apps Security Apps Academic Apps The web (internet) is the biggest data/text repository. Web Mining: discovering relationships from web data EXAMPLES OF INFORMATION FOUND ON THE WEB: whose home page is link to which other pages how many people have on their own website hyperlinks of other websites how a particular site is organized tracking of visitors to a website, each search on a search engine, each click on a link, a transaction on an ecommerce site 3 DIFFERENT AREAS OF WEB MINING: Web Content Mining: extracts & uses the content found w/in web pages; it is similar in concept to text mining; source is the unstructured text of the page, usu. HTML format Web Structure Mining: extracting useful info from the links embedded in Web docs; links w/in a document indicate depth of coverage (popularity) o Hubs pages that point to many authorities in the field EX. could be a list of recommended links o Authority pages - those that are linked by many hubs Web Usage Mining: extracts & uses information generated thru web page visits, traffic, transactions, etc.; finds out what people are looking for on the internet o Clickstream data: provides a trail of users activities & shows the users browsing patterns o EXAMPLES: - sites like Amazon presents user w/ a choice of products based on previous purchases & uses recommender system (association analyses) to recommend products based on similar users -if 70 percent of software downloads from your site occur b/w 7 & 11pm, you could plan for better technical support and bandwidth during that time. - Analysis may show that 60 percent of visitors who search for hotels in Maui had also searched for airfares to Maui.

CH4 Data Mining Cases: Opening Vignette: Data mining goes to Hollywood 4.1 1-800-Flowers 4.2 Police fight crimes with data mining 4.3 Motor vehicle accidents 4.4 A Mine on Terrorist Funding 4.5 Data Mining in Cancer Research 4.6 Highmark, Inc., Employs Data Mining to Manage Insurance Costs 4.7 Coors Improves Beer Flavors with Neural Networks 4.8 Predicting Customer Churn A Competition of Different Tools End of Chapter Application Case: Data Mining Helps Develop Custom-Tailored Product Portfolios for Telecommunication Companies CH5 Text and Web Mining Cases: Opening Vignette: Mining Text for Security and Counterterrorism 5.1 Text Mining for Patent Analysis 5.3 Mining for Lies Deception Detection End of Chapter Application Case: HP and Text Mining Chapter 4 Opening Vignette: Data Mining Goes to Hollywood It is an interesting and challenging to predict box-office receipts for a given movie. Sharda and Delen (2007) used classification (as opposed to regression analysis) to predict one of nine categories from flop to blockbuster. Data were collected on 2,632 movies, released from 1998 through 2006, including box office receipts for defining the 9 categories and independent variables, such as MPAA rating, competition, actors (star value), genre, etc. The predictive models are effective in minimizing investments in early stages of the movie production. The models also aided in making the right decisions during the making of a motion picture in order to manage large amounts of money and get the highest ROI, helped to determine how much to invest in the production of a motion picture, and to evaluate tradeoffs to maximize success of movie production. Business Analytics and Data Mining Help 1-800-Flowers (App case 4.1) 1-800-Flowers has become the leader in direct-order e-commerce after opening its own website more than 14 years ago. However, to stay competitive, they needed to make decisions in real-time to increase retention, reduce costs, and maintaining repeat customers. Believing in close customer relationships, they wanted to analyze every piece of information; so they decided to use SAS data mining tools. See the four specific benefits as listed on page 137. Police Department Fights Crime with Data Mining (App Case 4.2) The fight against crime is complicated by aging cases, shrinking resources, and few leads. Many cases are filed away until new leads are found. So the challenge for the police department in the United Kingdom was to determine a way to quickly and easily find patterns and trends in unsolved criminal cases. Each case serves as an observation (row) in a database and physical descriptions of the crimes and the MO are saved as the variables (columns). These data are analyzed using cluster analysis in order to detect which cases are similar based upon the descriptions. In particular, the SPSS software has a PASW modeler which uses, whats referred to as, Kohonen networks to cluster similar cases. If a perpetrator is known for a case, and that case is found to be similar to a case where the perpetrator is unknown, it is possible that the known perpetrator committed the crime for the case where the perpetrator is unknown (therefore the unsolved case is reopened). Motor Vehicle Accidents and Driver Distractions (App Case 4.3) A study published by NHTSA concluded that about 25-30 percent of injuries caused by car crashes are due to driver distractions. In 1999, according to FARS system developed by the National Center for Statistics and Analysis, 11 percent of fatal crashes are due to driver in distractions. Three data mining techniques (Decision Tree, Neural Networks, and Kohonen Networks for Cluster Analysis) using SPSS software were used on crash information from FARS to determine what factors (in addition to driver distraction) were associated with the occurrence of a crash and its severity. They found that driver inattention was the leading cause of crashes, followed by the point of impact (rear, head-on, and angled collisions).

A Mine on Terrorist Funding (App Case 4.4) The USA Patriot Act and the creation of the Dept. of Homeland Security have brought to light the potential application of technology and data mining to detect money laundering and other forms of terrorist funding. Law enforcement agencies are now concentrating on international trade as a means of moving money silently in and out of countries without attracting attention. These kinds of transfers can go undetected by overvaluing imports and undervaluing exports. Such activity results in crimes related to customs fraud, income tax evasion, money laundering and may also indicate that a foreign exporters may be a members of terrorists organizations. Data mining on import and export data can be used to detectsee last 2 paragraphs of this case on page 148. Data Mining in Cancer Research (App Case 4.5) Cancer is believed to be caused by both external and internal factors. The improvement in survival rate reflects progress in both diagnosis and treatment; data-driven research has been applied successful in identifying novel patterns. Delen (2009) used three data mining techniques in conjunction with logistic regression (all of these are classification analyses) to predict the survivability of prostate cancer. In 2006, Delen did a similar comparison in predicting the survivability of breast cancer. These examples show how advanced data mining techniques can be used to develop models that possess a high degree of predictive and explanatory power. These results complement (do not replace) medical professionals and researchers to save more human lives. Highmark, Inc., Employs Data Mining to Manage Insurance Costs (App Case 4.6) Highmark Inc. was formed from the merger of Pennsylvania Blue Shield and Blue Cross plan in western Pennsylvania. Managed-care organizations have been hesitant to use data mining applications because of the cost and complexity. Vast amounts of data, first viewed as taking up storage and a menace, are now however being used for knowledge. Highmark has used data mining to cluster patients that are more costly to treat on the average. Initially they were able to detect that those with diabetes or coronary heart disease are the most expensive to treat, but with data mining they have been able to relate different diseases with the patients profiles. Market pressures are driving managed-care organizations to become more efficient; they have used data mining applications to maximize revenue from Medicare, to analyze patient information, and to detect and prevent fraudulent insurance claims. Coors Improves Beer Flavors with Neural Networks (App Case 4.7) Coors has 20 percent of the market, years of experience, and the best people in the business. They face the problem that customers today face many varieties of beers from which to choose. A drinkers choice depends upon several factors, but Coors goal is to make sure the customer chooses Coors no matter what. An important issue is the flavor, usually determined through panel tests, however, because panel tests take time, Coors wants to understand flavor solely on its chemical composition. Coors used neural networks to link chemical composition (alcohol, color, bitterness, ethyl acetate, ) to sensory analysis (alcohol, estery, malty, grainy, burnt, and etc.) all in an attempt to improve the flavor of beer. Predicting Customer Churn A Competition of Different Tools (App Case 4.8) In 2003, Duke University and NCR Teradata Center sought to identify the best predictive modeling technique to help manage customer churn in the wireless telecommunications industry. In the 1990s when new subscriber rates were in 50% range, there was more focus on new customer acquisition as opposed to customer retention. However, in a new era of slower growth rates, it is clear that customer retention is vital to profitability. The key to customer retention is to predict which customers are most likely to defect to a competitor and offer them incentives to stay. To execute that strategy effectively, one must be able to develop highly accurate predictions - churn scorecards so that the retention effort is focused on the relevant customer. To build the prediction equations (to predict Yes/No to churning), the models used the predictors (Product details, such as handset price and capabilities, Customer financials, such as credit score and credit card ownership, Customer demographics, and Phone usage, such as the number and duration of various categories of calls Chapter 5 Opening Vignette: Mining Text for Security and Counterterrorism In order to provide for national security, there is an emerging need to analyze news from several sources (print, radio, video, emails, phone conversations - all of which can be converted to text). The Genoa project, part of the Defense Advanced Research Projects Agencys (DARPA) total information awareness program seeks to provide the analytical tools and techniques to rapidly analyze information on current situations to support better decision making and to provide knowledge discovery tools to better mine relevant information sources for pattern discovery. As a result, analysts have reduced response time, increase in the probability of taking the best action, and can quickly detect patterns in the form of

actionable information. The analysts applied a summarization filter (text miner) that identified and aggregated descriptions of people from a collection of documents using some simple natural language processing techniques, an efficient syntactic analysis, and the use of a thesaurus. You must summarize End of Chapter Application Cases from Chapters 4 & 5, Case 5.1, and Case 5.3

You might also like