We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16
Data Mining
Content
Data mining Introduction
KDD What is (not) Data Mining?
What is not Data What is Data Mining? –
Mining? – Certain names are more – Look up phone number prevalent in certain US in phone directory locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Query a Web search engine for information – Group together similar about “Amazon” documents returned by search engine according to their – Querying or searching context (e.g. Amazon rainforest, Amazon.com,)
– Finding trends and patterns
Data Mining: Classification Schemes
Decisions in data mining
– Kinds of databases to be mined – Kinds of knowledge to be discovered – Kinds of techniques utilized – Kinds of applications adapted
Data mining tasks
– Descriptive data mining – Predictive data mining Decisions in data mining Databases to be mined Relational, transactional, object-oriented, spatial, time- series, text, multi-media, heterogeneous, WWW, etc. Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, Data mining tasks/techniques Predictive modeling Use some variables to predict unknown or future values of other variables Descriptive modeling Find human-interpretable patterns that describe the data. Data mining tasks/techniques Predictive Modeling: Classification: Assigning data instances to predefined classes (e.g., decision trees, neural networks, support vector machines). Regression: Predicting continuous numerical values (e.g., linear regression, logistic regression). Time Series Analysis: Analyzing data points collected at specific time intervals (e.g., ARIMA, exponential smoothing). Descriptive Modeling: Clustering: Grouping similar data points together (e.g., k-means, hierarchical clustering). Association Rule Mining: Discovering relationships between items (e.g., market basket analysis). Outlier Detection: Identifying abnormal data points CRISP-DM: Framework for Data Mining CRISP-DM stands for Cross-Industry Standard Process for Data Mining. Widely adopted methodology Provides a structured approach for planning & executing DM projects. Designed to be adaptable across various industries and applications. Key Characteristics of CRISP-DM Iterative: The process is not strictly linear. You may need to revisit previous phases as you progress. Flexible: It can be adapted to various project sizes and SELF->Key Characterisics Here’s a simplified explanation of the key characteristics of CRISP-DM: 1. Iterative: The CRISP-DM process isn’t a straight line; it’s more like a circle. As you work on a data project, you might find that you need to go back and revisit earlier steps. For example, after analyzing your data, you might realize you need to refine your questions or gather more data. 2. Flexible: CRISP-DM can be used for different types of projects, whether they are big or small. You can adjust the process to fit the specific needs of your project, making it versatile for various situations. 3. Industry-Neutral: This approach can be used in any industry, whether it’s healthcare, finance, marketing, or any other field. It’s designed to be useful no matter what kind of data you’re working with. 4. Focus on Business Value: At the heart of CRISP-DM is the idea of understanding what the business needs. It’s important to make sure that your data analysis is aligned with the goals of the organization. This way, your work provides real value and helps the business succeed. 5. Structured Framework: CRISP-DM provides a clear framework for managing data mining projects. It outlines specific steps to follow, making it easier for teams to collaborate and stay organized. This structure helps ensure that all important aspects of the project are covered, from understanding the problem to evaluating the results. CRISP-DM: Data Mining Operations 1. Business Understanding: 4. Data Modeling: 1. Determine business objectives and 1. Select modeling techniques. requirements. 2. Generate test design. 2. Assess situation and 3. Build and Assess models. resources. 3. Determine data mining 5. Evaluation: goals. 1. Evaluate results. 2. Data Understanding: 2. Review process. 1. Collect initial data. 3. Determine next steps. 2. Describe data. 3. Explore data. 6. Deployment: 4. Verify data quality. 1. Plan deployment. 2. Plan monitoring and 3. Data Preparation: 1. Select and Clean data. maintenance.
2. Construct data. 3. Produce final report.
CRISP-DM: Framework for Data Mining Components of Data Mining Data Source: This is the origin of the data, which can be databases, data warehouses, or other repositories. Data Warehouse Server: This component retrieves relevant data from the data source based on user requests. Data Mining Engine: The heart of the data mining process, it applies various algorithms and techniques to extract patterns from the data. Pattern Evaluation Module: Assesses the discovered patterns based on predefined criteria to determine their significance and usefulness. Graphical User Interface (GUI): This provides a user-friendly interface for interaction with the data mining system. Data Mining Architecture/ Components Of data Mining Predictive Analytics
It is the use of data to predict future trends and events.
Attempts to answer the question, “What might happen next?” It leverages historical data, statistical modeling, and machine learning algorithms to identify patterns and make forecasts. It works by identifying correlations between different elements in selected datasets. There are broadly two types of predictive analytics models: classification models regression models. Predictive Analytics Challenges Data Quality: Inaccurate, incomplete, or biased data can lead to unreliable models. Data Availability: Insufficient or limited data can hinder model development. Model Complexity: Complex models can be difficult to interpret and explain. Overfitting: Models that are too closely fitted to the training data may not perform well on new data. Ethical Considerations: Concerns about privacy, bias, and fairness in model development and deployment. Computational Resources: Handling large datasets and complex models requires significant computational power. Predictive Analytics Applications Finance: Fraud detection, credit risk assessment, investment portfolio optimization, market trend prediction. Healthcare: Disease outbreak prediction, patient risk assessment, drug discovery, personalized medicine. Retail: Customer segmentation, demand forecasting, inventory management, recommendation systems. Marketing: Customer churn prediction, campaign optimization, targeted advertising. Manufacturing: Predictive maintenance, supply chain optimization, quality control. Insurance: Risk assessment, fraud detection, customer churn prediction.