0% found this document useful (0 votes)
40 views

Module 1 Ppt1

This document provides an overview of a data mining and analysis course. The objectives are to learn data mining methods, business intelligence, predictive analytics, and knowledge discovery. The course covers topics such as association rules, classification, clustering, performance evaluation, and time series forecasting. Data mining is presented as a process to extract useful patterns from large amounts of data and turn raw data into useful business information.

Uploaded by

Rashmi Sehgal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Module 1 Ppt1

This document provides an overview of a data mining and analysis course. The objectives are to learn data mining methods, business intelligence, predictive analytics, and knowledge discovery. The course covers topics such as association rules, classification, clustering, performance evaluation, and time series forecasting. Data mining is presented as a process to extract useful patterns from large amounts of data and turn raw data into useful business information.

Uploaded by

Rashmi Sehgal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Mining and

Analysis
MCA2001
Objectives:

• To learn Data mining methods and its importance.


• To learn about Business Intelligent and predictive
analytics and decision making.
Expected Outcomes
• Implement the appropriate data mining methods like
classification, clustering or association mining on large
data sets.
• Apply analytics and intelligence to solve practical
problems.
• Apply Data mining for knowledge discovery.
Module 1
Introduction : Data Mining(DM) –origin –rapid growth- - Core
Ideas in Data Mining - Supervised and Unsupervised Learning -
Steps in Data Mining – Data Warehousing.
Dimension reduction: Data Summaries, Correlation Analysis,
Reducing the Number of Categories in Categorical Variables-
Converting a Categorical Variable to a Numerical Variable -
Principal Components Analysis.
Module -2
Associative Prediction:
Frequent pattern Mining, Utility itemset mining, Association Rules –
Association Algorithms
Classifications:
Classification methods – Decision Tree, Naïve Bayes- K-Nearest
Neighbors- classification and regression trees –
logistic regression models.
Module-3
Cluster analysis:
Cluster analysis –Introduction –distance between
two records- measuring distance between two
clusters- Hierarchical clustering-Non-hierarchical
clustering –k-means algorithm.
Module 4
Performance Evaluation:
Evaluating classification performance -
Introduction - Evaluating Goodness of fit - logistic regression for
more than two classes. Predictive Performance - Judging
Classification Performance - Evaluating Predictive Performance –
Prediction - Multiple linear regression- Explanatory vs predictive
modelling – Estimating the regression equation and prediction
variable selection in linear regression.
Module 5
Forecasting time series :
Introduction to time series - Explanatory versus Predictive
Modelling - Popular Forecasting Methods in Business - Time
Series Components - Data Partitioning - Regression-Based
Forecasting - Model with Trend - Model with Seasonality –
Model with Trend and Seasonality - Autocorrelation and
ARIMA Models - Smoothing Methods.
Introduction to Data Mining(DM)
• Data mining is a process used by companies to
turn raw data into useful information.
• By using software to look for patterns in large
batches of data, businesses can learn more
about their customers to develop more effective
marketing strategies, increase sales and
decrease costs.
• Data mining depends on effective data
collection, warehousing, and computer
processing.
Why Data Mining ?
• Credit ratings/targeted marketing:
• Given a database of 100,000 names, which persons are the least likely to
default on their credit cards?
• Identify likely responders to sales promotions
• Fraud detection
• Which types of transactions are likely to be fraudulent, given the demographics
and transactional history of a particular customer?
• Customer relationship management:
• Which of my customers are likely to be the most loyal, and which are most likely
to leave for a competitor? :

Data Mining helps extract such information


Cont…

• The Explosive Growth of Data: from terabytes(10004) to yottabytes(10008)


• Data collection and data availability
• Automated data collection tools, database systems, web
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: bioinformatics, scientific simulation, medical research …
• Society and everyone: news, digital cameras, …
• Data rich but information poor!
• What does those data mean?
• How to analyze data?
• Data mining — Automated analysis of massive data sets
Data mining

• Process of semi-automatically analyzing large


databases to find patterns that are:
• valid: hold on new data with some certainity
• novel: non-obvious to the system
• useful: should be possible to act on the item
• understandable: humans should be able to interpret
the pattern
• Also known as Knowledge Discovery in
Databases (KDD)
What Is Data Mining?
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.

Data Mining: Concepts and Techniques 13


Potential Applications
• Data analysis and decision support
• Market analysis and management
• Target marketing, customer relationship management (CRM), market basket analysis,
cross selling, market segmentation
• Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control, competitive
analysis
• Fraud detection and detection of unusual patterns (outliers)
• Other Applications
• Text mining (news group, email, documents) and Web mining
• Stream data mining
• Bioinformatics and bio-data analysis
Data Mining: Concepts and Techniques
Applications (continued)
• Medicine: disease outcome, effectiveness of treatments
• analyze patient disease history: find relationship between diseases
• Molecular/Pharmaceutical: identify new drugs
• Scientific data analysis:
• identify new galaxies by searching for sub clusters
• Web site/store design and promotion:
• find affinity of visitor to pages and modify layout
Ex.: Market Analysis and Management
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
• Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 -
$800 a month live in that area
• Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k
usually buy this type of CD player
• Cross-market analysis—Find associations/co-relations between product sales, &
predict based on such association
• E.g. Customers who buy computer A usually buy software B 16
Ex.: Market Analysis and Management (2)
• Customer requirement analysis
• Identify the best products for different customers
• Predict what factors will attract new customers
• Provision of summary information
• Multidimensional summary reports
• E.g. Summarize all transactions of the first quarter from three different branches
Summarize all transactions of last year from a particular branch
Summarize all transactions of a particular product
• Statistical summary information
• E.g. What is the average age for customers who buy product A?
• Fraud detection
• Find outliers of unusual transactions
• Financial planning
• Summarize and compare the resources and spending 17
Knowledge Discovery (KDD) Process

Data Mining: Concepts and Techniques


KDD Process: Several Key Steps
• Learning the application domain
• relevant prior knowledge and goals of application

• Identifying a target data set: data selection


• Data processing
• Data cleaning (remove noise and inconsistent data)
• Data integration (multiple data sources maybe combined)
• Data selection (data relevant to the analysis task are retrieved from database)
• Data transformation (data transformed or consolidated into forms appropriate for mining) (Done with data
preprocessing)
• Data mining (an essential process where intelligent methods are applied to extract data patterns)
• Pattern evaluation (indentify the truly interesting patterns)
• Knowledge presentation (mined knowledge is presented to the user with visualization or representation
techniques)

• Use of discovered knowledge 19


Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and Techniques 20
A typical DM System Architecture

• Database, data warehouse, WWW or other information repository (store data)


• Database or data warehouse server (fetch and combine data)
• Knowledge base (turn data into meaningful groups according to domain
knowledge)
• Data mining engine (perform mining tasks)
• Pattern evaluation module (find interesting patterns)
• User interface (interact with the user)
A typical DM System
Architecture
Architecture of typical data mining system
• Database, data warehouse, World Wide Web, or other information repository:
This is one or a set of databases, data warehouses, spreadsheets, or other
kinds of information repositories. Data cleaning and data integration techniques
may be performed on the data.
• Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining
request.
• Knowledge base: This is the domain knowledge that is used to guide the search
or evaluate the interestingness of resulting patterns. Such knowledge can
include concept hierarchies, used to organize attributes or attribute values into
different levels of abstraction. Knowledge such as user beliefs, which can be
used to assess a pattern’s interestingness based on its unexpectedness, may
also be included. Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata (e.g., describing data
from multiple heterogeneous sources).
• Data mining engine: Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.
• Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns. Alternatively, the pattern evaluation module may be
integrated with the mining module.
• User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query or task, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. This
component allows the user to browse database and data warehouse schemas
or data structures, evaluate mined patterns, and visualize the patterns in
different forms.
Motivating Challenges
• Scalability:
• Datasets with sizes of gigabytes, terabytes or even petabytes
• Massive datasets cannot fit into main memory
• Need to develop scalable data mining algorithms to mine massive datasets
• Scalability can also be improved by using sampling or developing parallel and
distributed algorithms.
• High Dimensionality:
• Data sets with hundreds or thousands of attributes.
• Example: Dataset that contains measurements of temperature at various location
• Traditional data analysis techniques that were developed for low dimensional data .
• Need to develop data mining algorithms to handle high dimensionality.
• Heterogeneous and Complex Data:
• Traditional data analysis methods deal with datasets containing attributes of same
type(Continuous or Categorical).
• Complex data sets contains image, video, text etc.
• Need to develop mining methods to handle complex datasets
• Data Ownership and Distribution:
• Data is not stored in one location or owned by one organization.
• Data is geographically distributed among resources belonging to multiple entities.
• Need to develop distributed data mining algorithms to handle distributed datasets.
• Key challenges:
• How to reduce the amount of communication needed for distributed data.
• How to effectively consolidate the data mining results from multiple sources
• How to address data security issues.
• Non Traditional Analysis:
• Traditional statistical approach is based on a hypothesize-and-test
paradigm.
• A hypothesis is proposed, an experiment is designed to gather the
data, and then data is analyzed with respect to the hypothesis.
• This process is extremely labor-intensive.
• Need to develop mining methods to automate the process of
hypothesis generation and evaluation.
On What Kinds of Data?

• Database-oriented data sets and applications


• Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
• Object-Relational Databases
• Temporal Databases, Sequence Databases, Time-Series databases
• Spatial Databases and Spatiotemporal Databases
• Text databases and Multimedia databases
• Heterogeneous Databases and Legacy Databases
• Data Streams
• The World-Wide Web

28
Relational Databases
• DBMS – database management system, contains a collection of
interrelated databases
e.g. Faculty database, student database, publications database
• Each database contains a collection of tables and functions to
manage and access the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as attributes of data and
rows as records.
• Tables can be used to represent the relationships between or among multiple
tables.
Relational Databases

Data Mining: Concepts and Techniques


Relational Databases

• With a relational query language, e.g. SQL, we will be able to find


answers to questions such as:
• How many items were sold last year?
• Who has earned commissions higher than 10%?
• What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we can search for
trends or data patterns.
• Relational databases are one of the most commonly available and
rich information repositories, and thus are a major data form in our
study.
Data Warehouses

• A repository of information
collected from multiple
sources, stored under a
unified schema, and that
usually resides at a single
site.
• Constructed via a process of
data cleaning, data
integration, data
transformation, data loading
and periodic data refreshing.
Data Warehouses (2)

• Data are organized around major subjects, e.g. customer, item, supplier and activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10 years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at different degrees of
summarization

• OLAP (Online Analytical Processing) is the technology behind many Business


Intelligence (BI) applications. OLAP is a powerful technology for data discovery,
including capabilities for limitless report viewing, complex analytical calculations,
and predictive “what if” scenario (budget, forecast) planning.
Data Warehouses (3)
Transactional Databases
• Consists of a file where each record represents a transaction
• A transaction typically includes a unique transaction ID and a list of the items making
up the transaction.

• Either stored in a flat file or unfolded into relational tables


• Easy to identify items that are frequently sold together
Relationship with other fields

• Overlaps with machine learning, statistics, artificial intelligence,


databases, visualization but more stress on
• scalability of number of features and instances

• stress on algorithms and architectures whereas foundations of methods and


formulations provided by statistics and machine learning.

• automation for handling large, heterogeneous data


The Origins of Data Mining
• Data Mining Draws ideas, such as:
• Sampling, estimation and hypothesis testing
from statistics.
• Search algorithms, modeling techniques and
learning theories from Artificial Intelligence or
Machine Learning, Pattern Recognition.
• Database systems are needed to provide
support for efficient storage, Indexing and query
processing.
• The Techniques from parallel computing are
addressing the massive size of some datasets.
• Distributed Computing techniques are used to
gather information from different locations.
Confluence of Multiple Disciplines

Database
Technology Statistics

Information Machine
Science Data Mining Learning

Visualization Other
Disciplines

• Not all “Data Mining System” performs true data mining


 machine learning system, statistical analysis (small amount of data)
 Database system (information retrieval, deductive querying…)
38
DATAWAREHOUSE
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
40
Data, Data everywhere
yet ... • I can’t find the data I need
• data is scattered over the network
• many versions, subtle differences

 I can’t get the data I need


need an expert to get the data

 I can’t understand the data I


found
available data poorly documented

 I can’t use the data I found


results are unexpected
data needs to be transformed 41
from one form to other
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a
what they can understand and
use in a business context.

[Barry Devlin]

42
What are the users saying...
• Data should be integrated across the
enterprise
• Summary data has a real value to the
organization
• Historical data holds the key to
understanding data over time
• What-if capabilities are required

43
What is Data Warehousing?

A process of transforming
Information data into information and
making it available to users in
a timely enough manner to
make a difference

[Forrester Research, April 1996]

Data
44
Evolution
• 60’s: Batch reports
• hard to find and analyze information
• inflexible and expensive, reprogram every new request
• 70’s: Terminal-based DSS and EIS (executive information systems)
• still inflexible, not integrated with desktop tools
• 80’s: Desktop data access and analysis tools
• query tools, spreadsheets, GUIs
• easier to use, but only access operational databases
• 90’s: Data warehousing with integrated OLAP engines and tools

45
Very Large Data Bases
• Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes

• Petabytes -- 10^15 bytes: Geographic Information Systems

• Exabytes -- 10^18 bytes:


National Medical Records
• Zettabytes -- 10^21 bytes:
Weather images
• Zottabytes -- 10^24 bytes:
Intelligence Agency Videos

46
Data Warehousing --
It is a process
• Technique for assembling and
managing data from various sources
for the purpose of answering
business questions. Thus making
decisions that were not previous
possible
• A decision support database
maintained separately from the
organization’s operational database

47
Data Warehouse
• A data warehouse is a
• subject-oriented
• integrated
• time-varying
• non-volatile

collection of data that is used primarily in organizational decision


making.
-- Bill Inmon, Building the Data Warehouse 1996

48
Explorers, Farmers and Tourists
Tourists: Browse information harvested by
farmers

Farmers: Harvest information


from known access paths

Explorers: Seek out the unknown and previously


unsuspected rewards hiding in the detailed data

49
Data Mining works with Warehouse Data

• Data Warehousing provides the


Enterprise with a memory

Data Mining provides


the Enterprise with
intelligence
50
Supervised learning vs. unsupervised learning

• Supervised learning: discover patterns in the data that relate


data attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target
attribute in future data instances
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in
them.

51
Supervised learning:
• The computer is presented with example inputs and their desired outputs,
given by a “teacher”, and the goal is to learn a general rule that maps inputs
to outputs.
• The training process continues until the model achieves the desired level of
accuracy on the training data.
• Some real-life examples are:
• Image Classification: You train with images/labels. Then in the future you give a new
image expecting that the computer will recognize the new object.
• Market Prediction/Regression: You train the computer with historical market data
and ask the computer to predict the new price in the future.
Supervised Machine Learning
• Supervised learning is where you have input variables (x) and an
output variable (Y) and you use an algorithm to learn the mapping
function from the input to the output.
Y = f(X)
• The goal is to approximate the mapping function so well that when
you have new input data (x) that you can predict the output variables
(Y) for that data.
• It is called supervised learning because the process of an
algorithm learning from the training dataset can be thought of
as a teacher supervising the learning process.
• We know the correct answers, the algorithm iteratively makes
predictions on the training data and is corrected by the
teacher.
• Learning stops when the algorithm achieves an acceptable
level of performance.
We have four types of fruits. They are: apple, banana, grape and
cherry.

FRUIT
NO. SIZE COLOR SHAPE
NAME

Rounded shape with a depression at


1 Big Red Apple
the top

2 Small Red Heart-shaped to nearly globular Cherry

3 Big Green Long curving cylinder Banana

4 Small Green Round to oval, Bunch shape Cylindrical Grape


• Suppose you have taken an new fruit from the basket then you will see
the size , color and shape of that particular fruit.
• If size is Big , color is Red , shape is rounded shape with a depression at
the top, you will conform the fruit name as apple and you will put in
apple group.
• Likewise for other fruits also.
• Job of groping fruits was done and happy ending.
• You can observe in the table that a column was labeled as “FRUIT
NAME” this is called as response variable.
• If you learn the thing before from training data and then applying that
knowledge to the test data(for new fruit), This type of learning is called
as Supervised Learning.
• Supervised learning problems can be further grouped into
regression and classification problems.
• Classification: A classification problem is when the output variable
is a category, such as “red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a
real value, such as “dollars” or “weight”.
Unsupervised learning
• No labels are given to the learning algorithm, leaving it on its own to
find structure in its input.
• It is used for clustering population in different groups.
• Unsupervised learning can be a goal in itself (discovering hidden
patterns in data).
• Clustering: You ask the computer to separate similar data into clusters, this
is essential in research and science.
Steps in Data Mining
• Develop an understanding of the purpose of the data mining project
• Obtain the dataset to be used in the analysis.
• Explore, clean, and preprocess the data.
• Reduce the data, if necessary, and (where supervised training is involved)
separate them into training, validation, and test datasets.
• Determine the data mining task (classification, prediction, clustering, etc.).
• Choose the data mining techniques to be used (regression, neural nets,
hierarchical clustering, etc.).
• Use algorithms to perform the task
• Interpret the results of the algorithms
• Deploy the model

You might also like