Unit-1 - Data Mining
Unit-1 - Data Mining
Presented by
Pathapati Saroja
Introduction:
●Data is raw and unprocessed. It's like ingredients before they are cooked.
●Information is data that has been processed and organized in a way that makes it useful
and meaningful. It's like a finished meal made from those ingredients.
●Database (OLTP): A database is an organized collection of data that can be easily accessed,
managed, and updated.
●All organisations will maintain DB (Banks, Flipkart,
Spencers…)
●These databases will store day to day transactions.
2
● Now a days companies are using two databases (OLTP & OLAP)
● Data Warehouse (OLAP)- helps to store historical data
● Why there is a need for OLAP?
3
Why Data Mining?
● The Explicit Growth of Data: from terabytes to petabytes
-Data collection and data availability
Automated data collection tools, database systems, Web, computerized
society
-Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
● We have a lot of data, but starving for knowledge!
● There is a need for automated tools to analyze large amounts of data and f in d
useful information— "Necessity of invention" — Data mining
4
What is Data Mining?
● Data mining (knowledge discovery from data)
● Extraction of interesting (non-trivial-Finding that customers who buy bread also frequently
buy butter is non-trivial compared to noticing that people buy more umbrellas when it rains,
implicit- Discovering a correlation between late-night online activity and increased spending on
digital products , previously unknown-Identifying a new customer segment that prefers a
certain product feature which was not recognized before and potentially useful- Using
purchasing patterns to optimize inventory and reduce stockouts in a retail store.) patterns or
knowledge from huge amounts of data
● Alternative names
-Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, business intelligence, etc.
● Is everything “data mining”?
Simple search and query processing- google search
Eg: How many videos were uploaded to YouTube in the last hour.
5
Evolution of DM
6
Evolution of DM contd..:
1. Data Collection and Database Creation (1960s and earlier):
● In 1960, the data is stored in the form of f iles, f iles containing data stored in the form of
records.
Example:Primitive f ile processing: Imagine you have different notebooks for each subject in
school. If you want to f ind all the notes related to one topic, you have to look through each
notebook separately, which is time-consuming and messy.
2.Database Management Systems (1970s – early 1980s):
● Hierarchical and network database systems: Information is stored in a tree-like or
network structure, which can show complex relationships but is dif ficult to change once
set.
7
Evolution of DM contd..:
● Relational database systems: Like having a spreadsheet where each row is a record and
columns are attributes. You can easily link data from different tables (like linking a
student's ID to their grades in another table).
● Query languages: SQL: This is like a specif ic language used to ask the database
questions, such as "Show me all students with grades above 90."
● User interfaces, forms, and reports: Tools that make it easier for people to use databases
without needing to know the technical details.
● Concurrency control and recovery: Ensures that many people can use the database at the
same time without issues, and the database can recover from crashes.
● Online transaction processing (OLTP): Think of real-time processing, like when you swipe
your card at a store, and it instantly updates your bank account.
8
Evolution of DM contd..:
3. Advanced Database Systems (mid-1980s – present)
● Advanced data models: New ways to structure and organize data, including handling
multimedia like images and videos.
● Distributed and client-server architectures: Data can be stored in multiple places and
accessed over a network, like using cloud storage.
● Stream and sensor data: Processing continuous data from sensors, such as temperature
readings from a weather station.
● Knowledge-based systems: Systems that use artif icial intelligence to provide smart data
management and retrieval.
4. Advanced Data Analysis: Data Warehousing and Data Mining (late 1980s – present)
● Data warehouse and OLAP: Imagine a huge library that stores books (data) from all over
the world. OLAP tools help you quickly find and analyze information in these books.
9
Evolution of DM contd..:
Data mining and knowledge discovery:
● Classification: Like sorting your email into different folders (work, personal, spam).
● Clustering: Grouping similar items together, like clustering photos based on location or
people in them.
● Association: Finding relationships, such as noticing that people who buy bread also
often buy butter.
● Sequential patterns: Identifying trends over time, like noticing that sales of umbrellas go
up during the rainy season.
● Text mining and web mining: Extracting useful information from large text documents or
websites.
5. Web-based Databases (1990s – present)
● XML-based database systems: Designed to store data in a format that is easy to share
over the web.
● Integration with web services: Allows databases to interact with other web applications,
like how an online store’s database might connect to a payment service.
10
Evolution of DM contd..:
● Information retrieval: Techniques to search and f ind information from large databases,
like using a search engine to find specific articles on the web.
● Data and information integration: Combining data from different sources to get a
complete picture, like merging data from different departments in a company to generate
a comprehensive report.
6. New Generation of Integrated Data and Information Systems (present – future)
● Integration of various systems: This stage focuses on connecting different types of
databases and information systems to work together seamlessly, much like having all
your smart home devices (lights, thermostat, security) controlled from one app.
11
Importance of Data Mining:
Data mining is a powerful tool that enables organizations to:
1. Improved Decision Making: A restaurant uses data on past orders to decide which dishes
to keep on the menu and which ones to remove.
2. Predictive Analysis: An online clothing store analyzes past purchase data to predict what
types of clothes will be popular next season.
3. Customer Relationship Management (CRM): A mobile phone company uses data on
customer calls and internet usage to offer personalized plans and services to each
customer.
4. Fraud Detection: A bank uses data analysis to detect unusual patterns in credit card
transactions, like a sudden large purchase in a different country, to prevent fraud.
5. Market Basket Analysis: A grocery store f inds out that people who buy bread often buy
butter as well, so they place these items next to each other to increase sales.
6. Risk Management: An insurance company uses data to identify high-risk areas for
natural disasters and adjust their insurance policies accordingly.
12
Data Mining: On What Kinds of Data??
● Database-oriented data sets and applications
-Relational database, data warehouse, transactional database
13
Data Mining on What kinds of Data?:
1. Relational Databases:
● A relational database is a collection of tables, each of which is assigned a
unique name.
● Each table consists of a set of attributes (columns or fields) and usually
stores a large set of tuples (records or rows).
14
2. Data Warehouse:
16
3. Transactional Databases:
● In general, each record in a transactional database captures a transaction,
such as a customer’s purchase, a flight booking, or a user’s clicks on a web
page.
● A transaction typically includes a unique transaction identity number (trans
ID) and a list of the items making up the transaction, such as the items
purchased in the transaction.
● A transactional database may have additional tables, which contain other
information related to the transactions, such as item description,
information about the salesperson or the branch, and so on.
17
4. Object-Relational Databases:
● They are constructed based on object-relational data model.
● Each object has associated with the following
A set of variables that
describe the objects.
A set of messages: to
communicate with other objects.
A set of methods: holds the
code to implement a message.
18
5. Temporal Databases, sequence databases and Time-series databases
● A temporal database is designed to handle data related to time. It can track
historical data, current data, and future data, providing a comprehensive view
of data changes over time.
Example:Employee Work History
Consider a system that tracks employees' job titles and departments over
time.
● A sequence database is a type of data repository where the primary focus is
on ordered sequences of events, actions, or items., with or without concrete
notion of time. (customer shopping sequences, web click streams)
● Time series data refers to a sequence of data points collected or recorded at
successive points in time, often at regular intervals.(stock exchange data,
temperature, wind)
19
6. Spatial databases and spatiotemporal databases
It contains spatial related info like map databases, medical image databases,
satellite image databases
8. Data Streams:
Data streams include time series data, stock exchange, tele-communications,
stock exchange
20
Data Mining Applications:
1. Healthcare
Predicting Diseases: Data mining helps doctors predict if a person might get a disease by
looking at their health records and family history.
Grouping Patients: It groups patients with similar health issues so doctors can give them
similar treatments.
2. Finance
Fraud Detection: TIt finds unusual activities in financial transactions that might be fraud.
Risk Management: It helps banks decide if giving a loan to someone is risky based on their
financial history.
3. Marketing
Customer Segmentation: It divides customers into groups based on their buying habits, so
companies can target them with specific ads.
Market Basket Analysis: It f in ds out which products are often bought together, helping
stores to recommend similar items to customers.
21
4. Retail
Managing Stock: It predicts which products will sell well so stores can keep enough in stock.
Customer Retention: It identif ies why customers might stop buying from a store and helps
create strategies to keep them.
5. Telecommunications
Customer Turnover: It predicts which customers might switch to another provider based on
their usage patterns.
Optimizing Networks: It helps improve network services by analyzing usage data.
6. Manufacturing
Preventing Machine Breakdowns: It predicts when machines might break down so
maintenance can be done before that happens.
Quality Control: It spots patterns in production that could lead to defects, helping maintain
product quality.
22
7. Education
Student Performance Prediction: It identif ies students who might be struggling and need
extra help.
Customized Learning: It creates personalized study plans for students based on their
performance and learning style.
8. E-commerce
Product Recommendations: It suggests products to customers based on what they have
bought or looked at before.
Improving Websites: It analyzes how customers use a website to make it easier to navigate
and more user-friendly.
23
Knowledge discovery
from Data:
● Data mining is an essential
step in the process of
knowledge discovery
24
Architecture of typical DM System
25
Thank you
26