Data Mining-Session 1
Data Mining-Session 1
Mining
Introduction to Data Mining &
Data Preparation
Source: aws.amazon.com/id/data
• IoT/Devices:
o Data from Internet of Things (IoT) devices, such as smart home gadgets, wearables, and other connected hardware.
• Applications/Logs:
o Data generated from applications, including user activity logs, clickstreams, and app usage.
• Third-Party Data:
o Data acquired from external sources or partners, which could enhance Amazon's insights.
Knowledge
Discovery
Patterns
• Database:
o Amazon Aurora and Amazon DynamoDB: These are database services for managing structured data. Aurora is a relational
database, while DynamoDB is a NoSQL database optimized for scalable, high-speed operations.
• Data Warehouse:
Large o Amazon Redshift: A data warehouse solution where large volumes of data are stored and analyzed. Redshift enables fast
querying and supports complex data analysis.
Datasets • Data Lake:
o Amazon S3: A scalable storage service often used as a data lake to store raw, structured, and unstructured data.
• Big Data:
o Amazon EMR: A tool for processing large datasets using frameworks like Hadoop and Spark, allowing for efficient analysis of big
data.
• Data Streams:
o Amazon Kinesis and Amazon MSK: Used for real-time data streaming and analytics, allowing Amazon to process live data feeds
quickly and extract instant insights.
Data Mining vs. Machine Learning vs. AI
Scope Primarily an analytical process using statistical Involves creating algorithms that learn from Multiple approaches to create systems that
and machine learning techniques to understand data; narrower than AI but broader than just typically require human intelligence.
data. data mining.
Goal Extract useful information, trends, and patterns Develop models that generalize from past Create systems that mimic human intelligence
from data. data to make predictions or decisions. and perform complex tasks autonomously.
Techniques Statistical analysis, clustering, association, Algorithms such as classification, regression, Combines machine learning, neural networks,
Used anomaly detection, and machine learning clustering, and reinforcement learning. deep learning, computer vision, speech
algorithms. recognition, and rule-based systems.
Dependency May use machine learning, but also depends on Relies on data for training models and Machine learning is a primary method, but AI
statistical and analytical methods; doesn’t improved accuracy; highly data-driven. also includes rule-based systems, heuristics,
necessarily require AI. and logic.
Relationship to Often uses machine learning for data analysis A subset of AI providing tools and algorithms Broadest category, includes machine learning
Each Other but does not cover all aspects of machine frequently used in data mining. and data mining as application areas.
learning.
Examples Customer segmentation, fraud detection, Image recognition, spam filtering, personalized Autonomous driving, virtual assistants, medical
market basket analysis. recommendations. diagnosis systems.
Data Processing Flow
Foundation Stage
At this initial stage, raw data is gathered from various sources and pre-processed. Data
Data Collection and cleaning, transformation, and integration occur here to ensure the quality and consistency
of data before analysis.
Preparation
Exploratory Stage
This stage involves analyzing prepared data to discover patterns, trends, and insights.
Techniques like clustering, association analysis, and anomaly detection help uncover
Data Mining useful information from large datasets.
Choosing the most relevant data for the specific mining task.
#3 Data Selection
This focuses the analysis on the data most likely to yield
meaningful insights.
patient_id first_name last_name age gender phone email street city state zip condition_name diagnosed_date surgery_name surgery_date allergy
123456 John Doe 45 Male -1788 [email protected] 123 Elm St Springfield IL 62701 Hypertension 6/23/2015 Appendectomy 8/20/2010 Penicillin
{
"patient_id": "123456",
"personal_info": {
"first_name": "John",
"last_name": "Doe",
"age": 45,
"gender": "Male",
"contact_info": {
"phone": "+1-555-1234",
"email": "[email protected]",
personal_info: Contains basic personal information and
"address": {
"street": "123 Elm St",
contact details, nested under the "address" key.
"city": "Springfield",
"state": "IL",
"zip": "62701"
}
}
},
"medical_history": {
"conditions": [
{
"name": "Hypertension",
"diagnosed_date": "2015-06-23"
},
{
"name": "Type 2 Diabetes",
"diagnosed_date": "2018-02-14"
} medical_history: Includes conditions, surgeries, and
],
"surgeries": [
allergies. Each entry has a name and date for context.
{
"name": "Appendectomy",
"date": "2010-08-20"
}
],
"allergies": ["Penicillin", "Peanuts"]
},
"notes": "Routine follow-up for diabetes management."
}
]
}
<patient>
<patient_id>123456</patient_id>
<personal_info>
<first_name>John</first_name>
<last_name>Doe</last_name>
<age>45</age>
<gender>Male</gender>
<contact_info>
<phone>+1-555-1234</phone> personal_info: Contains basic personal information and
<email>[email protected]</email>
<address> contact details, nested under the "address" key.
<street>123 Elm St</street>
<city>Springfield</city>
<state>IL</state>
<zip>62701</zip>
</address>
</contact_info>
</personal_info>
<medical_history>
<conditions>
<condition>
<name>Hypertension</name>
<diagnosed_date>2015-06-23</diagnosed_date>
</condition>
<condition>
<name>Type 2 Diabetes</name>
<diagnosed_date>2018-02-14</diagnosed_date> medical_history: Includes conditions, surgeries, and
</condition>
</conditions> allergies. Each entry has a name and date for context.
<surgeries>
<surgery>
<name>Appendectomy</name>
<date>2010-08-20</date>
</surgery>
</surgeries>
<allergies>
<allergy>Penicillin</allergy>
<allergy>Peanuts</allergy>
</allergies>
</medical_history>
</patient>
Data Sources
Databases
Relational databases serve as a primary source of structured data for
data mining. They store vast amounts of information in an organized
and efficient manner, making it accessible for analysis using queries
and data mining techniques.
Data Warehouses
Data warehouses consolidate data from multiple sources into a
central repository, typically structured as multidimensional data cubes
for efficient analysis and reporting. They provide a historical
perspective on data, making them valuable for trend analysis and Sources
decision-making.
Web Data
The World Wide Web has become a massive source of both structured
and unstructured data. Web pages, social media posts, online reviews,
and search queries offer valuable insights into user behavior, market
trends, and public sentiment.
Sensor Data
The sources mention sensor data as an example of streaming data,
generated continuously by devices that monitor physical phenomena.
Analyzing sensor data can reveal patterns in environmental
conditions, equipment performance, or human behavior.
The Importance of Data Cleaning
1. Handling Missing Values
The sources describe various methods for handling missing values, which occur when there is no recorded value for a particular attribute in a data
tuple:
• Ignoring the tuple with missing values, which is ineffective unless the tuple has multiple missing values.
• Manually filling in the missing value, which is time-consuming and not feasible for large datasets.
• Using a global constant, such as "Unknown," to replace all missing values. However, this may lead the mining algorithm to mistakenly
identify the constant as an interesting concept.
• Using a measure of central tendency, such as the mean or median, to fill in the missing value.
• Using the attribute mean or median for all samples belonging to the same class as the tuple with the missing value.
• Using the most probable value, which can be determined using techniques like regression, Bayesian inference, or decision tree induction.
2. Removing Duplicates
The sources emphasize the importance of detecting and removing duplicate tuples, which can arise from data entry errors or the integration of
data from multiple sources. These inconsistencies can lead to discrepancies and affect the accuracy of the data mining results.
3. Dealing with Outliers
Outliers are data points that are significantly different from other data points in the dataset. The sources mention several techniques for
identifying and handling outliers:
• Data visualization techniques, like boxplots and scatter plots, can help identify outliers.
• Clustering, where outliers may be identified as values that fall outside of the set of clusters.
• Data smoothing techniques like binning and regression can be used to reduce the impact of outliers.
Identify Outliers Outlier
Outlier
Outlier
Outlier
Outlier
Outlier
Identify Outliers
Clustering Dendrogram: Circular
Identify Outliers
Clustering Dendrogram: Phylogenic
Outlier
Data Transformation
1. Normalization
Normalization scales attribute data to fall within a smaller range, typically [-1, 1]. This process is beneficial because the
measurement unit used for an attribute can influence the data analysis, and normalization attempts to give all attributes an
equal weight. This is particularly useful for classification algorithms involving neural networks or distance measurements, such
as nearest-neighbor classification and clustering. Three common normalization methods:
• Min-max normalization: Applies a linear transformation to map values to a new range, preserving the relationships
among the original data values.
• z-score normalization: Normalizes values based on the mean and standard deviation of the attribute, useful when the
actual minimum and maximum values are unknown or when outliers are present.
• Normalization by decimal scaling: Normalizes values by moving the decimal point, determined by the maximum
absolute value of the attribute.
2. Encoding Categorical Variables
It is a common data transformation technique used to convert categorical data (data that represents categories or groups) into
numerical representations suitable for machine learning algorithms. This information is not from the sources provided. You may
want to independently verify this information. There are various encoding methods, such as one-hot encoding and label
encoding, each with its own advantages and disadvantages.
Example: Min-max normalization
Data: [20, 35, 50, 75, 100]
𝑋 − 𝑋 𝑚𝑖𝑛
𝑋 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 = × ( 𝑛𝑒𝑤 𝑚𝑎𝑥 −𝑛𝑒𝑤 𝑚𝑖𝑛 ) +𝑛𝑒𝑤 𝑚𝑖𝑛
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛
Data (X)
20 0
35 0.1875
50 0.375
75 0.6875
100 1
√ √
2 2 2 2 2
( 20 −56 ) + ( 35 −56 ) + ( 50 −56 ) + ( 75 −56 ) + ( 100− 56 ) 4070
𝜎= = =28.53
5 5
Data (X)
20 -1.26
35 -0.74
50 -0.21
75 0.67
100 1.54
Determine 𝑗:
10
because
Data
20 0.002
350 0.035
500 0.05
750 0.075
1000 0.1