0% found this document useful (0 votes)
19 views37 pages

Pam Unit 1

Uploaded by

Rohit yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views37 pages

Pam Unit 1

Uploaded by

Rohit yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

PAM UNIT 1

Introduction to Data Mining:


Data-Mining Application, drawbacks, A strategy for Data Mining: CRISP-
DM, Stages and Tasks in CRISP-DM

Data mining is the process of extracting useful information from large datasets.
It involves applying statistical techniques and algorithms to uncover patterns,
trends, and relationships that would otherwise be hidden.

Data Mining Techniques:


CRISP-DM in Predictive Analytical Modeling

The CRISP-DM methodology can be adapted to guide predictive analytical


modeling projects. Here's a brief overview of the stages and tasks:

1. Business Understanding:

 Define the business problem and objectives.


 Identify the target variable to be predicted.
 Create a project plan that outlines the modeling approach.

2. Data Understanding:

 Collect and gather relevant data.


 Explore and analyze the data to understand its characteristics and quality.
 Identify potential features that may be useful for prediction.

3. Data Preparation:

 Clean and prepare the data for modeling.


 Handle missing values, outliers, and inconsistencies.
 Transform data into a suitable format, such as converting categorical
variables to numerical representations.
 Feature engineering: Create new features that may improve model
performance.
4. Modeling:

 Select appropriate predictive modeling techniques, such as regression,


classification, or time series analysis.
 Build and train models using the prepared data.
 Evaluate model performance using appropriate metrics (e.g., accuracy,
precision, recall, F1-score).

5. Evaluation:

 Assess the model's effectiveness in meeting the business objectives.


 Compare different models and select the best-performing one.
 Refine the model if necessary by adjusting parameters or trying different
techniques.

6. Deployment:

 Integrate the model into operational systems.


 Monitor and maintain the model to ensure its accuracy and effectiveness
over time.

Life Cycle of a Data-Mining Project, Skills Needed for Data Mining.

The Life Cycle of a Data Mining Project

A data mining project typically follows a structured lifecycle, often represented


by the CRISP-DM (Cross-Industry Standard Process for Data Mining)
methodology. Here's a breakdown of the key stages:

1. Business Understanding:
 Define the business problem: Clearly articulate the specific question or
challenge you want to address.
 Identify the data mining objectives: Determine what you hope to
achieve through data mining.
 Create a project plan: Outline the scope, timeline, and resources
required for the project.

2. Data Understanding:

 Collect and gather data: Acquire the necessary data from various
sources.
 Explore and analyze data: Get familiar with the data's characteristics,
quality, and potential issues.
 Identify data quality issues: Address any inconsistencies, errors, or
missing values.

3. Data Preparation:

 Clean and prepare data: Transform the data into a suitable format for
analysis.
 Handle missing values and outliers: Address these issues to ensure data
accuracy.
 Transform data into a suitable format: Convert data into a format that
is compatible with data mining algorithms.

4. Modeling:

 Select appropriate algorithms: Choose algorithms based on the nature


of the data and the problem.
 Build and train models: Develop and train models using the prepared
data.
 Evaluate model performance: Assess the accuracy and effectiveness of
the models.

5. Evaluation:

 Assess model accuracy and performance: Evaluate how well the


models meet the project objectives.
 Compare models: Compare different models to identify the best-
performing one.
 Refine models if necessary: Make adjustments to improve model
performance.

6. Deployment:
 Integrate models into operational systems: Incorporate the chosen
model into your business processes.
 Monitor and maintain models: Continuously evaluate the model's
performance and update it as needed.

Key Considerations:

 Collaboration: Throughout the project, ensure effective collaboration


between data scientists, domain experts, and stakeholders.
 Iterative Process: The data mining process is often iterative, allowing
for adjustments and improvements based on findings and feedback.
 Ethical Considerations: Address privacy concerns and ensure data is
used responsibly and ethically.

Skills You Need for Data Mining


▪ COMPUTER SCIENCE SKILLS
1. Programming/statistics language: R, Python, C++, Java, Matlab, SQL,
SAS
2. Big data processing frameworks: Hadoop, Storm, Samza, Spark, Flink
3. Operating System: Linux
4. Database knowledge: Relational Databases (SQL or Oracle) & Non-
Relational Databases (MongoDB, Cassandra, Dynamo, CouchDB
▪ STATISTICS AND ALOGIRITHIM SKILLS
1. Basic Statistics Knowledge: Probability, Probability Distribution,
Correlation, Regression, Linear Algebra, Stochastic Process
2. Data structures include arrays, linked list, stacks, queues, trees, hash
table, set…etc, and common Algorithms include sorting, searching,
dynamic programming, recursion…etc
3. Machine Learning/Deep Learning Algorithm

▪ OTHER REQUIRED SKILLS


1. Project Experience
2. Communication & Presentation Skills
Working with Modeler:
Introducing Nodes and Streams, Explore the user Interface, Creating
Streams-General Rules, Placing Nodes, Managing Nodes, Managing
Connections, Encapsulating Nodes in a super Node, Generating Nodes
from Output, Running Streams.

IBM- SPSS Modeler


▪ IBM® SPSS® Modeler is an analytical platform that enables
organizations and researchers to uncover patterns in data and build
predictive models to address key business outcomes.
▪ Moreover, aside from a suite of predictive algorithms, SPSS Modeler also
contains an extensive array of analytical routines that include data
segmentation procedures, association analysis, anomaly detection, feature
selection and time series forecasting.
▪ These analytical capabilities, coupled with Modeler’s rich
functionality in the areas of data integration and preparation tasks,
enable users build entire end-to-end applications from the reading of
raw data files to the deployment of predictions and
recommendations back to the business.
▪ As such, IBM® SPSS® Modeler is widely regarded and one of the
most mature and powerful applications of its kind.

SPSS Modeler GUI: Stream Canvas


▪ The stream canvas is the main work area in Modeler.
▪ It is located in the center of the Modeler user interface.
▪ The stream canvas can be thought of as a surface on which to place icons
or nodes.
▪ These nodes represent operations to be carried out on the data.
▪ Once nodes have been placed on the stream canvas, they can be linked
together to form a stream.
SPSS Modeler GUI: Palettes
▪ Nodes (operations on the data) are contained in palettes.
▪ The palettes are located at the bottom of the Modeler user interface.
▪ Each palette contains a group of related nodes that are available for you
to add to the data stream.
▪ For example, the Sources palette contains nodes that you can use to read
data into Modeler, and the Graphs palette contains nodes that you can use
to explore your data visually.
▪ The icons that are shown depend on the active, selected palette.

Building streams
▪ As was mentioned previously, Modeler allows users to mine data visually
on the stream canvas.
▪ This means that you will not be writing code for your data mining
projects; instead you will be placing nodes on the stream canvas.
▪ Remember that nodes represent operations to be carried out on the data.
So once nodes have been placed on the stream canvas, they need to be
linked together to form a stream.
▪ A stream represents the flow of data going through a number of
operations (nodes).
Modeler is a powerful data mining tool that utilizes a graphical interface to
build and execute data mining workflows. The fundamental building blocks of
these workflows are nodes and streams.

Nodes

Nodes represent individual operations or tasks within a data mining process.


They can perform various functions, such as:

 Data Input/Output: Reading data from files, databases, or other sources,


and writing results to different formats.
 Data Manipulation: Transforming, filtering, and aggregating data.
 Modeling: Building and training predictive models.
 Evaluation: Assessing model performance.
 Deployment: Integrating models into operational systems.

Common node types include:

 Source nodes: Read data from various sources (e.g., CSV files,
databases).
 Filter nodes: Select or filter data based on specific criteria.
 Transform nodes: Modify data attributes (e.g., normalization,
imputation).
 Modeling nodes: Build and train models (e.g., decision trees, neural
networks).
 Evaluation nodes: Assess model performance (e.g., confusion matrix,
ROC curve).
 Output nodes: Write results to different formats (e.g., CSV files,
databases).

Purpose of Nodes in Predictive Analytical Modeling

In predictive analytical modelling, nodes are essential components used in


various stages of data processing, analysis, and model building. They represent
different operations, transformations, or decisions applied to the data as it flows
through the modelling process. Nodes are commonly used in visual
programming environments like KNIME, IBM SPSS Modeler, or RapidMiner,
where they provide a modular and intuitive way to construct predictive models.

Key Purposes of Nodes:

1. Data Input and Output:


o Purpose: Nodes are used to import data from various sources (e.g.,
databases, CSV files, or data streams) and export the results after
processing.
o Example: A "File Reader" node to load a CSV file containing
customer transaction data into the model.

2. Data Preprocessing:
o Purpose: Nodes are used for cleaning, transforming, and preparing
data before analysis.
o Example: A "Missing Value Imputation" node to handle missing
data in a dataset, replacing null values with mean or median values.

3. Data Transformation:
o Purpose: Nodes transform data to make it suitable for modeling.
This includes normalization, feature extraction, and aggregation.
o Example: A "Normalization" node to scale numerical features to a
standard range (e.g., 0 to 1) for consistent model input.

4. Feature Selection:
o Purpose: Nodes help in selecting the most relevant features for the
model, reducing dimensionality and improving performance.
o Example: A "Feature Selection" node to identify and retain the
most significant variables that influence the target outcome.

5. Model Building:
o Purpose: Nodes are used to apply algorithms for creating
predictive models based on the input data.
o Example: A "Decision Tree" node to build a classification model
that predicts whether a customer will churn based on their
transaction history.

6. Model Evaluation:
o Purpose: Nodes are used to assess the performance of the
predictive model using various metrics.
o Example: A "Confusion Matrix" node to evaluate the accuracy,
precision, recall, and F1-score of a classification model.
7. Model Deployment:
o Purpose: Nodes facilitate the deployment of predictive models
into production environments where they can be applied to new
data.
o Example: A "Score" node to apply a trained model to a new
dataset, predicting outcomes such as customer churn or product
demand.

8. Visualization:
o Purpose: Nodes provide visual outputs to help interpret the data
and model results, making it easier to understand patterns and
insights.
o Example: A "Scatter Plot" node to visualize the relationship
between two variables, like customer age and spending habits.

Example Workflow Using Nodes:

Imagine you're building a predictive model to forecast customer churn in a


telecom company. Here's how you might use different nodes in a visual
programming tool:

1. Data Input:
o Use a "File Reader" node to load customer data, including
demographics, usage patterns, and billing information.

2. Data Preprocessing:
o Apply a "Missing Value Imputation" node to handle any gaps in
the data.
o Use a "Filter" node to remove irrelevant features like customer ID.

3. Data Transformation:
o Apply a "Normalization" node to scale features like monthly
charges and total calls.

4. Model Building:
o Use a "Random Forest" node to create a model that predicts churn
based on the processed data.

5. Model Evaluation:
o Utilize a "Cross-Validation" node to assess the model's accuracy
and prevent overfitting.

6. Visualization:
o Add a "ROC Curve" node to visualize the model's performance in
distinguishing between churn and non-churn customers.

7. Model Deployment:
o Use a "Score" node to apply the model to new customer data,
predicting the likelihood of churn.

Streams

Streams are the connections between nodes in a Modeler workflow. They define
the flow of data from one node to another. Streams can be configured to pass
specific data columns or subsets of data.

Key aspects of streams:

 Direction: Streams have a direction, indicating the flow of data from one
node to another.
 Data types: Streams can carry different data types (e.g., numeric,
categorical).
 Configuration: Streams can be configured to pass specific data columns
or subsets of data.

By connecting nodes with streams, you can create complex data mining
workflows that automate tasks and provide valuable insights.

Creating Streams in Modeler

General Rules for Creating Streams:

 Direction: Streams always flow from left to right.


 Compatibility: Nodes must be compatible in terms of data types and
expected inputs/outputs.
 Multiple Inputs: Some nodes can accept multiple input streams.
 Multiple Outputs: Some nodes can produce multiple output streams.

Placing Nodes:

 Drag and drop: Click on a node in the palette and drag it onto the
canvas.
 Keyboard shortcuts: Use keyboard shortcuts to create nodes (e.g.,
Ctrl+N for a new node).

Managing Nodes:
 Renaming: Double-click on a node to edit its name.
 Copying and pasting: Copy and paste nodes to create duplicates.
 Deleting: Right-click on a node and select "Delete."
 Grouping: Group nodes together for organization.

Managing Connections:

 Creating connections: Click on the output port of one node and drag it
to the input port of another node.
 Deleting connections: Right-click on a connection and select "Delete."
 Modifying connections: Right-click on a connection to modify its
properties (e.g., pass specific columns).

Encapsulating Nodes in a Super Node:

 Create a super node: Right-click on the canvas and select "Create Super
Node."
 Add nodes: Drag and drop nodes into the super node.
 Connect nodes: Connect the nodes within the super node.
 Configure the super node: Set the input and output ports of the super
node.

Generating Nodes from Output:

 Right-click on an output port: Right-click on the output port of a node


and select "Create Node from Output."
 Configure the new node: Set the properties of the new node based on
the output data.

Running Streams:

 Single node: Double-click on a node to run it.


 Multiple nodes: Right-click on a node and select "Run Stream" to run all
connected nodes.
 Entire workflow: Right-click on the canvas and select "Run Workflow."

Graph Nodes in Modeler

Graph nodes in Modeler are a specialized type of node used for visualizing and
analyzing data in a graphical format. They allow you to create interactive charts,
graphs, and diagrams to gain insights into your data.

Common types of graph nodes:


 Bar chart: Displays data as vertical or horizontal bars.
 Line chart: Plots data points connected by lines.
 Scatter plot: Displays data points as individual points on a graph.
 Pie chart: Represents data as slices of a pie.
 Histogram: Displays the distribution of a numerical variable.
 Treemap: Represents hierarchical data as nested rectangles.
 Network graph: Visualizes relationships between entities in a network.

Collecting Initial Data:


Rectangular Data Structure, The Unit Analysis, Field Storages, Field
Measurement Levels, Storage and Measurement level, Fields Instantiation,
Importing Data, The Sources Dialog Boxes- Data Tab, Importing Text
Files, Exporting data.

Rectangular Data Structure

A rectangular data structure, also known as a relational database or data matrix,


is a common format used in data mining. It consists of rows and columns, where
each row represents a record or observation, and each column represents a field
or attribute. This structure is widely used due to its simplicity and compatibility
with various data analysis tools.

Unit Analysis

Unit analysis is the process of understanding the units of measurement used for
each field in the data. This is crucial for ensuring data consistency and accuracy,
especially when performing calculations or comparisons.

Field Storages

Field storages refer to the data types used to store different fields. Common
field storages include:

 Numeric: For numerical data (e.g., integers, floating-point numbers).


 Text: For textual data (e.g., strings, characters).
 Date/Time: For date and time values.
 Boolean: For logical values (e.g., true, false).

Field Measurement Levels


Field measurement levels determine the type of data and the appropriate
statistical operations that can be performed on them. Common measurement
levels include:

 Nominal: Categorical data with no inherent order (e.g., colors, countries).


 Ordinal: Categorical data with an inherent order (e.g., rankings,
satisfaction levels).
 Interval: Numerical data with equal intervals between values but no true
zero point (e.g., temperature in Celsius).
 Ratio: Numerical data with a true zero point (e.g., weight, distance).

Storage and Measurement Level

The storage type and measurement level of a field should be compatible. For
example, a nominal field should typically be stored as text, while a ratio field
should be stored as a numeric data type.

Fields Instantiation

Fields instantiation involves creating instances or values for each field in the
data. This involves entering or importing data into the corresponding fields of
the rectangular data structure.

Importing Data

 Data sources: Data can be imported from various sources, including text
files (CSV, TSV), databases, spreadsheets, and other data repositories.
 Data formats: Ensure that the data format is compatible with the data
mining tool being used.
 Data cleaning: During the import process, it's often necessary to clean
the data to address any inconsistencies or errors.

The Sources Dialog Boxes- Data Tab

The Sources dialog box in many data mining tools provides options for
selecting data sources, specifying file formats, and configuring import settings.
The Data tab within this dialog box typically allows you to:

 Browse for files: Locate the files containing the data.


 Select file formats: Choose the appropriate file format (e.g., CSV, TSV,
XLSX).
 Specify delimiters: Define the characters used to separate fields in text
files.
 Handle headers: Indicate whether the first row contains column headers.
 Set data types: Specify the data types for each field.

Importing Text Files

When importing text files, it's important to consider factors such as:

 File format: Ensure compatibility with the data mining tool.


 Encoding: Specify the character encoding (e.g., UTF-8, ASCII).
 Delimiters: Identify the characters used to separate fields.
 Headers: Determine if the file contains headers.
 Data types: Specify the appropriate data types for each field.

Exporting Data

After data mining operations, the results can be exported to various formats for
further analysis or reporting. Common export formats include:

 Text files: CSV, TSV, or other text-based formats.


 Spreadsheets: Excel, Google Sheets.
 Databases: Relational databases (e.g., MySQL, PostgreSQL).
Understanding Your Data:
Data Audit, Using Statistics Node and Graphs Nodes for Reporting,
Describe Types of Invalid Values, Action for Invalid Values, Dealing with
Missing Data, Reporting Blanks in a Data Audit.

Understanding Your Data: A Critical Step

Before diving into data mining tasks, it's crucial to thoroughly understand your
data. This involves conducting a data audit to identify potential issues and
ensure data quality.

Data Audit

A data audit involves examining the data to assess its accuracy, completeness,
consistency, and relevance. Key aspects of a data audit include:

 Data quality assessment: Checking for errors, inconsistencies, and


missing values.
 Data profiling: Analyzing data characteristics, such as distribution,
range, and correlation.
 Data consistency checks: Verifying data integrity and consistency across
different sources.
 Data completeness checks: Ensuring that all necessary data is present.

Using Statistics Node and Graphs Nodes for Reporting

Statistics Node: This node provides summary statistics about your data, such
as:
 Count: The number of non-missing values.
 Mean: The average value.
 Median: The middle value when data is sorted.
 Mode: The most frequent value.
 Minimum and maximum: The smallest and largest values.
 Standard deviation: A measure of data dispersion.

Graphs Nodes: These nodes allow you to visualize data and identify patterns or
anomalies. Common graph types include:

 Histograms: Show the distribution of a numerical variable.


 Scatter plots: Plot two numerical variables against each other.
 Box plots: Display the distribution of a numerical variable, including
quartiles and outliers.

Types of Invalid Values

Invalid values can occur in various forms:

 Outliers: Values that are significantly different from the majority of the
data.
 Incorrect values: Values that are simply wrong or inaccurate.
 Missing values: Values that are missing or unknown.

Actions for Invalid Values

 Outliers: Depending on the context, you might remove, replace, or keep


outliers.
 Incorrect values: Correct or remove incorrect values if possible.
 Missing values: Impute missing values using techniques like mean,
median, mode, or more sophisticated methods.

Dealing with Missing Data

Strategies for handling missing data include:

 Deletion: Remove rows or columns with missing values.


 Imputation: Replace missing values with estimated values.
 Using missing value indicators: Create a new variable to indicate
missing values.

Reporting Blanks in a Data Audit


When reporting on a data audit, clearly document the number and types of
blanks encountered. This information can be valuable for understanding data
quality and making decisions about data cleaning and imputation.

By conducting a thorough data audit and addressing potential issues, you can
ensure that your data is clean, accurate, and ready for analysis.

Data cleaning is the process of identifying and correcting errors,


inconsistencies, and inaccuracies within a dataset. It's a crucial step in data
mining as it ensures data quality and reliability, which ultimately impacts the
accuracy and meaningfulness of the insights derived from the data.

Common Data Cleaning Tasks

 Handling missing values: Addressing missing data points using


techniques like imputation or deletion.
 Dealing with outliers: Identifying and handling extreme values that
deviate significantly from the majority of the data.
 Correcting errors: Identifying and correcting errors, such as typos,
incorrect formatting, or inconsistencies.
 Addressing inconsistencies: Ensuring data consistency across different
sources or fields.
 Normalizing data: Transforming data into a standard format (e.g.,
scaling, standardization).
 Handling duplicates: Identifying and removing duplicate records.

Setting the Unit Of Analysis: The Required Unit of Analysis, Methods to


create datasets with the required unit of analysis, Distincting Records,
Aggregating Records, Setting To Flag Fields.
Setting the Unit of Analysis

The unit of analysis is the fundamental unit of observation in a data mining


project. It determines the level at which you'll analyze data and draw
conclusions.

The Required Unit of Analysis

The appropriate unit of analysis depends on the specific research question or


business objective. Some common units of analysis include:

 Individual: Analyzing data at the level of individual entities (e.g.,


customers, patients, employees).
 Group: Analyzing data at the level of groups or categories (e.g.,
departments, regions, age groups).
 Event: Analyzing data at the level of individual events or occurrences
(e.g., transactions, interactions).
 Time period: Analyzing data over specific time intervals (e.g., days,
months, years).

Methods to Create Datasets with the Required Unit of Analysis

1. Distinguishing Records:
o Unique identifiers: Assign unique identifiers to each unit of
analysis (e.g., customer IDs, transaction IDs).
o Timestamps: Use timestamps to distinguish events or occurrences.
o Hierarchical structures: Create hierarchical structures to
represent relationships between different levels of analysis.

2. Aggregating Records:
o Grouping: Group records based on specific criteria (e.g., customer
segment, product category).
o Summary statistics: Calculate summary statistics for each group
(e.g., mean, median, total).
o Aggregation functions: Use functions like SUM, AVG, COUNT,
MIN, and MAX to aggregate data.

3. Setting to Flag Fields:


o Flag fields: Create additional fields to indicate specific
characteristics or conditions (e.g., a flag to indicate whether a
customer is a high-value customer).
o Conditional logic: Use conditional logic to set flag fields based on
specific criteria.

Example:

If you're analyzing customer purchasing behavior, you might:

 Distinguish records: Assign unique customer IDs and timestamps to


each transaction.
 Aggregate records: Group transactions by customer to calculate total
purchases, average purchase value, and purchase frequency.
 Set to flag fields: Create a flag field to indicate whether a customer has
made a purchase in the past month.
Feature Scaling
▪ Feature scaling is the final step of data preprocessing in machine learning.
It is a technique to standardize the independent variables of the dataset in
a specific range. In feature scaling, we put our variables in the same range
and in the same scale so that no any variable dominate the other variable.
▪ age and salary column values are not on the same scale. salary values
dominate the age values, and it will produce an incorrect result. So to
remove this issue, we need to perform feature scaling for machine
learning.
Integrating Data:
Methods to Integrate Data, Appending Records, Merging Fields, Sampling
Records, Caching Data
Integrating Data in Data Mining

Data integration is the process of combining data from multiple sources into a
unified dataset. This is often necessary when dealing with data that is stored in
different formats or locations.

Methods to Integrate Data

1. Appending Records:
o Vertical concatenation: Combine records from multiple datasets
based on a common identifier (e.g., customer ID).
o Horizontal concatenation: Combine fields from multiple datasets
with the same number of records.

2. Merging Fields:
o Join operations: Combine data from two or more datasets based
on matching values in common fields (e.g., inner join, outer join).
o Field concatenation: Combine fields from different datasets into a
new field.

3. Sampling Records:
o Random sampling: Select a random subset of records from a
dataset.
o Stratified sampling: Select a subset of records from each stratum
or category within the dataset.
o Cluster sampling: Select a subset of clusters from a dataset and
then sample records within those clusters.

4. Caching Data:
o Temporary storage: Store frequently accessed data in a temporary
cache to improve performance.
o Cache management: Implement strategies for managing the
cache, such as eviction policies and expiration times.

Considerations for Data Integration

 Data quality: Ensure that the data from different sources is consistent
and of high quality.
 Data formats: Convert data to a common format if necessary.
 Data relationships: Understand the relationships between different
datasets to determine the appropriate integration method.
 Data volume: Consider the volume of data being integrated and the
potential performance implications.
 Privacy and security: Protect sensitive data during integration and
ensure compliance with relevant regulations.

Dummy Variable Trap

The dummy variable trap is a common issue that arises when using
categorical variables in regression analysis. It occurs when including redundant
dummy variables in a model, leading to multicollinearity.

Multicollinearity is a statistical condition where two or more independent


variables are highly correlated with each other. This can make it difficult to
accurately estimate the individual effects of these variables on the dependent
variable.

How the Dummy Variable Trap Occurs:


When representing a categorical variable with k levels, you typically need k-1
dummy variables. Each dummy variable represents one of the k levels, and the
omitted level serves as the reference category. However, if you include all k
dummy variables, one of them can be perfectly predicted from the others,
leading to multicollinearity.

Introduction To Modeling:
Modeling Objectives, Objectives And Roles In The Type Node, Types Of
Classification Models, Rule Induction Models, Traditional Statistical
Models, Machine Learning Models, Data Cleaning, Outlier Detection,
Feature Scaling, Supervised Learning Models, Un- Supervised Learning
Models, Running Classification Models – Decision Tree and Random
Forest, Modeling Results: The Model Nugget, Evaluating Classification
Models, Applying Classification Models
Most Common Algorithms
▪ Naïve Bayes Classifier Algorithm (Supervised Learning - Classification)
▪ Linear Regression (Supervised Learning/Regression)
▪ Logistic Regression (Supervised Learning/Regression)
▪ Decision Trees (Supervised Learning – Classification/Regression)
▪ Random Forests (Supervised Learning – Classification/Regression)
▪ K- Nearest Neighbours (Supervised Learning)
▪ K Means Clustering Algorithm (Unsupervised Learning - Clustering)
▪ Support Vector Machine Algorithm (Supervised Learning -
Classification)
▪ Artificial Neural Networks (Reinforcement Learning)

Classification: ML program draw a conclusion from observed values and


determine to what category new observations belong.
For example, when filtering emails as ‘spam’ or ‘not spam’, the program must
look at existing observational data and filter the emails accordingly.

Regression: ML program must estimate and understand the relationships among


variables. Regression analysis focuses on one dependent variable and a series of
other changing variables – making it particularly useful for prediction and
forecasting.
Forecasting: Forecasting is the process of making predictions about the future
based on the past and present data, and is commonly used to analyze trends.

Diagram of supervised and unsupervised


Example of reinforcement:
Validation Data:
▪ Validation data are a sub-dataset separated from the training data, and it’s
used to validate the model during the training process.
▪ During training, validation data infuses new data into the model that it
hasn’t evaluated before.
▪ Validation data provides the first test against unseen data, allowing data
scientists to evaluate how well the model makes predictions based on the
new data.
▪ Not all data scientists use validation data, but it can provide some
helpful information to optimize hyperparameters, which influence
how the model assesses data.
▪ There is some semantic ambiguity between validation data and testing
data. Some organizations call testing datasets “validation datasets.”
Ultimately, if there are three datasets to tune and check ML algorithms,
validation data typically helps tune the algorithm and testing data
provides the final assessment.

Decision tree classification:


Decision Tree Classification

Decision tree classification is a supervised machine learning algorithm that


creates a tree-like model to predict categorical outcomes. The tree consists of
nodes (decisions) and branches (possible outcomes), where each branch leads to
another node or a leaf node (prediction).

How Decision Trees Work

1. Root Node: The tree starts with a root node, representing the entire
dataset.
2. Splitting: The algorithm selects the best attribute to split the data at the
root node based on a chosen criterion (e.g., information gain, Gini
impurity).
3. Creating Branches: Branches are created for each possible value of the
chosen attribute.
4. Recursive Process: The process is repeated for each new node, creating
subtrees until a stopping criterion is met (e.g., all data points in a node
belong to the same class).

Decision Tree Algorithms

 ID3 (Iterative Dichotomiser 3): Uses information gain as the splitting


criterion.
 C4.5: An extension of ID3 that handles missing values and continuous
attributes.
 CART (Classification and Regression Trees): Uses Gini impurity as
the splitting criterion and can handle both classification and regression
tasks.
 Random Forest: An ensemble method that combines multiple decision
trees to improve accuracy and reduce overfitting.

Advantages of Decision Trees

 Interpretability: Decision trees are easy to understand and visualize,


making them suitable for explaining predictions.
 Non-parametric: Decision trees do not require assumptions about the
underlying data distribution.
 Handling mixed data: They can handle both numerical and categorical
data.
 Scalability: Decision trees can handle large datasets efficiently.

Disadvantages of Decision Trees


 Overfitting: Decision trees can overfit the training data, leading to poor
performance on new data.
 Sensitivity to noise: Decision trees can be sensitive to noise in the data.
 Bias towards frequent classes: Decision trees may be biased towards
frequent classes in the data.

Applications of Decision Trees

 Customer churn prediction: Predicting which customers are likely to


discontinue their service.
 Fraud detection: Identifying fraudulent transactions.
 Medical diagnosis: Predicting diseases based on patient symptoms.
 Loan approval: Determining whether to approve or deny loan
applications.
 Market segmentation: Identifying distinct groups of customers with
similar characteristics.

▪ Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or
more homogeneous sets.
▪ Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
▪ Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given conditions.
▪ Branch/Sub Tree: A tree formed by splitting the tree.
▪ Pruning: Pruning is the process of removing the unwanted
branches from the tree.
▪ Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
▪ Below are the two reasons for using the Decision tree:
1. Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
2. The logic behind the decision tree can be easily understood because
it shows a tree-like structure.

Confusion Matrix:
▪ A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set of test data
for which the true values are known.
▪ Consider binary classification:
Accuracy

 Definition: Accuracy is the most basic metric and represents the overall
proportion of correct predictions made by the model. It's calculated by
dividing the number of correctly classified instances by the total number
of instances.
 Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Interpretation: A high accuracy value indicates that the model is making
a good number of correct predictions overall. However, accuracy alone
can be misleading, especially in cases of imbalanced class distributions.

2. Precision

 Definition: Precision focuses on the positive predictions and measures


the proportion of those predictions that were actually correct. It
essentially tells you how often the model correctly identified a positive
case.
 Formula: Precision = TP / (TP + FP)
 Interpretation: A high precision value indicates that the model is precise
in its positive predictions and isn't making many false positives. This is
important in situations where incorrectly identifying a positive case can
be costly (e.g., spam detection).

3. Recall

 Definition: Recall, also known as sensitivity, focuses on the actual


positive cases and measures the proportion of those cases that were
correctly identified by the model. It tells you how well the model finds all
the relevant positive cases.
 Formula: Recall = TP / (TP + FN)
 Interpretation: A high recall value indicates that the model is good at
identifying all the true positive cases and isn't missing many relevant
instances. This is important in scenarios where missing a positive case
can be detrimental (e.g., medical diagnosis).

4. F1-score

 Definition: The F1-score is a harmonic mean of precision and recall,


combining both metrics into a single value. It provides a balance between
the two and is a good measure of a model's overall effectiveness.
 Formula: F1-score = 2 * (Precision * Recall) / (Precision + Recall)
 Interpretation: A high F1-score indicates that the model is performing
well in terms of both precision and recall. It strikes a balance between not
making many false positives and not missing many true positives.

Random Forest:
• A random forest is a machine learning technique that’s used to
solve regression and classification problems.
• It utilizes ensemble learning, which is a technique that
combines many classifiers to provide solutions to complex problems.
• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.

How does Random Forest algorithm work?


• Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each
tree created in the first phase.
• The Working process can be explained in the below steps and diagram:
• Step-1: Select random K data points from the training set.
• Step-2: Build the decision trees associated with the selected data points
(Subsets).
• Step-3: Choose the number N for decision trees that you want to build.
• Step-4: Repeat Step 1 & 2.
• Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.

• Ensemble simply means combining multiple models.


• Thus a collection of models is used to make predictions rather than an
individual model.
• Ensemble uses two types of methods: Bagging and Boosting
• 1. Bagging – It creates a different training subset from sample training
data with replacement & the final output is based on majority voting. For
example, Random Forest.
• 2. Boosting – It combines weak learners into strong learners by creating
sequential models such that the final model has the highest accuracy. For
example, ADA BOOST, XG BOOST

You might also like