0% found this document useful (0 votes)

6 views16 pages

Key Ingredients of PM

The document outlines the key ingredients of predictive models, focusing on data collection, preprocessing, feature engineering, model selection, and training. It details various methods and considerations for each ingredient, including the types of data, cleaning techniques, feature selection methods, algorithm selection criteria, and the training process. The comprehensive approach emphasizes the importance of data quality, model complexity, and ethical considerations in building effective predictive models.

Uploaded by

shrutibansal2k22bae150

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views16 pages

Key Ingredients of PM

Uploaded by

shrutibansal2k22bae150

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Session: Key Ingredients of Predictive Models

Ingredient 1: Data Collection

Definition

Data collection is the systematic process of gathering information from a variety of sources to answer
specific research questions, test hypotheses, or evaluate outcomes. This process involves selecting
appropriate data collection methods, tools, and techniques based on the research objectives. It is a
critical step in research, as the quality and accuracy of the data collected directly influence the
validity of the study’s conclusions.

Sources of Data

Primary Data: Original data collected firsthand for a specific research purpose.

• Examples:

o Surveys (e.g., customer satisfaction surveys to gauge service quality)

o Interviews (e.g., one-on-one interviews with industry experts for qualitative insights)

o Observations (e.g., watching consumer behavior in a retail store)

o Experiments (e.g., testing different marketing strategies to see which is most

effective)

Primary data is highly relevant and specific to the research question but can be time-consuming and
costly to collect.

Secondary Data: Data that was collected by someone else for a different purpose but is now being
used for new research.

• Examples:

o Published Research (e.g., journal articles, books)

o Government Reports (e.g., census data, economic statistics)

o Company Records (e.g., sales data, financial reports)

o Online Databases (e.g., academic databases like JSTOR, or data repositories like
World Bank)

Secondary data is cost-effective and readily available but may not be perfectly suited to the new
research objectives.

Tertiary Data: Data that has been compiled or interpreted from primary and secondary sources.

• Examples:

o Textbooks (e.g., educational books summarizing theories and concepts)

o Encyclopedias (e.g., Britannica, Wikipedia entries)

o Manuals (e.g., user guides summarizing best practices)

Tertiary data provides a broad overview and can be useful for background information but may lack
depth or specificity for detailed research.

Considerations

Validity: Refers to the extent to which the data collection method accurately measures what it is
intended to measure.

• Example: If a survey aims to measure customer satisfaction, the questions should directly
address factors related to satisfaction, such as service quality and response time, rather than
unrelated factors like pricing or product variety. Validity ensures that the data collected is
relevant and representative of the concept being studied.

Reliability: Refers to the consistency of the data collection process. A reliable data collection method
produces stable and consistent results over time.

• Example: If a temperature sensor is used in an experiment, it should provide the same

reading for the same temperature under identical conditions repeatedly. Reliability is crucial
for ensuring that the data is dependable and can be replicated in future studies.

Ethics: Involves considerations such as obtaining informed consent from participants, ensuring their
privacy and confidentiality, and using the data collected responsibly.

• Example: When conducting a survey that collects personal information, researchers must
inform participants about how their data will be used and stored, and ensure it is not shared
without permission. Ethical considerations protect participants’ rights and maintain the
integrity of the research process.

Bias: Refers to any systematic error that skews data collection and can lead to misleading results.

• Example: If a researcher only interviews people from a particular demographic group, the
data collected may not be representative of the entire population, introducing selection bias.
It is important to design the data collection process to minimize bias, such as using random
sampling or ensuring diverse participation.

Cost and Resources: Refers to the practical constraints related to budget, time, and tools available
for data collection.

• Example: Conducting nationwide surveys can be expensive and time-consuming, so

researchers might opt for smaller, more manageable samples or use online survey tools to
reduce costs. Considering these factors ensures that the data collection method is feasible
within the research constraints.

Data Quality: Refers to the accuracy, completeness, and relevance of the data collected.

• Example: Data with many missing values or incorrect entries can lead to flawed analysis and
incorrect conclusions. Ensuring data quality involves careful planning, data validation, and
cleaning processes. High-quality data enhances the credibility and reliability of the research
findings.
Ingredient 2: Data Preprocessing

Cleaning

Definition:
Data cleaning is the process of identifying, correcting, or removing inaccurate, incomplete, or
irrelevant data from a dataset. It is a critical step to ensure that the dataset is free from errors or
inconsistencies that could negatively impact the analysis and model performance.

Techniques:

• Removing Duplicates:

o Example: In a customer dataset, if the same customer is listed twice with identical
information, one of the duplicate entries should be removed to avoid skewing the
analysis.

• Handling Missing Values:

o Mean/Median Imputation: Replacing missing numerical values with the mean or

median of that column.

▪ Example: If the "Age" column has missing values, you can fill them with the
average age of the available data.

o Mode Imputation: Replacing missing categorical values with the most frequent
category.

▪ Example: In a "Gender" column, if the majority of entries are "Female,"

missing entries can be filled with "Female."

o Deletion: Removing records with significant missing data.

▪ Example: If a row has more than 50% of its data missing, it may be best to
delete the entire row.

• Correcting Inaccuracies:

o Example: Standardizing date formats from "MM/DD/YYYY" to "YYYY-MM-DD" to

ensure consistency throughout the dataset.

Transformations

Definition:
Data transformation involves converting data into a format that is more suitable for analysis. This can
include scaling, encoding, or applying mathematical functions to make the data compatible with the
chosen analytical techniques or models.

Techniques:

• Normalization: Rescaling features to a specific range (e.g., 0 to 1) to ensure uniformity across

variables.
o Example: Rescaling income data from a range of $20,000 to $100,000 to a scale of 0
to 1 for use in a machine learning model.

• Standardization: Centering the data by subtracting the mean and scaling by the standard
deviation, often resulting in a mean of 0 and a standard deviation of 1.

o Example: Converting exam scores to z-scores to compare them on a standard scale,

regardless of the original grading system.

• Encoding Categorical Variables: Converting categorical data into numerical values using:

o One-Hot Encoding: Creating binary columns for each category.

▪ Example: Converting "Color" categories like "Red," "Blue," and "Green" into
separate binary columns: "Is_Red," "Is_Blue," "Is_Green."

o Label Encoding: Assigning a unique integer to each category.

▪ Example: Assigning 0 to "Red," 1 to "Blue," and 2 to "Green."

• Logarithmic Transformation: Applying a log function to reduce skewness in data

distributions, making them more normal.

o Example: Applying a log transformation to sales data that has a long tail, so it better
fits a normal distribution for regression analysis.

Handling Outliers

Definition:
Outliers are data points that significantly differ from other observations. Handling outliers is essential
because they can distort statistical analyses and model predictions, leading to inaccurate results.

Techniques:

• Identification: Detecting outliers using:

o Z-Score: Identifying data points that are several standard deviations away from the
mean.

▪ Example: In a dataset of test scores, if most students score between 50 and

90, a score of 10 may be flagged as an outlier.

o IQR (Interquartile Range): Finding outliers based on values outside 1.5 times the
IQR.

▪ Example: For a dataset of house prices, if the middle 50% of prices (IQR) is
between $150,000 and $300,000, then prices below $75,000 or above
$375,000 could be considered outliers.

o Visualization: Using box plots or scatter plots to visually detect outliers.

▪ Example: A box plot showing a "whisker" extending far from the rest of the
data points, indicating potential outliers.

• Treatment:
o Removal: Deleting outliers that are clearly errors or anomalies.

▪ Example: Removing a data entry for a person's age listed as "200" years old.

o Transformation: Applying techniques like log transformation to reduce their impact.

▪ Example: Using a log transformation on income data with extreme outliers,

like billionaires, to bring them closer to the rest of the dataset.

o Capping/Flooring: Limiting extreme values by setting upper and lower bounds.

▪ Example: Capping house prices in a dataset at $1 million to reduce the

influence of luxury properties in the analysis.
Ingredient 3: Feature Engineering

Feature Selection

Definition:
Feature selection is the process of identifying and selecting the most relevant features (variables)
from a dataset that contribute most to the prediction or output. This step helps improve model
performance by reducing overfitting, improving accuracy, and decreasing computational complexity.

Techniques:

• Filter Methods: Selecting features based on statistical measures that rank their importance
with respect to the target variable.

o Example: Using correlation coefficients to select features with a high correlation to

the target variable in a regression problem.

• Wrapper Methods: Evaluating different combinations of features by training and testing a

model to identify the subset that yields the best performance.

o Example: Forward selection, where features are added one by one based on model
improvement, or backward elimination, where features are removed until no further
performance gain is observed.

• Embedded Methods: Feature selection occurs as part of the model training process, where
the model itself chooses the most important features.

o Example: Lasso regression (L1 regularization) shrinks less important feature

coefficients to zero, effectively selecting only the most relevant features.

Example Application:
In a customer churn prediction model, feature selection might identify "Customer Tenure," "Monthly
Charges," and "Contract Type" as the most significant predictors while eliminating irrelevant features
like "Customer ID."

Dimensionality Reduction

Definition:
Dimensionality reduction is the process of reducing the number of input variables (features) in a
dataset while preserving as much information as possible. This technique is particularly useful when
dealing with high-dimensional data, where too many features can lead to overfitting and increased
computational load.

Techniques:

• Principal Component Analysis (PCA): A statistical method that transforms the original
features into a smaller set of uncorrelated components, capturing most of the variance in the
data.

o Example: In an image processing task, PCA can reduce the number of pixel features
by finding the key components that capture the most variance in the image data.
• Linear Discriminant Analysis (LDA): A technique that reduces dimensionality by finding the
linear combinations of features that best separate different classes.

o Example: LDA can be used in a classification problem (e.g., separating types of

cancer) to reduce the number of features while maintaining class separability.

• t-Distributed Stochastic Neighbor Embedding (t-SNE): A nonlinear dimensionality reduction

technique that visualizes high-dimensional data by mapping it to 2D or 3D space.

o Example: t-SNE can be applied to visualize the clustering of handwritten digit images
from the MNIST dataset in a 2D plot.

Example Application:
In a text classification problem with thousands of word features, PCA could reduce the feature set to
a few hundred components that still capture the essential information, improving model efficiency
and performance.
Ingredient 4: Model Selection

Algorithm Selection

Definition:
Algorithm selection is the process of choosing the most appropriate machine learning algorithm for a
given problem based on the characteristics of the data, the problem's requirements, and the desired
outcome. The selection of the right algorithm directly impacts the model’s accuracy, interpretability,
and computational efficiency.

Considerations:

• Type of Problem:

o Example: For a classification problem, you might choose between algorithms like
Logistic Regression, Decision Trees, or Support Vector Machines (SVM), depending
on whether the problem is binary or multi-class.

• Data Size and Quality:

o Example: If the dataset is small, a simpler algorithm like Naive Bayes may perform
well, while large datasets with complex relationships might benefit from more
sophisticated methods like Random Forests or Neural Networks.

• Interpretability vs. Accuracy:

o Example: If interpretability is crucial (e.g., in healthcare for making decisions about

patient treatment), simpler models like Linear Regression or Decision Trees might be
preferred over complex models like Neural Networks, which are harder to interpret.

• Training Time and Resources:

o Example: In situations with limited computational resources, faster algorithms like

Logistic Regression or Decision Trees may be chosen over more computationally
expensive ones like Gradient Boosting Machines.

Example Application:
For a credit scoring problem, where the goal is to classify customers as low or high risk, a Logistic
Regression model might be selected for its interpretability, while a Random Forest might be chosen
for higher accuracy if interpretability is less of a concern.

Model Complexity

Definition:
Model complexity refers to the capacity of a model to fit a wide range of functions or data patterns.
Complex models can capture more intricate relationships within the data, but they also risk
overfitting, especially with smaller datasets. Striking a balance between underfitting and overfitting is
key to optimal model performance.

Considerations:

• Bias-Variance Tradeoff:
o Example: Simple models like Linear Regression tend to have high bias but low
variance, leading to underfitting. Complex models like Deep Neural Networks have
low bias but high variance, which can cause overfitting.

• Overfitting vs. Underfitting:

o Example: A model with too many parameters (e.g., a polynomial regression of very
high degree) may overfit the training data, capturing noise as if it were a true signal.
Conversely, a model that is too simple may underfit, missing important patterns in
the data.

• Regularization:

o Example: Techniques like Lasso (L1) and Ridge (L2) regularization add penalties to
the loss function for larger coefficients, helping to reduce model complexity and
prevent overfitting.

• Cross-Validation:

o Example: Using k-fold cross-validation helps assess model performance on different

subsets of the data, providing a more reliable estimate of how the model will
perform on unseen data, thus guiding the choice of model complexity.

Example Application:
In a house price prediction task, a simple linear model might underfit if house prices are influenced
by non-linear factors. A more complex model like Random Forests might better capture these
relationships but could overfit if not carefully tuned.
Ingredient 5: Model Training

Training Process

Definition:
The training process in machine learning involves teaching a model to recognize patterns in data by
adjusting its parameters based on input data and corresponding outputs. This process is iterative,
aiming to minimize the difference between the model’s predictions and the actual outcomes through
optimization techniques.

Steps:

1. Data Preparation:

o Example: Splitting the dataset into training, validation, and test sets. For instance, in
a dataset of 10,000 records, you might allocate 70% for training, 15% for validation,
and 15% for testing.

o Scaling and Normalization: Ensuring all features are on the same scale to avoid
biases in models like Gradient Descent, which are sensitive to feature scaling.

2. Model Initialization:

o Example: Choosing an algorithm (e.g., Linear Regression) and initializing parameters

like weights and biases, often starting with small random values or zeros.

3. Forward Propagation:

o Example: In a Linear Regression model, this step involves calculating the predicted
output using the initial weights. If the input is [x1, x2] and the initial weights are [w1,
w2], the prediction is y_pred = w1*x1 + w2*x2 + bias.

4. Loss Function Calculation:

o Example: Using a Mean Squared Error (MSE) loss function in regression, the
difference between the predicted and actual outputs is calculated as MSE = (1/n) *
Σ(actual - predicted)^2, where n is the number of data points.

5. Backward Propagation:

o Example: Calculating gradients of the loss with respect to each parameter (e.g.,
using Gradient Descent) to determine the direction and magnitude by which the
weights should be updated.

6. Parameter Update (Optimization):

o Example: Adjusting the weights and biases using an optimization algorithm like
Gradient Descent, which subtracts a fraction of the gradient from the weights:
new_weight = old_weight - learning_rate * gradient.

7. Iteration:
o Example: Repeating the forward propagation, loss calculation, backward
propagation, and parameter update steps over many epochs (iterations) until the
model converges to a minimum loss or reaches a specified number of iterations.

8. Validation:

o Example: Using the validation set to evaluate the model’s performance during
training to prevent overfitting and to fine-tune hyperparameters like learning rate or
the number of layers in a Neural Network.

9. Testing:

o Example: After training, the model is tested on the test set to assess its
generalization to new, unseen data, ensuring it performs well beyond the training
data.

Example Application:
In an image classification problem, the training process might involve feeding thousands of labeled
images into a Convolutional Neural Network (CNN), adjusting filters (weights) through
backpropagation and optimization over many epochs, validating performance with a validation set,
and finally testing the trained model on a separate set of images to evaluate accuracy.
Ingredient 6: Model Evaluation

Evaluation Metrics

Classification Metrics

Definition:
Classification metrics are used to evaluate the performance of models that predict categorical
outcomes. These metrics help quantify how well the model is distinguishing between different
classes.

Key Metrics:

• Accuracy: The proportion of correct predictions out of the total predictions made.

o Example: In a spam detection model, if 90 out of 100 emails are correctly classified
as spam or not spam, the accuracy is 90%.

• Precision: The proportion of true positive predictions out of all positive predictions made by
the model. Precision is crucial when the cost of false positives is high.

o Example: In a medical diagnosis model, if 70 out of 100 predicted cases of a disease

are correct, the precision is 70%.

• Recall (Sensitivity): The proportion of true positive predictions out of all actual positives.
Recall is essential when the cost of missing positives (false negatives) is high.

o Example: In a cancer detection model, if 80 out of 100 actual cancer cases are
correctly identified, the recall is 80%.

• F1 Score: The harmonic mean of precision and recall, providing a balanced measure when
precision and recall are equally important.

o Example: If a model has a precision of 75% and a recall of 60%, the F1 Score is
approximately 66.7%, balancing both metrics.

• AUC-ROC Curve: A graphical representation of a model’s ability to discriminate between

classes, plotting the true positive rate against the false positive rate.

o Example: A model with an AUC-ROC score of 0.9 is highly effective at distinguishing

between classes, with 1.0 being a perfect score.

Regression Metrics

Definition:
Regression metrics are used to assess the performance of models predicting continuous outcomes.
These metrics measure the difference between predicted and actual values.

Key Metrics:

• Mean Absolute Error (MAE): The average of the absolute differences between predicted and
actual values.

o Example: If a house price prediction model has an MAE of $5,000, this means the
model’s predictions are, on average, off by $5,000.
• Mean Squared Error (MSE): The average of the squared differences between predicted and
actual values, penalizing larger errors more heavily.

o Example: In a model predicting car prices, an MSE of 25,000 indicates a higher

penalty for larger deviations between predicted and actual prices.

• Root Mean Squared Error (RMSE): The square root of the MSE, providing error estimates in
the same units as the target variable.

o Example: An RMSE of 50 in a student score prediction model indicates an average

prediction error of 50 points.

• R-squared (Coefficient of Determination): A measure of how well the independent variables

explain the variability of the dependent variable. It ranges from 0 to 1.

o Example: An R-squared of 0.85 in a house price prediction model indicates that 85%
of the variation in house prices is explained by the model.

Model Comparison

Definition:
Model comparison involves evaluating and contrasting the performance of different models on the
same dataset to identify the most suitable model for the problem at hand. This is done using
evaluation metrics and other criteria such as interpretability and computational efficiency.

Considerations:

• Metric-Based Comparison:

o Example: Comparing two classification models (e.g., Random Forest vs. SVM) using
metrics like F1 Score or AUC-ROC to determine which model better handles class
imbalance.

• Cross-Validation Scores:

o Example: Using k-fold cross-validation to compare the average performance of

different models across multiple subsets of the data, ensuring that the results are
not biased by any particular data split.

• Complexity vs. Performance:

o Example: If two models have similar accuracy but one is significantly less complex
(e.g., Logistic Regression vs. a Neural Network), the simpler model might be
preferred for its ease of deployment and interpretability.

• Computational Cost:

o Example: When comparing a Decision Tree with a Gradient Boosting model, the
former may be preferred if it achieves acceptable performance with much lower
training time and computational resources.

Example Application:
In a customer churn prediction project, you might compare models like Logistic Regression, Random
Forest, and XGBoost using the F1 Score, precision, recall, and AUC-ROC. You would choose the model
that provides the best trade-off between accuracy and interpretability, ensuring it generalizes well to
new data.
Ingredient 7: Model Deployment

Definition

Model deployment is the process of integrating a trained machine learning model into a production
environment where it can be used to make real-time or batch predictions on new data. This phase
transforms a machine learning model from a research artifact into a functional component of an
operational system, delivering value by generating predictions that can be acted upon.

Considerations

1. Infrastructure Requirements:

o Example: Depending on the deployment environment (e.g., cloud, on-premise, edge

devices), you must ensure the infrastructure can handle the model’s computational
demands, such as latency requirements and scalability. For instance, deploying a
real-time fraud detection model may require low-latency infrastructure with high
availability.

2. Integration with Existing Systems:

o Example: The model must seamlessly integrate with existing applications, databases,
and services. For instance, an e-commerce recommendation engine should integrate
with the website's backend to provide personalized product recommendations in
real time.

3. Model Monitoring and Maintenance:

o Example: After deployment, it's crucial to monitor the model’s performance to

detect issues like model drift (when the model’s performance degrades over time
due to changes in data). For instance, a demand forecasting model for retail products
might require retraining periodically to adjust to new shopping trends.

4. Security and Compliance:

o Example: The deployment process should ensure that the model adheres to data
privacy regulations (e.g., GDPR) and is secure from potential threats. For instance, a
healthcare model predicting patient outcomes must ensure patient data is encrypted
and that the model is only accessible by authorized personnel.

5. Scalability and Performance:

o Example: The deployed model should be capable of scaling to handle increased

loads, such as higher numbers of predictions or more complex data inputs. For
instance, a model used in a mobile app for image recognition should be optimized
for quick inference, even with millions of users.

6. Version Control and Rollback:

o Example: Keeping track of model versions is essential to ensure that updates can be
rolled out smoothly and previous versions can be rolled back if an issue arises. For
instance, in A/B testing different versions of a marketing model, you might need to
revert to an earlier version if the new model performs poorly.

7. User Interface and Experience:

o Example: The output of the model should be presented in a way that is accessible
and actionable for end-users. For example, a sales forecasting model could provide
intuitive visualizations and actionable insights for business managers through a
dashboard.

8. Cost Management:

o Example: The deployment should consider cost efficiency, particularly when using
cloud services. For instance, choosing a serverless architecture for a low-volume
prediction model can reduce operational costs.

Example Application:
Deploying a customer churn prediction model in a telecom company might involve integrating the
model into a customer relationship management (CRM) system, ensuring it can process real-time
data to alert sales teams about high-risk customers. The system would monitor the model’s accuracy
over time and retrain it as new customer behavior data becomes available, all while ensuring
compliance with data privacy regulations.

Flash BTC Sender Guide
No ratings yet
Flash BTC Sender Guide
3 pages
Case 2 - A Zero Wage Increase Again
63% (8)
Case 2 - A Zero Wage Increase Again
3 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Chapter 9 Real Mortgage
100% (3)
Chapter 9 Real Mortgage
6 pages
Chapter 1 - Obtaining Data: Lesson 1 - Data Collection Lesson Objectives
No ratings yet
Chapter 1 - Obtaining Data: Lesson 1 - Data Collection Lesson Objectives
4 pages
Swivel Grease MSDS
No ratings yet
Swivel Grease MSDS
8 pages
Case Study On Dabur
No ratings yet
Case Study On Dabur
7 pages
Cohesive Nouns
100% (1)
Cohesive Nouns
3 pages
The Raine Report Issue 02
No ratings yet
The Raine Report Issue 02
51 pages
Bioseparation Citric Acid
No ratings yet
Bioseparation Citric Acid
32 pages
PR5259610 BME Non CAA HVAC PM SOW
No ratings yet
PR5259610 BME Non CAA HVAC PM SOW
68 pages
2279 Performance Based Wind Resistant Design For High Rise Structures in Japan PDF
No ratings yet
2279 Performance Based Wind Resistant Design For High Rise Structures in Japan PDF
14 pages
Manual Aire Acondicionado
No ratings yet
Manual Aire Acondicionado
44 pages
Revised List of Items & Norms of Assistance From State Disaster Response Fund (SDRF) / National Disaster Response Fund (NDRF)
No ratings yet
Revised List of Items & Norms of Assistance From State Disaster Response Fund (SDRF) / National Disaster Response Fund (NDRF)
8 pages
Question Set - Asset Integrity
100% (1)
Question Set - Asset Integrity
5 pages
(Handwritten Solutions) JEE ADVANCED PYQs - Straight Lines and Circles
No ratings yet
(Handwritten Solutions) JEE ADVANCED PYQs - Straight Lines and Circles
35 pages
Expectancy Theory Overview
100% (3)
Expectancy Theory Overview
27 pages
Suspended Systems: Lighting and Security
No ratings yet
Suspended Systems: Lighting and Security
8 pages
Lesson Plan in Science 6 Components of Ecosystem
100% (1)
Lesson Plan in Science 6 Components of Ecosystem
5 pages
Ad Spender Manual
No ratings yet
Ad Spender Manual
17 pages
Performance Review of Thermal Power Stations 2011-12: Sl. No Name of Station Unit No Organisation Capacity
No ratings yet
Performance Review of Thermal Power Stations 2011-12: Sl. No Name of Station Unit No Organisation Capacity
4 pages
ChatGPT Premium Guide
67% (3)
ChatGPT Premium Guide
152 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Continental Device India Limited: PNP Silicon Epitaxial Power Transistor CFB1370 (9AW) TO-220FP
No ratings yet
Continental Device India Limited: PNP Silicon Epitaxial Power Transistor CFB1370 (9AW) TO-220FP
2 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Specifications Alphasorb Barrier Fabric Wrapped Acoustic Panels
No ratings yet
Specifications Alphasorb Barrier Fabric Wrapped Acoustic Panels
3 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
DRRM Minutes DSV, ZMDV
No ratings yet
DRRM Minutes DSV, ZMDV
17 pages
Untitled Design
No ratings yet
Untitled Design
15 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Linearization OpenFAST
No ratings yet
Linearization OpenFAST
13 pages
Week 3
No ratings yet
Week 3
23 pages
IPR Assignment
No ratings yet
IPR Assignment
5 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Presenatation On SIP by Saral Jain
No ratings yet
Presenatation On SIP by Saral Jain
12 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Mining
No ratings yet
Data Mining
5 pages
Unit 2 FDS
No ratings yet
Unit 2 FDS
13 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Week 2
No ratings yet
Week 2
3 pages
Research File 3
No ratings yet
Research File 3
10 pages
Lec 5
No ratings yet
Lec 5
1 page
DM Unit2
No ratings yet
DM Unit2
9 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
BUHK408
No ratings yet
BUHK408
5 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Unit 3
No ratings yet
Unit 3
18 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Optimization of Non-Catalytic Transesterification of Microalgae Oil To Biodiesel Under Supercritical Methanol Condition
No ratings yet
Optimization of Non-Catalytic Transesterification of Microalgae Oil To Biodiesel Under Supercritical Methanol Condition
10 pages
Ba CH-2
No ratings yet
Ba CH-2
6 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
FDM Notes
No ratings yet
FDM Notes
48 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Air Act 1981 Project Arjun Dubey 4046
No ratings yet
Air Act 1981 Project Arjun Dubey 4046
3 pages
Unit 2
No ratings yet
Unit 2
21 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
DSBD
No ratings yet
DSBD
23 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
FDS-Unit II-ECE
No ratings yet
FDS-Unit II-ECE
22 pages
Unit 2 Agriculture Economic Survey 2024-25
No ratings yet
Unit 2 Agriculture Economic Survey 2024-25
13 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
Unit 3 Services Economic Survey 2024-25
No ratings yet
Unit 3 Services Economic Survey 2024-25
19 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Da Mid1
No ratings yet
Da Mid1
32 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Inequlities in Health Sector Notes NEW 2024
No ratings yet
Inequlities in Health Sector Notes NEW 2024
13 pages
CAT Verbal Ability - 1749997605248
No ratings yet
CAT Verbal Ability - 1749997605248
10 pages
Data Mining
No ratings yet
Data Mining
22 pages
ITE Elective Lecture Materials Data Colletion and Descriptive Statistics
No ratings yet
ITE Elective Lecture Materials Data Colletion and Descriptive Statistics
8 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
ML 4
No ratings yet
ML 4
17 pages
CH 2
No ratings yet
CH 2
26 pages
Basic Data Analysis
No ratings yet
Basic Data Analysis
16 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Introduction To Data Science Methodology
No ratings yet
Introduction To Data Science Methodology
45 pages
AI Student HandbookXII
No ratings yet
AI Student HandbookXII
48 pages
Unit 2 - Data Science Methodology Notes
No ratings yet
Unit 2 - Data Science Methodology Notes
26 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet

Key Ingredients of PM

Uploaded by

Key Ingredients of PM

Uploaded by

Session: Key Ingredients of Predictive Models

Ingredient 1: Data Collection

o Surveys (e.g., customer satisfaction surveys to gauge service quality)

o Observations (e.g., watching consumer behavior in a retail store)

o Experiments (e.g., testing different marketing strategies to see which is most

o Published Research (e.g., journal articles, books)

o Government Reports (e.g., census data, economic statistics)

o Company Records (e.g., sales data, financial reports)

o Textbooks (e.g., educational books summarizing theories and concepts)

o Manuals (e.g., user guides summarizing best practices)

• Example: If a temperature sensor is used in an experiment, it should provide the same

• Example: Conducting nationwide surveys can be expensive and time-consuming, so

• Handling Missing Values:

o Mean/Median Imputation: Replacing missing numerical values with the mean or

▪ Example: In a "Gender" column, if the majority of entries are "Female,"

o Deletion: Removing records with significant missing data.

o Example: Standardizing date formats from "MM/DD/YYYY" to "YYYY-MM-DD" to

• Normalization: Rescaling features to a specific range (e.g., 0 to 1) to ensure uniformity across

o Example: Converting exam scores to z-scores to compare them on a standard scale,

o One-Hot Encoding: Creating binary columns for each category.

o Label Encoding: Assigning a unique integer to each category.

▪ Example: Assigning 0 to "Red," 1 to "Blue," and 2 to "Green."

• Logarithmic Transformation: Applying a log function to reduce skewness in data

• Identification: Detecting outliers using:

▪ Example: In a dataset of test scores, if most students score between 50 and

o Visualization: Using box plots or scatter plots to visually detect outliers.

o Transformation: Applying techniques like log transformation to reduce their impact.

▪ Example: Using a log transformation on income data with extreme outliers,

o Capping/Flooring: Limiting extreme values by setting upper and lower bounds.

▪ Example: Capping house prices in a dataset at $1 million to reduce the

o Example: Using correlation coefficients to select features with a high correlation to

• Wrapper Methods: Evaluating different combinations of features by training and testing a

o Example: Lasso regression (L1 regularization) shrinks less important feature

o Example: LDA can be used in a classification problem (e.g., separating types of

• t-Distributed Stochastic Neighbor Embedding (t-SNE): A nonlinear dimensionality reduction

• Data Size and Quality:

• Interpretability vs. Accuracy:

o Example: If interpretability is crucial (e.g., in healthcare for making decisions about

• Training Time and Resources:

o Example: In situations with limited computational resources, faster algorithms like

• Overfitting vs. Underfitting:

o Example: Using k-fold cross-validation helps assess model performance on different

o Example: Choosing an algorithm (e.g., Linear Regression) and initializing parameters

4. Loss Function Calculation:

6. Parameter Update (Optimization):

o Example: In a medical diagnosis model, if 70 out of 100 predicted cases of a disease

• AUC-ROC Curve: A graphical representation of a model’s ability to discriminate between

o Example: A model with an AUC-ROC score of 0.9 is highly effective at distinguishing

o Example: In a model predicting car prices, an MSE of 25,000 indicates a higher

o Example: An RMSE of 50 in a student score prediction model indicates an average

• R-squared (Coefficient of Determination): A measure of how well the independent variables

o Example: Using k-fold cross-validation to compare the average performance of

• Complexity vs. Performance:

o Example: Depending on the deployment environment (e.g., cloud, on-premise, edge

2. Integration with Existing Systems:

3. Model Monitoring and Maintenance:

o Example: After deployment, it's crucial to monitor the model’s performance to

4. Security and Compliance:

5. Scalability and Performance:

o Example: The deployed model should be capable of scaling to handle increased

6. Version Control and Rollback:

7. User Interface and Experience:

You might also like