data mining

The document discusses various data analysis techniques including data cleaning, regression, association rule learning, clustering, and classification, emphasizing the importance of accurate data for business decisions. It highlights the need for careful predictor selection in regression modeling to avoid issues related to overfitting and complexity, and introduces performance metrics like Adjusted R-squared, AIC, and BIC for model evaluation. Additionally, it explains the Naive Bayes classifier's approach to categorizing data based on predictor profiles.

Uploaded by

Yara Alhyari

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

data mining

Uploaded by

Yara Alhyari

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

Data cleaning: the process of finding and removing errors, Regression: The regression technique predicts numeric

inconsistent duplicates, and missing entries from the data to values in categories such as sales, stock prices, or even
increase data consistency and quality. so not cleaning your data can temperature. The ranges are based on the information
lead to incorrect business decisions. So, data must be accurate and found in a particular data set. Association Rule Learning:
unreliable before making any business decision. This toolset, also called market basket analysis, searches for
Data cleaning tools: excel-python-SQL-data visualization. relationships among dataset variables. For example,
Benefits of data cleaning: avoiding mistakes/ improving association rule learning can determine which products are
productivity/ avoiding unnecessary costs and errors/ staying frequently purchased together (e.g., a smartphone and a
organized/improving mapping. protective case).
Clustering: This process partitions datasets into a set of
meaningful sub-classes, known as clusters. The process
helps users understand the natural structure or grouping
within the data. Classification: This technique assigns items
in a dataset to different target categories or classes. The
goal is to develop accurate predictions within the target
class for each case in the data. (techniques)

The most popular model for making predictions, this model is used Explanatory vs. Predictive Modeling 1. Explaining or the process of estimating the regression equation and the "kitchen-sink" approach, which refers to the tendency to include all methods for reducing the number of predictors in a
available variables as predictors in a regression model without careful
to fit a relationship between a numerical outcome variable Y (also quantifying the average effect of inputs on an outcome making predictions using the ordinary least squares (OLS) regression model.
selection.
called the response, target, or dependent variable) and a set of (explanatory or descriptive task, respectively) method. The coefficients of the regression equation are warns against the "kitchen-sink" approach, where all available variables Domain Knowledge: The initial step in reducing predictors
predictors X1, X2, …, Xp (also referred to as independent variables, 2. Predicting the outcome value for new records, given their determined by minimizing the sum of squared deviations are included as predictors in a regression model without discretion. Instead, involves leveraging domain knowledge to understand the
input variables, regressors, or covariates). The assumption is that input values (predictive task) Explanatory modeling aims to between actual outcome values and their predicted values. it advocates for careful selection of predictors due to various drawbacks relevance of each predictor to the outcome variable.
associated with using too many variables: 1. Cost and feasibility concerns
the following function approximates the relationship between the understand the relationship between variables in a To predict outcomes for new records, the linear relationship may arise from collecting all predictors. 2. It's often preferable to focus on
Predictors can be eliminated based on factors like expense,
predictors and outcome variables. population by treating data as a random sample. This between independent variables and the outcome is utilized. fewer, more accurate predictors. 3. More predictors can lead to missing inaccuracy, high correlation with other predictors, missing
Choosing the right form depends on domain knowledge, data approach estimates regression models to capture average For accurate predictions, certain assumptions must be held, data issues. 4. Models with fewer parameters offer clearer insights. 5. values, or irrelevance to the problem at hand.
availability, and needed predictive power. such as a normal distribution of errors and independence of Including many predictors can result in unstable regression coefficients. 6. Exhaustive Search: This method involves evaluating all
Using uncorrelated predictors increases prediction variance. 7. Dropping
relationships in the population and inform decision-making. records. While these assumptions may not always hold, the possible subsets of predictors. However, due to the vast
correlated predictors may increase prediction bias.
Predictive analytics is concerned with predicting individual resulting estimates remain useful for prediction provided To address these challenges, methods are available to reduce predictors to number of potential models, it's often impractical to
outcomes for new records. It focuses on generating proper evaluation of model performance is conducted. a smaller, more effective set for better prediction and classification in data examine every combination.
mining tasks. Adjusted R-squared (R²_adj): It's like a smarter version of R-
predictions rather than interpreting coefficients, making it
suitable for micro-level decision-making. squared (R²), which tells us how well our model fits the data.
R²_adj considers not only how well the model fits but also
how many predictors it uses. It penalizes models with lots of
predictors that don't add much useful information.
Higher R²_adj values mean a better model fit.
Akaike Information Criterion (AIC) and Schwartz’s Bayesian
Information Criterion (BIC) act as scorecards for comparing
models. They consider both how well a model fits the data
and how complex it is, meaning how many predictors it
includes. They penalize overly complex models, with lots of
predictors.
It's an entire branch of statistics. Basic principles: For each record to be classified: Bayesian classifier works only with categorical predictors. these metrics help us choose the best combination of Smaller values of AIC and BIC indicate a better balance
Naive Bayes is a simple probabilistic classifier based on Bayes 1. Find all the other records with the same predictor profile * If we use a set of numerical predictors, then it is highly predictors for our model by finding the right balance between how well the model fits the data and how simple it
theorem with strong independence assumptions between features. (i.e., where the predictor values are the same). unlikely that multiple records will have identical values on between accuracy and simplicity. is. When comparing models with the same number of
*It is called "naive" because it assumes that the presence of a 2. Determine what classes the records belong to, and which these numerical predictors. predictors, metrics like R-squared (R²), Adjusted R-squared
particular feature in a class is unrelated to the presence of any class is most prevalent. * Therefore, numerical predictors must be binned and (R²_adj), AIC, and BIC usually agree on which model is best.
other feature 3. Assign that class to the new record. converted to categorical predictors. However, when comparing models with different numbers of
* Despite its simplicity, Naive Bayes often performs surprisingly well * The Bayesian classifier is the only classification or predictors, they might give different rankings because they
in practice and is widely used in various applications. ‫ال شي يعتمد على‬ prediction method especially suited for (and limited to) consider both the fit and complexity of the model. R² = 1 –
‫االخطاء غير‬ categorical predictor variables. n-1 (1-R) ²
‫هاي الطريقة‬ Adj n-p-1

import pandas as pd ___ df.axes

df=pd.read_csv('FlightDelays.csv')___
df.size
df.head()____df.shape___ df.values
df['distance'].dtype___ df.info()_
df.columns___ df.ndim____ df.describe()
Explanatory is one that fits the data the entire dataset is used Performance measures for explanatory models focus is on the
closely )Good) for estimating the best-fit measure how closely the data fits the model and coefficients (β)
model how strong the average relationship is
Predictive Is one that predicts new performance is measured performance is measured by predictive accuracy The focus is on the
records accurately (Good) by predictive accuracy predictions (y)

Business Analytics Data Analysis and Decision Making 7th Edition PDF
No ratings yet
Business Analytics Data Analysis and Decision Making 7th Edition PDF
29 pages
Regression Modeling Strategies
No ratings yet
Regression Modeling Strategies
506 pages
Handbook of Regression Methods
100% (5)
Handbook of Regression Methods
654 pages
MB0040-Statistics For Management-Answer Keys
75% (8)
MB0040-Statistics For Management-Answer Keys
34 pages
Accounting Analytics 2
No ratings yet
Accounting Analytics 2
41 pages
231
No ratings yet
231
10 pages
Feature Selection
No ratings yet
Feature Selection
22 pages
Course PDF
No ratings yet
Course PDF
403 pages
data science slides
No ratings yet
data science slides
57 pages
Course Regression Model Strategies PDF
No ratings yet
Course Regression Model Strategies PDF
307 pages
Predictive Modeling
No ratings yet
Predictive Modeling
8 pages
DCSN 216 Summary
No ratings yet
DCSN 216 Summary
19 pages
Rms PDF
No ratings yet
Rms PDF
506 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
How To Minimize Misclassification Rate and Expected Loss For Given Model
No ratings yet
How To Minimize Misclassification Rate and Expected Loss For Given Model
7 pages
Predictive-Analytics (1)
No ratings yet
Predictive-Analytics (1)
22 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
DCSN 216 Summary
No ratings yet
DCSN 216 Summary
23 pages
Regression basics
No ratings yet
Regression basics
27 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
21 pages
Full download Stat2 1st Edition Ann R. Cannon pdf docx
No ratings yet
Full download Stat2 1st Edition Ann R. Cannon pdf docx
67 pages
Strategies For Predictive Analytics - Dean Abbott Feb2014 PDF
No ratings yet
Strategies For Predictive Analytics - Dean Abbott Feb2014 PDF
75 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Predictive-Analytics
No ratings yet
Predictive-Analytics
8 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
DWDM (Unit-4)-2
No ratings yet
DWDM (Unit-4)-2
23 pages
S1-Evaluate-Performance-LKW-1Mar2025
No ratings yet
S1-Evaluate-Performance-LKW-1Mar2025
26 pages
Chapter 2
No ratings yet
Chapter 2
136 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
DA U3
No ratings yet
DA U3
10 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
DA Unit-2
No ratings yet
DA Unit-2
7 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
PMA Unit-2 pdf
No ratings yet
PMA Unit-2 pdf
19 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Week 10_Lecture 10
No ratings yet
Week 10_Lecture 10
59 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
AD3501-DL-UNIT 4 NOTES
No ratings yet
AD3501-DL-UNIT 4 NOTES
16 pages
CausalML Book
No ratings yet
CausalML Book
496 pages
Better Business Decisions from Data Statistical Analysis for Professional Success 1st ed. Edition Kenny instant download
No ratings yet
Better Business Decisions from Data Statistical Analysis for Professional Success 1st ed. Edition Kenny instant download
49 pages
Lecturer-Predictive Analytics Techniques and Regression Analysis
No ratings yet
Lecturer-Predictive Analytics Techniques and Regression Analysis
29 pages
Week 2 v1.1 (hidden) - Dimensionality and Evaluation
No ratings yet
Week 2 v1.1 (hidden) - Dimensionality and Evaluation
47 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Design of Credit Model Design IIM Fintech Abrg
No ratings yet
Design of Credit Model Design IIM Fintech Abrg
13 pages
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
No ratings yet
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
76 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Bda Unit 5
No ratings yet
Bda Unit 5
14 pages
Data Mining Project
100% (1)
Data Mining Project
14 pages
Inferential Statistics
No ratings yet
Inferential Statistics
22 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Multiple Imputation of Missing Data
No ratings yet
Multiple Imputation of Missing Data
495 pages
Statistical Prediction and Machine Learning
100% (2)
Statistical Prediction and Machine Learning
314 pages
Stat2 by Ann R. Cannon
No ratings yet
Stat2 by Ann R. Cannon
639 pages
Midterm Notes MGMT 2050
No ratings yet
Midterm Notes MGMT 2050
10 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Hero Honda
No ratings yet
Hero Honda
15 pages
Chapter 2013 Multivariate Statistical Analysis I
No ratings yet
Chapter 2013 Multivariate Statistical Analysis I
36 pages
MCQ - Statistics
No ratings yet
MCQ - Statistics
5 pages
Edexcel s1 Revision Notes
50% (2)
Edexcel s1 Revision Notes
5 pages
A Predictive Analytics Model For Forecasting Outcomes in The National Football League Games Using Decision Tree and Logistic Regression
No ratings yet
A Predictive Analytics Model For Forecasting Outcomes in The National Football League Games Using Decision Tree and Logistic Regression
10 pages
Two Sections Were Given Introduction To Statistics Examinations. The Following Information Was
No ratings yet
Two Sections Were Given Introduction To Statistics Examinations. The Following Information Was
2 pages
Chapter 2 Evaluation Analytical Data (Izirwan) (1)
No ratings yet
Chapter 2 Evaluation Analytical Data (Izirwan) (1)
107 pages
The Law of Iterated Expectations
No ratings yet
The Law of Iterated Expectations
3 pages
Business Statistics A First Course 7th Edition Levine Solutions Manual pdf download
100% (3)
Business Statistics A First Course 7th Edition Levine Solutions Manual pdf download
25 pages
DOTE2030 Lec 3
No ratings yet
DOTE2030 Lec 3
21 pages
Additivity, Metallurgical Recovery, and Grade
No ratings yet
Additivity, Metallurgical Recovery, and Grade
12 pages
Math 370/408, Spring 2008 Prof. A.J. Hildebrand Actuarial Exam Practice Problem Set 5 Solutions
No ratings yet
Math 370/408, Spring 2008 Prof. A.J. Hildebrand Actuarial Exam Practice Problem Set 5 Solutions
6 pages
Panel Data
No ratings yet
Panel Data
31 pages
Week 5 - Forecasting Techniques
No ratings yet
Week 5 - Forecasting Techniques
20 pages
IEE 6300: Advanced Simulation Modeling and Analysis
No ratings yet
IEE 6300: Advanced Simulation Modeling and Analysis
7 pages
Chap15 - Time Series Forecasting & Index Number
No ratings yet
Chap15 - Time Series Forecasting & Index Number
60 pages
S Sarkar 2008
No ratings yet
S Sarkar 2008
34 pages
Econometrics Eviews 4
No ratings yet
Econometrics Eviews 4
14 pages
Taller Diagramas de Caja
No ratings yet
Taller Diagramas de Caja
4 pages
An Investigation On Effect of Bias On Determination of Sample Size On The Basis of Data Related To The Students of Schools of Guwahati
No ratings yet
An Investigation On Effect of Bias On Determination of Sample Size On The Basis of Data Related To The Students of Schools of Guwahati
17 pages
313 731 1 SM - 2
No ratings yet
313 731 1 SM - 2
12 pages
Statistics Exam 2024
No ratings yet
Statistics Exam 2024
1 page
INST627 Spring2018 Syllabus
No ratings yet
INST627 Spring2018 Syllabus
6 pages
Sol Tutorial Set 5 - Hypothesis
No ratings yet
Sol Tutorial Set 5 - Hypothesis
4 pages
Chapter 9 2021 Revised - 2021-06-03
No ratings yet
Chapter 9 2021 Revised - 2021-06-03
42 pages
Kruskal Wallis
No ratings yet
Kruskal Wallis
14 pages
Pearson's Correlation
100% (1)
Pearson's Correlation
9 pages
What Statistical Analysis Should I Use? Statistical Analyses Using SPSS
No ratings yet
What Statistical Analysis Should I Use? Statistical Analyses Using SPSS
39 pages
DLP-Week-2-Q4
No ratings yet
DLP-Week-2-Q4
8 pages

data mining

Uploaded by

data mining

Uploaded by

Data cleaning: the process of finding and removing errors, Regression: The regression technique predicts numeric

import pandas as pd ___ df.axes

You might also like