data mining
data mining
inconsistent duplicates, and missing entries from the data to values in categories such as sales, stock prices, or even
increase data consistency and quality. so not cleaning your data can temperature. The ranges are based on the information
lead to incorrect business decisions. So, data must be accurate and found in a particular data set. Association Rule Learning:
unreliable before making any business decision. This toolset, also called market basket analysis, searches for
Data cleaning tools: excel-python-SQL-data visualization. relationships among dataset variables. For example,
Benefits of data cleaning: avoiding mistakes/ improving association rule learning can determine which products are
productivity/ avoiding unnecessary costs and errors/ staying frequently purchased together (e.g., a smartphone and a
organized/improving mapping. protective case).
Clustering: This process partitions datasets into a set of
meaningful sub-classes, known as clusters. The process
helps users understand the natural structure or grouping
within the data. Classification: This technique assigns items
in a dataset to different target categories or classes. The
goal is to develop accurate predictions within the target
class for each case in the data. (techniques)
The most popular model for making predictions, this model is used Explanatory vs. Predictive Modeling 1. Explaining or the process of estimating the regression equation and the "kitchen-sink" approach, which refers to the tendency to include all methods for reducing the number of predictors in a
available variables as predictors in a regression model without careful
to fit a relationship between a numerical outcome variable Y (also quantifying the average effect of inputs on an outcome making predictions using the ordinary least squares (OLS) regression model.
selection.
called the response, target, or dependent variable) and a set of (explanatory or descriptive task, respectively) method. The coefficients of the regression equation are warns against the "kitchen-sink" approach, where all available variables Domain Knowledge: The initial step in reducing predictors
predictors X1, X2, …, Xp (also referred to as independent variables, 2. Predicting the outcome value for new records, given their determined by minimizing the sum of squared deviations are included as predictors in a regression model without discretion. Instead, involves leveraging domain knowledge to understand the
input variables, regressors, or covariates). The assumption is that input values (predictive task) Explanatory modeling aims to between actual outcome values and their predicted values. it advocates for careful selection of predictors due to various drawbacks relevance of each predictor to the outcome variable.
associated with using too many variables: 1. Cost and feasibility concerns
the following function approximates the relationship between the understand the relationship between variables in a To predict outcomes for new records, the linear relationship may arise from collecting all predictors. 2. It's often preferable to focus on
Predictors can be eliminated based on factors like expense,
predictors and outcome variables. population by treating data as a random sample. This between independent variables and the outcome is utilized. fewer, more accurate predictors. 3. More predictors can lead to missing inaccuracy, high correlation with other predictors, missing
Choosing the right form depends on domain knowledge, data approach estimates regression models to capture average For accurate predictions, certain assumptions must be held, data issues. 4. Models with fewer parameters offer clearer insights. 5. values, or irrelevance to the problem at hand.
availability, and needed predictive power. such as a normal distribution of errors and independence of Including many predictors can result in unstable regression coefficients. 6. Exhaustive Search: This method involves evaluating all
Using uncorrelated predictors increases prediction variance. 7. Dropping
relationships in the population and inform decision-making. records. While these assumptions may not always hold, the possible subsets of predictors. However, due to the vast
correlated predictors may increase prediction bias.
Predictive analytics is concerned with predicting individual resulting estimates remain useful for prediction provided To address these challenges, methods are available to reduce predictors to number of potential models, it's often impractical to
outcomes for new records. It focuses on generating proper evaluation of model performance is conducted. a smaller, more effective set for better prediction and classification in data examine every combination.
mining tasks. Adjusted R-squared (R²_adj): It's like a smarter version of R-
predictions rather than interpreting coefficients, making it
suitable for micro-level decision-making. squared (R²), which tells us how well our model fits the data.
R²_adj considers not only how well the model fits but also
how many predictors it uses. It penalizes models with lots of
predictors that don't add much useful information.
Higher R²_adj values mean a better model fit.
Akaike Information Criterion (AIC) and Schwartz’s Bayesian
Information Criterion (BIC) act as scorecards for comparing
models. They consider both how well a model fits the data
and how complex it is, meaning how many predictors it
includes. They penalize overly complex models, with lots of
predictors.
It's an entire branch of statistics. Basic principles: For each record to be classified: Bayesian classifier works only with categorical predictors. these metrics help us choose the best combination of Smaller values of AIC and BIC indicate a better balance
Naive Bayes is a simple probabilistic classifier based on Bayes 1. Find all the other records with the same predictor profile * If we use a set of numerical predictors, then it is highly predictors for our model by finding the right balance between how well the model fits the data and how simple it
theorem with strong independence assumptions between features. (i.e., where the predictor values are the same). unlikely that multiple records will have identical values on between accuracy and simplicity. is. When comparing models with the same number of
*It is called "naive" because it assumes that the presence of a 2. Determine what classes the records belong to, and which these numerical predictors. predictors, metrics like R-squared (R²), Adjusted R-squared
particular feature in a class is unrelated to the presence of any class is most prevalent. * Therefore, numerical predictors must be binned and (R²_adj), AIC, and BIC usually agree on which model is best.
other feature 3. Assign that class to the new record. converted to categorical predictors. However, when comparing models with different numbers of
* Despite its simplicity, Naive Bayes often performs surprisingly well * The Bayesian classifier is the only classification or predictors, they might give different rankings because they
in practice and is widely used in various applications. ال شي يعتمد على prediction method especially suited for (and limited to) consider both the fit and complexity of the model. R² = 1 –
االخطاء غير categorical predictor variables. n-1 (1-R) ²
هاي الطريقة Adj n-p-1