PredictiveAnalysis U1 U2
PredictiveAnalysis U1 U2
Data Mining is the process of discovering patterns, trends, and useful information from large datasets. It involves analyzing massive amounts of data to
uncover hidden insights that can help organizations make better decisions.
• Improve decision-making
• Optimize operations
• 2000s-present: Big data, cloud computing, and AI revolutionized data mining with more powerful tools and techniques
1. Problem De nition
2. Data Preparation
3. Model Building
4. Evaluation
5. Deployment
1. Data Selection
2. Data Cleaning
3. Data Transformation
4. Data Mining
5. Pattern Evaluation
6. Knowledge Presentation
• Privacy concerns
• Interpretability of patterns
• Ethical considerations
Databases provide an organized structure to store data, ensuring the reliability and accessibility of data mining processes. Without databases, managing
and processing data would be complex and inef cient.
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment
• Decision Trees
• Neural Networks
• K-means Clustering
Predictive Analytics:
Predictive analytics involves using statistical and machine learning techniques to predict future outcomes based on historical data. It includes regression,
decision trees, and neural networks.
Where Predictive Analytics is Used:
• Privacy and Security: Using personal data requires strict ethical considerations
Data Preparation:
Data preparation is the process of transforming raw data into a clean, structured format suitable for analysis. This is a critical step in data mining and
machine learning, as poor-quality data can lead to inaccurate models and misleading insights.
2. Data Cleaning: Removing errors, handling missing values, and ensuring consistency.
3. Data Transformation: Converting data into a format that can be easily used for analysis.
4. Feature Selection: Identifying and selecting relevant features for the model.
Data Quality:
Data quality refers to how well the data meets the requirements of the analysis. Factors include:
• Consistency: Are all variables and data types consistent across the dataset?
• Random Sampling: Each data point has an equal chance of being selected.
• Strati ed Sampling: The data is divided into categories, and samples are taken from each category.
Data Description:
Descriptive statistics are used to summarize and describe the features of a dataset. This includes metrics like:
• Mean
• Median
• Mode
• Range
• Variance
• Standard Deviation
Data Exploration:
Data exploration is the process of analyzing data to discover patterns, trends, and relationships. It is a key step in identifying issues with the data and
gaining insights that can guide the modeling process.
Categories of Data Visualization:
• Univariate Analysis: Exploring one variable at a time (e.g., histograms, bar charts).
• Bivariate Analysis: Analyzing two variables to understand their relationship (e.g., scatter plots).
• Multivariate Analysis: Exploring the relationships between more than two variables (e.g., heatmaps, pair plots).
• Z-score method: Measures how far a data point is from the mean.
• Interquartile Range (IQR): Outliers are data points that fall outside the IQR, typically beyond 1.5 times the IQR.
• Visual methods: Boxplots, scatter plots, and histograms can visually highlight outliers.
Data Cleaning:
Data cleaning is the process of handling errors, missing values, and inconsistencies in the dataset.
fi
fi
fi
Data Cleaning: Acquisition:
This step involves ensuring the data is collected accurately and consistently from various sources. Acquiring clean data at the collection stage reduces the
need for extensive cleaning later on.
Missing Data:
Handling missing data is a critical part of data cleaning. Methods include:
• Imputation: Filling in missing data with mean, median, mode, or other statistical estimates.
• Predictive Imputation: Using machine learning algorithms to predict and ll missing values.
Discretization:
Discretization involves converting continuous variables into discrete categories. This can be helpful when dealing with algorithms that only accept
categorical inputs.
Discretization without Using Class:
Discretization can also be done without prede ned class labels. For example:
• Equal-frequency binning: Dividing the data such that each bin contains approximately the same number of data points.
Univariate analysis involves analyzing one variable at a time to summarize its distribution and main characteristics. Common techniques include:
• Histograms
• Boxplots
Data Transformation:
Data transformation is the process of converting data into a suitable format for analysis. Common techniques include:
• Standardization: Centering the data around the mean and scaling it based on standard deviation.
• Kernel Density Estimation (KDE): A non-parametric way to estimate the probability density function of a random variable.
fi
fi
fi
fi
fi
Standard Deviation:
Standard deviation measures the spread or dispersion of a dataset. It shows how much variation exists from the mean:
•High standard deviation: Data points are spread out over a wider range.
• Chi-square test: Used to test the independence between two categorical variables.