How Should Data Preparation Be Done For An Analytics Project
How Should Data Preparation Be Done For An Analytics Project
David Manne
[email protected]
Content 01 Introduction to Data Preparation
s 02 Data Collection
03 Data Cleaning
04 Data Transformation
05 Data Reduction
06 Data Integration
01 02 03
Common Challenges in Data Preparation
Data Collection
Identifying Data Sources
Data Scraping
03 Web scraping tools (e.g., Beautiful Soup)
Automated bots for data extraction
Custom scripting for web data collection
Tools for Data Collection
01 02 03
Database Management Systems APIs and Web Services Data Integration Platforms
SQL- based systems (e.g., MySQL, RESTful APIs ETL tools (e.g., Talend, Apache
PostgreSQL) SOAP- based web services Nifi)
NoSQL databases (e.g., MongoDB, Data warehousing solutions
• Public API integrators (e.g.,
Cassandra) (e.g., Snowflake)
Zapier)
• Cloud- based solutions (e.g., • Data lake platforms (e.g.,
Google BigQuery) AWS Lake Formation)
03
Data Cleaning
Handling Missing Values
Methods to detect missing datanull Mean/Median/Mode imputation for Removing records with substantial
checks, summary statistics numerical data missing data
Visualizing missing data with Using regression models for more Evaluating the impact of removing
heatmaps accurate imputation records on the dataset
Differentiating between data missing Employing k- Nearest Neighbors Best practices for documenting
at random and not at random algorithm for imputation removed records
Correcting Inaccurate Data
01 02 03
Statistical methodsZ- score, IQR Transforming data to reduce impact Potential distortion of statistical
method (log transformation) summaries and models
Visualization tools: box plots, scatter Winsorizing data to limit extreme Understanding and addressing
plots values biases introduced by outliers
• Software tools and libraries for • Using robust statistical methods • Strategies for appropriately
outlier detection less sensitive to outliers reporting and documenting
outliers
04
Data Transformation
Data Normalization
Feature Scaling
Standard Scaler
Min- Max Scaler
• Robust Scaler
Data Reduction
Data Sampling Methods
Random Sampling
Stratified Sampling
Systematic Sampling
Data Integration
Combining Data from Multiple Sources
01 03
Data Reconciliation Techniques Addressing Data Conflicts
02
Automated Consistency Checks
01 02 03
Importance of Metadata Metadata Tools and Techniques Metadata Standards
Verification against original sources Ensuring all required fields are filled Standardizing data formats
Cross- referencing with reputable Handling missing data Synchronizing data across systems
data appropriately Regularly reconciling data entries
Regular updates and corrections Tracking data entry processes
Validation Techniques
Cross- checking data entry Implementing validation rules Using statistical tools to identify
Reviewing reports for anomalies in software outliers
Double- checking critical data Utilizing data validation scripts Applying predictive models for
points Automated error detection and validation
correction Trend analysis to flag
discrepancies
Ensuring Data Integrity
01 02 03
15
Using primary and foreign Maintaining audit trails Automated error alerts
keys Regular system audits and User feedback and
Enforcing data type reviews reporting systems
restrictions Monitoring access and Regular error logs and
Referential integrity rules changes to data reviews
20XX Thanks