Assignment 2 - Data Collection and Preprocessing
Assignment 2 - Data Collection and Preprocessing
Data collection is a crucial step in the data analysis process. It involves gathering relevant data from
various sources. Some common data collection methods and sources include:
Missing data and outliers can significantly impact the accuracy and reliability of data analysis. Here
1. Missing data:
● Deletion: Remove observations or variables with missing data. This method can be
appropriate if the missing data is small in proportion.
● Imputation: Estimate missing values based on other available information. Common
imputation methods include mean imputation, regression imputation, or multiple
imputation using advanced techniques.
2. Outliers:
● Detection: Identify outliers using statistical techniques such as z-scores, box plots, or
Mahalanobis distance. Visual exploration of data using scatter plots or histograms
can also reveal potential outliers.
● Treatment: Depending on the context, outliers can be treated by removing them,
transforming them using winsorization or truncation, or imputing them using more
robust statistical techniques.
Data cleaning is a critical step in data preprocessing to ensure data accuracy and consistency. Here
1. Duplicate data: Identify and remove duplicate entries to avoid duplicative analysis or biased
results.
2. Consistency checks: Verify data consistency by checking for logical relationships between
variables. For example, cross-validate data such as age and birth date to ensure accuracy.
3. Data validation: Validate data against predefined rules or criteria. Check for data integrity,
completeness, and adherence to data types and formats.
4. Data profiling: Conduct data profiling to understand the distribution, summary statistics, and
patterns in the data. Identify potential issues such as data skewness, missing values, or
outliers.
5. Addressing data integrity issues: Resolve data integrity issues such as data entry errors, data
corruption, or data format inconsistencies.
Data transformation and normalization techniques are used to modify the data to meet certain
These techniques are employed to improve the distribution, comparability, and suitability of the data
It's important to note that the selection of specific techniques depends on the characteristics of the
data, the analysis objectives, and the specific requirements of the analytical methods being applied.
Data preprocessing is a flexible process that requires careful consideration and exploration of the