ML QP
ML QP
Section II:
Effort Required High effort to label datasets Less effort as no labeling is needed [3][2]
2. Explain Data Preparation Issues in Machine Learning
Data preparation involves refining raw data into a clean and structured format for machine learning
models. Key issues include:
o Data Quality: Incomplete, noisy, or inconsistent data can lead to inaccurate results.
o Data Cleaning: Removing duplicates, handling missing values, and correcting errors.
o Data Transformation: Converting raw data into usable formats (e.g., scaling numerical
features).
o Feature Engineering: Creating new features or selecting relevant ones for better model
performance.
o Overfitting Risks: Ensuring that models generalize well by avoiding excessive reliance on
training data patterns[4][5].
import numpy as np
import pandas as pd
data = {'Price': [632541, 425618, 356471, 7412512],
'Rooms': [2, 5, 3, 100],
'Square_Feet': [1600, 2850, 1780, 90000]}
df = pd.DataFrame(data)
df_cleaned = df[df['Rooms'] < 20]
df['Outlier'] = np.where(df['Rooms'] < 20, 0, 1)
print("Cleaned Data:")
print(df_cleaned)
print("\nData with Outlier Marking:")
print(df)
o Data Preparation: Cleaning and transforming raw data into usable formats.
o Model Selection: Choosing an appropriate algorithm based on the problem type (e.g.,
regression or clustering).
o Training: Feeding labeled or unlabeled data into the model to learn patterns.
o Monitoring and Maintenance: Continuously improving the model by retraining it with new
data[4][2].
o Basic Plots: Line plots, bar charts, pie charts for simple visualizations.
o Advanced Visualizations: Box plots, density plots, heatmaps for deeper insights.
o Interactive Visualizations: Dynamic charts and animations using libraries like Plotly and
Altair.