0% found this document useful (0 votes)
14 views3 pages

DIFFERENCES

The document outlines key differences between supervised and unsupervised learning, classification and regression, as well as normalization and standardization in machine learning. It also details techniques for data cleaning, including handling missing data, removing duplicates, managing outliers, and encoding categorical variables. The document emphasizes the importance of preparing data for effective analysis and modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

DIFFERENCES

The document outlines key differences between supervised and unsupervised learning, classification and regression, as well as normalization and standardization in machine learning. It also details techniques for data cleaning, including handling missing data, removing duplicates, managing outliers, and encoding categorical variables. The document emphasizes the importance of preparing data for effective analysis and modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

DIFFERENCES

a) Supervised Learning vs. Unsupervised Learning


Supervised Learning uses labeled data where the algorithm learns from input-
output pairs to make predictions. Example: Predicting house prices based on
historical data.
Unsupervised Learning works with unlabeled data, identifying hidden patterns
without predefined outputs. Example: Grouping customers based on purchasing
behavior.
b) classification and regression .

Classification is a type of supervised machine learning task where the goal is to


predict a discrete label/category. It classifies input data into one or more classes.
E.g. Image recognition
While
Regression is a type of machine learning task where the goal is to predict a
continuous output value, typically numerical. E.g. house price prediction
c) Normalization and standardization
Normalization changes values to fit within a specific range (e.g., 0–1).
After normalization, all feature values will be within the specified range (e.g., [0,
1]).
while
Standardization changes values to have a mean of 0 and standard deviation of 1.
After standardization, the mean of each feature will be 0, and the standard
deviation will be 1.
Techniques for data cleaning
1. Handling Missing Data
Remove Missing Values: If missing values are few, rows or columns with missing
values can be dropped.
Imputation: Fill missing values using techniques like:
Mean, median, or mode imputation.
Forward or backward fill (for time-series data).
Predictive imputation using machine learning models.
2. Removing Duplicates
Detecting and removing duplicate rows that may cause redundancy in analysis.
3. Handling Outliers
Use box plots or statistical methods like Z-score or IQR (Interquartile Range) to
detect outliers.
Possible actions:
Remove the outliers.
Transform or cap values (e.g., winsorization).
4. Data Type Conversion
Ensuring numerical values are stored as numbers and categorical values as
categories.
5. Standardizing Data Formats
Converting date formats (e.g., "01/02/2023" vs. "2023-02-01").
Ensuring consistent capitalization for text data (e.g., "New York" vs. "new york").
6. Handling Inconsistent Data
Correcting typos and inconsistencies (e.g., "Male" vs. "M" vs. "male").
Merging similar categories (e.g., "USA" and "United States").
7. Encoding Categorical Variables
Converting categorical variables into numerical form using:
One-Hot Encoding (e.g., converting "Red", "Blue", "Green" into binary features).
Label Encoding (assigning numeric labels like 0,1,2).
8. Removing Irrelevant Features
Dropping unnecessary columns (e.g., user IDs that don't contribute to prediction).
9. Handling Imbalanced Data
Using oversampling (e.g., SMOTE) or under sampling to balance class
distributions in classification problems.

You might also like