UNIT I: INTRODUCTION TO MACHINE LEARNING
1. Introduction to Machine Learning
Definition:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that provides systems
the ability to automatically learn and improve from experience without being explicitly
programmed.
Objective:
To develop algorithms that can generalize from data and make predictions or decisions.
Applications:
Email filtering, speech recognition, recommendation systems, medical diagnosis, stock
market prediction, etc.
2. Machine Learning Types
Type Description Example
Learn from labeled data. Predict output for Regression,
Supervised Learning
new input. Classification
Unsupervised Learning Discover hidden patterns in unlabeled data. Clustering, Association
Semi-supervised Mix of labeled and unlabeled data for
Text classification
Learning training.
Reinforcement
Learn through rewards and punishments. Game AI, Robotics
Learning
3. Types of Data
Structured Data: Tabular format, e.g., CSV files, SQL tables.
Unstructured Data: Text, images, audio, videos.
Semi-structured Data: JSON, XML – not strictly tabular but organized.
Categorical Data: Represents categories (e.g., gender: male/female).
Numerical Data: Integer or floating-point numbers.
4. Exploring Structure of Data
Steps:
o Understand dataset shape and size.
o Check for missing values.
o Use summary statistics (mean, median, std).
o Visualize distributions (histograms, box plots).
o Analyze correlation between features.
Tools: Pandas, NumPy, Matplotlib, Seaborn
5. Data Quality and Remediation
Common Data Issues:
o Missing values
o Duplicates
o Outliers
o Inconsistent formatting
Remediation Techniques:
o Imputation (mean, median, mode)
o Removing duplicates
o Normalizing/standardizing data
o Outlier detection and handling (Z-score, IQR)
6. Data Preprocessing
Purpose: Prepare raw data for ML models.
Steps:
o Cleaning: Remove noise and inconsistencies.
o Encoding: Convert categorical to numerical (Label/One-hot encoding).
o Normalization: Scale features to a standard range.
o Feature extraction and selection.
7. Model Selection
Goal: Choose the best algorithm for your problem.
Factors to Consider:
o Nature of data (linear/nonlinear)
o Training time
o Accuracy
o Interpretability
Common Algorithms:
o Linear Regression, Decision Tree, KNN, SVM, Random Forest, Neural Networks
8. Training and Testing the Model
Training Set: Used to fit the model.
Testing Set: Used to evaluate the model’s performance.
Validation Set (optional): Used during model tuning.
Cross-Validation: Split data into multiple parts for training and validation to avoid
overfitting.
9. Model Representation
Model Parameters: Learnable during training (e.g., weights in linear regression).
Model Hypothesis: Mathematical function approximating the relationship between input
and output.
Loss Function: Measures error between predicted and actual output.
Example:
o Linear Regression:
y=w1x+w0
o where w1,w0 are parameters