Machine Learning 600 - Chapter 6
Machine Learning 600 - Chapter 6
MCI600G
Bachelor of Science in Information Technology
Lecturer: Mr. Thabiso Aphane
Chapter 6: Machine Learning with Text Documents,
Sentiment Analysis & Image Processing
Chapter 6: Machine Learning
with Text Documents, Sentiment
Analysis & Image Processing
LEARNING OUTCOMES
Customizing Stopwords:
• Use domain-specific stopwords in addition to general ones for tailored analysis.
• Store domain-specific stopwords in a separate list or append them to the general stopwords file.
Implementation:
• Using Java's Stream API and removeAll method, you can filter out stopwords efficiently from text
data.
Stopword removal enhances text preprocessing, ensuring the algorithm focuses on relevant
content.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.3 Stemming
Stemming reduces words to their base or root form, helping to normalize text. For example:
Variations like likes, liked, and liking are reduced to like.
Benefits:
• Reduces dimensionality in text analysis.
• Makes pattern recognition easier.
Caution:
• Over-Stemming: Risk of reducing words to ambiguous roots (e.g., port from porter and
porting), potentially losing context.
• Tools like Apache OpenNLP provide robust stemming algorithms and other preprocessing
utilities.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.4 N-grams
N-grams are contiguous sequences of n items (words or characters) from text. They are crucial
for understanding contextual patterns.
Examples:
• Unigrams: Single words.
• Bigrams: Two-word sequences (e.g., "machine learning").
• Trigrams: Three-word sequences (e.g., "deep learning models").
Applications:
• Context Prediction: Identify likely sequences of words to predict the next word or phrase.
• TF/IDF Scoring: N-grams often provide more context-aware results compared to individual
words
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.5 TF/IDF
TF/IDF is a fundamental technique to assign importance to words within a
document relative to a larger corpus.
Components:
• Term Frequency (TF): Counts occurrences of a term in a document.
• Inverse Document Frequency (IDF): Reduces the importance of terms that
are common across all documents.
Use Cases:
Widely used in search engines, recommendation systems, and topic modeling.
[Reference study guide: Pages 69-70]
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
What is an Image?
At its core, a computer image is a grid of numbers, with each "square" representing a
pixel. Images can be binary (e.g., black-and-white) or contain multiple colors, with
numeric values representing the intensity or color information.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
Color Depth refers to the number of bits used to represent each pixel.
Higher bit depths can encode more colors:
• 1-bit: 2 colors (black and white).
• 8-bit: 256 colors.
• 24-bit: ~16.7 million colors (True Color).
Implications for ML:
High Color Depth: Increases the richness of image representation but also computational cost.
Optimization: Reducing color depth and image size helps improve processing speed, especially for
large datasets.
Example: A 24-bit color image (e.g., a sunflower image) with a resolution of 8 × 8 pixels contains
15,360,000 bits of information.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.6.2 Images in ML
In machine learning, images are treated as matrices of numbers, where each number represents pixel information.
Steps in Image Preparation:
Resizing:
• Images are resized to smaller, consistent dimensions (e.g., 16 × 16 or 28 × 28 pixels) to speed up processing.
• Uniform image dimensions ensure consistency in training and model performance.
Normalization:
• Scale pixel values (e.g., 0–255) to a smaller range (e.g., 0–1) to improve model efficiency.
Training Efficiency:
• Smaller grids reduce computational complexity.
• Batch processing of images ensures faster training, especially when working with large datasets.
Challenges:
• High-resolution images require significant processing power and time.
• Tens of thousands of images in datasets like ImageNet or CIFAR-10 can take hours or days to train.
[Reference study guide: Pages 71-73]
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.7.2 Classification
After feature extraction, the CNN uses the extracted features to classify the input
image.
Fully Connected Layer (FC):
• The final layers of a CNN are fully connected layers (e.g., a multilayer perceptron).
• They aggregate information from the feature maps and output the probabilities for
each class.
Steps in the CNN Process:
1. Convolution ➡ ReLU ➡ Pooling: Extracts and condenses features.
2. Repeat the above steps multiple times, forming deeper layers.
3. Fully Connected Layer: Performs classification based on the processed features.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.7.2 Classification
Applications of CNNs:
• Object Detection: Identifying objects within an image (e.g., cars, animals).
• Speech Recognition: Converting sound spectrograms into text or commands.
• Image Segmentation: Dividing an image into regions for analysis.
Popular CNN Architectures:
• LeNet-5: Early architecture for digit recognition.
• AlexNet: Revolutionized computer vision by introducing deeper networks.
• ResNet: Uses skip connections to enable training of very deep networks.
[Reference study guide: Pages 73-76]
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
Pre-trained Models: These are models that have been trained on large datasets (e.g., ImageNet)
and have already learned generic features like edges, textures, or colors in their initial layers.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.8 CNN & Transfer Learning
How Transfer Learning Works: There are two main strategies for implementing transfer learning with CNNs:
Feature Extraction:
• Use the convolutional base (feature extraction layers) of a pre-trained model as it is.
• Replace the classification layer with your custom classifier to suit your specific problem.
• Only train the custom classifier layer while keeping the convolutional base frozen (weights unchanged).
Fine-tuning:
• Unfreeze the pre-trained model's deeper layers and train them along with the custom classifier.
• This allows the model to adjust pre-trained weights slightly for your specific dataset, improving
performance.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.8 CNN & Transfer Learning
Benefits:
• Reduced Training Time: Skip the time-consuming steps of feature extraction and initial training.
• Lower Computational Cost: Avoid re-training massive datasets.
• Enhanced Performance: Leverage proven architectures and avoid the risk of overfitting small datasets.
Key Insights:
• Transfer learning is particularly powerful with CNNs because the early layers learn universal patterns (like
edges or gradients), while the deeper layers focus on specific details.
• While using transfer learning, consider the similarity of your data to the original dataset to ensure
effective results.
• Always assess whether freezing or fine-tuning layers provides better performance for your task.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.9 Assessing the Performance of ML Algorithms
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.9.1 Classification and Confusion Matrix
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.9.1 Classification and Confusion Matrix
Choosing Metrics:
Metrics depend on the context. For example:
• In spam detection, false negatives are less critical than false positives.
• In credit scoring, false positives (approving risky customers) are more detrimental than
false negatives (denying credit to reliable customers).
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.9.2 Regularization
Types of Regularization:
• L1 Regularization (Lasso): Adds a penalty proportional to the absolute value of
coefficients.
• L2 Regularization (Ridge): Adds a penalty proportional to the square of
coefficients.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.9.2 Regularization
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing