0% found this document useful (0 votes)
22 views26 pages

Machine Learning 600 - Chapter 6

Chapter 6 covers essential concepts in machine learning related to text documents, sentiment analysis, and image processing. Key topics include text data preparation, stopwords, stemming, N-grams, TF/IDF, and Convolutional Neural Networks (CNNs), along with techniques like transfer learning. The chapter emphasizes the importance of model evaluation and the challenges faced in real-world data scenarios.

Uploaded by

anita Nosipho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views26 pages

Machine Learning 600 - Chapter 6

Chapter 6 covers essential concepts in machine learning related to text documents, sentiment analysis, and image processing. Key topics include text data preparation, stopwords, stemming, N-grams, TF/IDF, and Convolutional Neural Networks (CNNs), along with techniques like transfer learning. The chapter emphasizes the importance of model evaluation and the challenges faced in real-world data scenarios.

Uploaded by

anita Nosipho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Machine Learning

MCI600G
Bachelor of Science in Information Technology
Lecturer: Mr. Thabiso Aphane
Chapter 6: Machine Learning with Text Documents,
Sentiment Analysis & Image Processing
Chapter 6: Machine Learning
with Text Documents, Sentiment
Analysis & Image Processing

LEARNING OUTCOMES

• Understand textual data preparation and Analysis


• Understand stopwords and stemming concepts
• Understand N-grams, TF/ IDF & Word2Vec algorithms
• Understand strings tokenizing & Convolutional Neural
Network
• Understand Sentiment Analysis & Image Processing
• Understand the assessment of ML algorithms
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.1 Preparing Text for Analysis
Text data is a cornerstone of modern information systems, appearing in various formats, such as
blog posts, emails, news articles, and academic papers. The sheer volume of text data on the
internet necessitates sophisticated techniques to make sense of it. However, working with text
data poses several challenges:
• Messy Data: Text documents often require cleaning and preprocessing to be usable for
analysis.
• Variety of Formats: Text comes in diverse formats, each requiring tailored approaches.

Key Techniques for Text Analysis:


• TF/IDF (Term Frequency/Inverse Document Frequency): A statistical measure to evaluate
the importance of a word in a document relative to a collection of documents.
• Word2Vec: A neural network-based approach for generating word embeddings that capture
semantic relationships.
• Neural Network Techniques: Models like recurrent neural networks (RNNs) and transformers
such as BERT can generate context-aware insights and even produce new text.
Chapter 6: Association Rules Learning - Support Vectors Machine & Neural Networks
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.2 Stopwords
Stopwords are commonly used words that typically do not add value to text analysis, such as
"the," "and," or "is." Removing these words reduces noise and focuses on more meaningful terms.

Customizing Stopwords:
• Use domain-specific stopwords in addition to general ones for tailored analysis.
• Store domain-specific stopwords in a separate list or append them to the general stopwords file.

Implementation:
• Using Java's Stream API and removeAll method, you can filter out stopwords efficiently from text
data.

Stopword removal enhances text preprocessing, ensuring the algorithm focuses on relevant
content.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.3 Stemming

Stemming reduces words to their base or root form, helping to normalize text. For example:
Variations like likes, liked, and liking are reduced to like.

Benefits:
• Reduces dimensionality in text analysis.
• Makes pattern recognition easier.

Caution:
• Over-Stemming: Risk of reducing words to ambiguous roots (e.g., port from porter and
porting), potentially losing context.
• Tools like Apache OpenNLP provide robust stemming algorithms and other preprocessing
utilities.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.4 N-grams

N-grams are contiguous sequences of n items (words or characters) from text. They are crucial
for understanding contextual patterns.

Examples:
• Unigrams: Single words.
• Bigrams: Two-word sequences (e.g., "machine learning").
• Trigrams: Three-word sequences (e.g., "deep learning models").

Applications:
• Context Prediction: Identify likely sequences of words to predict the next word or phrase.
• TF/IDF Scoring: N-grams often provide more context-aware results compared to individual
words

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.5 TF/IDF
TF/IDF is a fundamental technique to assign importance to words within a
document relative to a larger corpus.

Components:
• Term Frequency (TF): Counts occurrences of a term in a document.
• Inverse Document Frequency (IDF): Reduces the importance of terms that
are common across all documents.
Use Cases:
Widely used in search engines, recommendation systems, and topic modeling.
[Reference study guide: Pages 69-70]
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.6 Image Processing in ML

What is an Image?
At its core, a computer image is a grid of numbers, with each "square" representing a
pixel. Images can be binary (e.g., black-and-white) or contain multiple colors, with
numeric values representing the intensity or color information.

Example: Binary Images


• Black pixels are assigned the value 1, and white pixels are assigned the value 0.
• A binary image is also known as a bitmap, where each bit corresponds to a color.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.6.1 Color Depth

Color Depth refers to the number of bits used to represent each pixel.
Higher bit depths can encode more colors:
• 1-bit: 2 colors (black and white).
• 8-bit: 256 colors.
• 24-bit: ~16.7 million colors (True Color).
Implications for ML:
High Color Depth: Increases the richness of image representation but also computational cost.
Optimization: Reducing color depth and image size helps improve processing speed, especially for
large datasets.

Example: A 24-bit color image (e.g., a sunflower image) with a resolution of 8 × 8 pixels contains
15,360,000 bits of information.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.6.2 Images in ML
In machine learning, images are treated as matrices of numbers, where each number represents pixel information.
Steps in Image Preparation:
Resizing:
• Images are resized to smaller, consistent dimensions (e.g., 16 × 16 or 28 × 28 pixels) to speed up processing.
• Uniform image dimensions ensure consistency in training and model performance.
Normalization:
• Scale pixel values (e.g., 0–255) to a smaller range (e.g., 0–1) to improve model efficiency.
Training Efficiency:
• Smaller grids reduce computational complexity.
• Batch processing of images ensures faster training, especially when working with large datasets.
Challenges:
• High-resolution images require significant processing power and time.
• Tens of thousands of images in datasets like ImageNet or CIFAR-10 can take hours or days to train.
[Reference study guide: Pages 71-73]

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.7 Convolutional Neural Networks (CNNs)


Introduction
CNNs are a specialized type of neural network designed to process image
data. Introduced in 1998 by Yann LeCun and colleagues with the LeNet-5
model, CNNs are widely used for image recognition, object detection, and
other computer vision tasks.

Key Components of a CNN:


• Feature Extraction: Detects patterns such as edges, shapes, and
textures.
• Classification: Assigns labels to the detected features.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.7 Convolutional Neural Networks (CNNs)


Introduction
CNNs are a specialized type of neural network designed to process image
data. Introduced in 1998 by Yann LeCun and colleagues with the LeNet-5
model, CNNs are widely used for image recognition, object detection, and
other computer vision tasks.

Key Components of a CNN:


• Feature Extraction: Detects patterns such as edges, shapes, and
textures.
• Classification: Assigns labels to the detected features.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.7.1 Feature Extraction


Feature extraction is the core of the CNN process. It identifies patterns or features in
the image using mathematical operations.
Convolution Operation – What is Convolution?
A mathematical operation where a filter (or kernel) scans the image, performing
element-wise multiplications and summing up the results.
How It Works:
• Filters, typically small (e.g., 3 × 3), slide across the image matrix.
• The result is a new matrix, called the feature map, highlighting specific features in
the image.
Filter Example: A vertical edge filter detects vertical lines, while a horizontal edge
filter detects horizontal lines.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

… 6.7.1 Feature Extraction


Activation Functions:
After applying the filter, the result is passed through an activation function to
introduce non-linearity:
• ReLU (Rectified Linear Unit): Converts negative values to zero, enhancing
important features.
Pooling:
Pooling reduces the size of feature maps, decreasing computation while preserving
important information.
• Max Pooling: Selects the maximum value in a pooling window (e.g., 2 × 2).
• Benefits: Reduces dimensionality, improves computational efficiency, and helps
avoid overfitting.
Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.7.2 Classification
After feature extraction, the CNN uses the extracted features to classify the input
image.
Fully Connected Layer (FC):
• The final layers of a CNN are fully connected layers (e.g., a multilayer perceptron).
• They aggregate information from the feature maps and output the probabilities for
each class.
Steps in the CNN Process:
1. Convolution ➡ ReLU ➡ Pooling: Extracts and condenses features.
2. Repeat the above steps multiple times, forming deeper layers.
3. Fully Connected Layer: Performs classification based on the processed features.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

… 6.7.2 Classification
Applications of CNNs:
• Object Detection: Identifying objects within an image (e.g., cars, animals).
• Speech Recognition: Converting sound spectrograms into text or commands.
• Image Segmentation: Dividing an image into regions for analysis.
Popular CNN Architectures:
• LeNet-5: Early architecture for digit recognition.
• AlexNet: Revolutionized computer vision by introducing deeper networks.
• ResNet: Uses skip connections to enable training of very deep networks.
[Reference study guide: Pages 73-76]

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.8 CNN & Transfer Learning


Convolutional Neural Networks (CNNs) are powerful tools for image classification
and processing. However, training CNNs from scratch requires extensive
computational resources and time. Transfer Learning is a practical technique
that leverages pre-trained models to save time, resources, and effort,
particularly when working with smaller datasets.

What is Transfer Learning?


Transfer learning involves using an existing pre-trained model's knowledge (its
learned weights and architecture) to solve a new but similar problem. This
process bypasses the need to train models from scratch, especially for tasks
involving complex data like images or videos.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

6.8 CNN & Transfer Learning


Convolutional Neural Networks (CNNs) are powerful tools for image classification
and processing. However, training CNNs from scratch requires extensive
computational resources and time. Transfer Learning is a practical technique
that leverages pre-trained models to save time, resources, and effort,
particularly when working with smaller datasets.

What is Transfer Learning?


Transfer learning involves using an existing pre-trained model's knowledge (its
learned weights and architecture) to solve a new but similar problem. This
process bypasses the need to train models from scratch, especially for tasks
involving complex data like images or videos.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing

… 6.8 CNN & Transfer Learning

Pre-trained Models: These are models that have been trained on large datasets (e.g., ImageNet)
and have already learned generic features like edges, textures, or colors in their initial layers.

Examples of pre-trained models include VGGNet, ResNet, and Inception.

Applications of Transfer Learning:

Transfer learning is highly beneficial when:


• You have a small dataset.
• Your dataset is similar in nature to the data the pre-trained model was trained on.
• Common in fields like medical imaging, facial recognition, and object detection.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.8 CNN & Transfer Learning

How Transfer Learning Works: There are two main strategies for implementing transfer learning with CNNs:

Feature Extraction:
• Use the convolutional base (feature extraction layers) of a pre-trained model as it is.
• Replace the classification layer with your custom classifier to suit your specific problem.
• Only train the custom classifier layer while keeping the convolutional base frozen (weights unchanged).

Fine-tuning:
• Unfreeze the pre-trained model's deeper layers and train them along with the custom classifier.
• This allows the model to adjust pre-trained weights slightly for your specific dataset, improving
performance.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.8 CNN & Transfer Learning

Benefits:
• Reduced Training Time: Skip the time-consuming steps of feature extraction and initial training.
• Lower Computational Cost: Avoid re-training massive datasets.
• Enhanced Performance: Leverage proven architectures and avoid the risk of overfitting small datasets.

Example Tools for Transfer Learning:


• Frameworks like TensorFlow, PyTorch, and Keras provide pre-trained models.
• Resources such as the DeepLearning4J Model Zoo host pre-trained models for reuse.

Key Insights:
• Transfer learning is particularly powerful with CNNs because the early layers learn universal patterns (like
edges or gradients), while the deeper layers focus on specific details.
• While using transfer learning, consider the similarity of your data to the original dataset to ensure
effective results.
• Always assess whether freezing or fine-tuning layers provides better performance for your task.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.9 Assessing the Performance of ML Algorithms

Importance of Model Evaluation: Evaluating the performance of a machine learning


model ensures it generalizes well and can make reliable predictions in real-world
scenarios. The quality of evaluation depends on understanding the inherent
challenges in data, such as noise and variance.

Challenges in Model Evaluation:


• Dirty Data: Real-world data often contains noise (e.g., random errors, inaccuracies)
that can mislead the model during training.
• Bias-Variance Tradeoff:
• Underfitting: A model too simple to capture underlying patterns, resulting in high bias
and poor predictions.
• Overfitting: A model too tailored to the training data, capturing noise and failing to
generalize to new data.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.9.1 Classification and Confusion Matrix

When working with classification problems, a confusion matrix helps evaluate


performance. It maps predictions against actual outcomes to derive key metrics.

Confusion Matrix Basics:

• True Positives (TP): Correctly predicted positive instances.


• True Negatives (TN): Correctly predicted negative instances.
• False Positives (FP): Incorrectly predicted positive instances (Type I error).
• False Negatives (FN): Incorrectly predicted negative instances (Type II error).

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
… 6.9.1 Classification and Confusion Matrix

Performance Metrics Derived from the Confusion Matrix:


• Accuracy: (TP+TN)/(TP+TN+FP+FN) – Measures the proportion of correct predictions.
• Precision: TP/(TP+FP) – Indicates how many predicted positives were actually correct.
• Recall (Sensitivity): TP/(TP+FN) – Reflects the ability to identify all actual positives.
• F1 Score: The harmonic mean of precision and recall. Useful when class distributions
are imbalanced.

Choosing Metrics:
Metrics depend on the context. For example:
• In spam detection, false negatives are less critical than false positives.
• In credit scoring, false positives (approving risky customers) are more detrimental than
false negatives (denying credit to reliable customers).

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.9.2 Regularization

Regularization techniques prevent overfitting by penalizing model complexity during


training. This encourages the model to focus on the most relevant features.

Types of Regularization:
• L1 Regularization (Lasso): Adds a penalty proportional to the absolute value of
coefficients.
• L2 Regularization (Ridge): Adds a penalty proportional to the square of
coefficients.

When to Use Regularization:


• When the model starts fitting noise rather than the actual data.
• To simplify models and make them more interpretable.

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing
Chapter 6: Machine Learning with Text
Documents, Sentiment Analysis & Image
Processing
6.9.2 Regularization

Preparing Data for Processing:


1. Scaling:
• Adjusts feature values to a specific range without altering their distribution.
• Techniques: MinMax Scaler (scales values between 0 and 1), Robust Scaler (reduces outlier
impact).
2. Standardization:
• Centers feature values around 0 with a standard deviation of 1.
• Useful for algorithms sensitive to variance, like Support Vector Machines (SVMs).
3. Normalization:
• Adjusts rows of a dataset to emphasize relative values (e.g., word frequencies in a sentence).
• Types:
• L2 Normalization: Scales row values so their squared sum equals 1.
• L1 Normalization: Scales row values so their absolute sum equals 1.
[Reference study guide: Pages 77-83]

Chapter 6: Machine Learning with Text Documents, Sentiment Analysis & Image Processing

You might also like