Class 10 Portfolio Work
Class 10 Portfolio Work
This framework is often used in project management, design thinking, and problem-solving
processes to provide a clear and concise description of the problem at hand.
What:
User engagement in the mobile app has been declining over the past few months, resulting in
fewer active users and lower usage frequency.
Who:
The primary users of the app, as well as the product development and marketing teams, are
directly affected.
Why:
Low user engagement can lead to a decrease in retention rates and a decline in app revenue.
Enhancing engagement will improve user retention, increase in-app purchases, and
potentially expand the app’s user base.
When:
Immediate attention is required, with a targeted solution to be rolled out within the next two
months before the next marketing campaign.
• Clarity: It forces a clear definition of the problem, helping stakeholders understand the
scope.
• Focus: Ensures that the project or solution remains centered around the core problem.
• Alignment: Helps align teams and resources towards a common goal.
• Actionable: Provides a basis for identifying the next steps and priorities.
1. Problem Definition
Case Example:
• Problem: The company receives large amounts of customer feedback through multiple
channels (emails, reviews, surveys), but manually analyzing these responses is time-
consuming and inefficient.
• Objective: Automate the sentiment analysis of customer feedback to classify the feedback
as positive, neutral, or negative, and provide insights into customer satisfaction.
Key Questions:
• What: Develop an AI model that can process and classify text feedback.
• Who: The primary users are customer service teams, marketing teams, and product
development teams who need to understand customer sentiment quickly.
• Why: Automating sentiment analysis will save time, improve response times to negative
feedback, and help the company take proactive actions based on customer sentiment.
• When: The project needs to be deployed in 6 months to align with the launch of a new
product.
Case Example:
• Data Sources:
o Customer reviews (e.g., product reviews from e-commerce platforms)
o Survey responses (feedback on product features, customer experience)
o Social media comments and mentions
• Data Collection:
o Gather a dataset of labeled customer feedback (with sentiment annotations like
positive, neutral, and negative).
o Ensure data diversity by collecting feedback across different product categories,
customer demographics, and channels.
Key Questions:
• What data is required: Do we need feedback data from specific products or services?
• Is the data labeled: Do we already have labeled data (e.g., positive, negative, neutral labels),
or do we need to manually label the data?
• Is the data balanced: Does the dataset have an equal number of examples for each
sentiment class?
Case Example:
• Data Cleaning:
o Remove any irrelevant or noisy data (e.g., HTML tags, special characters, or empty
reviews).
o Handle missing values (e.g., fill in or remove reviews with missing sentiment labels).
• Text Preprocessing:
o Convert text to lowercase.
o Tokenize sentences (split the text into words or tokens).
o Remove stop words (e.g., "is", "the", "in") that don’t contribute much to sentiment.
o Lemmatize words (reduce words to their base form, e.g., "running" → "run").
• Feature Engineering:
o Convert text data into numerical features using techniques like TF-IDF (Term
Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec,
GloVe).
Key Questions:
• What preprocessing steps are necessary: Which text cleaning methods are suitable for
customer feedback data?
• How to handle imbalanced data: Should we balance the dataset if one sentiment class is
underrepresented?
Case Example:
• Model Selection:
o Choose a machine learning algorithm suitable for text classification. Some common
choices for sentiment analysis include:
! Logistic Regression: A simple model for binary or multiclass classification.
! Naive Bayes: A good choice for text classification tasks due to its efficiency
with large text datasets.
! Deep Learning Models (e.g., LSTM, BERT): If there is a large dataset, a
neural network-based model like BERT (Bidirectional Encoder
Representations from Transformers) can capture more complex
relationships in text.
• Training:
o Split the data into training, validation, and test sets.
o Train the model using the training set and validate it using the validation set. Tune
hyperparameters to improve performance.
Key Questions:
• What model will provide the best performance: Should we start with simpler models like
Logistic Regression or try advanced deep learning models?
• How to evaluate the model: Which performance metrics (e.g., accuracy, precision, recall,
F1-score) are most important for evaluating sentiment classification?
5. Model Evaluation
Case Example:
• Model Evaluation:
o Accuracy: The percentage of correct predictions (e.g., how many feedback items are
classified correctly).
o Precision, Recall, F1-Score: Especially important in imbalanced datasets, as one class
(e.g., positive feedback) might dominate.
o Confusion Matrix: To visualize the model’s performance in terms of true positives,
false positives, true negatives, and false negatives.
• Testing: Evaluate the model on the unseen test set and check how well it generalizes
to new data.
Key Questions:
• What evaluation metrics should we focus on: Should we care more about precision (to
avoid misclassifying positive reviews as negative) or recall (to catch more of the negative
feedback)?
• Are there any performance gaps: Is the model performing well across all sentiment classes
(positive, neutral, negative)?
6. Deployment
Case Example:
• Deployment Strategy:
o Deploy the sentiment analysis model into the company's existing feedback
processing pipeline.
o Ensure that the model can process customer feedback in real-time or in batches
(e.g., after surveys are completed).
• Integration:
o Integrate the model with the company's CRM or feedback management system to
automatically categorize incoming feedback.
• Monitoring:
o Monitor the model's performance in production, ensuring that the sentiment
predictions remain accurate over time.
o Implement a feedback loop for continuous improvement, allowing the model to be
retrained with new labeled data.
First, we extract the vocabulary (unique words) from all the documents. The vocabulary in
this case would be:
• apple
• orange
• banana
• fruit
• tree
Now, we represent each document as a vector where the entries are the frequency of each
word from the vocabulary in the document.
Explanation:
Each column represents the frequency of a particular term in the document. This is a simple
term frequency (TF) representation, where we only count the occurrences of each term in
each document.
Notes:
If you were using TF-IDF (Term Frequency-Inverse Document Frequency), the frequency
values would be adjusted to consider not only how often the term appears in each document
but also how common the term is across all documents, giving less weight to common words.
Both stemming and lemmatization are techniques used in Natural Language Processing
(NLP) to reduce words to their root forms. The difference lies in the approach and the quality
of the root word they produce.
• Stemming: Removes prefixes and suffixes in an aggressive manner to reduce a word to its
root form. It doesn't necessarily produce valid words.
• Lemmatization: More sophisticated than stemming; it reduces words to their lemma
(dictionary form) based on the context, ensuring the root word is a valid word.
Example:
Let's consider a few words and how stemming and lemmatization treat them:
1. Words to Analyze:
o running
o better
o cats
o flying
o fought
1. Stemming:
Stemming uses rules to remove suffixes or prefixes in a straightforward way, often resulting
in non-dictionary words. Popular stemming algorithms include Porter Stemmer and
Lancaster Stemmer.
running run
better better
cats cat
flying fli
fought fought
2. Lemmatization:
Lemmatization, on the other hand, uses a dictionary and often considers the word's part of
speech (POS) to reduce it to its base form.
running run
better good
cats cat
flying fly
fought fight
Summary of Differences:
• Stemming: Works through rule-based cutting of suffixes and prefixes, potentially resulting in
non-dictionary forms (e.g., "fli" from "flying").
• Lemmatization: Uses vocabulary and part of speech (POS) to find the correct base form of a
word, ensuring the result is always a valid word (e.g., "good" from "better").
• Stemming is faster and less computationally expensive, so it's often used in tasks like search
indexing or information retrieval where exact word meaning isn't crucial.
• Lemmatization is more accurate and useful when you want to ensure that the result is a
valid word, which is important for tasks like text classification, sentiment analysis, and
question answering.
V) Confusion Matrix
Terminology:
Example Scenario:
Let's say we have a model that predicts whether an email is spam (positive) or not spam
(negative). After testing the model on 100 emails, the confusion matrix might look like this:
Explanation:
• True Positives (TP): 30 emails that were actually spam and were correctly classified
as spam.
• False Negatives (FN): 10 emails that were spam but were incorrectly classified as not
spam.
• False Positives (FP): 5 emails that were not spam but were incorrectly classified as
spam.
• True Negatives (TN): 55 emails that were not spam and were correctly classified as
not spam.
1. Accuracy: The percentage of correctly classified instances (both true positives and
true negatives) out of all instances.
Accuracy=TP+TNTP+TN+FP+FN=30+5530+55+5+10=85100=0.85\text{Accuracy}
= \frac{TP + TN}{TP + TN + FP + FN} = \frac{30 + 55}{30 + 55 + 5 + 10} =
\frac{85}{100} = 0.85
Precision is 85.7%.
3. Recall (also called Sensitivity or True Positive Rate): The percentage of actual
positive instances that were correctly classified as positive.
Recall is 75%.
4. F1 Score: The harmonic mean of precision and recall, which balances the two
metrics.
F1 Score=2×Precision×RecallPrecision+Recall=2×0.857×0.750.857+0.75≈0.80\text{
F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} +
\text{Recall}} = 2 \times \frac{0.857 \times 0.75}{0.857 + 0.75} \approx 0.80
VI)Text normalization is the process of transforming text into a standard format, which
helps improve the consistency of text data for further processing, such as in natural language
processing (NLP) tasks. It typically involves steps like converting text to lowercase,
removing punctuation, handling special characters, expanding contractions, and more.
1. Convert to Lowercase:
• Goal: Ensure uniformity by converting all text to lowercase, so that the words "NLP" and
"nlp" are treated the same.
• Normalized Text:
"i'm learning nlp! it's amazing, isn't it?"
2. Remove Punctuation:
• Goal: Punctuation marks can be removed to focus on the words. This can make text easier to
analyze, especially when building models.
• Normalized Text:
"im learning nlp its amazing isnt it"
3. Expand Contractions:
• Goal: Contractions like "I'm" and "isn't" should be expanded to their full forms ("I am" and
"is not") to avoid treating them as different words.
• Normalized Text:
"i am learning nlp it is amazing is not it"
• Goal: Some words, like "am," "it," "is," and "not," are known as "stop words" and are often
removed to focus on more meaningful words.
• Normalized Text (after removing stop words):
"learning nlp amazing"
• Goal: Special characters such as punctuation marks, extra spaces, and sometimes numbers
are removed or handled.
• Normalized Text:
"learning nlp amazing" (This has already been done in previous steps)
After performing all the steps above, the text would look like this:
These steps help standardize the text and make it easier to process for tasks like sentiment
analysis, text classification, or building machine learning models.