0% found this document useful (0 votes)
19 views11 pages

Class 10 Portfolio Work

The document outlines the 4W Canvas framework for defining problem statements in projects, emphasizing clarity, focus, alignment, and actionable steps. It also details the AI project cycle for customer sentiment analysis, including problem definition, data collection, preparation, model training, evaluation, and deployment. Additionally, it covers concepts like document vectorization, stemming vs. lemmatization, and confusion matrices for evaluating classification models.

Uploaded by

dishitaroyc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

Class 10 Portfolio Work

The document outlines the 4W Canvas framework for defining problem statements in projects, emphasizing clarity, focus, alignment, and actionable steps. It also details the AI project cycle for customer sentiment analysis, including problem definition, data collection, preparation, model training, evaluation, and deployment. Additionally, it covers concepts like document vectorization, stemming vs. lemmatization, and confusion matrices for evaluating classification models.

Uploaded by

dishitaroyc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

I) 4W Canvas :

The 4W Canvas is a framework used to define and organize problem statements


for a project or initiative. The "4Ws" stand for:

1. What – What is the problem or challenge being addressed?


2. Who – Who is affected by this problem or challenge?
3. Why – Why is it important to address this problem?
4. When – When does this problem need to be solved, or what are the time constraints?

This framework is often used in project management, design thinking, and problem-solving
processes to provide a clear and concise description of the problem at hand.

Example of a 4W Canvas Problem Statement

Example: Improving User Engagement in a Mobile App

What:

User engagement in the mobile app has been declining over the past few months, resulting in
fewer active users and lower usage frequency.

Who:

The primary users of the app, as well as the product development and marketing teams, are
directly affected.

Why:

Low user engagement can lead to a decrease in retention rates and a decline in app revenue.
Enhancing engagement will improve user retention, increase in-app purchases, and
potentially expand the app’s user base.

When:

Immediate attention is required, with a targeted solution to be rolled out within the next two
months before the next marketing campaign.

Problem Statement Summary (using the 4Ws):

What: Declining user engagement in the mobile app.


Who: Affects app users and the product/marketing teams.
Why: Decreases retention and revenue, affecting long-term growth.
When: Solution needed within two months, before the next marketing campaign.
Why Use the 4W Canvas?

• Clarity: It forces a clear definition of the problem, helping stakeholders understand the
scope.
• Focus: Ensures that the project or solution remains centered around the core problem.
• Alignment: Helps align teams and resources towards a common goal.
• Actionable: Provides a basis for identifying the next steps and priorities.

II)AI Project Cycle:

Customer Sentiment Analysis

1. Problem Definition

Case Example:

• Problem: The company receives large amounts of customer feedback through multiple
channels (emails, reviews, surveys), but manually analyzing these responses is time-
consuming and inefficient.
• Objective: Automate the sentiment analysis of customer feedback to classify the feedback
as positive, neutral, or negative, and provide insights into customer satisfaction.

Key Questions:

• What: Develop an AI model that can process and classify text feedback.
• Who: The primary users are customer service teams, marketing teams, and product
development teams who need to understand customer sentiment quickly.
• Why: Automating sentiment analysis will save time, improve response times to negative
feedback, and help the company take proactive actions based on customer sentiment.
• When: The project needs to be deployed in 6 months to align with the launch of a new
product.

2. Data Collection and Data Understanding

Case Example:

• Data Sources:
o Customer reviews (e.g., product reviews from e-commerce platforms)
o Survey responses (feedback on product features, customer experience)
o Social media comments and mentions
• Data Collection:
o Gather a dataset of labeled customer feedback (with sentiment annotations like
positive, neutral, and negative).
o Ensure data diversity by collecting feedback across different product categories,
customer demographics, and channels.
Key Questions:

• What data is required: Do we need feedback data from specific products or services?
• Is the data labeled: Do we already have labeled data (e.g., positive, negative, neutral labels),
or do we need to manually label the data?
• Is the data balanced: Does the dataset have an equal number of examples for each
sentiment class?

3. Data Preparation and Preprocessing

Case Example:

• Data Cleaning:
o Remove any irrelevant or noisy data (e.g., HTML tags, special characters, or empty
reviews).
o Handle missing values (e.g., fill in or remove reviews with missing sentiment labels).
• Text Preprocessing:
o Convert text to lowercase.
o Tokenize sentences (split the text into words or tokens).
o Remove stop words (e.g., "is", "the", "in") that don’t contribute much to sentiment.
o Lemmatize words (reduce words to their base form, e.g., "running" → "run").
• Feature Engineering:
o Convert text data into numerical features using techniques like TF-IDF (Term
Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec,
GloVe).

Key Questions:

• What preprocessing steps are necessary: Which text cleaning methods are suitable for
customer feedback data?
• How to handle imbalanced data: Should we balance the dataset if one sentiment class is
underrepresented?

4. Model Selection and Training

Case Example:

• Model Selection:
o Choose a machine learning algorithm suitable for text classification. Some common
choices for sentiment analysis include:
! Logistic Regression: A simple model for binary or multiclass classification.
! Naive Bayes: A good choice for text classification tasks due to its efficiency
with large text datasets.
! Deep Learning Models (e.g., LSTM, BERT): If there is a large dataset, a
neural network-based model like BERT (Bidirectional Encoder
Representations from Transformers) can capture more complex
relationships in text.
• Training:
o Split the data into training, validation, and test sets.
o Train the model using the training set and validate it using the validation set. Tune
hyperparameters to improve performance.

Key Questions:

• What model will provide the best performance: Should we start with simpler models like
Logistic Regression or try advanced deep learning models?
• How to evaluate the model: Which performance metrics (e.g., accuracy, precision, recall,
F1-score) are most important for evaluating sentiment classification?

5. Model Evaluation

Case Example:

• Model Evaluation:
o Accuracy: The percentage of correct predictions (e.g., how many feedback items are
classified correctly).
o Precision, Recall, F1-Score: Especially important in imbalanced datasets, as one class
(e.g., positive feedback) might dominate.
o Confusion Matrix: To visualize the model’s performance in terms of true positives,
false positives, true negatives, and false negatives.
• Testing: Evaluate the model on the unseen test set and check how well it generalizes
to new data.

Key Questions:

• What evaluation metrics should we focus on: Should we care more about precision (to
avoid misclassifying positive reviews as negative) or recall (to catch more of the negative
feedback)?
• Are there any performance gaps: Is the model performing well across all sentiment classes
(positive, neutral, negative)?

6. Deployment

Case Example:

• Deployment Strategy:
o Deploy the sentiment analysis model into the company's existing feedback
processing pipeline.
o Ensure that the model can process customer feedback in real-time or in batches
(e.g., after surveys are completed).
• Integration:
o Integrate the model with the company's CRM or feedback management system to
automatically categorize incoming feedback.
• Monitoring:
o Monitor the model's performance in production, ensuring that the sentiment
predictions remain accurate over time.
o Implement a feedback loop for continuous improvement, allowing the model to be
retrained with new labeled data.

III)Document vector table

Example of a document vector table based on the Bag-of-Words (BoW) model,


where each document is represented as a vector of term frequencies.

Example Corpus (3 Documents):

• Doc 1: "apple orange apple"


• Doc 2: "banana orange apple fruit"
• Doc 3: "banana fruit tree"

Step 1: Build Vocabulary

First, we extract the vocabulary (unique words) from all the documents. The vocabulary in
this case would be:

• apple
• orange
• banana
• fruit
• tree

Step 2: Represent Documents as Vectors

Now, we represent each document as a vector where the entries are the frequency of each
word from the vocabulary in the document.

Document ID apple orange banana fruit tree


Doc 1 2 1 0 0 0
Doc 2 1 1 1 1 0
Doc 3 0 0 1 1 1

Explanation:

• Doc 1 ("apple orange apple"):


o "apple" appears 2 times, "orange" appears 1 time, and the other terms don't
appear.
• Doc 2 ("banana orange apple fruit"):
o "banana", "orange", and "apple" each appear 1 time, and "fruit" appears 1
time.
• Doc 3 ("banana fruit tree"):
o "banana" appears 1 time, "fruit" appears 1 time, and "tree" appears 1 time.

Each column represents the frequency of a particular term in the document. This is a simple
term frequency (TF) representation, where we only count the occurrences of each term in
each document.

Notes:

1. Bag-of-Words Limitation: This approach doesn't capture word order or semantic


meaning. "apple orange apple" and "orange apple apple" will be represented by the
same vector.
2. Document Vector Size: The vector size corresponds to the size of the vocabulary. In
this case, there are 5 unique terms, so the vector is of size 5.
3. Sparsity: This vector representation can become sparse, especially when working
with large vocabularies.

If you were using TF-IDF (Term Frequency-Inverse Document Frequency), the frequency
values would be adjusted to consider not only how often the term appears in each document
but also how common the term is across all documents, giving less weight to common words.

IV)Stemming and Lemmatization Example

Both stemming and lemmatization are techniques used in Natural Language Processing
(NLP) to reduce words to their root forms. The difference lies in the approach and the quality
of the root word they produce.

• Stemming: Removes prefixes and suffixes in an aggressive manner to reduce a word to its
root form. It doesn't necessarily produce valid words.
• Lemmatization: More sophisticated than stemming; it reduces words to their lemma
(dictionary form) based on the context, ensuring the root word is a valid word.

Example:

Let's consider a few words and how stemming and lemmatization treat them:

1. Words to Analyze:
o running
o better
o cats
o flying
o fought

1. Stemming:

Stemming uses rules to remove suffixes or prefixes in a straightforward way, often resulting
in non-dictionary words. Popular stemming algorithms include Porter Stemmer and
Lancaster Stemmer.

Word Stemmed (Using Porter Stemmer)

running run

better better

cats cat

flying fli

fought fought

• running → "run" (correct)


• better → "better" (no change, as "better" is already a root form)
• cats → "cat" (correct, removing the plural "-s")
• flying → "fli" (incorrect, as the stem is not a valid word)
• fought → "fought" (no change, as it is the root form already)

2. Lemmatization:

Lemmatization, on the other hand, uses a dictionary and often considers the word's part of
speech (POS) to reduce it to its base form.

Word Lemmatized Form (Using WordNet Lemmatizer)

running run

better good

cats cat

flying fly

fought fight

• running → "run" (correct, as "running" is the present participle form of "run")


• better → "good" (as "better" is a comparative form of "good")
• cats → "cat" (correct, removing the plural "-s")
• flying → "fly" (correct, as "flying" is the present participle of "fly")
• fought → "fight" (correct, as "fought" is the past tense of "fight")

Summary of Differences:
• Stemming: Works through rule-based cutting of suffixes and prefixes, potentially resulting in
non-dictionary forms (e.g., "fli" from "flying").
• Lemmatization: Uses vocabulary and part of speech (POS) to find the correct base form of a
word, ensuring the result is always a valid word (e.g., "good" from "better").

When to Use Each:

• Stemming is faster and less computationally expensive, so it's often used in tasks like search
indexing or information retrieval where exact word meaning isn't crucial.
• Lemmatization is more accurate and useful when you want to ensure that the result is a
valid word, which is important for tasks like text classification, sentiment analysis, and
question answering.

V) Confusion Matrix

A confusion matrix is a table used to evaluate the performance of a classification model. It


compares the predicted classifications with the actual classifications (true labels). The matrix
provides a clear view of how well the model is performing and helps identify areas where it
may be making errors.

Predicted Positive (P) Predicted Negative (N)


Actual Positive (P) True Positive (TP) False Negative (FN)
Actual Negative (N) False Positive (FP) True Negative (TN)

Terminology:

• True Positive (TP): Correctly predicted positive instances.


• False Positive (FP): Incorrectly predicted positive instances (Type I error).
• True Negative (TN): Correctly predicted negative instances.
• False Negative (FN): Incorrectly predicted negative instances (Type II error).

Example Scenario:

Let's say we have a model that predicts whether an email is spam (positive) or not spam
(negative). After testing the model on 100 emails, the confusion matrix might look like this:

Predicted Spam (P) Predicted Not Spam (N)


Actual Spam (P) 30 10
Actual Not Spam (N) 5 55

Explanation:

• True Positives (TP): 30 emails that were actually spam and were correctly classified
as spam.
• False Negatives (FN): 10 emails that were spam but were incorrectly classified as not
spam.
• False Positives (FP): 5 emails that were not spam but were incorrectly classified as
spam.
• True Negatives (TN): 55 emails that were not spam and were correctly classified as
not spam.

Performance Metrics Derived from the Confusion Matrix:

1. Accuracy: The percentage of correctly classified instances (both true positives and
true negatives) out of all instances.

Accuracy=TP+TNTP+TN+FP+FN=30+5530+55+5+10=85100=0.85\text{Accuracy}
= \frac{TP + TN}{TP + TN + FP + FN} = \frac{30 + 55}{30 + 55 + 5 + 10} =
\frac{85}{100} = 0.85

So, the accuracy is 85%.

2. Precision (also called Positive Predictive Value): The percentage of predicted


positive instances that are actually positive.

Precision=TPTP+FP=3030+5=3035≈0.857\text{Precision} = \frac{TP}{TP + FP} =


\frac{30}{30 + 5} = \frac{30}{35} \approx 0.857

Precision is 85.7%.

3. Recall (also called Sensitivity or True Positive Rate): The percentage of actual
positive instances that were correctly classified as positive.

Recall=TPTP+FN=3030+10=3040=0.75\text{Recall} = \frac{TP}{TP + FN} =


\frac{30}{30 + 10} = \frac{30}{40} = 0.75

Recall is 75%.

4. F1 Score: The harmonic mean of precision and recall, which balances the two
metrics.

F1 Score=2×Precision×RecallPrecision+Recall=2×0.857×0.750.857+0.75≈0.80\text{
F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} +
\text{Recall}} = 2 \times \frac{0.857 \times 0.75}{0.857 + 0.75} \approx 0.80

The F1 Score is 0.80.

VI)Text normalization is the process of transforming text into a standard format, which
helps improve the consistency of text data for further processing, such as in natural language
processing (NLP) tasks. It typically involves steps like converting text to lowercase,
removing punctuation, handling special characters, expanding contractions, and more.

Here’s an example of text normalization in the context of NLP:


Example Text:

"I'm learning NLP! It's amazing, isn't it?"

Steps in Text Normalization:

1. Convert to Lowercase:

• Goal: Ensure uniformity by converting all text to lowercase, so that the words "NLP" and
"nlp" are treated the same.
• Normalized Text:
"i'm learning nlp! it's amazing, isn't it?"

2. Remove Punctuation:

• Goal: Punctuation marks can be removed to focus on the words. This can make text easier to
analyze, especially when building models.
• Normalized Text:
"im learning nlp its amazing isnt it"

3. Expand Contractions:

• Goal: Contractions like "I'm" and "isn't" should be expanded to their full forms ("I am" and
"is not") to avoid treating them as different words.
• Normalized Text:
"i am learning nlp it is amazing is not it"

4. Remove Stop Words (optional):

• Goal: Some words, like "am," "it," "is," and "not," are known as "stop words" and are often
removed to focus on more meaningful words.
• Normalized Text (after removing stop words):
"learning nlp amazing"

5. Handle Special Characters (optional):

• Goal: Special characters such as punctuation marks, extra spaces, and sometimes numbers
are removed or handled.
• Normalized Text:
"learning nlp amazing" (This has already been done in previous steps)

Final Normalized Text:

After performing all the steps above, the text would look like this:

"learning nlp amazing"


Summary of Text Normalization Steps:

1. Convert to Lowercase: "I'M" → "i'm"


2. Remove Punctuation: "I'm!" → "im"
3. Expand Contractions: "I'm" → "I am"
4. Remove Stop Words: "It is" → (removed)
5. Remove Special Characters: Handle any unwanted symbols.

These steps help standardize the text and make it easier to process for tasks like sentiment
analysis, text classification, or building machine learning models.

You might also like