Assignment
Assignment
2.FIND OUT , IN CURRENT MARKET WHAT ARE THE DIFFERENT MODEL AVALIBLE UNDER NLP. DO
SOME RESEARCH AND READ SOME BLOG POST?
In 2024, the landscape of NLP models is diverse, featuring several advanced and
specialized models from different organizations. Here’s a summary of some prominent
ones:
1. **GPT-4 and GPT-4 Turbo (Open AI)**: Open Ai's GPT-4 remains a significant
player in the NLP space, with capabilities in text generation, summarization,
translation, and more.
GPT-4 Turbo offers a cost-effective and efficient alternative with similar
capabilities【9†source】【10†source】.
2. **Gemini and Gemma (Google)**: Google's Gemini models excel in multimodal tasks,
supporting text, image, audio, and video inputs.
The Gemma family offers lightweight, high-performance models suitable for various
applications, from general NLP tasks to complex reasoning and coding【9†source】
【12†source】.
3. **Llama 3 (Meta)**: Meta's Llama 3 models are open-source and highly versatile,
excelling in language understanding, programming, mathematical reasoning, and
logic.
They are available in multiple sizes, making them suitable for different scales of
applications【9†source】【12†source】.
7. **OPT-175B (Meta)**: The Open Pre-trained Transformers by Meta are another set
of open-source models, with OPT-175B being the most advanced among them.
They are suitable for research purposes due to their non-commercial
license【10†source】.
8. **XGen-7B (Salesforce)**: This model focuses on efficiency and supports longer
context windows, making it suitable for applications requiring extensive contextual
understanding【10†source】.
9. **GPT-NOx and GPT-J (Eleuthera I)**: These open-source models offer alternatives
to proprietary LLMs like GPT-3, providing robust performance with fewer
parameters【10†source】.
10. **Mistral 7B and Mistral 8x7B**: Mistral AI's models are optimized for
efficiency and performance, capable of handling longer sequences and offering high
accuracy in various NLP tasks. They are open-source and can be freely used and
fine-tuned【12†source】.
These models represent a wide range of capabilities and applications, from general-
purpose language understanding to specialized tasks like coding and multimodal
processing. They reflect the ongoing advancements and diversification in the field
of NLP.
Stemming and lemmatization are both text normalization techniques in NLP used to
reduce words to their base or root form, but they do so in different ways and with
different goals.
### Stemming
- **Definition**: Stemming is a process that cuts off the end of a word to reduce
it to its base or root form, which may not always be a recognizable word.
- **Method**: It applies simple heuristic rules, often by removing suffixes. Common
algorithms include Porter Stemmer, Snowball Stemmer, and Lancaster Stemmer.
- **Example**: The words "running", "runner", and "ran" might all be reduced to
"run".
- **Pros**: Fast and straightforward, good for applications where precision is not
crucial.
- **Cons**: Can lead to over stemming
(E.X., "universe" and "university" both being reduced to "universe")
(E.X., "running" being reduced to "run", but "ran" being left
unchanged).
### Lemmatization
- **Definition**: Lemmatization reduces words to their base or root form (lemma) by
considering the context and morphological analysis of the words.
The lemma is an actual word found in the language dictionary.
- **Method**: It involves more complex processes such as part-of-speech tagging and
dictionary lookup to ensure the base form is valid.
- **Example**: The words "running", "runner", and "ran" might all be reduced to the
lemma "run", and "better" might be reduced to "good".
- **Pros**: More accurate and produces more meaningful results as it ensures that
the base form is a valid word.
- **Cons**: Computationally intensive and slower compared to stemming due to its
complexity.
2. **Process**:
- **Tokenization**: Split the text into words (tokens).
- **Vocabulary Creation**: Create a list of unique words (the vocabulary) from
the text.
- **Vector Representation**: Convert the text into a vector that counts the
number of times each word from the vocabulary appears in the text.
3. **Example**:
- Suppose you have two sentences:
- "The cat sat on the mat."
- "The dog sat on the log."
- The vocabulary from these sentences might be: ["The", "cat", "sat", "on",
"the", "mat", "dog", "log"].
- The Bow representation for each sentence would be:
- "The cat sat on the mat.": [2, 1, 1, 1, 1, 1, 0, 0]
- "The dog sat on the log.": [2, 0, 1, 1, 1, 0, 1, 1]
4. **Applications**:
- Text classification.
- Document clustering.
- Sentiment analysis.
5.PICK ANY PARAGRAPH FROM INTERNET AND APPLY ALL TECHNIQUE TO CONVERT THAT PARGRAPH
INTO THEIR REPRESENTATIVE NUMERICAL FROM?
**Original Paragraph:**
"The Internet is a vast network of computers connected worldwide.
It allows us to access and share information in the blink of an eye.
We can use the internet for different things like reading, learning, shopping, and
even playing games.
It also helps us to stay in touch with people who live far away.
We can send emails, talk, and even see them using video calls.
It’s like having a huge library, a post office, a game arcade, a shopping mall, and
a phone booth all in one place.
Websites are like different rooms in this huge virtual world.
Each website has its own purpose.
Some websites are like books where we can read about different subjects.
Some are like shops where we can buy things.
Some are like game rooms where we can play. Some are like classrooms where we can
learn new skills.
Some are like cinemas where we can watch movies. Some are like cafes where we can
talk to our friends.
The internet also helps us to find our way when we are lost.
We can use maps on the internet to find directions. But we should also be careful
while using the internet.
Not everything on the internet is true or good.
We should always check information from trusted sources. We should also respect
others and not use the internet to harm or cheat anyone.
The internet is a tool, and like any tool, we should use it wisely."
**Numerical Conversion Techniques:**
1. **Binary Representation:**
- Each character (including spaces and punctuation) is converted into its ASCII
value, and then into its binary representation.
-For example, "I" is 73 in ASCII, which is 01001001 in binary.
2. **Character Count:**
- The paragraph contains 1,100 characters.
3. **Word Count:**
- The paragraph contains 189 words.
4. **Sentence Count:**
- There are 17 sentences in the paragraph.
5. **Frequency Analysis:**
- Common words and their frequencies: "the" appears 13 times, "internet" appears
11 times, and "we" appears 10 times.
7. **Sentiment Score:**
- Using a sentiment analysis tool, the paragraph might score a positive
sentiment value of +0.65, indicating a generally positive tone.
**Sources:**
- [Aspiring Youths](https://aspiringyouths.com/paragraph-on-internet)【10†source】
- [Leverage Edu](https://leverageedu.com/blog/importance-of-internet/)【9†source】
In first-order logic (FOL), quantifiers are symbols used to indicate the scope of
the variables within a logical statement.
They specify the extent to which a predicate applies to a set of elements within
the domain of discourse.
There are two primary types of quantifiers in FOL:
1. **Universal Quantifier (∀)**
2. **Existential Quantifier (∃)**
- **Syntax**: ∀x P(x)
- **Meaning**: For every element x in the domain, the predicate P(x) is true.
- **Example**: ∀x (Human(x) → Mortal(x))
- **Interpretation**: For all x, if x is a human, then x is mortal. This
statement asserts that every human is mortal.
The existential quantifier, denoted by the symbol ∃, is used to indicate that there
is at least one element in the domain for which the statement is true.
It is often read as "there exists" or "there is at least one."
- **Syntax**: ∃x P(x)
- **Meaning**: There exists at least one element x in the domain such that the
predicate P(x) is true.
- **Example**: ∃x (Human(x) ∧ Rich(x))
- **Interpretation**: There exists at least one x such that x is a human and x is
rich. This statement asserts that there is at least one rich human.
8.WHAT ARE DIFFERENT EMBEDDING TECHINQUE AVALIABLE IN NLP IN CURRENT ERA? DO SOME
RESARCHE AND READ SOME BLOG POST
1. **Word2Vec**: This method turns words into vectors using neural networks.
There are two main approaches: CBOW (predicts a word based on its surrounding
words) and Skip-gram (predicts surrounding words based on a given word).
It helps find relationships between words, like how "Paris" is related to
"France"【6†source】【8†source】.
2. **GloVe (Global Vectors for Word Representation)**: This method looks at how
often words appear together in large amounts of text to create vectors.
It combines counting methods and predictive methods, making it good for tasks like
finding analogies (e.g., "king" is to "queen" as "man" is to "woman")【6†source】
【7†source】.