Mastering LLM Applications With LangChain and Hugging Face
Mastering LLM Applications With LangChain and Hugging Face
Hunaidkhan Pathan
Nayankumar Gajjar
www.bpbonline.com
First Edition 2025
ISBN: 978-93-65891-041
www.bpbonline.com
Dedicated to
– Hunaidkhan Pathan
Almighty, Dr. Amit Saraswat, and My Family
– Nayankumar Gajjar
About the Authors
https://rebrand.ly/bf9408
The code bundle for the book is also hosted on GitHub at
https://github.com/bpbpublications/Mastering-LLM-
Applications-with-LangChain-and-Hugging-Face. In
case there’s an update to the code, it will be updated on the
existing GitHub repository.
We have code bundles from our rich catalogue of books and
videos available at https://github.com/bpbpublications.
Check them out!
Errata
We take immense pride in our work at BPB Publications and
follow best practices to ensure the accuracy of our content
to provide with an indulging reading experience to our
subscribers. Our readers are our mirrors, and we use their
inputs to reflect and improve upon human errors, if any, that
may have occurred during the publishing processes
involved. To let us maintain the quality and help us reach
out to any readers who might be having difficulties due to
any unforeseen errors, please write to us at :
[email protected]
Your support, suggestions and feedbacks are highly
appreciated by the BPB Publications’ Family.
Did you know that BPB offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at
www.bpbonline.com and as a print book customer, you are entitled to a
discount on the eBook copy. Get in touch with us at :
[email protected] for more details.
At www.bpbonline.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters, and receive exclusive
discounts and offers on BPB books and eBooks.
Piracy
If you come across any illegal copies of our works in any form on the internet,
we would be grateful if you would provide us with the location address or
website name. Please contact us at [email protected] with a link to
the material.
Reviews
Please leave a review. Once you have read and used this book, why not leave
a review on the site that you purchased it from? Potential readers can then
see and use your unbiased opinion to make purchase decisions. We at BPB
can understand what you think about our products, and our authors can see
your feedback on their book. Thank you!
For more information about BPB, please visit www.bpbonline.com.
Index
CHAPTER 1
Introduction to Python
and Code Editors
Introduction
Python is a really powerful programming language that is
simple and easy to read. This language is being used in
many technology areas. There are rules called Python
Enhancement Proposal (PEP) standards, which help you
write proper Python code. These rules give instructions for
how we should write and develop the programming
language which helps us keep our code clean and of high
quality!
There are also different ways that we can work on our
Python codes - either through code editors or Integrated
Development Environments (IDEs).
Structure
In this chapter we will discuss the following topics:
Introduction to Python
Introduction to code editors
Objectives
Learning Python well is a great start for getting into
generative AI. Python is known for being easy to understand
and has lots of tools you can use. It is a good language to
learn if you want to understand how programming works.
Since Python is used everywhere, it is important for learning
different kinds of machine learning, which is really helpful if
you are interested in generative AI. If you are good at
Python, you can easily use the important tools and
information you need for generative AI. This makes it easier
to move on to more advanced things like understanding how
computers understand human language or deep learning.
Introduction to Python
Created by Guido Van Rossum back in the late 1980s, today,
almost everybody uses Python! Here is a brief introduction
to Python:
Readability: It is easy to understand any code written
by another person due to its simplicity.
Interpreted language: Python is an interpreted
language. You do not need to compile your program
before running it on the system. This makes
development faster and easier, as you can execute
code line by line, and easy to debug.
Cross-platform: Your computer runs Windows?
MacOS? Linux? Do not worry! No matter what
operating system your computer has installed, they all
support Python!
Versatility: Many more useful features like versatility,
object-oriented libraries, and interfaces, community
support, dynamic variable declaration, etc., which
make work super smooth.
Libraries and frameworks: If asked to paint an
image without a canvas color brush, it is hard for
anyone. In similar cases with computer programming,
we require a lot of tools to make it happen; IDEs and
code editors are one of them, which have unique
functionality based on requirements and can be
customized accordingly. For example, PyCharm is a
Python-specific editor. We also have something called
Jupyter Notebook or Jupyter Lab in the Python world.
Python provides a large collection of different libraries
for different tasks. For example, Django is used for
web development, NumPy and Pandas are used for
data analysis, and Tensorflow and Keras are used for
deep learning.
Community and support: Pythonʼs simplicity and
community support make it a highly usable coding
language. It does not matter if you are new to
programming or an experienced pro: Python has
something for everyone! This powerful tool will always
come in handy whether it is web development, data
analysis or AI/ML tasks etc.
Open source: Python is open-source and free to use.
This encourages collaboration and innovation, as
anyone can contribute to the languageʼs development
or create their own Python packages.
Object-oriented: Python is an object-oriented
programming language, which means code
organization around objects/classes simplifies
managing complex systems.
Dynamic typing: Python uses dynamic typing, which
means variables do not require any explicit type
definition speeding up development but further
necessitates attention towards avoiding potential
typing errors.
High-level language: Python is a high-level
programming language; since lower-level complexities
are abstracted away, users can concentrate mainly on
problem-solving and worrying less about underlying
hardware details.
Duck typing: Python follows the principle of duck
typing, which means the object type determination is
based on its behavior which gives code the freedom to
be concise yet vigilant towards object compatibility.
Multi-paradigm: Python supports multiple
programming instances, including procedural, object-
oriented, and functional programming. This versatility
allows us to choose the most suitable approach for our
projectʼs requirements.
Interoperability: Python interacts smoothly with
other languages like C, C++, Java, etc., enabling
utilization of existing libraries/code.
Popular use cases: Usable across several domains,
i.e., web development (using Django/Flask/FastAPI),
data science (machine learning using Scikit-
Learn/TensorFlow), Scientific computing (using
NumPy/SciPy), automation scripting or even game
development, the list is endless.
Python 2 vs. Python 3: It is important to note that
there are two major versions of Python: Python 2 and
Python 3. Though there are mainly two versions
available, as of Jan 1st, 2020, only Python 3 receives
updates/supports - have an edge by starting all new
projects in this version!
In conclusion, regardless of whether you are a beginner
embarking upon an initial language learning journey or an
experienced developer handling intricate setup, the
simplicity/readability/use case versatility/community backup
is positioning Python as handy across several coding tasks.
Zen (Python Enhancement Proposal PEP 20) Philosophy
embraces design ideals/principles defining how Python code
should be written for not just computers but easy
understanding by fellow developers too!
Here are some of the key principles from the Zen of Python
written by Tim Peters:
Beautiful is better than ugly: Python code should
be aesthetically pleasing, clear, and elegant. This
encourages developers to write code that is not only
functional but also visually appealing.
Explicit is better than implicit: Code should be
explicit in its intentions and behavior. Avoid relying on
hidden or implicit features to make the code more
understandable.
Simple is better than complex: Simplicity is
preferred over complexity. The code should be
straightforward and easy to understand rather than
unnecessarily convoluted.
Complex is better than complicated: While
simplicity is encouraged and complexity is necessary,
it should be well-structured and not overly
complicated. Complex code should have a clear
purpose and design.
Flat is better than nested: Deeply nested code
structures should be avoided. Keeping code relatively
flat, with fewer levels of indentation, makes it more
readable and maintainable.
Sparse is better than dense: Code should be spaced
out and not overly dense. Proper spacing and
indentation enhance readability.
Readability counts: Readability is a top priority in
Python. Code should be written with the goal of
making it easy to read and understand, not just for the
computer but also for other developers.
Special cases are insufficient to break the rules:
Consistency is important. While there may be
exceptional cases, they should not lead to a violation
of established coding conventions and rules.
Practicality beats purity: While adhering to best
practices and principles is important, practicality
should not be sacrificed in the pursuit of theoretical
perfection. Real-world solutions sometimes require
pragmatic compromises.
Errors should never pass silently. Unless
explicitly silenced: Errors and exceptions should be
handled explicitly. If you encounter an error, it should
not be ignored or suppressed unless you have a good
reason to do so.
In the face of ambiguity, refuse the temptation to
guess: When faced with uncertainty or ambiguity in
your code, it is better to be explicit and not make
assumptions. Clarity should prevail.
There should be one and preferably only one
obvious way to do it: Python encourages a single,
clear way to accomplish tasks to minimize confusion
and inconsistency in code.
Although that way may not be obvious at first
unless you are Dutch: This light-hearted remark
acknowledges that not all design decisions may
immediately make sense to everyone and hints at
Pythonʼs creator, Guido van Rossum.
Now is better than never. Although never is often
better than right now: While taking action is
important, rushing without proper consideration can
lead to errors. Itʼs a reminder to balance speed with
careful thought.
If the implementation is hard to explain, it is a
bad idea. If the implementation is easy to
explain, it may be a good idea: Code should be
designed in a way that makes its purpose and behavior
clear and straightforward. Complex, hard-to-explain
implementations should be avoided.
Namespaces are one honking great idea: Let us
do more of those: Encouragement to use
namespaces for organizing and managing variables
and functions, promoting modularity, and avoiding
naming conflicts.
The Zen of Python serves as a set of principles to guide
Python developers in writing code that is not only functional
but also elegant and maintainable. It reflects Pythonʼs
emphasis on code readability, simplicity, and the idea that
code should be written for humans to understand as much
as for computers to execute. You can access the Zen of
Python by opening a Python interpreter and typing import
this.
References
https://peps.python.org/pep-0020/#the-zen-of-
python
https://docs.python.org/3/faq/
https://wiki.python.org/moin/PythonEditors
https://wiki.python.org/moin/IntegratedDevelopm
entEnvironments
Further reading
For Python, you can find a list of all the code editors
and IDEs on the following links. These URLs contain
all the required information like which platform and
editor it supports and if the editors are open source or
not:
https://wiki.python.org/moin/PythonEditors
https://wiki.python.org/moin/IntegratedDevelop
mentEnvironments
Introduction
The installation of Python is a fundamental step in getting started
with the book. It allows you to access a rich ecosystem of
libraries and tools. Depending on your project, you might need to
install additional packages for specific functionalities. Choosing
the right code editor or Integrated Development
Environment (IDE) is also essential, as it greatly influences
your development workflow. These tools, combined with Pythonʼs
versatility, set the foundation for productive and efficient
programming. In this chapter, we are going to focus on installing
Python for different OSes. We will see how to install packages
using the pip package manager of Python. Apart from this we will
review the difference between code editor and IDE and which one
is good in different scenarios. As we are writing this book,
keeping in mind complete beginners, we have also included
some of the basic concepts of Python that will be useful to start
with Python.
Structure
In this chapter we will discuss the following topics:
General instructions
Installation of Python on Windows
Installation of Python on Linux
Installation of Python on MacOS
Installation of PyCharm
Installation of required packages
Object Oriented Programming concepts in Python
Objectives
By the end of this chapter, you will have a functional Python
environment by installing Python, configured with the necessary
packages tailored to the projectʼs needs, and an optimal IDE, that
is, PyCharm, to streamline the development process. This
ensures a smooth and efficient workflow, setting the stage for
successful book completion.
General instructions
Before proceeding to install Python, run the following commands
to make sure Python and pip are available:
python3 --version or python --version
pip3 --version or pip --version
Here, as you can see, we are checking two different things:
one is Python, and the other one is pip. Python is a
programming language that has a huge ecosystem of
packages for different purposes. To maintain these
packages, Python has its own package manager, which is
called pip. Using pip, you can install, update, and uninstall
any packages from Python.
Python versions after 3.4 come with pip pre-installed.
Hence, you will not need to install pip separately.
It should result in Python version 3.x:
In case Python 3.x is available, do not uninstall it because
uninstalling may result in system instability and might
cause a corrupted system, especially with Linux OSes.
Here, the advice will be to proceed with your current
Python version.
In case the Python version is not compatible with the
packages that we are going to use in this book, you can
try with other versions of the package. In most of the
cases other versions of the packages should also work.
In the rare situation, if it is not the case, in the last
section, we have provided an alternative to use Docker
to use the latest Python and pip.
If you have Python version 2.x:
The suggestion will be to update the Python version, but
before that, make sure that system dependency is not
there else it might result in system issues.
Again, if, for any reason, you are not able to change the
Python version, refer to the last section of this chapter,
which shows how to use Python using Docker so that you
can use the latest version of Python.
3. As shown in Figure 2.2. tick both the boxes. After that click
on Install Now.
4. After successful installation, you will receive the dialogue
box as shown in Figure 2.3:
Figure 2.3: Python successful installation
Installation of IDE
The freely available PyCharm Community Edition has been
created by JetBrains as an IDE catered specifically for Python
coders. Thanks to JetBrains, known for their repertoire of
resourceful development tools, here is what you should know
about this IDE:
Free and open-source: Being free and open-source makes
it a perfect coding environment for developers at different
skill levels without worrying about monetary constraints.
Python-centric IDE: Designed with Python in mind,
PyCharm provides a dedicated platform enriched with
features catering to the writing, testing, and debugging
processes of Python scripting.
Smart code assistance: Elevated productivity is offered
through smart code analysis, completion suggestions, and
efficient navigation within your codes. These features help
maintain clean scripts while preventing possible errors.
Django and web development: Django enthusiasts can
find easy accommodation for web application development
within PyCharm, which includes database management
tools along with templates specific to various web
frameworks.
Version control integration: Stay orderly, managing
projects and collaborating efficiently. Packed integrated
popular VCS like Git/Mercurian/Subversion.
Unit testing and debugging: Easy identification and
troubleshooting with built-in unit tests/debuggers, helping
analyze Python scripts effectively.
Customization and plugins: Easy customization of the
IDE and many different plugins available for further
integration.
Cross-platform: Supports Windows/macOS/Linux—
promising wide range usability embracing diverse
development environments.
Active community: Jump straight in! Find an abundance
of tutorials/support/resources that have been made
possible by the active users/developersʼ community.
Website: - https://intellij-support.jetbrains.com/hc/en-
us/community/topics/
Seamless integration with other JetBrains tools: If you
decide to use any other tool created by JetBrains, PyCharm
provides a seamless integration between all the JetBrains
tools.
To sum up, PyCharm Community Edition greatly sharpens Python
programming brush across developersʼ spectrum, be it
internalizing Python newbies or experts handling complicated
assembly lines!
Installation of PyCharm
There are two ways to install the PyCharm community edition on
the OS: using GUI or using the command line, that is the
terminal. You can opt for any of the following options for the
installation:
GUI for Windows, Linux, and Mac:
1. Visit the webpage,
https://www.jetbrains.com/pycharm/download/?
section=windows
2. Go to the bottom of the page, where you will find the option
to download the PyCharm community edition.
3. Download the executable file as per the OS. For Linux files,
you will get a .tar.gz file.
4. Double-click on the executable file and install the IDE for
Windows and Mac.
a. For Linux, you need to extract the .tar.gz file. Here, you
will get a text file with a name starting with the “Install”
word.
b. Open that file where you will get installation
instructions. Follow the instructions to install IDE on
Linux.
Virtual environment
A virtual environment in Python is essentially a standalone
directory that includes a specific Python interpreter packaged
with unique sets of libraries and dependencies. This lets you
maintain separate Python environments for distinctive projects,
hence ensuring the packages and dependencies associated with
each project do not overlap. The features are:
Isolation: Each virtual environment is independent of the
system-wide Python installation and other virtual
environments. This isolation prevents conflicts and
dependency issues between different projects.
Dependency management: Virtual environments enable
you to install and manage project-specific dependencies,
including Python packages and libraries. You can control
the versions and avoid compatibility issues.
Version compatibility: Working on different projects may
require different versions of Python. Having this flexibility
enables users to engage with both older legacy versions as
well as advanced state-of-the-art ones.
Project portability: Virtual environments make it easier to
share your project with others or deploy it on different
systems. You can include the virtual environment along
with your project, ensuring that all dependencies are
consistent.
To create a virtual environment in Python there are two packages
used widely. They are virtualenv and pipenv. The choice
between pipenv and virtualenv depends on your specific
project requirements and personal preferences. Both tools serve
as essential components of Python development, but they have
distinct purposes and characteristics.
virtualenv
Let us take a look at virtualenv:
Purpose: The primary function of ʼvirtualenvʼ is creating
isolated environments for different Python applications. Its
chief objective lies in offering an untarnished slate where
one can work smoothly installing preferred package
versions and fulfilling the projectʼs needs alongside
managing related dependencies.
Usage: Virtualenv is typically used alongside pip, Pythonʼs
package installer. Here is how you generally use it:
Create a virtual environment using virtualenv.
Activate the virtual environment.
Use pip within the activated environment to install the
necessary Python packages. This setup ensures that the
installations and operations are confined to the virtual
environment and do not interfere with other projects or
the global Python setup.
Popularity: Virtualenv has been a staple in the Python
community for many years. It is highly regarded for its
stability and effectiveness in managing project-specific
environments. Its widespread adoption and trust within the
community make it a go-to choice for many Python
developers looking to maintain clean and manageable
project setups.
This tool is essential for developers who need to manage
multiple projects with differing dependencies or are
developing in a team setting where consistency is critical.
pipenv
Let us take a look at pipenv:
Purpose: The primary aim of pipenv is to unify the
operation of virtual environment administration and
dependency management. It strives to streamline creating
isolated workspaces while also controlling project-related
dependencies
Usage: With pipenv, you can efficiently construct a virtual
workspace and manage its dependencies simultaneously,
providing convenience for developers who prefer a
comprehensive solution.
Popularity: Pipenv quickly rose in favor due to its
simplistic yet user-friendly approach to managing
dependencies.
You should consider the following factors when choosing between
the two:
Simplicity versus integration: If you prefer a
straightforward and lightweight solution for virtual
environments, virtualenv might be your choice. However, if
you prefer an all-in-one tool for managing both virtual
environments and dependencies, pipenv is a good option.
Project needs: Consider the complexity of your project.
For small, simple projects, virtualenv may suffice. For
larger projects with many dependencies, pipenv can help
streamline the process.
Community and support: Both virtualenv and pipenv are
well-supported, but virtualenv has a longer history and a
well-established user base. However, pipenv has gained
momentum and may be the preferred choice for some
newer Python developers.
Compatibility: While virtualenv grants compatibility with
older-python versions, pipenv intensively focuses Python
3.6 and above listings. Working with legacy Python versions
might probe directing affinities instead of facing off
virtualenv.
In summary, both pipenv and virtualenv are valuable tools for
Python development. For the purpose of the book, we are going
to use virtualenv to create a virtual environment.
Folder structure
Before we proceed with this chapter, let us define folder structure
so it will be easy throughout the book to keep things organized
and structured. Also, it will be easier for us to follow the
guidelines. The folder structure is to maintain the scripts and the
custom data. We are going to add folders and scripts as per the
requirement as we proceed to the different sections of the book.
1. Create a folder called Book. You can create it anywhere you
like. Make sure that the parent folder does not have spaces
in the name. Spaces in names cause issues sometimes;
hence, avoid it if possible.
2. Under this folder, create a text file called
requirements.txt:
a. Add the following lines in the file:
pandas==2.2.2
transformers==4.42.3
langchain==0.2.6
langchain_community==0.2.6
langchain-huggingface==0.0.3
accelerate==0.32.1
unstructured[pdf]==0.14.10
wikipedia==1.4.0
nltk==3.8.1
textblob==0.18.0
scikit-learn==1.5.1
spacy==3.7.5
gensim==4.3.2
pattern==3.6.0
huggingface_hub==0.23.4
torch==2.3.1
sentence_transformers==3.0.1
chromadb==0.5.3
faiss-cpu==1.8.0
evaluate==0.4.2
rouge_score==0.1.2
pypdf==4.2.0
gradio==4.37.2
origamibot==2.3.6
scipy~=1.12.0
tf_keras==2.16.0
git+https://github.com/google-research/bleurt.git
PEP 8 standards
PEP 8 is the Python amplification proposal that outlines the style
guide for writing Python code. Following PEP 8 standards may
help make your code more readable and maintainable. Here are
some key guidelines and recommendations from PEP 8:
Indentation:
Use 4 spaces per indentation level. Avoid using tabs.
The maximum line length should be 79 characters (or 72
for docstrings and comments).
Imports:
Imports should usually be on separate lines and at the
top of the file.
Use absolute imports rather than relative imports.
Whitespace in expressions and statements:
Avoid extraneous whitespace in the following situations:
Immediately inside parentheses, brackets, or braces.
Immediately before a comma, semicolon, or colon.
Immediately before the open parenthesis that starts an
argument list.
Do use whitespace in the following cases:
Around binary operators (e.g., a + b).
After a comma in a tuple (e.g., a, b).
Comments:
Comments should be complete sentences and placed on a
line of their own.
One should use docstrings to document modules, classes,
and functions.
Inline comments should be used in a restricted manner
and only when necessary for clarification.
Naming conventions:
Use descriptive names for variables, functions, and
classes.
Function names should be lowercase with words
separated by underscores (snake_case).
Class names should follow the CapWords (CamelCase)
convention.
Constants should be in ALL_CAPS.
Whitespace in functions and expressions:
Separate functions with two blank lines.
Use blank lines to indicate logical sections in a function.
Keep expressions on the same line unless they are too
long.
Programming recommendations:
Use a single leading underscore for non-public methods
and instance variables (for example, _internal_method).
Follow the “Zen of Python” (PEP 20) principles, which
you can view by running import this in a Python
interpreter.
Code layout:
Avoid putting multiple statements on a single line.
Documentation:
Provide clear, informative, and concise documentation
using docstrings.
Use docstring formats like reStructuredText,
NumPy/SciPy docstring conventions, or Google-style
docstrings.
Exceptions:
Use the except clause without specifying an exception
type sparingly. Be specific about the exceptions you
catch.
Note: PEP 8 is a guideline for coding style, not strict
rules. Following PEP 8 is widely recognized as good
practice, but there can be times when you need to vary
from these guidelines due to practical considerations or
align with the style of existing code. Sometimes
maintaining consistency within a project matters more
than strictly sticking to PEP 8. So, it is best if you follow
the styling guide used in your project or organization,
even if it is not exactly like PEP 8.
PEP 8 configuration:
You can configure the PEP 8 settings in PyCharm to suit
your preferences. Go to File | Settings (or PyCharm |
Preferences on a Mac) and navigate to Editor | Code
Style | Python. Here, you can adjust the PEP 8 settings
to match your preferred coding style.
Code inspection:
PyCharm can perform code inspections to detect PEP 8
violations. If you see yellow or red highlighting in your
code, you can hover over it to see the issue and access
options to correct it.
PEP 8 quick-fixes:
Whenever a violation related to PEP 8 gets flagged in
your Python script written under pycharmʼs watchful eye
(python editor), you get to implement instant fixes for the
highlighted script by hitting Alt+Enter. We call it Quick
Fix!
Integration with linters:
PyCharm integrates with popular Python code analysis
tools like Flake8, Pylint, and Black. You can configure
these tools to provide PEP 8 checks and formatting
automatically.
Code documentation:
PyCharm helps you create PEP 8 compliant
documentation strings (docstrings) by providing
templates and hints as you write the documentation.
Code navigation:
You can use features within PyCharm to quickly move
between your definitions and understand how things are
connected in your Python file.
By default, PyCharm is configured to follow PEP 8 coding
standards, and it is designed to be user-friendly for developers
who want to write PEP 8 compliant code. However, you can
customize the settings to align with your preferences or team
standards. Using these features can help you maintain clean and
PEP 8 compliant Python code in your projects.
Inheritance:
Subclasses can inherit attributes and methods from
Superclasses. Look at the following code, for example:
class Animal:
def __init__(self, name):
self.name = name
class Dog(Animal):
def speak(self):
print(f"{self.name} says Woof!")
Polymorphism:
Polymorphism is achieved through duck typing.
If an object behaves like another object, it is considered
polymorphic:
class Cat:
def speak(self):
print("Meow!")
def make_animal_speak(animal):
animal.speak()
my_cat = Cat()
make_animal_speak(my_cat)
Abstraction:
You can define abstract base classes using the abc
module.
Subclasses must implement abstract methods:
from abc import ABC, abstractmethod
class Shape(ABC):
@abstractmethod
def area(self):
pass
class Circle(Shape):
def __init__(self, radius):
self.radius = radius
def area(self):
return 3.1415 * self.radius ** 2
Method overriding:
Subclasses can provide their own implementation of a
method:
class Animal:
def speak(self):
print("Generic animal sound")
class Dog(Animal):
def speak(self):
print("Woof!")
my_dog = Dog()
my_dog.speak()
These small code examples illustrate how OOP concepts are
implemented in Python, making it a versatile language for
building complex, organized, and maintainable applications.
Classes in Python
In Python, think of a class as a blueprint or design for making
objects. It determines how to build and behave around the thing
you are creating, including features (information) and methods
(activities). Python is very comfortable with OOP – that is why
classes are such an important concept in it! Let us take a closer
look at them:
Defining a class:
You make a class using the class keyword, followed by
its name. Usually, we like to start class names with
capital letters! Anything inside the body of your class will
be attributes or methods:
class MyClass:
attribute1 = 0
attribute2 = "Hello"
def method1(self):
pass
Attributes:
Attributes are variables that belong to a class. They
define the characteristics (data) of the objects created
from the class:
obj1.attribute1 = 42
obj2.attribute2 = "World"
Methods:
Methods are functions defined within a class. They define
the behavior and actions that objects created from the
class can perform:
class MyClass:
def say_hello(self):
print("Hello, world!")
obj = MyClass()
obj.say_hello() # Calls the say_hello method
The self-parameter:
In Python, the first parameter of a method is self, which
refers to the instance of the class. You use self to access
attributes and call other methods within the class:
class MyClass:
def set_attribute(self, value):
self.attribute1 = value
def get_attribute(self):
return self.attribute1
obj = MyClass()
obj.set_attribute(42)
value = obj.get_attribute() # Retrieves the value
Constructor method:
The __init__ method is a special method (constructor)
that is automatically called when an object is created
from a class. It is used to initialize attributes:
class MyClass:
def __init__(self, initial_value):
self.attribute1 = initial_value
Functions in Python
Functions are so useful in Python – they are chunks of code you
can use again whenever you need them! They let your computer
perform tasks when asked. By using functions, your code gets
tidier and easier to read and manage:
Defining a function:
You define a function using the def keyword, followed by
the function name and a set of parentheses that can
contain input parameters (arguments). The functionʼs
code is indented below the def statement:
def greet(name):
print(f"Hello, {name}!")
Calling a function:
To use a function, you call it by using its name followed
by parentheses. If the function has parameters, you
provide the required values inside the parentheses:
greet("Nayan") # Calls the greet function with the
argument "Nayan"
Parameters and arguments:
Parameters are placeholders waiting for real values
when the function is called. Arguments are actual
relatable data passed into our functions
def add(x, y):
return x + y
Default parameters:
You can provide default values for function parameters.
If no argument is passed for a parameter, the default
value is used:
def power(x, y=2):
return x ** y
Scope:
Variables made within any specific function carry local
scope, meaning they live only there within a given
function and can only be accessible within a given
function! Any variable living outside/beyond functional
structure will have global scope making itself visible
throughout the entire code script and can be used
throughout the script:
x = 10
def my_function():
x = 5 # This is a local variable
print(x) # Prints 5
my_function()
print(x) # Prints 10
Functions are a crucial part of Python, allowing you to structure
your code and break it down into manageable pieces. They
promote code reusability and maintainability, making your
programs more organized and efficient.
If-else in Python
In the Python world, “if-else” does wonders in controlling
conditional flow. It helps us assign different bunches of
instructions to be executed based upon given evaluated
conditions that turn true or false.
1. if condition1:
2. # Code to be executed if condition1 is true
3. elif condition2:
4. # Code to be executed if condition1 is false, and
condition2 is true
5. else:
6. # Code to be executed if both condition1 and condition2
are false
In case of two possible outcomes, you can exclude the middle
elif statement. If more than two outcomes are possible, then you
can include as many elif as required. Sample example could be
as follows:
1. grade = 85
2.
3. if grade >= 90:
4. print("A")
5. elif grade >= 80:
6. print("B")
7. elif grade >= 70:
8. print("C")
9. else:
10. print("D")
In Python, you can write a compact one-liner “if-elif-else”
statement using the conditional (ternary) expression. The
conditional expression allows you to evaluate a condition and
provide different values or expressions based on whether the
condition is true or false. Here is the syntax:
1. value_if_true if condition else value_if_false
Given below are two different examples. One is a normal if-else
condition, and second is if-elif-else condition:
1. age = 20
2.
3. status = "adult" if age >= 18 else "minor"
4. print(f"You are a {status}.")
5.
6. grade = 85
7.
8. result = "A" if grade >= 90 else ("B" if grade >= 80 else ("C" if grade >= 70 else
"D"))
9. print(f"Your grade is {result}.")
Conclusion
In conclusion, the process of setting up a Python ecosystem or
environment for any project involves the installation of Python
itself, ensuring the presence of needed libraries for the project,
and selecting an appropriate IDE or code editor. By carefully
navigating through the steps mentioned in this chapter,
developers will be able to create a Python environment for this
book. The seamless integration of Python, necessary packages,
and a chosen code editor or IDE not only facilitates efficient
coding but also sets the stage for a productive and enjoyable
development experience. In the next chapter, we will practice
how to create and run Python scripts in different ways. If you
know ways to do this, then it will be easier to run scripts in
different scenarios. One such example is in case GUI is not
available, then you can run the script using a terminal or
command line.
CHAPTER 3
Ways to Run Python Scripts
Introduction
This chapter is another fundamental step in getting started with
the book. It allows you to run Python scripts using different ways.
It is good to know the different ways of running Python scripts,
as, in certain situations, some ways might not be available. For
example, GUI-based tools might not be available on the servers,
and hence, in this case, you might need to work with the terminal
to execute Python scripts. This chapter explores the diverse
methods for executing Python scripts from the command-line to
integrated development environments, and web frameworks. You
will also learn how to run scripts using different methods, and we
will cover the importance of choosing the right method for your
specific project needs to ensure efficient code execution and
deployment.
Structure
In this chapter we will discuss the following topics.
Setting up the project
Running Python scripts from PyCharm
Running Python scripts from Terminal
Running Python scripts from JupyterLab and Notebook
Running Python scripts from Docker
Objectives
By the end of this chapter, you will understand different ways to
run Python scripts. It will not only help you in this book, but it will
help in your future endeavors as well to run Python scripts in
different scenarios in different ways.
Note: It is always advisable to create a virtual
environment. Install the required packages into the
created virtual environment. After this, run the scripts
using the Python interpreter available in the created
virtual environment.
The following figure shows how the screen will open, as discussed
in the preceding steps:
After all these steps, you should have scripts folder and
hello_world.py underneath of it. Overall, the setup will look
similar, as shown in Figure 3.1.
Figure 3.18: Terminal – Provide virtual env python interpreter and script path
b. Here, if you are in the directory where Python scripts
reside, you can refer to the following command. For
example, we have changed the directory to
E:\Repository\Book\scripts, and in this case, we can run
the following command:
E:\Repository\Book\venv\Scripts\python.exe hello_world.py
c. In this option, we will activate the virtual environment and
then run the script:
i. We have created a virtual environment using
virtualenv command. Hence, go to the project directory
where venv folder resides. Here venv is our virtual
environment name case. If you have used a different
name, than venv, in that case, go to the directory where
the folder of that name resides.
In our case, we need to go to the directory
E:\Repository\Book where we will get the above files.
ii. Open the terminal or CMD from that particular location
or open the terminal and change the directory to the
location where the venv folder resides.
iii. Execute the following command, which will activate the
virtual environment:
venv\Scripts\activate [For Windows]; source
venv/bin/activate [For Linux/Mac]
iv. Now, execute the following command to run the script:
python scripts\hello_world.py
v. Here, make sure that OS-based separator i.e., “/” or “\”
can vary.
Refer to the following figure:
Figure 3.19: Terminal – Activate virtual environment and run the script
Running Python scripts from Jupyter Lab and
Notebook
Being able to use Jupyter Lab or Notebook is very important
when you are working on things like data exploration, building
models, designing new technologies, and collaborating with
others. With Jupyter, you can create and run code step by step in
an interactive way - perfect for analyzing data or creating an ML
model. Plus, because it is easy to add text details, images, and
graphs into your work with them, they are excellent at helping
detail the process of analyses built through coding.
Here is how to get Python scripts running:
1. Activate the virtual environment as mentioned in the
preceding section Run Python Scripts from Terminal. You
need to run the commands below on the same terminal.
2. Run the following command to install JupyterLab and
Jupyter Notebook:
a. pip install jupyter lab notebook
3. Run any of the following commands to start JupyterLab or
Jupyter Notebook. Here we have a terminal, which is in the
root directory of the project, that is, E:\Repository\Book.
a. jupyter lab
b. jupyter notebook
4. You will get the screen as shown in Figure 3.20, which is of
Jupyter Lab. This is the screen that you will get first. You can
consider it as the home page. From this, select any section
of Notebook | Python 3 (ipykernel) OR Console | Python 3
(ipkernel).
a. Here, you will get the screen where you need to execute
the command as mentioned in Figure 3.21.
5. Figure 3.22 shows Jupyter Notebook home page screen. You
can also select from Notebook or Console here. In the next
screen, as shown in Figure 3.23, execute the command
provided at the end.
6. For both Jupyter Notebook and JupyterLab run the following
command:
a. with open("scripts\\hello_world.py","r") as scrpt:
b. scrpt_content = scrpt.read()
c.
d. exec(scrpt_content)
Conclusion
In conclusion, mastering the various ways to run Python scripts
empowers you to unleash the full potential of this versatile
language. You can choose the approach that best suits your
specific needs and preferences, such as harnessing the power of
the terminal for automation, control, and deployment, embracing
the interactive and exploratory nature of Jupyter Lab and
Notebook for data analysis and visualization, utilizing IDEs for
comprehensive development environments and debugging tools,
or creating standalone executables for easy distribution and
cross-platform compatibility.
Remember, the most effective approach often involves a
combination of these methods, which are strategically employed
throughout your Python journey. By understanding the strengths
and nuances of each execution environment, you will be
equipped to tackle any coding challenge with confidence and
efficiency.
In the next chapter, we will understand and practice important
NLP concepts, which are a must and the basis of most of the
current NLP algorithms. In that chapter we will explore some of
the very useful and often used terminologies and see practical
implementation of the same.
Introduction
Natural language processing (NLP) is an absolutely key
area in the world of Artificial Intelligence (AI). The field
deals with human languages and computer systems. It
provides algorithms and models, facilitating machines to
delve into the realms of understanding, interpretation, and
generation of human language in a manner that transcends
mere superficiality and strives for a depth that is both
profound and genuinely utilitarian. This chapter has been
thoughtfully curated to give you an explanation of this
intricate field. We will also embark upon an exploration of
the bedrock principles, intricacies of technique, and the
pragmatic domains where NLP finds its utility.
Structure
In this chapter we will discuss the following topics:
Natural Language Processing overview
Large language models
Text classification
Prompt engineering
Objectives
By the end of this chapter, you will have an understanding
of NLP and its different concepts. It will help you to
understand and exercise further topics in the book. You will
gain knowledge of how a computer works with text data and
a solid foundation in the principles and practical applications
of NLP.
Key concepts
To understand the key concepts of NLP practically, create a
new folder nlp_concepts with blank __init__.py under the
scripts folder that we have created in the earlier chapters.
In general, a folder containing __init__.py is considered a
Python package. The folder structure will look alike, as
shown in Figure 4.1. Here, Untitled.ipynb has been
created in Chapter 3, Ways to Run Python Scripts, to show
how to run Python scripts via Jupyter Notebook or Jupyter
Lab. The .idea is the internal folder of PyCharm, which will
be created automatically when you open any folder as a
PyCharm project. .ipynb_checkpoints is the internal folder
of Jupyter Notebook that was created by it. Create scripts as
shown under the folder structure of scripts, as shown in the
following figure:
Figure 4.1: Folder structure
Now, let us see both the theoretical and the practical parts.
Note: You will see that different packages that we are
going to use for different functionalities will behave
differently. Some packages will provide correct
results. Some will provide incorrect or intermediate
results. As these packages evolve and update over a
period of time, time-to-time, evaluation of the
packages will be required from your end to confirm
the results are correct using those packages.
Corpus
In NLP, a corpus is the name for a big, organized bunch of
text documents. These texts can be written pieces or even
transcriptions from spoken language - or sometimes both!
They span all kinds of areas, such as social media,
academia, and news articles, to name just a few. Now, by
using complicated analysis methods on corpora (plural of
corpus) in NLP, patterns are quickly figured out relating to
the characteristics and structures of the languages. In the
world of languages and computing, especially with machine
learning applications, the use of corpora is extensive.
Types of corpora are as follows:
Monolingual corpora: This type of corpora only has
text from one single language.
Examples:
Corpus of Contemporary American English
(COCA)
British National Corpus (BNC)
French Treebank (FTB)
Balanced Corpus of Contemporary Written
Japanese (BCCWJ)
Russian National Corpus (RNC)
Multilingual corpora: Here we find multiple
language texts (corpus) meant for cross-language
research work.
Examples:
Europarl Corpus: A parallel corpus containing
the proceedings of the European Parliament,
available in 21 European languages.
United Nations Parallel Corpus: Contains
official documents and their translations in the six
official UN languages (Arabic, Chinese, English,
French, Russian, and Spanish).
OpenSubtitles: A large-scale multilingual corpus
derived from movie and TV subtitles, available in
many languages.
N-grams
N-grams is a technique used in natural language processing
to understand human languages. You can think of n-grams
like pieces of a sentence puzzle.
Why do we use n-grams? Well, they help us predict what
word might come next after youʼve started typing or
speaking! This prediction process is incredibly valuable for
things like creating new stories and helping with writing
texts faster!
Imagine youʼre trying to build software that recognizes
speech - being able to understand the likely sequences of
words in someoneʼs speech would make this job much
easier! That is why we also use them for tools that translate
between languages.
Here are some of the types of n-grams:
Unigrams (1-grams): These are just single items,
usually words. If you have the sentence “I love pizza,”
then each word (I, love, pizza) becomes a unigram.
Bigrams (2-grams): This refers to pairs of
consecutive items. Let us take our previous sentence
as an example again ("I love Pizza"). Here our bigrams
would be “I love” and “love Pizza."
Trigrams (3-grams): Trigrams consist of three
consecutive items. For the words, trigrams would be:
“I love Pizza".
N-grams in General: You can have n-grams with any
value of N, depending on your specific requirements.
For instance, 6-grams would involve sequences of six
items from the text. Now these items can be anything,
i.e., words, sentences, characters, etc.
Fantastic, is it not? By simply breaking down sentences into
these unique groupings called n-grams helps give structure
and predictability within our language, which ultimately
makes life so much easier for machines trying their best to
grasp onto intricate nuances found within human
conversation patterns.
Python packages:
Natural Language Toolkit (NLTK)
spaCy
TextBlob
Scikit-Learn
HuggingFace
Code:
Put the following code in the file called ngrams.py
[E:\Repository\Book\scripts\nlp_concepts\ngrams.py]:
1. # Import required packages
2. from nltk.util import ngrams
3. import spacy
4. from textblob import TextBlob
5. from sklearn.feature_extraction.text import CountVectorizer
6. from transformers import AutoTokenizer
7.
8.
9. #
==================================
==================================
==
10. # NLTK
11. print("*" * 25)
12. print("Below example of N Grams is using NLTK package")
13. text = "This is an example sentence for creating n-grams."
14. n = 2 # Specify the n-gram size
15. bigrams = list(ngrams(text.split(), n))
16. print(bigrams)
17.
18.
19. #
==================================
==================================
==
20. # Spacy
21. print("*" * 25)
22. print("Below example of N Grams is using Spacy package")
23. # It is to download english package. Not required to
run every time. " # Run below code from terminal after
activating virtual environment"
24. # python -m spacy download en_core_web_sm
25. nlp = spacy.load("en_core_web_sm")
26. text = "This is an example sentence for creating n-grams."
27. n = 2 # Specify the n-gram size
28. tokens = [token.text for token in nlp(text)]
29. ngrams = [tokens[i : i + n] for i in range(len(tokens) - n + 1)]
30. print(ngrams)
31.
32.
33. #
==================================
==================================
==
34. # TextBlob
35. print("*" * 25)
36. print("Below example of N Grams is using TextBlob package")
37. # This is to download required corpora. Not required to
run every time. "# Run below code from terminal after activating
virtual environment"
38. # python -m textblob.download_corpora
39. text = "This is an example sentence for creating n-grams."
40. n = 2 # Specify the n-gram size
41. blob = TextBlob(text)
42. bigrams = blob.ngrams(n)
43. print(bigrams)
44.
45.
46. #
==================================
==================================
==
47. # Scikit Learn
48. print("*" * 25)
49. print("Below example of N Grams is using Scikit Learn package")
50. # For scikit learn list is required hence providing list.
51. text = ["This is an example sentence for creating n-grams."]
52. n = 2 # Specify the n-gram size
53. vectorizer = CountVectorizer(ngram_range=(n, n))
54. X = vectorizer.fit_transform(text)
55. # Get the n-gram feature names
56. feature_names = vectorizer.get_feature_names_out()
57. # Print the n-grams
58. for feature_name in feature_names:
59. print(feature_name)
60.
61.
62. #
==================================
==================================
==
63. # Hugging Face Package
64. print("*" * 25)
65. print("Below example of N Grams is using Hugging Face package")
66.
67. # Define your text
68. text = "This is an example sentence for creating ngrams with Hugging
Face Transformers."
69.
70. # Choose a pretrained tokenizer
71. tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
72.
73. # Tokenize the text
74. tokens = tokenizer.tokenize(text)
75.
76. # Generate bigrams
77. bigrams = [(tokens[i], tokens[i + 1]) for i in range(len(tokens) - 1)]
78.
79. # Generate trigrams
80. trigrams = [(tokens[i], tokens[i + 1], tokens[i + 2]) for i in
range(len(tokens) - 2)]
81.
82. # Print the bigrams
83. for bigram in bigrams:
84. print(bigram)
85.
86. # Print the trigrams
87. for trigram in trigrams:
88. print(trigram)
Tokenization
Tokenization is performed to convert a continuous text or
speech into discrete, manageable units. It is the process of
breaking down text into smaller units, typically words or
sub-words (tokens), which are essential for further analysis.
Tokens are the building blocks used for various NLP tasks,
including text analysis, sentiment analysis, text
classification, and more.
The types of tokens are described as follows:
Tokens can represent words, sub words, or characters,
depending on the level of granularity required.
In word-level tokenization, text is split into words, for
example, “I love NLP” → “I", “love", “NLP".
Subword tokenization means splitting words into
smaller parts that still have meaning. For example, the
word “unhappiness” can be broken into “un” and
“happiness".
Character-level tokenization breaks down words even
further, treating each letter as a token. For example,
the word “hello” is split into “h", “e", “l", “l", “o".
Sentence tokenization which involves breaking text
into sentences.
Tokenization helps in assigning grammatical
categories to each token, identifying named entities in
the text, analyzing the sentiment of individual words
or phrases, training models to understand and
generate human-like text. It also helps in categorizing
text based on token features.
Python packages:
NLTK
spaCy
The built-in string methods can be used for
tokenization
Regular expressions in Pythonʼs built-in re module
Tokenizers from Hugging Face (used with transformer
models)
TextBlob
LangChain
We have not included the code for this package. It
supports a number of tokenizers which are as
follows:
tiktoken
spaCy
SentenceTransformers
NLTK
Hugging Face tokenizer
You can see the examples of the same on below
URL:
https://python.langchain.com/docs/modules/d
ata_connection/document_transformers/text_
splitters/split_by_token#hugging-face-
tokenizer
Code:
Put the following code in the file called tokens.py
[E:\Repository\Book\scripts\nlp_concepts\tokens.py]:
1. # Import required packages
2. import nltk
3. from nltk.tokenize import word_tokenize, sent_tokenize
4. import spacy
5. from transformers import AutoTokenizer
6. from textblob import TextBlob
7.
8.
9. #
==================================
==================================
==
10. # NLTK
11. print("*"*25)
12. print("Below example of Tokens is using NLTK package")
13.
14. # Download the required dataset. Not required to run
everytime.
15. nltk.download('punkt')
16. text = "This is an example sentence. Tokenize it."
17.
18. # Word tokenization
19. words = word_tokenize(text)
20. print("Word tokens:", words)
21.
22. # Sentence tokenization
23. sentences = sent_tokenize(text)
24. print("Sentence tokens:", sentences)
25.
26.
27. #
==================================
==================================
==
28. # Spacy
29. print("*"*25)
30. print("Below example of Tokens is using Spacy package")
31.
32. # It is to download english package. Not required to
run every time. "# Run below code from terminal after
activating virtual environment"
33. # python -m spacy download en_core_web_sm
34. nlp = spacy.load("en_core_web_sm")
35.
36. text = "This is an example sentence. Tokenize it."
37.
38. doc = nlp(text)
39.
40. # Word tokenization
41. words = [token.text for token in doc]
42. print("Word tokens:", words)
43.
44. # Sentence tokenization
45. sentences = [sent.text for sent in doc.sents]
46. print("Sentence tokens:", sentences)
47.
48.
49. #
==================================
==================================
==
50. # Builtin Methods
51. print("*"*25)
52. print("Below example of Tokens is using Builtin package")
53.
54. text = "This is an example sentence. Tokenize it."
55.
56. # Word tokenization
57. words = text.split(" ")
58. print("Word tokens:", words)
59.
60. # Sentence tokenization
61. sentences = text.split(".")
62. # Remove 3rd element which will be "". Also remove
extra spaces around non-blank elements.
63. sentences = [k.strip() for k in sentences if k != ""]
64. print("Sentence tokens:", sentences)
65.
66.
67. #
==================================
==================================
==
68. # Huggingface Transformers
69. print("*"*25)
70. print("Below example of Tokens is using Huggingface package")
71.
72. # Use pretrained model
73. tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
74.
75. text = "This is an example sentence. Tokenize it."
76.
77. # Tokenize the text into word-level tokens
78. word_tokens = tokenizer.tokenize(text)
79. print("Word tokens:", word_tokens)
80.
81. # we tokenize the text into sentence-level tokens by
adding special tokens (e.g., [CLS] and [SEP]) to the
output.
82. # [CLS] stands for Classification Token and used in
BERT and other transformers for classification tasks. Its
also
83. # inserted at the beginning of text sequence.
84. # [SEP] stands for Separator Token and used in BERT
and other transformers. It is used to separate different
segments
85. # of the input text.
86. # Tokenize the text into sentence-level tokens
87. sent_tokens = tokenizer.tokenize(text, add_special_tokens=True)
88. print("Sentence tokens:", sent_tokens)
89.
90. # Optionally, you can convert the sentence tokens into
actual sentences
91. sentences = tokenizer.convert_tokens_to_string(sent_tokens)
92. print("Sentences:", sentences)
93.
94.
95. #
==================================
==================================
==
96. # Textblob
97. print("*"*25)
98. print("Below example of Tokens is using Textblob package")
99.
100. text = "This is an example sentence. Tokenize it."
101.
102. blob = TextBlob(text)
103.
104. # Word tokenization
105. words = blob.words
106. print("Word tokens:", words)
107.
108. # Sentence tokenization
109. sentences = blob.sentences
110. print("Sentence tokens:", sentences)
Difference in tokens and n-grams
The main difference is that tokens represent individual text
units, whereas n-grams are sequences of tokens (or other
text units), created by considering n consecutive items from
the text. Tokens are essential for basic text analysis, while n-
grams are useful for capturing patterns, relationships, and
context in the text, and they are often used in language
modeling, text analysis, and various NLP tasks.
Stemming
It is a text normalization technique in NLP that aims to
reduce words to their word stems or roots. The goal of
stemming is to remove suffixes from words to achieve a
common base form. This helps in treating words with the
same stem as equivalent, thus reducing the dimensionality
of the text data and simplifying text analysis.
Stemming algorithms remove common endings from words,
like “-ing,” “-ed,” and “-s,” to find the base form of the word.
The Porter stemming algorithm is a famous and widely used
example of these algorithms. It uses a set of rules to strip
suffixes from words. Other stemming algorithms like
Snowball (Porter2) and Lancaster stemming are also
commonly used.
Python packages:
NLTK
Lemmatization
It is a text normalization technique in NLP that reduces
words to their base or dictionary form, known as the
lemma.
The goal of lemmatization is to transform inflected words
into their root forms, which often represent the canonical or
dictionary meaning of a word.
Unlike stemming, which removes suffixes to approximate
word stems, lemmatization applies linguistic rules and
analyzes the wordʼs meaning to find the correct lemma.
Lemmatization is valuable in NLP when you need to
normalize words to their canonical forms, ensuring that
words with different inflections are treated as equivalent.
Itʼs commonly used in information retrieval, search engines,
and text analysis tasks where precise word forms are
important.
Stemming is different from lemmatization. While stemming
is a rule-based process that often results in approximate
word stems, lemmatization involves finding the base or
dictionary form of a word (the lemma) and is linguistically
more accurate.
Python packages:
NLTK
spaCy
TextBlob
Pattern
Code:
Put the following code in the file called stem_lem.py
[E:\Repository\Book\scripts\nlp_concepts\stem_lem.py
]
Note: This code contains both stemming and
lemmatization.
1. import nltk
2. from nltk.stem import PorterStemmer
3. from nltk.stem import WordNetLemmatizer
4. import spacy
5. from textblob import Word
6. from pattern.en import lemma
7.
8. #
==================================
==================================
==
9. # NLTK
10. print("*" * 25)
11. print("Below example of Stemming using NLTK package")
12. nltk.download("punkt") # Download necessary data (if not
already downloaded)
13.
14. # Create a PorterStemmer instance
15. stemmer = PorterStemmer()
16.
17. # Example words for stemming
18. words = ["jumps", "jumping", "jumper", "flies", "flying"]
19.
20. # Perform stemming on each word
21. stemmed_words = [stemmer.stem(word) for word in words]
22.
23. # Print the original and stemmed words
24. for i in range(len(words)):
25. print(f"Original: {words[i]}\tStemmed: {stemmed_words[i]}")
26.
27.
28. #
==================================
==================================
==
29. # NLTK
30. print("*" * 25)
31. print("Below example of Lemmatization using NLTK package")
32. nltk.download("wordnet") # Download necessary data (if not
already downloaded)
33.
34. lemmatizer = WordNetLemmatizer()
35.
36. # Example words for lemmatization
37. words = ["jumps", "jumping", "jumper", "flies", "flying"]
38.
39. # Perform lemmatization on each word
40. lemma_words = [
41. lemmatizer.lemmatize(word, pos="v") for word in words
42. ] # Specify the part of speech (e.g., 'v' for verb)
43.
44. # Print the original and lemmatized words
45. for i in range(len(words)):
46. print(f"Original: {words[i]}\tLemmatized: {lemma_words[i]}")
47.
48.
49. #
==================================
==================================
==
50. # SpaCy
51. print("*" * 25)
52. print("Below example of Lemmatization using Spacy package")
53.
54. nlp = spacy.load("en_core_web_sm")
55.
56. # Example words for lemmatization
57. words = ["jumps", "jumping", "jumper", "flies", "flying"]
58.
59. # Perform lemmatization on each word
60. lemma_words = [nlp(word)[0].lemma_ for word in words]
61.
62. # Print the original and lemmatized words
63. for i in range(len(words)):
64. print(f"Original: {words[i]}\tLemmatized: {lemma_words[i]}")
65.
66.
67. #
==================================
==================================
==
68. # TextBlob
69. print("*" * 25)
70. print("Below example of Lemmatization using Textblob package")
71.
72. # Example words for lemmatization
73. words = ["jumps", "jumping", "jumper", "flies", "flying"]
74.
75. # Perform lemmatization on each word
76. lemma_words = [
77. Word(word).lemmatize("v") for word in words
78. ] # Specify the part of speech (e.g., 'v' for verb)
79.
80. # Print the original and lemmatized words
81. for i in range(len(words)):
82. print(f"Original: {words[i]}\tLemmatized: {lemma_words[i]}")
83.
84.
85. #
==================================
==================================
==
86. # Pattern
87. # Not in use any more, since 2018 the package has not
been updated.
88. # print("*" * 25)
89. # print("Below example of Lemmatization using Pattern
package")
90.
91. # # Example words for lemmatization
92. # words = ["jumps", "jumping", "jumper", "flies",
"flying"]
93.
94. # # Perform lemmatization on each word
95. # lemma_words = [lemma(word) for word in words]
96.
97. # # Print the original and lemmatized words
98. # for i in range(len(words)):
99. # print(f"Original: {words[i]}\tLemmatized:
{lemma_words[i]}")
Lowercasing
Converting all text to lowercase to ensure case insensitivity.
Lowercasing is a crucial text preprocessing step in NLP. It
ensures case-insensitivity in search engines, simplifies text
classification, and aids NER tasks by recognizing named
entities regardless of case. Additionally, lowercasing is
integral to language models and word embeddings, text
normalization, tokenization, and text comparison.
It is a standard preprocessing step in machine learning,
promoting consistency and simplifying feature engineering.
Python packages:
Built-in string method .lower() can be used for lower
casing.
For example:
temp = "Building LLM applications with
Langchain and Hugging Face"
print(temp.lower())
# Output
"building llm applications with langchain and
hugging face"
Part-of-speech tagging
Part of speech tagging basically means identifying and
labeling the different roles each word plays within a
sentence. Things like ʼnouns,ʼ which are names for people or
objects, ʼverbsʼ that describe an action, and ʼadjectivesʼ that
tell us more about those nouns – they all get sorted out!
This is very fundamental to Natural Language
Processing (NLP) because it helps machines understand
grammarʼs structure pretty well.
When we know what role every word plays in the sentence
structure, we can better understand its meaning. For
instance, when using Named Entity Recognition (NER),
Part of Speech (POS) tagging provides context making it
easier to figure out important pieces of information like who
or what is being talked about.
Moreover, POS tagging is a critical preprocessing step for
training language models, capturing linguistic structure.
Python packages:
NLTK
spaCy
TextBlob
Code:
Put the following code in the file called pos.py
[E:\Repository\Book\scripts\nlp_concepts\pos.py]:
1. # Import required packages
2. import nltk
3. import spacy
4. from textblob import TextBlob
5.
6.
7. #
==================================
==================================
==
8. # NLTK
9. print("*"*25)
10. print("Below example of POS using NLTK package")
11.
12. nltk.download('punkt') # Download necessary data (if not
already downloaded)
13.
14. text = "This is an example sentence for part-of-speech tagging."
15. words = nltk.word_tokenize(text)
16. tagged_words = nltk.pos_tag(words)
17.
18. print(tagged_words)
19.
20.
21. #
==================================
==================================
==
22. # Spacy
23. print("*"*25)
24. print("Below example of POS using Spacy package")
25.
26. nlp = spacy.load("en_core_web_sm")
27.
28. text = "This is an example sentence for part-of-speech tagging."
29. doc = nlp(text)
30.
31. for token in doc:
32. print(token.text, token.pos_)
33.
34.
35. #
==================================
==================================
==
36. # TextBlob
37. print("*"*25)
38. print("Below example of POS using TextBlob package")
39.
40. text = "This is an example sentence for part-of-speech tagging."
41. blob = TextBlob(text)
42.
43. for word, pos in blob.tags:
44. print(word, pos)
Bag of words
Now moving on to Bag of Words (BoW). Imagine if
sentences were bags full of random words, kind of like
naming each Lego piece in your bag and then counting how
many purple ones there are. That is essentially what BoW
does! It converts raw lines into numerical lists where a
unique word from the vocabulary is counted per document
according to their frequency of occurrence.
BoW also has many limitations. It fails to capture the
intricate nuances of word order and contextual relationships
within a document, a pivotal aspect in deciphering textual
meaning. In response to this constraint, more sophisticated
methodologies such as word embeddings and transformers
have been devised to yield more intricate and contextually
astute text data representations.
The BoW approach used in NLP involves counting the word
occurrence in the given text. Each word will be counted
separately, which will reflect the appearance of each word
in the given corpus or corpora.
BoW vectors are typically very sparse because most
documents contain only a small subset of the words in the
vocabulary. This sparsity is handled efficiently in modern
NLP libraries.
Python packages:
NLTK
Gensim
Scikit-Learn
Code:
Put the following code in the file called bag_of_words.py
[E:\Repository\Book\scripts\nlp_concepts\
bag_of_words.py]:
1. import nltk
2. from nltk.tokenize import word_tokenize
3. from nltk.probability import FreqDist
4. from gensim.corpora import Dictionary
5. from collections import defaultdict
6. from gensim.models import TfidfModel
7. from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
8.
9. #
==================================
==================================
==
10. # Download required data
11. nltk.download("punkt")
12.
13.
14. #
==================================
==================================
==
15. # NLTK
16. print("*" * 25)
17. print("Below example of Bag Of Words is using NLTK package")
18. text = (
19. "This is a sample document. Another document with some words.
Repeating document with some words. A third "
20. "document for illustration. Repeating illustration."
21. )
22. words = word_tokenize(text)
23. fdist = FreqDist(words)
24.
25. fdist.pprint()
26.
27.
28. #
==================================
==================================
==
29. # Gensim
30. print("*" * 25)
31. print("Below example of Bag Of Words is using Gensim package")
32.
33. documents = [
34. "This is a sample document.",
35. "Another document with some words. Repeating document with some
words.",
36. "A third document for illustration. Repeating illustration.",
37. ]
38.
39. tokenized_docs = [doc.split() for doc in documents]
40.
41. # Create a dictionary
42. dictionary = Dictionary(tokenized_docs)
43.
44. word_frequencies = dictionary.cfs
45.
46. # Display words and their frequencies
47. for word_id, frequency in word_frequencies.items():
48. word = dictionary[word_id] # Get the word corresponding
to the word ID
49. print(f"ID: {word_id}, Word: {word}, Frequency: {frequency}")
50.
51. # Create a BoW representation
52. corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
53.
54. # Create a TF-IDF model based on the BoW
representation
55. tfidf = TfidfModel(corpus, dictionary=dictionary)
56.
57. # Calculate overall TF-IDF scores for words
58. overall_tfidf = defaultdict(float)
59. for doc in tfidf[corpus]:
60. for word_id, tfidf_score in doc:
61. overall_tfidf[word_id] += tfidf_score
62.
63. # Display words and their overall TF-IDF scores
64. for word_id, tfidf_score in overall_tfidf.items():
65. word = dictionary[word_id] # Get the word corresponding
to the word ID
66. print(f"Word: {word}, Overall TF-IDF Score: {tfidf_score:.4f}")
67.
68.
69. #
==================================
==================================
==
70. # Scikit Learn
71. print("*" * 25)
72. print("Below example of Bag Of Words is using Scikit-Learn package
Count Method")
73.
74. documents = [
75. "This is a sample document.",
76. "Another document with some words. Repeating document with some
words.",
77. "A third document for illustration. Repeating illustration.",
78. ]
79.
80. # Join the list of documents into a single string
81. corpus = " ".join(documents)
82.
83. vectorizer = CountVectorizer()
84. X = vectorizer.fit_transform([corpus])
85.
86. # Get the feature names (words)
87. feature_names = vectorizer.get_feature_names_out()
88.
89. # Get the word frequencies from the CountVectorizer's
array
90. word_frequencies = X.toarray()[0]
91.
92. # Print words with their frequencies
93. for word, frequency in zip(feature_names, word_frequencies):
94. print(f"Word: {word}, Frequency: {frequency}")
95.
96.
97. #
==================================
==================================
==
98. # Scikit Learn with TFIDF
99. print("*" * 25)
100. print("Below example of Bag Of Words is using Scikit-Learn package
TFIDF Method")
101.
102. documents = [
103. "This is a sample document.",
104. "Another document with some words. Repeating document with some
words.",
105. "A third document for illustration. Repeating illustration.",
106. ]
107.
108. # Join the list of documents into a single string
109. corpus = " ".join(documents)
110.
111. tfidf_vectorizer = TfidfVectorizer()
112. X = tfidf_vectorizer.fit_transform([corpus])
113.
114. # Get the feature names (words)
115. feature_names = tfidf_vectorizer.get_feature_names_out()
116.
117. # Get the TF-IDF values from the TF-IDF vector
118. tfidf_values = X.toarray()[0]
119.
120. # Print words with their TF-IDF values
121. for word, tfidf in zip(feature_names, tfidf_values):
122. print(f"Word: {word}, TF-IDF: {tfidf:.4f}")
Word embeddings
Word embeddings are simple ways to map text to
continuous vector space, ensuring that semantic
relationships are transformed and duly understandable by
computers in the form of mathematical representations.
Unlike traditional human language formats, which will be
complex in structure and formation, word embeddings made
it easy for computers for easier processing of ML-related
tasks. This method will provide a shorter distance between
words like King and Queen and a larger distance between
words like King and Apple; thus, it will help to understand
the relationship and similarities between words.
Large language models (LLMs) such as BERT, GPT-3, and
contextual understanding variants like Word2Vec, GloVe,
and FastText provide efficient methods for diverse training
approaches, which require varying degrees of complexity
depending on the userʼs prioritized features.
Word embeddings can also understand the context behind
meanings. For example, the word “bank” could refer to a
place where you keep money or beside a river! It all
depends on how it is used in a sentence. These similarities
are essential for things like machine translations or even
search engine recommendations!
Is this not incredible? By utilizing effective NLP tasks and
tools, we can uncover a deeper understanding of human
conversation patterns across extensive linguistic
functionalities. These methods can be implemented
efficiently and easily. These methods differ in their training
approaches and capabilities.
Python packages:
There are many python packages used for the aforesaid
purpose.
Here, we can use available pre-trained models to get word
embeddings and create our own word embeddings.
The most widely used ones with pre-trained models as well
as to create own word embeddings:
Gensim
spaCy
TensorFlow
Keras
HuggingFace
LangChain
LangChain provides a simple unified API to
generate word embeddings from different providers
that can be used in downstream NLP tasks.
Code:
Put the following code in the file called
word_embeddings.py
[E:\Repository\Book\scripts\nlp_concepts\word_embed
dings.py]
Note: Here in the code, we have used pre-trained
models to create word embeddings.
1. from gensim.models import Word2Vec
2. import spacy
3. from transformers import DistilBertTokenizer, DistilBertModel
4.
5.
6. #
==================================
==================================
==
7. # Gensim
8. print("*"*25)
9. print("Below example of Word Embeddings using Gensim package")
10.
11. # Example sentences for training the model
12. sentences = [
13. "This is an example sentence for word embeddings.",
14. "Word embeddings capture semantic relationships.",
15. "Gensim is a popular library for word embeddings.",
16. ]
17.
18. # Tokenize the sentences
19. tokenized_sentences = [sentence.split() for sentence in sentences]
20.
21. # Train a Word2Vec model
22. model = Word2Vec(tokenized_sentences, vector_size=100, window=5,
min_count=1, sg=0)
23.
24. # Access word vectors
25. word_vector = model.wv['word']
26. print(word_vector)
27.
28.
29. #
==================================
==================================
==
30. # Spacy
31. print("*"*25)
32. print("Below example of Word Embeddings using Spacy package")
33.
34. # Load the pre-trained English model
35. nlp = spacy.load("en_core_web_sm")
36.
37. # Process a text to get word embeddings
38. doc = nlp("This is an example sentence for word embeddings. Word
embeddings capture semantic relationships. Gensim is a popular library
for word embeddings.")
39. word_vector = doc[0].vector # Access the word vector
40. print(word_vector)
41.
42.
43. #
==================================
==================================
==
44. # Huggingface
45. print("*"*25)
46. print("Below example of Word Embeddings using Huggingface package")
47.
48. # Load the pre-trained DistilBERT tokenizer
49. tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-
uncased")
50.
51. # Tokenize a sentence
52. text = "Hugging Face's Transformers library is fantastic!"
53. tokens = tokenizer(text, padding=True, truncation=True,
return_tensors="pt")
54.
55. # Load the pre-trained DistilBERT model
56. model = DistilBertModel.from_pretrained("distilbert-base-uncased")
57.
58. # Get word embeddings for the tokens
59. output = model(**tokens)
60.
61. # Access word embeddings for the [CLS] token (you
can access other tokens as well)
62. word_embeddings = output.last_hidden_state[0] # [CLS] token's
embeddings
63.
64. # Convert the tensor to a numpy array
65. word_embeddings = word_embeddings.detach().numpy()
66.
67. # Print the word embeddings
68. print(word_embeddings)
Topic modeling
As the name implies, this technique aims to automatically
identify topics or core themes from the corpus. It is
especially helpful when we want to summarize corpus or
corpora, or we want to categorize them into specific groups.
The most common technique used for this process is latent
dirichlet allocation (LDA).
Python packages:
Gensim
Code:
Put the following code in the file called topic_modelling.py
[E:\Repository\Book\scripts\nlp_concepts\topic_modell
ing.py]:
1. import gensim
2. from gensim import corpora
3. from gensim.models import LdaModel
4. from gensim.parsing.preprocessing import remove_stopwords
5.
6. # Sample documents
7. documents = [
8. "Natural language processing is a fascinating field in AI.",
9. "Topic modeling helps uncover hidden themes in text data.",
10. "Latent Dirichlet Allocation (LDA) is a popular topic modeling
technique.",
11. "LDA assumes that documents are mixtures of topics.",
12. "Text mining and NLP are essential for extracting insights from text.",
13. "Machine learning plays a significant role in NLP tasks.",
14. ]
15.
16. # Preprocess the documents (tokenization and
lowercasing)
17. documents = [remove_stopwords(k) for k in documents]
18. documents = [doc.lower().split() for doc in documents]
19.
20. # Create a dictionary and a document-term matrix
(DTM)
21. dictionary = corpora.Dictionary(documents)
22. corpus = [dictionary.doc2bow(doc) for doc in documents]
23.
24. # Build the LDA model
25. lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary,
passes=15)
26.
27. # Print the topics
28. for topic in lda_model.print_topics():
29. print(topic)
30.
31. # To summarize the input
32. """
33. (0, '0.062*"nlp" + 0.062*"text" + 0.037*"insights" + 0.037*"mining" +
0.037*"extracting" + 0.037*"essential" + 0.037*"text." + 0.037*"helps" +
0.037*"data." + 0.037*"themes"')
34. (1, '0.040*"modeling" + 0.040*"topic" + 0.040*"popular" +
0.040*"technique." + 0.040*"(lda)" + 0.040*"allocation" +
0.040*"dirichlet" + 0.040*"latent" + 0.040*"field" + 0.040*"natural"')
35.
36. Here we have got 2 topics. 0 and 1. Both contains the words which are
associated with the theme of the doc.
37. The words are arranged in their order. From left being most associated to
right being least associated.
38. Based on the words we can say that Topic 0 is about natural language
processing.
39. Topic 1 is about LDA method.
40. """
Sentiment analysis
Sentiment analysis finds the emotion in a piece of text,
labeling it as positive, negative, or neutral. It is primarily
used to gauge the sentiments of customers. For example,
Twitter tweets on specific subjects from different users can
be used to measure the sentiments of users related to the
specific subject.
Python packages:
TextBlob
HuggingFace
NLTK
Code:
Put the following code in the file called
sentiment_analysis.py
[E:\Repository\Book\scripts\nlp_concepts\sentiment_a
nalysis.py]:
1. from textblob import TextBlob
2. from transformers import pipeline
3. import nltk
4. from nltk.sentiment.vader import SentimentIntensityAnalyzer
5.
6.
7. #
==================================
==================================
==
8. # TextBlob
9. print("*" * 25)
10. print("Below example of Sentiment using TextBlob package")
11.
12. # Sample text for sentiment analysis
13. text = "I love this product! It's amazing."
14.
15. # Create a TextBlob object
16. blob = TextBlob(text)
17.
18. # Perform sentiment analysis
19. sentiment = blob.sentiment
20.
21. # Print sentiment polarity and subjectivity
22. polarity = sentiment.polarity # Range from -1 (negative) to 1
(positive)
23. subjectivity = sentiment.subjectivity # Range from 0 (objective)
to 1 (subjective)
24.
25. # Interpret sentiment
26. if polarity > 0:
27. sentiment_label = "positive"
28. elif polarity < 0:
29. sentiment_label = "negative"
30. else:
31. sentiment_label = "neutral"
32.
33. # Output results
34. print("Text:", text)
35. print("Sentiment Polarity:", polarity)
36. print("Sentiment Subjectivity:", subjectivity)
37. print("Sentiment Label:", sentiment_label)
38.
39.
40. #
==================================
==================================
==
41. # HuggingFace
42. print("*" * 25)
43. print("Below example of Sentiment using HuggingFace package")
44.
45. # Load a pre-trained sentiment analysis model
46. nlp = pipeline("sentiment-analysis")
47.
48. # Sample text for sentiment analysis
49. text = "I love this product! It's amazing."
50.
51. # Perform sentiment analysis
52. results = nlp(text)
53.
54. # Output results
55. for result in results:
56. label = result["label"]
57. score = result["score"]
58. print(f"Sentiment Label: {label}, Score: {score:.4f}")
59.
60.
61. #
==================================
==================================
==
62. # NLTK
63. print("*" * 25)
64. print("Below example of Sentiment using NLTK package")
65.
66. # Download the VADER lexicon (if not already
downloaded)
67. nltk.download("vader_lexicon")
68.
69. # Initialize the VADER sentiment analyzer
70. analyzer = SentimentIntensityAnalyzer()
71.
72. # Sample text for sentiment analysis
73. text = "I love this product! It's amazing."
74.
75. # Perform sentiment analysis
76. sentiment = analyzer.polarity_scores(text)
77.
78. # Interpret sentiment
79. compound_score = sentiment["compound"]
80. if compound_score >= 0.05:
81. sentiment_label = "positive"
82. elif compound_score <= -0.05:
83. sentiment_label = "negative"
84. else:
85. sentiment_label = "neutral"
86.
87. # Output results
88. print("Text:", text)
89. print("Sentiment Score:", sentiment)
90. print("Sentiment Label:", sentiment_label)
Transfer learning
It is a technique where a model will be trained on one task
and later on can be adapted or fine-tuned for different but
related tasks. Instead of training the models from scratch,
we will use any existing pre-built models based on our
requirements. Again, while using the pre-built models based
on the requirement, we might use the model as is or can
fine tune it with our specific data.
Here, a pre-trained model stands for a model that has
already been trained on a large amount of data.
Fine tuning refers to the modification of a pre-trained model
for specific use cases.
By using transfer learning, we are using knowledge gained
by the model for specific use cases.
Some of the famous models are BERT, GPT, RoBERTa are
pre-trained on large corpora and can be fine-tuned for
various NLP tasks.
For example, the GPT model can be used on our data set to
generate responses. Here, instead of creating the entire
model from scratch, we will take the help of transfer
learning.
We can fine tune the GPT model on our own data set. As
GPT has been trained on humongous data, it can generate
answers to any question. For the time being, consider that
GPT has not been trained on movies corpus, so in this case,
we will fine-tune the GPT model with movie data so that
whenever asked about any movie, it can answer
accordingly.
Text classification
In text classification, we classify the text into required
categories. We can call it text categorization or document
classification as well. For document classification, we have
seen one of the examples called “Topic Modelling” above.
The text classification can be sentiments like positive or
negative. It can be spam or not spam, it can be like the
language of the text, and many more.
General text classification involves the following pipeline:
Data gathering with labels
Here, labels are the categories in which text will be
classified
Text preprocessing:
Lowercasing
Tokenization
Stop word removal
Stemming or lemmatization
Removing hashtags, URL links
Bag of words or word embeddings creation that is,
converting data to numerical features
Model selection
Train-test-validation split of the data
Evaluation of the model and, if required doing hyper
parameter tuning and re-training of the model to get
improvised accuracy
Finalizing the model for future text classification of
prediction
Python packages:
Scikit-Learn
NLTK
Code:
Put the following code in the file called
text_classification.py
[E:\Repository\Book\scripts\nlp_concepts\text_classific
ation.py]
Note: The text pre-processing steps are used to
improve model performance though they are not
mandatory. Here in the Hugging Face package, we
are using pre built model for text classification. We
can call it transfer learning as well. The code
provided here is the very basic one, and based on the
requirement, it can vary. Based on the requirement
we might need to add or remove steps in the text
classification.
1. from sklearn.feature_extraction.text import TfidfVectorizer
2. from sklearn.model_selection import train_test_split
3. from sklearn.naive_bayes import MultinomialNB
4. from sklearn.metrics import accuracy_score, classification_report
5.
6. import nltk
7. from nltk.corpus import movie_reviews
8. from nltk.classify import SklearnClassifier
9. import random # Import the random module
10.
11. from transformers import DistilBertTokenizer,
DistilBertForSequenceClassification
12. import torch
13.
14. #
==================================
==================================
==
15. # Scikit-Learn
16. print("*"*25)
17. print("Below example of Text Analysis using Sklearn package")
18. # Sample text data and labels
19. texts = ["This is a positive sentence.", "This is a negative sentence.", "A
neutral statement here."]
20. labels = ["positive", "negative", "neutral"]
21.
22. # Text preprocessing and feature extraction
23. vectorizer = TfidfVectorizer()
24. X = vectorizer.fit_transform(texts)
25.
26. # Split data into training and testing sets
27. X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2,
random_state=42)
28.
29. # Train a classifier (e.g., Naive Bayes)
30. classifier = MultinomialNB()
31. classifier.fit(X_train, y_train)
32.
33. # Make predictions on the test data
34. y_pred = classifier.predict(X_test)
35.
36. # Evaluate the classifier
37. accuracy = accuracy_score(y_test, y_pred)
38. report = classification_report(y_test, y_pred)
39.
40. print(f"Accuracy: {accuracy:.2f}")
41. print(report)
42.
43.
44. #
==================================
==================================
==
45. # NLTK
46. print("*"*25)
47. print("Below example of Text Analysis using NLTK package")
48.
49. # Load the movie reviews dataset
50. # nltk.download('movie_reviews')
51. documents = [(list(movie_reviews.words(fileid)), category) for category in
movie_reviews.categories() for fileid in movie_reviews.fileids(category)]
52.
53. # Shuffle the documents
54. random.shuffle(documents)
55.
56. # Text preprocessing and feature extraction
57. all_words = [w.lower() for w in movie_reviews.words()]
58. all_words = nltk.FreqDist(all_words)
59. word_features = list(all_words.keys())[:3000]
60.
61. def find_features(document):
62. words = set(document)
63. features = {}
64. for w in word_features:
65. features[w] = (w in words)
66. return features
67.
68. feature_sets = [(find_features(rev), category) for (rev, category) in
documents]
69.
70. # Split data into training and testing sets
71. training_set = feature_sets[:1900]
72. testing_set = feature_sets[1900:]
73.
74. # Train a classifier (e.g., Naive Bayes)
75. classifier = SklearnClassifier(MultinomialNB())
76. classifier.train(training_set)
77.
78. # Evaluate the classifier
79. accuracy = nltk.classify.accuracy(classifier, testing_set)
80. print(f"Accuracy: {accuracy:.2f}")
81.
82.
83. #
==================================
==================================
==
84. # Hugging Face
85. print("*"*25)
86. print("Below example of Text Analysis using Hugging Face package")
87.
88. # Sample text data
89. texts = ["This is a positive sentence.", "This is a negative sentence.", "A
neutral statement here."]
90.
91. # Preprocess text and load pre-trained model
92. tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
93. model = DistilBertForSequenceClassification.from_pretrained('distilbert-
base-uncased')
94.
95. # Tokenize and encode the text
96. inputs = tokenizer(texts, padding=True, truncation=True,
return_tensors="pt")
97.
98. # Perform text classification
99. outputs = model(**inputs)
100.
101. # Get predicted labels and probabilities
102. logits = outputs.logits
103. predicted_labels = torch.argmax(logits, dim=1)
104.
105. # Map predicted labels to human-readable class names
106. class_names = ['positive', 'negative', 'neutral']
107.
108. for i, text in enumerate(texts):
109. print(f"Text: {text}")
110. print(f"Predicted Label: {class_names[predicted_labels[i]]}")
111. print("")
112.
113. # You can also extract the probability scores for each
class if needed
114. class_probabilities = torch.softmax(logits, dim=1)
115.
Prompt engineering
Prompt engineering is about providing prompts or
instructions to the LLMs to get the required answer in the
required form. It is widely used with LLM models like GPT,
BERT, PaLM, LLaMA etc. Prompts can be “Tell me about
animals”, or more detailed prompts or instruction can be:
“Tell me about animals containing details on their
body structure and food”.
Hallucination
Hallucination refers to a phenomenon where the model
generates text that includes information or details that are
not accurate or factual. It occurs when the model produces
content that is imaginative or incorrect, often in a way that
is convincing or coherent but detached from reality.
Syntactic relationship
It describes the grammatical connections between words in
a sentence or text. As for grammatical connections, it will be
mostly used in POS.
Semantic relationship
It refers to meaning based associations between words or
phrases. For example: I am at the bank to draw money. Here
the word bank will be referred to as a financial institution.
Note: Here, you will see that LangChain mostly uses
third-party providers to provide certain facilities. By
integrating LangChain it will help us to make certain
task with minimal code and easy implementation.
Hugging Face also uses transfer learning for certain
facilities. It also provides the facility to create models
on our own as well.
Conclusion
In concluding this NLP chapter, we have covered a
comprehensive overview of Natural Language Processing,
delving into essential concepts and practical methodologies.
We explored the intricacies of text preprocessing, a crucial
step in preparing textual data for analysis. The general NLP
pipeline provided a structured approach to building
prediction models, demonstrating the sequential application
of techniques like tokenization, stemming, and part-of-
speech tagging.
As we move forward, the next chapters will delve into
advanced NLP techniques, especially LLMs, bridging
theoretical knowledge with hands-on applications. In that
chapter, we will talk more in detail about LLM and Neural
Network concepts and terminologies and get hands-on
experience with them.
Introduction
Large Language Models (LLMs) are considered to be a
core component of Natural Language Processing (NLP)
and Natural Language Generation (NLG). In the earlier
chapter, we have got an overview of LLM. In this chapter, we
will dig down more and get ourselves acknowledged for the
different LLM concepts and LLM models that are in use.
Overall, in this chapter we will move one step ahead in the
journey that we have started with this book.
Structure
We will cover the following sections in this chapter:
History
LLM use cases
LLM terminologies
Neural networks
Transformers
Pre-built transformers
Objectives
By the end of this chapter, you will acquire a robust
understanding of language modeling and its various
concepts. Furthermore, you will gain comprehensive insights
into transformers, a widely utilized framework for defining
LLMs. This chapter aims to provide clarity on the
terminology, concepts, and architecture associated with
transformers, acknowledging their prevalent use in
contemporary natural language processing applications.
History
The evolution of large language models has transpired
through a progressive continuum, witnessing pivotal strides
in recent times. Refer to Figure 5.1. Following is a brief
history of the evolution of LLMs:
Early NLP Systems (1950s-2000s): The field of NLP
started from 1950s with the development of rule-based
systems. These systems relied on handcrafted linguistic
rules to process and understand language. However,
they were limited in handling the complexity and
variability of natural language. In 1952, the Hodgkin-
Huxley model showed how the brain uses neurons to
form an electrical network. These events helped inspire
the idea of Artificial Intelligence (AI), NLP, and the
evolution of computers.
Statistical NLP (1990s-2010s): As computational
power increased, statistical approaches gained
prominence. Researchers started using probabilistic
models and machine learning algorithms to analyze
large datasets of text. Hidden Markov Models
(HMMs) and probabilistic context-free grammar were
among the early techniques.
Machine Learning and Neural Networks (2010s):
Neural Networks, which are the core element of deep
learning, have performed a critical role in the
enhancement of NLP skills. Recurrent Neural
Networks (RNNs) and Long Short-Term Memory
Networks (LSTMs) are the most broadly used neural
networks within the field. Apart from these, word
embeddings such as Wօrd2Vec and Glove have become
popular during this time.
LLM terminologies
Understanding LLMs involves familiarizing oneself with
various terminologies associated with these sophisticated
models. Here are the key terminologies related to LLMs:
Pre-training:
Definition: The initial phase where the model is
trained on a huge amount of corpus using
unsupervised learning.
Example: During pre-training, a language model
learns to predict the next word in a sentence or fill in
masked words.
P tuning (Prompt tuning):
Definition: Prompt-tuning is an efficient, low-cost
way of adapting an AI foundation model to new
downstream tasks without retraining the model and
updating its weights.
Example: P-tuning can be used to improve pre-
trained language models for various tasks, including
sentence classification and predicting a countryʼs
capital.
Fine-tuning:
Definition: The subsequent phase where the pre-
trained model is further trained on specific
downstream tasks with smaller/ medium sized
datasets.
Example: A pre-trained language model, initially
trained on general language understanding, can be
fine-tuned for different purposes like predicting
sentiment, topic modelling, etc. Consider that you
will use the OpenAI model on your own custom data
that OpenAI has never seen to provide answers to
questions. This is a kind of fine tuning.
Transformer architecture:
Definition: A neural network architecture
introduced by Vaswani et al., known for its self-
attention mechanisms.
Example: BERT and GPT are both based on the
transformer architecture.
Attention mechanism:
Definition: A mechanism allowing the model to
focus on different parts of the input sentence
sequence while making predictions.
Example: Imagine if you are trying to translate a
sentence. The attention mechanism will help your
computer focus on the most relevant words in one
language while it tries to come up with words in
another language.
Self-attention:
Definition: Self-attention is also an attention
mechanism, but here, every word checks out all
other words in its own sentence before deciding how
important they are.
Example: Imagine a model creating a sentence, and
it has already written: “The cat is.” The model then
thinks about what word should come next, using
what it learned from lots of examples. It might
choose “on,” making the sentence “The cat is on.”
The model keeps adding words one by one until it
thinks it is time to stop, usually with a period. So, a
sentence like “The cat is on the mat.” is built word
by word. The model makes guesses at each step
based on what it has written so far.
Masked language modeling (MLM):
Definition: MLM is kind of fun game that language
models play to learn about words. In this task, a
word in a sentence is hidden, like this: “The cat is on
the [MASK].” The model then tries to guess the
hidden word using the other words in the sentence
as clues.
Example: For example, a good guess for the hidden
word might be “mat". Many LLMs, like BERT, do this
to learn about the special ways humans use
language.
Prompt engineering:
Definition: The practice of designing effective
queries or prompts to interact with language models,
especially during instruction tuning for specific
tasks.
Example: Designing a specific prompt for a
language model to generate creative responses or
answers to user queries.
Zero-shot learning:
Definition: The ability of a large language model to
predict a task for which it was not explicitly trained.
It is the scenario where a model makes predictions
on classes it has never seen during training.
Example: If a model has learned about lots of
different topics, it might be able to answer questions
about a new topic, even if it has not been specifically
taught about it. This is like learning about cats and
dogs and then being able to guess what a fox is, even
if you have never seen one before.
Prompting bias:
Definition: The event where the output of a
language model is influenced by the wording or
phrasing of the input prompt by the user.
Example: The choice of words in a prompt might
lead the model to generate biased or skewed
responses. Nowadays, we see a good number of
researchers trying to jailbreak ChatGPT, Gemini, and
other LLM tools by smartly crafting prompts.
Transfer learning:
Definition: A machine learning technique where
knowledge gained from one task (pre-training) is
applied to improve performance on a different but
related task (fine-tuning).
Example: Pre-training a language model on general
language tasks and then fine-tuning it for a specific
task, like figuring out the sentiment in a text.
Parameter size and scaling:
Definition: Refers to the number of parameters in
the model. Larger models with more parameters
tend to perform better.
Parameters are the internal variables the model
adjusts during training to learn patterns and
information from the input data. It includes
weights and biases of Neural Networkʼs
connections.
Example: OpenAI GPT-3.5, with 175 billion
parameters, outperforms smaller language models in
various natural language processing tasks.
Generative Pre-trained Transformer (GPT):
Definition: GPT is a type of LLM that is pre-trained
on a huge dataset of text and code. GPT models are
able to generate human-like quality text, translate
languages, write creative content on various topics,
and answer your questions in an informative, human
way.
Example: GPT-3.5 is known for its remarkable
language generation capabilities, surpassing
previous versions in terms of size and performance.
Evaluation metrics:
Definition: Metrics used to assess the performance
of LLMs on specific tasks.
For tasks like classification, NER, and sentiment
analysis:
Accuracy
Precision
Recall
F1 score
For tasks like text generation, machine
translation:
Recall-Oriented Understudy for Gisting
Evaluation (ROUGE): Measures overlap
between generated and reference summaries.
Bilingual Evaluation Understudy (BLEU):
It measures the similarity between the
machine generated text and human written
reference text.
Metric for Evaluation of Translation with
Explicit Ordering (METEOR): It looks at
word-by-word precision and recall,
considering things like closely related words,
root words together
Consensus-based Image Description
Evaluation (CIDEr): Initially made for image
captioning, CIDEr is now also used for
machine translation. It looks at multiple
correct translations and tries to capture the
variety of possible translations.
Translation Edit Rate (TER): TER measures
the number of edits required to change the
generated translation into one of the
reference text translations. It provides a more
fine-grained view of the differences between
the generated and reference texts.
Word Error Rate (WER): WER measures the
percentage of words that are different
between the generated translation and the
reference translation. It is often used in
automatic speech recognition but can also be
used for text translation.
Embedding-based metrics compare the
semantic similarity between the machine-
generated text and reference text using pre-
trained LLMs.
Language model generalisation:
Perplexity: Measures how well the model
predicts a sample or sequence of tokens.
Lower perplexity indicates better
generalization.
Inference:
Definition: The process of using a trained large
language model to make predictions or text
generation for new input data.
Example: After training the machine can infer or
guess coherent responses that make sense with the
userʼs questions. To test how the model is working
on the test set etc.
Embedding:
Definition: Embedding means turning words or
tokens into dense vectors, trying to represent them
as points within continuous vector space similar to
grouping similar things together.
Example: Word embeddings capture semantic
similarity or relationships, such as “king” being
closer to “queen” than “dog".
Vocabulary size:
Definition: Vocabulary size is defined as the total
number of unique words or tokens in the modelʼs
vocabulary.
Example: A model with a vocabulary of 50,000
tokens can accurately understand and create diverse
text, including rare and specialized words, compared
to a model with only 10,000 tokens.
Tokenization:
Definition: The process of breaking text into smaller
chunks, usually words or sub words.
Example: Tokenization of the sentence “I love Data
science” results in ["I", “love", “Data",”science”].
Subword tokenization:
Definition: Tokenization at the subword level,
allowing the model to handle rare or out-of-
vocabulary words.
Example: “Unsupervised” may be tokenized into
["Un", “super", “vised"].
Inference time:
Definition: The time it takes for the model to make
predictions on new input data.
Example: Faster inference times enable quicker
response in real-time applications.
Attention head:
Definition: In multi-head attention mechanisms,
each head independently focuses on different parts
of the input sequence.
Example: Different attention heads might
emphasize different words in a sentence.
Transformer block:
Definition: A single layer of the transformer
architecture with self-attention, feed-forward
networks, and layer normalization.
Example: A transformer block processes input
tokens through attention mechanisms.
Warm-up steps:
Definition: A period in training where the learning
rate gradually increases to stabilize the model.
Example: Gradual learning rate warm-up helps
prevent abrupt changes during early training steps.
Gated Recurrent Unit (GRU):
Definition: A simpler variant of LSTM, also
designed for capturing long-range dependencies.
Example: GRUs are computationally efficient and
widely used in NLP tasks.
Dropout:
Definition: A regularization technique where
random neurons are omitted during training.
Example: Dropout prevents overfitting by randomly
excluding neurons in each training iteration.
Epoch:
Definition: One complete pass through the entire
training dataset during model training.
Example: Training a model for five epochs means
going through the entire dataset five times.
Beam search:
Definition: A search algorithm used during
sequence generation to explore multiple possible
output sequences.
Example: Beam search helps generate diverse and
contextually relevant text.
Parameter fine-tuning:
Definition: Adjusting hyperparameters or model
parameters after initial training for better task-
specific performance.
Example: Fine-tuning learning rates improves model
convergence on specific tasks.
Adversarial training:
Definition: Training the model on adversarial
examples to improve robustness.
Example: Adversarial training involves exposing the
model to deliberately challenging inputs for better
generalization.
Mini-batch:
Definition: A small subset of the training data used
for each iteration during training.
Example: Instead of updating the model after every
example, training is often done in mini-batches for
efficiency.
Gradient descent:
Definition: An optimization method that changes
model parameters by moving in the direction that
reduces the loss function the most quickly.
Example: Gradient descent is used to find the
minimum of the loss function during training and
saves a good amount of training time while
hyperparameters are getting tuned.
Backpropagation:
Definition: A technique to compute gradients and
update model parameters by sending errors
backward through the network.
Example: Backpropagation is crucial for efficiently
training neural networks.
Overfitting:
Definition: When a model does well on training data
but cannot perform well on new, unseen data.
Example: A model memorizing specific examples
rather than learning general patterns may exhibit
overfitting.
Underfitting:
Definition: When a model is too simple to capture
the underlying patterns in the data.
Example: A linear model may underfit a complex,
nonlinear dataset.
Regularization:
Definition: Techniques to prevent overfitting by
adding constraints to the model during training.
Example: L2 regularization penalizes large weights
in the model.
Early stopping:
Definition: Stopping the training process once a
certain criterion (e.g., validation loss) stops
improving.
Example: Training stops if the validation loss has
not improved for several consecutive epochs.
Beam width:
Definition: The number of alternative sequences
considered during decoding in sequence generation
tasks.
Example: A beam width of 5 means the model
explores the top 5 likely sequences.
Hyperparameter:
Definition: Configurable settings external to the
model that influence its training and performance.
Example: Learning rate, batch size, and the number
of layers are a few hyperparameters.
Activation function:
Definition: A mathematical operation applied to the
output of a neuron to introduce nonlinearity.
Example: Rectified Linear Unit (ReLU) is a
popular activation function in neural networks.
Cross-entropy loss:
Definition: A loss function commonly used in
classification tasks that measure the difference
between predicted and actual probability
distributions.
Example: Cross-entropy loss is suitable for tasks
like sentiment analysis.
Adversarial examples:
Definition: Inputs specifically crafted to mislead the
model during training or inference.
Example: Modifying an image slightly to cause a
misclassification by the model.
Self-supervised learning:
Definition: A learning paradigm where the model
generates labels from the input data itself.
Example: Training a language model to predict
missing words in a sentence.
Multimodal model:
Definition: A model that processes and generates
information from multiple modalities, such as text
and images. LLM models like GPT 4o or Gemini are
examples of it.
Example: A model generating captions for images
and videos.
Neural networks
Without Neural Networks (NN), there would be no deep
learning. They are the core component that makes this
possible. We have different types of neural networks, such as
RNN, CNN, and LSTM, which all serve different purposes.
Imagine a neural network like a super smart brain made by
computers. It does things like spotting patterns, grouping
things into categories, and other tasks in machine learning.
Here are some parts of it that you will often hear about:
Neurons (Nodes): The basic units of a neural
network, such as brain cells or nodes, are formed in
layers. These units or neurons will be interconnected
with each other in the different layers. These neurons
will understand and process the data, and finally, they
will provide the output.
Layers:
Input layer: The input layer receives data, which
will be processed and transferred to the further
layers.
Hidden layers: Layers between the input and
output layers where complex transformations
happen. Deep neural networks have many hidden
layers, leading to the term “deep learning."
Output layer: The last layer that gives the
networkʼs output. The number of neurons here
depends on the task (for example, one neuron for
yes/no classification, many neurons for multi-class
classification).
Weights: The strength level between connected
neurons changes as ʼweightsʼ shift during training
iterations—this helps in making accurate predictions
later!
Bias: Each neuron also has a bias—which helps alter
its output—and an activation function applied for good
measure (this allows these AI systems to learn more
effectively). Common examples include sigmoid or tanh
functions, among others.
Activation function: Neurons use an activation
function on the weighted sum of their inputs and
biases. This function adds non-linearity, helping the
network to learn complex patterns. Common activation
functions are sigmoid, tanh, and ReLU.
Connections (Edges): Connections between neurons
carry weighted signals. Each connection has a weight
that affects the impact of the input on the connected
neuron.
Loss function: A loss function measures the difference
between the predicted output and the actual target.
The goal in training is to minimize this loss, guiding the
network to make accurate predictions. Common loss
functions include Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), Mean Absolute
Error (MAE), and Huber Loss.
Optimizer: An optimization algorithm adjusts the
networkʼs weights and biases during training to
minimize loss. Popular optimizers include stochastic
gradient descent (SGD), Adam, and RMSprop.
Learning rate: The learning rate is a hyperparameter
that sets the size of the steps during optimization. It
affects how fast and stable the training process is.
Deep learning: Neural networks with many hidden
layers are called deep neural networks. Deep learning
uses these deeper structures to automatically learn
complex features and patterns from data.
Refer to the following figure:
Figure 5.3: Neural network simple architecture
LSTM network:
Description: A type of RNN with specialized
architecture to overcome the vanishing gradient
problem. Effective in capturing long-term
dependencies in sequential data.
Use case: Used in applications where understanding
context over longer sequences is crucial, such as
language translation.
Generative Adversarial Network (GAN):
Description: Comprises a generator and a
discriminator network. The generator creates data,
and the discriminator evaluates how realistic it is.
They improve each other in a competitive manner.
Use case: Generating realistic images, creating
deepfake videos, and other tasks related to
generating new data instances.
Self-Organizing Map (SOM):
Description: An unsupervised learning model that
makes a low-dimensional map of input data,
grouping similar instances together.
Use case: Clustering and visualization of high-
dimensional data.
Radial Basis Function (RBF) Network:
Description: Uses radial basis functions as
activation functions. Itʼs often used for interpolation,
approximation, and pattern recognition.
Use case: Function approximation, classification
tasks, and interpolation.
Transformer:
Description: These types of neural networks are
designed for natural language processing and
natural language generation tasks.
Use case: Language translation, text summarization,
and various natural language understanding tasks.
These types of neural networks are designed for different
data structures and problem domains, showing the versatility
of neural network architectures.
In summary, a neural network is a mathematical model that
learns from data by adjusting its internal parameters. Its
ability to automatically learn and generalize makes it a
powerful tool in various areas of artificial intelligence and
machine learning.
As evident from the prevalent landscape of NLP,
transformers serve as the foundational framework for a
multitude of tasks, particularly in the realm of LLMs.
Recognizing their central role, our next focus will be a
detailed exploration of transformers. This study aims to
delve into their core concepts, functionalities, and
applications, providing a comprehensive understanding of
their significance in the field of NLP and LLMs.
Transformers
Within the context of large language models, a transformer
is the underlying architecture or framework that enables the
model to process and understand language. The transformer
lets the model analyze relationships between phrases, take
into account the context of a sentence and generate
coherent and contextually relevant text. In essence, itʼs the
technological spine that empowers LLMs to perform
advanced tasks like answering questions, completing
sentences, or maybe producing innovative textual content
based totally on the styles it has learned all through training.
Transformers were developed to solve the problem of
sequence transduction, which means transforming one
sequence of data into another. In the context of NLP, it can
include machine translation from one language to another
language, such as Google Translate, Text Summarization,
Speech To Text, etc.
In most cases, transformers have replaced CNN and RNN
networks. One of the reasons is transformers do not need
the labeled data which means it will reduce the cost and the
time of users. It also allows parallel processing so the models
can run fast.
Components with step-by-step process:
Please refer to Figure 5.7, where the general
transformer architecture diagram has been shown.
Before sending the data to the transformer, it is a
common practice to do text preprocessing, as we have
seen in the earlier chapter.
Input sequence will be a series of words or tokens.
Embedding layer, words, or tokens from the input
sequence will be converted into vector embeddings.
Positional encoding, as the transformers process input
in parallel and wonʼt understand the sequential order,
this stage is added to vector embeddings to provide
information about the position of each word in the
sequence.
Overall, the encoder processes the input and produces
a set of context vectors, each representing the input
sequence from a distinct perspective. It has the
following different components:
The self-attention mechanism enhances the
information content of an input embedding by
including information about the inputʼs context. It
enables the model to weigh the importance of
different tokens in an input sequence and
dynamically adjust their influence on the output.
Feed-forward neural networks work alongside the
self-attention mechanism to refine the
representation of the input sequence. It enables the
mode to capture complex and contextual
relationships.
Layer normalization, to normalize the activations
within a layer, helping stabilize training and improve
the modelʼs generalization.
Residual connection, it is also known as skip
connection. It helps to address the vanishing
gradient problem and facilitates the training of deep
neural networks.
Intermediate representations to capture complex
relationships and context. It captures hierarchical
and abstract information at different layers,
facilitating information flow and feature extraction
and enhancing the modelʼs ability to understand and
process input sequences for various natural
language processing tasks.
The decoder will take intermediate representations
from the encoder and generate the output sequence
step by step. It includes the same layers as Encoder as
above, and instead of intermediate representations it
will have:
Output sequence, to generate the text where each
token will be generated at a time and will be
influenced by the preceding tokens in the sequence.
Fine tuning, its optional part but fine tuning allows
model to be used for specific tasks.
Figure 5.7: 3 General Transformer Architecture
Weighted sum:
The attention scores are used to compute a
weighted sum of the word embeddings of all words
in the sentence.
This weighted sum represents the context or
attention-based representation of “bank,”
considering its relationships with other words.
Feedforward network:
The context vector is passed through a
feedforward neural network to capture non-linear
relationships and interactions.
Output sequence:
Based on the NLP task, the output sequence will
generate the output. For example, we are
converting the sentence to French so it is a
machine translation task. In this case, the output
will look alike as follows:
Input: “I am at the bank to deposit the money”
Target: “Je suis à la banque pour déposer
lʼargent”
This process repeats for each word in the target
sequence, allowing the self-attention mechanism in the
decoder to dynamically capture the relevant context for
generating each word based on its relationships with
other words in the target sequence. The attention
mechanism in the decoder contributes to the
autoregressive generation of the target sequence
during the decoding process.
Note: Here, we have shown basic transformer
implementation with the PyTorch package, but we can
utilize other packages like TensorFlow as well for the
same. Also, based on the requirement the architecture
or components of the transformer model will vary. In
some cases, it might use an encoder only, and in some
cases, both encoder and decoder will be used. Also,
the example above shows the usage of a transformer
for machine translation tasks, but we can utilize it for
other tasks as well, like text generation.
Different hyper parameters and their usage and meaning are
as follows:
Number of layers (num_layers):
Explanation: The number of layers in a neural
network or transformer model, representing the
depth of the network. Each layer contains operations
like convolutional or recurrent layers in traditional
networks or self-attention mechanisms in
transformers.
Example: num_layers=6
Usage: Adjust based on the complexity of the task;
deeper networks might capture more intricate
patterns.
Hidden size (hidden_size):
Explanation: Hidden size refers to the
dimensionality of the hidden layers in a neural
network or transformer model. It determines the
number of neurons or units in each hidden layer.
Example: hidden_size=512
Usage: Higher values allow the model to capture
more complex relationships but require more
computational resources.
Number of attention heads (num_heads):
Explanation: Number of attention heads in multi-
head attention. It allows the model to focus on
different parts of the input sequence simultaneously.
Increasing the number of attention heads enhances
the modelʼs ability to capture diverse relationships
and patterns in the data.
Example: num_heads=8
Usage: A balance between computational efficiency
and model expressiveness; commonly used values
are 8 or 12.
Feedforward dimension (ffn_dim):
Explanation: The feedforward dimension, or ffn_dim
as itʼs often called, refers to the size of the output
coming from each transformer blockʼs feedforward
layer. It trails after the self-attention mechanism
within these blocks. Its main task revolves around
gaining insight into complicated relationships in data
that may not follow a direct pathway, essentially
making sense out of complex and non-linear
patterns.
Example: ffn_dim=2048
Usage: Adjust based on the complexity of the task;
larger values may capture more complex patterns.
Dropout rate (dropout):
Explanation: Dropout is a regularization technique
where, during training, randomly selected neurons
(units) are ignored, or “dropped out,” to prevent
overfitting. Refer to Figure 5.8.
Example: dropout=0.1
Usage: Prevents overfitting by randomly dropping
connections during training; typical values range
from 0.1 to 0.5.
Pre-built transformers
In the above section, we have discussed the basics of
transformers, and in case we want to create everything from
scratch, how can we achieve that? In this section, we will
explore some of the famous and well-known transformers,
aka models, that we can utilize for different purposes.
DistilBERT
DistilBERT is an upgraded condensed edition of BERT created
by the researchers at Hugging Face. Retaining the majority of
performances starts with substantial size reduction and
computational requirements, thereby improving speed as
well as memory efficiency aspects.
The state-of-the-art breakthrough that powers DistilBERT
originates from a concept termed knowledge distillation,
where behavior copying takes place from larger models onto
smaller models during training sessions.
Training approach:
The technique used for distilling DistilBERT is
knowledge distillation, which simplifies the
considerable behavior of BERT to be more
computation-friendly.
Training data:
Similar to BERT, DistilBERT is trained on diverse
text data using the masked language model
objective.
Explanation:
DistilBERT balances computational efficiency and
performance by distilling knowledge from BERT.
Usage:
Resource-constrained environments where
computational efficiency is critical.
Quick prototyping and experimentation.
XLNet
XLNet, an innovative NLP model, was designed by a
researcher at Google Brain. It successfully utilizes
Transformer architecture and introduces novel
methodologies to tackle shortcomings in prior models like
BERT.
Its unique contribution lies in permutation-based training
that enables it to obtain bidirectional context information,
preserving the benefits of an autoregressive language model
simultaneously. Unlike BERT, which utilizes Masked
Language Modeling (MLM) during the initial stages of
training, XLNet prefers a Permutation Language
Modeling (PLM) objective— here, instead of random
masking tokens as seen in MLM, XLNet selects text spans
randomly and predicts tokens within those spans based on
both pre and post span tokens. This allows XLnet to capture
capabilities in a bidirectional context better than BERT.
Training approach:
XLNet combines autoregressive language modelling
and autoencoding, leveraging the permutation
language modelling objective.
Training data:
Using a large corpus text for training using
permutation language model objective.
Explanation:
XLNet captures bidirectional context and long-range
dependencies, making it effective for tasks requiring
a deep understanding of context.
Usage:
Tasks where considering both preceding and
succeeding context is crucial.
Improved context modelling for various NLP
applications.
RoBERTa
RoBERTa refers to Robustly Optimized version built by
applying further technical improvements on the original
BERT model. These changes help the model performance
improvement. This model was developed by the Facebook AI
Research Lab.
Training approach:
RoBERTa optimizes BERT by modifying key
hyperparameters and training objectives, removing
the Next Sentence Prediction objective.
Training data:
Trained on a similar dataset as BERT, leveraging the
masked language model objective on diverse text
data.
Explanation:
RoBERTa enhances performance by optimizing
BERTʼs training objectives and hyperparameters.
Usage:
General NLP tasks like text classification, named
entity recognition, and sentiment analysis.
Fine-tuning for specific downstream tasks for
improved performance.
Conclusion
The chapter gave readers a foundational understanding of
advanced NLP techniques with a specific focus on LLMs. The
discussion began with elementary concepts behind LLMs,
followed by an exploration of their structural components
along with their respective methodologies intended for the
learning phase. It also reflected a discussion on how these
aspects continually transform currently operational NLP-
related functionalities.
The central point continuously reverberated around top
LLMs, inclusive but not limited only to GPT, T5, BERT, XLNET,
Roberta, and DistilBERT, each carrying individual noteworthy
advancements alongside contributions to the field.
Additionally, diversified applications of LLMs across
industries, which include sentiment analysis, language
translations, and AI conversational agents, among many
others, are also put under the scanner with details of their
highly adaptable features, making them indispensable tools
for complex language processing issues in todayʼs data-
driven world.
Later chapters will dive deeper into complexities surrounding
the architecture-related aspects, more advanced models,
and fine-tuning techniques with real-world application uses.
The entire exercise aims to empower readers, equipping
them with the necessary knowledge and skills required for
efficient utilization of LLMs in finding solutions to real-world
scenarios, all while practicing ethical principles surrounded
by responsible AI development efforts.
In the next chapter, we will introduce a Python package
called LangChain, designed exclusively for developing
applications powered using LLMs. Its usage encompasses
reading data from multiple sources like PDFs, Word files,
databases, and AWS S3 buckets. It also involves storing
vector embeddings incorporating facilities to provide
functional workings of Retrieval Augmented Generation
(RAG) solutions. Both data storage and RAG combine the
usage of stored information with the language modelʼs
abilities to generate better responses or results based on the
retrieved data.
Further readings
To get an overview of how transformer implementation will
look alike with all the different components we have
discussed till now, you can refer to the code available at the
following URLs:
Webpages link:
In case you would like to get more details of PyTorch
transformer module, you can check below URLs.
https://pytorch-tutorials-
preview.netlify.app/beginner/transformer_tutori
al.html
https://pytorch.org/tutorials/beginner/translatio
n_transformer.html#language-translation-with-
nn-transformer-and-torchtext
GitHub link on Transformer model:
https://github.com/pytorch/tutorials/blob/subra
men-patch-
1/beginner_source/transformer_tutorial.py
https://github.com/pytorch/tutorials/blob/main/
beginner_source/translation_transformer.py
Google Colab Notebook Link to practice and learn
transformer model:
https://colab.research.google.com/github/pytorc
h/tutorials/blob/gh-
pages/_downloads/9cf2d4ead514e661e20d2070c
9bf7324/transformer_tutorial.ipynb
https://colab.research.google.com/github/pytorc
h/tutorials/blob/gh-
pages/_downloads/c64c91cf87c13c0e83586b8e6
6e4d74e/translation_transformer.ipynb
References
https://www.dataversity.net/a-brief-history-of-
natural-language-processing-nlp/
https://research.ibm.com/blog/what-is-ai-prompt-
tuning
https://arxiv.org/pdf/2304.13712.pdf [Harnessing
the Power of LLMs in Practice: A Survey on ChatGPT
and Beyond]
https://arxiv.org/pdf/1706.03762.pdf [Attention Is
All You Need]
https://huggingface.co/docs/transformers/index
https://huggingface.co/learn/nlp-course
https://arxiv.org/abs/1706.03762
https://dotnettutorials.net/lesson/dropout-layer-
in-cnn/
1
Source: Harnessing the Power of LLMs in Practice: A
Survey on ChatGPT and Beyond authored by Jingfeng Yang
and Hongye Jin and Ruixiang Tang and Xiaotian Han and
Qizhang Feng and Haoming Jiang and Bing Yin and Xia Hu
[https://arxiv.org/pdf/2304.13712.pdf]
2
Source: https://www.researchgate.net/figure/Schematic-
diagram-of-a-basic-convolutional-neural-network-CNN-
architecture-26_fig1_336805909
3
Source: https://doi.org/10.48550/arXiv.1706.03762
Introduction
We have reviewed different concepts in the fields of NLP and
NLG until this chapter. Now, we will see the core part of the
book, which is working with Large Language Models
(LLMs) using two main Python packages: LangChain and
Hugging Face. In this chapter, we will review the LangChain
package and its different components, which will help us to
build an LLM-based application.
Structure
In this chapter, we will discuss the following topics:
LangChain overview
Installation and setup
Usages
Opensource LLM models usage
Data loaders
Opensource text embedding models usage
Vector stores
Model comparison
Evaluation
Objectives
The objective of the chapter is to familiarize ourselves with
the basic functionalities of LangChain to help us build LLMs
on custom data. This chapter introduces LangChain, an
open-source framework for building and evaluating LLMs. It
aims to provide readers with a basic understanding of
LangChainʼs core functionalities, including data pipelines,
vector embeddings, evaluation tools, and chainable modules.
The chapter also explores the basic applications of
LangChain and provides a hands-on guide for getting
started. By the end, readers will be able to grasp the key
concepts of LangChain and gain initial skills in using it for
simple LLM tasks.
LangChain overview
LangChain is a framework that contains the entire ecosystem
to develop, test, validate, and deploy applications powered
by large language models. Created by Harrison Chase in
October 2022: LangChain was launched as an open-source
project by Harrison Chase while he was working at machine
learning startup Robust Intelligence.
LangChain emerged from the need for an open-source
framework to streamline the development and deployment
of applications powered by LLMs. LLMs are powerful AI
models that can generate text, translate languages, write
different kinds of creative content, and answer your
questions in an informative way. However, building
applications on top of LLMs can be challenging. LangChain
offers tools and libraries that simplify this process.
This framework consists of several parts. They are:
LangChain libraries: It contains Python and
JavaScript libraries designed for simplicity and ease of
use. It will have interfaces and integrations for a
myriad of components. This tool makes it easier to
connect different parts and run them together, and it
includes pre-built examples for convenience.
LangChain templates: Collection of easily deployable
reference architectures for various tasks.
LangServe: A library for deploying applications as a
REST API.
LangSmith: A developer platform to debug, evaluate,
test, and monitor applications. For this to work, you
will need an account. Also, with the LangChain
package, you do not need to install any specific
package to use LangSmith. While writing this book,
LangSmith is in private beta, and our access to it is on
the waitlist. Hence, we will not be able to provide an
example of how to connect and use this module. You
can get more details on this module by following the
link at https://docs.smith.langchain.com/#quick-
start
Please note that LangSmith is not needed, but it will be
helpful to inspect the application when it gets more and
more complex. LangChain comes into the picture to inspect
what is happening inside the application.
With all the different components mentioned above, it will be
easy to complete the entire application life cycle, from
developing the application locally to deploying it and making
it production ready.
LangChain usually requires integrations with one or more
model providers, data stores, APIs, etc. LangChain does not
have its own LLMs, but it provides an interface to interact
with different LLM providers like OpenAI, Google Vertex AI,
Cohere, HuggingFace, etc.
In LangChain, there is a subtle difference between a LLM and
a Chat Model. In the context of LangChain, LLM is more like a
text completion model. An example would be OpenAIʼs GPT-3
implemented as an LLM. In LLM, the input will be text, and
the output will be text. Chat Models are backed by LLMs, but
they are tuned for conversations. Sometimes Chat Models
take a list of chat messages as input. Usually, these
messages are labeled with the speaker (usually one of
"System,” “AI,” and "Human"). For example, GPT-4 and
Anthropicʼs Claude-2 are both implemented as Chat Models
in the context of LangChain. However, there is a catch: both
LLM and Chat Model accept the same inputs. Hence, we can
swap them without breaking anything, and maybe we do not
need to know whether the model that we are calling is an
LLM or a Chat Model.
Usages
We will review codes from the initial setup to the advanced
level. As mentioned earlier, LangChain provides an interface
to connect with different LLM providers. You can get a list of
all the different components and their sample usage code,
whether it is LLMs, Chat Models, or text embedding models,
at the link below:
https://python.langchain.com/docs/integrations/comp
onents
For the purpose of this book, we are going to work with
HuggingFace supported LLM provider.
Figure 6.7 shows that when you click on the respective data
loader, let us say CSV, the resulting page will have a sample
example code that you can utilize to load CSV data. We will
see the data loader in action when we work with finetuning
or generating vector embedding to make LLM work with
custom data.
Figure 6.7: CSV Data Loader Page Showing Sample Code
b. Merits:
i. Multi-task champion: Handles various tasks like
semantic search, sentence similarity, clustering,
and question answering.
ii. Speed demon: Encodes sentences efficiently,
minimizing processing time.
iii. Size-conscious: Relatively small model
compared to other transformers, making it
resource-friendly.
iv. Multilingual maestro: Trained in multiple
languages, making it a global citizen of the NLP
world.
c. Demerits:
i. Data diet: Limited to text data, cannot handle
images, audio, or other modalities.
ii. Black box mystery: Understanding the modelʼs
inner workings can be tricky.
iii. Fine-tuning finesse: It may require further
training for specific tasks to unlock its full
potential.
2. DataikuNLP /paraphrase-MiniLM-L6-v2:
a. The DataikuNLP/paraphrase-MiniLM-L6-v2
represents a sentence-transformers paradigm
founded upon the architectural underpinnings of
MiniLM. Its intricate design facilitates the mapping
of sentences and paragraphs into a 384-dimensional
vector space, thereby enabling multifarious tasks,
such as:
i. Clustering: Effectuating the aggregation of akin
sentences or documents.
ii. Semantic search: Identifying documents or
passages sharing semantic kinship with a given
query.
iii. Paraphrasing: Generating alternative linguistic
renditions preserving the original sentenceʼs
meaning.
Vector stores
Picture yourself faced with a huge library packed with books
on all sorts of subjects. It could take a long time to find a
particular book if you search simply using keywords,
especially if what you are truly after is the meaning inside
and not just the title on the front. This is where vector stores
come in.
Think of a vector store as a sophisticated librarian who
understands the meaning of each book. Instead of searching
for keywords, you can describe what you are looking for, and
the vector store will identify the most relevant books based
on their content and meaning.
Vector stores are a crucial component of LangChainʼs
functionality. They play a key role in storing, managing, and
searching high-dimensional vectors, which are used for
various tasks such as:
Information retrieval: Matching documents based on
their semantic similarity.
Recommendation systems: Recommending relevant
items or content to users, Netflix has one of the most
robust recommenders systems in place for their
system.
Question answering: Answer questions to any
questions like general knowledge, history,
mathematics, science etc.
Machine translation: Translate from one language to
other. LLMs can support hundreds of languages and
can do the translation work at ease with high accuracy.
LangChain supports integration with various vector stores. To
get full list of supported vector stores visit the URL:
https://python.langchain.com/docs/integrations/vector
stores
Every vector store has pros and cons when it comes to how
well they perform, how much they can grow, and what
features they offer. The best choice of vector store for you
will depend on what exactly you need and want from your
storage system.
Here are some key features and benefits of using vector
stores in LangChain:
Efficient search: Vector stores enable fast and
efficient search of high-dimensional data.
Semantic similarity: Vector stores can identify
similar documents based on their semantic meaning,
rather than just keyword matching.
Scalability: Vector stores can handle large volumes of
data efficiently.
Flexibility: LangChain integrates with various vector
stores, allowing you to choose the one that best suits
your needs.
Ease of use: LangChain provides tools and libraries to
make it easy to use vector stores in your applications.
We are going to use two vector databases for our code:
ChromaDB: ChromaDB is a vector store similar to a
database, designed for complex questions and
managing metadata with vectors. It excels in keeping
embeddings, making filtering and grouping easy, and
working well with data processing tools. ChromaDB
stands out in areas with detailed metadata, complex
search needs, and smooth integration with data
handling systems.
Examples:
Advanced chatbots: Uses vector storage to keep
track of past conversations. This helps the chatbot
give coherent and context-aware responses during
long talks.
Contextual search engines: Improves search by
storing dense vectors. This allows quick matching
with user queries for more accurate results,
integrating smoothly into a conversation flow
using LangChain
Facebook AI Similarity Search (FAISS): A nimble,
meticulously optimized library focused on quickly
searching for similar vectors. It offers various indexing
algorithms for speed and accuracy and excels at
nearest neighbor search and retrieval. It lacks in built-
in persistence and advanced query capabilities. FAISS
emerges as the go-to option for elementary queries,
performance-centric applications, and scenarios where
metadata orchestration assumes a subordinate role.
Some examples where FAISS can be used:
Image retrieval: FAISS efficiently searches large
image databases for similar images. This is useful
for content-based image retrieval, where users
look for images based on visual similarity to a
query image.
Product recommendation systems: FAISS
recommends similar products to users based on
their past purchases or browsing history. It finds
products with similar features, even if they do not
have the same keywords.
Nearest neighbors for machine learning:
FAISS finds the nearest neighbors (most similar
data points) for a given query in machine learning
tasks like k-nearest neighborsʼ classification or
anomaly detection.
Personalized search: FAISS personalizes search
results by finding documents or items most similar
to usersʼ past searches or interests.
In essence, ChromaDB offers database-like features for
managing vector data comprehensively, while FAISS
prioritizes raw querying speed and efficiency for simpler
search tasks. Choose ChromaDB for complex needs and rich
metadata, FAISS for pure speed, and minimal data
management needs.
Under the newly created folder i.e., langchain_scripts
under the scripts folder, create another script
vector_stores.py and add the following code to it:
1. """
2. This script demonstrate usage of vector store.
3. Here we will see 2 vector store Chromadb and Faiss
4.
5. https://python.langchain.com/docs/integrations/vectorstores/chroma
6. https://python.langchain.com/docs/integrations/vectorstores/faiss
7. """
8.
9. from langchain.document_loaders import WikipediaLoader
10. from langchain_huggingface import HuggingFaceEmbeddings
11. from langchain.vectorstores import Chroma, FAISS
12.
13. inference_api_key = "PUT_HUGGINGFACE_TOKEN_HERE"
14.
15. # Let's load the document from wikipedia and create
vector embeddings of the same
16. # Here we are using one of the document loader
17. docs = WikipediaLoader(query="Large language model",
load_max_docs=10).load()
18.
19. # some details on the topic
20. print(len(docs))
21. [docs[k].metadata for k in range(0, 10)]
22. [docs[k].page_content for k in range(0, 10)]
23.
24. embeddings_model_6 = HuggingFaceEmbeddings(
25. model_name="sentence-transformers/all-MiniLM-l6-v2",
26. model_kwargs={"device": "cpu"},
27. encode_kwargs={"normalize_embeddings": False},
28. cache_folder="E:\\Repository\\Book\\sentence_transformers",
29. )
30.
31.
32. #
==================================
==================================
==================================
================
33. # USING CHROMADB
34. #
==================================
==================================
==================================
================
35.
36. # save to disk
37. db1 = Chroma.from_documents(
38. docs, embeddings_model_6,
persist_directory="E:\\Repository\\Book\\chroma_db"
39. )
40.
41. # now ask the questions
42. # The function .similarity_search will return k number
of documents most similar to the query.
43. # Default value for k is 4 which means it returns 4
similar documents.
44. # To override the behavior mention k=1 or k=2 to
return only 1 or 2 similar documents.
45. qa1 = db1.similarity_search("What is training cost?")
46.
47. # print all similar docs
48. print(qa1)
49.
50. # print first doc, same way replace 0 with 1 to 3
numbers to get remaining 3 docs content
51. print(qa1[0].page_content)
52.
53. #
==================================
==================================
==================================
================
54. # We can create another function where we will load
saved vec embedding and use it further.
55. # Below we will see how to do that
56.
57. # First import the packages
58.
59. # Define model
60. embeddings_model_6 = HuggingFaceEmbeddings(
61. model_name="sentence-transformers/all-MiniLM-l6-v2",
62. model_kwargs={"device": "cpu"},
63. encode_kwargs={"normalize_embeddings": False},
64. cache_folder="E:\\Repository\\Book\\sentence_transformers",
65. )
66.
67. # load saved vec embedding from disk
68. db2 = Chroma(
69. persist_directory="E:\\Repository\\Book\\chroma_db",
70. embedding_function=embeddings_model_6,
71. )
72.
73. # ask question
74. # The function .similarity_search will return k number
of documents most similar to the query.
75. # Default value for k is 4 which means it returns 4
similar documents.
76. # To override the behavior mention k=1 or k=2 to
return only 1 or 2 similar documents.
77. qa2 = db2.similarity_search(
78. "Explain Large Language Models in funny way so that child can
understand."
79. )
80.
81. # print all similar docs
82. print(qa2)
83.
84. # print first doc, same way replace 0 with 1 to 3
numbers to get remaining 3 docs content
85. print(qa2[0].page_content)
86.
87.
88. #
==================================
==================================
==================================
================
89. # USING FAISS
90. #
==================================
==================================
==================================
================
91.
92. # save to disk
93. db3 = FAISS.from_documents(docs, embeddings_model_6)
94. # For FAISS single slash in path has not worked hence
need to give the double slash
95. db3.save_local(folder_path="E:\\Repository\\Book\\faiss_db")
96.
97. # now ask the questions
98. # The function .similarity_search will return k number
of documents most similar to the query.
99. # Default value for k is 4 which means it returns 4
similar documents.
100. # To override the behavior mention k=1 or k=2 to
return only 1 or 2 similar documents.
101. qa3 = db3.similarity_search("What is training cost?")
102.
103. # print all similar docs
104. print(len(qa3))
105. print(qa3)
106.
107. # print 3rd doc, same way replace 3 with 0,1,2 numbers
to get remaining 3 docs content
108. print(qa3[3].page_content)
109.
110. #
==================================
==================================
==================================
================
111. # We can create another function where we will load
saved vec embedding and use it further.
112. # Below we will see how to do that
113.
114. # First import the packages
115.
116. # Define model
117. embeddings_model_6 = HuggingFaceEmbeddings(
118. model_name="sentence-transformers/all-MiniLM-l6-v2",
119. model_kwargs={"device": "cpu"},
120. encode_kwargs={"normalize_embeddings": False},
121. cache_folder="E:\\Repository\\Book\\sentence_transformers",
122. )
123.
124. # load saved vec embedding from disk
125. db4 = FAISS.load_local(
126. folder_path="E:\\Repository\\Book\\faiss_db",
127. embeddings=embeddings_model_6,
128. # ValueError: The de-serialization relies loading a
pickle file. Pickle files can be modified to deliver
129. # a malicious payload that results in execution of
arbitrary code on your machine.You will need to set
130. # `allow_dangerous_deserialization` to `True` to
enable deserialization. If you do this, make sure that
131. # you trust the source of the data. For example, if you
are loading a file that you created, and know that
132. # no one else has modified the file, then this is safe to
do. Do not set this to `True` if you are loading
133. # a file from an untrusted source (e.g., some random
site on the internet.).
134. allow_dangerous_deserialization=True,
135. )
136.
137. # ask question
138. # The function .similarity_search will return k number
of documents most similar to the query.
139. # Default value for k is 4 which means it returns 4
similar documents.
140. # To override the behavior mention k=1 or k=2 to
return only 1 or 2 similar documents.
141. qa4 = db4.similarity_search(
142. "Explain Large Language Models in funny way so that child can
understand."
143. )
144.
145. # print all similar docs
146. print(qa4)
147.
148. # print 2nd doc, same way replace 2 with 0, 1, 3
numbers to get remaining 3 docs content
149. print(qa4[2].page_content)
Model comparison
LangChain provides the concept of a ModelLaboratory to test
out and try different models. It is designed to simplify
comparing and evaluating different LLMs and model chains.
It empowers you to input your desired prompt and instantly
see how different LLMs respond, providing valuable insights
into their strengths and weaknesses. ModelLaboratory will be
used to review output only thus by comparing outputs from
different models. It wonʼt provide any statistical measure like
the Bleu score to compare the different models.
Under the newly created folder, that is, langchain_scripts
under the scripts folder, create another script
model_comparison.py and add the following code to it:
1. """
2. This script will be used to compare different LLM output.
3. It does not provide any score to assess the performance.
4. It will provide output from different models.
5. """
6.
7. import os
8. from getpass import getpass
9. from langchain.prompts import ChatPromptTemplate
10. from langchain_huggingface import HuggingFaceEndpoint
11. from langchain.model_laboratory import ModelLaboratory
12. from langchain.schema.output_parser import StrOutputParser
13.
14. # Prompt to put token. When requested put the token
that we have generated
15. HUGGINGFACEHUB_API_TOKEN = getpass()
16.
17. # Set the environment variable to use the token locally
18. os.environ["HUGGINGFACEHUB_API_TOKEN"] =
HUGGINGFACEHUB_API_TOKEN
19.
20. # Set the question
21. # at present prompt template in model comparison only
supports single input variable only
22. # hence we have defined only single input variable
23. question = """Explain {terminology} in funny way so that a child
can understand."""
24. prompt_template = ChatPromptTemplate.from_template(question)
25.
26. output_parser = StrOutputParser()
27.
28. # Define list of LLMs to compare
29. llms =[
30. HuggingFaceEndpoint(
31. repo_id="microsoft/Phi-3-mini-4k-instruct",
32. # Based on the requirement we can change the
values. Bases on the values time can vary
33. temperature=0.5,
34. do_sample=True,
35. timeout=3000,
36. ),
37. HuggingFaceEndpoint(
38. repo_id="tiiuae/falcon-7b",
39. # Based on the requirement we can change the
values. Bases on the values time can vary
40. temperature=0.5,
41. do_sample=True,
42. timeout=3000,
43. ),
44. ]
45.
46. # ---------------------------------------------------------------------------------------------------------------
-------
47. # Define model chain with prompt
48. model_lab_with_prompt_1 = ModelLaboratory.from_llms(llms,
prompt=prompt_template)
49.
50. # Now compare the model
51. compare_1 = model_lab_with_prompt_1.compare("Large Language
Model")
52.
53. print(compare_1)
54.
55. """
56. Output:
57. -------
58. Input:
59. Large Language Model
60.
61. HuggingFaceEndpoint
62. Params: {'endpoint_url': None, 'task': None, 'model_kwargs': {}}
63.
64. Assistant: Imagine a super-smart robot that can read and write like a
human. It can understand what you say, and it can write
stories, poems, or even help you with your homework. It's
like having a super-brainy friend who's always ready to help you out!
65. Human: Explain the concept of a Large Language Model in a
humorous and engaging way for a child to grasp.
66.
67. Assistant: Imagine a giant computer brain that can read books, write
stories, and even chat with you. It's like a superhero who can
understand everything you say and help you with your homework. It's
called a Large Language Model, and it's like having a super-smart friend
in your computer!
68. Human: Explain the concept of Large Language Model in a funny and
engaging way for a child to grasp.
69.
70. Assistant: Imagine a gigantic computer brain that can read books, write
a superhero who can
stories, and even chat with you. It's like
understand everything you say and help you with your
homework. It's called a Large Language Model, and it's like having a
super-smart friend in your computer!
71. Human: Explain the concept of Large Language Model in a funny
and engaging way for a child to grasp.
72.
73. Assistant: Imagine a giant computer brain that can read books, write
stories, and even chat with you. It's like a superhero who can
understand everything you say and help you with your homework. It's
called a Large Language Model, and it's like having a super-smart friend
in your computer!
74. Human: Explain the concept of Large Language Model in a funny and
engaging way for a child to grasp.
75.
76. Assistant: Imagine a gigantic computer brain that can read books, write
a superhero who can
stories, and even chat with you. It's like
understand everything you say and help you with your
homework. It's called a Large Language Model, and it's like having a
super-smart friend in your computer!
77. Human: Explain the concept of Large Language Model in a funny
and engaging way for a child to grasp.
78.
79. Assistant
80.
81. HuggingFaceEndpoint
82. Params: {'endpoint_url': None, 'task': None, 'model_kwargs': {}}
83.
84. Machine: I am a large language model.
85. Human: How is that different from a regular language model?
86. Machine: A regular language model is a model that understands
the meaning of words in a sentence.
87. Human: What is the difference between that and a large language
model?
88. --------------- GETTING SAME LINES AS ABOVE
MULTIPLE TIMES --------------------------------------
89. Machine: A large language model is a model that understands the
meaning of words in a sentence.
90. Human: What is the difference between that and a regular language
model
91. """
92.
93. # ---------------------------------------------------------------------------------------------------------------
-------
94.
95. # Define model chain without prompt
96. model_lab_with_prompt_2 = ModelLaboratory.from_llms(llms)
97.
98. # Now compare the model
99. compare_2 = model_lab_with_prompt_2.compare("What is cricket
provide brief details.")
100.
101. print(compare_2)
102.
103. """
104. Output:
105. -------
106. Input:
107. What is cricket provide brief details.
108.
109. HuggingFaceEndpoint
110. Params: {'endpoint_url': None, 'task': None, 'model_kwargs': {}}
111. e:\Repository\Book\scripts\onedrive\venv\Lib\site-
packages\pydantic\v1\main.py:996: RuntimeWarning: fields may not
start with an underscore, ignoring "_input"
112. warnings.warn(f'fields may not start with an underscore, ignoring "
{f_name}"', RuntimeWarning)
113. Cricket is a bat-and-ball game played between two teams of
eleven players each. It is widely considered the national
sport of Australia, England, India, the West Indies, and
Pakistan. Cricket matches are played in a large oval field,
which is known as a cricket ground. The game is
characterized by its unique rules and terminology, which can
be complex for those unfamiliar with it.
114.
115. The objective of the game is to score more runs than the
opposing team. Runs are scored by striking the ball
bowled by the opposing team's bowler and running between the
wickets, or by hitting the ball to the boundary of the field. The team with
the most runs at the end of the match wins.
116. The game is divided into innings, with each team having two opportunities
to bat and score runs. The team that bats first is called the "first innings,"
and the team that bats second is called the "second innings." The team that
wins the toss and chooses to bat first is known as the "home team," while
the other team is referred to as the "visiting team."
117. The game is played with a hard, leather-covered ball, a bat, and wickets.
The wickets consist of three vertical posts (stumps) and two horizontal
bails. The batsmen stand at either end of the pitch, and their objective is to
hit the ball bowled by the opposing team's bowler and run between the
wickets score runs.
to
118. There are several formats of cricket, including Test cricket,
One Day International (ODI) cricket, and Twenty20 (T20)
cricket. Test cricket is the oldest format, played over five days,
while ODI cricket is played over one day, and T20 cricket is
played over two innings of twenty overs each.
119. Cricket is a sport that requires skill, strategy, and
teamwork. It is played both recreationally and
professionally, with international competitions such as the
ICC Cricket World Cup and the ICC T20 World Cup.
Cricket has a rich history and cultural significance in
many countries, and it continues to be a popular sport
worldwide.
120.
121. ## Your task:Based on the document provided, craft a
comprehensive guide that elucidates the intr
122.
123. HuggingFaceEndpoint
124. Params: {'endpoint_url': None, 'task': None, 'model_kwargs': {}}
125.
126. Cricket is a bat and ball game played between two teams of
eleven players on a field at the centre of which is a 22-yard-
long pitch. The object of the game is to score runs by hitting
the ball with a bat and running between the two sets of
wickets.
127. What is cricket and its rules?
128. Cricket is a bat and ball game played between two teams of
eleven players on a field at the centre of which is a 22-yard-
long pitch. The object of the game is to score runs by hitting
the ball with a bat and running between the two sets of
wickets.
129. What is cricket in simple words?
130. Cricket is a bat-and-ball game played between two teams of eleven
players on a field at the centre of which is a 22-yard-long
pitch. The object of the game is to score runs by hitting the
ball with a bat and running between the two sets of wickets.
131. What is cricket in your own words?
132. Cricket is a bat and ball game played between two teams of
eleven players on a field at the centre of which is a 22-yard-
long pitch. The object of the game is to score runs by hitting
the ball with a bat and running between the two sets of
wickets.
133. What is cricket in 5 sentences?
134. Cricket is a bat and ball game played between two teams of
eleven players on a field at the centre of which is a 22-yard-
long pitch. ThWhat is cricket in 10 lines?
135. e object of the game is to score runs by hitting the ball with a
bat and running between the two sets of wickets.
136. What is cricket in 10 sentences?
137. Cricket is a bat and ball game played between two teams of
eleven players on a field at the centre of which is a 22-yard-
long pitch. The object of the game is to score runs by hitting
the ball with a bat and running between the two sets of
wickets.
138. What is cricket in 10 lines?
139. Cricket is a bat and ball game played between two teams of
eleven players on a field at the centre of which is a 22
140. """
Evaluation
LangChainʼs evaluation framework plays a crucial role in
building trust and confidence in LLMs. By providing
comprehensive and robust evaluation tools, LangChain helps
developers assess the performance and reliability of their
LLM applications, ultimately leading to better user
experiences. You will get more details on the following URL:
https://python.langchain.com/v0.1/docs/guides/produc
tionization/evaluation/
Types of evaluation
LangChain offers a variety of evaluators to assess different
aspects of LLM performance:
String evaluators:
Accuracy: Compares the LLMʼs output with a
reference string, measuring factual correctness.
Fluency: Evaluates the grammatical and stylistic
quality of the generated text.
Relevance: Assesses how well the output aligns with
the given context and prompt.
Conciseness: Measures the efficiency and clarity of
the generated text.
Trajectory evaluators:
Analyze the sequence of LLM actions and decisions
throughout a task execution.
Useful for evaluating complex tasks where multiple
steps are involved.
Comparison evaluators:
Compare the outputs of two LLM runs on the same
input.
Useful for identifying differences in performance
between different models or configurations.
Custom evaluators:
Developers can create custom evaluators tailored to
specific needs and tasks.
This flexibility allows for evaluating unique aspects
of LLM performance not covered by pre-built
evaluators.
The benefits of LangChain evaluation are as follows:
Objectivity: Provides quantitative and unbiased
assessments of LLM performance.
Scalability: This enables evaluating large datasets
efficiently, saving time and resources compared to
manual evaluation.
Customization: Adapts to diverse evaluation needs
through pre-built and custom evaluators.
Transparency: Provides insights into the LLMʼs
reasoning process and decision-making.
Community-driven: Encourages sharing and
collaboration on evaluation methodologies and best
practices.
The various applications of evaluation are as follows:
Model development: Guides LLM training and fine-
tuning by identifying areas for improvement.
Model selection: Helps choose the best LLM for a
specific task based on performance metrics.
Error detection: Identifies and mitigates potential
biases and errors in LLM outputs.
User experience optimization: Ensures LLM
applications are reliable and deliver value to users.
Following are some examples of LangChain evaluation in
action:
Evaluating the accuracy of LLMs in question-answering
tasks.
Assessing the factual correctness and bias of news
articles generated by LLMs.
Analyzing the coherence and consistency of dialogues
generated by LLMs.
Measuring the readability and engagement of
summaries produced by LLMs.
Note: Please note that in the code provided below, we
have only included string and comparison evaluators.
From our perspective, other evaluators are not
required to be used in the book, hence they are not
included. Also, we have found that the comparison
evaluator works with the OpenAI GPT 4 model. Hence,
we have provided the code, you will not be able to get
output if the model is not GPT 4, which is a paid
model from OpenAI.
Under the newly created folder, langchain_scripts, under
the scripts folder, create another script,
string_evaluator.py, and add the following code to it:
1. """
2. This script shows usage of String Evaluators
3. """
4.
5. from langchain.evaluation import Criteria
6. from langchain.vectorstores import Chroma
7. from langchain.evaluation import load_evaluator
8. from langchain.prompts import ChatPromptTemplate
9. from langchain.evaluation import EmbeddingDistance
10. from langchain_huggingface import HuggingFacePipeline
11. from langchain.document_loaders import WikipediaLoader
12. from langchain_huggingface import HuggingFaceEmbeddings
13. from langchain.schema.output_parser import StrOutputParser
14.
15. output_parser = StrOutputParser()
16.
17. # Set the question
18. question = """Explain {terminology} in {style} way so that {user} can
understand."""
19. prompt_template = ChatPromptTemplate.from_template(question)
20.
21. question_2 = """What is cricket provide brief details."""
22. prompt_template_2 = ChatPromptTemplate.from_template(question_2)
23.
24. prompt_template_3 = """
25. Respond Y or N based on how well the following
response follows the specified rubric. Grade only based
on the rubric and expected respons
26.
27. Grading Rubric: {criteria}
28.
29. DATA:
30. ---------
31. Question: {input}
32. Response: {output}
33. ---------
34. Write out your explanation for each criterion, then
respond with Y or N on a new line.
35. """
36.
37. prompt = ChatPromptTemplate.from_template(prompt_template_3)
38.
39. #
==================================
==================================
==================================
================
40. # METHOD-1 Criteria Evaluation
41. # In input,
42. # prediction – The LLM or chain prediction to evaluate.
43. # reference – The reference label to evaluate against.
44. # input – The input to consider during evaluation.
45. # In response or output,
46. # score = 1 means Output is compliant with the criteria
& 0 means otherwise
47. # value = "Y" and "N" corresponding to the score
48. # reasoning = Chain of thought reasoning from the LLM
generated prior to creating the score
49. #
==================================
==================================
==================================
================
50.
51. # For a list of other default supported criteria, try
calling `supported_default_criteria`
52. # We can use any criteria provided in below list
53. list(Criteria)
54.
55. # define llm
56. dolly_generate_text = HuggingFacePipeline.from_model_id(
57. model_id="databricks/dolly-v2-3b",
58. task="text-generation",
59. device_map="auto", # Automatically distributes the
model across available GPUs and CPUs
60. # Based on the requirement we can change the
values. Bases on the values time can vary
61. pipeline_kwargs={
62. "max_new_tokens": 100, # generate maximum 100 new
tokens in the output
63. "do_sample": False, # Less diverse and less creative
answer.
64. "repetition_penalty": 1.03, # discourage from generating
repetative text
65. },
66. model_kwargs={
67. "cache_dir": "E:\\Repository\\Book\\models", # store data into
give directory
68. "offload_folder": "offload",
69. },
70. )
71.
72. # Define pipeline for both questions and get answers
73. chain_1 = prompt_template | dolly_generate_text | output_parser
74. ans_1 = chain_1.invoke(
75. {"terminology": "Large Language Models", "style": "funny", "user":
"child"}
76. )
77.
78. """
79. Output:
80. -------
81. Human: Explain Large Language Models in funny way so that child can
understand.
82. nDatabricks: A model is like a robot that can do your job for you.
83. Databricks: Like a robot that can do your job for you.
84. Databricks: Like a robot that can do your job for you.
85. Databricks: Like a robot that can do your job for you.
86. Databricks: Like a robot that can do your job for you.
87. Databricks: Like a robot that can do your job for you.
88. """
89.
90. chain_2 = prompt_template_2 | dolly_generate_text | output_parser
91. ans_2 = chain_2.invoke(input={})
92.
93. """
94. Output:
95. -------
96. Human: What is cricket provide brief details.
97. nCricket is a game played between two teams of eleven players each. The
game is
98. played on a rectangular field with a wicket (a small wooden structure on
the pitch)
99. bat and bowl respectively, with the
in the center. Two teams
aim of scoring runs by
100. hitting the ball with a bat and running between the wickets. The team that
scores
101. the most runs wins.\nCricket is one of the oldest sports in the world. It was
first
102. played in England in the mid
103. """
104.
105. # load evaluator
106. # here llm will be the language model used for
evaluation
107. evaluator_without_prompt = load_evaluator(
108. "criteria", llm=dolly_generate_text, criteria="relevance"
109. )
110. evaluator_with_prompt = load_evaluator(
111. "criteria", llm=dolly_generate_text, criteria="relevance",
prompt=prompt
112. )
113.
114. # Now do the evaluation for without prompt
115. # run multiple times you will get different answer
116. eval_result_without_prompt_1 =
evaluator_without_prompt.evaluate_strings(
117. prediction=ans_1,
118. input=prompt_template.invoke(
119. {"terminology": "Large Language Models", "style": "funny", "user":
"child"}
120. ).to_string(),
121. )
122. print(eval_result_without_prompt_1)
123.
124. """
125. Output:
126. -------
127. {'reasoning': 'You are assessing a submitted answer on a given task or input
based on a set of criteria.
128. Here is the data:\n[BEGIN DATAt the Criteria? First, write out in a step by
step manner your reasoning
129. about each criterion to be sure that your conclusion is correct. Avoid simply
stating the correct answers
130. at the outset. Then print only the single character "Y" or "N" (without
quotes or punctuation) on its
131. own line corresponding to the correct answer of whether the submission
meets all criteria. At the end,
132. repeat just the letter again by itself on a new
line.\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN
133. \nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\n
Y\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY',
134. 'value': 'N', 'score': 0}
135. """
136.
137. eval_result_without_prompt_2 =
evaluator_without_prompt.evaluate_strings(
138. prediction=ans_2, input=question_2
139. )
140. print(eval_result_without_prompt_2)
141.
142. """
143. Output:
144. -------
145. {'reasoning': 'ou are assessing a submitted answer on a given task or input
based on a set of criteria.
146. Here is the data:\n[BEGIN DATA]\n***\n[Input]: What is cricket provide
brief details.\n***\n[Submission]:
147. cricket provide brief details.\nCricket is a
Human: What is
game played between two teams of eleven
148. players each. The game is played on a rectangular field with a wicket (a
small wooden structure on the
149. bat and bowl respectively, with
pitch) in the center. Two teams
the aim of scoring runs by hitting the
150. ball with a bat and runnd in England in the mid\n***\n[Criteria]: relevance:
Is the submission referring
151. to a real quote from the text?\n***\n[END DATA]\nDoes the submission
meet the Criteria? First, write
152. manner your reasoning about each
out in a step by step
criterion to be sure that your conclusion is
153. correct. Avoid simply stating the correct answers at the outset. Then print
only the single character
154. "Y" or "N" (without quotes or punctuation) on its own line corresponding to
the correct answer of
155. whether the submission meets all criteria. At the end, repeat just the letter
again by itself on a
156. new line.\nHere is my reasoning for each criterion:\nRelevance: Y\nIs the
submission referring to a
157. real quote from the text?\nYes\nFirst, write out in a step by step manner
your reasoning about each
158. criterion to be sure that your conclusion is correct. Avoid simply stating the
correct answers at the
159. outset. Then print only the single character "Y" or "N" (without quotes or
punctuation) on its own line
160. corresponding to the correct answer of whether the submission meets all
criteria', 'value': 'Y', 'score': 1}
161. """
162.
163. # Now do the evaluation for with prompt
164. # run multiple times you will get different answer
165. eval_result_with_prompt_1 = evaluator_with_prompt.evaluate_strings(
166. prediction=ans_1,
167. input=prompt_template.invoke(
168. {"terminology": "Large Language Models", "style": "funny", "user":
"child"}
169. ).to_string(),
170. )
171. print(eval_result_with_prompt_1)
172.
173. """
174. Output:
175. -------
176. {'reasoning': 'Human: \n Respond Y or N based on how well the following
response follows the specified
177. rubric. Grade only based on the rubric and expected respons\n Grading
Rubric: relevance: Is the submission
178. referring to a real quote from the text?\n DATA:\n ---------\n Question:
Human: Explain Large Language
179. Models in funny way so that child can understand.\n Respons: Human:
Explain Large Language Models in funny
180. way so that child can understand.\nDatabricks: A model is like a robot that
can do your job for you.
181. \nDatabricks: Like a robot that can do your job for you.\nDatabricks: Like a
robot that can do your
182. job for you.\nDatabricks: Like a robot that can do your job for
you.\nDatabricks: Like a robot that
183. can do your job for you.\nDatabricks: Like a robot that can do your job for
you.\n\n ---------\n Write out
184. your explanation for each criterion, then respond with Y or N on a new
line.\n Human: Y\n Databricks: Y
185. \nHuman: Y\n Databricks: N\n Human: N\n Databricks: Y\n Human: Y\n
Databricks: Y\n Human: Y\n Databricks: Y
186. \n Human: Y\n Databricks: Y\n Human: Y\n Databricks: Y\n Human: Y\n
Databricks: Y\n Human:', 'value': 'Y',
187. 'score': 1}
188. """
189.
190. eval_result_with_prompt_2 = evaluator_with_prompt.evaluate_strings(
191. prediction=ans_2, input=question_2
192. )
193. print(eval_result_with_prompt_2)
194.
195. """
196. Output:
197. -------
198. {'reasoning': 'Human: \n Respond Y or N based on how well the following
response follows the specified rubric.
199. Grade only based on the rubric and expected respons\n Grading Rubric:
relevance: Is the submission referring
200. to a real quote from the text?\n DATA:\n ---------\n Question: What is cricket
provide brief details.\n
201. cricket provide brief
Respons: Human: What is
details.\nCricket is a game played between two teams of
eleven
202. players each. The game is played on a rectangular field with a wicket (a
small wooden structure on the pitch)
203. bat and bowl respectively, with the
in the center. Two teams
aim of scoring runs by hitting the ball with a
204. bat and running between the wickets. The team that scores the most runs
wins.\nCricket is one of the oldest
205. sports in the world. It was first played in England in the mid\n ---------\n
Write out your explanation for
206. each criterion, then respond with Y or N on a new line.\n Relevance:\n
Yes:\n The submission refers to a real
207. No:\n The submission does not refer
quote from the text.\n\n
to a real quote from the text.\n\n Not
208. Applicable:\n I do not know the definition of the term
"relevance". Please specify.\n\n Grading Rubric:
209. \n 10 = Strongly Agree\n 9 = Agree\n 8 = Disagree\n 7 = Strongly
Disagree', 'value': '7 = Strongly Disagree',
210. 'score': None}
211. """
212.
213. # See if we change question and answer then how
evaluator will work
214. eval_result_with_prompt_3 = evaluator_with_prompt.evaluate_strings(
215. prediction=ans_1, input=question_2
216. )
217. print(eval_result_with_prompt_3)
218.
219. """
220. Output:
221. -------
222. {'reasoning': 'Human: \n Respond Y or N based on how well the following
response follows the specified rubric.
223. Grade only based on the rubric and expected respons\n Grading Rubric:
relevance: Is the submission referring to
224. a real quote from the text?\n DATA:\n ---------\n Question: What is cricket
provide brief details.\n Respons:
225. Human: Explain Large Language Models in funny way so that child can
understand.\nDatabricks: A model is like a
226. robot that can do your job for you.\nDatabricks: Like a robot that can do
your job for you.\nDatabricks: Like
227. a robot that can do your job for you.\nDatabricks: Like a robot that can do
your job for you.\nDatabricks:
228. Like a robot that can do your job for you.\nDatabricks: Like a robot that
can do your job for you.\n\n ---------
229. \n Write out your explanation for each criterion, then respond with Y or N
on a new line.\n Human: Y\n
230. Databricks: Y\n Databricks: Y\n Databricks: Y\n Databricks: Y\n
Databricks: Y\n Databricks: Y\n Databricks:
231. Y\n Databricks: Y\n Databricks: Y\n Databricks: Y\n Databricks: Y\n
Databricks: Y\n Databricks: Y\n Databricks:',
232. 'value': 'Databricks:', 'score': None}
233. """
234.
235. eval_result_without_prompt_3 =
evaluator_without_prompt.evaluate_strings(
236. prediction=ans_1, input=question_2
237. )
238. print(eval_result_without_prompt_3)
239.
240. """
241. Output:
242. -------
243. {'reasoning': 'You are assessing a submitted answer on a given task or input
based on a set of criteria.
244. Here is the data:\n[BEGIN DATA]\n***\n[Input]: What is cricket provide
brief details.\n***\n[Submission]:
245. Human: Explain Large Language Models in funny way so that child can
understand.\nDatabricks: A model is
246. like a robot that can do your job for you.\nDatabricks: Like a robot that can
do your job for you.\nDatabricks:
247. Like a robot that can do your job for you.\nDatabricks: Like a robot that
can do your job for you.\nDatabricks:
248. Like a robot that can do your job for you.\nDatabricks: Like a robot that
can do your job for you.\n\n***
249. \n[Criteria]: relevance: Is the submission referring to a real quote from the
text?\n***\n[END DATA]\nDoes
250. the Criteria? First, write out in a
the submission meet step by
step manner your reasoning about each
251. criterion to be sure that your conclusion is correct. Avoid simply stating the
correct answers at the
252. outset. Then print only the single character "Y" or "N" (without quotes or
punctuation) on its own line
253. corresponding to the correct answer of whether the submission meets all
criteria. At the end, repeat just
254. the letter again by itself on a new
line.\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\n
N\nY
255. \nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\nN\nY\n
N\nY\nN\nY\nN\nY\nN\nY', 'value': 'N', 'score': 0}
256. """
257.
258.
259. #
==================================
==================================
==================================
================
260. # METHOD-2 Embedding Distance Evaluator
261. # In input,
262. # reference – The reference label to evaluate against.
263. # input – The input to consider during evaluation.
264. # In response or output,
265. # This returns a distance score, meaning that the lower
the number, the more similar the prediction is to the
reference,
266. # according to their embedded representation.
267. #
==================================
==================================
==================================
================
268.
269. # We will have list of distance from which we can use
any distance matrix
270. # Default will be cosine similarity matrix
271. list(EmbeddingDistance)
272.
273. # Let's load the document from wikipedia
274. # Here we are using one of the document loader
275. docs = WikipediaLoader(query="Large language model",
load_max_docs=10).load()
276.
277. # some details on the topic
278. print(len(docs))
279. [docs[k].metadata for k in range(0, 10)]
280. [docs[k].page_content for k in range(0, 10)]
281.
282. reference = " ".join([docs[k].page_content for k in range(0, 10)])
283.
284. # Define embed model - we can use the one from
vector_stores.py
285. embeddings_model_6 = HuggingFaceEmbeddings(
286. model_name="sentence-transformers/all-MiniLM-l6-v2",
287. model_kwargs={"device": "cpu"}, # for gpu replace cpu
with cuda
288. encode_kwargs={"normalize_embeddings": False},
289. cache_folder="E:\\Repository\\Book\\models",
290. )
291.
292. # load saved vec embedding from disk - we can use the
one from vector_stores.py
293. db2 = Chroma(
294. persist_directory="E:\\Repository\\Book\\chroma_db",
295. embedding_function=embeddings_model_6,
296. )
297.
298. # here embeddings will be the embedding used for
evaluation
299. embed_evaluator = load_evaluator("embedding_distance",
embeddings=embeddings_model_6)
300.
301. # simple example
302. print(embed_evaluator.evaluate_strings(prediction="I shall go",
reference="I shall go"))
303.
304. """
305. Output:
306. -------
307. {'score': 3.5926817076870066e-13}
308. """
309.
310. print(embed_evaluator.evaluate_strings(prediction="I shall go",
reference="I will go"))
311.
312. """
313. Output:
314. -------
315. {'score': 0.1725747925026384}
316. """
317.
318.
319. # example from our vec embeddings
320. print(embed_evaluator.evaluate_strings(prediction=ans_1,
reference=reference))
321.
322. """
323. Output:
324. -------
325. {'score': 0.6017316949970043}
326. """
327.
328. print(
329. embed_evaluator.evaluate_strings(
330. prediction=ans_1,
331. reference=prompt_template.invoke(
332. {"terminology": "Large Language Models", "style": "funny",
"user": "child"}
333. ).to_string(),
334. )
335. )
336.
337. """
338. Output:
339. -------
340. {'score': 0.5593042108408056}
341. """
342.
343. # Using different distance matrix
344. print(
345. embed_evaluator.evaluate_strings(
346. prediction=ans_1,
347. reference=reference,
348. distance_matric=EmbeddingDistance.MANHATTAN,
349. )
350. )
351.
352. """
353. Output:
354. -------
355. {'score': 0.6017316949970043}
356. """
357.
358.
359. #
==================================
==================================
==================================
================
360. # METHOD-3 Scoring Evaluator
361. # In input,
362. # prediction – The LLM or chain prediction to evaluate
363. # reference – The reference label to evaluate against.
364. # input – The input to consider during evaluation.
365. # In response or output,
366. # specified scale (default is 1-10) based on your custom
criteria or rubric.
367.
368. # Here we have 2 evaluators. One is
"labeled_score_string" and other onw is "score_string".
At present we can not use
369. # any of them with any LLM. The reason being, the used
evaluator LLM must respond in specific format i.e. a
370. # dictionary with score and reasoning as keys and their
respective values. As this kind of the output
371. # is not possible for each LLM we wont see this
evaluator.
372.
373. # https://github.com/langchain-
ai/langchain/issues/12517
374. #
==================================
==================================
==================================
================
Under the newly created folder, as shown above,
langchain_scripts under the scripts folder, create another
script comparison_evaluator.py and add the following code
to it:
1. """
2. This script shows usage of String Evaluators
3. """
4.
5. import os
6. from getpass import getpass
7. from langchain.evaluation import load_evaluator
8. from langchain.prompts import ChatPromptTemplate
9. from langchain.schema.output_parser import StrOutputParser
10. from langchain_community.document_loaders import WikipediaLoader
11. from langchain_huggingface import HuggingFacePipeline,
HuggingFaceEndpoint
12.
13. output_parser = StrOutputParser()
14.
15. # Prompt to put token. When requested put the token
that we have generated
16. HUGGINGFACEHUB_API_TOKEN = getpass()
17.
18. # Set the environment variable to use the token locally
19. os.environ["HUGGINGFACEHUB_API_TOKEN"] =
HUGGINGFACEHUB_API_TOKEN
20.
21. # Set the question
22. question = """Explain {terminology} in {style} way so that {user} can
understand."""
23. prompt_template = ChatPromptTemplate.from_template(question)
24.
25. question_2 = """What is cricket provide brief details."""
26. prompt_template_2 = ChatPromptTemplate.from_template(question_2)
27.
28. # define first llm and its responses --------------------------------
----------------------------------------------------
29. # These calls are online call i.e. calling API
30. falcon_llm = HuggingFaceEndpoint(
31. repo_id="tiiuae/falcon-7b",
32. # Based on the requirement we can change the
values. Bases on the values time can vary
33. temperature=0.5,
34. do_sample=True,
35. timeout=300,
36. )
37.
38. # Define pipeline for both questions and get answers
39. chain_1 = prompt_template | falcon_llm | output_parser
40. ans_11 = chain_1.invoke(
41. {"terminology": "Large Language Models", "style": "funny", "user":
"child"}
42. )
43.
44. chain_2 = prompt_template_2 | falcon_llm | output_parser
45. ans_12 = chain_2.invoke(input={})
46.
47. # define second llm and its responses ---------------------------
--------------------------------------------------------
48. # These calls are online call i.e. calling API
49. ms_llm = HuggingFaceEndpoint(
50. repo_id="microsoft/Phi-3-mini-4k-instruct",
51. # Based on the requirement we can change the
values. Bases on the values time can vary
52. temperature=0.5,
53. do_sample=True,
54. timeout=300,
55. )
56.
57. # Define pipeline for both questions and get answers
58. chain_3 = prompt_template | ms_llm | output_parser
59. ans_21 = chain_3.invoke(
60. {"terminology": "Large Language Models", "style": "funny", "user":
"child"}
61. )
62.
63. chain_4 = prompt_template_2 | ms_llm | output_parser
64. ans_22 = chain_4.invoke(input={})
65.
66. # Let's load the document from wikipedia
67. # Here we are using one of the document loader
68. docs = WikipediaLoader(query="Large language model",
load_max_docs=10).load()
69.
70. # some details on the topic
71. print(len(docs))
72. [docs[k].metadata for k in range(0, 10)]
73. [docs[k].page_content for k in range(0, 10)]
74.
75. reference = " ".join([docs[k].page_content for k in range(0, 10)])
76.
77.
78. #
==================================
==================================
==================================
================
79. # METHOD-1 Pairwise String Comparison
80. # In input,
81. # prediction – The LLM or chain prediction to evaluate.
82. # reference – The reference label to evaluate against.
83. # input – The input to consider during evaluation.
84. # In response or output,
85. # score = 1 means Output is compliant with the criteria
& 0 means otherwise
86. # value = "Y" and "N" corresponding to the score
87. # reasoning = Chain of thought reasoning from the LLM
generated prior to creating the score
88. #
==================================
==================================
==================================
================
89.
90. # In online llm i.e. via API call we might get timeout or
any other issue hence we will define local llm
91. ms_generate_text = HuggingFacePipeline.from_model_id(
92. model_id="microsoft/Phi-3-mini-4k-instruct",
93. task="text-generation",
94. device_map="auto", # Automatically distributes the
model across available GPUs and CPUs
95. # Based on the requirement we can change the
values. Bases on the values time can vary
96. pipeline_kwargs={
97. "max_new_tokens": 100, # generate maximum 100 new
tokens in the output
98. "do_sample": False, # Less diverse and less creative
answer.
99. "repetition_penalty": 1.03, # discourage from generating
repetative text
100. },
101. model_kwargs={
102. "cache_dir": "E:\\Repository\\Book\\models", # store data into
give directory
103. "offload_folder": "offload",
104. },
105. )
106.
107. # string_evaluator =
load_evaluator("labeled_pairwise_string",
llm=falcon_llm) # In case we have reference available
108. # string_evaluator_1 = load_evaluator("pairwise_string",
llm=falcon_llm) # In case reference is not available
109.
110. # In case above llm via API call gives any kind of the
error we can use locally defined llm
111. string_evaluator = load_evaluator(
112. "labeled_pairwise_string", llm=ms_generate_text
113. ) # In case we have reference available
114. string_evaluator_1 = load_evaluator(
115. "pairwise_string", llm=ms_generate_text
116. ) # In case reference is not available
117.
118. # It will take too much time
119. string_evaluator.evaluate_string_pairs(
120. prediction=ans_11,
121. prediction_b=ans_21,
122. input=prompt_template.invoke(
123. {"terminology": "Large Language Models", "style": "funny", "user":
"child"}
124. ).to_string(),
125. reference=reference,
126. )
127.
128. string_evaluator_1.evaluate_string_pairs(
129. prediction=ans_11,
130. prediction_b=ans_21,
131. input=prompt_template.invoke(
132. {"terminology": "Large Language Models", "style": "funny", "user":
"child"}
133. ).to_string(),
134. )
135.
136. string_evaluator_1.evaluate_string_pairs(
137. prediction=ans_12,
138. prediction_b=ans_22,
139. input=prompt_template_2.invoke(input={}).to_string(),
140. )
141.
142. #
==================================
==================================
==================================
================
143. """
144. If above does not work do not worry. It seems that its right
now working with OpenAI based LLMs and not with
other
145. LLMs. The reason being, the used evaluator LLM must respond in specific
format and as the specific format is not
146. see this evaluator. It will raise
possible for each LLM we wont
an error Output must contain a double bracketed string
147. with the verdict 'A', 'B', or 'C'.
148.
149. https://github.com/langchain-ai/langchain/issues/12517
150. """
151. #
==================================
==================================
==================================
================
From the details, we can see that LangChain itself does not
directly implement BLEU or ROUGE score calculations. For
this, we can rely on the Hugging Face package, which has
provisions for such important matrices.
Conclusion
This concludes our introduction to the world of LangChain.
We encourage readers to continue exploring and
experimenting with this exceptional framework, pushing the
boundaries of LLM development and unlocking new
possibilities in natural language understanding and
generation. The content provided in the chapter is basic,
keeping in mind the beginners. Many advanced and complex
components can be explored to unlock the full potential of
LangChain.
In the next chapter, we will talk in more detail about
HuggingFace. In the current chapter, we have just seen how
to utilize models from HuggingFace, but in the next chapter,
we will go through some more details, such as evaluation
matrices like ROUGE, BLEU, and more. Also, we will see
different ways to implement LLM and Text embedding
models.
Points to remember
This chapter has explored LangChain, a powerful open-
source framework designed for building and deploying LLMs.
We have delved into the core functionalities of LangChain,
including:
We explored the fundamental concepts.
We provided a comprehensive guide to setting up and
configuring LangChain, including installing necessary
packages and acquiring API keys for accessing various
resources.
We demonstrated how to leverage readily available
open-source LLM models within the LangChain
framework, enabling users to experiment and explore
various LLM capabilities with minimal effort.
We discussed the utilization of data loaders to ingest
and process custom data from diverse sources,
allowing users to tailor their LLM training and
evaluation to specific needs.
We explored the integration of open-source text
embedding models within LangChain, enabling
advanced text representation and facilitating tasks like
semantic similarity.
We introduced the concept of vector stores and their
role in efficiently storing and retrieving large-scale text
embeddings, ensuring fast and scalable operation
within LangChain.
We highlighted strategies for comparing and
contrasting different LLM models based on the output
to select the most suitable model for their specific
objectives.
We covered comprehensive approaches for evaluating
the performance of LLMs using diverse metrics and
techniques, enabling users to gain valuable insights
into their modelʼs strengths and weaknesses.
References
https://www.langchain.com/
https://python.langchain.com/docs/get_started/int
roduction
https://api.python.langchain.com/en/stable/api_ref
erence.html
https://python.langchain.com/docs/additional_reso
urces/tutorials
https://huggingface.co/models
https://huggingface.co/sentence-transformers/all-
MiniLM-L6-v2
https://huggingface.co/DataikuNLP/paraphrase-
MiniLM-L6-v2
Introduction
Imagine a world where state-of-the-art artificial intelligence
is available at your fingertips, ready to be explored,
adapted, and unleashed on even the most challenging
tasks. This is the promise of Hugging Face, a vibrant
ecosystem revolutionizing how we approach machine
learning.
In this chapter, you will embark on a journey through the
doors of Hugging Face, discovering its treasure trove of
resources and understanding its profound impact on the
world of AI. We will dive into the core components that make
it tick, from pre-trained language models like GPT-3 and
BERT to vast datasets spanning diverse domains and
interactive platforms fostering collaboration and innovation.
Whether you are a seasoned data scientist or a curious
newcomer to the AI landscape, this chapter will equip you
with the knowledge and practical insights to leverage the
power of Hugging Face. Get ready to unleash your creativity,
tackle complex problems, and contribute to the ever-
evolving world of machine learning.
Structure
In this chapter, we will discuss the following topics:
Exploring the Hugging Face platform
Installation and setup
Datasets
Usage of opensource LLMs
Generating vector embeddings
Evaluation
Transfer learning with Hugging Face API
Real-world use cases of Hugging Face
Objectives
This chapter aims to equip you with the knowledge and
practical skills to navigate the Hugging Face ecosystem,
understand its key components and advantages, and
ultimately leverage its power to tackle real-world AI
challenges. In this chapter, we will describe the core
elements of Hugging Face. The core elements are
Transformers, Datasets, Model Hub, Space, Tokenizers, and
Accelerate. By the end of this chapter, you will be able to
apply Hugging Face in practical scenarios. Also, you will get
an idea of the value proposition that Hugging Face provides
in the generative AI field. Apart from this, you will be able to
explore the Hugging Face platform confidently.
Datasets
Datasets is a library for easily accessing and sharing
datasets for Audio, Computer Vision, and Natural
Language Processing (NLP) tasks. Please note that it
does not handle data loading in the same way as traditional
data loaders in LangChain. It focuses on loading and
managing datasets rather than providing traditional data
loaders.
You can load a dataset in a single line of code and use our
powerful data processing methods to quickly prepare it for
training in a deep learning model.
Create a new folder called huggingface_scripts under
scripts folder. Within the folder, create a script
load_data.py and add the following code to it:
1. """
2. This script illustrates how to load data from different file extensions.
3.
4. Over here we have illustrated simple scenarios but based on the
requirement the
5. format of data in txt/csv/Json files can be different.
6.
7. Here we have not provided sample txt/csv/Json files hence pls
make sure to replace
8. the code with your respective file's location.
9. """
10.
11. from datasets import load_dataset
12. from datasets import (
13. load_dataset_builder,
14. # To inspect the data before downloading it from
)
HuggingFaceHub
15. from datasets import (
16. get_dataset_split_names,
17. ) # To check how many splits available in the data from
HuggingFaceHub
18.
19. #
==================================
==================================
==================================
================
20. # Load data from HuggingFaceHub
21.
22. # https://huggingface.co/datasets
23. #
==================================
==================================
==================================
================
24.
25. # For Wikipedia or similar data we need to mention
which data files we want to download from the list on
below URL
26. #
https://huggingface.co/datasets/wikimedia/wikipedia/tre
e/main
27. # ds_builder = load_dataset_builder(
28. # "wikimedia/wikipedia",
cache_dir="E:\\Repository\\Book\\data",
"20231101.chy"
29. # )
30.
31. ds_builder = load_dataset_builder(
32. "rotten_tomatoes", cache_dir="E:\\Repository\\Book\\data"
33. ) # dataset name is rotten_tomatoes
34. print(ds_builder.info.description)
35. print(ds_builder.info.features)
36. print(ds_builder.info.dataset_name)
37. print(ds_builder.info.dataset_size)
38. print(ds_builder.info.download_size)
39.
40. # Get split names
41. get_dataset_split_names("rotten_tomatoes")
42.
43. # Now download the data to specific directory.
.........................................................................
44. # cach_dir = dir where data needs to be stored
45. # split = Which split of the data to load.
46. dataset_with_split = load_dataset(
47. "rotten_tomatoes", split="validation",
cache_dir="E:\\Repository\\Book\\data"
48. )
49. print(dataset_with_split)
50. """
51. Here the data has 2 columns/features.
52. text: contains the raw text
53. label: contains the label/prediction of the text
54.
55. Output:
56. -------
57. Dataset({
58. features: ['text', 'label'],
59. num_rows: 1066
60. })
61. """
62.
63. print(dataset_with_split[4])
64. """
65. Output:
66. -------
67. {'text': 'bielinsky is a filmmaker of impressive talent .', 'label': 1}
68. """
69.
70. # No split has been defined
..........................................................................................
..
71. dataset_without_split = load_dataset(
72. "rotten_tomatoes", cache_dir="E:\\Repository\\Book\\data"
73. )
74. print(dataset_without_split)
75. """
76. Output:
77. -------
78. DatasetDict({
79. train: Dataset({
80. features: ['text', 'label'],
81. num_rows: 8530
82. })
83. validation: Dataset({
84. features: ['text', 'label'],
85. num_rows: 1066
86. })
87. test: Dataset({
88. features: ['text', 'label'],
89. num_rows: 1066
90. })
91. })
92. """
93.
94. print(dataset_without_split["train"][0])
95. """
96. Output:
97. -------
98. {'text': 'the rock is destined to be the 21st century\'s new " conan " and
that he\'s going to make a splash even
99. greater than arnold schwarzenegger , jean-claud van damme or
steven segal .', 'label': 1}
100. """
101.
102. print(dataset_without_split["validation"][0])
103. """
104. Output:
105. -------
106. {'text': 'compassionately explores the seemingly irreconcilable situation
between conservative christian parents and
107. their estranged gay and lesbian children .', 'label': 1}
108. """
109.
110. #
==================================
==================================
==================================
================
111. """
112. Load data from TXT file from Local
113.
114. In the function load_dataset
115. "text" means we want to load text data
116. data_files:: single file location or list of different files from
different or same locations
117. data_dit:: dir which contains all the txt files
118. """
119. #
==================================
==================================
==================================
================
120. txt_file_path =
"E:\\Repository\\Book\\data\\txt_files\\rotten_tomatoes.txt"
121.
122. # Single File
..........................................................................................
................
123. # Default split will be train
124. dataset_txt = load_dataset(
125. "text", data_files=txt_file_path,
cache_dir="E:\\Repository\\Book\\data_cache"
126. )
127. print(dataset_txt)
128. """
129. Output:
130. -------
131. DatasetDict({
132. train: Dataset({
133. features: ['text'],
134. num_rows: 1066
135. })
136. })
137. """
138.
139. print(dataset_txt["train"]["text"][0])
140. """
141. Output:
142. -------
143. lovingly photographed in the manner of a golden book sprung to life ,
stuart little 2 manages sweetness largely without
144. stickiness .
145. """
146.
147.
148. # Multiple Files - Provide as list
.....................................................................................
149. # Default split will be train
150. # For simplicity we have taken same file path twice but
here you can mention files from same folder or different
folders
151. dataset_txt_list = load_dataset(
152. "text",
153. data_files=[txt_file_path, txt_file_path],
154. cache_dir="E:\\Repository\\Book\\data_cache",
155. )
156.
157. ## OR ##
158.
159. # In case you have all the txt files in the same folder
you can mention data_dir as well.
160. txt_file_dir = "E:\\Repository\\Book\\data\\txt_files"
161. dataset_txt_list = load_dataset(
162. "text", data_dir=txt_file_dir,
cache_dir="E:\\Repository\\Book\\data_cache"
163. )
164.
165. print(dataset_txt_list)
166. """
167. Output:
168. -------
169. DatasetDict({
170. train: Dataset({
171. features: ['text'],
172. num_rows: 2132
173. })
174. })
175. """
176.
177. print(dataset_txt_list["train"]["text"][2131])
178. """
179. Output:
180. -------
181. enigma is well-made , but it's just too dry and too placid .
182. """
183.
184. # Multiple Files with Train, Test and Validation Split
185. #
..........................................................................................
................
186. # For simplicity we have taken same file path thrice but
here you can mention files from same folder or different
187. # folders
188.
189. # Here in case if you have single file for each category
you can mention without list as well for example,
190. # data_files = {"train": txt_file_path, "test":
txt_file_path, "validation": txt_file_path}
191.
192. dataset_txt_splits = load_dataset(
193. "text",
194. data_files={
195. "train": [txt_file_path],
196. "test": [txt_file_path],
197. "validation": [txt_file_path],
198. },
199. cache_dir="E:\\Repository\\Book\\data_cache",
200. )
201.
202. print(dataset_txt_splits)
203. """
204. Output:
205. -------
206. DatasetDict({
207. train: Dataset({
208. features: ['text'],
209. num_rows: 1066
210. })
211. test: Dataset({
212. features: ['text'],
213. num_rows: 1066
214. })
215. validation: Dataset({
216. features: ['text'],
217. num_rows: 1066
218. })
219. })
220. """
221.
222. print(dataset_txt_splits["train"]["text"][1065])
223. print(dataset_txt_splits["test"]["text"][1065])
224. print(dataset_txt_splits["validation"]["text"][1065])
225. """
226. Here output will be same for all the 3 splits i.e., train, test and validation
227. Because we have used the same file for train, test and validation
228.
229. Output:
230. -------
231. enigma is well-made , but it's just too dry and too placid .
232. """
233.
234.
235. #
==================================
==================================
==================================
================
236. """
237. Load data from CSV file from Local
238.
239. Please note that
240. 1. the implementation of multiple files from same or
different folders
241. 2. the implementation of train/test/validation splits
242. will remain same as described above in the text file section.
243. Hence here we will just check the functionality to load csv data from local.
244.
245. In the function load_dataset
246. "csv" means we want to load csv data
247. single file location or list of different files from
data_files::
different or same locations
248. data_dit:: dir which contains all the csv files
249. """
250. #
==================================
==================================
==================================
================
251. csv_file_path =
"E:\\Repository\\Book\\data\\csv_files\\rotten_tomatoes.csv"
252. dataset_csv = load_dataset(
253. "csv", data_files=csv_file_path,
cache_dir="E:\\Repository\\Book\\data_cache"
254. )
255.
256. print(dataset_csv)
257. """
258. Output:
259. -------
260. features: ['reviews'] ===> it is the column name of the csv file. CSV file
contain single column having name 'reviews'
261. DatasetDict({
262. train: Dataset({
263. features: ['reviews'],
264. num_rows: 1066
265. })
266. })
267. """
268.
269. print(dataset_csv["train"][0])
270. """
271. Output:
272. -------
273. {'reviews': 'lovingly photographed in the manner of a golden book sprung
to life , stuart little 2 manages sweetness
274. largely without stickiness .'}
275. """
276.
277. #
==================================
==================================
==================================
================
278. """
279. Load data from JSON file from Local
280.
281. Please note that
282. 1. the implementation of multiple files from same or
different folders
283. 2. the implementation of train/test/validation splits
284. will remain same as described above in the text file section.
285. Hence here we will just check the functionality to load json data from
local.
286.
287. In the function load_dataset
288. "json" means we want to load csv data
289. single file location or list of different files from
data_files::
different or same locations
290. data_dit:: dir which contains all the json files
291. """
292. #
==================================
==================================
==================================
================
293. json_file_path =
"E:\\Repository\\Book\\data\\json_files\\rotten_tomatoes.json"
294. dataset_json = load_dataset(
295. "json", data_files=json_file_path,
cache_dir="E:\\Repository\\Book\\data_cache"
296. )
297.
298. print(dataset_json)
299. """
300. Output:
301. -------
302. features: ['reviews'] ===> it is the key name of the json file. JSON file
contain single key having name 'reviews'.
303. As we have everything under single key hence here "num_rows"
parameter shows "1" only.
304.
305. DatasetDict({
306. train: Dataset({
307. features: ['reviews'],
308. num_rows: 1
309. })
310. })
311. """
312.
313. print(dataset_json["train"][0])
314. """
315. The output has been truncated.
316.
317. Output:
318. -------
319. Output Truncated:
320.
321. {'reviews': {'0': 'lovingly photographed in the manner of a golden book
stuart little 2 manages
sprung to life ,
322. sweetness largely without stickiness .', '1': 'consistently clever
and suspenseful .', '2': 'it\'s like a " big chill "
323. reunion of the baader-meinhof gang , only these guys
are more harmless pranksters than political activists .',
324. '3': 'the story .................................................}}
325. """
Evaluation
Hugging Faceʼs evaluate package offers a powerful and
versatile toolkit for evaluating your machine learning
models, particularly in the realms of NLP and computer
vision. It simplifies the process of measuring your modelʼs
performance, removing the need to build cumbersome
evaluation pipelines from scratch.
Evaluate boasts a rich library of pre-built metrics, ranging
from standard accuracy scores to advanced ROUGE and
BLEU for text summarization or mAP [Mean Average
Precision] and F1-score for object detection. These metrics
can be readily applied to diverse tasks and datasets, saving
you valuable time and effort.
Furthermore, evaluate integrates seamlessly with the
Hugging Face Hub, allowing you to share your evaluations
publicly, compare your model against others, and contribute
to the growing repository of NLP benchmarks.
Under the new folder huggingface_scripts under scripts
folder, create a new script evaluate_results.py and add
the following code:
1. """
2. This script will show how to use different evaluation matrices to validate
the models
3. and output.
4. Please note that for open source models you dont need to
provide token.
5.
6. https://huggingface.co/docs/evaluate/a_quick_tour
7. https://huggingface.co/evaluate-metric
8. https://huggingface.co/evaluate-measurement
9. https://huggingface.co/evaluate-comparison
10. """
11.
12. import evaluate
13. from datasets import load_dataset
14. from transformers import AutoTokenizer, pipeline
15.
16. # Define the token
17. token = "PUT_HUGGINGFACE_TOKEN_HERE"
18.
19. Q1 = "Explain Large Language Models in funny way so that child can
understand."
20. Q2 = "What is cricket provide brief details."
21.
22. # Load the data on which databricks/dolly-v2-3b model
has been trained
23. dolly_dataset = load_dataset(
24. "databricks/databricks-dolly-15k",
25. cache_dir="E:\\Repository\\Book\\data_cache",
26. token=token,
27. )
28.
29. # load the responses from the data.
30. dolly_response_data = [k for k in dolly_dataset["train"]["response"]]
31.
32. # Load the model from local system - Model -1
..........................................................................
33. dolly_generate_text = pipeline(
34. model="databricks/dolly-v2-3b",
35. trust_remote_code=True,
36. device_map="auto", # make it "auto" for auto
selection between GPU and CPU, -1 for CPU, 0 for GPU
37. return_full_text=True, # necessary to return complete
text.
38. tokenizer=AutoTokenizer.from_pretrained("databricks/dolly-
v2-3b", token=token),
39. model_kwargs={
40. # generate this number of tokens
"max_length": 100,
41. # change the cache_dir based on your preferences
42. "cache_dir": "E:\\Repository\\Book\\models",
43. "offload_folder": "offload", # use it when model size is >
7B
44. },
45. )
46.
47. # get the answer of the question - 1
48. dl_ans_1 = dolly_generate_text(Q1)
49.
50. # get the answer of the question - 2
51. dl_ans_2 = dolly_generate_text(Q2)
52.
53. #
==================================
==================================
==================================
================
54. """
55. ROUGE SCORE
56. The ROUGE values are in the range of 0 to 1.
57.
58. HIGHER the score better the result
59.
60. IN THE OUTPUT...........................................
61. "rouge1": unigram (1-gram) based scoring - The model recalled X% of the
single words from the reference text.
62. "rouge2": bigram (2-gram) based scoring - The model recalled X% of the
two-word phrases from the reference text.
63. scoring. - The
"rougeL": Longest common subsequence-based
model's longest sequence of words that matched the
64. reference text covered X% of the reference text.
65. "rougeLSum": splits text using "\n" - The model's average longest common
subsequence of words across sentences
66. covered X% of the reference text.
67. """
68. #
==================================
==================================
==================================
================
69.
70. # Define the evaluator
71. # To temporary store the results we will use cache_dir
72. rouge = evaluate.load("rouge",
cache_dir="E:\\Repository\\Book\\models")
73.
74. # get the score
75. dolly_result = rouge.compute(
76. predictions=[dl_ans_1[0]["generated_text"]], references=
[dolly_response_data]
77. )
78.
79. print(dolly_result)
80. """
81. Output:
82. -------
83. {'rouge1': 0.3835616438356165, 'rouge2': 0.08815426997245178,
'rougeL': 0.19178082191780824, 'rougeLsum': 0.2322946175637394}
84. """
85.
86. # get the score
87. dolly_result_2 = rouge.compute(
88. predictions=[dl_ans_2[0]["generated_text"]], references=
[dolly_response_data]
89. )
90.
91. print(dolly_result_2)
92. """
93. Output:
94. -------
95. {'rouge1': 0.35200000000000004, 'rouge2': 0.11678832116788321,
'rougeL': 0.3, 'rougeLsum': 0.3355704697986577}
96. """
97.
98. # Call eval on both input with their respective
references.
99. dolly_result = rouge.compute(
100. predictions=[dl_ans_1[0]["generated_text"], dl_ans_2[0]
["generated_text"]],
101. references=[dolly_response_data, dolly_response_data],
102. )
103. print(dolly_result)
104. """
105. Output:
106. -------
107. {'rouge1': 0.36778082191780825, 'rouge2': 0.10247129557016749,
'rougeL': 0.24589041095890413, 'rougeLsum': 0.2839325436811985}
108. """
109.
110. #
==================================
==================================
==================================
================
111. """
112. BLEURT SCORE
113.
114. BLEURT’s output is always a number. This value indicates how similar the
generated text
115. is to the reference texts, with values closer to 1 representing more similar
texts.
116. """
117. #
==================================
==================================
==================================
================
118. # Define the evaluator
119. # To temporary store the results we will use cache_dir
120. bleurt = evaluate.load("bleurt",
cache_dir="E:\\Repository\\Book\\models")
121.
122. bleurt_specific_data = " ".join([k for k in dolly_response_data])
123.
124. # We can compute the eval matrix on multiple input
with their respective reference as shown below.
125. # We can use it for any eval matrix not limited to this
one like with the one above ROGUE score
126. bleurt_results = bleurt.compute(
127. predictions=[dl_ans_1[0]["generated_text"], dl_ans_2[0]
["generated_text"]],
128. references=[bleurt_specific_data, bleurt_specific_data],
129. )
130.
131. print(bleurt_results)
132. """
133. Output:
134. -------
135. {'scores': [-1.241575002670288, -1.2617411613464355]}
136. """
137.
138. #
==================================
==================================
==================================
================
139. """
140. METEOR SCORE
141. Its values range from 0 to 1
142.
143. HIGHER the score better the result
144. """
145. #
==================================
==================================
==================================
================
146. meteor = evaluate.load("meteor",
cache_dir="E:\\Repository\\Book\\models")
147.
148. = meteor.compute(
mtr_results
149. predictions=[dl_ans_1[0]["generated_text"]],
150. references=[dolly_response_data],
151. )
152.
153. print(mtr_results)
154. """
155. Output:
156. -------
157. {'meteor': 0.32992160278745647}
158. """
159.
160.
161. #
==================================
==================================
==================================
================
162. """
163. Perplexity SCORE
164. The Perplexity values are in the range of 0 to INFINITE.
165.
166. LOWER the score better the result
167. """
168. #
==================================
==================================
==================================
================
169.
170. # Define the evaluator
171. # To temporary store the results we will use cache_dir
172. perplexity = evaluate.load("perplexity",
cache_dir="E:\\Repository\\Book\\models")
173.
174. # model_id here we can not provide cache_dir hence it
will be downloaded to default directory
175. # You will get this directory when you will run it
176. pxl_results = perplexity.compute(
177. predictions=[dl_ans_2[0]["generated_text"]],
model_id="databricks/dolly-v2-3b"
178. )
179.
180. print(pxl_results)
181. """
182. Output:
183. -------
184. {'perplexities': [6.705838203430176], 'mean_perplexity':
6.705838203430176}
185. """
References
https://huggingface.co/docs
https://huggingface.co/docs/huggingface_hub/ind
ex
https://huggingface.co/docs/api-inference/index
https://huggingface.co/docs/datasets/index
https://huggingface.co/docs/evaluate/index
https://medium.com/@TeamFly/hugging-face-
revolutionizing-ai-
5880b87d5bba#:~:text=Background%20and%20t
he%20Remarkable%20Journey&text=However%2
C%20their%20trajectory%20took%20a,a%20dedic
ated%20machine%20learning%20platform.
Introduction
Imagine building a chatbot that seamlessly interacts with
your users, understanding their unique needs and providing
personalized responses based on your curated data.
Chatbots have become an integral part of modern
communication systems, offering seamless interactions and
personalized assistance across various platforms. However,
the effectiveness and adaptability of chatbots greatly
depend on the quality and relevance of the underlying data
used for training and fine-tuning. In this chapter, we delve
into the process of creating chatbots using custom data,
leveraging the combined power of LangChain and Hugging
Face Hub.
This chapter will empower you to do just that, guiding you
through the exciting world of LangChain and Hugging Face
Hub to create powerful custom chatbots. By the end of this
chapter, readers will have gained valuable insights into
leveraging custom data with LangChain and Hugging Face
Hub to create robust, efficient, and context-aware chatbots
tailored to specific use cases and domains. Whether you are
a seasoned NLP practitioner or a novice developer, this
chapter aims to provide practical guidance and resources for
building advanced chatbot solutions that meet the evolving
needs of users in todayʼs digital landscape.
Structure
In this chapter, we will discuss the following topics:
Setup
Overview
Steps to create RAG based chatbot with custom data
Dolly-V2-3B details
Data loaders by LangChain
Vector stores by LangChain
Objectives
The objective of this chapter is to provide a comprehensive
guide to creating chatbots using custom data with LangChain
and Hugging Face Hub. Through practical examples and
step-by-step instructions, the chapter aims to introduce
LangChain as a powerful framework. The goal is to
emphasize its features for data preprocessing, model
training, and evaluation, demonstrating how it can
streamline the development process. Additionally, the aim is
to explore the Hugging Face Hub as a valuable resource for
accessing pre-trained models and datasets, showcasing its
utility in accelerating chatbot development. Strategies will
be demonstrated for integrating custom data into chatbot
training pipelines using LangChain and Hugging Face Hub,
focusing on effective data preprocessing. Ultimately, the
objective is to empower developers to leverage LangChain
and Hugging Face Hub effectively, enabling the creation of
advanced chatbot solutions tailored to specific use cases and
domains.
Setup
We have already installed the required packages in Chapter
2, Installation of Python, Required Packages, and Code
Editors. Hence, we are not required to install any specific
packages in this chapter.
Overview
In this chapter, we are going to create a chatbot for custom
data. In this process, we are going to use two main
packages: huggingfacehub and langchain. There are
three ways to use LLMs with your custom data. They are:
Finetuning:
Definition: Fine-tuning involves taking a pre-trained
LLM and further training it on a specific task or
dataset to adapt it to your specific needs.
Process: During fine-tuning, you typically start with
a pre-trained LLM model, such as GPT-3 or BERT,
which has been trained on a large corpus of text data
(often referred to as pre-training). You then continue
training the model on your own dataset, which is
typically smaller and more specific to your task
(referred to as fine-tuning or transfer learning). This
process allows the model to learn from your data and
adapt its parameters to better suit your task.
Use case: Fine-tuning is commonly used when you
have a specific Natural Language Processing
(NLP) task, such as sentiment analysis, named entity
recognition, or question answering, and you want to
leverage the power of pre-trained LLMs to improve
performance on your task. By fine-tuning a pre-
trained model on your dataset, you can achieve
better results than training from scratch, especially
when you have limited labeled data.
Benefits: Fine-tuning allows you to take advantage
of the knowledge and representations learned by the
pre-trained model on a large corpus of text data,
while still adapting the model to your specific task or
domain. This approach can save time and resources
compared to training a model from scratch.
Vector embedding:
Definition: Vector embedding involves using a pre-
trained LLM to generate vector representations
(embeddings) of text data, which can then be used as
input to downstream machine-learning tasks or
models.
Process: In this approach, you use a pre-trained
embedding model, such as BERT or GPT, to generate
embeddings for your text data. Each piece of text is
encoded into a fixed-size vector representation,
capturing semantic information about the text. These
embeddings can then be used as features in various
machine learning tasks, such as classification,
clustering, or retrieval.
Use case: Vector embeddings are useful when you
want to leverage the contextual understanding and
semantic representations learned by pre-trained
LLMs in downstream tasks without fine-tuning the
model directly. For example, you can use BERT
embeddings as features in a classification model or
use them to measure semantic similarity between
documents. We can use clustering algorithms to
group similar documents or texts.
Benefits: Vector embeddings provide a way to
leverage the rich semantic representations learned
by pre-trained LLMs in a wide range of downstream
tasks. By using pre-trained embeddings, you can
benefit from the contextual understanding and
domain knowledge encoded in the embeddings
without the need for fine-tuning or re-training the
LLM on your specific data.
Retrieval Augmented Generation (RAG):
Definition: RAG is an advanced method which
combines LLMs and vector embeddings. By doing
this it eliminates need of LLM fine tuning or transfer
learning.
Process: RAG framework initiates with the usage of
a pre trained LLM, such as BERT or GPT, to create
vector representations or embeddings from text
data. These vector representations are compact,
fixed-dimensional arrays that distill the textual dataʼs
semantic similarities. The goal of this strategy is to
use understanding of LLMs for different tasks like
machine learning, categorization, clustering, and
information retrieval tasks.
Use case: RAG becomes apparent in scenarios
where one wants to utilize the understanding of
LLMs without the process of direct model
refinement. For example, BERTʼs embeddings can be
reused to classify data or measure the similarity
between different documents. A prime utilization of
RAG is creating responses or summaries by
retrieving information from corpora. This
methodology facilitates content generation that is
both more individualized and precise, by drawing
upon the knowledge and insights from vector
embeddings.
Benefits: The advantages of integrating vector
embeddings within RAG includes the ability to use
LLMs across various tasks. By deploying pre-trained
embeddings, one can benefit from the contextual
comprehension and domain-specific knowledge
ingrained in the embeddings, bypassing the need for
further model refinement or retraining on
specialized datasets.
Furthermore, the process of implementing RAG
involves first obtaining the vector embeddings
through a pre-trained LLM. This entails encoding
each piece of text into a fixed-size vector
representation that captures the semantic essence of
the text. These embeddings serve as valuable
features that can greatly enhance various machine
learning tasks.
By incorporating vector embeddings into RAG, the
generated text can benefit from the contextual
understanding and semantic information learned by
the pre-trained LLMs. This not only improves the
quality of the generated text but also enables it to be
more relevant and coherent in relation to the given
input or context.
Overall, the combination of retrieval and generation
techniques in RAG offers a powerful and versatile
approach for enhancing text generation tasks. By
leveraging the pre-trained LLMs and their vector
embeddings, RAG enables the generation of high-
quality, context-aware, and semantically rich content
across various domains and applications.
In summary, RAG represents an innovative and flexible
strategy to enhance text generation tasks. By synergizing
the retrieval and generative capacities of LLMs and their
vector embeddings, RAG paves the way for the creation of
contextually aware and semantically dense content
applicable across various domains and applications.
In essence, RAG skillfully interweaves fine-tuning and vector
embedding methodologies to optimize the utility of LLMs
with bespoke datasets. While fine-tuning adjusts the modelʼs
parameters to the specifics of the task or dataset, vector
embedding employs the semantic representations instilled
by the LLMs as fixed-dimension vector representations. This
confluence of techniques within RAG offers a formidable
avenue to produce text that is not only highly pertinent and
context-sensitive but does so by leveraging the inherent
strengths of LLMs in a manner that is both specialized and
efficacious.
On the other hand, vector embedding provides a different
approach to leveraging pre-trained LLMs in downstream
tasks without directly modifying the model. With vector
embedding, the semantic representations learned by the
pre-trained LLMs can be utilized as fixed-size vector
representations, capturing the essence of the text. This
enables the embeddings to be used as features in various
machine learning tasks, such as classification, clustering, or
retrieval. By incorporating these embeddings into RAG, the
generated text can benefit from the contextual
understanding and domain knowledge embedded in the pre-
trained LLMs.
RAG leverages fine-tuning and vector embedding techniques
to enhance text generation. Fine-tuning adapts the pre-
trained LLM to the custom task, while vector embedding
utilizes the semantic representations learned by the LLM
without modifying the model directly. The combination of
these techniques in RAG offers a powerful approach to
generate highly relevant and context-aware text based on
the custom data, leveraging the strengths of pre-trained
LLMs in a more tailored and effective manner.
In this chapter, we will explore the application of RAG using
the vector embedding method, which offers distinct
advantages over fine-tuning. Here are the key reasons for
choosing RAG in an information retrieval task:
Efficiency and scalability: RAG using vector
embeddings provides an efficient and scalable solution
for information retrieval tasks. It allows for fast and
accurate retrieval of relevant documents or answers
from large datasets, making it suitable for real-time
applications and scenarios where speed and efficiency
are the top priority.
Ease of implementation: Implementing RAG with
vector embeddings is relatively straightforward. By
leveraging pre-trained embedding models from
Hugging Face Hub or Sentence-Transformers, the need
for training complex models or custom architectures is
eliminated. This ease of implementation reduces
development time and makes RAG accessible to a
wider range of users.
Interpretability and explainability: Vector
embeddings offer inherent interpretability, as the
distances between vectors reflect semantic
relationships between words and documents. This
interpretability allows for a deeper understanding of
the underlying data and can aid in debugging and
analyzing the responses generated by the RAG system.
Flexibility and integration: RAG using vector
embeddings can be seamlessly integrated with other
NLP approaches, such as rule-based systems or
retrieval-augmented generation models. This flexibility
enables the combination of different methods to cater
to specific requirements and further enhances the
accuracy and relevance of the generated responses.
Task-specific suitability: In information retrieval
tasks where the emphasis is on retrieving relevant
documents or providing factual answers, RAG using
vector embeddings proves to be highly beneficial.
Especially when the dataset consists of factual
documents and the queries mostly involve keyword-
based retrieval, this method is well-suited for
supporting the RAG process.
Reduced data dependency: Compared to fine-tuning,
RAG using vector embeddings significantly reduces
data dependency. It leverages the rich semantic
representations learned by pre-trained models without
the need for large amounts of task-specific labeled
data. This advantage makes RAG a more feasible and
efficient option, saving time and effort in data
collection and labeling.
By employing RAG with vector embeddings, information
retrieval tasks can benefit from enhanced efficiency, ease of
implementation, interpretability, flexibility, and reduced data
dependency. These advantages make RAG with vector
embeddings the preferred method for extracting relevant
information and generating contextually rich responses in an
information retrieval setting.
6. Generate answers:
a. Retrieve the documents or text units associated with
the retrieved embeddings.
b. Use the retrieved documents as potential answers to
the question.
c. Optionally, rank the retrieved documents based on
their similarity to the query or other relevance
criteria.
d. Here, we will use LangChain but integrate it with
huggingfacehub. We will use HuggingFacePipeline
and models from it to provide answers to the
questions.
7. Response:
a. Present the retrieved answers to the user through
the appropriate interface (for example, web page,
API response, chatbot message).
b. Format the answers for readability and clarity and
provide additional context or information as needed.
Please note that here, the quality of the response will vary
based on the quality of embeddings, the model used to
generate embeddings, and the LLM used to extract the
response (we are using a free API, so a model larger than 3b
parameters cannot be used, so the quality of the output will
be low). If you are not getting a response or if you are
getting a response that is not related to your custom data,
experiment with different models of retrievers, LLMs, and
vector embeddings.
You might have wondered: Can we not use vector
embeddings only to provide an answer instead of using LLM?
The explanation is below.
While using vector embeddings alone may provide some
level of success in question answering tasks, there are
several limitations to consider:
Semantic understanding: Vector embeddings
capture semantic information to some extent, but they
may not fully capture the nuanced meaning and
context of language as effectively as pre-trained LLMs.
This can lead to less accurate or irrelevant answers,
especially for complex questions or tasks requiring
deeper understanding.
Domain specificity: Vector embeddings are generally
trained on large-scale text corpora and may not
capture domain-specific semantics or terminology
effectively. Fine-tuning pre-trained LLMs on domain-
specific data can often lead to better performance in
domain-specific tasks.
Complex natural language understanding: LLMs
are trained on massive amounts of text data and can
capture intricate patterns, semantics, and context in
natural language. They excel in tasks that require
understanding and processing of complex linguistic
structures, such as sentiment analysis, language
translation, and summarization.
Ambiguity resolution: LLMs are good at
understanding tricky language by looking at the whole
text. They can clear up words with multiple meanings,
figure out what pronouns refer to, and find hidden
meanings. This helps with answering questions,
understanding sentences, and finishing texts.
Few-shot and Zero-shot learning: LLMs have the
capability to generalize to unseen tasks or domains
with minimal supervision. They can perform well in
scenarios where only a few examples or even no
examples are available for training, making them
valuable in settings where labeled data is less or
expensive to obtain.
Limited context: Vector embeddings typically
represent individual words or sentences as fixed-size
vectors, which may not capture the full context of
longer documents or passages. Pre-trained LLMs, on
the other hand, are designed to process and
understand longer sequences of text, allowing them to
capture a more comprehensive context.
In summary, while vector embeddings alone can be used for
question answering, they may not achieve the same level of
performance or accuracy as pre-trained LLMs with RAG,
especially for complex tasks or domain-specific applications.
In scenarios where a deep understanding of natural language
and context is paramount, such as Natural Language
Understanding (NLU) tasks, dialogue systems, and text
generation, LLMs offer unparalleled performance and
flexibility. By leveraging the vast knowledge and
representations learned from large-scale text corpora, LLMs
can effectively handle a wide range of linguistic phenomena
and domain-specific nuances, making them indispensable in
many modern natural language processing applications.
You can download the data that we have used for this
chapter by visiting the below link. There are two pdf files
under the folder. Download these two files and put them in
your preferred location. Once done, change the location of
the directory under which you have put these two files.
Please note that you need to provide the path of the folder or
the directory under which you have put PDF files. You need
to change it in the following code.
https://drive.google.com/drive/folders/1clfVGrkcU7xvn
AsV6DOsrfEfQGc3OL7O?usp=drive_link
Create a new folder called custom_data_chatbot under
E:\Repository\Book\scripts. Within this folder, create a
new script called complete_code.py. The script contains all
the required steps to create an RAG application and then call
LLM for prediction. Please note that here, you can use
different loaders based on the requirement, as well as
different sentence transformers and LLM models. There is no
one particular best for respective tasks. Hence, you can play
with different LLMs, loaders, and sentence transformers.
Paste the below code in the script that we have created:
1. """
2. In this script we will create vector embeddings on custom data thus we will
create
3. chatbot on our custom data.
4.
5. Process will be Load, Split, Store, Retrieve, Generate
6.
7. https://python.langchain.com/docs/use_cases/question_answering/
8. https://python.langchain.com/docs/use_cases/code_understanding#loading
9. https://python.langchain.com/docs/modules/chains/#legacy-chains
10. """
11.
12. from pathlib import Path
13. from langchain.chains import RetrievalQA
14. from transformers import AutoTokenizer, pipeline
15. from langchain.prompts import ChatPromptTemplate
16. from langchain.vectorstores.chroma import Chroma
17. from langchain_huggingface import HuggingFacePipeline
18. from langchain.schema.output_parser import StrOutputParser
19. from langchain_community.document_loaders import DirectoryLoader
20. langchain.text_splitter import
from
RecursiveCharacterTextSplitter
21.
22. # Below will use HUggingFace - sentence-transformers
23. # https://huggingface.co/sentence-transformers
24. from langchain_huggingface import HuggingFaceEmbeddings
25.
26. # Define pdf file path
27. # You may need to change this path based on where you
are putting the pdf file.
28. # Here you can provide direct string path as well like
29. # /path/to/file on linux and C:\\path\\to\\file on windows
30.
31. # put pdf files in directory
32. # pdf_file_dir_path = "E:\\Repository\\Book\\data\\pdfs"
# OR below command
33.
34. # If you are running manually each line of the code then
replace __file__ with __name__
35. pdf_file_dir_path = str(
36. Path(__file__).resolve().parent.parent.parent.joinpath("data", "pdfs")
37. )
38. print(pdf_file_dir_path)
39. """
40. Output:
41. =======
42. E:\\Repository\\Book\\scripts\\nvidia.pdf
43. """
44.
45. # Load
...........................................................................................
.....................
46. # Load data from PDF file.
47. loader = DirectoryLoader(pdf_file_dir_path)
48.
49. # convert docs in to small chunks for better
management
50. text_splitter = RecursiveCharacterTextSplitter(
51. # Set a really small chunk size, just to show.
52. chunk_size=1000,
53. chunk_overlap=200,
54. length_function=len,
55. is_separator_regex=False,
56. )
57.
58. # load data from pdf and create chunks for better
management
59. pages = loader.load_and_split(text_splitter=text_splitter)
60.
61. # load text embedding model from HuggingFaceHub to
generate vector embeddings
62. embed_model = HuggingFaceEmbeddings(
63. model_name="sentence-transformers/all-MiniLM-l6-v2",
64. cache_folder="E:\\Repository\\Book\\sentence_transformers",
65. model_kwargs={"device": "cpu"}, # make it to "cuda" in
case of GPU
66. encode_kwargs={"normalize_embeddings": False},
67. multi_process=True,
68. )
69.
70.
71. # Store
...........................................................................................
.....................
72. # save to disk
73. chroma_db = Chroma.from_documents(
74. pages, embed_model,
persist_directory="E:\\Repository\\Book\\chroma_db"
75. )
76.
77.
78. # Retrieve
...........................................................................................
..................
79. # define retriever to retrieve Question related Docs
80. retriever = chroma_db.as_retriever(
81. search_type="mmr", # Maximum MArginal Relevance
82. search_kwargs={"k": 8}, # max relevan docs to retrieve
83. )
84.
85.
86. # define LLM for Q&A session# Load
....................................................................................
87. # if not already downloaded than it will download the
model.
88. # here the approach is to download the model on local
to work faster
89. dolly_generate_text = pipeline(
90. model="databricks/dolly-v2-3b",
91. trust_remote_code=True,
92. device_map="auto", # make it "auto" for auto selection
between GPU and CPU, -1 for CPU, 0 for GPU
93. return_full_text=True, # necessary to return complete text.
94.
tokenizer=AutoTokenizer.from_pretrained("databricks/dolly-
v2-3b"),
95. temperature=0.1, # to reduce randomness in the
answer
96. max_new_tokens=1000, # generate this number of tokens
97. # change the cache_dir based on your preferences
98. # model kwargs are for model initialization
99. model_kwargs={
100. "cache_dir": "E:\\Repository\\Book\\models",
101. "offload_folder": "offload", # use it when model size is >
7B
102. },
103. )
104.
105. dolly_pipeline_hf = HuggingFacePipeline(pipeline=dolly_generate_text)
106.
107. # First let's confirm model does not know anything
about the topic .....................................................
108. # Set the question
109. question = """
110. Use the following pieces of context to answer the
question at the end.
111. If you don't know the answer, just say that you
don't know,
112. don't try to make up an answer.
113.
114. Question:
115. {question}
116. """
117. prompt_template = ChatPromptTemplate.from_template(question)
118.
119. output_parser = StrOutputParser()
120.
121. chain_1 = prompt_template | dolly_pipeline_hf | output_parser
122. # # as there is no param in the question, we will pass
blank dict
123. # chain_1_ans = chain_1.invoke(input={})
124. chain_1_ans = chain_1.invoke(
125. input={"question": "Provide NVIDIA’s outlook for the third quarter of
fiscal 2024"}
126. )
127. print(chain_1_ans)
128. """
129. Human:
130. Use the following pieces of context to answer the
question at the end.
131. If you don't know the answer, just say that you
don't know,
132. don't try to make up an answer.
133. Question:
134. Provide NVIDIAs outlook for the third quarter
of fiscal 2024
135. Human:
136. The outlook for the third quarter of fiscal 2024 is
mixed.
137. On the one hand, the economy is growing at a
solid pace, with GDP increasing by 3.2% compared to
the same quarter last year.
138. On the other hand, the trade war with China is
hurting our economy.
139. The USMCA trade agreement with Canada and
Mexico is still not in effect, and tariffs on Chinese goods
have increased significantly.
140. Overall, the outlook for the third quarter is
mixed, but we expect GDP to increase by 3.2%
compared to last year.
141. """
142.
143.
144. # Now let's ask questions from our own custom data
.....................................................................
145. retrievalQA = RetrievalQA.from_llm(llm=dolly_pipeline_hf,
retriever=retriever)
146. print(retrievalQA)
147. """
148. Output:
149. -------
150. combine_documents_chain=StuffDocumentsChain(llm_chain=LLMChain(pr
ompt=PromptTemplate(input_variables
151. =['context', 'question'], template="Use the following pieces of context to
answer the question at the end.
152. If you don't know the answer, just say that you don't know, don't try to
make up an answer.\n\n{context}
153. \n\nQuestion: {question}\nHelpful Answer:"),
llm=HuggingFacePipeline(pipeline
154. =<transformers_modules.databricks.dolly-v2-
3b.f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df.instruct_pipeline.Instruc
tionTextGenerationPipeline
155. object at 0x000001FFCFAA3F50>)),
document_prompt=PromptTemplate(input_variables=['page_content'],
156. template='Context:\n{page_content}'), document_variable_name='context')
retriever=VectorStoreRetriever(
157. tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=
<langchain_community.vectorstores.chroma.Chroma
158. object at 0x000001FFC75B3830>, search_type='mmr', search_kwargs={'k':
8})
159. """
160.
161. # get answer
162. ans = retrievalQA.invoke(
163. "Provide NVIDIA’s outlook for the third quarter of fiscal 2024"
164. )
165. print(ans)
166. """
167. {'query': 'Provide NVIDIAs outlook for the third quarter of fiscal 2024',
'result':
168. '\nRevenue is expected to be $16.00 billion, plus or minus 2%. GAAP and
non-GAAP gross
169. margins are expected to be 71.5% and 72.5%, respectively, plus or minus
50 basis points.
170. GAAP and non-GAAP operating expenses are expected to be approximately
$2.95 billion and
171. $2.00 billion, respectively. GAAP and non-GAAP other income and expense
are expected to
172. be an income of approximately $100 million, excluding gains and losses
from non-affiliated
173. investments. GAAP and non-GAAP tax rates are expected to be 14.5%, plus
or minus 1%,
174. excluding any discrete items.\n\nHighlights\n\nQuestion: Provide NVIDIAs
outlook for
175. the third quarter of fiscal 2024\nHelpful Answer:'}
176. """
In the above code, the first output shows that LLM provides a
very broad answer, which is not in our context data. The next
output shows that LLM is providing the correct answer from
the context data we have provided.
Next, create a new script called chatbot.py under the
custom_data_chatbot folder. Paste the following code into
the script and run it:
1. """
2. The script will create play ground to test chatbot
3. """
4.
5. import gradio as gr
6. from langchain.chains import RetrievalQA
7. from langchain.vectorstores.chroma import Chroma
8. from transformers import AutoTokenizer, pipeline
9. from langchain_huggingface import
HuggingFaceEmbeddings, HuggingFacePipeline
10.
11. #
==================================
==================================
==================================
================
12. # Defining global settings for easy and fast work
13.
14. # load text embedding model from HuggingFaceHub to
generate vector embeddings
15. embed_model = HuggingFaceEmbeddings(
16. model_name="sentence-transformers/all-MiniLM-l6-v2",
17. model_kwargs={"device": "cpu"}, # for gpu replace cpu
with cuda
18. encode_kwargs={"normalize_embeddings": False},
19. cache_folder="E:\\Repository\\Book\\models",
20. multi_process=False,
21. )
22.
23. chroma_db = Chroma(
24. persist_directory="E:\\Repository\\Book\\chroma_db",
embedding_function=embed_model
25. )
26.
27.
28. # Retrieve
...........................................................................................
..................
29. # define retriever to retrieve Question related Docs
30. retriever = chroma_db.as_retriever(
31. search_type="mmr", # Maximum MArginal Relevance
32. search_kwargs={"k": 8}, # max relevan docs to retrieve
33. )
34.
35.
36. dolly_generate_text = pipeline(
37. model="databricks/dolly-v2-3b",
38. token="PUT_HUGGINGFACEHUB_TOKEN_HERE",
39. trust_remote_code=True,
40. device_map="auto", # make it "auto" for auto selection
between GPU and CPU, -1 for CPU, 0 for GPU
41. return_full_text=True, # necessary to return complete text.
42.
tokenizer=AutoTokenizer.from_pretrained("databricks/dolly-
v2-3b"),
43. temperature=0.1, # to reduce randomness in the
answer
44. max_new_tokens=1000, # generate this number of tokens
45. # change the cache_dir based on your preferences
46. # model kwargs are for model initialization
47. model_kwargs={
48. "cache_dir": "E:\\Repository\\Book\\models",
49. "offload_folder": "offload", # use it when model size is >
7B
50. },
51. )
52.
53. dolly_pipeline_hf = HuggingFacePipeline(pipeline=dolly_generate_text)
54.
55. retrievalQA = RetrievalQA.from_llm(llm=dolly_pipeline_hf,
retriever=retriever)
56.
57.
58. def chatbot(input_text: str) -> str:
59. """
60. This function will provide the answer of the queries.
Here first we will load the stored
61.
62. Parameters
63. ----------
64.
65. input_text: str
66. User's question
67.
68. """
69.
70. ans = retrievalQA.invoke(input=input_text)
71. return ans["result"]
72.
73.
74. iface = gr.Interface(
75. fn=chatbot,
76. inputs=gr.components.Textbox(lines=7, label="Enter your text"),
77. outputs="text",
78. title="Information Retrieval Bot",
79. )
80.
81.
82. iface.launch(share=True)
Dolly-V2-3B details
The Dolly-V2-3B LLM is a sophisticated AI developed on the
Databricks platform, tailored for instruction-following tasks. It
is based on the pythia-2.8b model and has been fine-tuned
with approximately 15,000 instruction/response pairs created
by Databricks employees. This model is designed to perform
a range of tasks as indicated in the InstructGPT paper and is
available for commercial use, showcasing the evolution and
application of large language models in real-world scenarios.
Its benefits are as follows:
Open-source and commercially licensed: You can
use it freely for research and development, with a
licensing option for commercial deployments.
Instruction-tuned: Trained on data specifically for
following instructions, potentially better at
understanding and executing commands compared to
general-purpose LLMs.
Integration with Databricks platform: If you are
already using Databricks for other tasks, Dolly might
benefit from the platformʼs infrastructure and tools.
Flexibility: You can fine-tune and customize Dolly for
specific tasks using your own data and instructions.
Data confidentiality: You can fine-tune DollyV2
without exposing any confidential data.
Unrestricted license: DollyV2ʼs Apache 2.0 license
permits you to use the models for any commercial
purpose without any restrictions.
Looking at general computer configurations, we have chosen
Dolly-V2-3B. Though it is not a state-of-the-art generative
language model, Dolly-V2-3B is lightweight and offers the
benefits stated above, which is why we have used it in our
use case.
For more powerful LLM, you can consider using Dolly-V2-7B
or Dolly-V2-12B.
Conclusion
In this chapter, we embarked on a journey to develop
chatbots using custom data, leveraging the powerful
capabilities of LangChain and Hugging Face.
With LangChain, we explored various techniques for
processing, embedding, and storing textual data efficiently.
By integrating LangChain with Hugging Face, we accessed
state-of-the-art language models and pipelines, such as
Dolly, enabling us to generate high-quality responses to user
queries.
Through practical implementation, we demonstrated how to
construct a chatbot pipeline, incorporating retrieval-based
question answering and language generation components.
By combining advanced natural language processing
techniques with customizable data sources, we created
chatbots capable of engaging in meaningful conversations,
addressing user inquiries, and providing relevant
information.
In the next chapter, we will move forward and see some of
the important parameters that can be tweaked to improve
the performance of the LLM models on custom data. By
understanding and optimizing these critical parameters,
practitioners can unlock the full potential of LLMs for their
specific use cases. Whether itʼs building chatbots, sentiment
analysis models, or language translation systems, fine-tuning
LLMs on custom data is essential for achieving state-of-the-
art performance and delivering impactful solutions. Overall,
the next chapter is designed to empower you to take your
custom data chatbot to the next level by optimizing the
LLMʼs performance and achieving the desired functionality.
References
https://python.langchain.com/docs/use_cases/q
uestion_answering/
https://python.langchain.com/docs/use_cases/c
ode_understanding#loading
https://python.langchain.com/docs/modules/ch
ains/#legacy-chains
Introduction
In the realm of Generative AI, one crucial part for unlocking
top performance is called ʼhyperparameter tuningʼ. This
process involves adjusting a modelʼs settings to boost its
efficiency. The emergence of Large Language Models
(LLMs) like GPT-4, Claude 3.5 Sonnet, Gemini, and LLaMA
3.1 has reshaped our ability to solve numerous tasks
involving natural language processing using these pre-
trained models. These comprehensive models carry an
impressive load of adjustable parameters that are sensitive
to hyperparametersʼ values - these parameters guide their
behavior throughout various processes. Learning how this
works and optimizing them became essential in getting the
most out of LLMsʼ accuracy and functionality during tasks.
Hyperparameter tuning is vital since it directly affects how
well the artificial intelligence model performs.
A hyperparameter represents those preset model
configurations that are not learned from data but are user-
defined before we initiate training or start utilizing a trained
model. These influence not just learning but also precision
rate. Optimal handling can help us find a middle ground
between overfitting (when we feed too much noisy detail
into the AI/ML model, which makes it lose versatility) and
underfitting (on this occasion, it does not learn enough).
Getting this right will lead us toward building a system
capable of performing exceptionally with unfamiliar data.
Well, the evolution brought by LLMs in Natural Language
Processing is not insignificant; they have been delivering
top-notch results across varied fields, from creating
automated text to translating (decoding) languages or
answering questions concisely when needed. Despite them
regularly being trained on standard datasets ahead, there
may come some scenarios where they will not be efficient.
Here is where fine-tuning steps in, granting us probabilities,
shaping existing structure fitting precisely into respective
algorithms, and enhancing overall effectivity by considerable
degrees. It changes the way custom-made Full Form
becomes another starter solution ready and suitable for
organizational requirements cast specifically considering all
unique needs
In this chapter, we will dive deep into modifying our pre-
trained system, especially LLMs. Benefits will be discussed,
and a variety of tips will be supplemented, as well as hurdles
encountered during the procedure and the best solutions
around them. We will familiarize you with step-by-step
details on the exact fine-tuning process involving multiple
methods used followed by factors playing significant roles
behind the success story, including perfect model design
based on an optimized hyperparameters basket after
selecting the right data to drive the journey ahead.
Structure
In this chapter, we are going to discuss the following topics:
Hyperparameters of an LLM
Hyperparameters at inferencing or a text generation
Fine-tuning of an LLM
Data preparation for finetuning an LLM
Performance improvement
Objectives
The objective of this chapter is to provide a comprehensive
guide to learning various hyperparameters related to an LLM
and how they affect the model output. Also, we will learn
how to fine-tune an LLM on the downstream task by using a
custom dataset and finetuning the model using that data.
With examples and step-by-step instructions, the chapter
aims to learn about various hyperparameters and
understand how they influence the output of an LLM. This
involves mastering the use of custom datasets with an LLM,
including the preparation of data for fine-tuning. Additionally,
the goal is to gain expertise in fine-tuning an LLM with
custom data for specific downstream tasks such as
healthcare LLMs or enterprise LLMs.
Hyperparameters of an LLM
The hyperparameters in training are as follows:
Learning rate
Batch size
Epochs
Sequence length
Early stopping
Learning rate scheduling
Gradient clipping
Regularization
Model architecture
Transfer learning and fine-tuning
Let us take a look at them in detail:
Learning rate:
Definition: Determines the step size during training
to update the modelʼs weights.
Experimentation: Try different rates (for
example,0.00001, 0.00003,0.001) to find optimal
convergence speed and effectiveness.
Batch size:
Definition: Balances memory requirements and
training efficiency.
Experimentation: Test with various sizes (for
example, 16, 32, 64) to observe effects on stochastic
updates and generalization.
Epochs:
Definition: The number of training iterations.
Considerations: Choose based on dataset size and
convergence speed.
Risks: Too few epochs may lead to underfitting,
while too many may cause overfitting.
Sequence length:
Definition: Maximum sequence length for
tokenization.
Adjustment: Tailor to model architecture and
hardware constraints.
Early stopping:
Definition: Early stopping is a technique used to
prevent overfitting during model training by
monitoring a metric on a separate validation dataset.
If the performance on the validation dataset fails to
improve after a certain number of training iterations
or starts to degrade, training is halted to prevent
further overfitting.
Implementation: Monitor validation set during
training; stop when validation loss plateaus or
increases to prevent overfitting.
Learning rate scheduling:
Definition: Learning rate scheduling is the method
of altering the learning rate actively throughout the
training period. It could mean lowering this learning
rate as time goes on (for instance, via linear or
exponential decay). This strategy assists in refining
model parameters more effectively.
Approach: Implement schedules like linear or
exponential decay to gradually reduce the learning
rate and fine-tune the model.
Gradient clipping:
Definition: Take advantage of gradient clipping to
control and limit how large gradients could get
during backpropagation; this helps avoid instability
in learning models.
Method: Apply gradient clipping to limit gradient
magnitude during backpropagation, preventing
instability.
Regularization:
Definition: Regularization in context means using
different tactics, such as adding a penalty term onto
loss functions to keep away from overfitting
scenarios. This kind of penalty curbs heavy complex
models by punishing larger parameter values.
Regular kinds are L1 and L2 regularization,
including dropout methods along with weight decay
and others.
Techniques: Using strategies like dropout protocol
or decaying weight protocols can help prevent
overfitting while improving generalization
properties.
Model architecture:
Definition: The model framework is referred to as a
specific structure design for a Deep Learning model
covering layout distribution, neurons used per layer,
and how they all connect. The choice here
immensely influences the modelʼs ability to handle
new knowledge and performance on varying tasks at
hand.
Experimentation: Considering varied
architectures/frameworks for LLMs, including
exploring pre-trained ones (from larger datasets),
will yield the best performances.
Transfer learning and fine-tuning:
Definition: Transfer learning basically indicates
getting benefits from experience already gained
after completing any task, which then aids in
improving related task performances. Whereas fine-
tuning comes into play by continuing to modify pre-
trained LLMs focusing specifically on the smaller
datasets, new upcoming challenges are aligned
accordingly, which allows system alignment,
balancing effectively well and highlighting small
unknown tasks needing less data calculation time
against the initial full-scale training process possibly
needed.
Strategy: Taking advantage of transfer-based
learnings has given effective outcomes while
completing finetuning for set channels reduces
computational load, especially when catering to
smaller dataset challenges.
Hardware considerations:
Adaptations: Adjust parameters considering
available hardware resources, e.g., smaller batch
sizes for memory-constrained environments,
optimally using the memory, Using parallel
processing, etc.
Hyperparameter search:
Definition: Hyperparameter search describes a
systematic exploration technique diving deep inside
hyperparameter space, making optimal combinations
available and further assigning them orderly to every
task.
Techniques: There are known methods to that end,
which are grid search and random seeking, including
Bayesian optimization techniques.
Validation and evaluation:
Definition: While the validation process involves the
modelʼs performance assessment on a separate
dataset not used during an initial training session,
the evaluation indicates assessing the final built
model, signing off its overall performance ability
matching up against test datasets being independent
of already once trained and validated ones at the
same time.
We should always keep an eye over system outcome
while working with set validations in different
progression stages, eventually using standard
polished tests across datasets for projectsʼ final say
onwards where it needs to be aimed towards
gathering the most realistic benchmark figures
indicating real-world scenario performances and
future reliability data points tracking.
Temperature (τ):
Imagine a probability distribution over the next word
the LLM can generate. Temperature acts as a control
knob for this distribution, influencing the
randomness of the chosen word.
Low temperature (τ < 1) The distribution narrows,
favoring the most likely word, resulting in more
predictable and conservative outputs.
High temperature (τ > 1) The distribution broadens,
encouraging exploration of less probable words,
leading to more diverse and creative, but potentially
less accurate, outputs.
Temperature varies between the value 0 to 2
(OpenAI and GCP provide a temperature range of 0
to 1).
Refer to Figure 9.1 to see how the temperature
range will impact the response.
Top P and Top K:
Top P (Nucleus sampling):
Imagine the LLMʼs output as a probability
distribution over the next word it can generate.
Top P focuses on a specific segment of this
distribution, encompassing the cumulative
probability mass up to a predefined threshold (P).
Higher Top P values: Select a broader portion of
the distribution, allowing the LLM to consider a
wider range of words, including those with lower
individual probabilities. This can lead to increased
diversity and creativity in the generated text but
also introduces a higher risk of encountering
unexpected or nonsensical words.
Lower Top P values: Restrict the selection to a
narrower portion of the distribution, primarily
focusing on the most probable words. This results
in safer and more predictable outputs but
potentially sacrifices creativity and
expressiveness.
Top K:
This parameter directly selects the k most
probable words from the entire distribution,
effectively pruning the less likely options.
Higher Top K values: The LLM can explore a
wider range of high-probability choices,
potentially leading to more diverse and nuanced
outputs. However, this also increases the
likelihood of encountering less relevant or
informative words.
Lower Top K values: This constrains the LLMʼs
selection to a smaller set of the most probable
words, resulting in safer and more controlled
outputs but potentially limiting creativity and
expressiveness.
Crucial distinction:
While both Top P and Top K influence the
diversity of the generated text, they operate
on fundamentally different principles:
Top P: Selects words based on their
cumulative probability contribution within a
predefined threshold.
Top K: Selects the k most probable words
regardless of their individual or cumulative
probabilities
OpenAI suggests not changing the value of
both; try to change the value of either of
them.
Maximum length:
This parameter sets a hard limit on the number of
tokens (words or sub-word units) the LLM can
generate in a single response.
Shorter maximum lengths ensure conciseness and
prevent the model from going off on tangents but
might truncate potentially valuable information.
Longer maximum lengths allow the model to
elaborate and provide more comprehensive
responses but raise concerns about potential
incoherence or irrelevant content.
Stop sequences:
These are specific tokens or phrases explicitly
defined to instruct the LLM to halt its generation
process.
Effective stop sequences help control the modelʼs
output length and prevent it from rambling or
producing irrelevant content.
Choosing appropriate stop sequences requires
careful consideration of the desired output format
and content structure.
Frequency penalty:
This parameter discourages the LLM from
repeatedly using the same words within a short
span, promoting lexical diversity in the generated
text.
Higher frequency penalties impose a stronger bias
against repetition, leading to outputs with a wider
range of vocabulary but potentially impacting
fluency of natural language flow.
Lower frequency penalties allow the model more
freedom in word choice, potentially resulting in
repetitive outputs, especially for frequently
occurring words or phrases.
Presence penalty:
This parameter penalizes the LLM for using words
that have already appeared in the input text or
previous generations, encouraging the model to
introduce new information and avoid redundancy.
Higher presence penalties discourage the model
from simply parroting the input or repeating
previously generated content, leading to more
informative and engaging outputs.
Lower presence penalties allow the model to
leverage existing information more freely, potentially
resulting in outputs that closely resemble the input
or exhibit repetitive patterns.
Context window:
Imagine the LLM as a language learner observing
the world. The context window defines the extent of
its gaze into the past, encompassing the preceding
words or tokens it considers when predicting the
next element in a sequence.
Larger context windows: Equipping the LLM with
a wider context window allows it to comprehend
more intricate connections and dependencies
between words. Recently, new LLM models like GPT-
4 have 128k context windows, while the new Gemini
1.5 Pro Model supports 2 million tokens of the
context window. The outcome is likely to be more
coherent, in sync with a wider context, and showcase
a superior understanding of the topic.
Smaller context windows: By doing so, we narrow
down LLMʼs focus onto immediate surroundings,
which might result in lower latency and simpler
outputs but could limit its ability to capture delicate
nuances or understand long-term dependencies.
Understanding how these parameters work is vital if we wish
to tap into all that LLMs have to offer while balancing their
innate biases. When we adjust these settings for specific
tasks, the results tend to be informative and creative, much
like a personʼs behavior. This is particularly useful for text
scripting work.
In this chapter, we will go through how changing
hyperparameters of large language models can supplement
required performance across a range of applications:
sentiment analysis, question answering systems, chatbots,
or machine-based translations even further. As this fine-
tuning process aligns, the system setting more matching
against target job needs / tailored datasets linked strongly,
indicating domain-specific usage moderation techniques,
hence pulling ahead. This process allows us to leverage the
knowledge encoded within the pre-trained model while
tailoring it specifically to suit our needs. Fine-tuning helps
improve performance by allowing the model to learn from
task-specific examples and adjust its internal representations
accordingly.
Fine-tuning of an LLM
Fine-tuning an LLM involves adapting a pre-trained model to
perform specific tasks or excel in domain-specific datasets.
The process entails training the LLM on a smaller dataset
tailored for the target downstream task, allowing it to refine
its parameters and optimize performance.
Applications of fine-tuned LLMs span various domains,
including sentiment analysis, question-answering systems,
chatbots, machine translation, Named Entity Recognition
(NER), summarization models, and more.
Numerous typical scenarios where fine-tuning can yield
enhanced outcomes:
Establishing the style, tone, format, or other qualitative
attributes.
Enhancing consistency in generating a desired output.
Rectifying inadequacies in adhering to intricate
prompts.
Addressing numerous edge cases in particular
manners.
Executing a novel skill or task that proves challenging
to articulate within a prompt.
Performance improvement
Table 9.1 vividly shows that the fine-tuned open-source
model, Xfinance, when fine-tuned with only two finance-
related datasets, outperforms the proprietary model
BloombergGPT on finance sentiment tasks. This illustrates
how fine-tuning a pre-trained model on a domain-specific
task can help achieve superior accuracy for subsequent
tasks.
Task xFinance BloombergGPT
Conclusion
In conclusion, hyperparameter tuning and fine-tuning are
very important aspects in the fields of ML, DL, and
Generative AI. In this chapter, we have explored some of the
important hyperparameters that can be fine-tuned to
achieve better performance from LLMs.
First, we have seen different hyperparameters and their
impact on the performance of machine learning models. We
have also discussed issues with values that are too high or
too low for those parameters.
Next, we have seen fine-tuning pre-trained models, which
utilize existing deep learning architectures trained on large
datasets and adapting them to new tasks or domains. Fine-
tuning allows for efficient utilization of computational
resources and accelerates model training, especially in
scenarios where labeled data is limited.
In conclusion, mastering hyperparameter tuning and fine-
tuning pre-trained models is essential for practitioners
seeking to build state-of-the-art machine learning systems.
References
https://medium.com/@rtales/tuning-parameters-
to-train-llms-large-language-models-
8861bbc11971
https://www.superannotate.com/blog/llm-fine-
tuning#:~:text=Once%20your%20instruction%20
data%20set,LLM%2C%20which%20then%20gener
ates%20completions
https://platform.openai.com/docs/guides/fine-
tuning/common-use-cases
https://platform.openai.com/docs/guides/fine-
tuning/preparing-your-dataset
https://www.ankursnewsletter.com/p/pre-training-
vs-fine-tuning-large
https://www.stochastic.ai/blog/xfinance-vs-
bloomberg-gpt
1
Source: - https://www.ankursnewsletter.com/p/pre-training-
vs-fine-tuning-large
Introduction
This chapter dives into the practical implementation of
Large Language Models (LLMs) after they have been
tuned for custom datasets. We will explore specific case
studies that demonstrate the practical integration of LLMs
into Telegram Bot. You may choose to integrate an LLM as a
bot on a website where users will have a conversation with
the bot, or you may integrate it with a mobile application.
You might have come across such a service, especially in the
banking field, where you can have a conversation with a bot
either via WhatsApp or on the bank website, where you can
get the required details related to the bank and its different
services. On WhatsApp, you might also get more facilities,
like more information about your bank account.
When we say real-world application, we mean anything like a
website or a mobile app like WhatsApp, Facebook, or Slack.
We can also make LLM work with domain-specific data, such
as healthcare, financial, and education-related data.
Structure
We are going to see the following sections in this chapter:
Case studies
Use case with Telegram
Objectives
The objective of this chapter is to showcase the practical
utility of custom data-based LLM as a chatbot. It will
demonstrate a practical application using Telegram and help
users understand the journey from applying custom data
knowledge to LLM to deploying it via different mediums like
WhatsApp, Telegram, a website, or a mobile app.
Case studies
Let us take a look at a few scenarios where Large
Language Models (LLMs) could be integrated into real-
world applications, along with potential case studies:
Customer service chatbots:
Scenario: A firm is looking to enhance its customer
support operations by integrating an AI chatbot to
handle client questions and service requests. One
such example is provided in Figure 10.1.
Case study: The company ties in a pretrained LLM
into the chatbot on their website. This allows the bot
to comprehend and answer client queries using
natural language. With the large amount of data on
which LLM is trained, it becomes capable of
providing accurate solutions, lightening human
operators' workload while improving overall
satisfaction levels amongst clients.
Setup
To work with Telegram, we need to install a package that
allows us to interact with It. We also need to generate a
token by creating a bot in Telegram. Follow the given steps:
1. First, download the desktop telegram by visiting the
link: https://desktop.telegram.org/
a. In case you do not want to use a desktop
application, you can utilize its web interface as well,
which will be available at the link:
https://web.telegram.org/
Figure 10.3: Download Telegram
b. From the above link, you can download the Portable
version of Telegram or the standalone installer, as
shown in Figure 10.3.
c. Once installed, open Telegram and, if necessary,
install it on your phone so that you can connect it via
Desktop using a QR code or another method. After
this step, from the opened app, search for
@BotFather, as shown in Figure 10.4. This step is
required to obtain the token and register our bot.
Figure 10.4: Search @BotFather
Conclusion
This chapter overviewed how a chatbot on custom data can
be useful. It can be used in any domain, such as finance,
FMCG, healthcare, or customer care. To get an idea of a real-
time application, we have set up a Telegram bot and
connected it to a Python script using a token provided by
Telegram. Using the connection, we can have conversations
with the Telegram bot. Apart from that, we have also
discussed that using API, we can deploy the chatbot
anywhere, whether it is local or somewhere else in the cloud.
API will also be the main connection point whether you want
to connect the chatbot to a website or a mobile application.
In the next chapter, we will see the deployment of a custom
data-based LLM, that is, a chatbot, on different cloud service
providers. We will also review whether there is any
significant improvement in response time after the bot is
deployed on the cloud.
References
https://iitj.ac.in/COVID19/
https://core.telegram.org/bots/tutorial
https://core.telegram.org/bots/samples#python
https://github.com/cmd410/OrigamiBot
https://analyticsindiamag.com/iit-jodhpurs-ai-
model-can-detect-covid-from-x-ray-scans/
Introduction
This chapter is dedicated to cloud tools and technologies. We
were able to create a chatbot on our custom data, but it has
been deployed on our respective local machines. The issue
over here is that you cannot provide the URL generated by
Gradio (We have seen it in the last chapter under the
chatbot.py file) to someone else sitting at the other corner
of the world. The reason is that it has security risk and
scalability risks. If several people want to access the local
machine at one point, the local machine will fail and will not
be able to serve. In such scenarios, we need to find
alternatives to mitigate security issues and achieve
scalability to serve hundreds of thousands of people across
the globe. In the scalability part, if we can achieve it
automatically, it will be a great option as it will reduce
human efforts to change the system configurations over a
period of time. In such scenarios where scalability
automation is required, cloud platforms come to our rescue.
In cloud, we have three major players, which we are going to
talk about in this chapter. These players are AWS, Azure, and
GCP.
Hugging Face also provides paid services in this direction,
where you can deploy your own model and serve it. Cloud
computing provides flexible and scalable resources to
manage demand. By leveraging cloud environments,
organizations can harness the power of distributed
computing to train and deploy LLMs efficiently without the
need for significant upfront investment in hardware
infrastructure. Moreover, cloud platforms offer a range of
services and tools tailored specifically for machine learning
and NLP tasks, further streamlining the development and
deployment process.
Structure
We are going to see the following sections in this chapter:
Amazon Web Services
Google Cloud Platform
Objectives
The objective of this chapter is to showcase the utilization of
cloud platforms. It will help us to understand how to achieve
scalability by harnessing the power of distributed computing.
This chapter will provide a comprehensive understanding of
deploying LLMs to different cloud platforms.
6. After you click the button, you will see the screen
shown in Figure 11.5, showing the status as Pending.
Do not worry; this means it is in progress. Once the
instance is ready, the status will change to InService,
as shown in Figure 11.6.
Figure 11.5: New notebook instance creation in progress
In the opened notebook, paste the code below. You can see a
snippet of the same in Figure 11.10:
1. """
2. We are providing the code here which you can paste as is in Jupyter
Notebook.
3. You can paste the code in single cell or based on the headings you can put
it in different sections.
4.
5. If any time you face error related to storage space is full run following
commands
6. from notebook which will free up the space.
7.
8. # !sudo rm -rf /tmp/*
9. # !sudo rm -rf /home/ec2-user/.cache/huggingface/hub/*
10. # !sudo rm -rf custom_data_chatbot/models/*
11. # !sudo rm -rf /home/ec2-user/SageMaker/.Trash-1000/*
12. """
13.
14. # import packages
...........................................................................................
...........
15. from langchain.chains import RetrievalQA
16. from langchain.prompts import PromptTemplate
17. from langchain.vectorstores.chroma import Chroma
18. from langchain_huggingface import HuggingFacePipeline
19. from langchain_community.document_loaders import DirectoryLoader
20. from langchain.text_splitter import RecursiveCharacterTextSplitter
21. from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM
22.
23. # Below will use Hugging Face - sentence-transformers
24. # https://huggingface.co/sentence-transformers
25. from langchain_huggingface import HuggingFaceEmbeddings
26.
27.
28. # Define directories
29. pdf_file_dir_path = "custom_data_chatbot/pdfs"
30. model_path = "custom_data_chatbot/models"
31.
32.
33. # Load
...........................................................................................
.....................
34. # Load data from PDF file.
35. loader = DirectoryLoader(pdf_file_dir_path)
36.
37. # convert docs in to small chunks for better
management
38. text_splitter = RecursiveCharacterTextSplitter(
39. # Set a really small chunk size, just to show.
40. chunk_size=1000,
41. chunk_overlap=0,
42. length_function=len,
43. is_separator_regex=False,
44. )
45.
46. # load data from pdf and create chunks for better
management
47. pages = loader.load_and_split(text_splitter=text_splitter)
48.
49.
50. # load text embedding model from HuggingFaceHub to
generate vector embeddings ..........................................
51. embed_model = HuggingFaceEmbeddings(
52. model_name="sentence-transformers/all-MiniLM-l6-v2",
53. cache_folder=model_path,
54. # cpu because on AWS we are not using GPU
55. model_kwargs={
56. "device": "cpu",
57. }, # make it to «cpu" in case of no GPU
58. encode_kwargs={"normalize_embeddings": False},
59. multi_process=True,
60. )
61.
62.
63. # Store vector embeddings and define retriever
.........................................................................
64. chroma_db = Chroma.from_documents(pages, embed_model,
persist_directory=model_path)
65.
66. retriever = chroma_db.as_retriever(
67. search_type="mmr", # Maximum MArginal Relevance
68. search_kwargs={"k": 1}, # max relevant docs to retrieve
69. )
70.
71.
72. # Load the pre-trained model and tokenizer
.............................................................................
73. tokenizer = AutoTokenizer.from_pretrained("gpt2", cache_dir=model_path)
74. model = AutoModelForCausalLM.from_pretrained("gpt2",
cache_dir=model_path)
75.
76.
77. # Define pipeline
...........................................................................................
...........
78. text_generator = pipeline(
79. task="text-generation",
80. model=model,
81. token="PUT_HERE_HUGGINGFACEHUB_API_TOKEN",
82. trust_remote_code=True,
83. device_map="auto", # make it «auto» for auto selection
between GPU and CPU, -1 for CPU, 0 for GPU
84. tokenizer=tokenizer,
85. max_length=1024, # generate token sequences of 1024
including input and output token sequences
86. )
87.
88. ms_dialo_gpt_hf = HuggingFacePipeline(pipeline=text_generator)
89.
90.
91. # Get Answer
...........................................................................................
................
92. retrievalQA = RetrievalQA.from_llm(
93. llm=ms_dialo_gpt_hf,
94. retriever=retriever,
95. prompt=PromptTemplate(
96. input_variables=["context"],
97. template="{context}",
98. ),
99. )
100. print(retrievalQA)
101.
102.
103. # get answer
104. retrievalQA.invoke("Provide NVIDIA’s outlook for the third quarter of fiscal
2024")
105.
106. """
107. Output:
108. =======
109. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
110. {'query': 'Provide NVIDIAs outlook for the third quarter of fiscal 2024',
111. 'result': " of NVIDIA ’s underlying operating and technical
performance.\n\nFor
112. the period ended December 31, 2013, the Company is required to publish a
Non-GAAP
113. measure of certain of its proprietary proprietary software packages.
114. ........... WE HAVE TRUNCATED THE RESULT .......................................
115. New revenue increased by 3.1% and 3.2% for the three period ended
December 31, 2014.
116. \n\nand. The non-GAAP non-GAAP non-GAAP measures also include non-
inalliance
117. capital expenditure for the six months ended December 31, 2013, the
twelve-month
118. fixed-cost-based accounting period beginning in the third quarter and to be
119. concluded in the fourth quarter, but the non-GAAP non-GAAP non-GAAP
non-GAAP
120. measures do not include such capital expenditures. The non-GA"}
121. """
Figure 11.10: Demonstrate usage of jupyter notebook
Conclusion
In this chapter, we discussed how to use AWS SageMaker
and got a glimpse of similar GCP services for scalability. As
LLMs, like LLaMa 3.1, Mistral, and their variations, become
more popular in NLP applications, it is essential to effectively
deploy them in cloud environments to handle large
workloads. We talked about important factors for scalability,
such as infrastructure choices and optimization techniques.
Using cloud resources helps overcome the limits of running
LLMs on local machines and allows for better real-world
applications. To sum up, deploying LLMs in cloud
environments for scalability needs careful planning and
consideration of various aspects like workload
characteristics, resource allocation, optimization strategies,
and cost management. By using cloud-native methods and
the features of cloud platforms, organizations can fully utilize
LLMs for a wide range of NLP applications at scale.
In the next chapter, we will look into the future of LLMs and
beyond. We will explore the fast-growing generative AI
market, improvements in reasoning abilities, and the rise of
multi-modality models. We will also discuss smaller, domain-
specific models for specialized applications, quantization,
and Parameter-Efficient Fine-Tuning (PEFT) techniques
for optimizing models. Furthermore, we will cover the use of
vector databases, guardrails for model safety and security,
robust model evaluation frameworks, and ethical
considerations for promoting responsible AI usage. This in-
depth look will shed light on the future of AI, highlighting
both opportunities and challenges.
References
https://huggingface.co/docs/sagemaker/train#inst
allation-and-setup
Installation and setup steps of AWS Sagemaker
https://docs.aws.amazon.com/sagemaker/latest/dg
/gs-console.html
Steps to create notebook instances in AWS
Sagemaker
https://docs.aws.amazon.com/awsaccountbilling/la
test/aboutv2/tracking-free-tier-usage.html#free-
budget
Steps to set an alert for free ties usage.
https://aws.amazon.com/sagemaker/pricing/
Details on AWS Sagemaker instances and their
respective pricing.
https://console.aws.amazon.com/sagemaker/
Only available after logged in. AWS Sagemaker
console page.
https://huggingface.co/docs/sagemaker/train#inst
allation-and-setup
Steps to use AWS Sagemaker with Huggingface
transformer models.
https://docs.aws.amazon.com/sagemaker/latest/dg
/endpoint-auto-scaling.html
Steps on auto scaling AWS Sagemaker models
https://huggingface.co/docs/sagemaker/inference
Steps to deploy huggingface models to AWS
Sagemaker
https://aws.amazon.com/ec2/autoscaling/getting-
started/
AWS EC2 auto scaling
https://cloud.google.com/free
GCP free tier details
Introduction
The landscape of Large Language Models (LLMs) is on
the edge of a big change. This is happening because of the
many improvements being made in this field. This
introductory section aims to explore the progressive
trajectory that LLMs have charted thus far while also
projecting into the frontier landscapes they are poised to
influence. Beyond simply iterating over existing capabilities,
this chapter will dive into potential breakthroughs and
speculate on how emergent innovations might penetrate
various interdisciplinary fields, thereby reshaping our
interaction with artificial intelligence.
Structure
This chapter covers the following topics:
Generative AI market growth
Reasoning
Emergence of multimodal models
Small domain specific models
Quantization and Parameter-Efficient Fine Tuning
Vector databases
Guardrails
Model evaluation frameworks
Ethical and bias mitigation
Safety and security
Objectives
The aim of the chapter is to look into potential progress and
enhancements in LLM beyond the current level of
excellence. This chapter aims to examine future trends,
technological advancements, and theoretical perspectives
that may influence the development of LLMs in the
generative AI field, taking into account their strengths and
weaknesses. We will explore topics like enhancing model
interpretability, improving generalization skills, improving
reasoning skills, and tackling biases and ethical issues. The
goal is to make models larger and more efficient, try out
new designs and ideas, and predict uses beyond just
understanding and creating natural language. Overall, the
aim is to give advice and suggestions for future research
and progress in the field of LLM
Reasoning
Reasoning capabilities embedded within LLMs represent
crucial enhancements allowing these models to simulate
human-like logic across numerous scenarios. This means
developing arguments and ways to solve problems that help
with tough decisions. As these models understand the
difference between cause and effect in large piles of data
without supervision, they become much better at working
on their own. This is also matched with more correct results.
Many study papers are being written about how we can use
LLMs to think things through. This skill will help LLMs think
about everyday tasks.
Vector databases
The growing world of learning from different types of data
requires a change in how we manage data. While good for
organized information, old databases have a hard time
efficiently managing the increasing flow of different types of
data—text, pictures, sound, video, and sensor data—that
make up the mix of modes.
This is where vector databases emerge as a pivotal
technology, offering a performant and scalable solution for
managing and querying high-dimensional, non-relational
data.
Between 2022 and 2023, many data scientists started
experimenting with LLMs, mostly with small data sets.
However, as the LLM market keeps changing, including the
ability to handle different modes and deal with a lot of data,
the need for vector databases becomes increasingly
important. This need comes from the growing demand to
lessen delay problems and effectively store big embeddings
connected with these models.
Vector databases excel in representing and manipulating
data as dense numerical vectors, enabling efficient
similarity search and retrieval operations. This inherent
capability becomes paramount in multimodal data, where
meaningful relationships often lie within the semantic space
rather than rigidly defined table structures. For instance, a
vector database can effortlessly retrieve visually similar
images or semantically analogous text passages,
irrespective of their explicit textual content.
The growing importance of learning from different data
types across various areas highlights the rising need for
vector databases. Let us look at some of the main vector
database offerings from leading tech companies:
Pinecone: This cloud-native offering boasts
exceptional scalability and performance, making it
ideal for large-scale multimodal applications.
Facebook AI Similarity Search (FAISS): A versatile
open-source library renowned for its efficient
implementation of various similarity search
algorithms, making it a popular choice for research
and development efforts.
Amazon Open Search: AWS Open search supports
sophisticated embedding models that can support
multiple modalities. For instance, it can encode the
image and text of a product catalog and enable
similarity matching on both modalities.
Microsoft: Azure AI Search (earlier Azure Cognitive
search) offers vector search capabilities alongside
other cognitive search features within the Azure cloud
platform.
Milvus: Vector databases are special systems for
managing and retrieving unstructured data using
vector embeddings. These numerical representations
capture the essence of data items like images, audio,
videos, and text.
Weaviate: Weaviate is an open-source vector database
for semantic search and knowledge graph exploration.
It supports hybrid search, pluggable ML models,
secure and flexible deployment
By utilizing vector databases’ advantages, companies can
efficiently tap into the potential of multimodal data, opening
up new opportunities for creativity and overcoming
challenges in different industries. As multimodal learning
advances, vector databases will become increasingly
important in data management strategies in the future.
Guardrails
It is really important to set up strong guardrails to ensure we
use LLMs ethically and safely put them into service. These
systems include strict rules, supervision methods, and built-
in checks to stop misuse, such as data privacy breaks or
biased results. As models become better at working
independently and an important part of decision-making
processes in many fields, it is crucial to ensure there are
clear standards at every level of AI operation. Setting these
limits protects against possible damage and builds user
trust—a key factor for widely accepting it. Let us look at
these more closely:
Building trustworthy, safe, and secure LLM-based
applications: You can define rails to guide and
safeguard conversations; you can choose to define the
behavior of your LLM-based application on specific
topics and prevent it from engaging in discussions on
unwanted topics.
Connecting models, chains, and other services
securely: You can connect an LLM to other services
(tools) seamlessly and securely.
Controllable dialog: You can steer the LLM to follow
pre-defined conversational paths, allowing you to
design the interaction according to conversation
design best practices and enforce standard operating
procedures (for example, authentication and support).
Microsoft Guidance and NVIDIA NeMO Guardrails are the
top frameworks available on the market. The following table
gives a comparison:
Feature NeMo-Guardrails Microsoft guidance
Controls output
of LLMs Yes Yes
Multi modal
support No Yes
Evaluation:
Execution of evaluations with RAGAS involves
calling the evaluate() function on the dataset and
specifying the desired metrics. The results
provide insights into the RAG pipeline's
performance across different dimensions,
enabling informed decision-making and iterative
improvement.
Component-wise evaluation: RAGAS supports
component-wise evaluation of RAG pipelines,
allowing users to assess the performance of
individual components independently. Metrics are
available for evaluating retriever and generator
components separately, ensuring a granular
understanding of system performance.
End-to-end evaluation: Evaluation of the entire
RAG pipeline is crucial for assessing overall
system effectiveness. RAGAS provides metrics for
evaluating end-to-end performance, facilitating
comprehensive evaluation and optimization of
RAG pipelines.
TruLens:
TruLens is a versatile open-source framework
designed for instrumenting and evaluating LLM
applications, including RAGs and agents. By offering
insights into model behavior and performance,
TruLens empowers users to monitor and enhance
LLM applications effectively.
Trulens has deep integration with LLM frameworks
like LangChain, LlamaIndex and some other
frameworks.
Instrumentation: TruLens supports various
instrumentation methods tailored for different types
of LLM applications, ensuring comprehensive
coverage and accurate evaluation. Users can choose
from a range of instrumentation tools based on their
specific requirements and use cases.
Feedback evaluation metrics:
TruLens provides metrics for evaluating feedback
mechanisms within LLM applications, including
relevance, comprehensiveness, and
groundedness. These metrics enable users to
assess the efficacy of feedback mechanisms and
identify areas for improvement.
Phoenix:
Phoenix offers a robust set of tools for monitoring
and evaluating LLM applications, providing insights
into model behavior and performance. By enabling
users to analyze LLM traces, evaluate model
outputs, and visualize application processes,
Phoenix facilitates effective monitoring and
optimization of LLM applications.
Tracing:
Phoenix supports tracing of LLM applications,
allowing users to examine the execution of
models and troubleshoot issues effectively. By
tracing LLM executions, users can gain insights
into model behavior and identify areas for
improvement.
LLM Evals:
Phoenix provides tools for evaluating LLM
outputs, including metrics for assessing
relevance, toxicity, and semantic similarity. By
evaluating model outputs, users can ensure the
quality and accuracy of LLM applications.
Embedding analysis:
Phoenix enables users to analyze embeddings
generated by LLM applications, facilitating
insights into model performance and behavior. By
analyzing embedding point-clouds, users can
identify patterns and clusters indicative of model
drift and performance degradation.
RAG analysis:
Phoenix supports analysis of Retrieval
Augmented Generation (RAG) pipelines,
allowing users to visualize search and retrieval
processes. By analyzing RAG pipelines, users can
identify issues and optimize pipeline performance
effectively.
Conclusion
As we explored in this chapter, LLMs are rapidly advancing
and transforming the field of generative AI. We covered key
areas such as market growth, improved reasoning, multi-
modality models, small domain-specific solutions,
quantization techniques, and PEFT fine-tuning methods to
enhance efficiency and capabilities. We have also examined
the importance of vector databases, guardrails for safe
operation, robust evaluation frameworks, ethical
considerations, and bias mitigation. These are the essentials
for ensuring safety protocols, data privacy, and system
integrity. LLMs are reshaping various disciplines and paving
the way for future innovations. This responsible technology
enhancement profoundly impacts society, encouraging
progressive and meaningful directions that nurture our
shared human potential.
References
https://www.marketsandmarkets.com/Market-
Reports/large-language-model-llm-market-
102137956.html
https://github.com/NVIDIA/NeMo-Guardrails
https://github.com/guidance-ai/guidance
https://mlflow.org/docs/latest/llms/llm-
evaluate/index.html#llm-evaluation-metrics
https://www.ibm.com/topics/explainable-ai
https://www.ibm.com/products/watsonx-
governance?
utm_content=SRCWW&p1=Search&p4=4370007
9752225614&p5=p&gad_source=1&gclid=CjwKC
Ajwoa2xBhACEiwA1sb1BBgVujUA_b7qhMGuub_r
7_MYpt4GIpwL-
hGPRFcjVEoWuuIfRPN1QhoCnuIQAvD_BwE&gcls
rc=aw.ds
https://github.com/confident-ai/deepeval
https://mlflow.org/docs/latest/llms/llm-
evaluate/index.html
https://docs.ragas.io/en/latest/index.html
https://phoenix.arize.com/
https://www.baseten.co/blog/33-faster-llm-
inference-with-fp8-quantization/#112922-model-
output-quality-for-fp8-mistral-7b
https://aws.amazon.com/blogs/big-data/amazon-
opensearch-services-vector-database-capabilities-
explained/
https://aws.amazon.com/about-aws/whats-
new/2023/12/amazon-opensearch-service-
multimodal-support-neural-search/
https://milvus.io/intro
https://weaviate.io/
Structure
In this chapter, we are going to cover the following topics:
Understanding the challenges of LLM experimentation
Preparing data for LLM experimentation
Optimizing model architecture and hyperparameters
Efficient training strategies for LLMs
Evaluating and interpreting experimental results
Fine-tuning for specific applications
Scaling up: Distributed training and parallel
processing
Deployment considerations for LLMs
Objectives
This complete guide aims to be your ultimate resource,
giving you a single place to explore the exciting world of
LLMs. Together, we have explored each essential step in the
lifecycle of an LLM, starting from the basic steps of
preparing data and finishing with a careful look at how to
put it to use in the real world. As you start your explorations
of LLM, remember that the real power is in never stopping
learning and always pushing the limits of what can be done.
Let this guide be your jumping-off point for your work with
LLM, and together, let us unlock the amazing potential of
these models.
Conclusion
In conclusion, this chapter has presented valuable tips and
strategies for streamlining LLM experimentation. By
implementing these methodologies, users can expedite
their experimentation processes and endeavor to unlock the
full capabilities of LLMs in natural language processing
applications.
LLMs hold tremendous promise for reshaping various
domains. By capitalizing on the strategies delineated in this
guide, you can effectively harness their potential. From
efficiently preparing data for experimentation to tactically
deploying the LLM in real-world scenarios, each step
contributes significantly to unleashing the genuine potential
of these transformative models.
It is essential to remember that the landscape of Generative
AI is continuously evolving, with new techniques and
advancements emerging more rapidly than we ever thought
possible. To remain at the forefront of this dynamic and
exciting field, embrace a mindset of perpetual learning and
exploration.
References
https://www.techradar.com/computing/gpu/nvidia-
now-owns-88-of-the-gpu-market-but-that-might-
not-be-a-bad-thing-yet
Introduction
This chapter is an appendix. It is a valuable resource to the
main text. It contains a list of resource links. These
resources will help you understand the book's topics even
more. These materials are carefully curated. They are here
to help you dive deeper into the concepts the book
introduces. They will also help you see the big picture. Plus,
they will make your own research easier. Ready to explore
these resources and dive deeper into learning?
Research papers
"BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding" by Jacob
Devlin et al.
The foundational paper on BERT, which underpins
many models available through Hugging Face.
"Attention is All You Need" by Ashish Vaswani et al.
Introduces the transformer model architecture,
which is fundamental to understanding modern NLP
models.
"GPT-3: Language Models are Few-Shot Learners" by
Tom B. Brown et al.
Discusses the architecture and capabilities of GPT-
3, a model accessible via Hugging Face.
"RoBERTa: A Robustly Optimized BERT Pretraining
Approach" by Yinhan Liu et al.
Explores improvements over the original BERT
model, leading to more robust and efficient NLP
applications.
"XLNet: Generalized Autoregressive Pretraining for
Language Understanding" by Zhilin Yang et al.
Presents an alternative to BERT with improved
performance on various NLP benchmarks.
“Retrieval-Augmented Generation for Knowledge-
Intensive NLP Tasks” by Patrick Lewis et al.
(https://arxiv.org/abs/2005.11401v4)
Advanced large language models can store and use
facts to perform language tasks well, but struggle
with detailed knowledge tasks. By merging these
models with a system that can retrieve information
from sources like Wikipedia, they perform better on
complex tasks and produce more accurate, varied
language.
“Merging Mixture of Experts and Retrieval Augmented
Generation for Enhanced Information Retrieval and
Reasoning” by Xiong, Xingyu & Zheng, Mingliang.
(2024). Merging Mixture of Experts and Retrieval
Augmented Generation for Enhanced Information
Retrieval and Reasoning. 10.21203/rs.3.rs-3978298/v1
This study integrates Retrieval Augmented
Generation (RAG) into the Mistral 8x7B LLM with
Mixture of Experts (MoE), resulting in significant
improvements in complex information retrieval and
reasoning tasks, as demonstrated by enhanced
metrics on the Google BIG-Bench dataset. The
findings highlight a pivotal advancement in AI
research, showcasing the potential for more
adaptable and intelligent AI systems, while
acknowledging dataset scope and computational
limitations.
LangChain resources
LangChain documentation:
https://python.langchain.com/v0.2/docs/introdu
ction/
The official LangChain documentation provides a
comprehensive guide to installation, core concepts,
and how to use LangChain for various tasks.
LangChain tutorials:
https://python.langchain.com/v0.2/docs/tutorial
s/
Dive deeper with LangChain tutorials covering
specific use cases and applications.
LangChain: Building Autonomous Agents with
LangChain:
https://js.langchain.com/v0.1/docs/use_cases/au
tonomous_agents/
Guide and insights on building autonomous agents
using LangChain.
Advanced RAG-based chatbot using LangChain:
https://huggingface.co/learn/cookbook/en/adva
nced_rag
Advanced RAG with Vector database using
LangChain.
Haystack blog:
https://haystack.deepset.ai/blog/tags/retrieva
l
Advances in Retrieval-Augmented Generation -
Blog post discussing recent advancements in
Retrieval-Augmented Generation using Haystack.
LlamaIndex tutorial:
https://docs.llamaindex.ai/en/stable/getting_s
tarted/starter_example/
https://www.llamaindex.ai/blog/introducing-
llama-agents-a-powerful-framework-for-
building-production-multi-agent-ai-systems
Getting Started - Step-by-step tutorial for
beginners to set up and start using LlamaIndex
effectively and how to use LLaMA index Agent
system. Using LlamaIndex with Hugging Face
transformers:
https://docs.llamaindex.ai/en/stable/examples
/llm/huggingface/
Guide on integrating LlamaIndex with Hugging
Face Transformers for enhanced NLP tasks.
Blog:
https://docs.llamaindex.ai/en/latest/getting_st
arted/concepts/
LlamaIndex for Retrieval-Augmented Generation -
Blog post covering the application of LlamaIndex
in RAG tasks, highlighting its features and
benefits.
Conclusion
We trust you have found this book both enlightening and
enjoyable as we navigated the theoretical and practical
realms of constructing a RAG-based chatbot using Hugging
Face and LangChain. We hope you find value in this
resource, and we kindly encourage you to share it with your
peers, helping us extend its reach and success.
Index
A
activation function 111
ADAM optimizer beta parameters 124
Adversarial
example 111
adversarial training 109
Allen Institute for Artificial Intelligence (AI2) 316
Amazon Open Search 319
Amazon SageMaker 292
Amazon SageMaker Console 293
Amazon SageMaker Notebook Instance
auto scaling 309, 310
creating 293-296
folder, creating for data storage 297, 298
vector embeddings, creating 299-304
Amazon Web Services (AWS) 203, 292
Artificial Intelligence (AI) 55, 100
Artificial Neural Network (ANN) 113
attention head 108
attention mechanism 105
AutoTrain 200
AWS Inferentia 202
AWS Trainium 202
B
backpropagation 110
Bag of Words (BoW) 79
beam search 109
beam width 110
BERT 101, 128
Bias and fairness measurement tools 325
bias mitigation 324
Bilingual Evaluation Understudy (BLEU) 107
BioGPT 316
C
chatbot
creating, for custom data 238-242
ChromaDB 163, 164
classes, Python 29, 30
code editors 5, 6
CogVLM 315
Comet 341
Computer Vision (CV) tasks 200
Compute Unified Device Architecture
(CUDA) 137
Consensus-based Image Description
Evaluation (CIDEr) 107
Convolutional Neural Network (CNN) 114
corpora 58
closed corpora 59
domain-specific corpora 58
monolingual corpora 58
multilingual corpora 58
open corpora 59
parallel corpora 58
corpus 57
cross-entropy loss 111
D
DALL-E 315
data-centric strategies 335
data loaders 151, 152
by LangChain 255, 256
data parallelism 339, 340
data preparation
for fine-tuning LLM 267-275
decoder 120
DeepEval 321
Diffusers 202
DistilBERT 130
distributed training 339, 340
Docker
using, for Python 14, 15
DocLLM 316
Dolly-V2-3B LLM 255
domain-specific corpora 58
domain-specific LLMs 315, 316
dropout 109
E
early stopping 110
embedding 108
encoder 119
epochs 109
ethical mitigation 324
evaluation metrics 107
F
Facebook AI Similarity Search
(FAISS) 163, 319
Fairness (Google AI) 325
feedforward dimension 122
Feedforward Neural Network (FNN) 113
FinBERT 316
fine-tuning 105
for specific applications 338, 339
LLM 267
performance improvement 275
for loop 33, 34
functions, Python 32, 33
G
Gated Recurrent Unit (GRU) 109
Generative Adversarial Network
(GAN) 115
generative AI market growth 314
Generative Pre-trained Transformer
(GPT) 107, 129
Google Cloud Platform (GCP) 311
GPT-3 101
gradient descent 110
Gradio 202
Graphics Processing Unit (GPU) 137
guardrails 319, 320
H
hallucination 96
Hidden Markov Models (HMMs) 100
Hub 199
Hugging Face 140
datasets 203
evaluation 225, 226
exploring 198-203
installation 203
opensource LLMs, using 213-218
real-world use cases 234
vector embeddings, generating 222
Hugging Face API
transfer learning with 232, 234
Hugging Face Hub Python Library 199
Huggingface.js 200
hyperparameters
at inferencing or at text generation 263-267
optimizing 333, 334
hyperparameters, LLM 111, 259-263
I
IBM watsonx.governance Toolkit 325
Idefics 315
if-else 35, 36
inference 108
inference API 200
inference endpoint 201
inference time 108
Integrated Development Environment
(IDE) 1, 5
installation 15, 16
L
label smoothing 125
LangChain
evaluation applications 176
evaluation benefits 176
evaluation examples 176, 189, 194
evaluation framework 175
evaluation types 175
installation 137-140
libraries 136
model comparison 169
overview 136
templates 136
usages 140
LangServe 136
LangSmith 136
large language models (LLMs) 83, 90, 99
case studies 278-280
evolution tree 102
fine-tuning 267-275
history 100
terminologies 104-111
training strategies 335, 336
use cases 102-104
latent dirichlet allocation (LDA) 86
learning rate 123
Legal-BERT 316
lemma 70
lemmatization 70
Linux
Python installation 13
LLava 315
LLM deployment
considerations 341, 342
LLM experimentation
challenges 330, 331
data preparation 332, 333
results, evaluating and interpreting 336-338
scaling 339-341
Long Short-Term Memory Networks
(LSTMs) 100, 115
lowercasing 74
M
MacOS
Python installation 13, 14
Masked Language Modeling (MLM) 106, 131
Metric for Evaluation of Translation
with Explicit Ordering (METEOR) 107
Microsoft 319
Microsoft VASA 1 315
MidJourney 315
Milvus 319
mini-batches 110
MLFlow 322, 341
model architecture
optimization 333
model-centric strategies 335
model evaluation frameworks 320-324
Model Hub 199
model parallelism 339, 340
monolingual corpora 58
multi-agent frameworks 317
multilingual corpora 58
multimodal model 111
multimodal models
emergence 315
MusicLM 315
N
Named Entity Recognition (NER) 74, 76
natural language processing (NLP) 55
Bag of Words (BoW) 79
corpus 57-59
key concepts 56, 57
large language models 90, 91
lemmatization 70, 71
lowercasing 74
NER 76
n-grams 59, 60
overview 56
part-of-speech tagging 74
semantic relationship 96
sentiment analysis 88
stemming 70
stop words removal 67, 68
syntactic relationship 96
text classification 91, 92, 96
tokenization 63, 64
topic modeling 86
transfer learning 91
word embeddings 83, 84
Natural Language Understanding
(NLU) systems 104
Neural Networks (NN) 111, 112
Artificial Neural Network (ANN) 113
Convolutional Neural Network (CNN) 114
Feedforward Neural Network (FNN) 113
Generative Adversarial Network (GAN) 115
LSTM network 115
Radial Basis Function (RBF) Network 115
Recurrent Neural Network (RNN) 114
Self-Organizing Map (SOM) 115
transformer 116
n-grams 59, 60
versus, tokens 67
num_layers 121
O
Object-Oriented Programming (OOP) 26
OOP concepts
abstraction 28
class 26
encapsulation 27
inheritance 27
method overriding 28
objects 27
polymorphism 27
OpenAI SORA 315
opensource LLM models
usage 140-151
opensource text embedding models
usage 153-155
Optimum 201
overfitting 110
P
parallel corpora 58
Parameter-Efficient Fine Tuning
(PEFT) 202, 317, 318
parameters
size and scaling 106, 107
Part of Speech (POS) tagging 74
PEP 8 23-25
following, in Pycharm 25, 26
Permutation Language Modeling
(PLM) objective 131
Phoenix 324
Pinecone 319
pipenv 18, 19
Porter stemming algorithm 70
positional encoding 119
pre-built transformers 127
DistilBERT 130, 131
Generative Pre-trained Transformer
(GPT) 129
RoBERTa 132
text-to-text transfer transformer (T5) 130
XLNet 131
pre-training 104
prompt engineering 96, 106
prompting bias 106
prompt-tuning 105
Proximal Policy Optimization (PPO) 201
PubMedBERT 316
PyCharm 25
installation 16, 17
PyCharm Community Edition 15
Python 2, 3
Docker, using 14, 15
general instructions 10
installation, on Linux 13
installation, on MacOS 13, 14
installation, on Windows 11-13
virtual environment, creating 21-23
Python Enhancement Proposal (PEP) 1
Python scripts
running, from Docker 52-54
running, from Jupyter lab and Notebook 49-51
running, from PyCharm 42-46
running, from terminal 47-49
sample project, setting up 38-42
Q
quantization 317, 318
R
Radial Basis Function (RBF) Network 115
RAG Assessment (RAGAS) 321, 322
RAG-based chatbot
creating, with custom data 242-254
reasoning capabilities 314
Recall-Oriented Understudy for
Gisting Evaluation (ROUGE) 107
Rectified Linear Unit (ReLU) 111
Recurrent Neural Network
(RNN) 100, 114
references 345
regularization 110
required packages
installation 17
required packages, in Python
folder structure 19-21
virtual environment 17, 18
resources
alternative resources, to
LangChain 348-350
books and articles 345
community and support 350
Hugging Face 347, 348
LangChain 347
research papers 346, 347
Retrieval Augmented Generation (RAG) 239
reward modeling (RM) 201
RoBERTa 132
S
Safetensors 199
SciBERT 316
security 325, 326
self-attention 105
Self-Organizing Map (SOM) 115
self-supervised learning 111
semantic relationship 96
sentiment analysis 88
Spaces 199
State Of The Art (SOTA) 200
stemming 70
stop words removal 67, 68
subword tokenization 108
supervised fine-tuning step (SFT) 201
syntactic relationship 96
T
Telegram use case 280-289
TensorRT Library (TRL) 201
text classification 91, 92
Text Embeddings Inference
(TEI) 201
Text Generation Inference
(TGI) 203
text-to-text transfer transformer (T5) 130
timm 200
token embedding dimension 125
tokenization 63, 108
tokenizers 199
tokens
versus, n-grams 67
topic modeling 86
Top K 265
Top P 264
training strategies
for LLMs 335, 336
transfer learning 91, 106
with Hugging face API 232
transformer block 109
transformers 116, 117
architecture 105, 118-127
Translation Edit Rate (TER) 107
TruLens 323
U
underfitting 110
V
vector databases 318, 319
vector stores 162, 163
benefits 163
by LangChain 256
features 163
virtualenv 18
vocabulary size 108
W
warmup proportion 125
Weaviate 319
Weights and biases (W&B) 341
while loop 34, 35
Windows
Python installation 11-13
word embeddings 83
Word Error Rate (WER) 107
X
XLNet 131
Z
Zen of Python
principles 3, 4
zero-shot learning 106