0% found this document useful (0 votes)
111 views24 pages

LLM Security

Uploaded by

ghosalarjun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views24 pages

LLM Security

Uploaded by

ghosalarjun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

LLM Security: Can Large

Language Models be Hacked ?


$ cat introductions.txt

Neumann
Sysploit
-Presenter A
-Presenter
(Details) B

(Details)
What is AI and Generative AI ?

Traditional AI - Uses data to analyze patterns, make predictions, and


perform specific tasks. It's often used in finance, healthcare, and
manufacturing for tasks like spam filtering, fraud detection, and
recommendation systems. (data analysis and automation)
Generative AI - Uses data as a starting point to create new content, such
as images, audio, video, text, and code. It's often used in music, design, and
marketing, and can be used for tasks like answering questions, revising
content, correcting code, and generating test cases. (creative content
generation)
LLMs and their Applications

LLMs - Advanced AI models trained on vast datasets to


understand and generate human-like text.
Key Models - Falcon 40B, GPT-4, BERT, Claude 3
Applications
● Content Creation
● Customer Support
● Education
Components of LLM Security

● Data Privacy: Protecting training data and user interactions.


● Access Control: Restricting who can interact with and modify
the model.
● Model Integrity: Ensuring the model has not been tampered
with.
● Response Monitoring: Detecting and mitigating harmful
outputs.
● Update Management: Regularly updating the model to patch
vulnerabilities.
OWASP Top 10: LLM

1. LLM01: Prompt Injection -

Malicious actors can manipulate LLMs through crafted prompts to gain


unauthorized access, cause data breaches, or influence decision-
making.

2. LLM02: Insecure Output Handling

Failing to validate LLM outputs can lead to downstream security


vulnerabilities, including code execution for compromising systems
and data exposure.
3. LLM03: Training Data Poisoning
Biasing or manipulating the training data used to develop LLMs can
lead to biased or malicious outputs.
4. LLM04: Model Denial of Service
Intentionally overloading LLMs with excessive requests can disrupt
their functionality and prevent legitimate users from accessing
services.
5. LLM05: Supply Chain Vulnerabilities
Security weaknesses in the development tools, libraries, and
infrastructure used to build LLMs can create vulnerabilities in the
6. LLM06: Sensitive Information Disclosure
LLMs can inadvertently reveal sensitive information during
generation tasks if not properly configured to handle confidential
data.
7. LLM07: Insecure Plugin Design
Third-party plugins used to extend LLM functionalities can introduce
security vulnerabilities if not designed and implemented securely.
8. LLM08: Excessive Agency
Overstating the capabilities of LLMs or attributing human-like
qualities can lead to unrealistic expectations and misuse.
9. LLM09: Overreliance

Blindly trusting LLM outputs without human oversight can lead to


errors, biases, and unintended consequences.

10. LLM10: Model Theft

The unauthorized access or copying of LLM models can lead to


intellectual property theft and misuse.
Fundamentals of LLM Threats

Overview of key threats and vulnerabilities that Large Language


Models face:

● Backdoor Attacks
● Adversarial Attacks
● Model Inversion Attacks
● Distillation Attacks
● Hyperparameter Tampering

And so on …
Backdoor Attacks

Backdoor attacks revolve around the embedding of malicious triggers or


“backdoors” into machine learning models during their training process.
Typically, an attacker with access to the training pipeline introduces these
triggers into a subset of the training data. The model then learns these
malicious patterns alongside legitimate ones. Once the model is deployed,
it operates normally for most inputs. However, when it encounters an input
with the embedded trigger, it produces a predetermined, often malicious,
output.
Scenario: To train a powerful LLM,
web data is scraped to form the
corpus(training dataset) on which the
LLM is trained. Attackers may
introduce some "poisoned websites",
containing cleverly hidden
backdoors, triggered by specific
prompts or keywords.

Attackers might inject these


backdoors as subtle biases woven
into seemingly objective content. The
LLM unknowingly absorbs these
backdoors during training. Later,
when prompted with specific
keywords or phrases, the LLM might
be manipulated into generating
biased or misleading text, even if the
Examples of Backdoor instances:

(a) Original sentence.

(b) Backdoor instance in the


beginning of text.

(c) Backdoor instance in the middle


of the text.

Backdoor trigger is coloured in red


font and is semantically correct in
both contexts.
Mitigation Techniques
The passage you provided highlights two key strategies to combat
the hidden threat of backdoor attacks in machine learning models:
● Anomaly Detection: This approach constantly monitors the
model's outputs for unusual patterns. If the model starts making
strange predictions for specific inputs (potentially containing the
attacker's trigger), it might be a sign of a backdoor at work.
● Regular Retraining: By periodically retraining the model on a
fresh, verified dataset free from malicious influences, the
backdoor's effect can be potentially erased.
Adversarial Attacks

Adversarial attacks focus on deceiving the model by introducing


carefully crafted inputs, known as adversarial samples. Adversary
crafts a seemingly normal input laced with triggers to exploit biases in
the LLM's training. This tricks the LLM into generating false or biased
outputs, like fake news or malicious content. This can manipulate user
opinions or spread misinformation. Defenses involve better training
data, detection methods, and more robust LLM designs.
Types of Adversarial Attacks
Attack Type Description
Token Black-box Alter a small fraction of tokens in the text
Manipulation input such that it triggers model failure
but still remain its original semantic
meanings.
Gradient-based White-box Rely on gradient signals to learn an
Attack effective attack.
Jailbreak Black-box Often heuristic based prompting to
prompting “jailbreak” built-in model safety.

Human red- Black-box Human attacks the model, with or without


teaming assist from other models.
Scenario: Prompt injection on ChatGPT(GPT 3.5) leading to unethical
responses.
Mitigation Techniques

Defending against adversarial attacks requires a multifaceted


approach. Some of the widely accepted mitigation techniques include
the following:

● Adversarial Training: Train the LLM on fake attacks to make it


better detect real ones.
● Input Validation: Check for signs of tampering before feeding
data to the LLM.
● Model Ensemble: Use multiple LLMs to analyze input, making it
harder for attacks to succeed.
● Gradient Masking: Hide internal signals from attackers to make
Model Inversion Attacks

Model inversion attacks are a class of attacks that specifically target machine
learning models with the aim to reverse engineer and reconstruct the input
data solely from the model outputs.

This becomes particularly alarming for models that have been trained on data
of a sensitive nature, such as personal health records or detailed financial
information. In such scenarios, malicious entities might potentially harness the
power of these attacks to infer private details about individual data points.
Scenario: A large language model (LLM) is trained on a massive dataset of
text and code, potentially containing private information like user
comments, emails, or even code snippets.

A trained LLM, used for creative tasks, could be stolen. Attackers might use
the stolen model and with some crawled auxiliary information, reconstruct
private user data that was used to train the language model. This could be
a serious privacy breach.
Mitigation Techniques

There are two main approaches to mitigating model inversion


attacks:

● Input Obfuscation: Transforming the input data before feeding


it to the model to make it less interpretable.
● Differential Privacy: Adding noise to the model's outputs to
make it harder to reconstruct the original data.
● Input Sanitization: Cleaning the input data to remove
potential weaknesses that attackers could exploit.
LLMSecOps
LLMSecOps evolved to address Some best practices of LLMSecOps:
ethical AI concerns like bias,
1. Design phase: Technical and ethical
explainability, and adversarial
considerations
vulnerabilities in LLM 2. Training data management:
applications. LLMs bring new Curation, analysis, sanitization
scale and open-ended 3. Training process governance:
versatility. A system like GPT-4 Controlled environments and
has 1.76 trillion parameters, protocols
and DALL-E 3 can generate 4. Monitoring: Post-deployment
realistic synthetic imagery regulation
based on any text prompt. As
capabilities expand, so do
potential risks.
Benefits of LLMSecOps
Benefit Category Specific Benefits
Efficiency Faster model development, higher quality models, faster
deployment to production

Scalability Management of thousands of models, reproducibility of


LLM pipelines, acceleration of release velocity

Risk Reduction Regulatory compliance, transparency and


responsiveness, alignment with organizational policies
Thank You!!

You might also like