Deepseek Docs
Deepseek Docs
com/
------- • -------
https://api-docs.deepseek.com/#__docusaurus_skipToContent_fallback
------- • -------
https://api-docs.deepseek.com/#
------- • -------
https://api-docs.deepseek.com/zh-cn/
------- • -------
https://api-docs.deepseek.com/quick_start/pricing
Deduction Rules
The expense = number of tokens × price.
The corresponding fees will be directly deducted from your topped-up balance or
granted balance, with a preference for using the granted balance first when both
balances are available.
Product prices may vary and DeepSeek reserves the right to adjust them. We
recommend topping up based on your actual usage and regularly checking this page
for the most recent pricing information.PreviousYour First API CallNextThe
Temperature ParameterPricing DetailsDeduction RulesWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/parameter_settings
We recommend users to set the temperature according to their use case listed in
below.
------- • -------
https://api-docs.deepseek.com/quick_start/token_usage
However, due to the different tokenization methods used by different models, the
conversion ratios can vary. The actual number of tokens processed each time is
based on the model's return, which you can view from the usage results.
Calculate token usage offline
You can run the demo tokenizer code in the following zip package to calculate the
token usage for your intput/output.
deepseek_v3_tokenizer.zipPreviousThe Temperature ParameterNextRate LimitCalculate
token usage offlineWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/rate_limit
These contents do not affect the parsing of the JSON body by the OpenAI SDK. If you
are parsing the HTTP responses yourself, please ensure to handle these empty lines
or comments appropriately.
If the request is still not completed after 30 minutes, the server will close the
connection.PreviousToken & Token UsageNextError CodesWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/error_codes
------- • -------
https://api-docs.deepseek.com/news/news250120
🌐 Website & API are live now! Try DeepThink at chat.deepseek.com today!
📜 License Update!
📈 Large-scale RL in post-training
🏆 Significant performance boost with minimal labeled data
📄 More details:
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
------- • -------
https://api-docs.deepseek.com/news/news250115
📱 Now officially available on App Store & Google Play & Major Android markets
Important Notice:
📲 Search "DeepSeek" in your app store or visit our website for direct links
------- • -------
https://api-docs.deepseek.com/news/news1226
🎉 What’s new in V3
Model 👉 https://github.com/deepseek-ai/DeepSeek-V3
Paper 👉 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
------- • -------
https://api-docs.deepseek.com/news/news1210
📊 DeepSeek-V2.5-1210 raises the bar across benchmarks like math, coding, writing,
and roleplay—built to serve all your work and life needs.
🔧 Explore the open-source model on Hugging Face: https://huggingface.co/deepseek-
ai/DeepSeek-V2.5-1210
------- • -------
https://api-docs.deepseek.com/news/news1120
------- • -------
https://api-docs.deepseek.com/news/news0905
Safety Evaluation
Balancing safety and helpfulness has been a key focus during our iterative
development. In DeepSeek-V2.5, we have more clearly defined the boundaries of model
safety, strengthening its resistance to jailbreak attacks while reducing the
overgeneralization of safety policies to normal queries.
ModelOverall Safety Score (higher is better)*Safety Spillover Rate (lower is
better)**DeepSeek-V2-062874.4%11.3%DeepSeek-V2.582.6%4.6%
* Scores based on internal test sets: higher scores indicates greater overall
safety.
** Scores based on internal test sets:lower percentages indicate less impact of
safety measures on normal queries.
Code Capabilities
In the coding domain, DeepSeek-V2.5 retains the powerful code capabilities of
DeepSeek-Coder-V2-0724. It demonstrated notable improvements in the HumanEval
Python and LiveCodeBench (Jan 2024 - Sep 2024) tests. While DeepSeek-Coder-V2-0724
slightly outperformed in HumanEval Multilingual and Aider tests, both versions
performed relatively low in the SWE-verified test, indicating areas for further
improvement. Moreover, in the FIM completion task, the DS-FIM-Eval internal test
set showed a 5.1% improvement, enhancing the plugin completion experience.
DeepSeek-V2.5 has also been optimized for common coding scenarios to improve user
experience. In the DS-Arena-Code internal subjective evaluation, DeepSeek-V2.5
achieved a significant win rate increase against competitors, with GPT-4o serving
as the judge.
Open-Source
DeepSeek-V2.5 is now open-source on HuggingFace! Check it out:
https://huggingface.co/deepseek-ai/DeepSeek-V2.5Previous🚀 DeepSeek-R1-Lite-Preview
is now live: unleashing supercharged reasoning power!NextDeepSeek API introduces
Context Caching on Disk, cutting prices by an order of magnitudeVersion
HistoryGeneral CapabilitiesCode CapabilitiesOpen-SourceWeChat Official Account
------- • -------
https://api-docs.deepseek.com/news/news0802
Hint 1: The API price of DeepSeek-V3 has been updated. For details, please refer to
Models & Pricing.
2. Data analysis: Subsequent requests with the same prefix can hit the context
cache.
For more detailed instructions, please refer to the guide Use Context Caching.
Monitoring Cache Hits
Two new fields in the API response's usage section help users monitor cache
performance:
prompt_cache_hit_tokens:Number of tokens from the input that were served from the
cache ($0.014 per million tokens)
prompt_cache_miss_tokens: Number of tokens from the input that were not served from
the cache ($0.14 per million tokens)
Reducing Latency
First token latency will be significantly reduced in requests with long, repetitive
inputs.
For a 128K prompt with high reference, the first token latency is cut from 13s to
just 500ms.
Lowering Costs
Users can save up to 90% on costs with optimization for cache characteristics.
Even without any optimization, historical data shows that users save over 50% on
average.
The service has no additional fees beyond the $0.014 per million tokens for cache
hits, and storage usage for the cache is free.
Security Concerns
The cache system is designed with robust security strategy.
Each user's cache is isolated and logically invisible to others, ensuring data
privacy and security.
Unused cache entries are automatically cleared after a period, ensuring they are
not retained or repurposed.
Why DeepSeek Leads with Disk Caching
Based on publicly available information, DeepSeek appears to be the first large
language model provider globally to implement extensive disk caching in API
services.
This is made possible by the MLA architecture in DeepSeek V2, which enhances model
performance while significantly reducing the size of the context KV cache, enabling
efficient storage on low-cost disks.
DeepSeek API’s Concurrency and Rate Limits
The DeepSeek API is designed to handle up to 1 trillion tokens per day, with no
limits on concurrency or rate, ensuring high-quality service for all users. Feel
free to scale up your parallelism.
The cache system uses 64 tokens as a storage unit; content less than 64 tokens will
not be cached.
The cache system does not guarantee 100% cache hits.
Unused cache entries are automatically cleared, typically within a few hours to
days.PreviousDeepSeek-V2.5: A New Open-Source Model Combining General and Coding
CapabilitiesNextDeepSeek API UpgradeHow to Use DeepSeek API's Caching
ServiceMonitoring Cache HitsReducing LatencyLowering CostsSecurity ConcernsWhy
DeepSeek Leads with Disk CachingDeepSeek API’s Concurrency and Rate LimitsWeChat
Official Account
------- • -------
https://api-docs.deepseek.com/news/news0725
JSON Output
Function Calling
Chat Prefix Completion (Beta)
8K max_tokens (Beta)
All new features above are open to the two models: deepseek-chat and deepseek-
coder.
The following is an example of JSON Output.In this example, the user provides a
piece of text, and the model formats the questions and answers within the text into
JSON.
The image below illustrates the interaction process using the Function Calling
feature:
The following is an example of using Chat Prefix Completion. In this example, the
beginning of the assistant message is set to '```python\n' to enforce the output to
start with a code block, and the stop parameter is set to '```' to prevent the
model from outputting extra content.
Update Statements
The Beta API is open for all users. User needs to set base_url to
https://api.deepseek.com/beta to enable the Beta features
Beta API are considered unstable and their subsequent testing and release plans may
change flexibly. Thank you for your understanding.
The related model versions will be released to the open-source community once the
functionality is stable.PreviousDeepSeek API introduces Context Caching on Disk,
cutting prices by an order of magnitudeNextIntroductionNow Supporting Chat Prefix
Completion, FIM, Function Calling and JSON OutputUpdate API /chat/completionsNew
API /completionsUpdate StatementsWeChat Official Account
------- • -------
https://api-docs.deepseek.com/api/deepseek-api
------- • -------
https://api-docs.deepseek.com/guides/reasoning_model
Input:
max_tokens:The maximum length of the final response after the CoT output is
completed, defaulting to 4K, with a maximum of 8K. Note that the CoT output can
reach up to 32K tokens, and the parameter to control the CoT length
(reasoning_effort) will be available soon.
Output:
Context Length:The API supports a maximum context length of 64K, and the length of
the output reasoning_content is not counted within the 64K context length.
Multi-round Conversation
In each round of the conversation, the model outputs the CoT (reasoning_content)
and the final answer (content). In the next round of the conversation, the CoT from
previous rounds is not concatenated into the context, as illustrated in the
following diagram:
------- • -------
https://api-docs.deepseek.com/guides/multi_round_chat
In the first round of the request, the messages passed to the API are:
[ {"role": "user", "content": "What's the highest mountain in the world?"}]
In the second round of the request:
Add the model's output from the first round to the end of the messages.
Add the new question to the end of the messages.
------- • -------
https://api-docs.deepseek.com/guides/chat_prefix_completion
Sample Code
Below is a complete Python code example for chat prefix completion. In this
example, we set the prefix message of the assistant to "```python\n" to force the
model to output Python code, and set the stop parameter to ['```'] to prevent
additional explanations from the model.
from openai import OpenAIclient = OpenAI( api_key="<your api key>",
base_url="https://api.deepseek.com/beta",)messages = [ {"role": "user",
"content": "Please write quick sort code"}, {"role": "assistant", "content":
"```python\n", "prefix": True}]response =
client.chat.completions.create( model="deepseek-chat", messages=messages,
stop=["```"],)print(response.choices[0].message.content)PreviousMulti-round
ConversationNextFIM Completion (Beta)NoticeSample CodeWeChat Official Account
------- • -------
https://api-docs.deepseek.com/guides/fim_completion
Sample Code
Below is a complete Python code example for FIM completion. In this example, we
provide the beginning and the end of a function to calculate the Fibonacci
sequence, allowing the model to complete the content in the middle.
from openai import OpenAIclient = OpenAI( api_key="<your api key>",
base_url="https://api.deepseek.com/beta",)response =
client.completions.create( model="deepseek-chat", prompt="def fib(a):",
suffix=" return fib(a-1) + fib(a-2)",
max_tokens=128)print(response.choices[0].text)
Integration With Continue
Continue is a VSCode plugin that supports code completion. You can refer to this
document to configure Continue for using the code completion feature.PreviousChat
Prefix Completion (Beta)NextJSON OutputNoticeSample CodeIntegration With
ContinueWeChat Official Account
------- • -------
https://api-docs.deepseek.com/guides/json_mode
Sample Code
Here is the complete Python code demonstrating the use of JSON Output:
import jsonfrom openai import OpenAIclient = OpenAI( api_key="<your api key>",
base_url="https://api.deepseek.com",)system_prompt = """The user will provide some
exam text. Please parse the "question" and "answer" and output them in JSON format.
EXAMPLE INPUT: Which is the highest mountain in the world? Mount Everest.EXAMPLE
JSON OUTPUT:{ "question": "Which is the highest mountain in the world?",
"answer": "Mount Everest"}"""user_prompt = "Which is the longest river in the
world? The Nile River."messages = [{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}]response = client.chat.completions.create(
model="deepseek-chat", messages=messages, response_format={ 'type':
'json_object' })print(json.loads(response.choices[0].message.content))
The model will output:
{ "question": "Which is the longest river in the world?", "answer": "The Nile
River"}PreviousFIM Completion (Beta)NextFunction CallingNoticeSample CodeWeChat
Official Account
------- • -------
https://api-docs.deepseek.com/guides/function_calling
Note: In the above code, the functionality of the get_weather function needs to be
provided by the user. The model itself does not execute specific
functions.PreviousJSON OutputNextContext CachingNoticeSample CodeWeChat Official
Account
------- • -------
https://api-docs.deepseek.com/guides/kv_cache
The cache system uses 64 tokens as a storage unit; content less than 64 tokens will
not be cached.
The cache system works on a "best-effort" basis and does not guarantee a 100% cache
hit rate.
Cache construction takes seconds. Once the cache is no longer in use, it will be
automatically cleared, usually within a few hours to a few days.
https://api-docs.deepseek.com/faq
Billing
Is there any expiration date for my balance?
Your topped-up balance will not expire. You can check the expiration date of the
granted balance on the billing page.
API Call
Are there any rate limits when calling your API? Can I increase the limits for my
account?
The rate limit exposed on each account is adjusted dynamically according to our
real-time traffic pressure and each account's short-term historical usage.
We temporarily do not support increasing the dynamic rate limit exposed on any
individual account, thanks for your understanding.
Why do I feel that your API's speed is slower than the web service?
The web service uses streaming output, i.e., every time the model outputs a token,
it will be displayed incrementally on the web page.
The API uses non-streaming output (stream=false) by default, i.e., the model's
output will not be returned to the user until the generation is done completely.
You can use streaming output in your API call to optimize interactivity.
Why are empty lines continuously returned when calling the API?
To prevent the TCP connection from being interrupted due to timeout, we
continuously return empty lines (for non-streaming requests) or SSE keep-alive
comments ( : keep-alive,for streaming requests) while waiting for the request to be
scheduled. If you are parsing the HTTP response yourself, please make sure to
handle these empty lines or comments appropriately.
Does your API support LangChain?
Yes. You can refer to the demo code below, which demonstrates how to use LangChain
with DeepSeek API. Replace the API key in the code as necessary.
deepseek_langchain.py
How to calculate token usage offline?
Please refer to Token & Token UsagePreviousContext CachingNextChange
LogAccountCannot sign in to my accountCannot register with my emailBillingIs there
any expiration date for my balance?API CallAre there any rate limits when calling
your API? Can I increase the limits for my account?Why do I feel that your API's
speed is slower than the web service?Why are empty lines continuously returned when
calling the API?Does your API support LangChain?How to calculate token usage
offline?WeChat Official Account
------- • -------
https://api-docs.deepseek.com/updates
Version: 2025-01-20
deepseek-reasoner
Version: 2024-12-26
deepseek-chat
The deepseek-chat model has been upgraded to DeepSeek-V3. The API remains
unchanged. You can invoke DeepSeek-V3 by specifying model='deepseek-chat'.
For details, please refer to: introducing DeepSeek-V3
Version: 2024-12-10
deepseek-chat
The deepseek-chat model has been upgraded to DeepSeek-V2.5-1210, with improvements
across various capabilities. Relevant benchmarking results include:
Additionally, the new version of the model has optimized the user experience for
file upload and webpage summarization functionalities.
Version: 2024-09-05
deepseek-coder & deepseek-chat Upgraded to DeepSeek V2.5 Model
The DeepSeek V2 Chat and DeepSeek Coder V2 models have been merged and upgraded
into the new model, DeepSeek V2.5.
For backward compatibility, API users can access the new model through either
deepseek-coder or deepseek-chat.
The new model significantly surpasses the previous versions in both general
capabilities and code abilities.
The new model better aligns with human preferences and has been optimized in
various areas such as writing tasks and instruction following:
The new model has further enhanced its code generation capabilities based on the
original Coder model, optimized for common programming application scenarios, and
achieved the following results on the standard test set:
HumanEval: 89%
LiveCodeBench (January-September): 41%
Version: 2024-08-02
API Launches Context Caching on Disk Technology
The DeepSeek API has innovatively adopted hard disk caching, reducing prices by
another order of magnitude.
For more details on the update, please refer to the documentation Context Caching
is Available 2024/08/02.
Version:2024-07-25
New API Features
JSON Mode
Function Calling
Chat Prefix Completion(Beta)
8K max_tokens(Beta)
FIM Completion(Beta)
For more details, please check the documentation New API Features 2024/07/25
Version: 2024-07-24
deepseek-coder
The deepseek-coder model has been upgraded to DeepSeek-Coder-V2-0724.
Version: 2024-06-28
deepseek-chat
The deepseek-chat model has been upgraded to DeepSeek-V2-0628.
Model's reasoning capabilities have improved, as shown in relevant benchmarks:
In the Arena-Hard evaluation, the win rate against GPT-4-0314 increased from 41.6%
to 68.3%.
The model's role-playing capabilities have significantly enhanced, allowing it to
act as different characters as requested during conversations.
Version: 2024-06-14
deepseek-coder
The deepseek-coder model has been upgraded to DeepSeek-Coder-V2-0614, significantly
enhancing its coding capabilities. It has reached the level of GPT-4-Turbo-0409 in
code generation, code understanding, code debugging, and code completion.
Additionally, it possesses excellent mathematical and reasoning abilities, and its
general capabilities are on par with DeepSeek-V2-0517.
Version: 2024-05-17
deepseek-chat
The deepseek-chat model has been upgraded to DeepSeek-V2-0517. The model has seen a
significant improvement in following instructions, with the IFEval Benchmark
Prompt-Level accuracy jumping from 63.9% to 77.6%. Additionally, on API end, we
have optimized model ability to follow instruction filled in the ``system" part.
This optimization has significantly elevated the user experience across a variety
of tasks, including immersive translation, Retrieval-Augmented Generation (RAG),
and more.
The model's accuracy in outputting JSON format has been enhanced. In our internal
test set, the JSON parsing rate increased from 78% to 85%. By introducing
appropriate regular expressions, the JSON parsing rate was further improved to
97%.PreviousFAQVersion: 2025-01-20deepseek-reasonerVersion: 2024-12-26deepseek-
chatVersion: 2024-12-10deepseek-chatVersion: 2024-09-05deepseek-coder & deepseek-
chat Upgraded to DeepSeek V2.5 ModelVersion: 2024-08-02API Launches Context Caching
on Disk TechnologyVersion:2024-07-25New API FeaturesVersion: 2024-07-24deepseek-
coderVersion: 2024-06-28deepseek-chatVersion: 2024-06-14deepseek-coderVersion:
2024-05-17deepseek-chatWeChat Official Account
------- • -------
https://api-docs.deepseek.com/#invoke-the-chat-api
------- • -------
https://api-docs.deepseek.com/zh-cn/#__docusaurus_skipToContent_fallback
------- • -------
https://api-docs.deepseek.com/zh-cn/#
https://api-docs.deepseek.com/zh-cn/quick_start/pricing
扣费规则
扣减费用 = token 消耗量 × 模型单价,对应的费用将直接从充值余额或赠送余额中进行扣减。
当充值余额与赠送余额同时存在时,优先扣减赠送余额。
产品价格可能发生变动,DeepSeek 保留修改价格的权利。请您依据实际用量按需充值,定期查看此页面以获知最新价格信息。上一页首次调用 API 下一页
Temperature 设置模型 & 价格细节扣费规则微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/quick_start/parameter_settings
我们建议您根据如下表格,按使用场景设置 temperature。
场景温度代码生成/数学解题 0.0 数据抽取/分析 1.0 通用对话 1.3 翻译 1.3 创意类写作/诗歌创作 1.5 上一页模型 & 价格下一页 Token 用量计算
微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/quick_start/token_usage
------- • -------
https://api-docs.deepseek.com/zh-cn/quick_start/rate_limit
非流式请求:持续返回空行
流式请求:持续返回 SSE keep-alive 注释(: keep-alive)
https://api-docs.deepseek.com/zh-cn/quick_start/error_codes
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news250120
HuggingFace 链接:
https://huggingface.co/deepseek-ai
开放的许可证和用户协议
为了推动和鼓励开源社区以及行业生态的发展,在发布并开源 R1 的同时,我们同步在协议授权层面也进行了如下调整:
产品协议明确可“模型蒸馏”。为了进一步促进技术的开源和共享,我们决定支持用户进行“模型蒸馏”。我们已更新线上产品的用户协议,明确允许用户利用模型输出、通过模型蒸馏等方
式训练其他模型。
App 与网页端
登录 DeepSeek 官网或官方 App,打开“深度思考”模式,即可调用最新版 DeepSeek-R1 完成各类推理任务。
API 及定价
DeepSeek-R1 API 服务定价为每百万输入 tokens 1 元(缓存命中)/ 4 元(缓存未命中),每百万输出 tokens 16 元。
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news250115
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news1226
性能对齐海外领军闭源模型
DeepSeek-V3 为自研 MoE 模型,671B 参数,激活 37B,在 14.8T token 上进行了预训练。
论文链接:https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
DeepSeek-V3 多项评测成绩超越了 Qwen2.5-72B 和 Llama-3.1-405B 等其他开源模型,并在性能上和世界顶尖的闭源模型 GPT-4o 以
及 Claude-3.5-Sonnet 不分伯仲。
生成速度提升至 3 倍
通过算法和工程上的创新,DeepSeek-V3 的生成吐字速度从 20 TPS 大幅提高至 60 TPS,相比 V2.5 模型实现了 3 倍的提升,为用户带来更加迅速
流畅的使用体验。
API 服务价格调整
随着性能更强、速度更快的 DeepSeek-V3 更新上线,我们的模型 API 服务定价也将调整为每百万输入 tokens 0.5 元(缓存命中)/ 2 元(缓存未命
中),每百万输出 tokens 8 元,以期能够持续地为大家提供更好的模型服务。
开源权重和本地部署
DeepSeek-V3 采用 FP8 训练,并开源了原生 FP8 权重。
得益于开源社区的支持,SGLang 和 LMDeploy 第一时间支持了 V3 模型的原生 FP8 推理,同时 TensorRT-LLM 和 MindIE 则实现了
BF16 推理。此外,为方便社区适配和拓展应用场景,我们提供了从 FP8 到 BF16 的转换脚本。
模型权重下载和更多本地部署信息请参考: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
“以开源精神和长期主义追求普惠 AGI” 是 DeepSeek 一直以来的坚定信念。我们非常兴奋能与社区分享在模型预训练方面的阶段性进展,也十分欣喜地看到开源模型和闭
源模型的能力差距正在进一步缩小。
这是一个全新的开始,未来我们会在 DeepSeek-V3 基座模型上继续打造深度思考、多模态等更加丰富的功能,并将持续与社区分享我们最新的探索成果。上一页
DeepSeek APP 下一页 DeepSeek V2 系列收官,联网搜索上线官网性能对齐海外领军闭源模型生成速度提升至 3 倍 API 服务价格调整开源权重和本地部署
微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news1210
模型通用能力提升
DeepSeek-V2.5-1210 版本通过 Post-Training 阶段的迭代,全面提升了模型在各个领域的能力:
遵循我们一贯的开源精神,新版模型权重已经开源在 Huggingface:
https://huggingface.co/deepseek-ai/DeepSeek-V2.5-1210
联网搜索功能
DeepSeek-V2.5-1210 版本支持了联网搜索功能,并已上线网页端。登陆 https://chat.deepseek.com/,在输入框中打开 “联网搜索”
即可体验。目前,API 暂不支持搜索功能。
在“联网搜索”模式下,模型将深入阅读海量网页,为用户生成全面、准确、满足个性化需求的回答。面对用户的复杂问题,模型将自动提取多个关键词并行搜索,在更短时间内提供更加多
样的搜索结果。
以下是搜索效果示例:
V2.5 的最后版本
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news1120
深度思考的效果与潜力
DeepSeek-R1-Lite 的推理过程长,并且包含了大量的反思和验证。下图展示了模型在数学竞赛上的得分与测试所允许思考的长度紧密相关。
红色实线展示了模型所能达到的准确率与所给定的推理长度呈正相关;
相比传统的多次采样+投票(Majority Voting),模型思维链长度增加展现出了更高的效率。
全面上线,尝鲜体验
登录 chat.deepseek.com,在输入框中选择“深度思考”模式,即可开启与 DeepSeek-R1-Lite 预览版的对话。
“深度思考” 模式专门针对数学、代码等各类复杂逻辑推理问题而设计,相比于普通的简单问题,能够提供更加全面、清晰、思路严谨的优质解答,充分展现出较长思维链的更多优势。
对话开启示例:
适用场景与效果示例:
新的开始,敬请期待
DeepSeek-R1-Lite 目前仍处于迭代开发阶段,仅支持网页使用,暂不支持 API 调用。DeepSeek-R1-Lite 所使用的也是一个较小的基座模型,无
法完全释放长思维链的潜力。
当前,我们正在持续迭代推理系列模型。之后,正式版 DeepSeek-R1 模型将完全开源,我们将公开技术报告,并部署 API 服务。
扫码与 DeepSeek 开启对话上一页 DeepSeek V2 系列收官,联网搜索上线官网下一页 DeepSeek-V2.5:融合通用与代码能力的全新开源模型全面提升
的推理性能深度思考的效果与潜力全面上线,尝鲜体验新的开始,敬请期待微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news0905
通用能力评测
安全能力评测
模型开源
一如既往,秉持着持久开源的精神,DeepSeek-V2.5 现已开源到了 HuggingFace:
https://huggingface.co/deepseek-ai/DeepSeek-V2.5 上一页 DeepSeek 推理模型预览版上线,解密 o1 推理过程下一页
DeepSeek API 创新采用硬盘缓存,价格再降一个数量级升级历史通用能力代码能力模型开源微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news0802
2. 数据分析场景,后续具有相同前缀的请求会命中上下文缓存
多种应用能从上下文硬盘缓存中受益:
具有长预设提示词的问答助手类应用
具有长角色设定与多轮对话的角色扮演类应用
针对固定文本集合进行频繁询问的数据分析类应用
代码仓库级别的代码分析与排障工具
通过 Few-shot 提升模型输出效果
...
更详细的使用方法,请参考指南使用硬盘缓存
如何查询缓存命中情况
在 API 返回的 usage 中,增加了两个字段,帮助用户实时监测缓存的命中情况:
降低服务延迟
输入长、重复内容多的请求,API 服务的首 token 延迟将大幅降低。
举个极端的例子,对 128K 输入且大部分重复的请求,实测首 token 延迟从 13 秒降低到 500 毫秒。
降低整体费用
最高可以节省 90% 的费用(需要针对缓存特性进行优化)。
即使不做任何优化,按历史使用情况,用户整体节省的费用也超过 50%。
缓存没有其它额外的费用,只有 0.1 元每百万 tokens。缓存占用存储无需付费。
缓存的安全性问题
本缓存系统在设计的时候已充分考虑了各种潜在的安全问题。
每个用户的缓存是独立的,逻辑上相互不可见,从底层确保用户数据的安全和隐私。
长时间不用的缓存会自动清空,不会长期保留,且不会用于其他用途。
为何 DeepSeek API 能率先采用硬盘缓存
根据公开的信息,DeepSeek 可能是全球第一家在 API 服务中大范围采用硬盘缓存的大模型厂商。
这得益于 DeepSeek V2 提出的 MLA 结构,在提高模型效果的同时,大大压缩了上下文 KV Cache 的大小,使得存储所需要的传输带宽和存储容量均大幅减少,
因此可以缓存到低成本的硬盘上。
DeepSeek API 的并发和限流
DeepSeek API 服务按照每天 1 万亿的容量进行设计。对所有用户均不限流、不限并发、同时保证服务质量。请放心加大并发使用。
------- • -------
https://api-docs.deepseek.com/zh-cn/news/news0725
更新接口 /chat/completions
JSON Output
Function Calling
对话前缀续写(Beta)
8K 最长输出(Beta)
新增接口 /completions
FIM 补全(Beta)
一、更新接口 /chat/completions
1. JSON Output,增强内容格式化
DeepSeek API 新增 JSON Output 功能,兼容 OpenAI API,能够强制模型输出 JSON 格式的字符串。
在进行数据处理等任务时,该功能可以让模型按预定格式返回 JSON,方便后续对模型输出内容进行解析,提高程序流程的自动化能力。
要使用 JSON Output 功能,需要:
2. Function,连接物理世界
DeepSeek API 新增 Function Calling 功能,兼容 OpenAI API,通过调用外部工具,来增强模型与物理世界交互的能力。
Function Calling 功能支持传入多个 Function(最多 128 个),支持并行 Function 调用。
下图展示了将 deepseek-coder 整合到开源大模型前端 LobeChat 的效果。在这个例子中,我们开启了“网站爬虫”插件,来实现对网站的爬取和总结。
3. 对话前缀续写(Beta),更灵活的输出控制
对话前缀续写沿用了对话补全的 API 格式,允许用户指定最后一条 assistant 消息的前缀,来让模型按照该前缀进行补全。该功能也可用于输出长度达到
max_tokens 被截断后,将被截断的消息进行拼接,重新发送请求对被截断内容进行续写。
要使用对话前缀续写功能,需要:
4. 8K 最长输出(Beta),释放更长可能
为了满足更长文本输出的场景,我们在 Beta 版 API 中,将 max_tokens 参数的上限调整为 8K。
要提高到 8K 最长输出,需要:
二、新增接口 /completions
1. FIM 补全(Beta),使能续写场景
DeepSeek API 新增 FIM (Fill-In-the-Middle) 补全接口,兼容 OpenAI 的 FIM 补全 API,允许用户提供自定义的前缀/后
缀(可选),让模型进行内容补全。该功能常用于故事续写、代码补全等场景。FIM 补全接口收费与对话补全相同。
要使用 FIM 补全接口,需要设置 base_url 为 https://api.deepseek.com/beta 来开启 Beta 功能。
以下为 FIM 补全接口的使用样例。在这个例子中,用户提供斐波那契数列函数的开头和结尾,模型对中间内容进行补全。
更新说明
Beta 接口已开放给所有用户使用,用户需要设置 base_url 为 https://api.deepseek.com/beta 来开启 Beta 功能。
Beta 接口属于不稳定接口,后续测试、发布计划会灵活变动,敬请谅解。
相关模型版本,在功能稳定后会发布到开源社区,敬请期待。上一页 DeepSeek API 创新采用硬盘缓存,价格再降一个数量级下一页基本信息一、更新接口
/chat/completions 二、新增接口 /completions 更新说明微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/api/deepseek-api
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/reasoning_model
输入参数:
输出字段:
支持的功能:对话补全,对话前缀续写 (Beta)
不支持的参数:temperature、top_p、presence_penalty、frequency_penalty、logprobs、top_logprobs。请
注意,为了兼容已有软件,设置 temperature、top_p、presence_penalty、frequency_penalty 参数不会报错,但也不会生效。设
置 logprobs、top_logprobs 会报错。
上下文拼接
在每一轮对话过程中,模型会输出思维链内容(reasoning_content)和最终回答(content)。在下一轮对话中,之前轮输出的思维链内容不会被拼接到上下文中,
如下图所示:
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/multi_round_chat
要将第一轮中模型的输出添加到 messages 末尾
将新的提问添加到 messages 末尾
https://api-docs.deepseek.com/zh-cn/guides/chat_prefix_completion
样例代码
下面给出了对话前缀续写的完整 Python 代码样例。在这个例子中,我们设置 assistant 开头的消息为 "```python\n" 来强制模型输出
python 代码,并设置 stop 参数为 ['```'] 来避免模型的额外解释。
from openai import OpenAIclient = OpenAI( api_key="<your api key>",
base_url="https://api.deepseek.com/beta",)messages = [ {"role": "user",
"content": "Please write quick sort code"}, {"role": "assistant", "content":
"```python\n", "prefix": True}]response =
client.chat.completions.create( model="deepseek-chat", messages=messages,
stop=["```"],)print(response.choices[0].message.content)上一页多轮对话下一页 FIM 补全(Beta)注意事项
样例代码微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/fim_completion
模型的最大补全长度为 4K。
用户需要设置 base_url="https://api.deepseek.com/beta" 来开启 Beta 功能。
样例代码
下面给出了 FIM 补全的完整 Python 代码样例。在这个例子中,我们给出了计算斐波那契数列函数的开头和结尾,来让模型补全中间的内容。
from openai import OpenAIclient = OpenAI( api_key="<your api key>",
base_url="https://api.deepseek.com/beta",)response =
client.completions.create( model="deepseek-chat", prompt="def fib(a):",
suffix=" return fib(a-1) + fib(a-2)",
max_tokens=128)print(response.choices[0].text)
配置 Continue 代码补全插件
Continue 是一款支持代码补全的 VSCode 插件,您可以参考这篇文档来配置 Continue 以使用代码补全功能。上一页对话前缀续写(Beta)下一页
JSON Output 注意事项样例代码配置 Continue 代码补全插件微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/json_mode
样例代码
这里展示了使用 JSON Output 功能的完整 Python 代码:
import jsonfrom openai import OpenAIclient = OpenAI( api_key="<your api key>",
base_url="https://api.deepseek.com",)system_prompt = """The user will provide some
exam text. Please parse the "question" and "answer" and output them in JSON format.
EXAMPLE INPUT: Which is the highest mountain in the world? Mount Everest.EXAMPLE
JSON OUTPUT:{ "question": "Which is the highest mountain in the world?",
"answer": "Mount Everest"}"""user_prompt = "Which is the longest river in the
world? The Nile River."messages = [{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}]response = client.chat.completions.create(
model="deepseek-chat", messages=messages, response_format={ 'type':
'json_object' })print(json.loads(response.choices[0].message.content))
模型将会输出:
{ "question": "Which is the longest river in the world?", "answer": "The Nile
River"}上一页 FIM 补全(Beta)下一页 Function Calling 注意事项样例代码微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/function_calling
用户:询问现在的天气
模型:返回 function get_weather({location: 'Hangzhou'})
用户:调用 function get_weather({location: 'Hangzhou'}),并传给模型。
模型:返回自然语言,"The current temperature in Hangzhou is 24°C."
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
上一页 Function Calling 下一页常见问题例一:长文本问答例二:多轮对话例三:使用 Few-shot 学习查看缓存命中情况硬盘缓存与输出随机性其它说明微信
公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/prompt-library/
------- • -------
https://api-docs.deepseek.com/zh-cn/faq
账号问题
账号无法登录
您账号近期的行为可能触发了我们的自动化风控策略,导致我们暂时关闭了您对账号的访问权限。如需申诉,请填写问卷,我们会尽快处理。
邮箱无法注册
如果您在注册时,收到错误提示“注册失败,暂不支持该邮箱域名注册。”,这是因为您的邮箱不被 DeepSeek 支持,请切换邮箱服务提供商。如仍有问题,请联系
[email protected]。
企业认证
个人实名认证与企业实名认证有什么区别?
个人实名认证账号与企业实名认证账号在用户权益和产品功能上目前没有区别,但认证方式和所需材料有所不同。根据合规要求,请以您账号的实际使用情况进行认证。
企业实名账号可以更改为个人账号吗?
企业实名认证账号不可变更为个人认证或其他企业。
财务问题
如何充值?
在线充值:完成实名认证后,您可以在充值页使用支付宝/微信进行在线充值。您可以在账单页查询充值结果。
对公汇款:对公汇款仅支持企业用户。完成企业实名认证后,可获取专属汇款账号,向专属汇款账号进行打款。为保证汇款顺利进行,请确保汇款方开户名称与开放平台实名认证名称一致。
我方银行账户到账后,汇款金额将在 10 分钟- 1 小时左右自动转入您的开放平台账户,如未及时收到,请联系我们。
余额是否会过期?
充值余额不会失效过期。赠送余额的有效期您可以在账单页查看。
如何申请发票?
请访问账单页,点击发票管理,申请发票。企业用户开具发票时,发票抬头需要与实名认证信息一致,目前开发票的周期为 7 个工作日左右。
API 调用问题
调用模型时的并发限制是多少?是否可以提高账号的并发上限?
当前阶段,我们没有按照用户设置硬性并发上限。在系统总负载量较高时,基于系统负载和用户短时历史用量的动态限流模型可能会导致用户收到 503 或 429 错误码。
目前暂不支持针对单个账号提高并发上限,感谢您的理解。
为什么我感觉 API 返回比网页端慢
网页端默认使用流式输出(stream=true),即模型每输出一个字符,都会增量地显示在前端。
API 默认使用非流式输出(stream=false),即模型在所有内容输出完后,才会返回给用户。您可以通过开启 API 的 stream 模式来提升交互性。
为什么调用 API 时,持续返回空行?
为了保持 TCP 连接不会因超时中断,我们会在请求等待调度过程中,持续返回空行(非流式请求)或 SSE keep-alive 注释(: keep-alive,流式请
求)。如果您在自己解析 HTTP 响应,请注意处理这些空行或注释。
是否支持 LangChain?
支持。LangChain 支持 OpenAI API 接口,而 DeepSeek API 接口与 OpenAI 兼容。您可以下载以下代码文件并替换代码中的 API
Key,实现在 LangChain 中调用 DeepSeek API。
deepseek_langchain.py
如何离线计算 Tokens 用量?
请参考 Token 用量计算上一页上下文硬盘缓存下一页更新日志账号问题账号无法登录邮箱无法注册企业认证个人实名认证与企业实名认证有什么区别?企业实名账号可以更改为个人
账号吗?财务问题如何充值?余额是否会过期?如何申请发票?API 调用问题调用模型时的并发限制是多少?是否可以提高账号的并发上限?为什么我感觉 API 返回比网页端慢为
什么调用 API 时,持续返回空行?是否支持 LangChain?如何离线计算 Tokens 用量?微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/updates
版本: 2025-01-20
deepseek-reasoner
版本: 2024-12-26
deepseek-chat
版本:2024-12-10
deepseek-chat
deepseek-chat 模型升级为 DeepSeek-V2.5-1210,模型各项能力提升,相关基准测试:
版本:2024-09-05
deepseek-coder & deepseek-chat 升级为 DeepSeek V2.5 模型
DeepSeek V2 Chat 和 DeepSeek Coder V2 两个模型已经合并升级,升级后的新模型为 DeepSeek V2.5。
为向前兼容,API 用户通过 deepseek-coder 或 deepseek-chat 均可以访问新的模型。
新模型在通用能力、代码能力上,都显著超过了旧版本的两个模型。
新模型更好的对齐了人类的偏好,在写作任务、指令跟随等多方面进行了优化:
HumanEval: 89%
LiveCodeBench (1-9 月): 41%
版本:2024-08-02
API 上线硬盘缓存技术
DeepSeek API 创新采用硬盘缓存,价格再降一个数量级
更新详情请跳转文档 API 上线硬盘缓存 2024/08/02
版本:2024-07-25
API 接口更新
更新接口 /chat/completions
JSON 输出
Function 调用
对话前缀续写(Beta)
8K 最长输出(Beta)
新增接口 /completions
FIM 补全(Beta)
版本:2024-07-24
deepseek-coder
deepseek-coder 模型升级为 DeepSeek-Coder-V2-0724。
版本:2024-06-28
deepseek-chat
deepseek-chat 模型升级为 DeepSeek-V2-0628,模型推理能力提升,相关基准测试:
版本:2024-05-17
deepseek-chat
deepseek-chat 模型升级为 DeepSeek-V2-0517,模型在指令跟随方面的性能得到了显著提升,IFEval Benchmark Prompt-
Level 准确率从 63.9% 跃升至 77.6%。此外,我们对 API 端的“system”区域指令跟随能力进行了优化,显著增强了沉浸式翻译、RAG 等任务的用户体
验。
模型对于 JSON 格式输出的准确性得到了提升。在内部测试集中,JSON 解析率从 78% 提高到了 85%。通过引入恰当的正则表达式,JSON 解析率进一步提高至
97%。上一页常见问题版本: 2025-01-20deepseek-reasoner 版本: 2024-12-26deepseek-chat 版本:2024-12-
10deepseek-chat 版本:2024-09-05deepseek-coder & deepseek-chat 升级为 DeepSeek V2.5 模型版本:
2024-08-02API 上线硬盘缓存技术版本:2024-07-25API 接口更新版本:2024-07-24deepseek-coder 版本:2024-06-
28deepseek-chat 版本:2024-06-14deepseek-coder 版本:2024-05-17deepseek-chat 微信公众号
------- • -------
https://api-docs.deepseek.com/zh-cn/#%E8%B0%83%E7%94%A8%E5%AF%B9%E8%AF%9D-api
------- • -------
https://api-docs.deepseek.com/quick_start/
pricing#__docusaurus_skipToContent_fallback
Deduction Rules
The expense = number of tokens × price.
The corresponding fees will be directly deducted from your topped-up balance or
granted balance, with a preference for using the granted balance first when both
balances are available.
Product prices may vary and DeepSeek reserves the right to adjust them. We
recommend topping up based on your actual usage and regularly checking this page
for the most recent pricing information.PreviousYour First API CallNextThe
Temperature ParameterPricing DetailsDeduction RulesWeChat Official Account
CommunityEmailDiscordTwitterMoreGitHubCopyright © 2025 DeepSeek, Inc.
------- • -------
https://api-docs.deepseek.com/quick_start/pricing#
Deduction Rules
The expense = number of tokens × price.
The corresponding fees will be directly deducted from your topped-up balance or
granted balance, with a preference for using the granted balance first when both
balances are available.
Product prices may vary and DeepSeek reserves the right to adjust them. We
recommend topping up based on your actual usage and regularly checking this page
for the most recent pricing information.PreviousYour First API CallNextThe
Temperature ParameterPricing DetailsDeduction RulesWeChat Official Account
CommunityEmailDiscordTwitterMoreGitHubCopyright © 2025 DeepSeek, Inc.
------- • -------
https://api-docs.deepseek.com/quick_start/pricing#pricing-details
Deduction Rules
The expense = number of tokens × price.
The corresponding fees will be directly deducted from your topped-up balance or
granted balance, with a preference for using the granted balance first when both
balances are available.
Product prices may vary and DeepSeek reserves the right to adjust them. We
recommend topping up based on your actual usage and regularly checking this page
for the most recent pricing information.PreviousYour First API CallNextThe
Temperature ParameterPricing DetailsDeduction RulesWeChat Official Account
https://api-docs.deepseek.com/quick_start/pricing#deduction-rules
Deduction Rules
The expense = number of tokens × price.
The corresponding fees will be directly deducted from your topped-up balance or
granted balance, with a preference for using the granted balance first when both
balances are available.
Product prices may vary and DeepSeek reserves the right to adjust them. We
recommend topping up based on your actual usage and regularly checking this page
for the most recent pricing information.PreviousYour First API CallNextThe
Temperature ParameterPricing DetailsDeduction RulesWeChat Official Account
https://api-docs.deepseek.com/quick_start/
parameter_settings#__docusaurus_skipToContent_fallback
We recommend users to set the temperature according to their use case listed in
below.
------- • -------
https://api-docs.deepseek.com/quick_start/parameter_settings#
We recommend users to set the temperature according to their use case listed in
below.
https://api-docs.deepseek.com/quick_start/
token_usage#__docusaurus_skipToContent_fallback
However, due to the different tokenization methods used by different models, the
conversion ratios can vary. The actual number of tokens processed each time is
based on the model's return, which you can view from the usage results.
Calculate token usage offline
You can run the demo tokenizer code in the following zip package to calculate the
token usage for your intput/output.
deepseek_v3_tokenizer.zipPreviousThe Temperature ParameterNextRate LimitCalculate
token usage offlineWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/token_usage#
However, due to the different tokenization methods used by different models, the
conversion ratios can vary. The actual number of tokens processed each time is
based on the model's return, which you can view from the usage results.
Calculate token usage offline
You can run the demo tokenizer code in the following zip package to calculate the
token usage for your intput/output.
deepseek_v3_tokenizer.zipPreviousThe Temperature ParameterNextRate LimitCalculate
token usage offlineWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/token_usage#calculate-token-usage-offline
However, due to the different tokenization methods used by different models, the
conversion ratios can vary. The actual number of tokens processed each time is
based on the model's return, which you can view from the usage results.
Calculate token usage offline
You can run the demo tokenizer code in the following zip package to calculate the
token usage for your intput/output.
deepseek_v3_tokenizer.zipPreviousThe Temperature ParameterNextRate LimitCalculate
token usage offlineWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/
rate_limit#__docusaurus_skipToContent_fallback
These contents do not affect the parsing of the JSON body by the OpenAI SDK. If you
are parsing the HTTP responses yourself, please ensure to handle these empty lines
or comments appropriately.
If the request is still not completed after 30 minutes, the server will close the
connection.PreviousToken & Token UsageNextError CodesWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/rate_limit#
These contents do not affect the parsing of the JSON body by the OpenAI SDK. If you
are parsing the HTTP responses yourself, please ensure to handle these empty lines
or comments appropriately.
If the request is still not completed after 30 minutes, the server will close the
connection.PreviousToken & Token UsageNextError CodesWeChat Official Account
------- • -------
https://api-docs.deepseek.com/quick_start/
error_codes#__docusaurus_skipToContent_fallback
------- • -------
https://api-docs.deepseek.com/quick_start/error_codes#
------- • -------
https://api-docs.deepseek.com/news/news250120#__docusaurus_skipToContent_fallback
🌐 Website & API are live now! Try DeepThink at chat.deepseek.com today!
🔥 Bonus: Open-Source Distilled Models!
📜 License Update!
📈 Large-scale RL in post-training
📄 More details:
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
🌐 API Access & Pricing
------- • -------
https://api-docs.deepseek.com/news/news250120#
🌐 Website & API are live now! Try DeepThink at chat.deepseek.com today!
🔥 Bonus: Open-Source Distilled Models!
📜 License Update!
📈 Large-scale RL in post-training
📄 More details:
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
------- • -------
https://api-docs.deepseek.com/news/news250115#__docusaurus_skipToContent_fallback
📱 Now officially available on App Store & Google Play & Major Android markets
Important Notice:
📲 Search "DeepSeek" in your app store or visit our website for direct links
------- • -------
https://api-docs.deepseek.com/news/news250115#
📱 Now officially available on App Store & Google Play & Major Android markets
Important Notice:
📲 Search "DeepSeek" in your app store or visit our website for direct links
------- • -------
https://api-docs.deepseek.com/news/news250115#key-features-of-deepseek-app
📱 Now officially available on App Store & Google Play & Major Android markets
Important Notice:
📲 Search "DeepSeek" in your app store or visit our website for direct links
------- • -------
https://api-docs.deepseek.com/news/news250115#important-notice
📱 Now officially available on App Store & Google Play & Major Android markets
Important Notice:
📲 Search "DeepSeek" in your app store or visit our website for direct links
------- • -------
https://api-docs.deepseek.com/news/news1226#__docusaurus_skipToContent_fallback
🎉 What’s new in V3
Model 👉 https://github.com/deepseek-ai/DeepSeek-V3
Paper 👉 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
https://api-docs.deepseek.com/news/news1226#
🎉 What’s new in V3
Model 👉 https://github.com/deepseek-ai/DeepSeek-V3
Paper 👉 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
------- • -------
https://api-docs.deepseek.com/news/news1226#biggest-leap-forward-yet
🎉 What’s new in V3
Model 👉 https://github.com/deepseek-ai/DeepSeek-V3
Paper 👉 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
------- • -------
https://api-docs.deepseek.com/news/news1226#-whats-new-in-v3
🎉 What’s new in V3
Model 👉 https://github.com/deepseek-ai/DeepSeek-V3
Paper 👉 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
------- • -------
https://api-docs.deepseek.com/news/news1226#-api-pricing-update
🎉 What’s new in V3
Model 👉 https://github.com/deepseek-ai/DeepSeek-V3
Paper 👉 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
------- • -------
https://api-docs.deepseek.com/news/news1226#still-the-best-value-in-the-market-
🎉 What’s new in V3
Model 👉 https://github.com/deepseek-ai/DeepSeek-V3
Paper 👉 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
------- • -------
https://api-docs.deepseek.com/news/news1210#__docusaurus_skipToContent_fallback
📊 DeepSeek-V2.5-1210 raises the bar across benchmarks like math, coding, writing,
and roleplay—built to serve all your work and life needs.
🔧 Explore the open-source model on Hugging Face: https://huggingface.co/deepseek-
ai/DeepSeek-V2.5-1210
------- • -------
https://api-docs.deepseek.com/news/news1210#
📊 DeepSeek-V2.5-1210 raises the bar across benchmarks like math, coding, writing,
and roleplay—built to serve all your work and life needs.
🔧 Explore the open-source model on Hugging Face: https://huggingface.co/deepseek-
ai/DeepSeek-V2.5-1210
------- • -------
https://api-docs.deepseek.com/news/news1120#__docusaurus_skipToContent_fallback
------- • -------
https://api-docs.deepseek.com/news/news1120#
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/function_calling#%E6%8F%90%E7%A4%BA
用户:询问现在的天气
模型:返回 function get_weather({location: 'Hangzhou'})
用户:调用 function get_weather({location: 'Hangzhou'}),并传给模型。
模型:返回自然语言,"The current temperature in Hangzhou is 24°C."
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/function_calling#%E6%A0%B7%E4%BE%8B
%E4%BB%A3%E7%A0%81
用户:询问现在的天气
模型:返回 function get_weather({location: 'Hangzhou'})
用户:调用 function get_weather({location: 'Hangzhou'}),并传给模型。
模型:返回自然语言,"The current temperature in Hangzhou is 24°C."
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/
kv_cache#__docusaurus_skipToContent_fallback
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache#
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache#%E4%BE%8B%E4%B8%80%E9%95%BF
%E6%96%87%E6%9C%AC%E9%97%AE%E7%AD%94
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache#%E4%BE%8B%E4%BA%8C%E5%A4%9A
%E8%BD%AE%E5%AF%B9%E8%AF%9D
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache#%E4%BE%8B%E4%B8%89%E4%BD%BF
%E7%94%A8-few-shot-%E5%AD%A6%E4%B9%A0
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache#%E6%9F%A5%E7%9C%8B%E7%BC
%93%E5%AD%98%E5%91%BD%E4%B8%AD%E6%83%85%E5%86%B5
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存系统以 64 tokens 为一个存储单元,不足 64 tokens 的内容不会被缓存
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache#%E7%A1%AC%E7%9B%98%E7%BC
%93%E5%AD%98%E4%B8%8E%E8%BE%93%E5%87%BA%E9%9A%8F%E6%9C%BA%E6%80%A7
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/guides/kv_cache#%E5%85%B6%E5%AE%83%E8%AF
%B4%E6%98%8E
例一:长文本问答
第一次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请总结一下这份财报的关键信息。"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位资深的财报分析师..."} {"role": "user",
"content": "<财报内容>\n\n 请分析一下这份财报的盈利情况。"}]
在上例中,两次请求都有相同的前缀,即 system 消息 + user 消息中的 <财报内容>。在第二次请求时,这部分前缀会计入“缓存命中”。
例二:多轮对话
第一次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位乐于助人的助手"}, {"role": "user",
"content": "中国的首都是哪里?"}, {"role": "assistant", "content": "中国的首都是北京。"},
{"role": "user", "content": "美国的首都是哪里?"}]
在上例中,第二次请求可以复用第一次请求开头的 system 消息和 user 消息,这部分会计入“缓存命中”。
例三:使用 Few-shot 学习
在实际应用中,用户可以通过 Few-shot 学习的方式,来提升模型的输出效果。所谓 Few-shot 学习,是指在请求中提供一些示例,让模型学习到特定的模式。由于
Few-shot 一般提供相同的上下文前缀,在硬盘缓存的加持下,Few-shot 的费用显著降低。
第一次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问清朝的开国皇帝是谁?"}]
第二次请求
messages: [ {"role": "system", "content": "你是一位历史学专家,用户将提供一系列问题,你的回答应当简明
扼要,并以`Answer:`开头"}, {"role": "user", "content": "请问秦始皇统一六国是在哪一年?"},
{"role": "assistant", "content": "Answer:公元前 221 年"}, {"role": "user",
"content": "请问汉朝的建立者是谁?"}, {"role": "assistant", "content": "Answer:刘邦"},
{"role": "user", "content": "请问唐朝最后一任皇帝是谁"}, {"role": "assistant",
"content": "Answer:李柷"}, {"role": "user", "content": "请问明朝的开国皇帝是谁?"},
{"role": "assistant", "content": "Answer:朱元璋"}, {"role": "user", "content":
"请问商朝是什么时候灭亡的"}, ]
在上例中,使用了 4-shots。两次请求只有最后一个问题不一样,第二次请求可以复用第一次请求中前 4 轮对话的内容,这部分会计入“缓存命中”。
查看缓存命中情况
在 DeepSeek API 的返回中,我们在 usage 字段中增加了两个字段,来反映请求的缓存命中情况:
硬盘缓存与输出随机性
硬盘缓存只匹配到用户输入的前缀部分,输出仍然是通过计算推理得到的,仍然受到 temperature 等参数的影响,从而引入随机性。其输出效果与不使用硬盘缓存相同。
其它说明
缓存系统以 64 tokens 为一个存储单元,不足 64 tokens 的内容不会被缓存
缓存构建耗时为秒级。缓存不再使用后会自动被清空,时间一般为几个小时到几天
------- • -------
https://api-docs.deepseek.com/zh-cn/prompt-library/
#__docusaurus_skipToContent_fallback
------- • -------
https://api-docs.deepseek.com/zh-cn/prompt-library/#
------- • -------