Qwen Technical Report
Qwen Technical Report
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu
Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu,
Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan,
Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin
Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng
Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou,
Xiaohuan Zhou, Tianhang Zhu.
A BSTRACT
Large language models (LLMs) have revolutionized the field of artificial intelli-
gence, enabling natural language processing tasks that were previously thought
to be exclusive to humans. In this work, we introduce Q WEN1 , the first install-
ment of our large language model series. Q WEN is a comprehensive language
model series that encompasses distinct models with varying parameter counts. It
includes Q WEN, the base pretrained language models, and Q WEN -C HAT, the chat
models finetuned with human alignment techniques. The base language models
consistently demonstrate superior performance across a multitude of downstream
tasks, and the chat models, particularly those trained using Reinforcement Learning
from Human Feedback (RLHF), are highly competitive. The chat models pos-
sess advanced tool-use and planning capabilities for creating agent applications,
showcasing impressive performance even when compared to bigger models on
complex tasks like utilizing a code interpreter. Furthermore, we have developed
coding-specialized models, C ODE -Q WEN and C ODE -Q WEN -C HAT, as well as
mathematics-focused models, M ATH -Q WEN -C HAT, which are built upon base
language models. These models demonstrate significantly improved performance
in comparison with open-source models, and slightly fall behind the proprietary
models.
∗
Authors are ordered alphabetically by the last name. Correspondence to: [email protected].
1
Q WEN is a moniker of Qianwen, which means “thousands of prompts” in Chinese. The pronunciation of
“Q WEN” can vary depending on the context and the individual speaking it. Here is one possible way to pronounce
it: /kwEn/.
1
Contents
1 Introduction 3
2 Pretraining 4
2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Context Length Extension . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Alignment 9
3.1 Supervised Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Reinforcement Learning from Human Feedback . . . . . . . . . . . . . . . . . . . 10
3.2.1 Reward Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Automatic and Human Evaluation of Aligned Models . . . . . . . . . . . . . . . . . 11
3.4 Tool Use, Code Interpreter, and Agent . . . . . . . . . . . . . . . . . . . . . . . . 13
4 C ODE -Q WEN: Specialized Model for Coding 16
4.1 Code Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Code Supervised Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 M ATH -Q WEN: Specialized Model for Mathematics Reasoning 17
5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Related Work 20
6.1 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 Tool Use and Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.4 LLM for Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.5 LLM for Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7 Conclusion 22
A Appendix 35
A.1 More Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.1.1 Data Format for Q WEN -C HAT . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2.1 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.3 Analysis of Code Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2
1 I NTRODUCTION
Large language models (LLMs) (Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2020; Brown
et al., 2020; OpenAI, 2023; Chowdhery et al., 2022; Anil et al., 2023; Thoppilan et al., 2022; Touvron
et al., 2023a;b) have revolutionized the field of artificial intelligence (AI) by providing a powerful
foundation for complex reasoning and problem-solving tasks. These models have the ability to
compress vast knowledge into neural networks, making them incredibly versatile agents. With a
chat interface, LLMs can perform tasks that were previously thought to be the exclusive domain of
humans, especially those involving creativity and expertise (OpenAI, 2022; Ouyang et al., 2022; Anil
et al., 2023; Google, 2023; Anthropic, 2023a;b). They can engage in natural language conversations
with humans, answering questions, providing information, and even generating creative content such
as stories, poems, and music. This has led to the development of a wide range of applications, from
chatbots and virtual assistants to language translation and summarization tools.
LLMs are not just limited to language tasks. They can also function as a generalist agent (Reed et al.,
2022; Bai et al., 2022a; Wang et al., 2023a; AutoGPT, 2023; Hong et al., 2023), collaborating with
external systems, tools, and models to achieve the objectives set by humans. For example, LLMs
can understand multimodal instructions (OpenAI, 2023; Bai et al., 2023; Liu et al., 2023a; Ye et al.,
2023; Dai et al., 2023; Peng et al., 2023b), execute code (Chen et al., 2021a; Zheng et al., 2023; Li
et al., 2023d), use tools (Schick et al., 2023; LangChain, Inc., 2023; AutoGPT, 2023), and more.
This opens up a whole new world of possibilities for AI applications, from autonomous vehicles and
robotics to healthcare and finance. As these models continue to evolve and improve, we can expect
to see even more innovative and exciting applications in the years to come. Whether it’s helping us
solve complex problems, creating new forms of entertainment, or transforming the way we live and
work, LLMs are poised to play a central role in shaping the future of AI.
Qwen-PMP Qwen-RM
Pretrain
Models
Qwen Qwen-Chat Qwen-Chat-RLHF
RM
Models
Code-Qwen Code-Qwen-Chat
SFT
Models
Math-Qwen-Chat RLHF
Models
Qwen-VL Qwen-VL-Chat
Figure 1: Model Lineage of the Qwen Series. We have pretrained the language models, namely
Q WEN, on massive datasets containing trillions of tokens. We then use SFT and RLHF to align
Q WEN to human preference and thus we have Q WEN -C HAT and specifically its improved version
Q WEN -C HAT-RLHF. Additionally, we also develop specialized models for coding and mathematics,
such as C ODE -Q WEN, C ODE -Q WEN -C HAT, and M ATH -Q WEN -C HAT based on Q WEN with similar
techniques. Note that we previously released the multimodal LLM, Q WEN -VL and Q WEN -VL-
C HAT (Bai et al., 2023), which are also based on our Q WEN base models.
Despite their impressive capabilities, LLMs are often criticized for their lack of reproducibility,
steerability, and accessibility to service providers. In this work, we are pleased to present and release
the initial version of our LLM series, Q WEN. Q WEN is a moniker that derives from the Chinese phrase
Qianwen, which translates to “thousands of prompts” and conveys the notion of embracing a wide
range of inquiries. Q WEN is a comprehensive language model series that encompasses distinct models
with varying parameter counts. The model series include the base pretrained language models, chat
models finetuned with human alignment techniques, i.e., supervised finetuning (SFT), reinforcement
learning with human feedback (RLHF), etc., as well as specialized models in coding and math. The
details are outlined below:
3
1. The base language models, namely Q WEN, have undergone extensive training using up to 3
trillion tokens of diverse texts and codes, encompassing a wide range of areas. These models
have consistently demonstrated superior performance across a multitude of downstream
tasks, even when compared to their more significantly larger counterparts.
2. The Q WEN -C HAT models have been carefully finetuned on a curated dataset relevant to task
performing, chat, tool use, agent, safety, etc. The benchmark evaluation demonstrates that
the SFT models can achieve superior performance. Furthermore, we have trained reward
models to mimic human preference and applied them in RLHF for chat models that can
produce responses preferred by humans. Through the human evaluation of a challenging test,
we find that Q WEN -C HAT models trained with RLHF are highly competitive, still falling
behind GPT-4 on our benchmark.
3. In addition, we present specialized models called C ODE -Q WEN, which includes C ODE -
Q WEN -7B and C ODE -Q WEN -14B, as well as their chat models, C ODE -Q WEN -14B-
C HAT and C ODE -Q WEN -7B-C HAT. Specifically, C ODE -Q WEN has been pre-trained
on extensive datasets of code and further fine-tuned to handle conversations related to
code generation, debugging, and interpretation. The results of experiments conducted on
benchmark datasets, such as HumanEval (Chen et al., 2021b), MBPP (Austin et al., 2021),
and HumanEvalPack (Muennighoff et al., 2023), demonstrate the high level of proficiency
of C ODE -Q WEN in code understanding and generation.
4. This research additionally introduces M ATH -Q WEN -C HAT specifically designed to tackle
mathematical problems. Our results show that both M ATH -Q WEN -7B-C HAT and M ATH -
Q WEN -14B-C HAT outperform open-sourced models in the same sizes with large margins
and are approaching GPT-3.5 on math-related benchmark datasets such as GSM8K (Cobbe
et al., 2021) and MATH (Hendrycks et al., 2021).
5. Besides, we have open-sourced Q WEN -VL and Q WEN -VL-C HAT, which have the versatile
ability to comprehend visual and language instructions. These models outperform the current
open-source vision-language models across various evaluation benchmarks and support text
recognition and visual grounding in both Chinese and English languages. Moreover, these
models enable multi-image conversations and storytelling. Further details can be found
in Bai et al. (2023).
Now, we officially open-source the 14B-parameter and 7B-parameter base pretrained models Q WEN
and aligned chat models Q WEN -C HAT2 . This release aims at providing more comprehensive and
powerful LLMs at developer- or application-friendly scales.
The structure of this report is as follows: Section 2 describes our approach to pretraining and results
of Q WEN. Section 3 covers our methodology for alignment and reports the results of both automatic
evaluation and human evaluation. Additionally, this section describes details about our efforts in
building chat models capable of tool use, code interpreter, and agent. In Sections 4 and 5, we delve
into specialized models of coding and math and their performance. Section 6 provides an overview
of relevant related work, and Section 7 concludes this paper and points out our future work.
2 P RETRAINING
The pretraining stage involves learning vast amount of data to acquire a comprehensive understanding
of the world and its various complexities. This includes not only basic language capabilities but also
advanced skills such as arithmetic, coding, and logical reasoning. In this section, we introduce the
data, the model design and scaling, as well as the comprehensive evaluation results on benchmark
datasets.
2.1 DATA
The size of data has proven to be a crucial factor in developing a robust large language model,
as highlighted in previous research (Hoffmann et al., 2022; Touvron et al., 2023b). To create an
effective pretraining dataset, it is essential to ensure that the data are diverse and cover a wide range
2
GitHub: https://github.com/QwenLM/Qwen.
4
Figure 2: Performance of GPT-4, GPT-3.5, the previous 13B SOTA, as well as Q WEN -14B. We
demonstrate the results on 12 datasets covering multiple domains, including language understanding,
knowledge, reasoning, etc. Q WEN significantly outperforms the previous SOTA of similar model
sizes, but still lag behind both GPT-3.5 and GPT-4.
of types, domains, and tasks. Our dataset is designed to meet these requirements and includes public
web documents, encyclopedia, books, codes, etc. Additionally, our dataset is multilingual, with a
significant portion of the data being in English and Chinese.
To ensure the quality of our pretraining data, we have developed a comprehensive data preprocessing
procedure. For public web data, we extract text from HTML and use language identification tools to
determine the language. To increase the diversity of our data, we employ deduplication techniques,
including exact-match deduplication after normalization and fuzzy deduplication using MinHash
and LSH algorithms. To filter out low-quality data, we employ a combination of rule-based and
machine-learning-based methods. Specifically, we use multiple models to score the content, including
language models, text-quality scoring models, and models for identifying potentially offensive or
inappropriate content. We also manually sample texts from various sources and review them to ensure
their quality. To further enhance the quality of our data, we selectively up-sample data from certain
sources, to ensure that our models are trained on a diverse range of high-quality content. Finally, we
have built a dataset of up to 3 trillion tokens.
5
Model
LLaMA-7B
3.5 Baichuan-7B
ChatGLM2-6B
InternLM-7B
Qwen
3.0
2.5
Compression Ratio
2.0
1.5
1.0
0.5
0.0
th he ar ko vi zh ja tr id pl ru nl pt it de es fr en code
Languages
2.2 T OKENIZATION
The design of vocabulary significantly impacts the training efficiency and the downstream task
performance. In this study, we utilize byte pair encoding (BPE) as our tokenization method, following
GPT-3.5 and GPT-4. We start with the open-source fast BPE tokenizer, tiktoken (Jain, 2022), and
select the vocabulary cl100k base as our starting point. To enhance the performance of our model on
multilingual downstream tasks, particularly in Chinese, we augment the vocabulary with commonly
used Chinese characters and words, as well as those in other languages. Also, following Touvron et al.
(2023a;b), we have split numbers into single digits. The final vocabulary size is approximately 152K.
The performance of the Q WEN tokenizer in terms of compression is depicted in Figure 3. In this
comparison, we have evaluated Q WEN against several other tokenizers, including XLM-R (Conneau
et al., 2019), LLaMA Touvron et al. (2023a), Baichuan Inc. (2023a), and InternLM InternLM Team
(2023). Our findings reveal that Q WEN achieves higher compression efficiency than its competitors
in most languages. This implies that the cost of serving can be significantly reduced since a smaller
number of tokens from Q WEN can convey more information than its competitors. Furthermore, we
have conducted preliminary experiments to ensure that scaling the vocabulary size of Q WEN does
not negatively impact the downstream performance of the pretrained model. Despite the increase
in vocabulary size, our experiments have shown that Q WEN maintains its performance levels in
downstream evaluation.
2.3 M ODEL
2.3.1 A RCHITECTURE
Q WEN is designed using a modified version of the Transformer architecture. Specifically, we have
adopted the recent open-source approach of training large language models, LLaMA (Touvron et al.,
2023a), which is widely regarded as the top open-source LLM. Our modifications to the architecture
include:
6
Table 1: Model sizes, architectures, and optimization hyper-parameters.
# of Params Hidden size Heads Layers Learning rate Batch size Training tokens
−4
1.8B 2048 16 24 3.0 × 10 4M 2.2T
7B 4096 32 32 3.0 × 10−4 4M 2.4T
14B 5120 40 40 3.0 × 10−4 4M 3.0T
• Positional embedding. We have chosen RoPE (Rotary Positional Embedding) (Su et al.,
2021) as our preferred option for incorporating positional information into our model. RoPE
has been widely adopted and has demonstrated success in contemporary large language
models, notably PaLM (Chowdhery et al., 2022; Anil et al., 2023) and LLaMA (Touvron
et al., 2023a;b). In particular, we have opted to use FP32 precision for the inverse frequency
matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve
higher accuracy.
• Bias. For most layers, we remove biases following Chowdhery et al. (2022), but we add
biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su,
2023b).
• Pre-Norm & RMSNorm. In modern Transformer models, pre-normalization is the most
widely used approach, which has been shown to improve training stability compared to
post-normalization. Recent research has suggested alternative methods for better training
stability, which we plan to explore in future versions of our model. Additionally, we have
replaced the traditional layer normalization technique described in (Ba et al., 2016) with
RMSNorm (Jiang et al., 2023). This change has resulted in equivalent performance while
also improving efficiency.
• Activation function. We have selected SwiGLU (Shazeer, 2020) as our activation function,
a combination of Swish (Ramachandran et al., 2017) and Gated Linear Unit (Dauphin et al.,
2017). Our initial experiments have shown that activation functions based on GLU generally
outperform other baseline options, such as GeLU (Hendrycks & Gimpel, 2016). As is
common practice in previous research, we have reduced the dimension of the feed-forward
network (FFN) from 4 times the hidden size to 38 of the hidden size.
Transformer models have a significant limitation in terms of the context length for their attention
mechanism. As the context length increases, the quadratic-complexity computation leads to a drastic
increase in both computation and memory costs. In this work, we have implemented simple training-
free techniques that are solely applied during inference to extend the context length of the model. One
of the key techniques we have used is NTK-aware interpolation (bloc97, 2023), which adjusts the
scale to prevent the loss of high-frequency information in a training-free manner. To further improve
performance, we have also implemented a trivial extension called dynamic NTK-aware interpolation,
which is later formally discussed in (Peng et al., 2023a). It dynamically changes the scale by chunks,
avoiding severe performance degradation. These techniques allow us to effectively extend the context
length of Transformer models without compromising their computational efficiency or accuracy.
Q WEN additionally incorporates two attention mechanisms: LogN-Scaling (Chiang & Cholak, 2022;
Su, 2023a) and window attention (Beltagy et al., 2020). LogN-Scaling rescales the dot product of
the query and value by a factor that depends on the ratio of the context length to the training length,
ensuring that the entropy of the attention value remains stable as the context length grows. Window
attention restricts the attention to a limited context window, preventing the model from attending to
tokens that are too far away.
We also observed that the long-context modeling ability of our model varies across layers, with lower
layers being more sensitive in context length extension compared to the higher layers. To leverage
this observation, we assign different window sizes to each layer, using shorter windows for lower
layers and longer windows for higher layers.
7
Table 2: Overall performance on widely-used benchmarks compared to open-source base models.
Our largest Q WEN model with 14 billion parameters outperforms previous 13B SoTA models on all
datasets.
2.4 T RAINING
To train Q WEN, we follow the standard approach of autoregressive language modeling, as described
in Radford et al. (2018). This involves training the model to predict the next token based on the
context provided by the previous tokens. We train models with context lengths of 2048. To create
batches of data, we shuffle and merge the documents, and then truncate them to the specified context
lengths. To improve computational efficiency and reduce memory usage, we employ Flash Attention
in the attention modules (Dao et al., 2022). We adopt the standard optimizer AdamW (Kingma & Ba,
2014; Loshchilov & Hutter, 2017) for pretraining optimization. We set the hyperparameters β1 = 0.9,
β2 = 0.95, and ϵ = 10−8 . We use a cosine learning rate schedule with a specified peak learning rate
for each model size. The learning rate is decayed to a minimum learning rate of 10% of the peak
learning rate. All the models are trained with BFloat16 mixed precision for training stability.
To evaluate the zero-shot and few-shot learning capabilities of our models, we conduct a thorough
benchmark assessment using a series of datasets. We compare Q WEN with the most recent open-
source base models, including LLaMA (Touvron et al., 2023a), LLaMA2 (Touvron et al., 2023b),
MPT (Mosaic ML, 2023), Falcon (Almazrouei et al., 2023), Baichuan2 (Yang et al., 2023), Chat-
GLM2 (ChatGLM2 Team, 2023), InternLM (InternLM Team, 2023), XVERSE (Inc., 2023b), and
StableBeluga2 (Stability AI, 2023). Our evaluation covers a total of 7 popular benchmarks, which
are MMLU (5-shot) (Hendrycks et al., 2020), C-Eval (5-shot) (Huang et al., 2023), GSM8K (8-
shot) (Cobbe et al., 2021), MATH (4-shot) (Hendrycks et al., 2021), HumanEval (0-shot) (Chen et al.,
2021b), MBPP (0-shot) (Austin et al., 2021), and BBH (Big Bench Hard) (3 shot) (Suzgun et al.,
2022). We aim to provide a comprehensive summary of the overall performance of our models across
these benchmarks.
8
Table 3: Results of Q WEN on long-context inference using various techniques. Our experimental
findings reveal that the application of our crucial techniques enables the model to consistently achieve
low perplexity as the context length increases. This suggests that these techniques play a significant
role in enhancing the model’s ability to comprehend and generate lengthy texts.
Sequence Length
Model
1024 2048 4096 8192 16384
Q WEN -7B 4.23 3.78 39.35 469.81 2645.09
+ dynamic ntk 4.23 3.78 3.59 3.66 5.71
+ dynamic ntk + logn 4.23 3.78 3.58 3.56 4.62
+ dynamic ntk + logn + window attn 4.23 3.78 3.58 3.49 4.32
Q WEN -14B - 3.46 22.79 334.65 3168.35
+ dynamic ntk + logn + window attn - 3.46 3.29 3.18 3.42
In this evaluation, we focus on the base language models without alignment and collect the baselines’
best scores from their official results and OpenCompass (OpenCompass Team, 2023). The results are
presented in Table 2.
Our experimental results demonstrate that the three Q WEN models exhibit exceptional performance
across all downstream tasks. It is worth noting that even the larger models, such as LLaMA2-70B, are
outperformed by Q WEN -14B in 3 tasks. Q WEN -7B also performs admirably, surpassing LLaMA2-
13B and achieving comparable results to Baichuan2-13B. Notably, despite having a relatively small
number of parameters, Q WEN -1.8B is capable of competitive performance on certain tasks and even
outperforms larger models in some instances. The findings highlight the impressive capabilities of
the Q WEN models, particularly Q WEN -14B, and suggest that smaller models, such as Q WEN -1.8B,
can still achieve strong performance in certain applications.
To evaluate the effectiveness of context length extension, Table 3 presents the test results on arXiv3 in
terms of perplexity (PPL). These results demonstrate that by combining NTK-aware interpolation,
LogN-Scaling, and layer-wise window assignment, we can effectively maintain the performance of
our models in the context of over 8192 tokens.
3 A LIGNMENT
Pretrained large language models have been found to be out of sync with human behavior, making
them unsuitable for serving as AI assistants in most cases. Recent research has shown that the use of
alignment techniques, such as supervised finetuning (SFT) and reinforcement learning from human
feedback (RLHF), can significantly improve the ability of language models to engage in natural
conversation. In this section, we will delve into the details of how Qwen models have been trained
using SFT and RLHF, and evaluate their performance in the context of chat-based assistance.
To gain an understanding of human behavior, the initial step is to carry out supervised finetuning.
This process fine-tunes a pre-trained model on chat-style data, which includes both human queries
and AI responses. Supervised finetuning is similar to text-to-text transfer, but it is capable of creating
a helpful AI assistant due to the intricate and varied nature of the datasets used for finetuning. In the
following sections, we will delve into the details of data construction and training methods.
3.1.1 DATA
To enhance the capabilities of our supervised finetuning datasets, we have annotated conversations
in multiple styles. While conventional FLAN datasets (Wei et al., 2022a) contain a vast amount of
data prompted with questions, instructions, and answers in natural language, our approach takes
it a step further by annotating human-style conversations. This practice, inspired by Ouyang et al.
3
The dataset contains academic papers from https://arxiv.org/
9
(2022), is aimed at improving the model’s helpfulness by focusing on natural language generation for
diverse tasks. To ensure the model’s ability to generalize to a wide range of scenarios, we specifically
excluded data formatted in prompt templates that could potentially limit its capabilities. Furthermore,
we have prioritized the safety of the language model by annotating data related to safety concerns
such as violence, bias, and pornography. This will enable the model to detect and reject malicious
prompts or provide safe answers in such situations.
In addition to data quality, we have observed that the training method can significantly impact the
final performance of the model. To achieve this, we utilized the ChatML-style format (OpenAI,
2022), which is a versatile meta language capable of describing both the metadata (such as roles)
and the content of a turn. This format enables the model to effectively distinguish between various
types of information, including system setup, user inputs, and assistant outputs, among others. By
leveraging this approach, we can enhance the model’s ability to accurately process and analyze
complex conversational data.
3.1.2 T RAINING
Consistent with pretraining, we also apply next-token prediction as the training task for SFT. We apply
the loss masks for the system and user inputs. More details are demonstrated in Appendix A.1.1.
The model’s training process utilizes the AdamW optimizer, with the following hyperparameters: β1
set to 0.9, β2 set to 0.95, and ϵ set to 10−8 . The sequence length is limited to 2048, and the batch size
is 128. The model undergoes a total of 4000 steps, with the learning rate gradually increased over the
first 1430 steps, reaching a peak of 2 × 10−6 . To prevent overfitting, weight decay is applied with a
value of 0.1, dropout regularization is set to 0.1, and gradient clipping is enforced with a limit of 1.0.
While SFT has proven to be effective, we acknowledge that its generalization and creativity capa-
bilities may be limited, and it is prone to overfitting. To address this issue, we have implemented
Reinforcement Learning from Human Feedback (RLHF) to further align SFT models with human
preferences, following the approaches of Ouyang et al. (2022); Christiano et al. (2017). This process
involves training a reward model and using Proximal Policy Optimization (PPO) (Schulman et al.,
2017) to conduct policy training.
To create a successful reward model, like building a large language model (LLM), it is crucial to
first undergo pretraining and then finetuning. This pretraining process, also known as preference
model pretraining (PMP) (Bai et al., 2022b), necessitates a vast dataset of comparison data. This
dataset consists of sample pairs, each containing two distinct responses for a single query and their
corresponding preferences. Similarly, finetuning is also conducted on this type of comparison data,
but with a higher quality due to the presence of quality annotations.
During the fine-tuning phase, we gather a variety of prompts and adjust the reward model based on
human feedback for responses from the Q WEN models. To ensure the diversity and complexity of
user prompts are properly taken into account, we have created a classification system with around
6600 detailed tags and implemented a balanced sampling algorithm that considers both diversity and
complexity when selecting prompts for annotation by the reward model (Lu et al., 2023). To generate
a wide range of responses, we have utilized Q WEN models of different sizes and sampling strategies,
as diverse responses can help reduce annotation difficulties and enhance the performance of the
reward model. These responses are then evaluated by annotators following a standard annotation
guideline, and comparison pairs are formed based on their scores.
In creating the reward model, we utilize the same-sized pre-trained language model Q WEN and
initiate the PMP process. Subsequently, we fine-tune the PMP model to enhance its performance. It
is important to mention that we have incorporated a pooling layer into the original Q WEN model to
extract the reward for a sentence based on a specific end token. The learning rate for this process has
been set to a constant value of 3 × 10−6 , and the batch size is 64. Additionally, the sequence length
is set to 2048, and the training process lasts for a single epoch.
10
Table 4: Test Accuracy of Q WEN PMP and reward model on diverse human preference benchmark
datasets.
Our Proximal Policy Optimization (PPO) process involves four models: the policy model, value
model, reference model, and reward model. Before starting the PPO procedure, we pause the policy
model’s updates and focus solely on updating the value model for 50 steps. This approach ensures
that the value model can adapt to different reward models effectively.
During the PPO operation, we use a strategy of sampling two responses for each query simultaneously.
This strategy has proven to be more effective based on our internal benchmarking evaluations. We set
the KL divergence coefficient to 0.04 and normalize the reward based on the running mean.
The policy and value models have learning rates of 1 × 10−6 and 5 × 10−6 , respectively. To enhance
training stability, we utilize value loss clipping with a clip value of 0.15. For inference, the policy
top-p is set to 0.9. Our findings indicate that although the entropy is slightly lower than when top-p is
set to 1.0, there is a faster increase in reward, ultimately resulting in consistently higher evaluation
rewards under similar conditions.
Additionally, we have implemented a pre-trained gradient to mitigate the alignment tax. Empirical
findings indicate that, with this specific reward model, the KL penalty is adequately robust to
counteract the alignment tax in benchmarks that are not strictly code or math in nature, such as
those that test common sense knowledge and reading comprehension. It is imperative to utilize
a significantly larger volume of the pretrained data in comparison to the PPO data to ensure the
effectiveness of the pretrained gradient. Additionally, our empirical study suggests that an overly
large value for this coefficient can considerably impede the alignment to the reward model, eventually
compromising the ultimate alignment, while an overly small value would only have a marginal effect
on alignment tax reduction.
To showcase the effectiveness of our aligned models, we conduct a comparison with other aligned
models on well-established benchmarks, including MMLU (Hendrycks et al., 2020), C-Eval (Huang
et al., 2023), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021b), and BBH (Suzgun et al.,
2022). Besides the widely used few-shot setting, we test our aligned models in the zero-shot setting
to demonstrate how well the models follow instructions. The prompt in a zero-shot setting consists
of an instruction and a question without any previous examples in the context. The results of the
baselines are collected from their official reports and OpenCompass (OpenCompass Team, 2023).
The results in Table 5 demonstrate the effectiveness of our aligned models in understanding human
instructions and generating appropriate responses. Q WEN-14B-Chat outperforms all other models
except ChatGPT (OpenAI, 2022) and L LAMA 2-C HAT -70B (Touvron et al., 2023b) in all datasets,
including MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), GSM8K (Cobbe et al., 2021),
HumanEval (Chen et al., 2021b), and BBH (Suzgun et al., 2022). In particular, Q WEN’s performance
in HumanEval, which measures the quality of generated codes, is significantly higher than that of
other open-source models.
Moreover, Q WEN’s performance is consistently better than that of open-source models of similar size,
such as LLaMA2 (Touvron et al., 2023b), ChatGLM2 (ChatGLM2 Team, 2023), InternLM (InternLM
Team, 2023), and Baichuan2 (Yang et al., 2023). This suggests that our alignment approach, which
involves fine-tuning the model on a large dataset of human conversations, has been effective in
improving the model’s ability to understand and generate human-like language.
11
Table 5: Performance of aligned models on widely-used benchmarks. We report both zero-shot
and few-shot performance of the models.
: L Q U D W H Y V * 3 7
4 Z H Q % &