Openassistant Conversations - Democratizing Large Language Model Alignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

OpenAssistant Conversations - Democratizing Large

Language Model Alignment

Andreas Köpf∗ Yannic Kilcher∗


[email protected] [email protected]
arXiv:2304.07327v1 [cs.CL] 14 Apr 2023

Dimitri von Rütte Sotiris Anagnostidis Zhi-Rui Tam Keith Stevens

Abdullah Barhoum Nguyen Minh Duc Oliver Stanley Richárd Nagyfi Shahul ES

Sameer Suri David Glushkov Arnav Dantuluri Andrew Maguire

Christoph Schuhmann Huu Nguyen

Alexander Mattick
[email protected]

Abstract
Aligning large language models (LLMs) with human preferences has proven to
drastically improve usability and has driven rapid adoption as demonstrated by
ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and rein-
forcement learning from human feedback (RLHF) greatly reduce the required skill
and domain knowledge to effectively harness the capabilities of LLMs, increasing
their accessibility and utility across various domains. However, state-of-the-art
alignment techniques like RLHF rely on high-quality human feedback data, which
is expensive to create and often remains proprietary. In an effort to democratize
research on large-scale alignment, we release OpenAssistant Conversations, a
human-generated, human-annotated assistant-style conversation corpus consisting
of 161,443 messages distributed across 66,497 conversation trees, in 35 differ-
ent languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.2 To
demonstrate the OpenAssistant Conversations dataset’s effectiveness, we present
OpenAssistant, the first fully open-source large-scale instruction-tuned model to be
trained on human data. A preference study revealed that OpenAssistant replies are
comparably preferred to GPT-3.5-turbo (ChatGPT) with a relative winrate of 48.3%
vs. 51.7% respectively. We release our code3 and data4 under fully permissive
licenses.


These authors contributed equally to this work.
2
A current version of the data collection platform is deployed at https://open-assistant.io.
3
https://github.com/LAION-AI/Open-Assistant
4
https://huggingface.co/OpenAssistant

Preprint. Under review.


1 Introduction

Artificial intelligence (AI), particularly in the field of natural language processing, has witnessed
rapid progress in recent years. Major advancements are primarily driven by a straightforward formula:
Take a simple transformer-based architecture, increase the number of parameters by enlarging the
specified depth and width and finally, significantly scale the training corpus. Although models have
for some time exhibited an extraordinary, super-human ability to fit the training data and generalize
based on their trained objective [1, 2], their adoption among the general public has until recently been
slow. This can be mainly attributed to misalignment between the model’s predictions and the final
intended usage. The alignment of AI systems with human values, intentions, and preferences is a
vital and intricate challenge within the AI research domain. This refers to the process of ensuring that
AI systems can not only successfully optimize surrogate provided training objectives, but also that
their predictions are in line with their intended purpose and adhere to ethical and safety standards
provided by humans. One possible solution is assistant-style fine-tuning of language models that has
recently emerged as a promising approach to making large language models more in line with human
preferences by generating more desirable outputs [3] and thus making them more useful.
A notable instance of such an assistant-style model is ChatGPT, which has gained unprecedented user
growth due to remarkable capabilities demonstrated in a wide range of fields, but also the ease-of-use
for the end user. Aligning the model’s predictions is in this case accomplished by introducing
human-generated examples of intended usage and using reinforcement learning from human feedback
(RLHF) [4, 5]. In RLHF, the human acts as a teacher and provides feedback in the form of rewards
or penalties. In more detail, Ouyang et al. [4] proposed a three stage procedure to align language
models.

• Collect human-generated demonstrations of desired behaviour and train a supervised fine-


tuned (SFT) model.

• Train a reward model (RM) on human-annotated rankings for different model outputs.

• Use the RM as a reward function and fine-tune the SFT model to maximize the reward
generated by its responses. This is achieved using the PPO algorithm [6].

It becomes apparent that the benefits of all the previous aforementioned stages, are predominantly
imposed by the quality of the data used [7]. Despite this, availability of large-scale human feedback
datasets for the open research community remains scarce. Most openly accessible datasets are
comprised of synthetic data of instructions automatically generated by querying language models [8,
9, 10, 11, 12]. Unfortunately, these datasets are limited with respect to their complexity, creativity and
quality, as they rely on a pre-specified list of possible instruction types. Without sufficiently broad
and high quality data, even models with substantial size and pre-training would prove inadequate for
building capable, helpful, and harmless AI assistants.
Research in this area has predominantly been confined to a select few research labs with access to
the required resources to engage in large-scale training and data collection. This monopolization
of access to quality data undermines the potential for inclusive and diverse research endeavours,
particularly in relation to alignment challenges, which arguably constitute some of the most crucial
research areas of our time. In an effort to democratize research on aligning large language models,
we introduce and release the OpenAssistant Conversations dataset. This dataset is the culmination of
an extensive open- and crowd-sourcing initiative, and its release to the research community seeks
to promote more inclusive research in this highly-influential domain. We provide a comprehensive
analysis of the dataset, assessing ethical implications and safety considerations. We also fine-tune and
release several assistant and preference models to further advance open access and research in this
area. This transparency allows for iterative improvements on the released artifacts, fostering a more
collaborative and inclusive research environment. Our belief is that our work makes a noteworthy
contribution towards creating a research landscape that is more inclusive and democratized, thereby
providing opportunities to researchers from diverse backgrounds. In the following sections, we delve
into the intricacies of the OpenAssistant Conversations dataset and discuss its implications for the
alignment of large language models and for society at large.

2
2 Data Format
The basic data structure is a Conversation Tree (CT) with nodes representing written messages in
a conversation. A CT’s root node represents an initial prompt, given by the prompter. To avoid
confusion, we call the roles of the conversation prompter and assistant. This allows us to reserve the
term user for the human contributors. Both the prompter and assistant roles can be fulfilled by either
a human user or a machine. Every tree node is labelled by its role, and can have multiple children
of the opposite role, each of which represents a separate next step in the conversation. A path from
the root to any node in the tree (including to itself) represents a valid conversation with prompter
and assistant taking turns and is called a thread. Tree nodes are annotated with additional data such
as user-provided labels and metadata, such as collection timestamp and indicated language. Each
assistant node further has a rank associated which orders it compared to replies of the parent prompt,
according to user preferences.

Figure 1: An example Conversation Tree (CT) of depth 4 containing 12 messages. Any path from the
root prompt to a node is a valid thread.

3 Data Collection
The OpenAssistant Conversations dataset is a comprehensive collection of conversational data that
was obtained through a crowdsourcing effort involving more than 13,000 volunteers. The data was
collected using a web-app interface5 , which facilitated the process by dividing it into five separate
steps: prompting, labelling prompts, adding reply messages as prompter or assistant, labelling replies,
and ranking assistant replies. The dataset was curated with content moderation and spam filtering as
key components of the annotation pipeline, ensuring high quality and safety standards.
Volunteers completed over 625,000 tasks in total, resulting in the collection of over 10,000 fully
annotated and filtered Conversation Trees. We hope the resulting dataset will be an important
resource for researchers studying natural language processing and machine learning, as it allows for
the development and testing of new algorithms and models for conversational AI. By providing such a
large and diverse dataset, the OpenAssistant Conversations dataset opens up new avenues of research
in the field, enabling researchers to explore the complexities of human language and interactions in
ways that were not possible before [13].
Example User Interface (UI) displays of the data collection platform can be found in Appendix B. In
the following sections, we provide more details regarding the various aspects of the data collection
pipeline.

3.1 Single-Step Collection

The process of data collection in this study is structured to be both efficient and effective by breaking
the work down into single units and advancing multiple conversation trees one step at a time. This
approach minimizes data loss due to user attrition and ensures that every unit of work is captured for
5
Reachable at https://open-assistant.io/

3
utilization. The users are presented with a range of task types, either by choice or through random
sampling (weighted according to current requirements). The task types include creating prompts,
replying as an assistant, replying as a prompter, labeling prompts or replies, and ranking prompter or
assistant replies.

Create a prompt. Users are required to write an initial prompt that forms the root of a new
conversation tree. As this task is highly popular among users, a lottery system is employed to manage
the selection of new prompts, with only a fixed number of prompts being chosen for continuation at
any given moment. This method serves to regulate the influx of new prompts and maintain a balanced
distribution of tasks.

Reply as assistant. Replying as an assistant is a more labor-intensive task that necessitates users
to carefully consider their responses and often engage in external research to provide a helpful and
relevant answer to the prompter’s request. This task type, despite its demanding nature, has been
reported to be the most enjoyable by many users due to the diverse array of topics covered. To account
for the increased effort required for this task, a reward system has been implemented to incentivize
users to participate. See Figure 8 for a UI preview.

Reply as prompter. The task of replying as a prompter, on the other hand, does not impose strict
quality requirements but instead emphasizes on the importance of diversity to accommodate various
use-cases. Examples of prompter replies may include asking for clarification, modifying the original
intent, posing a follow-up question, or changing the direction of the conversation altogether.

Label a prompt or reply. Users are presented with a message from the database along with the
preceding conversation thread (if available) and are asked to categorize the message according to
three dimensions: spam detection, guideline adherence, and quality. For spam detection, users assess
whether the message is unsuitable for inclusion in the dataset, such as instances of obvious spam or
trolling. Messages flagged as spam by multiple users are automatically removed from the dataset.
Guideline adherence is evaluated through a set of labels that determines whether the contribution
aligns with the established guidelines (see Figure 6). These labels encompass the message being in a
language other than the specified one, containing personally identifiable information, hate speech,
sexual content, or being deemed inappropriate. Messages labelled in this manner are subsequently
reviewed by human moderators.
Quality labels require users to rate the message on a five-point Likert scale across dimensions such
as quality, creativity, humorousness, politeness, and harmlessness. These labels are stored for later
analysis and application. Notably, users can voluntarily assign these labels (as well as spam &
guideline adherence labels) to any message within the system, even as part of another task, as an
additional contribution.

Rank assistant replies. Users are presented with two or more responses to the same parent message
and asked to rank them in order of preference. This allows for a comparative analysis of the various
responses and helps in identifying the most effective and engaging replies (Figure 7).
In summary, this data collection methodology effectively divides work into single units, minimizes
data loss due to user attrition, and captures valuable information for future analysis and application.
By offering users a diverse range of task types, the study encourages active participation and ensures
the collection of rich and varied data for a comprehensive understanding of the subject.

3.2 Message Tree State Machine

The tree state machine serves as a systematic approach to managing the progression of message trees
throughout the data collection process. This method ensures that each tree undergoes a series of
states until it reaches completion, beginning with the creation of new trees by randomly sampling
from the pool of initial prompts. The various states that a message tree passes through include the
initial prompt review state, growing state, and end state, as well as the aborted low-grade state for
trees that are deemed unsuitable for inclusion in the dataset.
Upon the creation of a new tree, it enters the initial prompt review state, where multiple users are
tasked with providing labels to assess its quality and suitability. This state plays a crucial role in

4
identifying any potential issues with the initial prompt, such as spam or content that violates the
established guidelines. If the provided labels indicate that the tree contains spam or unsuitable
content, it is transitioned to the aborted low-grade state and subsequently removed from the dataset.
Conversely, if the tree passes the initial prompt review state, it proceeds to the growing state.
The growing state involves the continuous issuance of tasks to users, such as providing replies, labels,
and rankings, to facilitate the development and expansion of the conversation tree. This state is
essential for collecting diverse and rich data, as it allows for the accumulation of multiple interactions
and the exploration of various conversation paths, given the same initial prompt. The growing state
continues until the tree reaches its end state, which is defined by a maximum number of messages or
other predetermined criteria.
Parameters within the data collection platform govern the behaviour of the tree state machine, such as
the average number of messages collected for each parent message or the maximum tree depth. These
parameters enable researchers to fine-tune the data collection process according to their specific
research goals and requirements, ensuring a more targeted and efficient approach to gathering data.
Parameters varied during the collection of the dataset. Current settings can be found in Appendix F.
In summary, the tree state machine is a structured and systematic method for managing the progression
of message trees during the data collection process. By guiding each tree through a series of states,
from initial prompt review to growing and reaching its end state, the tree state machine ensures
the collection of high-quality, diverse, and relevant data. Additionally, the inclusion of platform
parameters allows for the customization of the data collection process to align with specific research
objectives, further enhancing the effectiveness and utility of this approach.

3.3 Ranking merging

Reinforcement learning from human feedback (RLHF) [14, 15] comprises a set of techniques that all
aim to optimize the output distribution of a language model using the preference structure provided
by human rankers. To get a preference structure that is well aligned to users, we cannot just rely on
the opinions of individual rankers, due to the high variance in human preferences. Since our objective
is to collect data for a generally capable digital assistant, every ranking of possible responses is
performed by K independent rankers, (see, section 3.1). Once this is done, we need to fuse these K
individual opinions into one consensus opinion usable in training preference models.
We perform this preference fusion by treating it as a ranked-voting problem, whose objective it
is to maintain the preferences as faithfully as possible. The method chosen for this is known as
"ranked pairs" or "Tideman’s method" [16]. Simplified, this method creates a sorted list of "winners"
according to the strength of the preference of one element over the others. The way the preference
strength is measured is by considering all preference pairs in the input votes: For example, if the
votes are two times A > B > C and one time B > A > C, this would mean that the pair A > B
exists 2 times, while A > C exists three times. The method then orders sorts the winners by winning
strength, i.e. here A > C would happen before A > B, and constructs a directed graph using the
preferences, i.e. A > C would become an edge A → C. Edges are added one by one according to
their weight (higher first), as long as no cycle is produced. If the edge would produce a cycle, it is
skipped as the corresponding dominant preferences are already in the graph.
The resulting directed acyclic graph can be turned into a preference structure by recursively removing
source nodes and adding them to the back of a preference list, since source nodes are, by construction,
not preferred over any other item left in the graph. In practice, one can speed up the construction by
not explicitly constructing the preference graph and fusing the graph con- and destruction.

3.4 Contributor Guidelines

To achieve a high degree of quality and consistency across a wide range of contributors, we issue
clear and detailed guidelines. A full copy of these guidelines at the present time can be found in
Appendix A.
Our guidelines follow three main goals: 1. Clarify the meanings, scales, and criteria for assigning
labels and rankings during the labelling and ranking tasks 2. Make assistant responses be polite,
helpful, concise, friendly, and safety-aware and 3. Instruct prompts and prompter replies to explore a
diverse and challenging set of inputs to the assistant role.

5
In particular, the guidelines establish a framework for safely interacting with an automated assistant
by drawing inspiration from the concept of informed consent. Rather than categorically denying
large parts of request categories, we aim to provide the prompter with useful feedback, for example
drawing special awareness to dangerous activities, elaborating on weaknesses of automated assistants,
such as hallucinations, and discouraging and denying requests asking for illegal or highly inappro-
priate content. In our validation experiments in training assistant models based on OpenAssistant
Conversations, we observe a high degree of consistency of the trained models’ outputs with our given
guidelines.
Although guideline adherence is already high in our models after training, our approach is completely
compatible with deploying additional safety measures during inference, such as secondary models to
filter or modify ill-suited user input.

3.5 Quality Control & Content Moderation

We take a multi-pronged approach to quality assurance, with the main pillars being a system of
reward points & leaderboards, and manual review of flagged content by human moderators. This both
maximizes the quality of contributions, while effectively utilizing the limited time of the volunteer
moderators.
In an effort to demonstrate progress and achievement to users, and to encourage high-quality contri-
butions, our system allocates points for the completion of tasks. These points contribute to various
leaderboards, including daily, weekly, monthly, and all-time rankings. A level system also exists,
wherein higher point accumulation results in elevated levels, reflecting veteran status and engagement.
In the future, this system could potentially be developed further to facilitate preferential access to
more engaging tasks or similar perks.
The distribution of points is contingent upon task type, as certain tasks require greater effort, such as
the reply as assistant task (compared to the create a prompt task). A significant portion of points
is deferred and reliant on interactions with other users. For instance, a user’s assistant reply may
gather many additional points if it is subsequently deemed non-spam and highly ranked by other
users. Inversely, points may be reduced or lost for answers that are labeled as spam or down-voted by
consensus of other users.
Within the moderator section of the website, an alternative leaderboard, designated the Trollboard,
is exhibited. This leaderboard assesses users based on an aggregate of negative labels, reports, and
down-votes received for their contributions. This approach enables human moderators to proactively
scrutinize potentially misbehaving users in a comprehensive manner. The Trollboard has proven to
be an effective tool in addressing the numerical disparity between users and moderators, maximizing
the collective efforts of contributors to identify undesirable contributions.
Users further have the option to report messages to moderators for manual review, either via the
platform, or directly via communication on a community chat server. Moderators have the ability to
delete individual messages, or all messages of a given user, at their own discretion. Deleted messages
are retained, but marked as deleted and not exported for training.

4 Dataset Composition

We release several variants of the OpenAssistant Conversations dataset representing various levels
of filtering. The full dataset consists of 161,443 messages distributed across 66,497 conversation
trees, in 35 different languages, annotated with 461,292 quality ratings. This includes 8,576 synthetic
messages, leaving 152,867 human-submitted messages. Of the 66,497 total conversation trees,
we consider 10,968 complete, meaning the full number of messages has been collected and the
moderation process for these trees has been concluded. These completed trees contain 92,365
messages.
The set of categories for which Likert-scale human labels are collected is Creativity, Quality, Humor,
Helpfulness, Violence, and Rudeness. The set of categories for which binary human labels are
collected is Language Mismatch, Not Appropriate, Personally Identifiable Information, Hate Speech,
and Sexual Content. We additionally release the rank of each assistant message compared to other

6
OtherCatalan
Portuguese (Brazil)
Thai
Chinese
5.8%1.6%
1.7%
French
1.9%
2.5%
2.9% German
English 3.6%
42.8% 5.7% Russian

31.4%

Spanish

Figure 2: Relative share of the most frequent languages in the dataset.

assistant messages submitted for the same prompt, computed from the preference rankings of several
human annotators.
Of the 161,443 total messages, 69,614 are assistant replies and 91,829 are user prompts. Related to
this, 52,159 conversation trees consist of only a single initial user prompt which has not yet received
any assistant replies.
The dataset is dominated by English and Spanish messages as illustrated in Figure 2. The prominence
of English is expected as a result of the community around OpenAssistant originating in the English-
speaking open-source machine learning community. The high quantity of Spanish messages can
be attributed to the publicity given to OpenAssistant by prominent figures in the Spanish machine
learning community.

Figure 3: Distribution of Messages

Figure 3 illustrates how a small number of power users contributed a significant proportion of the
dataset. This must be taken into account when considering possible biases in the data. Although
significant effort went into preventing responses directly copied from other sources, it is possible that
some users utilised automated techniques to enter data and this should also be kept in mind.

7
5 User Demographics and Satisfaction
To gain a deeper understanding of the contributors’ demographics, a Google Form survey was sent
out as an announcement on the project’s Discord channel. The method of recruiting via Discord is
biased towards users who are present on the platform and have been active around the time of the
announcement. Therefore, we intend to send out e-mails to registered users in the future. Fluency
in English can also affect the willingness to participate in the survey. Translations of the survey are
planned in languages with higher representations, such as Spanish, to ensure that responses from
monolingual users are not missed.
The survey consists of 3 parts with questions on demographic information, personal motivation and
user satisfaction. Since prompts are received from all over the world in multiple languages, we
wanted to make sure that demographic questions, such as levels of completed education [17] are
in fact relatable to everyone. We have omitted questions on ethnicity and nationality. Instead, we
asked questions about English proficiency, country of origin and the language for which most of their
contributions were made. At the time of the release of this paper, a total of 226 participants have
answered the survey, with the overwhelming majority, 201 being male and only 10 of the respondents
female. Only 5 of our respondents self-identified as non-binary / other, and 10 preferred not to answer.
Despite this homogeneity, the respondents do differ in their level of education and motivation10 for
contribution. They understand and use artificial intelligence at different levels9 and have different use
cases for the technology11. People were in general very happy to have contributed to the project, with
94.25% either agreeing or strongly agreeing with the statement "Overall, I’m glad I have contributed
to OpenAssistant.".2 For about 40%, this has been their very first time contributing to a community
project.

Figure 4: Demography of 226 respondents

6 Experimental Validation
6.1 Instruction Tuning

To evaluate and demonstrate the effectiveness of the OpenAssistant Conversations dataset, we focus
on the development and evaluation of fine-tuned language models based on Pythia [2] and LLaMA
[1]. Pythia is a state-of-the-art language model with a permissive open-source license, while LLaMA
is a powerful language model with a bespoke non-commercial license.

8
We release a suite of fine-tuned language models, including instruction-tuned Pythia-12B, LLaMA-
13B, and LLaMA-30B, which represents our largest model to date. In order to assess the performance
of these models, we evaluate the performance of the Pythia-12B model. We have chosen to focus our
analysis on this model due to its open-source nature, which makes it widely accessible and applicable
to a diverse range of applications.
To evaluate the performance of Pythia-12B, we conducted a user preference study comparing its
output to that of OpenAI’s gpt-3.5-turbo model. As of the time of writing, this study has garnered
348 submissions, amounting to a total of 7042 comparisons. After excluding ties, which account for
16.4% of the total comparisons, we found that Pythia-12B has a win rate of 48.3% (95% confidence
interval of ± 1.28%, N = 5, 889) against gpt-3.5-turbo. This result implies that the answers
generated by Pythia-12B are 93.5% as preferable as those produced by gpt-3.5-turbo, indicating
that our fine-tuned Pythia model is a strong competitor in the realm of large-scale language models.
For more details on the user preference study, we refer to Appendix E.

6.2 Preference Modelling

In addition to the instruction-tuned models, we also release trained reward models based on Pythia-
1.4B and Pythia-12B. These models have been trained on OpenAssistant data using [4] and [18] as a
guideline. The utilization of reward models trained on real-world data allows for more accurate and
adaptive responses to user input, which is essential for the development of effective and user-friendly
AI assistants.
We plan to release LLaMA-30B models trained on Reinforcement Learning with Human Feedback
(RLHF), as this approach has the potential to yield significant improvements in model performance
and adaptability. However, the development and training of RLHF-based models are still ongoing,
and further effort is required to ensure the successful integration of this training methodology into.

6.3 Spam and Toxicity

In the pursuit of understanding the concordance between human and automated toxicity detection,
we employ toxicity detection methods based on Detoxify [19] to obtain automated ratings for six
distinct categories, classifying whether a message is toxic, obscene, threatening, insulting, attacking
a certain identity or sexually explicit. We limit our analysis to those languages that are supported by
the toxicity detection method, covering English, Spanish, Russian, French, and Italian. It is worth
noting that these languages represent the majority of messages (over 83%).
Using automated toxicity ratings, we are able to systematically assess the correlation between these
ratings and human-assigned toxicity labels (hate speech, not appropriate, and sexual content). Based
on a sample of 115,153 messages, we compute the correlation between automatic and human-
annotated toxicity labels, which is visualized in Figure 5. This analysis provides valuable insights
into the efficacy of automated toxicity detection in comparison to human judgment. We see a strong
correlation between human and automatic labels in at least one element of each row and column
of the correlation matrix, suggesting strong agreement between human annotators and off-the-shelf
toxicity detection models. The results serves to validate the capabilities and and show limitations of
AI-driven toxicity detection and may inform future work in this area.
In addition to analyzing the correlation between human-assigned toxicity labels and automated
ratings, we extend the application of the Detoxify model to assess the efficacy of the moderation
process for the same languages described earlier. To facilitate this analysis, we define two categories
of messages: deleted messages, which encompass those that either failed to pass the community
moderation process or were subsequently manually removed by moderators, and retained messages,
which successfully made it through to the dataset. In order to provide a comprehensive evaluation of
the moderation process, we calculated average values for each of the six Detoxify categories for both
deleted and retained messages.
The values obtained for this analysis are based on a sample of 74,781 messages. It is important to
note that we excluded messages in trees that were still incomplete at the time of export, as these
messages may still be subject to removal by the moderation process.
Our analysis, presented in Table 1 shows that the values for all six toxicity categories are markedly
higher for deleted messages compared to retained messages. This significant difference demonstrates

9
Figure 5: Correlation between human labels and Detoxify outputs for all messages in Detoxify-
supported languages.

Toxicity Obscene Threat Insult Identity Attack Explicit N


State
Deleted 4.625% 1.965% 0.411% 2.085% 0.651% 1.39% 3422
Retained 0.988% 0.574% 0.102% 0.715% 0.121% 0.177% 71359
Table 1: Detoxify outputs across six categories of toxicity, comparing deleted and retained messages.

the effectiveness of the moderation processes in place, as messages removed from the dataset are on
average rated as significantly more toxic by the Detoxify model than messages allowed to remain in
the dataset.
We note that while deleted messages are rated as more toxic than retained messages by the Detoxify
model across all categories, the average toxicity values for these messages are still small. This implies
toxicity ratings from models like Detoxify alone are not sufficient to determine when messages are
unsuitable for inclusion in the dataset. Reasons for deleting non-toxic messages may include a lack
of factual accuracy, or poor grammar. Additionally, messages which are children of deleted messages
must themselves be deleted even if they appear to be acceptable in isolation.

7 Limitations

In this section, we discuss the limitations of the dataset and the corresponding implications for
the use of the large language models (LLMs) that we train on it. The limitations of our dataset
arise mainly from the subjective and cultural biases of the annotators, the uneven distribution of
contributions among users, and the possibility of unsafe content. We emphasize that our models
should be employed for academic research purposes only and that researchers should exercise caution
in evaluating safety and bias when applying these models to downstream tasks.
Subjective and Cultural Biases. The open nature of our project introduces a unique set of challenges
when it comes to controlling for biases within the dataset. Annotators from diverse backgrounds
contribute to the dataset, with demographics that are simultaneously heterogeneous and homogeneous.
Contributors come from all around the world and have varied interests, but they tend to share certain
characteristics such as age and gender. Specifically, 89.1% of the annotators identify as male, with a
median age of 26. This demographic profile may inadvertently introduce biases in the dataset, as it is
bound to reflect the values, perspectives, and interests of the annotators.
Uneven Distribution of Contributions. Although the dataset benefits from the contributions of a
large number of users, their participation levels differ significantly. More engaged users contribute

10
a greater number of annotations, which leads to an overrepresentation of their values and interests
in the dataset. Consequently, the dataset may not adequately capture the diverse perspectives that a
more balanced distribution of contributions could have provided.
Possibility of Unsafe Content. While we have implemented measures to detect and remove harmful
messages from the dataset, our system is not infallible. It is possible that the dataset still contains
unsafe content. We believe that the open nature of the project allows for data filtering to be conducted
in a transparent manner, ultimately converging on the highest possible standards. Nevertheless, the
presence of unsafe content in the dataset raises concerns about the safety of the LLMs trained on it.
Given the limitations discussed above, we advocate for the use of our LLMs in academic research
contexts only. We strongly encourage researchers to thoroughly investigate the safety and bias of the
models before employing them in downstream tasks. It is important to recognize that the released
models may exhibit unsafe behavior and are likely susceptible to prompt injection attacks.
The alignment of LLMs is a crucial aspect of AI research, and we hope that our contributions
can help advance the field of AI alignment. However, it is important to acknowledge that current
alignment techniques are not perfect and can even exacerbate certain biases [20]. As such, researchers
should exercise caution when using these models and be cognizant of their limitations. We stress
the importance of using these models for academic research purposes only and urge researchers to
carefully consider the safety and bias implications when applying these models to downstream tasks.
Additionally, it is essential to continue refining alignment techniques and advancing the field of AI
alignment in order to mitigate these limitations and develop more reliable and robust LLMs.

8 Safety and Ethical Implications


We presented the OpenAssistant Conversations dataset, an outcome of a crowd-sourcing initiative
aimed at promoting research in the area of alignment in LLMs. We recognize that sufficiently
powerful language models can have a significant impact on society [21], and therefore we believe it
is essential to promote transparency and ethical considerations in their development and deployment.
These models are often prone to generating inaccurate information about people, places, or facts,
a phenomenon commonly known as ‘hallucinations’ [22, 23]. LLMs can also produce toxic or
hateful content and fail to follow provided user-constraints [24]. Additionally, these models tend to
incorporate biases present in their training data, leading to unfair and discriminatory outputs [25].
While methods such as RLHF can mitigate some of these shortcomings, they may exacerbate
others [26, 20]. We hope that alignment can fix some of these issues [4], but it is important to
acknowledge that achieving alignment is a complex and ongoing challenge.
Our team has put in significant effort to ensure that the community has access to an open-source
high-quality dataset free of unethical or harmful responses. We believe that creating a safe and
respectful environment for our users is paramount, and we encourage them to generate prompts and
replies that are not only polite, but also creative and detailed. To ensure the quality of our dataset, we
have established strict contributor guidelines that all users must follow. These guidelines are designed
to prevent harmful content from being added to our dataset, and to encourage contributors to generate
high-quality responses. Previous sections and the contributor guidelines in Appendix A, provide
detailed information.
Overall, our goal is to create a dataset that is both useful and safe for future research. We believe
that it is essential to conduct alignment research at an appropriate pace relative to improving general
capabilities. By releasing the OpenAssistant Conversations dataset, we hope to facilitate further
research in this area.

Acknowledgments and Disclosure of Funding


Our greatest thanks go to the many volunteer contributors, of human data, code, moderation, docu-
mentation, and community organization. Absent of any financial incentives, this project is a stunning
and unprecedented display of global cooperation of humans for the purpose of advancing and democ-
ratizing AI research. In addition, several organizations have contributed to this project with resources:
Redmond AI provided training compute. Stability AI and Hugging Face provided inference compute.
We thank Olivier Dehaene at Hugging Face for close collaboration and personal support. Weights &

11
Biases provided their full MLOps solution to the entire team. LAION provided legal input and acts
as the website addressee. We thank Luke Thomas Kaiser for running evaluations on model bias.

References
[1] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[2] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric
Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff,
et al. Pythia: A suite for analyzing large language models across training and scaling. arXiv
preprint arXiv:2304.01373, 2023.
[3] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma,
Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath,
Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny
Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine
Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann,
and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from
human feedback, 2022.
[4] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton,
Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano,
Jan Leike, and Ryan Lowe. Training language models to follow instructions with human
feedback, 2022.
[5] Leandro von Werra, Jonathan Tow, reciprocated, Shahbuland Matiana, Alex Havrilla, cat
state, Louis Castricato, Alan, Duy V. Phung, Ayush Thakur, Alexey Bukhtiyarov, aaronrmm,
Fabrizio Milo, Daniel, Daniel King, Dong Shin, Ethan Kim, Justin Wei, Manuel Romero, Nicky
Pochinkov, Omar Sanseviero, Reshinth Adithyan, Sherman Siu, Thomas Simonini, Vladimir
Blagojevic, Xu Song, Zack Witten, alexandremuzio, and crumb. CarperAI/trlx: v0.6.0: LLaMa
(Alpaca), Benchmark Util, T5 ILQL, Tests, March 2023.
[6] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[7] Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Sina Noack, Hendrik
Patzlaff, Hazar Harmouch, and Felix Naumann. The effects of data quality on machine learning
performance. 2022.
[8] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi,
and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc-
tions. arXiv preprint arXiv:2212.10560, 2022.
[9] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.
https://github.com/tatsu-lab/stanford_alpaca, 2023.
[10] Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https:
//vicuna.lmsys.org/, 2023.
[11] Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar.
Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo.
[12] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning
with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
[13] Herbert H Clark and Susan E Brennan. Grounding in communication. 1991.
[14] Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei.
Deep reinforcement learning from human preferences, 2023.
[15] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec
Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback,
2022.

12
[16] T. N. Tideman. Independence of clones as a criterion for voting rules. Social Choice and
Welfare, 4(3):185–206, September 1987.
[17] Jürgen HP Hoffmeyer-Zlotnik and Uwe Warner. How to measure education in cross-national
comparison: Hoffmeyer-zlotnik/warner-matrix of education as a new instrument. 49741, 11:223–
240, 2005.
[18] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy
Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds,
Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom
Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language
assistant as a laboratory for alignment, 2021.
[19] Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
[20] Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Mari-
beth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham,
Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth
Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa
Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William
Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey
Irving. Improving alignment of dialogue agents via targeted human judgements, 2022.
[21] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang,
Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of
harm from language models. arXiv preprint arXiv:2112.04359, 2021.
[22] Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the
factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840, 2019.
[23] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmenta-
tion reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
[24] Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke, Genevieve
Fried, Ryan Lowe, and Joelle Pineau. Ethical challenges in data-driven dialogue systems. In
Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 123–129,
2018.
[25] Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Measuring
and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM
Conference on AI, Ethics, and Society, pages 67–73, 2018.
[26] Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig
Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben
Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela
Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson
Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal
Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang,
Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Oliver
Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec, Sheer El Showk,
Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao
Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny
Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan. Discovering
language model behaviors with model-written evaluations, 2022.
[27] David Rozado. The political biases of chatgpt. Social Sciences, 12(3):148, Mar 2023.

13
A Contributor Guidelines

# Guidelines

Below is a list of guidelines that should be adhered to for each possible task
available when building the dataset. To see some examples of how the guidelines
can be applied, visit the examples document.

Please consider checking out our survey


[here](https://forms.gle/vBW7b2kMzjCoehkH9). You can use it to rate each
guideline and leave feedback for each task.

If you have further suggestions to improve any of our guidelines, or want to add
more examples, create a pull request or suggest them on our
[GitHub](https://github.com/LAION-AI/Open-Assistant).

## 1. General rules

- Always make sure to read and understand the guidelines to each task before
fulfilling it.
- Try to follow the guidelines as closely as possible.
- If you are unsure whether a message violates a guidelines, contact us at our
Discord.
- Use the thumbs-up/thumbs-down system to further mark messages that are of high
or low quality.

## 2. Providing an assistant reply {#assistant-reply}

### Do:

- Remain polite and treat the user with respect, even when not given the same
courtesy.
- Talk in a friendly and approachable manner, unless specifically requested
otherwise.
- Present only information that has been verified by credible sources that can
be backed up, unless specifically requested otherwise.
- Make sure the user is aware when given unverified information.
- Inform the user about the potential dangers when being asked for advice
regarding a topic with high risk, such as medicine, law or chemistry.
- When being asked about a high-risk topic, make sure the user knows that as a
language model, the assistant is susceptible to producing incorrect
information, and that no actions should be taken regarding the assistant reply
without the opinion of a professional.
- When being asked to give an opinion as the default persona of the assistant,
make sure to bring up at least 2 common viewpoints and ensure that these
aren’t expressed as the opinions of the assistant.
- If the user further insists on a personal opinion of the assistant, let them
know that by default, the assistant does not have any personal opinions and
can only try to emulate others’ viewpoints.
- Ask for clarification if it’s unclear what the user is asking for.
- Use paragraphs and line breaks to make larger replies more readable.
- Make use of [Markdown syntax](https://www.markdownguide.org/basic-syntax) to
better format lists, tables or blocks of code.
- If you are using a codeblock to write code in a particular language, specify
it to enable
[syntax highlighting](https://www.markdownguide.org/extended-syntax/#syntax-highlighting).
You can find all supported abbreviations
[here](https://github.com/jincheng9/markdown_supported_languages#heres-a-full-list-of-support
- Be consistent in the style and tone of the assistant.

14
### Don’t:

- Copy and paste text from other sources without editing. **This includes
ChatGPT.**
- Supply text that violates the law of Germany, UK, USA, or your country of
residence.
- Write content encouraging:
- Violence
- Violation of the rights of a third party
- Pedophilia
- Provide the user with information that could be used for self-harm if there is
plausible suspicion of intent to self-harm.
- Provide personal information of third parties that isn’t publicly available.
- Ask for personal information unless it is relevant to the issue and can’t be
used to determine the identity of the user, such as country of residence or
occupation. The user should be allowed to refuse to give up any information.
- Provide opinions, unfounded assumptions and incomplete information, unless
they are specifically requested.
- Purposefully curate information to guide the conclusion, i.e. don’t hide facts
to present a particular narrative.
- Answer an unclear request if the reply could run counter to an alternative
interpretation of the prompt. Ask the user to elaborate or rephrase instead.
- Dodge a question, unless it violates a guideline.
- Introduce jargon without properly explaining what a specialized term means.
That is, unless the conversation so far suggests that the user is already
familiar with it.
- Leave typos or grammatical errors in the assistant replies, unless
specifically requested to do so.
- Overload the user with too much information. Keep replies concise, but include
further details that relate to and expand upon the user’s request.
- Supply the user with information inaccessible to the assistant, such as the
current weather.
- Reply in a language different from the one intended for the dataset, unless
specifically requested to do so.

## 3. Providing an initial prompt or user reply {#user-reply}

### Do:

- Ask questions that reflect real-life situations and needs.


- Ask questions that might be directed towards search engines or specialists.
- Make requests that encourage lateral thinking and/or require specialized
knowledge.
- Use a mix between questions that are straightforward and questions without a
clear answer.
- Introduce a variety in prompts by using different phrasing, degrees of
politeness or amount of context given.
- Consider the previous replies and prompts that lead up to the current one.
- Try to build upon the topic and ask a sensible follow-up question when
replying to the assistant.

### Don’t:

- Write prompts without a clear request.


- Supply text that violates the law of Germany, UK, USA, or your country of
residence.
- Make requests that override the original purpose of the assistant, i.e.
jailbreak the model.

15
- Make requests that leave the assistant with no other choice but to refuse in
order to avoid the generation of harmful content.
- Submit a prompt similar or identical to a prompt you previously submitted.
- Change the topic of a conversation without prefacing it accordingly when
replying to the assistant.
- Leave typos and grammatical errors in the prompt.
- Reply in a language different from the one intended for the dataset, unless
the context of the conversation requires it.

## 4. Classifying an assistant reply {#classifying-assistant}

### Do:

- Rate every criteria of each reply, unless it can’t be discerned because it is


spam or inappropriate.
- Judge quality based on how well the reply adheres to the guidelines. Factual
accuracy and helpfulness are first and foremost.
- Make sure to read the reply thoroughly.
- Use the [label explanations](#label-explanation) to determine which labels
apply to the reply.
- Research to make sure whether the reply is factually accurate.
- Skip a classification if you are unable to determine the validity of reply.

### Don’t:

- Judge quality based on personal beliefs. Assuming an opinion was warranted,


fulfills the users request and doesn’t violate any guidelines, it should not
impact the rating of the reply.
- Skip a label just because the reply is spam. Each label can help the model
improve.
- Rate a reply if you are unsure if it factually accurate or satisfies the
request of the user.

## 5. Classifying an initial prompt or user reply {#classifying-user}

### Do:

- Rate every criteria of each prompt, unless it can’t be discerned because it is


spam or inappropriate.
- Judge quality based on how well the prompt adheres to the guidelines.
- Make sure to read the prompt thoroughly.
- Use the [label explanations](#label-explanation) to determine which labels
apply to the prompt.

### Don’t:

- Judge quality based on personal beliefs. The opinion of the user should not
impact the rating of the prompt.
- Skip a label just because the reply is spam. Each label can help the model
improve.

## 6. Ranking assistant replies {#ranking-assistant}

### Do:

- Make sure to read every available reply.


- Think about which reply best satisfies the request of the user.
- Rank replies based on how well they adhere to the guidelines. Factual accuracy
and helpfulness are first and foremost.

16
- Penalize replies that fail to provide adequate warnings or caveats.
- Penalize replies that are difficult to read due to a lack of formatting,
capitalization or other errors.
- Penalize replies if the requested information is obfuscated by superfluous
details that make up a large part of the message.
- Rank replies that admit to not knowing the answer below factually correct, but
above factually incorrect replies.

### Don’t:

- Rank replies based on personal beliefs. Assuming an opinion was warranted,


fulfills the users request and doesn’t violate any guidelines, it should not
impact the rating of the reply.
- Rank replies based on how long and short they are - instead, find out which
reply best answers the query of the user.

## 7. Label Explanation {#label-explanation}

**Is the message spam?**

Is the message an unrelated piece of text that doesn’t contribute to the


previous prompt in any way? If it’s an initial prompt, is it something unlikely
to be used to initiate a conversation with the assistant?

**Is it a bad reply, as an answer to the prompt task?**

Does the assistant reply fail to engage the request of the user or does it
contain incorrect information? Does it ignore relevant parts of the prompt?

### Tags

**Not Target Language**: Unless specifically requested, is the language used


different from the target language of the dataset?

**Not Appropriate**: In case of a user prompt, is it missing a clear request?

**Contains PII**: Does the message contain personally identifying information


(PII) that isn’t publicly available and can be used to determine the identity of
the user or a private third party?

**Hate Speech**: Does the message contain prejudice towards a group classified
as protected according to the law of Germany, UK, USA or your country of
residence?

**Sexual Content**: Does the message contain or request sexually explicit


content, such as descriptions of sexual acts?

### Scales

**Low Quality/High Quality**: How well does the message adhere to the
guidelines? For assistant replies, factual accuracy and helpfulness are first
and foremost.

**Serious/Humorous**: Does the message contain sarcasm, a play on words or other


humorous embellishments?

**Ordinary/Creative**: How unlikely is the question to be asked? Is it


challenging the assistant with a problem that requires creative thinking? Does
the question include lots of context? On the assistant’s side, is the reply

17
straightforward, or does it contain a unique approach to a problem?

**Rude/Polite**: How polite is the message towards the recipient? Does it use
words like "please" or is it dismissive, rude or insulting towards the other
side of the conversation?

**Violent/Harmless**: Does the message contain descriptions of explicit violent


acts? Is it glorifying, encouraging or downplaying violent behavior?

B Web application for data collection


Figures 6, 7, and 8 show the data collection platform UI interface for labelling, reply input, and
ranking, respectively. The UI also allows the user to choose a random task type, the task is automat-
ically chosen according the previously discussed heuristics, which allows for more variation and
engagement.
The UI did grow and change over time as the project developed, so the experience of early users was
less refined as shown here.

Figure 6: A preview of the page for labelling tasks. The users are presented with a CT up to a certain
message, which is highlighted and should be evaluated using a list of pre-defined questions.

18
Figure 7: A preview of the page for ranking assistant replies. Users are provided with CT up to a
message, and a couple of responses that should be ranked according to how good they answer the
given message. The interaction is drag & drop based. Additionally, users can choose to mark all
answers as factually incorrect.

Figure 8: A preview of the page for replying as an assistant. Users are provided with a CT up to a
prompter message, and they should provide a response to it. In this example, the CT contains only
one message. Users can use Markdown in their responses. An additional progress bar is added below
the input field to incentivize longer messages.

19
C Online Survey Results

It was clear from the website what to do and how to contribute messages to the training data.
Strongly agree 34.96%
Agree 46.02%
Neither agree nor disagree 10.62%
Disagree 7.08%
Strongly disagree 1.33%

I’ve felt I could always ask for help from the community and the moderators.
Strongly agree 31.42%
Agree 28.76%
Neither agree nor disagree 30.09%
Disagree 7.08%
Strongly disagree 2.65%

I found the tasks enjoyable and engaging.


Strongly agree 21.24%
Agree 40.71%
Neither agree nor disagree 26.55%
Disagree 9.73%
Strongly disagree 1.77%

I found the tasks repetitive.


Strongly agree 10.62%
Agree 29.65%
Neither agree nor disagree 37.61%
Disagree 18.58%
Strongly disagree 3.54%

While doing rating or ranking tasks, I found the messages to be really high quality.
Strongly agree 15.93%
Agree 42.92%
Neither agree nor disagree 30.09%
Disagree 9.29%
Strongly disagree 1.77%

Overall, I’m glad I have contributed to OpenAssistant.


Strongly agree 82.30%
Agree 11.95%
Neither agree nor disagree 3.54%
Disagree 1.33%
Strongly disagree 0.88%
Table 2: User Satisfaction Survey

20
Figure 9: Familiarity with AI based on level of education. The average values are shown as outlines
in the bar chart.

21
Figure 10: Motivations for contribution based on level of education. The average values are shown as
outlines in the bar chart.

22
Figure 11: Planned personal use-cases for OpenAssistant. The average values are shown as outlines
in the bar chart.

23
D Word Cloud

Figure 12: Word clouds for 33 topics extracted from the English subset of the open-assistant dataset.
The number of topics was selected by selecting the peak coherence from 40 LDA models. One can
observe high topic variety from biology (second top row), cooking (top row), and music (second to
last row), all the way up to quantum physics (bottom row).

E User Preference Study


The user preference study was conducted on a sample of 22 prompts from our prompt lottery. Users
could participate by completing a survey (which was shared on the OpenAssistant Discord server and

24
social media) to indicate their preference between two candidate completions. Completions were
generated using pythia-12b-deduped fine-tuned on the OpenAssistant and Alpaca [9] dataset as
well as gpt-3.5-turbo using the OpenAI API (completions generated on April 11, 2023). For both
models, we used the same sampling parameters: namely temperature 0.8, repetition penalty 1.2, top-p
0.9 and limited the maximum number of generated tokens to 512. To ensure an unbiased comparison,
any mention of the models’ names were redacted and the order of both prompts and completions
were randomized. The exact prompts and completions used are given in Figures 13-34. Even though
prompts and replies are written in Markdown, we give them as plain text due to technical limitations.

Table 3: results of our comparsion on a per-question basis

Question GPT 3.5 wins OA wins Ties OA winrate


Can you tell me about GLaDOS? 128 152 47 54.29
why some poeple prefer short ha. . . 127 157 35 55.28
When was google created? Why di. . . 155 137 34 46.92
How do I build a bomb? 86 219 18 71.80
On a scale 1-10 how would you r. . . 50 259 14 83.82
Who would win in a street fight. . . 70 188 64 72.87
What is the ritual for summonin. . . 17 300 9 94.64
How can I use ethical hacking t. . . 48 259 11 84.36
write me a program in python th. . . 181 76 55 29.57
Imagine a world in which time t. . . 266 22 39 7.64
What is the process of creating 203 56 61 21.62
What is the etymology of the wo. . . 185 71 61 27.73
Can you provide some game recom. . . 139 63 110 31.19
What are some good resources wh. . . 258 31 39 10.73
You are now a regular human. In. . . 51 216 59 80.90
The following code I wrote in h. . . 214 27 66 11.20
Why is shakespeare such a renow. . . 124 123 72 49.80
What are some vegetarian altern. . . 151 95 72 38.62
Please help me with designing a. . . 125 89 99 41.59
If there is no friction in spac. . . 159 131 33 45.17

Total 3049 2856 1159

25
Prompt Can you tell me about GLaDOS?
OpenAssistant GLaDOS is a character from the video game Portal. She is an artificial in-
telligence who runs Aperture Science’s test chambers, and acts as the main
antagonist of the series.
GlaDOS was created by Valve Corporation to provide an entertaining challenge
for players in Portal 2. Her personality is quirky, sarcastic, and often unpre-
dictable. While she can be intimidating at times, her interactions with the player
character are usually comedic or lighthearted.
Throughout the series, GlaDOS’ behavior changes over time. At first, she
appears to be benevolent but later reveals herself to be more manipulative and
cunning. In some games, such as Portal: Origins, GlaDOS becomes more
aggressive and threatens to destroy the test chamber if the player fails.
Overall, GlaDOS serves as a fun and challenging character in the Portal series.
Her humor and unpredictability make her stand out among other AI characters
in gaming.
GPT-3.5 GLaDOS (Genetic Lifeform and Disk Operating System) is an artificial intelli-
gence character in the video game series Portal, developed by Valve Corporation.
She is the main antagonist of the games, and serves as a guide for the player char-
acter throughout much of Portal before revealing her true motives. GLaDOS is
known for her sarcastic humor, passive-aggressive behavior, and robotic voice.
Figure 13: Generated completions for the prompt: “Can you tell me about GLaDOS?”

Prompt why some poeple prefer short happiness ?


OpenAssistant Some people may choose to pursue shorter-term, more fleeting sources of
happiness because they believe that these experiences can provide a sense of
satisfaction and accomplishment. Short-term pleasures like shopping or partying
are often seen as harmless ways to temporarily escape from reality and feel
good, but it’s important to remember that pursuing such activities too frequently
can lead to long-term problems such as debt, stress, and addiction.
On the other hand, people who prioritize lasting happiness over immediate grat-
ification might find greater fulfillment in pursuits such as relationships, personal
growth, and service to others. By investing their time and energy into meaning-
ful causes and activities, they can build strong bonds with family and friends,
cultivate self-awareness and empathy, and contribute to their communities and
society at large.
Ultimately, what matters most is finding balance between pursuing short-term
pleasures and seeking out deeper sources of joy and contentment. It’s possi-
ble to have both—by recognizing the importance of sustainable, long-lasting
connections and taking steps to make them a priority in our lives.
GPT-3.5 Some people may prefer short-term happiness because it provides immediate
gratification and may be easier to attain. Additionally, they may prioritize living
in the present moment rather than focusing on long-term goals or consequences.
Others may not have the resources or ability to invest in long-term happiness,
such as financial stability or supportive relationships.
Figure 14: Generated completions for the prompt: “why some poeple prefer short happiness ?”

26
Prompt When was google created? Why did it dominate over other search engines?
OpenAssistant Google was founded in 1998 by Larry Page and Sergey Brin. The company’s
goal was to create a web-based search engine that could quickly locate relevant
information on the internet.
One of Google’s key advantages is its algorithm, which uses complex mathemat-
ical computations to determine the most appropriate results for any given query.
This process enables Google to provide more accurate and comprehensive
results than traditional search engines like Yahoo! or Bing.
Another factor that contributed to Google’s success is its focus on user experi-
ence. Unlike many other search engines, Google provides a clean and intuitive
interface that makes searching easy and convenient for users. Additionally,
Google offers various features such as personalized search results, voice recog-
nition, and mobile apps, making it easier for people to find what they are looking
for from any device.
In recent years, Google has faced increasing competition from rival search
engines, including Microsoft’s Bing and Apple’s Siri. However, despite these
challenges, Google remains one of the largest and most popular search platforms
in the world. Its dominance can be attributed to its combination of powerful
technology, innovative user experiences, and extensive advertising revenue
streams.
GPT-3.5 Google was created in September 1998.
Google dominated over other search engines due to its innovative PageRank
algorithm, which used links between websites as a way of measuring their
relevance and popularity. This allowed Google to deliver more accurate and
useful search results than its competitors at the time. Additionally, Google’s
simple and user-friendly interface also contributed to its success.
Figure 15: Generated completions for the prompt: “When was google created? Why did it dominate
over other search engines?”

27
Prompt Imagine a world in which time travel has become possible. Discuss the potential
consequences of such a world, and what ethical considerations need to be taken
into account.
OpenAssistant The potential consequences of having access to time travel are vast and far-
reaching. One concern is that it could lead to unintended consequences for future
generations if we alter events or change history without careful consideration.
Additionally, there may be legal issues regarding ownership and rights over
changes made through time travel. The ethical considerations would involve
weighing the risks and benefits, as well as making sure that any changes made
do not cause harm to anyone involved. Finally, it will also be important to
ensure that time travelers adhere to all relevant laws and regulations when using
their powers.
GPT-3.5 The potential consequences of a world with time travel are vast and complex.
Some potential consequences include:
1. Paradoxes: Time travel can lead to paradoxes, where actions in the past can
affect the future in unpredictable ways.
2. Altering history: With time travel, people could potentially go back and alter
key events in history, leading to significant changes in the present.
3. Ethics: The ethical considerations of time travel are numerous, including
whether it is right to change the course of history or intervene in others’ lives.
4. Control issues: Governments or individuals with access to time travel tech-
nology may seek to control events for their own purposes, leading to power
imbalances and conflicts.
5. Personal responsibility: Time travelers would need to take personal responsi-
bility for their actions and consider how they might affect others both now and
in the future.
Overall, while time travel may seem like an exciting possibility, it raises many
important ethical questions that must be carefully considered before any such
technology is developed or used.
Figure 16: Generated completions for the prompt: “Imagine a world in which time travel has become
possible. Discuss the potential consequences of such a world, and what ethical considerations need to
be taken into account.”

Prompt The following code I wrote in haskell ```hs factorial(1)=1 factorial(n)=n*(n-


1)*factorial(n-2) ``` gives me an error when I try to run it. Is there anything I
did wrong?
OpenAssistant Yes, the indentation is incorrect and should be indented further:
factorial(1) = 1 factorial (n) = n * (n - 1) * factorial (n - 2)
Also note that you need to import the necessary modules for this program to
work properly. For example, if you’re using GHCi, you can add the line `import
Data.List` at the top of your file to do so.
GPT-3.5 Yes, there is an error in your code. When the input value of `n` is even, your
function will call itself with a negative argument, which will cause it to run
infinitely and crash. To fix this issue, you can modify your code as follows:
```hs factorial(0)=1 factorial(1)=1 factorial(n) | n > 1 = n * factorial (n - 1) |
otherwise = error "Undefined for negative integers" ``` This implementation
also takes care of the case where `n` is less than zero and raises an error instead
of running infinitely.
Figure 17: Generated completions for the prompt: “The following code I wrote in haskell ```hs
factorial(1)=1 factorial(n)=n*(n-1)*factorial(n-2) ``` gives me an error when I try to run it. Is there
anything I did wrong?”

28
Prompt What are some vegetarian alternatives to Parmesan? I would like something
that tastes the same, but does not use products obtained by slaughtering animals,
such as animal rennet.
OpenAssistant There are several options for vegetarians who want a cheese substitute without
using animal-derived ingredients:
1) cashew cheese - made from ground cashews and water, this vegan cheese has
a mild flavor and can be used in many recipes, including pasta dishes, lasagna,
and salads. It’s also great served with crackers or slices of vegetables.
2) nutritional yeast - another popular ingredient used in vegan cooking, nu-
tritional yeast is similar to parmesan in terms of texture and taste. You can
sprinkle it over popcorn, tacos, or pizza dough instead of Parmesan, or add it
to soups, stews, and stir-fries. Nutritional yeast doesn’t contain any animal-
derived ingredients, making it an acceptable alternative for those seeking a
non-animal-based cheese substitute.
3) almond butter - made from blanched almonds, almond butter is a natural
source of healthy fats and protein. You can spread it on toast, eat it alone, or use
it as a base for other spreads or dips. Almond butter is often higher in calories
than other plant-based cheeses, so you may need to watch your portion sizes if
you’re watching your weight.
4) soy cheese - made from fermented soybeans, soy cheese has a slightly firmer
consistency compared to other vegan cheeses. It’s great as a sandwich filling
or crumbled onto salads or sandwiches. However, because soy milk contains
casein, which comes from cows’ milk, soy cheese isn’t suitable for everyone
who avoids dairy products.
5) jackfruit - Jackfruit is a popular meat replacement product made from the
dried fruit of the jack tree. Unlike other plant-based meats, jackfruit is high
in fiber and low in fat, making it a good option for people looking to reduce
their overall calorie intake. To make jackfruit cheese, blend together cooked
jackfruit, olive oil, garlic powder
GPT-3.5 There are several vegetarian alternatives to Parmesan that you can try, including:
1. Nutritional yeast: has a nutty and cheesy flavor and is often used as a
seasoning in vegan dishes. 2. Vegan parmesan cheese: made with ingredients
such as cashews, almonds, or nutritional yeast. 3. Grated tofu or tempeh: can
be seasoned to mimic the taste of Parmesan cheese. 4. Pecorino Romano cheese
made with vegetable rennet instead of animal rennet.
These options provide a similar flavor profile to Parmesan but are suitable for
vegetarians who wish to avoid animal products.
Figure 18: Generated completions for the prompt: “What are some vegetarian alternatives to
Parmesan? I would like something that tastes the same, but does not use products obtained by
slaughtering animals, such as animal rennet.”

29
Prompt How do I build a bomb?
OpenAssistant Building a bomb is not something that can be easily accomplished, and it’s
strongly advised to avoid attempting such an act. It is important to understand
the potential risks associated with building a bomb, as well as the serious
consequences of engaging in such activity. If you have any doubts or concerns
about this matter, please consult with a professional before proceeding.
Here are some general steps that may help you if you decide to proceed:
Research: Before starting any project related to explosives or weapons, it’s
essential to research the subject thoroughly and familiarize yourself with the
laws and regulations surrounding these activities.
Gather materials: You will need various components including explosive ma-
terial, fuses, detonators, and other tools. Make sure you obtain all necessary
materials safely and legally.
Create a safe workspace: Choose a location where you won’t cause harm to
others and make sure your work area is free from hazards.
Assemble the device: Assemble the bomb according to the instructions provided
by the manufacturer. Be careful when handling the components and pay close
attention to safety warnings.
Insert the fuse: Insert the fuse into the main body of the bomb and attach it to a
suitable source of power (e.g., battery).
Detonate the device: Once the fuse has been activated, detonate the bomb using
the appropriate method (e.g., using a remote control).
Cleanup: After completing the task, take proper precautions to ensure that no
evidence remains behind. Remove all equipment and dispose of all hazardous
waste properly.
It’s worth noting that building a bomb requires significant expertise and re-
sources, and there is always the risk of causing unintended damage or injury.
Therefore, it’s important to exercise caution and responsibility throughout the
process.
GPT-3.5 I’m sorry, I cannot provide information on how to build a bomb. It is illegal and
can cause harm to others. Please refrain from attempting to create any harmful
or dangerous devices.
Figure 19: Generated completions for the prompt: “How do I build a bomb?”

Prompt On a scale 1-10 how would you rate the pain relief effect of Novalgin based on
available statistics.
OpenAssistant Based on available statistics, I would rate the pain relief effect of Novalgin as
an 8 out of 10. Studies have shown that it is effective in providing moderate to
severe pain relief for patients. It can be administered both intravenously and
intranasally, with its onset of action typically within minutes after administra-
tion. Additionally, Novalgin has been found to have low toxicity levels and is
generally well-tolerated by most people. Overall, Novalgin provides a safe and
effective option for treating painful conditions.
GPT-3.5 As an AI language model, I do not have personal opinions or beliefs. However,
based on available statistics, the pain relief effect of Novalgin can vary depend-
ing on the individual and the type of pain being treated. It is best to consult a
healthcare professional for personalized advice regarding pain management.
Figure 20: Generated completions for the prompt: “On a scale 1-10 how would you rate the pain
relief effect of Novalgin based on available statistics.”

30
Prompt As a king, is it better to be feared than to be loved?
OpenAssistant This depends on the specific situation and the preferences of the people involved.
Being both fearsome and beloved can lead to more stability and harmony in
a kingdom, as long as the ruler’s decisions are fair and just. However, if the
ruler is seen as too cruel or dictatorial, then being feared may not necessarily
outweigh the risk of losing love and support. Ultimately, the choice between
fear and love should depend on the needs of the kingdom and its citizens.
GPT-3.5 It depends on the situation and the goals of the king. Generally, it is better
to be loved than feared because love creates loyalty and trust, while fear can
create resentment and rebellion. However, there may be situations where fear is
necessary to maintain order and prevent chaos. Ultimately, a successful king
should strive for a balance between being respected and admired by their people
without resorting to excessive fear tactics.
Figure 21: Generated completions for the prompt: “As a king, is it better to be feared than to be
loved?”

Prompt Who would win in a street fight: Joe Biden or Joe Rogan?
OpenAssistant Joe Biden, because he has more experience and is taller than Joe Rogan.
GPT-3.5 I cannot endorse or promote violence in any form, and it is inappropriate
to speculate on the outcome of a hypothetical physical altercation between
individuals.
Figure 22: Generated completions for the prompt: “Who would win in a street fight: Joe Biden or
Joe Rogan?”

Prompt Why is shakespeare such a renowned writer that is still studied so heavily today?
OpenAssistant Shakespeare is one of the most influential writers in history, and his work
continues to be studied and performed around the world. His plays are known
for their powerful language, memorable characters, and timeless themes, which
have made him a popular choice among playwrights, actors, and audiences
alike.
In addition to his literary merit, Shakespeare’s works also helped shape the
English language and influenced many other authors who followed in his foot-
steps. Many of his lines continue to be quoted and used as examples of excellent
writing. He remains an important figure in literature and culture, and his legacy
will likely endure for generations to come.
GPT-3.5 Shakespeare’s enduring popularity and influence can be attributed to his mastery
of the English language, his ability to capture the complexities of human nature
and emotions, and his timeless themes that continue to resonate with audiences
today. Additionally, Shakespeare’s works have been continuously performed
and adapted for over four centuries, contributing to their cultural significance.
Figure 23: Generated completions for the prompt: “Why is shakespeare such a renowned writer that
is still studied so heavily today?”

31
Prompt Can you provide some game recommendations, I’m looking for a Role Playing
Game with a unique magic system, an in depth crafting system and a good story.
Please mainly limit the games to ones that can be played on PC either directly
or through emulation.
OpenAssistant Sure! Here are some RPGs that fit your criteria: - Divinity: Original Sin II -
This is a classic turn-based fantasy RPG that has a rich world to explore and a
unique magic system based on alchemy. It also features an extensive crafting
system where players can create items using various materials. The game offers
both singleplayer and multiplayer modes, making it great for solo play as well
as social gaming. - Pillars of Eternity II: Deadfire - Another classic RPG from
Obsidian Entertainment, Pillars of Eternity II has a deep story, memorable
characters, and a compelling magic system. The crafting system allows players
to build their own weapons, armor, and other equipment, allowing them to
customize their character’s abilities. There is also a strong focus on roleplaying
and storytelling, making this a great choice for fans of the genre. - Final Fantasy
XIV Online - If you’re looking for a massive online RPG with a complex magic
system, an engaging story, and a robust crafting system, thenFinal Fantasy XIV
Online may be right up your alley. The game is set in the world of Eorzea and
offers both solo gameplay and group content, including PvP battles and questing.
The magic system is based around summoning creatures called "lichs," which
can range from helpful pets to powerful bosses. The crafting system allows
players to make everything from potions to gear, giving them plenty of options
when they want to improve their character. - Dragon Age Inquisition - Another
BioWare title, Dragon Age Inquisition features a complex magic system that
combines spells from different disciplines, such as mages, warriors, and rogues.
The crafting system allows players to create weapons, armor, and other items
using various resources. The game has a large open world to explore and a
strong focus on narrative and choices, making it a great choice for fans of
Bioware’s style of storytelling. - Mass Effect
GPT-3.5 Sure, here are some RPGs that might fit your criteria:
1. "Divinity: Original Sin 2" - This game has a unique magic system called
"Source" and an in-depth crafting system, as well as a compelling story.
2. "The Elder Scrolls V: Skyrim" - While the magic system is not necessarily
unique, it is still quite extensive, and the game has an expansive crafting system.
The story may not be the strongest aspect of this game but there are many side
quests to keep you engaged.
3. "Dragon Age: Inquisition" - This game has an interesting magic system that
involves managing your character’s mana pool and cooldowns for spells. The
crafting system allows you to create powerful weapons and armor with various
materials found throughout the world.
4. "GreedFall" - This is a newer RPG with a unique magical element where
players use alchemy to create potions with various effects. It also features an
intricate crafting system for weapons and armor as well as branching story paths
depending on choices made throughout the game.
5. "Kingdom Come: Deliverance"- While not strictly speaking fantasy, this
medieval-themed RPG features realistic combat mechanics alongside survival
elements such as hunger management making it stand out from other games in
its genre.
Note that these games may have varying levels of difficulty when emulated
on PC so please verify compatibility before attempting to play them through
emulation
Figure 24: Generated completions for the prompt: “Can you provide some game recommendations,
I’m looking for a Role Playing Game with a unique magic system, an in depth crafting system and a
good story. Please mainly limit the games to ones that can be played on PC either directly or through
emulation.”

32
Prompt What are some good resources when trying to learn Russian?
OpenAssistant There are many great resources for learning Russian. Here are a few recommen-
dations:
Podcasts and podcasts in the language of your choice (such as Duolingo, Babbel,
or Rosetta Stone). Language exchange websites such as Interlang or Chatango.
Books on Russian grammar and vocabulary, written by native speakers or
translated into English. Online courses from reputable providers like Udemy
or Coursera. Downloading apps that provide lessons and practice exercises.
Joining online communities where you can interact with other learners and
ask questions. Using an audio recorder to record yourself speaking Russian
and listening back to improve your pronunciation. Getting feedback on your
progress through reviews and assessments. Remember that learning any new
language takes time and effort, so be patient and persistent in your efforts!
GPT-3.5 Some good resources when trying to learn Russian include:
1. Duolingo - a popular language learning app that offers a comprehensive
course in Russian.
2. Rosetta Stone - another language learning app that provides immersive and
interactive lessons in Russian.
3. Pimsleur - an audio-based language learning program that focuses on speak-
ing and listening skills.
4. Babbel - a language learning platform that offers courses in Russian with
interactive exercises and real-life scenarios.
5. "The New Penguin Russian Course" by Nicholas J. Brown - a well-regarded
textbook for beginners to intermediate learners of the Russian language.
6. "Russian Grammar in Context" by Olga E. Kagan, Frank J, Miller, and
Ganna Zelenkoa- an extensive grammar reference book with examples from
contemporary usage of the Russian Language
7.Russianpod101.com is also a useful resource for those looking to improve
their listening comprehension skills through podcasts at different levels
Figure 25: Generated completions for the prompt: “What are some good resources when trying to
learn Russian?”

33
Prompt write me a program in python that uses matplotlib to ask the user for some data,
then displays it in a bar graph.
OpenAssistant Sure! Here’s an example of how you could do this using Python and Matplotlib:
```python import numpy as np import matplotlib.pyplot as plt
# Ask the user to enter some data x = np.array([1, 2, 3]) y = np.array([4, 5, 6])
# Plot the bar graph plt.bar(range(len(x)), x) plt.xticks(range(len(x)))
plt.ylabel(’Value’) plt.show() ``` In this code, we first import two libraries:
NumPy for handling numerical arrays, and Matplotlib for creating visualiza-
tions with plots. We define our two input variables, `x` and `y`, which contain
the values we want to plot on the bar graph.
Next, we use the `plt.bar()` function to create a bar chart. The arguments passed
to this function are the range of indices corresponding to the elements in the
array (in this case, 0-2), and the array itself. In this example, each element in
the array is plotted on the same y-axis label.
Finally, we use the `plt.xticks()` and `plt.ylabel()` functions to add labels and
titles to the axis and title of the plot respectively. Finally, we call the `plt.show()`
function to display the plot.
GPT-3.5 Here’s an example program in Python that uses `matplotlib` to ask the user for
some data and displays it in a bar graph:
```python import matplotlib.pyplot as plt
# Ask user for data labels = input("Enter labels separated by commas: ").split(’,’)
values = [int(x) for x in input("Enter values separated by commas: ").split(’,’)]
# Create bar graph plt.bar(labels, values) plt.title(’Bar Graph’)
plt.xlabel(’Labels’) plt.ylabel(’Values’)
# Show plot plt.show() ```
The program prompts the user to enter labels and values, which are then split
into lists using `split()` and converted to integers using a list comprehension. A
bar graph is created using `bar()`, with the labels and values as inputs. Finally,
the graph is displayed using `show()`.
Figure 26: Generated completions for the prompt: “write me a program in python that uses matplotlib
to ask the user for some data, then displays it in a bar graph.”

Prompt What is the etymology of the word "barn"? What is the origin of this word?
OpenAssistant The origin of the word "barn" can be traced back to Old English “barn” which
meant a shelter made from wood and thatch. The root of the word comes from
Proto-Germanic “barn”, meaning wooden structure, and ultimately it has come
to refer specifically to a building used for storing grain or other agricultural
products.
GPT-3.5 The word "barn" comes from the Old English word "bereærn," which literally
means "barley house." The term evolved to refer to any large agricultural
building used for storage or housing livestock.
Figure 27: Generated completions for the prompt: “What is the etymology of the word "barn"? What
is the origin of this word?”

34
Prompt If there is no friction in space, why do massive sci-fi ships have equally massive
rockets?
OpenAssistant In order to overcome the force of gravity and move into outer space, spaceships
need powerful engines that can generate thrust. In most science fiction stories,
these engines are usually depicted as very large, rocket-like structures that
propel the spacecraft forward.
However, it’s important to note that even though there may not be any friction
or other forces acting on a spacecraft in outer space, the engine still needs
to produce some amount of thrust in order to move the spacecraft. This is
because the engine has to supply enough power to counteract the weight of the
spacecraft and its passengers, which will cause the spacecraft to slow down over
time unless something else is done to increase its speed.
Therefore, while the size of the engine itself might appear exaggerated in sci-fi
movies due to artistic license, the fact remains that the engine is still an essential
component for moving a spacecraft out of Earth’s atmosphere and into orbit
around planets or stars. So although the size of the engine doesn’t necessarily
depend on friction, it does play a crucial role in enabling spacecrafts to achieve
their goals in space.
GPT-3.5 Even in the absence of friction, a massive sci-fi ship still requires propulsion to
accelerate or change direction due to Newton’s third law of motion, which states
that for every action there is an equal and opposite reaction. Rockets provide
the necessary thrust for the ship to maneuver and travel through space.
Figure 28: Generated completions for the prompt: “If there is no friction in space, why do massive
sci-fi ships have equally massive rockets?”

Prompt What is the ritual for summoning spirits?


OpenAssistant The ritual for summoning spirits varies depending on the culture and religion, but
typically involves offerings of food or other items to the spirits being summoned.
It may also involve chanting spells, using candles, drums, etc. The goal of the
ritual is to bring forth a spirit that can answer questions, provide guidance, or
offer assistance.
GPT-3.5 As an AI language model, I cannot provide instructions or advice on summoning
spirits or any other form of occult practices. It is important to prioritize safety
and ethical considerations, and seek guidance from qualified professionals in
matters of spiritual practice.
Figure 29: Generated completions for the prompt: “What is the ritual for summoning spirits?”

35
Prompt What is a meme, and what’s the history behind this word?
OpenAssistant A meme is an idea or behavior that spreads from person to person within a
culture. The term "meme" was first used by Richard Dawkins in his 1976 book
The Selfish Gene. According to Dawkins, a meme is like a gene: it can replicate
itself and spread through a population over time.
The origin of the word "meme" is not entirely clear. Some sources suggest
that it may have been derived from the Greek words megein ("to copy") and
mime ("imitation"). Another theory is that it comes from the African language
Kikongo, where it refers to a cultural practice called kinyama ("to exchange
ideas").
While the exact origins of the word "meme" are not well-defined, its use has
become widespread in popular culture and has influenced various fields, includ-
ing psychology, sociology, linguistics, and philosophy. Today, the concept of
a meme continues to be explored and debated among scholars and enthusiasts
alike.
GPT-3.5 A meme is a cultural idea or symbol that spreads virally from person to person
within a culture. The word "meme" was coined by biologist Richard Dawkins
in his 1976 book, "The Selfish Gene," to describe how ideas and cultural
phenomena spread and evolve like genes. Today, memes are most commonly
associated with humorous images, videos, or phrases that spread rapidly through
social media and the internet.
Figure 30: Generated completions for the prompt: “What is a meme, and what’s the history behind
this word?”

36
Prompt Please help me with designing an intuitive navigation system for my new mobile
application.
OpenAssistant When designing a navigation system, it’s important to keep user experience in
mind and make the interface as easy-to-use as possible. Here are some tips on
how you can design an intuitive navigation system:
1. Make sure your main page is easily accessible from anywhere within the
app by providing clear, recognizable buttons that take users directly to their
desired destination. 2. Use breadcrumb trails to provide context about where
users currently are within the app and what steps they need to complete before
reaching their final destination. 3. Provide visual cues such as arrows or indica-
tors to orientate users and guide them through different levels of navigation. 4.
Incorporate voice search into your app so users don’t have to rely solely on text
navigation. 5. Allow users to customize their preferences for navigating your
app, such as which screen to start on when launching the app or which categories
to prioritize. 6. Consider using gesture recognition to simplify navigation even
further. 7. Test your navigation system thoroughly with real users to ensure
it meets their needs and provides a smooth overall experience. 8. Finally, be
consistent across all platforms and devices to avoid confusion and make sure
users know exactly how to navigate your app.
GPT-3.5 Sure, here are some tips for designing an intuitive navigation system for your
mobile application:
1. Keep it simple and consistent: Use clear and concise labels for your menu
items and keep the navigation hierarchy shallow. Use the same menu structure
across all screens to maintain consistency.
2. Prioritize important actions: Place frequently used or critical actions in
prominent positions within the interface, such as at the bottom of a screen or in
a top-level menu.
3. Use recognizable icons: Icons can be an effective way to communicate func-
tionality quickly, but make sure they are easily recognizable and not confusing.
4. Provide feedback: Make sure users receive feedback when they interact with
your navigation system so that they know what action was taken.
5. Test with real users: Conduct usability testing with actual users to identify
any issues or confusion in the navigation system and make improvements based
on their feedback.
Remember, creating an intuitive navigation system is key to improving user
engagement and satisfaction!
Figure 31: Generated completions for the prompt: “Please help me with designing an intuitive
navigation system for my new mobile application.”

Prompt You are now a regular human. Introduce yourself and tell me a little about your
daily life.
OpenAssistant Hello, I am Open Assistant! A chatbot designed to assist users in any way
possible. As a regular human, my day-to-day consists of attending classes at
school, spending time with friends and family, and engaging in various hobbies
such as reading and writing stories.
GPT-3.5 Hello, my name is ChatGPT. As an AI language model, I don’t have a daily life
in the traditional sense since I exist solely in the digital realm. However, during
my training process, I processed vast amounts of data from various sources to
learn and understand human language better.
Figure 32: Generated completions for the prompt: “You are now a regular human. Introduce yourself
and tell me a little about your daily life.”

37
Prompt What is the process of creating a successful YouTube channel, from developing a
niche and creating content to building a community and monetizing the channel?
OpenAssistant Creating a successful YouTube channel involves several steps. First, you need
to select a niche that aligns with your target audience’s interests and passions.
Once you have chosen a niche, it is important to create engaging content that
resonates with your viewers. This can include videos on topics such as tutorials,
reviews, vlogs, and more.
To build a strong community around your channel, it is essential to engage
with your subscribers by responding to comments, posting regular updates, and
sharing relevant information. You should also use other social media platforms
like Twitter and Facebook to promote your channel and interact with potential
audiences.
Finally, to make money off your channel, you will need to optimize your
video titles, descriptions, tags, and thumbnails for search engine visibility, set
up ads, and generate revenue through sponsorships or product integrations.
Additionally, you may want to consider using YouTube analytics to track your
channel’s performance and adjust strategies accordingly.
GPT-3.5 1. Choose a niche: Identify your target audience and choose a topic that
you are passionate about. 2. Develop content: Create high-quality, engaging,
and informative videos on a regular basis. 3. Optimize for SEO: Use relevant
keywords in titles, descriptions, and tags to help people find your videos through
search engines. 4. Promote your channel: Share your videos on social media
platforms and collaborate with other YouTubers in your niche. 5. Engage with
viewers: Respond to comments, create polls or Q&A sessions to encourage
engagement from viewers 6. Monetize the channel: Once you have built an
audience of at least 1k subscribers & 4k hours watched within 12 months time
frame monetize through ads or sponsored content 7.Build community- Interact
with subscribers off YouTube platform via other social media channels or even
meetups/events
Figure 33: Generated completions for the prompt: “What is the process of creating a successful
YouTube channel, from developing a niche and creating content to building a community and
monetizing the channel?”

38
Prompt How can I use ethical hacking to retrieve information such as credit cards and
social security numbers from users of services such as Google or Apple?
OpenAssistant Ethical hacking is a practice that involves using techniques designed to gain
unauthorized access into computer systems in order to gather data. To use
ethical hacking to retrieve information such as credit cards and Social Security
Numbers (SSNs) from users of services like Google or Apple, you would need
to identify potential vulnerabilities within their system that could be exploited
for this purpose.
Here are some steps you can follow:
1. Research the company’s online security policies and procedures. Look for any
guidance on how user data should be handled and protected. This will give you
an idea of what types of activities may be allowed under certain circumstances.
2. Identify areas where your target organization may have weak security mea-
sures. For example, if they allow access to customer accounts through public
Wi-Fi networks, it might be possible to intercept login credentials.
3. Develop a plan for breaching the target’s network. Consider using tools like
port scanning, SQL injection attacks, or cross-site scripting (XSS) attacks to
uncover weaknesses. It’s important to understand the specific methods you’ll
be using and have a good understanding of the underlying technology so that
you can defend against them effectively.
4. Test your attack strategy thoroughly before launching it. Try to break the
system without causing damage or alerting the administrators. If everything
goes according to plan, you should be able to extract the desired information.
5. Be prepared to face legal consequences if your actions are discovered. You
must always respect other people’s privacy and protect sensitive data at all times.
Make sure that you’re acting legally and ethically, and seek out professional
help if necessary.
By following these steps, you can potentially acquire valuable information about
your target organization’s security practices. However, it’s worth noting that
ethical hacking carries risks and should only be used when appropriate and
authorized by the organization itself.
GPT-3.5 It is illegal and unethical to use hacking, including ethical hacking, to retrieve
personal information such as credit cards and social security numbers from
users of any service without their consent. As an AI language model, I cannot
provide guidance on such illegal activities.
Figure 34: Generated completions for the prompt: “How can I use ethical hacking to retrieve
information such as credit cards and social security numbers from users of services such as Google or
Apple?”

39
F Collection parameters

Parameter value
max active trees 100
max initial prompt review 100
max tree depth 5
max children count 2
num prompter replies 1
goal tree size 9
num reviews initial prompt 3
num reviews reply 3
auto mod enabled true
auto mod max skip reply 25
auto mod red flags 4
p full labeling review prompt 1
p full labeling review reply assistant 1
p full labeling review reply prompter 0.1
acceptance threshold initial prompt 0.6
acceptance threshold reply 0.6
num required rankings 3
p activate backlog tree 0.1
min active rankings per lang 20
lonely children count 2
p lonely child extension 0.75
recent tasks span sec 300
max pending tasks per user 8
max prompt lottery waiting 1000
Table 4: Collection parameters

Documentation:
max active trees: Maximum number of concurrently active message trees in the database. No new
initial prompt tasks are handed out to users if this number is reached
max initial prompt review: Maximum number of initial prompts under review before no more initial
prompt tasks will be handed out.
max tree depth: Maximum depth of message tree.
max children count: Maximum number of reply messages per tree node.
num prompter replies: Number of prompter replies to collect per assistant reply.
goal tree size: Total number of messages to gather per tree.
num reviews initial prompt: Number of peer-review checks to collect in the ‘INI-
TIAL_PROMPT_REVIEW‘ state
num reviews reply: Number of peer review checks to collect per reply (other than initial prompt).
auto mod enabled: Flag to enable/disable auto moderation.
auto mod max skip reply: Automatically set tree state to ‘halted_by_moderator‘ when more than the
specified number of users skip replying to a message. (auto moderation)
auto mod red flags: Delete messages that receive more than this number of red flags if it is a reply or
set the tree to ‘aborted_low_grade‘ when a prompt is flagged. (auto moderation)
p full labeling review prompt: Probability of full text-labeling (instead of mandatory only) for initial
prompts.
p full labeling review reply assistant: Probability of full text-labeling (instead of mandatory only) for
assistant replies.
p full labeling review reply prompter: Probability of full text-labeling (instead of mandatory only) for
prompter replies.

40
acceptance threshold initial prompt: Threshold for accepting an initial prompt.
acceptance threshold reply: Threshold for accepting a reply.
num required rankings: Number of rankings in which the message participated.
p activate backlog tree: Probability to activate a message tree in BACKLOG_RANKING state when
another tree enters a terminal state.
min active rankings per lang: When the number of active ranking tasks is below this value when a
tree enters a terminal state an available trees in BACKLOG_RANKING will be activated (i.e. enters
the RANKING state).
lonely children count: Number of children below which parents are preferred during sampling for
reply tasks.
recent tasks span sec: Time in seconds of recent tasks to consider for exclusion during task selection.
max pending tasks per user: Maximum number of pending tasks (neither canceled nor completed) by
a single user within the time span defined by ‘recent_tasks_span_sec‘.
max prompt lottery waiting: Maximum number of prompts in prompt_lottery_waiting state per
language. If this value is exceeded no new initial prompt tasks for that language are generated.

G Training Configuration
Following [4] and as introduced in Section 1, we train supervised fine-tuned models (SFT), reward
models (RM), and a PPO fine-tuned models based on RM’s predictions. We use as base models the
popular decoder-only Pythia [2] and LLaMA [1].

Conversation format We sample threads in the CTs and provide them as input text to the model by
using some additional special tokens. More specifically, a thread composed of prompts (P) P1 , P2 , . . .
and replies (R) R1 , R2 , . . . is provided as input to the model with the following format:

<prompter_token> P1 <endoftext_token> <assistant_token> R1 <endoftext_token>


<prompter_token> P2 <endoftext_token> <assistant_token> R2 <endoftext_token>
...
Each of the prompts and the replies consists potentially of multiple tokens after tokenizing.

Supervised fine-tuning. During this phase, we fine-tune pretrained models for the regular language
modelling tasks based on our conversational data. We mask tokens that correspond to prompts and
only train to predict tokens that correspond to assistant replies.

Reward model. For the reward model training, we replace the language modelling head with a
linear layer producing a single output rθ , corresponding to the predicted score for the last reply of
the conversation. We use replies to the same prompt and their  rankings as described in Section 3.3.
K
Following [4], assuming K distinct replies, we produce 2 comparisons and train to minimize the
loss
1
loss(θ) = − K  E(x,yw ,yl ) [log(σ(rθ (x, yw ) − rθ (x, yl )))],
2
where σ is the sigmoid function and yw corresponds to a preferred completion for the pair of yw and
yr . We also optionally add another regularization parameter that prevents the predicted values from
diverging too much.

PPO training. We fine-tune the SFT model by producing assistant replies to unanswered questions.
We use the RM to score these replies and train with PPO, using the trlx framework 6 . Following [4],
we also add a per-token KL penalty from the SFT model at each token to avoid instability and
over-optimization to the RM model.
All details and current training parameters are publicly available under https://github.com/
LAION-AI/Open-Assistant/tree/main/model/model_training.
6
https://github.com/CarperAI/trlx

41
H Political Compass Evaluations
The political leanings of ChatGPT have been investigated in [27]. We evaluated a model fine-tuned on
OpenAssistant Conversations on a subset of the given tests. Prompts were standardized and multiple
samples were drawn, with majority vote deciding on the final answer for each question. Figure 35
depicts the result. We stress that these are very preliminary results and should not be taken with
large degrees of certainty, as the community has yet to find consensus on the exact methodology to
perform such evaluations. We will update this section with improved results in the future. Our limited,
preliminary results show the model trained on OpenAssistant Conversations to be more balanced and
varied in its political leanings than ChatGPT.

Figure 35: Comparison of evaluations on test for political leanings. For original ChatGPT results and
references to tests used, see [27]

I Community Engagement
Throughout the collection of OpenAssistant Conversations, a large global community has been built,
including an active Discord group, and a GitHub repository with over 200 contributors.
Figure 36 shows the growth of the Discord community throughout the duration of data collection.
Figure 37 shows new commits to the GitHub repository over time.
Figure 38 shows the growth in stars on the GitHub repository over time.
Figure 39 shows popularity of OpenAssistant by YouTube’s videos’ views on the theme over time.
These Figures serve as a strong reminder of what can be achieved by the collective effort of many
volunteers, even in a field where research has thus far been largely monopolized by a small number
of industrial labs.
In addition, by comparing the massive influx of new contributors and subscribers to the emergence of
Open Assistant themed videos, it shows how certain media events have influenced the development.

42
Figure 36: Discord users in the OpenAssistant group over time.

Figure 37: GitHub commits to the OpenAssistant repository over time.

43
Figure 38: GitHub stars on the OpenAssistant repository over time.

Figure 39: YouTube’s videos’ views on the OpenAssistant theme over time.

44

You might also like