FULLTEXT01
FULLTEXT01
MARCUS WARNERFJORD
The proof-of-concept system developed for this research generated MIDI files
using ChatGPT, which were analyzed against human-composed music from the
Lakh MIDI dataset. Objective measures, including pitch-class distributions,
Inter-Onset-Interval (IOI), pitch range, average pitch intervals, pitch counts,
note length, and transition matrices, facilitated a comprehensive comparison.
Findings revealed that while the AI model’s output demonstrated stylistic
consistency and a certain level of musical texture, it exhibited less complexity and
variety compared to human compositions.
Despite its limitations, ChatGPT exhibits the capability to generate valuable music
from natural language prompts. However, enhancements are necessary to better
mimic the complexity and variance found in human compositions in order to make
it applicable in music production.
i
Sammanfattning
Denna avhandling undersöker förmågan hos artificiell intelligensmodell (AI),
ChatGPT 3.5-turbo, att komponera värdefull musik i ett digitalt format (MIDI)
från naturligt språk. Studien implementerar en mångfacetterad kvantitativ
metodansats, där objektiva musikaliska mått kombineras med subjektiva
användarutvärderingar.
Trots sina begränsningar visar ChatGPT förmågan att generera värdefull musik
från naturliga språkpåminnelser. Men förbättringar behövs för att bättre
efterlikna komplexiteten och variansen som finns i människokompositioner för
att göra den tillämplig inom musikproduktion.
ii
Author
Marcus Warnerfjord
KTH Royal Institute of Technology
Electrical Engineering and Computer Science
Examiner
Pawel Herman
Supervisor
Jörg Conradt
Contents
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Essential Elements of Music Theory . . . . . . . . . . . . . . . . . . 7
2.4 Digital Representation of Music: MIDI . . . . . . . . . . . . . . . . 8
2.5 Related Work: On the evaluation of generative models in music . . 9
3 Method 12
3.1 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Proof-of-concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Community Feedback and Redeployment . . . . . . . . . . . . . . . 20
3.4 Construction of the Analysis Tool . . . . . . . . . . . . . . . . . . . 20
3.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Result 24
4.1 Objective metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Subjective metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Discussion 38
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion 41
iv
1 Introduction
The fusion of artificial intelligence (AI) and music has introduced new avenues for
both creative expression and automation. AI is increasingly playing a significant
role in the music industry, from generating novel compositions to aiding in music
production [5]. Key to this advancement is AI’s ability to comprehend and create
music in a variety of formats. The Musical Instrument Digital Interface (MIDI),
a protocol used for conveying musical information digitally, is one such format.
MIDI is so popular that it has become an industry standard in music production
and live performance. The reason for this is that MIDI only communicates the
structure and timing of the notes, which can then be sent to any instrument with
a MIDI input. This makes it a highly flexible format to work with.
1.1 Problem
The primary issue this project seeks to address is the ability of a large
language model (LLM), specifically ChatGPT, to generate valuable MIDI files
from natural language prompts. The problem involves both the objective and
subjective assessment of the model’s performance against human-composed
MIDI files.
1
of objective metrics suggested by Yang et al.[19]?
• To what extent are users, particularly those with varying degrees of musical
experience, satisfied with the musical outputs of ChatGPT?
1.2 Purpose
The objective of this degree project is to investigate the potential of LLMs for
generating MIDI files based on natural language prompts. This exploration will
enhance understanding of the underlying relationship between natural language
processing and music generation, thereby contributing to the broader field of AI
in music.
1.3 Goal
The ultimate aim of this degree project is to uncover the capabilities of LLMs in
generating MIDI files based on natural language prompts. The specific objectives
are to develop a proof-of-concept system that utilizes LLMs to generate MIDI files
from text descriptions. And then to evaluate the effectiveness and limitations
of the LLM-based approach in generating musically coherent and relevant MIDI
files, using both objective and subjective metrics.
The project’s deliverables will include the developed system, a set of generated
MIDI files, and an evaluation system of the produced music based on midi data,
user feedback and analysis.
1.4 Delimitations
2
• The performance of large language models (LLMs) other than ChatGPT
falls outside the purview of this study, which is dedicated to investigating
the potential of ChatGPT in generating MIDI files from natural language
prompts.
1.5 Outline
Chapter 2 provides the foundation for the study, outlining the research design,
large language models, basic music theory, and the role of MIDI. The chapter also
3
discusses related work in music generation systems.
Chapter 4 offers the results, examining both objective and subjective metrics to
provide a comprehensive evaluation of the AI-generated music.
4
2 Background
5
2.2 Large Language Models
Large Language Models (LLMs) represent a significant shift in the field of natural
language processing, which emerged around 2018. They are neural networks
with an immense number of parameters, often in the billions or more, and are
trained on extensive amounts of unlabeled text using self-supervised or semi-
supervised learning techniques. The advent of LLMs has redefined the approach
in natural language processing research, transitioning from the previous model of
creating specialized supervised models for specific tasks, towards models capable
of performing a wide array of tasks with high proficiency [17].
In the context of a conversation with an LLM, prompts can be used to set the
context and specify what information is important and what the desired format
6
and content of the output should be. For example, a prompt could dictate
that the LLM should only generate code adhering to a certain coding style, or
flag certain keywords in a generated document and provide additional related
information.
The ChatGPT API is a product by OpenAI that enables developers to integrate the
capabilities of the GPT-3 model into their applications. It provides a way to send
a series of messages to the model and receive a model-generated message as a
response. Each message has a ’role’ that can be ’system’, ’user’, or ’assistant’, and
’content’ which is the text of the message from the role [10].
In music theory, a pitch class refers to the set of all pitches that are a whole number
of octaves apart from each other [11]. For example, the pitch class ”C” includes all
C notes on a keyboard, irrespective of the octave they are in.
Western music recognizes twelve distinct pitch classes, each signified by one of
twelve notes within an octave on a piano keyboard: C, C#/Db, D, D#/Eb, E, F,
F#/Gb, G, G#/Ab, A, A#/Bb, and B. Each pitch class corresponds to a different
7
frequency. A sharp (”#”) raises the pitch of a note by a half step, and a flat (”b”)
lowers it by the same amount. This system is integral to creating scales, chords,
and melodies.
In MIDI, each pitch class within an octave is assigned a number from 0 to 11, with
C as 0, C#/Db as 1, D as 2, and so on up to B as 11 [7]. This pattern repeats in
modulo 12 up to 127, thereby representing all notes in the MIDI format.
To bridge the gap between these music theory concepts and their digital
representation, we use the Musical Instrument Digital Interface (MIDI).
Figure 2.1: Visualization of MIDI data. In the upper subplot, time is plotted on
the x-axis, and pitch/key is on the y-axis, representing the sequence of the melody.
The lower subplot, instead of pitch/key, represents velocity (the speed at which the
key was pressed) on the y-axis.
8
”note on,” ”note off,” ”note/pitch,” ”pitchbend,” and others. MIDI instruments
interpret these messages to generate sound [8].
To interact with MIDI files programmatically, one can use libraries like Mido, a
Python library that allows for the reading, writing, and real-time handling of MIDI
messages in a way that aligns with Python’s design principles [12].
9
2.5.1 Subjective Metrics
Subjective evaluations are primarily based on human listener inputs. This can
involve a musical Turing test, where listeners distinguish between human and
computer-generated compositions. However, there are known limitations to this
method. For instance, the aesthetics of a piece may be conflated with whether it
sounds human-composed. Additionally, these tests often overlook factors such
as listener expertise and sample size, which can influence the reliability and
statistical significance of results. Furthermore, these tests risk overestimating the
subject’s comprehension. Therefore, these metrics should be tested along with an
objective evaluation.
• Pitch Count (PC): This counts the number of different pitches within a
MIDI file, resulting in a scalar for each sample [19].
• Pitch Range (PR): The pitch range is the difference between the highest
and lowest pitches in semitones within a sample, resulting in a scalar for
10
each sample.
• Average Pitch Interval (PI): This is the average value of the interval
between two consecutive pitches in semitones, resulting in a scalar for each
sample.
• Note Count (NC): This counts the total number of notes in a piece,
resulting in a scalar for each sample. It is a rhythm-related feature that does
not contain pitch information.
11
3 Method
The study follows a multifaceted quantitative approach, combining quantitative
techniques for data collection and analysis. The project’s research methods are
grounded in the pragmatist philosophical assumptions, which focus on solving
real-world problems and using the most appropriate methods for the given
context [6].
4. Analysis of the collected data to identify factors influencing the quality and
relevance of the generated music.
3.2 Proof-of-concept
The development of the PoC system, depicted in figure 3.1, is based on adapting
ChatGPT for generating MIDI files from natural language prompts. A web app
was developed to facilitate the interaction between users and the system and to
collect user feedback through surveys. The main functionality is described as
follows:
Users access the web app through their browser and are presented with an
interface to input their preferences or parameters for the MIDI file generation,
such as music style, tempo, key, and length. After submitting their preferences,
the web app processes the request by sending the user’s input to the server. The
input is contextualized in a natural language prompt and sent to the ChatGPT API,
which analyzes the input and sends back a list of note events formatted as a list
of lists. While the server processes the request, the user is shown a ”processing”
12
page, which periodically checks the status of the task by polling the server. Once
ChatGPT has provided a response, the web app generates the MIDI file based
on the provided list. The server immediately stores the file in the database and
provides a unique identifier (a file ID) for the generated file. The web app then
redirects the user to a ”feedback” page, where they can download the generated
file. Users can also provide feedback if they would like to contribute. The feedback
questions are slightly different depending on the user’s musical experience, as
suggested by Yang et al. [19]. Once the feedback is submitted, the data from the
input is stored within the database.
The user input form was designed to facilitate the collection of essential
information required to generate a customized music composition. The form
13
aims to be simple, intuitive, and flexible to accommodate users’ diverse musical
preferences. The design of the form is as follows:
• Tempo: This field allows users to specify the tempo of the music in beats per
minute (BPM). A default value of 120 BPM is provided, which is a common
tempo for various music genres.
• Time Signature: Users can input the desired time signature for the
composition. The default value is 4/4, which is the most common time
signature in Western music.
• Key Signature: This field allows users to define the key signature of the
composition. The default value is C, which represents the C Major scale.
• Length: Users can specify the desired length of the composition in bars.
The default value is 4, which represents a short musical phrase.
• Genre: This field allows users to define the musical genre for the
composition. The default value is ”Any”, which gives the LLM the freedom
to choose a genre based on its expertise.
• Mood: Users can specify the mood of the music, such as ”happy” or ”sad”.
The default value is ”Any”, allowing the LLM to determine the mood based
on other inputs or its expertise. This also allows for some more diverse
prompts.
This user input form design ensures that users can easily define their preferences
while providing the necessary information for the LLM to generate a customized
musical composition. The form’s simplicity and flexibility allow users with varying
levels of musical expertise to interact with the system and explore the potential of
14
LLMs for music generation.
Example: ”From now on, when I ask you to generate music, I want you
to do it in the short-midi-list-language (SMLL) which consists of a list of
lists, format like this: ”[[N 1, S1, D1], [N 2, S2, D2], [N 3, S3, D3], ...]”, where
N represents the MIDI note number, S represents the start time (in ticks),
and D represents the duration (in ticks)”
• Template pattern: The template pattern provides a clear format for the
LLM’s output, ensuring consistency and facilitating downstream processing
of the generated music. It instructs the LLM to preserve the provided format
when generating music in the SMLL.
15
output formatting for subsequent processing, including the extraction of
note_on and note_off events, the creation of MIDI messages and events,
and the use of the mido Python library for generating MIDI files. It also
specifies the ticks_per_beat value to be used in the process.
These components were incorporated into the system messages before presenting
the actual prompt to the LLM. This approach ensures that the LLM has the
necessary context and instructions for generating music according to the user’s
preferences and in a format that can be easily converted into MIDI files.
In the process of obtaining MIDI data from ChatGPT, the response may contain
clutter or unexpected formatting that could interfere with the conversion of the
response to a Python matrix. To handle this, the program first extracts only
the relevant information using a pattern matching technique. It searches for the
desired format in the response, preserving only the lists that match the pattern,
and reconstructs a cleaned response containing only the matched lists.
Once the cleaned response is obtained, the program tries to convert it into a
Python list of lists. If the conversion fails due to any errors, the program sends
a new prompt to the ChatGPT API to try and obtain a better-formatted response.
This process continues until the response can be successfully converted into a
Python list of lists or a maximum number of attempts is reached.
Finally, the list of MIDI events is converted into a more suitable data structure
that can be used to generate a MIDI file.
This approach ensures that the response from ChatGPT can be successfully
processed and used to generate a MIDI file while minimizing the risk of errors
due to unexpected formatting or clutter in the response
The feedback forms are designed with the purpose of gathering user feedback
on the AI-generated music, which helps to evaluate the subjective metrics.
By separating the feedback forms based on experience levels (Beginner,
16
Intermediate, and Advanced), the system can gather more targeted and specific
feedback that considers the users’ expertise and understanding of music.
This allows users across different experience levels to provide more reliable
feedback.
The default option is ”Undefined” for each question and serves two purposes.
First, it acts as a placeholder, prompting users to choose a value that reflects their
opinion. By having ”Undefined” as the default choice, users are encouraged to
provide their input, rather than leaving the question unanswered. Second, it helps
prevent any unintentional or accidental selections by the user. If a user forgets to
answer a question, the ”Undefined” value ensures that the system does not record
any misleading or incorrect feedback. Subjective metrics, gathered through user
feedback, are crucial for understanding how well the AI-generated music meets
user expectations based on the given prompts. The feedback forms, separated by
experience levels (Beginner, Intermediate, and Advanced), allow users to provide
their opinions on various aspects of the music, such as how well it matches the
desired tempo, time signature, genre, mood, and complexity. For example, a
question like ”How well does the generated music match the desired Mood?” helps
to gauge whether the AI is accurately capturing users’ emotional intent.
Skill Level Question 1-5
Q1. I feel generally positive towards the use of AI
Q2. How well does the generated music match the desired Tempo?
Q3. How well does the generated music match the desired Time Signature?
Q4. How well does the generated music match the desired Genre?
Q5. How well does the generated music match the desired Mood?
Q6. How well does the generated music match the desired Key Signature?
Beginner Q7. How well does the generated music match the desired Length (Bars)?
Q8. How well does the generated music match the desired choice of Harmony, Melody, or Rhythm?
Q9. How well does the generated music match the desired Complexity?
Q10. Overall, how satisfied are you with the AI-generated music?
Q11. Does the generated music have a clear structure (e.g., intro, verse, chorus, etc.)?
Q12. Are there any noticeable repetitions in the generated music?
Q13. Is the generated music enjoyable to listen to?
Q1 - Q13
Intermediate Q14. How well does the generated music maintain a consistent harmonic progression?
Q15. How natural are the melodic transitions in the generated music?
Q16. How well do the rhythm patterns support the generated music?
Q1 - Q16
Advanced Q17. How well does the generated music incorporate dynamics?
Q18. How well does the generated music incorporate articulation?
Q19. Can you identify any advanced musical techniques (e.g., counterpoint, modulation, etc.)?
17
question is essential, as it provides insight into users’ general attitudes towards
AI-generated music, which can potentially affect their opinions and evaluation
of the compositions. This approach is supported by research, such as the study
conducted by Chamberlain et al. (2018), which found that aesthetic responses
to computer-generated art can be influenced by the knowledge that the art was
created by an AI.[4]
The MIDI conversion process relies on the mido library, which is a Python library
for working with MIDI messages and files. mido provides an easy-to-use interface
for creating, parsing, and manipulating MIDI data. The process works in the
following steps:
• Create MIDI file and track: A new MIDI file and track are created using
the MidiFile and MidiTrack classes from the mido library.
• Set meta-data: The tempo, time signature, key signature and such (if
provided) are set using MetaMessages. mido provides a convenient way to
create MetaMessages for setting various meta-data properties of the MIDI
file, such as ”set_tempo”, ”time_signature”, and ”key_signature”.
• Create note events: For each note produced by ChatGPT, the start time,
duration, and velocity are extracted. The end time is calculated by adding
the duration to the start time. Note on and note off events are created and
appended to an events list. Each event is a tuple containing the start or end
time, the note value, and a boolean indicating whether it’s a note on (True)
or note off (False) event.
• Sort note events: The events list is sorted by the start time and whether
the event is a note on or note off event. This ensures that the note events are
processed in the correct order when adding them to the track
• Add note events to the track: For each event in the sorted list, a
”note_on” or ”note_off” message is added to the track. The time difference
between the current event and the last event is calculated to ensure proper
timing. This time difference is passed as the time argument in the
18
mido.Message function call, which determines the delay between the current
message and the previous one. This step ensures that the notes are played
at the correct times and with the correct durations.
• Calculate ticks per bar: The ticks per bar are calculated based on the
MIDI file’s ticks per beat and the beats per bar (obtained from the numerator
of the time signature).
• Extend track length (if necessary): If the track’s length in ticks is less
than the desired length in ticks, silence is added to the end of the track using
a control_change message.
• Save MIDI file to binary data: The MIDI file is saved to binary data
using a BytesIO buffer, which is then returned.
To handle long response times from the API and prevent interruptions from the
web server, Celery and Redis were used. Celery is a task queue that allows for
asynchronous processing of tasks, while Redis serves as a message broker and
caching system. With Celery and Redis, the web application can manage tasks in
the background, ensuring smooth user experience and uninterrupted web server
operation.
Additionally, to further increase reach and diversify the user base, the web
application was published on various Subreddits targeted towards both music-
oriented and computer science-oriented audiences. This approach enabled
valuable insights from users with different backgrounds and expertise, ensuring
a well-rounded evaluation of the AI-generated music. By appealing to a broader
19
audience, we can better understand the performance and reception of ChatGPT’s
MIDI generation capabilities across different user groups.
• Improving the handling of multiple users: Users reported that the site was
not functioning optimally when accessed by many users simultaneously. To
address this issue, the message broker was enhanced with a new Redis client,
allowing for more efficient handling of concurrent users and preventing
interruptions from the web server.
After implementing these changes, the web application was redeployed, providing
an improved user experience and ensuring a more reliable platform for user
evaluation.
During the evaluation process, it was discovered that the intended analysis tool,
mgeval [20] (Python), developed by Yang et al., had compatibility issues and
outdated requirements. To overcome these challenges, an alternative approach
was adopted to measure the objective metrics.
Instead of relying on mgeval, the miditoolbox for MATLAB was employed, which
consists of similar functions for analyzing and processing MIDI files [1]. This
20
decision allowed for a more seamless integration of the analysis tool into the
evaluation process while still providing comparable metrics for assessing the AI-
generated music.
To assess the quality of the AI-generated music and draw meaningful comparisons
with human-made compositions, it is essential to select an appropriate
benchmark dataset. The Lakh MIDI dataset (LMD) was chosen for this purpose,
as it represents a diverse collection of music and offers a substantial amount of
data for comparison. The LMD consists of 176,581 unique MIDI files, with 45,129
of these files matched and aligned to entries in the Million Song Dataset [13].
1. Load MIDI files from the two sets and convert them into note matrices using
the miditoolbox functions.
2. Calculate various metrics for each MIDI file in both sets, such as pitch class
histograms, inter-onset intervals, note counts, pitch ranges, pitch intervals,
pitch counts, note length histograms, pitch class transition matrices, and
note length transition matrices.
21
3. Visualize the results using graphical representations such as histograms and
heatmaps to facilitate comparison between the two sets.
Data analysis was performed on the objective and subjective metrics collected
from the MIDI compositions generated by ChatGPT 3.5-turbo and the feedback
from users.
All data collected in this study was initially subjected to a normality test.
The Lilliefors test, a variation of the Kolmogorov-Smirnov test used when the
parameters of the normal distribution are estimated from the data, was applied to
check the conformity of the data with a normal distribution. However, the results
indicated that the data did not adhere to the criteria of a normal distribution.
The p-values obtained from the Mann-Whitney U tests were examined to verify
statistical significance. In all instances, the p-values were found to satisfy a
predetermined significance level, suggesting a statistically significant difference
between the two groups within the context of this study.
22
Lastly, the computed statistics were depicted using appropriate graphical
representations. These visualizations aided in identifying patterns, trends,
and differences in the analyzed metrics. The plots, along with the calculated
descriptive statistics and the results of the Mann-Whitney U tests, formed the
basis for drawing conclusions about the musical performance of the ChatGPT
model compared to human-composed MIDI files.
For the subjective metrics, the users’ overall satisfaction scores, overall
satisfaction scores by skill level, mean scores for each question by skill level,
relationship between AI sentiment and satisfaction, and performance metrics
were analyzed. To summarize and visualize these metrics, various statistical
plots including histograms, boxplots, heatmaps, and scatterplots were generated.
These plots enable the visual understanding of data distribution, central tendency,
dispersion, and potential correlations in the data set.
23
4 Result
This web app gained 500 MIDI files from the music generated by ChatGPT 3.5-
turbo, which was matched with 500 randomly selected files from the Lakh MIDI
Dataset. The objective metrics offer quantitative assessments of the music’s
attributes, facilitating direct comparisons with human-composed pieces.
24
Figure 4.1: Histograms depicting the distribution of pitch counts for music
generated by ChatGPT 3.5-turbo (left) and the Lakh MIDI Dataset (right). The
solid red line represents the mean, and the dashed lines delineate one standard
deviation above and below the mean.
The music generated by ChatGPT tends to exhibit fewer unique pitches and a
wider variation in pitch usage, while human compositions in the Lakh MIDI
Dataset demonstrate more consistent usage of a wider range of pitches. This
comparison potentially suggests a denser tonal texture in the music created by the
model, while human compositions typically traverse a larger tonal space.
Pitch class distribution quantifies how often each pitch class (note) occurs in a
piece of music. Figure 4.2 compares how often each musical note (C to B) is
used.
ChatGPT music, on average, uses each note about 25% of the time, with a variation
of 13%. This means there’s a notable difference in how often each note is
used.
Lakh MIDI Dataset music, on the other hand, uses each note roughly 30% of the
time, with a higher variation of 15%. So, it has a slightly broader use of different
notes, but with a wider range in the frequency of their use.
ChatGPT music shows a bit less unpredictability (entropy 3.26) in note usage
compared to the Lakh MIDI Dataset (entropy 3.53). This suggests that ChatGPT’s
25
Figure 4.2: Comparison of note usage between ChatGPT 3.5-turbo (blue) and the
Lakh MIDI Dataset (red). Each note from C to B is shown on the x-axis. The y-axis
shows the probability of each note appearing in the music.
The findings here reveal that ChatGPT’s generated music exhibits some stylistic
consistency in pitch usage but may lack the complexity and variety found in
human compositions. The music generated by ChatGPT seems to include smaller
and more varied pitch intervals, whereas human compositions in the Lakh MIDI
Dataset exhibit larger and more consistent pitch intervals, possibly indicating a
denser tonal texture in the music generated by ChatGPT and a broader tonal space
in human compositions.
The pitch class transition matrices for both ChatGPT 3.5-turbo and the Lakh MIDI
dataset reveal a complex web of transitions between pitch classes. Although the
mean transition probability and standard deviation for both sets are the same, the
26
Figure 4.3: Pitch class transition matrices for ChatGPT 3.5-turbo generated
music and the Lakh MIDI dataset. The color gradient corresponds to the
transition probability between pitch classes, with darker shades representing
higher probabilities.
Even though the numerical values converge, indicating similar transition patterns
on average, the visual representation of the matrices shows a more nuanced
contrast between the music generation behavior of ChatGPT 3.5-turbo and human
compositions.
The examination of the pitch range, defined as the disparity between the highest
and lowest pitches in a musical piece, serves as an effective tool to discern the
tonal diversity of the piece. The illustration in Figure 4.4 elucidates the pitch range
analysis for the musical pieces generated by ChatGPT 3.5-turbo as well as those
from the Lakh MIDI Dataset.
The mean pitch range for music output by ChatGPT 3.5-turbo is 7.27 with
a standard deviation of 7.55, indicative of a relatively consistent tonal range
27
Figure 4.4: Pitch ranges in semitones for both the ChatGPT 3.5-turbo dataset (left
panel) and the Lakh MIDI Dataset (right panel). The distributions of the pitch
ranges are depicted with histograms, while the mean and standard deviation for
each dataset are represented by solid and dashed vertical lines, respectively.
in the music composed by the model. On the other hand, the Lakh MIDI
Dataset exhibits a higher mean pitch range of 59.93 with a standard deviation
of 11.91, demonstrating a more extensive tonal range and variability in human
compositions.
28
Figure 4.5: Histograms of the average pitch intervals for music generated by
ChatGPT 3.5-turbo (left) and the Lakh MIDI Dataset (right). The mean pitch
interval and standard deviation (Std Dev) are indicated by the red solid and dashed
lines, respectively.
compositions.
29
Figure 4.6: Comparison of the average note counts for ChatGPT 3.5-turbo and
the Lakh MIDI Dataset. The standard deviation is represented by the error bars.
The blue bar corresponds to ChatGPT, and the red bar represents the Lakh MIDI
Dataset. Due to the considerable difference in values, the y-axis is presented in
log scale.
The Inter-Onset Interval (IOI) represents the time interval between the onset
of consecutive notes, providing valuable insights into the rhythmic properties of
musical compositions. Figure 4.7 compares the average IOI of music generated
by ChatGPT 3.5-turbo with that of compositions found in the Lakh MIDI
Dataset.
30
Figure 4.7: The average Inter-Onset Interval (IOI) for ChatGPT 3.5-turbo and the
Lakh MIDI Dataset. The error bars represent the standard deviation for each set.
The blue bar corresponds to ChatGPT, and the red bar corresponds to the Lakh
MIDI Dataset.
31
Figure 4.8: Grouped histograms of note length categories for the music generated
by ChatGPT 3.5-turbo and the Lakh MIDI Dataset. Each bar represents a note
length category on a logarithmic scale, with categories ranging from a quarter
beat (1/4) to four beats (4). Categories are calculated using logarithmic binning
to ensure the accurate representation of the diverse note lengths. The bottom
row shows the mean and standard deviation of the note lengths for each set,
illustrating differences in rhythmic complexity.
rhythmic consistency.
The transition matrices illustrate note length transitions in the music, with the
color of each matrix cell indicating the transition probability between the note
lengths associated with the cell’s row and column.
Figure 4.9: Note length transition matrices for ChatGPT 3.5-turbo generated
music (left) and Lakh MIDI dataset (right). The color scale represents the
transition probability between note lengths, darker colors indicating higher
transition probability.
32
larger standard deviation (11.50) compared to the Lakh MIDI dataset, with mean
transition probability of 2.62 and standard deviation of 5.08. A higher mean
transition probability could point to a more exploratory note length transition
behavior. On the other hand, the higher standard deviation might indicate either
a richer musical texture due to a wider range of note length transitions, or less
consistency in musical generation.
Thus, while ChatGPT tends to generate music with denser tonal texture, as shown
by smaller and more varied pitch intervals, human compositions in the Lakh MIDI
Dataset demonstrate a broader tonal space with more consistent intervals.
This section presents an analysis of the subjective metrics based on the responses
collected 23 participants. Each participant evaluated the MIDI compositions
created by ChatGPT and provided feedback on their overall satisfaction and
perceived quality of the generated music.
For the purpose of this analysis, the musical skill level of participants is
classified into three categories, represented numerically as 1, 2, and 3.
Level 1 refers to beginners or those with limited musical experience, level 2
corresponds to individuals with an intermediate level of musical experience,
and level 3 represents advanced users with significant musical experience and
expertise.
The scores range from 1 to 5, with the majority of scores concentrated at the lower
end of the scale. The most common score is 1, with 13 occurrences, followed
by 1.77 with 8 occurrences, and score 2 with 5 occurrences. This indicates that
many users had a less than satisfactory experience with the generated MIDI
compositions.
33
Figure 4.10: Histogram showing the distribution of overall satisfaction scores.
The x-axis represents the score, and the y-axis represents the frequency of each
score.
The mean and median satisfaction scores are both approximately 1.77, implying
that the data is not significantly skewed and that there’s a balanced spread
around the central value. However, considering the scale of the scores, a mean
and median score of 1.77 indicates a generally low level of satisfaction among
users.
The boxplot in figure 4.11 illustrates the distribution of overall satisfaction scores
broken down by the level of musical experience. The levels range from 1 to
3, with 1 being the least experienced and 3 being the most experienced. The
median satisfaction scores for users with musical experience levels 1 and 2 are
both approximately 1.77. The users with a musical experience level of 3 had a
lower median satisfaction score of 1. This suggests that users with more musical
experience were, on average, less satisfied with the MIDI compositions generated
by ChatGPT.
The interquartile ranges, which represent the spread of the middle 50% of scores,
are 0.23, 0.81, and 0.39 for musical experience levels 1, 2, and 3, respectively.
This indicates a greater variability in the satisfaction scores among users with a
34
Figure 4.11: Boxplot showing the distribution of overall satisfaction scores
stratified by musical experience level. The x-axis represents the musical
experience level, and the y-axis represents the score.
The number of outliers are 4, 1, and 2 for musical experience levels 1, 2, and 3,
respectively. This suggests there were some scores that deviated significantly from
the others, especially among users with musical experience level 1.
The heatmap in Figure 4.12 presents the mean scores for each question, grouped
by the level of musical experience. The levels range from 1 to 3, with 1 being the
least experienced and 3 being the most experienced.
The color intensity in the heatmap represents the mean score, with darker colors
indicating higher scores and lighter colors indicating lower scores. It can be
observed that, in general, users with a musical experience level of 1 provided
higher mean scores for most questions compared to users with higher musical
experience levels. On the other hand, users with a musical experience level of 3
tended to provide the lowest mean scores for the majority of questions.
These results highlight a potential relationship between the user’s level of musical
experience and their evaluation of ChatGPT’s performance on different aspects
of music generation. The higher the musical experience level, the lower the
mean scores tend to be, suggesting that more experienced users have higher
35
Figure 4.12: Heatmap showing the mean scores for each question, stratified by
musical experience level. The x-axis represents the questions, and the y-axis
represents the musical experience level. The color intensity represents the mean
score for each question and skill level combination.
36
Figure 4.13: Scatterplot illustrating the relationship between AI sentiment
(Question 1) and overall satisfaction (Question 10). The color of the points
indicates the musical experience level of the participant, with lighter colors
representing lower experience levels and darker colors representing higher
experience levels.
37
5 Discussion
While the results offer interesting insights into the performance of this AI
model in generating music, several factors and limitations should be taken into
consideration when interpreting these findings.
5.1 Limitations
The validity of this study’s findings hinges substantially on the feedback survey
collected from 23 individuals. This relatively small sample size may not capture
the full range of perceptions and experiences of users interacting with the
music generated by ChatGPT. Hence, the generalizability of these findings is
limited. Furthermore, the marked discrepancy between the number of MIDI files
generated by users (500 files) and the number of users who provided feedback is
noteworthy. This gap may imply that not all users found the product satisfactory
enough to provide feedback, pointing towards potential areas of user experience
that need improvement.
White et al. highlight the critical role of prompt engineering in the output
generated by LLMs[16]. The specificity and syntactic design of prompts can
directly influence the output, including MIDI sequence generation. Consequently,
the art and science of crafting effective prompts will play a significant part of the
quality and relevance of the generated outputs. Given that prompt engineering
is still a relatively new field of study, further exploration and refinement in this
area could potentially impact the results of studies like this one in the future, and
thereby make better use of existing models by using a more refined prompt.
This study mainly focuses on the musical quality based on the raw data extracted
from MIDI and user satisfaction feedback. However, a detailed music theory
analysis of the generated pieces is absent. Future research could benefit from
38
involving musicologists or professional musicians to analyze the model’s output in
terms of harmony, melody, and rhythm. Such an approach would provide a deeper
understanding of ChatGPT’s generated music, identifying both its strengths and
areas needing improvement.
Another aspect worth discussing is the vastly lower note count in MIDI sequences
generated by ChatGPT compared to Lakh MIDI dataset. This difference could
be due to inherent design limitations aimed at managing output length. Such
a restriction could significantly limit the model’s ability to develop extensive
musical ideas, consequently curbing the richness and variety of the musical
content. However, future iterations of the model could explore augmenting note
count, thereby aligning the model’s outputs more closely with user expectations
and potentially yielding more diverse and intricate compositions.
Large language models such as GPT-3 and its successors have exhibited a steep
trajectory of improvement across diverse applications including natural language
understanding, translation, and creative generation tasks. Although, this study
has demonstrated that music generation from ChatGPT 3.5 bares resemblance
to human compositions, it’s highly likely that future models will greatly improve
39
this functionality among other unexpected areas. Also, the merging of LLMs
with other AI technologies, such as a musically trained AI, could enable more
sophisticated, context-aware music creation. The integration of AI models
specializing in different musical aspects could also be another promising avenue
to potentially achieve a comprehensive music generation system, yielding more
compelling compositions.
This study, while being a proof-of-concept, has not fully explored all possible
scenarios where music generation AI could be beneficial. For instance, an
interactive music generation AI could serve as a valuable tool for music education,
providing real-time guidance, feedback, and explanations to students learning
composition or music theory.
40
6 Conclusion
This study aimed to answer the research question: Can ChatGPT generate
valuable music in a digital format (MIDI) from natural language prompts? The
conclusions drawn from the set of objective metrics can be summarized as
follows.
Pitch Counts The research found that the music composed by ChatGPT tends
to utilize a narrower range of unique pitches compared to human-composed
pieces. Despite this, it does not necessarily limit the artistic value of the generated
music, as pitch diversity is context and genre-specific.
Average Pitch Intervals ChatGPT’s music exhibits smaller and more varied
pitch intervals compared to human compositions which show larger, but more
consistent pitch intervals. This might suggest a denser tonal texture in the AI-
generated music.
41
Average IOI ChatGPT-generated compositions demonstrated notable
rhythmic variability, whereas human compositions suggested a more consistent
rhythmic structure.
Note Lengths ChatGPT generated music with a wide variety of note lengths but
potentially lacks rhythmic consistency compared to human compositions, which
showed more consistent usage of note lengths.
42
In conclusion, the capabilities of ChatGPT in generating music in a digital format
(MIDI) from natural language prompts are evident. Objective analyses suggest
that the model is capable of generating music with some stylistic consistency
and tonal texture that bear resemblance to human-composed pieces. However,
compared to human compositions, the music generated by ChatGPT exhibits
limitations in terms of complexity, diversity, and consistency, favoring certain
pitch transitions and note lengths, which results in a denser tonal texture.
These observations indicate that while ChatGPT offers a unique and valuable tool
for music generation, there is room for enhancement in future iterations of the
model. Specifically, improvements could focus on increasing musical complexity,
diversity, and sophistication to better meet the expectations of users with varying
degrees of musical expertise, thus enhancing both the objective and subjective
value of the generated music
43
References
[1] URL: www.jyu.fi/musica/miditoolbox/.
[9] O’Brien, Cian and Lerch, Alexander. “Genre-specific key profiles”. In:
ICMC. 2015.
[12] Project, Mido. Mido: MIDI Objects for Python. https : / / mido .
readthedocs.io/. 2023.
44
[15] Verbeurgt, Karsten, Dinolfo, Michael, and Fayer, Mikhail. “Extracting
patterns in music for composition via markov chains”. In: Innovations
in Applied Artificial Intelligence: 17th International Conference on
Industrial and Engineering Applications of Artificial Intelligence and
Expert Systems, IEA/AIE 2004, Ottawa, Canada, May 17-20, 2004.
Proceedings 17. Springer. 2004, pp. 1123–1132.
[19] Yang, Li-Chia and Lerch, Alexander. “On the evaluation of generative
models in music”. In: Neural Computing and Applications 32.9 (2020),
pp. 4773–4784.
[20] Yang, Richard. mgeval: Evaluation Metrics for Music Generation. https:
//github.com/RichardYang40148/mgeval. Year of retrieval.
45
Appendix
Persona description
Template pattern
46
” I am going t o p r o v i d e a t emp l a te f o r your output u s i n g
the short −midi− l i s t −language (SMLL) . E v e r y t h i n g i n a l l
caps i s a p l a c e h o l d e r . Any time t h a t you g e n e r a t e music
p l e a s e p r e s e r v e the f o r m a t t i n g and o v e r a l l t emp late t h a t
I provide .
[
[NOTE, START_TIME, DURATION] ,
[NOTE, START_TIME, DURATION] ,
...
[NOTE, START_TIME, DURATION]
]”
47
TRITA-EECS-EX-2023:303
www.kth.se