0% found this document useful (0 votes)
45 views53 pages

FULLTEXT01

This thesis examines ChatGPT's ability to compose music in MIDI format from natural language prompts. It develops a system using ChatGPT to generate MIDI files, which are then analyzed against human compositions using objective and subjective metrics. Objective metrics show the AI music exhibits less complexity than human music. Subjective user evaluations reveal moderate to low satisfaction, with more experienced musicians less satisfied, showing how musical experience impacts perception of AI-generated music.

Uploaded by

Alari Acb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views53 pages

FULLTEXT01

This thesis examines ChatGPT's ability to compose music in MIDI format from natural language prompts. It develops a system using ChatGPT to generate MIDI files, which are then analyzed against human compositions using objective and subjective metrics. Objective metrics show the AI music exhibits less complexity than human music. Subjective user evaluations reveal moderate to low satisfaction, with more experienced musicians less satisfied, showing how musical experience impacts perception of AI-generated music.

Uploaded by

Alari Acb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Degree Project in Computer Science

First Cycle 15 credits

Evaluating ChatGPT’s Ability to


Compose Music Using the MIDI File
Format

MARCUS WARNERFJORD

Stockholm, Sweden 2023


Abstract
This thesis examines the capabilities of the artificial intelligence model (AI),
ChatGPT 3.5-turbo, to compose valuable music in a digital format (MIDI)
from natural language prompts. Implementing a multifaceted quantitative
approach, the study combines objective musical metrics with subjective user
evaluations.

The proof-of-concept system developed for this research generated MIDI files
using ChatGPT, which were analyzed against human-composed music from the
Lakh MIDI dataset. Objective measures, including pitch-class distributions,
Inter-Onset-Interval (IOI), pitch range, average pitch intervals, pitch counts,
note length, and transition matrices, facilitated a comprehensive comparison.
Findings revealed that while the AI model’s output demonstrated stylistic
consistency and a certain level of musical texture, it exhibited less complexity and
variety compared to human compositions.

Subjective evaluations, derived from a feedback survey, revealed moderate to low


satisfaction with the AI-generated music. The results suggested that users with
higher musical experience were less satisfied with the compositions, indicating
a correlation between musical experience and perception of the AI-generated
music.

Despite its limitations, ChatGPT exhibits the capability to generate valuable music
from natural language prompts. However, enhancements are necessary to better
mimic the complexity and variance found in human compositions in order to make
it applicable in music production.

i
Sammanfattning
Denna avhandling undersöker förmågan hos artificiell intelligensmodell (AI),
ChatGPT 3.5-turbo, att komponera värdefull musik i ett digitalt format (MIDI)
från naturligt språk. Studien implementerar en mångfacetterad kvantitativ
metodansats, där objektiva musikaliska mått kombineras med subjektiva
användarutvärderingar.

Det konceptbevis-system som utvecklades för denna forskning genererade


MIDI-filer med hjälp av ChatGPT, vilka sedan analyserades och jämfördes
mot människokomponerad musik från Lakh MIDI-datasetet. Objektiva mått,
inklusive pitch-klassdistributioner, Inter-Onset-Interval (IOI), pitchomfång,
genomsnittliga pitchintervaller, pitchräkningar, notlängd och övergångsmatriser,
möjliggjorde en omfattande jämförelse. Resultaten visade att medan AI-
modellens kompositioner visade stilistisk konsekvens och en viss nivå av
musikalisk textur, uppvisade de mindre komplexitet och variation jämfört med
människans kompositioner.

Subjektiva utvärderingar, härledda från en återkopplingsundersökning, avslöjade


måttlig till låg tillfredsställelse med AI-genererad musik. Resultaten antydde
att användare med högre musikalisk erfarenhet var mindre nöjda med
kompositionerna, vilket indikerar ett samband mellan musikalisk erfarenhet och
uppfattningen om AI-genererad musik.

Trots sina begränsningar visar ChatGPT förmågan att generera värdefull musik
från naturliga språkpåminnelser. Men förbättringar behövs för att bättre
efterlikna komplexiteten och variansen som finns i människokompositioner för
att göra den tillämplig inom musikproduktion.

ii
Author
Marcus Warnerfjord
KTH Royal Institute of Technology
Electrical Engineering and Computer Science

Examiner
Pawel Herman

Supervisor

Jörg Conradt
Contents

1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5
2.1 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Essential Elements of Music Theory . . . . . . . . . . . . . . . . . . 7
2.4 Digital Representation of Music: MIDI . . . . . . . . . . . . . . . . 8
2.5 Related Work: On the evaluation of generative models in music . . 9

3 Method 12
3.1 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Proof-of-concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Community Feedback and Redeployment . . . . . . . . . . . . . . . 20
3.4 Construction of the Analysis Tool . . . . . . . . . . . . . . . . . . . 20
3.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Result 24
4.1 Objective metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Subjective metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Discussion 38
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Conclusion 41

iv
1 Introduction
The fusion of artificial intelligence (AI) and music has introduced new avenues for
both creative expression and automation. AI is increasingly playing a significant
role in the music industry, from generating novel compositions to aiding in music
production [5]. Key to this advancement is AI’s ability to comprehend and create
music in a variety of formats. The Musical Instrument Digital Interface (MIDI),
a protocol used for conveying musical information digitally, is one such format.
MIDI is so popular that it has become an industry standard in music production
and live performance. The reason for this is that MIDI only communicates the
structure and timing of the notes, which can then be sent to any instrument with
a MIDI input. This makes it a highly flexible format to work with.

Nonetheless, there is still a discernible gap in the capacity of AI to interpret and


produce MIDI files from natural language prompts. Most of the current research
in AI music generation mainly addresses parameters such as style, genre, and
musical structure. The end result being to produce audio files which are less
flexible in music production. Meanwhile, using descriptive language as a guide
for music generation remains relatively unexplored. This limitation presents a
challenge for individuals wishing to create music through intuitive language.

1.1 Problem

The primary issue this project seeks to address is the ability of a large
language model (LLM), specifically ChatGPT, to generate valuable MIDI files
from natural language prompts. The problem involves both the objective and
subjective assessment of the model’s performance against human-composed
MIDI files.

To what extent can ChatGPT generate musically valuable and satisfying


compositions in a digital format (MIDI) from natural language prompts?

The guiding questions for this project include:

• How does the complexity, diversity, and consistency of ChatGPT’s


compositions compare to human-composed MIDI files as evaluated by a set

1
of objective metrics suggested by Yang et al.[19]?

• To what extent are users, particularly those with varying degrees of musical
experience, satisfied with the musical outputs of ChatGPT?

1.2 Purpose

The objective of this degree project is to investigate the potential of LLMs for
generating MIDI files based on natural language prompts. This exploration will
enhance understanding of the underlying relationship between natural language
processing and music generation, thereby contributing to the broader field of AI
in music.

Potential beneficiaries of this degree project encompass musicians, composers,


and music enthusiasts, as it offers an innovative approach to music generation
using natural language prompts. The project could potentially contribute to
the evolution of new music composition tools, facilitating a more intuitive and
accessible means of creating MIDI files.

1.3 Goal

The ultimate aim of this degree project is to uncover the capabilities of LLMs in
generating MIDI files based on natural language prompts. The specific objectives
are to develop a proof-of-concept system that utilizes LLMs to generate MIDI files
from text descriptions. And then to evaluate the effectiveness and limitations
of the LLM-based approach in generating musically coherent and relevant MIDI
files, using both objective and subjective metrics.

The project’s deliverables will include the developed system, a set of generated
MIDI files, and an evaluation system of the produced music based on midi data,
user feedback and analysis.

1.4 Delimitations

To maintain a focus on the project’s primary objectives, several delimitations


apply:

2
• The performance of large language models (LLMs) other than ChatGPT
falls outside the purview of this study, which is dedicated to investigating
the potential of ChatGPT in generating MIDI files from natural language
prompts.

• Comparative analysis of other music generation models will not be


conducted. The project’s core focus lies in the exploration and evaluation
of an LLM for MIDI generation, not in providing a comprehensive review of
the music generation model landscape.

• Turing testing, a common evaluative measure in AI research assessing the


indistinguishability of AI output from human output, is not considered
relevant to this project’s primary research question and is, therefore,
excluded.

• An exploration of different temperature settings that balance the model’s


output between creativity and consistency/relevance is not part of the
study’s scope. Although temperature can affect the randomness of the music
generated, this project will maintain a constant default setting to eliminate
it as a variable.

• Fine-tuning of ChatGPT on a specific music-related dataset was not


undertaken in this study. Though fine-tuning can enhance an LLM’s
performance for a specific task or domain, this project focused on exploring
the ”out-of-the-box” capabilities of ChatGPT for music generation. It aimed
to maintain the model’s general-purpose nature, not tailoring it to specific
musical tasks or styles, thereby excluding fine-tuning from the project’s
scope.

• The performance of the model on multiple prompts is beyond this study’s


scope. The research design focuses on evaluating the MIDI output from
single prompts to ensure result consistency and comparability.

1.5 Outline

Chapter 2 provides the foundation for the study, outlining the research design,
large language models, basic music theory, and the role of MIDI. The chapter also

3
discusses related work in music generation systems.

Chapter 3 presents the research methods, describing the proof-of-concept, the


process of collecting community feedback, the construction of the analysis tool,
and the data analysis procedures.

Chapter 4 offers the results, examining both objective and subjective metrics to
provide a comprehensive evaluation of the AI-generated music.

Chapter 5 concludes the report with a discussion of the project’s implications,


potential impacts, and future research directions.

4
2 Background

2.1 Research Design

2.1.1 Multifaceted Quantitative Approach

A multifaceted quantitative approach refers to a research design that combines


different types of quantitative research techniques.

• Overview and Explanation: The multifaceted quantitative approach


leverages the strengths of various quantitative methodologies to provide a
more comprehensive understanding of the research problem. This approach
allows for in-depth analysis of numerical data from different sources and
perspectives, leading to a robust and detailed assessment [3].

• Application in this Research: In this study, the multifaceted


quantitative approach is employed to evaluate the performance of the
ChatGPT MIDI generator against MIDI composed by humans. Quantitative
techniques are used to analyze the MIDI files generated, and these objective
analyses are complemented by another quantitative method – user feedback
gathered through a rating system. This combination allows for a detailed
and comprehensive evaluation of the system, taking into account both the
objective musical metrics and the subjective user experiences.

2.1.2 Development of a Proof-of-Concept System

A proof-of-concept (PoC) system is a realization of a certain method or idea to


demonstrate its feasibility.

• Explanation: A PoC system is typically developed with the goal of verifying


that a concept or theory has practical potential. It is a prototype that is
designed to determine feasibility, but does not represent a final product [18].

• Purpose in this Research: In the context of this study, the ChatGPT


MIDI generator is developed as a PoC system. The aim is to validate the
feasibility of using a language model for MIDI generation, and to evaluate
its performance in a real-world setting.

5
2.2 Large Language Models

Large Language Models (LLMs) represent a significant shift in the field of natural
language processing, which emerged around 2018. They are neural networks
with an immense number of parameters, often in the billions or more, and are
trained on extensive amounts of unlabeled text using self-supervised or semi-
supervised learning techniques. The advent of LLMs has redefined the approach
in natural language processing research, transitioning from the previous model of
creating specialized supervised models for specific tasks, towards models capable
of performing a wide array of tasks with high proficiency [17].

Contrary to task-specific models, such as those trained solely for sentiment


analysis, named entity recognition, or mathematical reasoning, LLMs are general-
purpose models. Their ability to execute a variety of tasks, and the breadth of
tasks they can perform, appear to be tied to the quantity of resources, like data
and computational power, utilized during their training. This performance does
not necessitate additional breakthroughs in their design.

Despite their training being fundamentally based on straightforward tasks, such


as predicting the subsequent word in a sentence, these models capture significant
aspects of human language syntax and semantics when provided with adequate
training and parameter counts. Furthermore, LLMs exhibit substantial general
knowledge about the world and have the capability to ”remember” a large amount
of factual information from their training process [17]

2.2.1 Prompt Engineering

Prompt engineering is technique for efficiently interacting with Large Language


Models (LLMs) like ChatGPT. A ”prompt” refers to a set of instructions given to an
LLM which can be used to establish rules, automate processes, and determine the
specific quality and quantity of the output generated by the model. Essentially,
prompts serve as a means of programming an LLM to customize its outputs and
interactions.

In the context of a conversation with an LLM, prompts can be used to set the
context and specify what information is important and what the desired format

6
and content of the output should be. For example, a prompt could dictate
that the LLM should only generate code adhering to a certain coding style, or
flag certain keywords in a generated document and provide additional related
information.

Prompt engineering, then, is the process of creating these prompts to program


LLMs. This can extend beyond simply specifying the type of output or filtering
the information provided to the model. With the right prompt, entirely new
interaction paradigms can be created, such as having the LLM generate quizzes
related to a software engineering concept or even simulate a Linux terminal
window.[16]

2.2.2 ChatGPT API

The ChatGPT API is a product by OpenAI that enables developers to integrate the
capabilities of the GPT-3 model into their applications. It provides a way to send
a series of messages to the model and receive a model-generated message as a
response. Each message has a ’role’ that can be ’system’, ’user’, or ’assistant’, and
’content’ which is the text of the message from the role [10].

2.3 Essential Elements of Music Theory

Music fundamentally consists of individual notes, each of which represents a


distinct pitch and duration [7]. These notes, when played either in sequence or
at the same time, construct the melodies and harmonies that shape a piece of
music.

2.3.1 Pitch Classes

In music theory, a pitch class refers to the set of all pitches that are a whole number
of octaves apart from each other [11]. For example, the pitch class ”C” includes all
C notes on a keyboard, irrespective of the octave they are in.

Western music recognizes twelve distinct pitch classes, each signified by one of
twelve notes within an octave on a piano keyboard: C, C#/Db, D, D#/Eb, E, F,
F#/Gb, G, G#/Ab, A, A#/Bb, and B. Each pitch class corresponds to a different

7
frequency. A sharp (”#”) raises the pitch of a note by a half step, and a flat (”b”)
lowers it by the same amount. This system is integral to creating scales, chords,
and melodies.

In MIDI, each pitch class within an octave is assigned a number from 0 to 11, with
C as 0, C#/Db as 1, D as 2, and so on up to B as 11 [7]. This pattern repeats in
modulo 12 up to 127, thereby representing all notes in the MIDI format.

2.4 Digital Representation of Music: MIDI

To bridge the gap between these music theory concepts and their digital
representation, we use the Musical Instrument Digital Interface (MIDI).

Figure 2.1: Visualization of MIDI data. In the upper subplot, time is plotted on
the x-axis, and pitch/key is on the y-axis, representing the sequence of the melody.
The lower subplot, instead of pitch/key, represents velocity (the speed at which the
key was pressed) on the y-axis.

2.4.1 The MIDI Protocol

MIDI is a standard protocol designed for communicating musical information in a


digital environment. It is supported by various computer sound cards and is used
extensively for recording and playing back music on digital synthesizers. Rather
than producing sound itself, MIDI communicates through a series of messages like

8
”note on,” ”note off,” ”note/pitch,” ”pitchbend,” and others. MIDI instruments
interpret these messages to generate sound [8].

2.4.2 MIDI Files

MIDI files provide a standard way of exchanging sequenced musical information


between different software applications or hardware devices. A MIDI file contains
one or more MIDI streams, with each event accompanied by time information.
Aspects like song structure, sequence, track details, tempo, and time signatures
are all encapsulated within specific data structures, making the file relatively easy
to parse [8].

2.4.3 Manipulating MIDI in Python: Mido Library

To interact with MIDI files programmatically, one can use libraries like Mido, a
Python library that allows for the reading, writing, and real-time handling of MIDI
messages in a way that aligns with Python’s design principles [12].

2.5 Related Work: On the evaluation of generative models in


music

The evaluation of generative music models is a crucial aspect of research in


artificial creativity. A prominent piece of research in this area is the work
by Yang and Lerch [19], which discusses the inherent challenges of evaluating
generative music systems. While they argue that subjective evaluation should
ideally be employed to assess creative results, they also acknowledge the potential
difficulties this approach presents in terms of reliability, validity, and replicability.
To address this, they propose a set of musically informed objective metrics that
offer an objective and reproducible means of evaluating and comparing the output
of music generative systems. Their proposed methods have been tested with
several experiments on real-world data.

9
2.5.1 Subjective Metrics

Subjective evaluations are primarily based on human listener inputs. This can
involve a musical Turing test, where listeners distinguish between human and
computer-generated compositions. However, there are known limitations to this
method. For instance, the aesthetics of a piece may be conflated with whether it
sounds human-composed. Additionally, these tests often overlook factors such
as listener expertise and sample size, which can influence the reliability and
statistical significance of results. Furthermore, these tests risk overestimating the
subject’s comprehension. Therefore, these metrics should be tested along with an
objective evaluation.

2.5.2 Objective Metrics

Objective evaluations are often more reproducible and resource-efficient.


However, these probabilistic measures may not always correlate with subjective
quality, demonstrating the need for multiple evaluation methods. The objective
evaluation consist of nine objective metrics that utilize general musical domain
knowledge. These metrics, which are derived from music theory and musicology,
provide valuable insights into the structural and aesthetic characteristics of
generated music.[19]

• Pitch Count (PC): This counts the number of different pitches within a
MIDI file, resulting in a scalar for each sample [19].

• Pitch Class Histogram (PCH): This captures the distribution of musical


pitches, disregarding the octave they’re in. In simpler terms, it shows how
often each musical note (like C, D, E, etc., regardless of the octave) is played
in a piece of music [2, 9].

• Pitch Class Transition Matrix (PCTM): This is a matrix that measures


the transition probabilities of pitch classes. It provides useful information
for tasks such as key detection, chord recognition, or genre pattern
recognition. The resulting feature dimensionality is 12 × 12 [14].

• Pitch Range (PR): The pitch range is the difference between the highest
and lowest pitches in semitones within a sample, resulting in a scalar for

10
each sample.

• Average Pitch Interval (PI): This is the average value of the interval
between two consecutive pitches in semitones, resulting in a scalar for each
sample.

• Note Count (NC): This counts the total number of notes in a piece,
resulting in a scalar for each sample. It is a rhythm-related feature that does
not contain pitch information.

• Average Inter-Onset-Interval (IOI): The Inter-Onset-Interval (IOI) is


the time interval between the onsets (start times) of consecutive notes. The
average IOI gives a general sense of the tempo and rhythm of the piece.

• Note Length Histogram (NLH): This measures the distribution of note


lengths in a piece. It provides insights into the rhythmic complexity and
variety of the piece. The output vector has a length of either 12 or 24,
depending on whether rest option is activated or not [19].

• Note Length Transition Matrix (NLTM): Similar to the pitch class


transition matrix, but for note lengths. It provides insights into the rhythmic
structure and patterns in the piece. The output feature dimension is either
12 × 12 or 24 × 24 [15].

These objective metrics, when used in conjunction with subjective evaluations,


can provide a comprehensive evaluation of the quality and musicality of the
generated pieces [19].

11
3 Method
The study follows a multifaceted quantitative approach, combining quantitative
techniques for data collection and analysis. The project’s research methods are
grounded in the pragmatist philosophical assumptions, which focus on solving
real-world problems and using the most appropriate methods for the given
context [6].

3.1 Research Design

The research design for this project consists of four stages:

1. Development of a proof-of-concept (PoC) system for generating MIDI files


using LLMs.

2. Deployment, gathering of community feedback, improvements and


redeployment

3. Construction of the Analysis Tool used for evaluating the data

4. Analysis of the collected data to identify factors influencing the quality and
relevance of the generated music.

3.2 Proof-of-concept

The development of the PoC system, depicted in figure 3.1, is based on adapting
ChatGPT for generating MIDI files from natural language prompts. A web app
was developed to facilitate the interaction between users and the system and to
collect user feedback through surveys. The main functionality is described as
follows:

Users access the web app through their browser and are presented with an
interface to input their preferences or parameters for the MIDI file generation,
such as music style, tempo, key, and length. After submitting their preferences,
the web app processes the request by sending the user’s input to the server. The
input is contextualized in a natural language prompt and sent to the ChatGPT API,
which analyzes the input and sends back a list of note events formatted as a list
of lists. While the server processes the request, the user is shown a ”processing”

12
page, which periodically checks the status of the task by polling the server. Once

Figure 3.1: Flowchart for the PoC web application

ChatGPT has provided a response, the web app generates the MIDI file based
on the provided list. The server immediately stores the file in the database and
provides a unique identifier (a file ID) for the generated file. The web app then
redirects the user to a ”feedback” page, where they can download the generated
file. Users can also provide feedback if they would like to contribute. The feedback
questions are slightly different depending on the user’s musical experience, as
suggested by Yang et al. [19]. Once the feedback is submitted, the data from the
input is stored within the database.

3.2.1 User Input Form Design

The user input form was designed to facilitate the collection of essential
information required to generate a customized music composition. The form

13
aims to be simple, intuitive, and flexible to accommodate users’ diverse musical
preferences. The design of the form is as follows:

• Tempo: This field allows users to specify the tempo of the music in beats per
minute (BPM). A default value of 120 BPM is provided, which is a common
tempo for various music genres.

• Time Signature: Users can input the desired time signature for the
composition. The default value is 4/4, which is the most common time
signature in Western music.

• Key Signature: This field allows users to define the key signature of the
composition. The default value is C, which represents the C Major scale.

• Length: Users can specify the desired length of the composition in bars.
The default value is 4, which represents a short musical phrase.

• Genre: This field allows users to define the musical genre for the
composition. The default value is ”Any”, which gives the LLM the freedom
to choose a genre based on its expertise.

• Mood: Users can specify the mood of the music, such as ”happy” or ”sad”.
The default value is ”Any”, allowing the LLM to determine the mood based
on other inputs or its expertise. This also allows for some more diverse
prompts.

• Harmony, Melody, or Rhythm: This dropdown menu enables users


to choose whether they want the LLM to generate a harmony, melody, or
rhythm. The default option is ”Harmony”.

• Complexity: This dropdown menu allows users to select the desired


complexity level of the composition on a scale of 1 to 5, with 1 being the
simplest and 5 being the most complex. The default value is 3, which
represents a moderate level of complexity.

This user input form design ensures that users can easily define their preferences
while providing the necessary information for the LLM to generate a customized
musical composition. The form’s simplicity and flexibility allow users with varying
levels of musical expertise to interact with the system and explore the potential of

14
LLMs for music generation.

3.2.2 Prompt engineering

The prompt engineering process was guided by the recommendations in White et


al. [16], with a focus on effectively eliciting musical compositions from the LLM
while preserving the user’s intent and ensuring proper formatting of the output.
The following components were integrated into the prompt design:

• Meta language: This component establishes the short-midi-list-language


(SMLL) as the format in which the LLM should generate music. It ensures
that the LLM’s output is consistent and can be easily processed by the MIDI
file generation system.

Example: ”From now on, when I ask you to generate music, I want you
to do it in the short-midi-list-language (SMLL) which consists of a list of
lists, format like this: ”[[N 1, S1, D1], [N 2, S2, D2], [N 3, S3, D3], ...]”, where
N represents the MIDI note number, S represents the start time (in ticks),
and D represents the duration (in ticks)”

All prompts can be found in the appendix 6

• Persona: The persona component directs the LLM to act as a music


composer with expertise in a specific genre. It specifies the desired
complexity level and mood of the generated music, and encourages the LLM
to improvise when some information is missing.

• Alternative approach: This component prompts the LLM to consider


alternative musical ideas, scales, or techniques when generating a
composition. It aims to encourage the LLM to explore different possibilities
and select the best approach based on emotional impact and complexity.

• Template pattern: The template pattern provides a clear format for the
LLM’s output, ensuring consistency and facilitating downstream processing
of the generated music. It instructs the LLM to preserve the provided format
when generating music in the SMLL.

• Context manager: This component explains the importance of proper

15
output formatting for subsequent processing, including the extraction of
note_on and note_off events, the creation of MIDI messages and events,
and the use of the mido Python library for generating MIDI files. It also
specifies the ticks_per_beat value to be used in the process.

These components were incorporated into the system messages before presenting
the actual prompt to the LLM. This approach ensures that the LLM has the
necessary context and instructions for generating music according to the user’s
preferences and in a format that can be easily converted into MIDI files.

3.2.3 Output masking and error handling

In the process of obtaining MIDI data from ChatGPT, the response may contain
clutter or unexpected formatting that could interfere with the conversion of the
response to a Python matrix. To handle this, the program first extracts only
the relevant information using a pattern matching technique. It searches for the
desired format in the response, preserving only the lists that match the pattern,
and reconstructs a cleaned response containing only the matched lists.

Once the cleaned response is obtained, the program tries to convert it into a
Python list of lists. If the conversion fails due to any errors, the program sends
a new prompt to the ChatGPT API to try and obtain a better-formatted response.
This process continues until the response can be successfully converted into a
Python list of lists or a maximum number of attempts is reached.

Finally, the list of MIDI events is converted into a more suitable data structure
that can be used to generate a MIDI file.

This approach ensures that the response from ChatGPT can be successfully
processed and used to generate a MIDI file while minimizing the risk of errors
due to unexpected formatting or clutter in the response

3.2.4 Feedback survey design

The feedback forms are designed with the purpose of gathering user feedback
on the AI-generated music, which helps to evaluate the subjective metrics.
By separating the feedback forms based on experience levels (Beginner,

16
Intermediate, and Advanced), the system can gather more targeted and specific
feedback that considers the users’ expertise and understanding of music.
This allows users across different experience levels to provide more reliable
feedback.

The default option is ”Undefined” for each question and serves two purposes.
First, it acts as a placeholder, prompting users to choose a value that reflects their
opinion. By having ”Undefined” as the default choice, users are encouraged to
provide their input, rather than leaving the question unanswered. Second, it helps
prevent any unintentional or accidental selections by the user. If a user forgets to
answer a question, the ”Undefined” value ensures that the system does not record
any misleading or incorrect feedback. Subjective metrics, gathered through user
feedback, are crucial for understanding how well the AI-generated music meets
user expectations based on the given prompts. The feedback forms, separated by
experience levels (Beginner, Intermediate, and Advanced), allow users to provide
their opinions on various aspects of the music, such as how well it matches the
desired tempo, time signature, genre, mood, and complexity. For example, a
question like ”How well does the generated music match the desired Mood?” helps
to gauge whether the AI is accurately capturing users’ emotional intent.
Skill Level Question 1-5
Q1. I feel generally positive towards the use of AI
Q2. How well does the generated music match the desired Tempo?
Q3. How well does the generated music match the desired Time Signature?
Q4. How well does the generated music match the desired Genre?
Q5. How well does the generated music match the desired Mood?
Q6. How well does the generated music match the desired Key Signature?
Beginner Q7. How well does the generated music match the desired Length (Bars)?
Q8. How well does the generated music match the desired choice of Harmony, Melody, or Rhythm?
Q9. How well does the generated music match the desired Complexity?
Q10. Overall, how satisfied are you with the AI-generated music?
Q11. Does the generated music have a clear structure (e.g., intro, verse, chorus, etc.)?
Q12. Are there any noticeable repetitions in the generated music?
Q13. Is the generated music enjoyable to listen to?
Q1 - Q13
Intermediate Q14. How well does the generated music maintain a consistent harmonic progression?
Q15. How natural are the melodic transitions in the generated music?
Q16. How well do the rhythm patterns support the generated music?
Q1 - Q16
Advanced Q17. How well does the generated music incorporate dynamics?
Q18. How well does the generated music incorporate articulation?
Q19. Can you identify any advanced musical techniques (e.g., counterpoint, modulation, etc.)?

Table 3.1: User Evaluation Questionnaire with Skill-group Divisions

In addition to the specific music-related questions, the feedback forms also


include a sentiment question: ”I feel generally positive towards the use of AI” This

17
question is essential, as it provides insight into users’ general attitudes towards
AI-generated music, which can potentially affect their opinions and evaluation
of the compositions. This approach is supported by research, such as the study
conducted by Chamberlain et al. (2018), which found that aesthetic responses
to computer-generated art can be influenced by the knowledge that the art was
created by an AI.[4]

3.2.5 MIDI File Conversion

The MIDI conversion process relies on the mido library, which is a Python library
for working with MIDI messages and files. mido provides an easy-to-use interface
for creating, parsing, and manipulating MIDI data. The process works in the
following steps:

• Create MIDI file and track: A new MIDI file and track are created using
the MidiFile and MidiTrack classes from the mido library.

• Set meta-data: The tempo, time signature, key signature and such (if
provided) are set using MetaMessages. mido provides a convenient way to
create MetaMessages for setting various meta-data properties of the MIDI
file, such as ”set_tempo”, ”time_signature”, and ”key_signature”.

• Create note events: For each note produced by ChatGPT, the start time,
duration, and velocity are extracted. The end time is calculated by adding
the duration to the start time. Note on and note off events are created and
appended to an events list. Each event is a tuple containing the start or end
time, the note value, and a boolean indicating whether it’s a note on (True)
or note off (False) event.

• Sort note events: The events list is sorted by the start time and whether
the event is a note on or note off event. This ensures that the note events are
processed in the correct order when adding them to the track

• Add note events to the track: For each event in the sorted list, a
”note_on” or ”note_off” message is added to the track. The time difference
between the current event and the last event is calculated to ensure proper
timing. This time difference is passed as the time argument in the

18
mido.Message function call, which determines the delay between the current
message and the previous one. This step ensures that the notes are played
at the correct times and with the correct durations.

• Calculate ticks per bar: The ticks per bar are calculated based on the
MIDI file’s ticks per beat and the beats per bar (obtained from the numerator
of the time signature).

• Extend track length (if necessary): If the track’s length in ticks is less
than the desired length in ticks, silence is added to the end of the track using
a control_change message.

• Save MIDI file to binary data: The MIDI file is saved to binary data
using a BytesIO buffer, which is then returned.

3.2.6 Web Deployment and Reach

Deploying the ChatGPT MIDI generation project as a web application offers


several key advantages. By deploying it as a web application, users worldwide
can interact with the ChatGPT MIDI generator without any setup or installation,
resulting in a more extensive and diverse participant pool. This broader audience
and diversity help obtain a more accurate evaluation of AI-generated music. These
advantages include ease of access, increased reach and a more diverse sample
size.

To handle long response times from the API and prevent interruptions from the
web server, Celery and Redis were used. Celery is a task queue that allows for
asynchronous processing of tasks, while Redis serves as a message broker and
caching system. With Celery and Redis, the web application can manage tasks in
the background, ensuring smooth user experience and uninterrupted web server
operation.

Additionally, to further increase reach and diversify the user base, the web
application was published on various Subreddits targeted towards both music-
oriented and computer science-oriented audiences. This approach enabled
valuable insights from users with different backgrounds and expertise, ensuring
a well-rounded evaluation of the AI-generated music. By appealing to a broader

19
audience, we can better understand the performance and reception of ChatGPT’s
MIDI generation capabilities across different user groups.

3.3 Community Feedback and Redeployment

As a result of publishing the web application on various Subreddits, valuable


community feedback was received. Several key improvements were made based
on this feedback, which resulted in a subsequent redeployment of the web
application. These improvements include:

• Setting default alternatives as undefined: Users suggested that providing an


”Undefined” default alternative would help to prevent bias in the survey.

• Clearer information about the rating system: Users requested more


information about the rating system used in the feedback section, to ensure
a better understanding of the evaluation criteria and facilitate more accurate
ratings.

• Improving the handling of multiple users: Users reported that the site was
not functioning optimally when accessed by many users simultaneously. To
address this issue, the message broker was enhanced with a new Redis client,
allowing for more efficient handling of concurrent users and preventing
interruptions from the web server.

After implementing these changes, the web application was redeployed, providing
an improved user experience and ensuring a more reliable platform for user
evaluation.

3.4 Construction of the Analysis Tool

During the evaluation process, it was discovered that the intended analysis tool,
mgeval [20] (Python), developed by Yang et al., had compatibility issues and
outdated requirements. To overcome these challenges, an alternative approach
was adopted to measure the objective metrics.

Instead of relying on mgeval, the miditoolbox for MATLAB was employed, which
consists of similar functions for analyzing and processing MIDI files [1]. This

20
decision allowed for a more seamless integration of the analysis tool into the
evaluation process while still providing comparable metrics for assessing the AI-
generated music.

The MATLAB program constructed for this purpose is composed of several


functions, each designed to calculate and visualize specific musical metrics. These
metrics include pitch class histogram, average inter-onset-interval, note count,
pitch range, average pitch interval, pitch count, note length histogram, pitch class
transition matrix, and note length transition matrix. The program reads in MIDI
files from two distinct sets, processes the files using miditoolbox functions, and
calculates the desired metrics. It then visualizes the results using histograms,
heatmaps, and other appropriate graphical representations.

3.4.1 Comparison with the Lakh MIDI Dataset

To assess the quality of the AI-generated music and draw meaningful comparisons
with human-made compositions, it is essential to select an appropriate
benchmark dataset. The Lakh MIDI dataset (LMD) was chosen for this purpose,
as it represents a diverse collection of music and offers a substantial amount of
data for comparison. The LMD consists of 176,581 unique MIDI files, with 45,129
of these files matched and aligned to entries in the Million Song Dataset [13].

The LMD-matched subset was selected for comparison, as it provides a more


focused and manageable dataset. This subset consists of 45,129 files that have
been matched to entries in the Million Song Dataset. To ensure a fair comparison,
the LMD-matched dataset was reduced to a size comparable to the AI-generated
MIDI set, while maintaining its diversity.

The analysis tool’s workflow consists of the following steps:

1. Load MIDI files from the two sets and convert them into note matrices using
the miditoolbox functions.

2. Calculate various metrics for each MIDI file in both sets, such as pitch class
histograms, inter-onset intervals, note counts, pitch ranges, pitch intervals,
pitch counts, note length histograms, pitch class transition matrices, and
note length transition matrices.

21
3. Visualize the results using graphical representations such as histograms and
heatmaps to facilitate comparison between the two sets.

By utilizing miditoolbox, it was possible to conduct a thorough and reliable


evaluation of the musical compositions generated by ChatGPT, ensuring the
validity of the research findings. This custom-built MATLAB program provides
a flexible and adaptable framework that can be easily extended to accommodate
new metrics or adapted for other musical analysis tasks

3.5 Data Analysis

Data analysis was performed on the objective and subjective metrics collected
from the MIDI compositions generated by ChatGPT 3.5-turbo and the feedback
from users.

All data collected in this study was initially subjected to a normality test.
The Lilliefors test, a variation of the Kolmogorov-Smirnov test used when the
parameters of the normal distribution are estimated from the data, was applied to
check the conformity of the data with a normal distribution. However, the results
indicated that the data did not adhere to the criteria of a normal distribution.

Due to this non-normal distribution, the usage of parametric statistical methods


for further analysis was deemed inappropriate. As a consequence, the Mann-
Whitney U test, a non-parametric statistical test, was applied to analyze the
differences between the two distinct groups – the music generated by the ChatGPT
model and human-composed pieces.

The p-values obtained from the Mann-Whitney U tests were examined to verify
statistical significance. In all instances, the p-values were found to satisfy a
predetermined significance level, suggesting a statistically significant difference
between the two groups within the context of this study.

Subsequent to these tests, descriptive statistics, including mean and standard


deviation, were calculated for each group. These served to characterize and
summarize the central tendency and dispersion of the data. The insights gathered
from these statistical descriptions facilitated a comparison between the AI-
generated music and human compositions.

22
Lastly, the computed statistics were depicted using appropriate graphical
representations. These visualizations aided in identifying patterns, trends,
and differences in the analyzed metrics. The plots, along with the calculated
descriptive statistics and the results of the Mann-Whitney U tests, formed the
basis for drawing conclusions about the musical performance of the ChatGPT
model compared to human-composed MIDI files.

For the subjective metrics, the users’ overall satisfaction scores, overall
satisfaction scores by skill level, mean scores for each question by skill level,
relationship between AI sentiment and satisfaction, and performance metrics
were analyzed. To summarize and visualize these metrics, various statistical
plots including histograms, boxplots, heatmaps, and scatterplots were generated.
These plots enable the visual understanding of data distribution, central tendency,
dispersion, and potential correlations in the data set.

The analysis of AI sentiment and overall satisfaction included the computation


of mean scores, highlighting general trends in user feedback, and the use
of scatterplots to visually examine potential correlations between these two
factors.

23
4 Result

4.1 Objective metrics

This web app gained 500 MIDI files from the music generated by ChatGPT 3.5-
turbo, which was matched with 500 randomly selected files from the Lakh MIDI
Dataset. The objective metrics offer quantitative assessments of the music’s
attributes, facilitating direct comparisons with human-composed pieces.

Table 4.1: Summary of Statistical Analysis of Objective Metrics


Metric ChatGPT 3.5-turbo Lakh MIDI Dataset
Pitch Counts Mean: 3.484, Std Dev: 3.008 Mean: 44.296, Std Dev: 11.917
Pitch-Class Distribution Mean: 0.25, Std Dev: 0.13 Mean: 0.30, Std Dev: 0.15
Pitch Class Trans. Prob. Mean: 0.01, Std Dev: 0.005 Mean: 0.01, Std Dev: 0.005
Pitch Range (semitones) Mean: 7.27, Std Dev: 7.55 Mean: 59.93, Std Dev: 11.91
Average Pitch Intervals Mean: 3.49, Std Dev: 3.87 Mean: 7.82, Std Dev: 2.66
Note Counts Mean:11.79, Std Dev: 14.08 Mean: 4634.85, Std Dev: 2625
Average IOI (seconds) Mean: 0.45, Std Dev: 1.00 Mean: 0.50, Std Dev: 0.47
Note Lengths Mean: 30.67, Std Dev: 47.78 Mean: 54.89, Std Dev: 35.89
Note Length Trans. Prob. Mean: 3.07, Std Dev: 11.50 Mean: 2.62, Std Dev: 5.08

4.1.1 Pitch Counts

Pitch counts quantify the number of unique pitches used in a composition,


providing insights into the harmonic complexity and diversity of the piece. The
histograms in Figure 4.1 illustrate the distributions of pitch counts in music
composed by ChatGPT 3.5-turbo and the Lakh MIDI Dataset.

In the music composed by ChatGPT 3.5-turbo, the mean pitch count is


approximately 3.55 with a standard deviation of 3.33, suggesting a modest diversity
in pitch use. However, in the Lakh MIDI Dataset, which is representative of
human compositions, the mean pitch count rises significantly to 44.62 with a
standard deviation of 12.19, indicating greater pitch diversity and variation in
human-created music.

A comparison of these findings suggests that the music composed by ChatGPT


tends to utilize a narrower range of unique pitches compared to human-composed
pieces. While this may suggest a limitation in harmonic richness, the artistic value
of the generated music is not necessarily reduced, as pitch diversity is context and
genre-specific.

24
Figure 4.1: Histograms depicting the distribution of pitch counts for music
generated by ChatGPT 3.5-turbo (left) and the Lakh MIDI Dataset (right). The
solid red line represents the mean, and the dashed lines delineate one standard
deviation above and below the mean.

The music generated by ChatGPT tends to exhibit fewer unique pitches and a
wider variation in pitch usage, while human compositions in the Lakh MIDI
Dataset demonstrate more consistent usage of a wider range of pitches. This
comparison potentially suggests a denser tonal texture in the music created by the
model, while human compositions typically traverse a larger tonal space.

4.1.2 Pitch-Class Distribution

Pitch class distribution quantifies how often each pitch class (note) occurs in a
piece of music. Figure 4.2 compares how often each musical note (C to B) is
used.

ChatGPT music, on average, uses each note about 25% of the time, with a variation
of 13%. This means there’s a notable difference in how often each note is
used.

Lakh MIDI Dataset music, on the other hand, uses each note roughly 30% of the
time, with a higher variation of 15%. So, it has a slightly broader use of different
notes, but with a wider range in the frequency of their use.

ChatGPT music shows a bit less unpredictability (entropy 3.26) in note usage
compared to the Lakh MIDI Dataset (entropy 3.53). This suggests that ChatGPT’s

25
Figure 4.2: Comparison of note usage between ChatGPT 3.5-turbo (blue) and the
Lakh MIDI Dataset (red). Each note from C to B is shown on the x-axis. The y-axis
shows the probability of each note appearing in the music.

music follows a more predictable pattern in its choice of notes.

The findings here reveal that ChatGPT’s generated music exhibits some stylistic
consistency in pitch usage but may lack the complexity and variety found in
human compositions. The music generated by ChatGPT seems to include smaller
and more varied pitch intervals, whereas human compositions in the Lakh MIDI
Dataset exhibit larger and more consistent pitch intervals, possibly indicating a
denser tonal texture in the music generated by ChatGPT and a broader tonal space
in human compositions.

4.1.3 Pitch Class Transition Matrices

Transition matrices offer a means to visually understand the progression of pitch


classes in musical compositions. Each cell in these matrices represents the
probability of transitioning from one pitch class to another, denoted by the color
of the cell, with darker shades representing higher probabilities.

The pitch class transition matrices for both ChatGPT 3.5-turbo and the Lakh MIDI
dataset reveal a complex web of transitions between pitch classes. Although the
mean transition probability and standard deviation for both sets are the same, the

26
Figure 4.3: Pitch class transition matrices for ChatGPT 3.5-turbo generated
music and the Lakh MIDI dataset. The color gradient corresponds to the
transition probability between pitch classes, with darker shades representing
higher probabilities.

individual transition probabilities are diverse.

Even though the numerical values converge, indicating similar transition patterns
on average, the visual representation of the matrices shows a more nuanced
contrast between the music generation behavior of ChatGPT 3.5-turbo and human
compositions.

In particular, the Lakh MIDI dataset appears to exhibit a greater diversity in


transition probabilities, indicating that human compositions often consist of a
broad range of transitions between different pitch classes. Conversely, ChatGPT
3.5-turbo seems to generate music with a denser tonal texture, favoring certain
pitch transitions more frequently, which could potentially explain the denser
visual representation in the corresponding matrix.

4.1.4 Pitch Range

The examination of the pitch range, defined as the disparity between the highest
and lowest pitches in a musical piece, serves as an effective tool to discern the
tonal diversity of the piece. The illustration in Figure 4.4 elucidates the pitch range
analysis for the musical pieces generated by ChatGPT 3.5-turbo as well as those
from the Lakh MIDI Dataset.

The mean pitch range for music output by ChatGPT 3.5-turbo is 7.27 with
a standard deviation of 7.55, indicative of a relatively consistent tonal range

27
Figure 4.4: Pitch ranges in semitones for both the ChatGPT 3.5-turbo dataset (left
panel) and the Lakh MIDI Dataset (right panel). The distributions of the pitch
ranges are depicted with histograms, while the mean and standard deviation for
each dataset are represented by solid and dashed vertical lines, respectively.

in the music composed by the model. On the other hand, the Lakh MIDI
Dataset exhibits a higher mean pitch range of 59.93 with a standard deviation
of 11.91, demonstrating a more extensive tonal range and variability in human
compositions.

This disparity in pitch ranges suggests that human compositions, as illustrated by


the Lakh MIDI Dataset, tend to exhibit a more diverse tonal palette compared
to the compositions generated by ChatGPT. It can be seen as an indication
of potential limitations of the model in generating music with varying pitch
ranges.

4.1.5 Average Pitch Intervals

Pitch intervals, expressed in semitones, fundamentally influence the tonality and


harmonic richness of musical compositions.

For music generated by ChatGPT, the mean pitch interval amounts to


approximately 3.49 semitones, with a standard deviation of 3.87 semitones.
This implies a fair amount of variation in pitch intervals within the music
output by ChatGPT. Conversely, the Lakh MIDI Dataset presents a higher
mean pitch interval of around 7.82 semitones, paired with a lesser standard
deviation of 2.66 semitones, indicating less variation in pitch intervals in human

28
Figure 4.5: Histograms of the average pitch intervals for music generated by
ChatGPT 3.5-turbo (left) and the Lakh MIDI Dataset (right). The mean pitch
interval and standard deviation (Std Dev) are indicated by the red solid and dashed
lines, respectively.

compositions.

Considering these results, it could be posited that ChatGPT’s music exhibits


smaller and more varied pitch intervals. In contrast, human compositions
in the Lakh MIDI Dataset show larger, but more consistent pitch intervals.
This might suggest a denser tonal texture in the AI-generated music, while
human compositions tend to cover a wider tonal space with more consistent
intervals.

4.1.6 Note Counts

The number of notes utilized in a composition is an essential metric indicating


its complexity and richness. The mean note count for the ChatGPT was found
to be approximately 11.79, with a standard deviation of 14.08 and variance of
198.33. This implies a moderate variability in the model’s musical outputs.
Contrastingly, the Lakh MIDI Dataset presented a significantly higher average
note count of approximately 4634.85, with a standard deviation of 2625.83 and
variance of 6894996.89. This indicates a greater complexity and diversity in

29
Figure 4.6: Comparison of the average note counts for ChatGPT 3.5-turbo and
the Lakh MIDI Dataset. The standard deviation is represented by the error bars.
The blue bar corresponds to ChatGPT, and the red bar represents the Lakh MIDI
Dataset. Due to the considerable difference in values, the y-axis is presented in
log scale.

human compositions within this dataset.

This disparity in note counts suggests a limitation in the model’s capability to


generate compositions as intricate as those created by human composers and
showases the greatest discrepancy between the two sets.

4.1.7 Average Inter-Onset Interval

The Inter-Onset Interval (IOI) represents the time interval between the onset
of consecutive notes, providing valuable insights into the rhythmic properties of
musical compositions. Figure 4.7 compares the average IOI of music generated
by ChatGPT 3.5-turbo with that of compositions found in the Lakh MIDI
Dataset.

ChatGPT 3.5-turbo generated music with an average IOI of approximately 0.45


seconds and a standard deviation of 1.00 second. This data suggests a notable
degree of rhythmic variability within the AI-generated compositions. Conversely,
compositions in the Lakh MIDI Dataset demonstrated a marginally higher average
IOI of approximately 0.50 seconds, but with a lower standard deviation of 0.47

30
Figure 4.7: The average Inter-Onset Interval (IOI) for ChatGPT 3.5-turbo and the
Lakh MIDI Dataset. The error bars represent the standard deviation for each set.
The blue bar corresponds to ChatGPT, and the red bar corresponds to the Lakh
MIDI Dataset.

seconds. This lower variability in human compositions suggests a more consistent


rhythmic structure compared to the output from ChatGPT.

4.1.8 Note Length

The distribution of note lengths in the musical pieces provides valuable


information regarding their rhythmic complexity and diversity. This investigation
aims to compare the note length distributions in two sets of compositions: one
generated by ChatGPT 3.5-turbo, and the other from the Lakh MIDI Dataset.

In the music generated by ChatGPT, the mean note length is approximately


30.67, with a standard deviation of 47.78 and the 5th note length category
overrepresented. The Lakh MIDI Dataset, on the other hand, exhibits a higher
mean note length of 54.89 with a lower standard deviation of 35.89, potentially
suggesting a more consistent usage of note lengths.

This comparison contributes to the overall understanding of how ChatGPT’s


performance, in terms of rhythmic complexity, stacks up against human
compositions. However, it should be noted that a higher standard deviation might
indicate a wider variety of rhythmic patterns, but it might also signal a lack of

31
Figure 4.8: Grouped histograms of note length categories for the music generated
by ChatGPT 3.5-turbo and the Lakh MIDI Dataset. Each bar represents a note
length category on a logarithmic scale, with categories ranging from a quarter
beat (1/4) to four beats (4). Categories are calculated using logarithmic binning
to ensure the accurate representation of the diverse note lengths. The bottom
row shows the mean and standard deviation of the note lengths for each set,
illustrating differences in rhythmic complexity.

rhythmic consistency.

4.1.9 Note Length Transition Matrices

The transition matrices illustrate note length transitions in the music, with the
color of each matrix cell indicating the transition probability between the note
lengths associated with the cell’s row and column.

Figure 4.9: Note length transition matrices for ChatGPT 3.5-turbo generated
music (left) and Lakh MIDI dataset (right). The color scale represents the
transition probability between note lengths, darker colors indicating higher
transition probability.

ChatGPT 3.5-turbo exhibited a higher mean transition probability (3.07) and a

32
larger standard deviation (11.50) compared to the Lakh MIDI dataset, with mean
transition probability of 2.62 and standard deviation of 5.08. A higher mean
transition probability could point to a more exploratory note length transition
behavior. On the other hand, the higher standard deviation might indicate either
a richer musical texture due to a wider range of note length transitions, or less
consistency in musical generation.

Thus, while ChatGPT tends to generate music with denser tonal texture, as shown
by smaller and more varied pitch intervals, human compositions in the Lakh MIDI
Dataset demonstrate a broader tonal space with more consistent intervals.

4.2 Subjective metrics

This section presents an analysis of the subjective metrics based on the responses
collected 23 participants. Each participant evaluated the MIDI compositions
created by ChatGPT and provided feedback on their overall satisfaction and
perceived quality of the generated music.

For the purpose of this analysis, the musical skill level of participants is
classified into three categories, represented numerically as 1, 2, and 3.
Level 1 refers to beginners or those with limited musical experience, level 2
corresponds to individuals with an intermediate level of musical experience,
and level 3 represents advanced users with significant musical experience and
expertise.

4.2.1 Distribution of Overall Satisfaction Scores

The histogram in Figure 4.10 represents the distribution of overall satisfaction


scores from users evaluating the MIDI compositions created by ChatGPT.

The scores range from 1 to 5, with the majority of scores concentrated at the lower
end of the scale. The most common score is 1, with 13 occurrences, followed
by 1.77 with 8 occurrences, and score 2 with 5 occurrences. This indicates that
many users had a less than satisfactory experience with the generated MIDI
compositions.

Scores of 3, 4, and 5 are less frequent, with 1, 2, and 1 occurrences, respectively,

33
Figure 4.10: Histogram showing the distribution of overall satisfaction scores.
The x-axis represents the score, and the y-axis represents the frequency of each
score.

suggesting that a minority of users had a more positive experience.

The mean and median satisfaction scores are both approximately 1.77, implying
that the data is not significantly skewed and that there’s a balanced spread
around the central value. However, considering the scale of the scores, a mean
and median score of 1.77 indicates a generally low level of satisfaction among
users.

4.2.2 Overall Satisfaction Scores by Skill Level

The boxplot in figure 4.11 illustrates the distribution of overall satisfaction scores
broken down by the level of musical experience. The levels range from 1 to
3, with 1 being the least experienced and 3 being the most experienced. The
median satisfaction scores for users with musical experience levels 1 and 2 are
both approximately 1.77. The users with a musical experience level of 3 had a
lower median satisfaction score of 1. This suggests that users with more musical
experience were, on average, less satisfied with the MIDI compositions generated
by ChatGPT.

The interquartile ranges, which represent the spread of the middle 50% of scores,
are 0.23, 0.81, and 0.39 for musical experience levels 1, 2, and 3, respectively.
This indicates a greater variability in the satisfaction scores among users with a

34
Figure 4.11: Boxplot showing the distribution of overall satisfaction scores
stratified by musical experience level. The x-axis represents the musical
experience level, and the y-axis represents the score.

musical experience level of 2.

The number of outliers are 4, 1, and 2 for musical experience levels 1, 2, and 3,
respectively. This suggests there were some scores that deviated significantly from
the others, especially among users with musical experience level 1.

4.2.3 Mean Scores for Each Question by Skill Level

The heatmap in Figure 4.12 presents the mean scores for each question, grouped
by the level of musical experience. The levels range from 1 to 3, with 1 being the
least experienced and 3 being the most experienced.

The color intensity in the heatmap represents the mean score, with darker colors
indicating higher scores and lighter colors indicating lower scores. It can be
observed that, in general, users with a musical experience level of 1 provided
higher mean scores for most questions compared to users with higher musical
experience levels. On the other hand, users with a musical experience level of 3
tended to provide the lowest mean scores for the majority of questions.

These results highlight a potential relationship between the user’s level of musical
experience and their evaluation of ChatGPT’s performance on different aspects
of music generation. The higher the musical experience level, the lower the
mean scores tend to be, suggesting that more experienced users have higher

35
Figure 4.12: Heatmap showing the mean scores for each question, stratified by
musical experience level. The x-axis represents the questions, and the y-axis
represents the musical experience level. The color intensity represents the mean
score for each question and skill level combination.

expectations or are more critical of the generated MIDI compositions.

4.2.4 Relationship between AI Sentiment and Satisfaction

The scatterplot displayed in Figure 4.13 shows the relationship between AI


sentiment (Q1) and overall satisfaction (Q10) based on the feedback survey data.
Each point on the scatterplot represents an individual participant’s responses to
Q1 and Q10. The color of the points corresponds to the participant’s level of
musical experience. Overall, it seems that there is a positive trend, suggesting that
higher AI sentiment scores could be associated with higher overall satisfaction
scores. However, this trend is not strongly pronounced and there is a considerable
spread in the data, indicating variability in user responses.

It is noteworthy that respondents with lower musical experience (represented by


lighter color points) tend to give higher scores in both AI sentiment and overall
satisfaction. In contrast, those with higher musical experience (darker color
points) frequently gave lower scores.

36
Figure 4.13: Scatterplot illustrating the relationship between AI sentiment
(Question 1) and overall satisfaction (Question 10). The color of the points
indicates the musical experience level of the participant, with lighter colors
representing lower experience levels and darker colors representing higher
experience levels.

37
5 Discussion
While the results offer interesting insights into the performance of this AI
model in generating music, several factors and limitations should be taken into
consideration when interpreting these findings.

5.1 Limitations

5.1.1 Sample Size and User Feedback

The validity of this study’s findings hinges substantially on the feedback survey
collected from 23 individuals. This relatively small sample size may not capture
the full range of perceptions and experiences of users interacting with the
music generated by ChatGPT. Hence, the generalizability of these findings is
limited. Furthermore, the marked discrepancy between the number of MIDI files
generated by users (500 files) and the number of users who provided feedback is
noteworthy. This gap may imply that not all users found the product satisfactory
enough to provide feedback, pointing towards potential areas of user experience
that need improvement.

5.1.2 Influence of Prompt Engineering

White et al. highlight the critical role of prompt engineering in the output
generated by LLMs[16]. The specificity and syntactic design of prompts can
directly influence the output, including MIDI sequence generation. Consequently,
the art and science of crafting effective prompts will play a significant part of the
quality and relevance of the generated outputs. Given that prompt engineering
is still a relatively new field of study, further exploration and refinement in this
area could potentially impact the results of studies like this one in the future, and
thereby make better use of existing models by using a more refined prompt.

5.1.3 Analysis of Musical Quality

This study mainly focuses on the musical quality based on the raw data extracted
from MIDI and user satisfaction feedback. However, a detailed music theory
analysis of the generated pieces is absent. Future research could benefit from

38
involving musicologists or professional musicians to analyze the model’s output in
terms of harmony, melody, and rhythm. Such an approach would provide a deeper
understanding of ChatGPT’s generated music, identifying both its strengths and
areas needing improvement.

5.1.4 Absence of Fine-tuning

A crucial limitation in this study is the lack of fine-tuning of ChatGPT for


music generation. Using the model ’out-of-the-box’ without additional training
on music-specific data could have limited the coherence and quality of the
music produced. Fine-tuning on specific music-related datasets could potentially
enhance the model’s performance, enabling more complex compositions, and
more meaningful, context-aware explanations of its outputs. Yet, it’s also
important to note that fine-tuning requires substantial computational resources,
time, and a high-quality diverse dataset.

5.1.5 Note Count and Output Limitations

Another aspect worth discussing is the vastly lower note count in MIDI sequences
generated by ChatGPT compared to Lakh MIDI dataset. This difference could
be due to inherent design limitations aimed at managing output length. Such
a restriction could significantly limit the model’s ability to develop extensive
musical ideas, consequently curbing the richness and variety of the musical
content. However, future iterations of the model could explore augmenting note
count, thereby aligning the model’s outputs more closely with user expectations
and potentially yielding more diverse and intricate compositions.

5.2 Future Work

5.2.1 Model Improvements and Novel Integration

Large language models such as GPT-3 and its successors have exhibited a steep
trajectory of improvement across diverse applications including natural language
understanding, translation, and creative generation tasks. Although, this study
has demonstrated that music generation from ChatGPT 3.5 bares resemblance
to human compositions, it’s highly likely that future models will greatly improve

39
this functionality among other unexpected areas. Also, the merging of LLMs
with other AI technologies, such as a musically trained AI, could enable more
sophisticated, context-aware music creation. The integration of AI models
specializing in different musical aspects could also be another promising avenue
to potentially achieve a comprehensive music generation system, yielding more
compelling compositions.

5.2.2 ChatGPT as a Music Production Tool

ChatGPT’s potential extends beyond the generation of MIDI sequences; its


capability to explain its outputs could render it a powerful tool for learning
music production and composition, as well as co-creating. In such a context, a
continuous dialogue between the user and the LLM could transform the model
into a creative collaborator or musical mentor. However, the implementation
of such an interactive system would require sophisticated design, potentially
including capabilities for real-time feedback and adaptation based on user
inputs.

Accommodating an ongoing dialogue between the user and the AI also


necessitates assessment methods to evaluate the quality and effectiveness of the
interactive process. Measures could encompass the coherence and relevance of
the AI’s responses, its adaptability to user feedback, and the educational value
of its output explanations. Indicators of system effectiveness might also include
user satisfaction, comprehension of music theory concepts, and improvement in
composition skills. Thus, exploring these alternative use cases and their unique
evaluation methods can provide a deeper understanding of LLMs’ potential in
music generation and education.

This study, while being a proof-of-concept, has not fully explored all possible
scenarios where music generation AI could be beneficial. For instance, an
interactive music generation AI could serve as a valuable tool for music education,
providing real-time guidance, feedback, and explanations to students learning
composition or music theory.

40
6 Conclusion
This study aimed to answer the research question: Can ChatGPT generate
valuable music in a digital format (MIDI) from natural language prompts? The
conclusions drawn from the set of objective metrics can be summarized as
follows.

Pitch Counts The research found that the music composed by ChatGPT tends
to utilize a narrower range of unique pitches compared to human-composed
pieces. Despite this, it does not necessarily limit the artistic value of the generated
music, as pitch diversity is context and genre-specific.

Mean Pitch-Class Distribution ChatGPT’s generated music appears to


demonstrate some stylistic consistency in pitch usage but may lack the complexity
and variety found in human compositions.

Pitch Class Transition Probability While ChatGPT mimics the overall


transition tendencies seen in human compositions, there are subtle differences
in the precise nature of these transitions. Further detailed scrutiny of individual
transition probabilities could potentially disclose these distinctions.

Pitch Range Human compositions exhibited a more extensive tonal range


and variability than the compositions generated by ChatGPT, indicating potential
limitations of the model in generating music with varying pitch ranges.

Average Pitch Intervals ChatGPT’s music exhibits smaller and more varied
pitch intervals compared to human compositions which show larger, but more
consistent pitch intervals. This might suggest a denser tonal texture in the AI-
generated music.

Note Counts While ChatGPT exhibits a moderate variability in its musical


outputs, human compositions within the Lakh MIDI Dataset showed greater
complexity and diversity.

41
Average IOI ChatGPT-generated compositions demonstrated notable
rhythmic variability, whereas human compositions suggested a more consistent
rhythmic structure.

Note Lengths ChatGPT generated music with a wide variety of note lengths but
potentially lacks rhythmic consistency compared to human compositions, which
showed more consistent usage of note lengths.

Note Length Transition Probability ChatGPT exhibited a higher mean


transition probability and larger standard deviation, suggesting either a richer
musical texture due to a wider range of note length transitions or less consistency
in musical generation.

User surveys The subjective results paint a nuanced picture. Based


on the analysis of overall satisfaction scores, it can be inferred that a
considerable number of users had less than satisfactory experiences with the MIDI
compositions generated by ChatGPT, as reflected by the concentration of scores
at the lower end of the scale. The mean and median satisfaction scores of 1.77,
on a scale of 1 to 5, further indicate a generally low level of satisfaction among
users.

When considering the satisfaction scores in relation to users’ musical experience,


it appears that those with more musical expertise tended to be less satisfied with
the output generated by ChatGPT. This trend is consistent across all levels of
musical experience, suggesting that those with a deeper understanding or higher
expectation of music found the generated compositions less fulfilling. Moreover,
there was a notable variability in satisfaction scores among users, particularly
among those with a moderate level of musical experience.

Looking at the relationship between AI sentiment and overall satisfaction, a slight


positive trend can be observed. This suggests that users who had a more favorable
sentiment towards AI, in general, tended to have higher satisfaction scores for
the MIDI compositions generated by ChatGPT. However, this correlation was not
strongly pronounced and was subject to a fair amount of variability.

42
In conclusion, the capabilities of ChatGPT in generating music in a digital format
(MIDI) from natural language prompts are evident. Objective analyses suggest
that the model is capable of generating music with some stylistic consistency
and tonal texture that bear resemblance to human-composed pieces. However,
compared to human compositions, the music generated by ChatGPT exhibits
limitations in terms of complexity, diversity, and consistency, favoring certain
pitch transitions and note lengths, which results in a denser tonal texture.

Subjectively, the perceived value of the music generated by ChatGPT is currently


limited, particularly among users with higher musical experience. While there is
some satisfaction among users, especially those with a more favorable sentiment
towards AI, a significant proportion of users reported less than satisfactory
experiences.

These observations indicate that while ChatGPT offers a unique and valuable tool
for music generation, there is room for enhancement in future iterations of the
model. Specifically, improvements could focus on increasing musical complexity,
diversity, and sophistication to better meet the expectations of users with varying
degrees of musical expertise, thus enhancing both the objective and subjective
value of the generated music

43
References
[1] URL: www.jyu.fi/musica/miditoolbox/.

[2] Babbitt, Milton. “Twelve-tone invariants as compositional determinants”.


In: The Musical Quarterly 46.2 (1960), pp. 246–259.

[3] Bryman, Alan. Social Research Methods 4th ed. 2001.

[4] Chamberlain, Rebecca et al. “Putting the art in artificial: Aesthetic


responses to computer-generated art.” In: Psychology of Aesthetics,
Creativity, and the Arts 12.2 (2018), p. 177.

[5] Civit, Miguel et al. “A systematic review of artificial intelligence-based


music generation: Scope, applications, and future trends”. In: Expert
Systems with Applications (2022), p. 118190.

[6] Creswell, John W. and Creswell, J. David. Research Design: Qualitative,


Quantitative, and Mixed Methods Approaches. Sage publications, 2017.

[7] Edwards, Michael et al. Fundamentals of Music Theory. The University of


Edinburgh, 2021.

[8] Huber, David Miles. The MIDI manual. Sams, 1991.

[9] O’Brien, Cian and Lerch, Alexander. “Genre-specific key profiles”. In:
ICMC. 2015.

[10] OpenAI. ChatGPT API. https : / / platform . openai . com / docs /


introduction. 2021.

[11] Pitch class. https://en.wikipedia.org/wiki/Pitch_class. 2023.

[12] Project, Mido. Mido: MIDI Objects for Python. https : / / mido .
readthedocs.io/. 2023.

[13] Raffel, Colin. The Lakh MIDI Dataset v0.1. https://colinraffel.com/


projects/lmd/. Year of retrieval.

[14] Temperley, David and Marvin, Elizabeth West. “Pitch-class distribution


and the identification of key”. In: Music Perception 25.3 (2008), pp. 193–
212.

44
[15] Verbeurgt, Karsten, Dinolfo, Michael, and Fayer, Mikhail. “Extracting
patterns in music for composition via markov chains”. In: Innovations
in Applied Artificial Intelligence: 17th International Conference on
Industrial and Engineering Applications of Artificial Intelligence and
Expert Systems, IEA/AIE 2004, Ottawa, Canada, May 17-20, 2004.
Proceedings 17. Springer. 2004, pp. 1123–1132.

[16] White, Jules et al. “A Prompt Pattern Catalog to Enhance Prompt


Engineering with ChatGPT”. In: arXiv preprint arXiv:2302.11382 (2023).

[17] Wikipedia contributors. Large language model — Wikipedia, The Free


Encyclopedia. [Online; accessed 20-May-2023]. 2023. URL: https://en.
wikipedia.org/wiki/Large_language_model.

[18] Wikipedia contributors. Proof of


concept — Wikipedia, The Free Encyclopedia. [Online; accessed 20-May-
2023]. 2023. URL: https://en.wikipedia.org/wiki/Proof_of_concept.

[19] Yang, Li-Chia and Lerch, Alexander. “On the evaluation of generative
models in music”. In: Neural Computing and Applications 32.9 (2020),
pp. 4773–4784.

[20] Yang, Richard. mgeval: Evaluation Metrics for Music Generation. https:
//github.com/RichardYang40148/mgeval. Year of retrieval.

45
Appendix

Prompts used in the study

Meta language instruction

”From now on , when I ask you t o g e n e r a t e music , I want you


t o do i t i n the short −midi− l i s t −language (SMLL) which
c o n s i s t s o f a l i s t o f l i s t s , format l i k e t h i s :
\ ” [ [ N1 , S1 , D1 ] , [ N2 , S2 , D2 ] , [N3 , S3 , D3 ] , . . . ] \ ” ,
where N r e p r e s e n t s the MIDI note number , S r e p r e s e n t s the
s t a r t time ( i n t i c k s ) , and D r e p r e s e n t s the d u r a t i o n ( i n
ticks ) ”

Persona description

” Act as a music composer with e x p e r t i s e i n { genre } . When


c r e a t i n g music , p r o v i d e a {h_m_r} i n the short −midi− l i s t
−language (SMLL) t h a t i s a t a c o m p l e x i t y l e v e l o f {
c o m p l e x i t y } out o f 5 and i n the mood o f {mood } . I f some
i n f o r m a t i o n i s missing , you can use your e x p e r t i s e t o
i m prov i se . ”

Alternative approach guidance

” Whenever I ask you t o c r e a t e a {h_m_r} , i f t h e r e a r e


a l t e r n a t i v e ways t o accomplish the same musical idea ,
l i s t the b e s t a l t e r n a t e approaches , f o c u s i n g on
d i f f e r e n t s c a l e s , or t e c h n i q u e s . Compare and c o n t r a s t
the emotional impact and c o m p l e x i t y o f each approach ,
then choose the b e s t one . and w r i t e i t i n the short −midi
− l i s t −language (SMLL) . ”

Template pattern

46
” I am going t o p r o v i d e a t emp l a te f o r your output u s i n g
the short −midi− l i s t −language (SMLL) . E v e r y t h i n g i n a l l
caps i s a p l a c e h o l d e r . Any time t h a t you g e n e r a t e music
p l e a s e p r e s e r v e the f o r m a t t i n g and o v e r a l l t emp late t h a t
I provide .
[
[NOTE, START_TIME, DURATION] ,
[NOTE, START_TIME, DURATION] ,
...
[NOTE, START_TIME, DURATION]
]”

Context manager details

” The output needs t o be formated c o r r e c t l y as i t i s used by


making a l i s t o f the output with r e g e x . From the data
we c a l c u l a t e the note_on e v e n t s and n o t e _ o f f e v e n t s . The
note_on e v e n t s a re used t o c r e a t e a l i s t o f note_on
messages . The n o t e _ o f f e v e n t s ar e used t o c r e a t e a l i s t
o f n o t e _ o f f messages . The note_on and n o t e _ o f f messages
ar e then used t o c r e a t e a l i s t o f MIDI e v e n t s . The MIDI
e v e n t s a r e then used t o c r e a t e a MIDI f i l e with the mido
python l i b r a r y . Keep i n mind t h a t the t i c k s _ p e r _ b e a t
s e t t o 480.”

47
TRITA-EECS-EX-2023:303

www.kth.se

You might also like