AI and Automatic Music Generation For Mindfulness
AI and Automatic Music Generation For Mindfulness
Conference Paper 84
Presented at the International Conference on
Immersive and Interactive Audio
2019 March 27–29, York, UK
This paper was peer-reviewed as a complete manuscript for presentation at this conference. This paper is available in the AES
E-Library (http://www.aes.org/e-lib) all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted
without direct permission from the Journal of the Audio Engineering Society.
ABSTRACT
This paper presents an architecture for the creation of emotionally congruent music using machine learning aided
sound synthesis. Our system can generate a small corpus of music using Hidden Markov Models; we can label
the pieces with emotional tags using data elicited from questionnaires. This produces a corpus of labelled music
underpinned by perceptual evaluations. We then analyse participant’s galvanic skin response (GSR) while
listening to our generated music pieces and the emotions they describe in a questionnaire conducted after
listening. These analyses reveal that there is a direct correlation between the calmness/scariness of a musical
piece, the users’ GSR reading and the emotions they describe feeling. From these, we will be able to estimate an
emotional state using biofeedback as a control signal for a machine-learning algorithm, which generates new
musical structures according to a perceptually informed musical feature similarity model. Our case study
suggests various applications including in gaming, automated soundtrack generation, and mindfulness.
listening to our favourite music, our bodies respond small amount of music due to the required
physically, inducing reactions such as pupil dilation, participant sample sizes, so we augment our human-
increased heart rate, blood pressure, and skin labelled data using ML. We build on these findings
conductivity [6]. Thus, there is a potential crossover to deliver a personalized AI approach to target
between mindful action, physiological reaction, and mindfulness.
musical stimulation. We are attempting to fruitfully
1.2 Emotional responses to music
exploit this crossover to gamify mindful interactions
and create a music-based training system for the There are a number of approaches for modelling
end-user using machine learning to automate the emotional responses to musical stimuli [13]. Often,
process. For example, mood-based regulation may these borrow from conventional models used to
be a target for the user. This might be adapted in the quantify and qualify emotion, such as the
creative industries to the designs that use circumplex (two-dimensional) model of affect [14].
physiological metrics as control systems: for This model places valence (as a measure of
example in video games [7], [8] in which case, the positivity) and arousal (as a measure of activation
player might be subjected to targeted mood strength) on the horizontal and vertical axes
disruption (i.e., being deliberately excited or even respectively. Emotional descriptors (e.g., happy, sad,
scared). angry) can be mapped on to this space, though some
Machine learning (ML) is a field of computer descriptors can be problematic in terms of a duality
science covering systems that learn "when they of placement on the model. For example, angry and
change their behaviour in a way that makes them afraid are different emotions, but would be
perform better in the future" [9]. These systems considered negative valence and high arousal and
learn from data without being specifically thus difficult to differentiate on this type of emotion
programmed. Many ML algorithms use supervised space.
learning. In supervised learning, an algorithm learns Another problem in evaluating emotional responses
a set of labelled example inputs, generates a model to music exists in the distinction between perceived
associating the inputs with their respective labels or and induced emotions [15] - this is also relevant in
scores, and then classifies (or predicts) the label or multimodal stimulus such as film [16]. This might
score of unseen examples using the learned model. be broadly summarised as the listener or viewer
This can emotionally label music pieces for our understanding what type of feeling the stimulus is
system. supposed to express (perceived), versus describing
Kim et al. [10] and Laurier & Herrera [11] give a how it makes them feel (induced). For example, a
literature overview of detecting emotion in music sad piece of music may be enjoyable to an individual
and focus on the music representations. Laurier & listener in the right context, despite being
Herrera [11] also analyse the ML algorithms used. constructed to convey sadness.
Classification algorithms used in the literature 2 System Overview
include C4.5, Gaussian mixture models, k-nearest
Recent advances in portability, wearability, and
neighbour, random forest, support vector machines,
affordability of biosensors now allow us to explore
[10]–[12]. Regression techniques include Gaussian
evaluations considering the above distinction.
mixture model regression, multiple linear regression,
Biophysiological regulation may circumnavigate
partial least-squares regression and support vector
some of the problems of self-reported emotion (e.g.,
regression. ML has been used to retrieve music by
users being unwilling to report particular felt
mood and found the personalized approach more
responses, or confusing perceived responses with
consistent than a general approach [12]. A
felt responses). Real-world testing of systems using
significant area for further work is the need to better
bio-signal mappings in music generation contexts
understand the whole process and be more
has become an emerging field of research. For
intelligent with respect to music, users and emotions.
example, [17] generate simple music for specific
Thus, we underpin our system with results from
emotional states using Markov chains. The Markov
human experiments. These are only feasible on a
chains are used to generate music while the user
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 2 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
2.1 Task 1
To generate a database of labelled musical pieces,
we elicited scores using user data accumulated
through an anonymous online survey. This is an
initial feasibility evaluation to assess whether human
labelling is possible. Hence, we used an anonymous
voluntary survey. We ran small pilot evaluations on
Fig. 1. Listeners rank musical excerpts, which are the best survey questions and determined that binary
analysed for features to train a ML model for comparison of two pieces elicited the most
construction of new excerpts. consistent results. We surveyed 53 participants using
The system detects the user’s current emotional level a Qualtrics on-line survey (www.qulatrics.com). We
and the ML algorithm picks musical pieces to distributed the URL link to the survey via email lists
influence their future emotional level to achieve to colleagues who responded anonymously but are
their desired mood. This whole process requires English speakers, which is important for
musical pieces that have an associated emotional understanding the emotional labels. For this
label (score) to allow the selection of appropriate development system, we selected music that is
pieces. We use two tasks to achieve this. The first unknown to the participants. As discussed in [11],
task in Fig. 1 and section 2.1 is to generate and emotions induced in the listener are influenced by
expand a human-labelled corpus to provide many different contextual elements, such as personal
sufficient labelled pieces for the system to operate. experiences, cultural background, music they have
The music generation process is described in detail only recently heard or other personal preferences, so
in section 2.1.1. The second task in Fig. 2 and using generated music as a stimulus may help to
section 2.2 is to analyse the user’s galvanic skin eliminate some of these confounds as
conductivity and to select the most appropriate preconceptions are removed. There is much debate
music from the corpus according to the user’s regarding adjectives as emotional descriptors, and
emotional requirement. There is also a feedback how they might be best interpreted particularly
loop to adapt the corpus scores according to the considering ambiguities across various languages. In
user’s actual experience. this work we use mindful (calm/ not scary) and
(tense/ scary) as these can be considered
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 3 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 4 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
they did not have an abrupt ending (which might and 89.1% of participants rated it scarier than N1
otherwise also influence emotional response in the and N2 respectively. Yet, 58.1% of the participants
participants). rated S2 scarier than S1 despite S2 having lower
scariness than S1 when compared to the non-scary
2.1.3 Results
tracks. Similarly, for N1, 4.6% and 10.9% rated it
Data from 53 participants was collected for analysis. scarier than S1 and S2 respectively while for N2,
Table 1 provides an overview of the composition of 6.8% and 11.9% rated it scarier than S1 and S2
the six bipolar questions and table 2 details the respectively. This presents a similar contradiction as
participants’ responses. for the scary tracks as N1 has lowest scariness rating
yet was rated scarier than N2 by 54.6% of
Table 1. Listing the pair of tracks in each question participants. We cannot explain this.
(Q1-Q6) with one question per column. Scary tracks Although we randomized the order of presentation
are labelled Sn and marked with shading of the questions to the individual participants, we did
Q1 Q2 Q3 Q4 Q5 Q6 not alter the order of presentation within the
N1 S2 S1 N1 S1 N1 questions. This may have contextual effects on the
S1 N2 N2 S2 S2 N2 participants and needs to be considered. However,
we note that the participants rated the second track
as scariest in Q5 and the first track as scariest in Q6
Table 2. Listing the number of participants that indicating that the intra-question ordering is unlikely
picked each track as the scariest of the pair of tracks to be significant.
in each question. Each question (Q1-Q6) is one From these comparisons, we were able to attain
column in the table sufficient data that we can calculate a ranked order
(score) for the pieces from these pairwise
Q1 Q2 Q3 Q4 Q5 Q6
comparisons [19]. From above, 58.1% of
2 37 41 5 18 24 participants thought S2 scarier than S1 while 54.6%
42 5 3 41 25 20 felt N1 was scarier than N2. Hence, the ranking is
that S2 is scarier than S1 and N2 is calmer than N1.
We produce a scored label in contrast to Laurier &
2.1.4 Analysis Herrera [10] who used a Boolean label, for instance
Responses to the musical stimuli in table 2 suggest a song is “happy” or “not happy”. However, this
that listeners found it relatively easy to discriminate Boolean label does not provide the fine-grained
the affective states between stimuli rendered using differentiation we require to select emotionally
different synthesized timbres. As expected (see relevant pieces so we produce a score from [0-10]
figures 3 and 4), shorter durations and larger pitch for each musical piece where 0 is completely calm,
ranges were considered lower in mindfulness 10 is completely scary and 5 is the midpoint: neither
(“scarier/more tense”) than longer durations with a calm nor scary.
more restricted pitch range, regardless of the timbre
being used. The tracks we expected to be labelled 2.1.5 Enhancing the corpus
“scary” were labelled “scary” by the participants and Human experiments are only feasible on a small set
the tracks we expected to be labelled “not scary” of music pieces as n pieces of music require n!
were labelled “not scary”. Questions 5 and 6 comparisons and enough human survey participants
compare the two “scary” tracks and the two “not to provide enough responses for each of the n!
scary” tracks respectively. Here the results are closer comparisons. Using human participants to generate a
as we may expect. 58.1% of participants thought S2 sufficiently large database of labelled pieces for our
scarier than S1 while 54.6% felt N1 was scarier than work is very time consuming and complex. To
N2. augment our small labelled database, we need to use
For S1, 94.5% and 93.2% of participants rated it ML to label new music and to provide a corpus
scarier than N1 and N2 respectively. For S2, 88.1% sufficiently large for task 2 to be feasible.
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 5 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 6 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
data from a MIDI file, which represents the structure will provide additional data. Combining data from
of the melody, chords, rhythm and other musical multiple sensors will be richer. Analysing this richer
dimensions, with Mel-Frequency Cepstral data using suitable machine learning algorithms will
Coefficients features [20] obtained from the entire be more accurate, more reliable and reveal finer-
piece to represent the piece's quality (character). grained fluctuations and changes in the participant’s
This dual representation is more flexible and richer emotional responses than would be revealed by
than simply using MIDI or signal-based audio analysing only a single sensor.
content features. We only use numerical data Further evaluation of our automatically generated
features to describe each piece and perform feature music is not trivial. Although the influence of music
selection to identify the most significant set of on emotional state is widely acknowledged [13],
features as described in [21]. Using this reduced set [25], [26] perceptual audio evaluation strategies
of significant features, the ML model will predict the often consider issues of audio quality [27] rather
“calmness” score of new music pieces by than the influence of generative strategy on the
determining the similarity between pieces using their resulting affective state in the listener. Moreover,
respective sets of features. methods which do consider the influence of
In task 2, from the GSR analysis, we can calculate generative music on affective state tend to be
the new calmness level required to achieve a focused on creativity [28], and issues regarding the
mindfulness goal (e.g., make the listener calmer if authorship of the material [29]. Thus, methods for
they are over-stimulated). This new calmness value perceptual evaluation of affectively-charged music
allows us to retrieve all pieces of music in the scored generation remain a significant area for further
song corpus at this new calmness level. To select the work: for example in the design and development of
most appropriate piece from this matching set, we a multi-purpose evaluation toolkit. Many such kits
will match the input piece against each matching exist in audio quality evaluation, for example [30].
piece using the selected set of features. We will use
the same music data representation as task 1 and the
4 Conclusion
identical ML model to ensure consistency and to We have shown through an experiment with 53
stop the system introducing contextual biasing and participants and four music pieces that we can
irregularities. We summarize task 2 in Fig. 1. The generate emotionally communicative music. We
features we have selected are input to the ML model, generated two scary pieces and two calm pieces and
we will calculate the musical similarity score for the users ranked these as we expected, thus
each piece using the ML model and then recommend supporting our hypotheses regarding how to
the music piece that is at the correct calmness level generate calm and scary music in a more robust
and is most similar (musically contiguous) with the fashion in future.
user's current state. To support this we intend to generate a larger corpus
As we continue monitoring the participant’s GSR we of pieces and recruit further listeners to bootstrap the
will assess whether the new piece has achieved the generation of a larger corpus. A rich data description
desired level of calmness. This difference (error of human labelled pieces will allow machine-
between actual and required GSR) will feed back learning algorithms to label new pieces
into the corpus of scored pieces to adjust the independently, which would mean we can expand
stimulus calmness score (essentially a calmness the corpus to any size required for a task.
index). We will adjust both the global score to Once we have a sufficiently large labelled corpus of
ensure the system correctly rates each piece and the our auto-generated music, we will use these to select
person’s own scoring mechanism to provide pieces to play to users according to their galvanic
personalized music for their mindfulness skin response. Our previous work [24] showed that
requirements. we can combine auto-generated music and GSR
We can enhance the monitoring further by using monitoring to induce emotions and that these
additional sensors. We have proven GSR sensors for emotions correspond with those felt by the listener
this task but other sensors such as heart-rate sensors (as self-reported via questionnaires). The ultimate
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 7 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
goal of the system would be to generate a calmer However, such work also needs to heed the potential
piece in direct response to a listener’s physiological drawbacks of emotional manipulation using AI and
reaction and promote the necessary emotional state related systems. There is potential for emotional
for enhanced mindfulness. Physiologically informed manipulation for marketing purposes or social
feedback is vital for this process: Any error (error control issues. However, the promising every day
between actual and required GSR) feeds back into applications for mindfulness, and the potential
the corpus of scored pieces to adjust that particular therapeutic applications of this provides a strong
piece’s calmness score by a small error factor. This argument to continue investigating this area.
requires a large dataset as the corpus will be adjusted
gradually and incrementally to maximize the
5 Acknowledgements
available emotion space. By using our own This work was supported by the Digital Creativity
automatically-generated pieces, we can minimize Labs (www.digitalcreativity.ac.uk), jointly funded
confounds of familiarity, and the need to actively by EPSRC/AHRC/Innovate UK under grant no
rank music whilst listening (in itself a process which EP/M023265/1.
might break mindfulness or relaxation). Thus the use References
of biophysiological sensors is critical in the
development of suitable systems for audio [1] M. Economides, J. Martman, M. J. Bell, and
generation in the context of mindfulness or B. Sanderson, “Improvements in Stress,
relaxation. Affect, and Irritability Following Brief Use of
Generative music technology has the potential to a Mindfulness-based Smartphone App: A
produce infinite soundtracks in sympathy with a Randomized Controlled Trial,” Mindfulness,
listener’s bio-signals, and in a biofeedback loop. vol. 9, no. 5, pp. 1584–1593, Oct. 2018.
There are promising applications for linking music [2] R. Chambers, E. Gullone, and N. B. Allen,
with emotions, especially in the creative industries “Mindful emotion regulation: An integrative
art and therapy, and particularly for relaxation. review,” Clin. Psychol. Rev., vol. 29, no. 6,
Enhancement of well-being using music and the pp. 560–572, 2009.
emotions music induces is becoming an emerging [3] G. Bondolfi, “Depression: the mindfulness
topic for further work. The potential of cheaper method, a new approach to relapse,” Rev.
wearable biosensors to collect large amounts of data Med. Suisse, vol. 9, no. 369, p. 91, 2013.
for training machine learning algorithms suggests [4] D. Williams, A. Kirke, E. R. Miranda, E.
that gamifying emotions through musical sound Roesch, I. Daly, and S. Nasuto, “Investigating
synthesis might be possible in the near future. For affect in algorithmic composition systems,”
example, this type of audio stimulus generation need Psychol. Music, vol. 43, no.6, pp.831-854,
not be restricted to a given extracted bio-signal value 2014.
- in future, trials with target emotional values could [5] I. Daly et al., “Automated identification of
be conducted, i.e., encouraging the listener to move neural correlates of continuous variables,” J.
towards a specific emotional correlate or Cartesian Neurosci. Methods, vol. 242, pp. 65–71, 2015.
co-ordinate in a dimensional emotion model, such as [6] S. D. Vanderark and D. Ely, “Cortisol,
a gamified approach to mindfulness, or a biosensor biochemical, and galvanic skin responses to
driven thriller or horror game. We note that the music stimuli of different preference values by
music generation software using HMMs allows us to college students in biology and music,”
generate this music rapidly so we can generate on- Percept. Mot. Skills, vol. 77, no. 1, pp. 227–
the-fly and on-demand in the future rather than 234, 1993.
selecting pre-generated tracks from a corpus. This [7] K. Garner, “Would You Like to Hear Some
auto-generation is much richer, more varied, more Music? Music in-and-out-of-control in the
adaptive and more personalized than selecting from Films of Quentin Tarantino,” Film Music Crit.
a play list. Approaches, pp. 188–205, 2001.
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 8 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
[8] M. Scirea, J. Togelius, P. Eklund, and S. Risi, emotion models, and stimuli,” Music Percept.
“Affective evolutionary music composition Interdiscip. J., vol. 30, no. 3, pp. 307–340,
with MetaCompose,” Genet. Program. 2013.
Evolvable Mach., vol. 18, no. 4, pp. 433–465, [19] F. Wauthier, M. Jordan, and N. Jojic,
2017. “Efficient ranking from pairwise
[9] I. H. Witten, E. Frank, M. A. Hall, and C. J. comparisons,” in International Conference on
Pal, Data Mining: Practical machine learning Machine Learning, 2013, pp. 109–117.
tools and techniques. Morgan Kaufmann, [20] B. Logan and others, “Mel Frequency Cepstral
2016. Coefficients for Music Modeling.,” in ISMIR,
[10] Y. E. Kim et al., “Music emotion recognition: 2000, vol. 270, pp. 1–11.
A state of the art review,” in Proc. ISMIR, [21] V. J. Hodge, S. O’Keefe, and J. Austin,
2010, pp. 255–266. “Hadoop neural network for parallel and
[11] C. Laurier and P. Herrera, “Automatic distributed feature selection,” Neural Netw.,
detection of emotion in music: Interaction vol. 78, pp. 24–35, 2016.
with emotionally sensitive machines,” in [22] D. C. Shrift, “The galvanic skin response to
Machine Learning: Concepts, Methodologies, two contrasting types of music,” University of
Tools and Applications, IGI Global, 2012, pp. Kansas, Music Education, 1954.
1330–1354. [23] I. Daly et al., “Towards human-computer
[12] A. C. Mostafavi, Z. W. Ras, and A. music interaction: Evaluation of an
Wieczorkowska, “Developing personalized affectively-driven music generator via
classifiers for retrieving music by mood,” in galvanic skin response measures,” in, 7th
Proc. Int. Workshop on New Frontiers in Computer Science and Electronic Engineering
Mining Complex Patterns, 2013. Conference (CEEC), IEEE, pp. 87–92, 2015.
[13] M. Zentner, D. Grandjean, and K. R. Scherer, [24] D. Williams, C-Y. Wu, V. J. Hodge, D.
“Emotions evoked by the sound of music: Murphy and P. I. Cowling, “A Psychometric
Characterization, classification, and Evaluation of Emotional Responses to Horror
measurement.,” Emotion, vol. 8, no. 4, pp. Music,” in 146th Audio Engineering Society
494–521, 2008. International Pro Audio Convention, Dublin,
[14] J. A. Russell, “A circumplex model of March 20-23, 2019.
affect.,” J. Pers. Soc. Psychol., vol. 39, no. 6, [25] K. R. Scherer, “Acoustic Concomitants of
p. 1161, 1980. Emotional Dimensions: Judging Affect from
[15] A. Gabrielsson, “Emotion perceived and Synthesized Tone Sequences.,” in
emotion felt: Same or different?,” Music. Sci., Proceedings of the Eastern Psychological
vol. 5, no. 1 suppl, pp. 123–147, 2002. Association Meeting, Boson, Massachusetts,
[16] L. Tian et al., “Recognizing induced emotions 1972.
of movie audiences: Are induced and [26] K. R. Scherer, “Which Emotions Can be
perceived emotions the same?,” in Affective Induced by Music? What Are the Underlying
Computing and Intelligent Interaction (ACII), Mechanisms? And How Can We Measure
2017 Seventh International Conference on, Them?,” J. New Music Res., vol. 33, no. 3, pp.
2017, pp. 28–35. 239–251, Sep. 2004.
[17] C.-F. Huang and Y. Cai, “Automated Music [27] J. Berg and F. Rumsey, “AES E-Library:
Composition Using Heart Rate Emotion Spatial Attribute Identification and Scaling by
Data,” in International Conference on Repertory Grid Technique and Other
Intelligent Information Hiding and Methods,” in 16th International Conference:
Multimedia Signal Processing, 2017, pp. 115– Spatial Sound Reproduction, 1999.
120. [28] F. Rumsey, B. de Bruyn, and N. Ford,
[18] T. Eerola and J. K. Vuoskoski, “A review of “Graphical elicitation techniques for
music and emotion studies: approaches, subjective assessment of the spatial attributes
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 9 of 10
Williams, (Hodge, Gega, Murphy, Cowling and Drachen) AI Automated Music Generation for Mindfulness
AES Conference on Immersive and Interactive Audio, York, UK, March 27–29, 2019
Page 10 of 10