1. Introduction
Intonation and speech rhythm research is practically exclusively associated with the respective studies of fundamental frequency (F0) patterns and syllabic duration throughout utterances, with a few exceptions reviewed below. Those who work with the description and modeling of a melodic contour focus almost exclusively on the description of the form and progression of F0 contours and levels in time, with indirect or secondary references to syllabic duration. This is the case of the American ToBI annotation system (
Silverman et al. 1992), which is largely used in phonology-based intonation research in several languages to which the ToBI system was adapted (see, for instance, G-ToBI for German and SP-ToBI for Spanish). As for prosodic breaks, ToBI only marks the strength of a break in essentially two levels (3 or 4) to direct the analyses on matters of the sequence and organization of mono- and bitonal events linked to the realization of pitch accents and boundary tones. Some criticism appeared after ten years of use of the ToBI system. The main drawbacks concern the low inter-annotator reliability for the choice of pitch accent tones (
Wightman 2002), as well as the possible circularity in not dissociating form from function (
Hirst 2005). These two drawbacks are avoided in the work reported here by a procedure we used some time ago (
Barbosa 2010) and which is similar to the rapid prosodic transcription used by Cole and collaborators (
Cole et al. 2010;
Cole and Shattuck-Hufnagel 2016). The method they proposed consists of asking listeners to evaluate how the speakers split their production into smaller parts by signaling breaks and how they highlighted words by listening to the corresponding audio files. To accomplish both tasks, the listeners were invited to mark the breaks with bars (/) and highlighted words with circles. The strength of these two functions (boundary and prominence) is considered as directly related to the proportion of the listeners’ choices.
From the phonetic point of view, models of intonation have relied on the realization of focus and boundary marking functions from simple phonological descriptions of the sentence with codes which refer to F0 levels and contours. Examples of such models were proposed for languages such as Swedish (
Bruce 1977,
1982;
Gårding and Bruce 1981), Dutch (
t’Hart 1984), English (
t’Hart 1984;
van Santen and Möbius 2000), and Japanese (
Fujisaki and Hirose 1984), among others.
Botinis et al. (
2001) revised some of these models, pointing out that these phonetic models combine global trends and local changes in the F0 contours for implementing focus. In particular, the model by
van Santen and Möbius (
2000) combined the duration of accented and the following non-accented syllables to generate the F0 contours associated with pitch accents.
Recent work on Brazilian Portuguese (BP) analyzed the F0 contour and the role of the visual modality in the realization of both wide and narrow focus in declarative and interrogative sentences. The authors evaluated the roles of F0, syllabic duration, and intensity in the signaling of the focal function (
Carnaval et al. 2022;
Miranda et al. 2021,
2022). Also, in BP,
Teixeira et al. (
2018) studied the relevance of more than 100 acoustic prosodic descriptors for signaling terminal and non-terminal boundaries, with the aim of producing an algorithm for the automatic detection and classification of prosodic boundaries based on data of the C-Oral-Brazil corpus. Their work revealed the need to combine duration and F0 to achieve a better performance in predicting prosodic boundaries in spontaneous speech.
As for work on other languages, a comparative study of Mandarin and English investigated the joint manifestation of F0 contours and syllabic duration changes associated with tone, intonation, and duration in English (
Xu 2009). A more recent work modeled the joint manifestation of F0 contours and syllabic duration for focus implementation in Emirati Arabic (
Alzaidi et al. 2023). Work by
Christodoulides (
2018), which investigates the relative importance of prosodic–acoustic parameters to signal different kinds of boundaries in French spontaneous speech, points to the higher relevance of silent pause duration. As for prominence marking in English, the work by
Herment-Dujardin and Hirst (
2002) pointed out the relative relevance of F0 and duration parameters, of their combination as well as of semantic content. Their work also took into consideration a corpus of spontaneous speech. With controlled experiments, work on the prosodic cues for signaling prominence in German, both in adults (
Holzgrefe et al. 2012) and in 8-month infants (
Wellmann et al. 2012), revealed that the combination of pitch change and preboundary lengthening is a reliable cue for perceiving a boundary in speech.
Recently, for Greek,
Arvaniti et al. (
2024) used functional principal component analysis (FPCA) to evaluate the trade-off between F0 shape and duration in a corpus of 13 speakers arranged in different pairs to read dialogs. The first two components of the FPCA revealed that “F0 curves with a less pronounced dip and an earlier and lower peak were associated with longer accented vowel duration” and that “lower F0 curves, particularly those low in the preaccentual region, were associated with longer duration of that region.” (pp. 292 and 293). None of these works, however, analyzed the systematic correspondence of F0 descriptors extracted from the F0 trace and syllabic duration for realizing focus and for signaling prosodic boundary throughout long stretches of unscripted speech.
In the same direction, it is important to highlight that long-standing research on speech rhythm rarely makes reference to F0 contour events. This line of research has been pointing to the primacy of syllable duration for marking prosodic boundaries and prominence in different languages (see
Leemann et al. 2016 in eight languages/varieties;
Streefkerk 1997 in Dutch;
Gussenhoven and Rietveld 1992 in English; and
Barbosa 2007 in BP).
Recent work on the notion of macro-rhythm considered the timing of melodic events such as pitch accents throughout utterances but did not take into account the analysis of the duration of entire syllabic sequences regardless of their accentedness (
Jun 2014;
Wehrle et al. 2020).
It is the interplay between F0 and syllabic duration patterns that we propose to investigate here by examining the cross-correlation between the corresponding time series in places where terminal and non-terminal boundaries and prominence are realized in storytelling. Speech productions in two dialects (São Paulo and Rio de Janeiro dialects) are analyzed. These two dialects were chosen because they have been the object of the majority of BP intonation studies.
Time series correlations capture the systematic correspondence between two variables. If both series are identical, the correlation is 1, whereas if they are completely unrelated on statistical grounds, the correlation is 0. In between, the technique is able to capture systematic positions where both series have local maxima or minima, which is the aspect we want to investigate here. To the best of our knowledge, the previous literature did not explore systematic correlations of these paired series. This also applies for work on BP, for which very limited corpora can be found, usually investigating the realization of local functions in read utterances.
As for work on narratives and storytelling, no systematic exploration of the relationship between melodic and duration levels has been carried out. The work by
Oliveira (
2012) investigated narratives in BP on the role of prosodic parameters such as F0, speech rate, and pause for the segmentation of narrative sections. He pointed to the cyclical character of speech rate which accompanies the change across these sections.
Other studies on storytelling proposed prosodic modification rules from neutral speech to endow speech synthesis systems with this speaking style. They have been developed for languages such as Malay (
Ramli et al. 2016), Hindi (
Verma et al. 2015), and French (
Doukhan et al. 2011). The first one points to the prosodic differences between storytelling and the neutral speeches of two professional storytellers, while the second one evaluates the differences regarding distinct emotions conveyed during the narratives of five laypeople telling stories to children. The work on French, on the other hand, uses a single professional storyteller and assesses differences in melodic, durational, and intensive prosodic parameters between other speaking styles, as well as across narrative sections. None of these studies explore the retelling of a story, which has a strong component of cognitive load (see
Dixon and Gould 1996 for the difference between storytelling and story retelling; see
Pratt et al. 1989;
Skehan and Foster 1999 for the study of processing load in story retelling). The number of speakers is also lower than the number of participants in our study.
In the present study, we fill these two gaps by describing and quantifying the convergence between F0 descriptors and syllabic duration in story retelling with a corpus that includes more speakers than usually found for this kind of study as well as analyzing running speech, and not isolated utterances, as is the case in the great majority of studies investigating BP. Recently, we investigated the interplay between F0 descriptors and normalized duration maxima in three speakers from São Paulo state, Brazil, for both reading and story retelling recorded in 2009 (
Barbosa 2024). The results of cross-correlations between the positions of prominence and each series of four F0 descriptors (F0 median, F0 range, F0 rise and fall rates) revealed that the two speaking styles differ in the sense that the F0 range and F0 rise/fall rate are more correlated with peaks of duration in story retelling than F0 median is in reading. In this style, F0 median and F0 rise rate peaks are mostly aligned with duration maxima, which is also due to rising contours being the most frequent shape for realizing this function in reading. A rising contour for marking prominence is the most frequent contour in story retelling, but it is preceded by F0 fall rate minima and with a large F0 range. This is related to the fact that an F0 fall with higher rates precedes the typical rising of the prominent F0 contour either inside the longer V-to-V interval or inside the immediately preceding unit, which signals that the alignment of an F0 fall is more relevant for preparing a variable following an F0 rise.
As for the cross-correlation between the non-terminal boundary positions with the F0 descriptor series, with the exception of one speaker, the F0 range and F0 rise rate maxima correlate more with duration peaks, with higher values for story retelling, which can be taken as a characteristic of this style: the peaks of risings and extended ranges may indicate in BP that the speaker has more to say. Finally, the same work pointed to the correlation between terminal boundary positions with the F0 descriptor series, showing that F0 median minima and F0 range maxima are the descriptors with higher values of correlation in reading. This can be seen as a characteristic of this style when BP speakers signal terminal boundaries.
Based on previous work, our hypotheses concerning the correlation of one or more F0 descriptor with duration maxima are as follows: (1) F0 range and F0 rise/fall rate maxima are the most relevant correlated descriptors when signaling prominence; (2) the F0 range and F0 rise rate maxima are the most relevant descriptors at non-terminal boundaries; and (3) F0 median minima and F0 range maxima are the most relevant descriptors at terminal boundaries, which are less frequent in the case of story retelling due to the nature of the task.
2. Methodology
2.1. Corpus
The BELÉM corpus is formed by readings and retellings of the story about the origin of the Portuguese Belém pastries by Brazilian and Portuguese male and female speakers with a range of 30 to 45 years of age from the state of São Paulo. For the building of this corpus, the first recording was made in 2009. In 2022, the recordings were resumed to add to the BELÉM corpus Brazilian speakers from different dialectal regions of Brazil, starting with speakers from Rio de Janeiro, as well as extending the number of speakers from São Paulo. Furthermore, a shortened version of the original text (753 words instead of 1568), which can be found in
Appendix A, was used as the text for the new retellings. These new recordings include six subjects from São Paulo (3 females and 3 males) and three subjects from Rio de Janeiro (2 females and 1 male). As in 2009, this new recruitment was based in the traditional friend-of-a-friend approach in sociolinguistics started by the friends of the second author in Brazil. Their narratives had between 220 and 330 words (2 to 4 min). The additional participants were between 20 and 30 years old at the time of recording and were college students of different majors. These recordings are available in the Figshare platform [
https://doi.org/10.6084/m9.figshare.25383190.v1 (accessed on 5 July 2024)] Only data from the new recordings, based on the shortened version of the Belém pastries story, are used for the analyses presented here.
Due to restrictions related to the COVID-19 pandemic, the participants themselves used the Easy Voice app on their own cell phones to make all recordings. The nine subjects read the text and soon after retold the story in their own words. Only the retellings are considered for analysis here. Because this app allows choosing among different codifications, instructions were given to record all audio files in PCM format (WAV) at a sampling rate of 48 kHz. The first author, who is a trained phonetician, further evaluated all audio files for the presence of noise that would impact computing the acoustic parameters. No recordings were discarded. The recordings were then resampled at 16 kHz and leveled to the same maximum intensity level at 65 dB.
2.2. Acoustic–Prosodic Parameters for Analysis
A Praat script, Prosody Descriptor Extractor (
Barbosa 2020), henceforth PDE, was used to extract statistical descriptors of prosodic–acoustic parameters based on syllabic duration and fundamental frequency (F0). The script is accompanied by a manual, and examples of input and output data are also given.
A syllabic duration series was obtained by normalizing and smoothing the duration of the sequence of V-to-V intervals, that is, of all the segments delimited by two immediately consecutive vowel onsets.
The reason for using V-to-V intervals instead of phonological syllables is threefold: (1)
Dogil and Braun (
1988) conducted psycholinguistic experiments that showed that vowel onset tracking is a fundamental property of speech signal processing in our brain. (2) This property of the brain activity was also pointed out by
Chistovich and Ogorodnikova (
1982) by examining post-stimulus temporal neuronal responses to speech. They reported amplified neuronal responses to portions of energy increase typical of C-V transitions, accompanied by response suppression in regions where energy decreased (typically around V-C transitions). (3) Furthermore, a segmentation based on vowel onsets has the advantage of being detectable under moderately noisy conditions (
Barbosa 2010).
The perceptual impression of speech rate is also primarily associated with the tracking of vowel onsets and not syllable onsets, as was experimentally tested by
Pompino-Marschall (
1991) with German subjects. This reveals that the perception of the syllable sequence relies on the detection of the nucleus of the syllables, more often occupied by vowels. Because V-to-V intervals are syllable-sized units, which mark the flow of syllables throughout utterances, we refer to the V-to-V interval duration as the “syllabic duration”.
Figure 1 illustrates, in the second tier, the segmentation and labeling of the V-to-V intervals for the excerpt “e aí no fim ele dormiu…” (and then he eventually slept) of a story retelling carried out by a female speaker from Rio de Janeiro. The content of the other tiers is explained later on in this text.
For normalizing the duration of the V-to-V intervals, the PDE script uses the z-score transformation given in Equation (1), where dur is the V-to-V duration in ms and the pair (
μi and
vari) denotes the reference mean and variance in ms of the phones within the corresponding interval. These reference descriptors are found in the file TableOfReal included with the script.
A smoothing technique is then used which consists of serially applying a smoothing technique carried by a 5-point moving average filter given by Equation (2) to the sequence of z-scores obtained from the previous stage.
This technique minimizes the effects of intrinsic duration and number of segments in the V-to-V unit, as well as attenuates the minor effects of duration variation related to the realization of lexical stress in the speech chain. Local peaks of smoothed z-scores are then detected by tracking the position for which their discrete first derivative changes from a positive to a negative value.
Previous research demonstrated good correspondence between these local peaks of syllabic duration and the perception of both prominent and pre-boundary syllables with correlations between 69 and 82% for reading (
Barbosa 2008), provided that silent pauses, when applicable, were included in the corresponding V-to-V interval, as can be seen in
Figure 1 for “iU”. A local peak of smoothed z-scores is considered here an index of prosodic strength and the duration of the silent pause, when present, is an integral part of the signaling of this strength. It is related to the fact that the longer a silent pause is, the stronger the perception of a boundary (
Sanderman and Collier 1995). The same applies when a local peak of smoothed z-scores signals the prominence of a syllable in a word. This is why these maxima were taken as indicators of the right edges of stress groups in BP, a language for which syllabic duration is the main parameter for signaling lexical and phrase stress (
Fernandes 1976;
Massini 1991;
Barbosa 1996). When a silent pause is part of the V-to-V interval and it has a duration peak, this does not mean that it is part of the stress group but only that the stretch of sound at the left of this pause ends the stress group. For the purpose of this study, the interval between two immediately consecutive smoothed z-score maxima is called the “stress group”. The stress group delimited in such a way is taken as a prosodic constituent that ends in a prominent unit (often an informational focus) or a prosodic boundary, either terminal or non-terminal.
The PDE script also delivers 12 F0 descriptors for the intervals of the tier specified by the user, here, the V-to-V intervals tier. From these 12 parameters, we selected four F0 descriptors: F0 median, F0 range (F0 maximum minus F0 minimum), and F0 rise and F0 fall mean rates. The latter two descriptors were computed as the first derivatives of smoothed and interpolated F0 contour. Smoothing and interpolation employed embedded Praat functions with 5 Hz as the cut frequency for smoothing and a quadratic function for interpolation, avoiding small oscillations and gaps of the F0 contour before computing the derivative. These four F0 descriptors in particular were chosen due to their relevance for signaling prominence and boundary in the literature (see
Mittmann and Barbosa 2016 for a review). This relevance is illustrated in
Figure 1, where the values of the V-to-V smoothed z-scores at the right ends of stress groups 31 and 32 (27.86 and 33.82) correspond to V-to-V units followed by long pauses. Furthermore, by the end of stress group 32, an F0 rise (INC) signals a non-terminal boundary (NTB). In those cases, the F0 range and F0 rise rate are better descriptors of F0 shape in signaling this type of boundary.
It is exactly the relation between the four F0 descriptors and duration maximum positions that realize each one of three prosodic functions, which are further discussed in the next section. We illustrate this relation in
Figure 2 before presenting the prosodic functions in the next section. By examining the very first terminal boundary position at the extreme left of
Figure 2, one can observe a syllabic duration peak coinciding with a local minimum of the F0 contour. The “meeting” of these two landmarks, a local maximum of syllabic duration and a local minimum of F0, unequivocally marks a terminal boundary of declarative utterances for a Brazilian listener (
Moraes 1998). What is more, the fact that the relevance of both F0 minima and lower rates of F0 decrease, as we will see later on, allows the listener to distinguish the terminal boundary from the non-terminal boundary, such as the ones shown in the antepenultimate and penultimate positions in this figure. In the penultimate position, the non-terminal boundary is marked by both a local duration peak and an F0 peak, soon followed by an F0 fall. The two dashed positions, on the other hand, indicate prominences by F0 rising (LH) and falling (HL) contours, respectively. In both positions, the duration peaks are less salient in contrast to the salient local peaks of the boundary positions, as revealed by their heights.
Analyses of the parameters investigated here were carried out for two regions (São Paulo and Rio de Janeiro) and two genders (males and females). The reason to split between the two genders and two regions is to suggest future work on the contribution of these factors, as well as to be able to engage with the previous literature, as presented in the Discussion section.
2.3. Prosodic Function Annotation
For each stress group automatically generated by the PDE script, the two authors independently annotated, based on their respective perception, which one of the three functions studied here, i.e., the terminal boundary (TLB), non-terminal boundary (NTB), or prominence (PRO), was realized at its right edge. After that stage, both authors discussed cases of initial disagreement until they achieved consensus. This was carried out for less than 1% of the labels given. Previous research revealed that even lay listeners perceive these functions with quite a fair to high degree of inter-rater reliability (
Raso and Mello 2012;
Cole and Shattuck-Hufnagel 2016).
The detection of the positions where one of the three functions investigated here was realized and signaled by a peak of normalized/smoothed syllabic duration had performance compatible with previous tests (e.g., success rates between 69 and 82% for reading in
Barbosa 2008). As regards missing errors, 17% of the prosodic events were not detected. In addition, the procedure produced circa 6% of false alarms. A false alarm (FA) stands for a duration peak that does not correspond to any of the functions studied here. Thus, the procedure has a success rate of 77%, which means that the use of syllable-size duration for detecting both prominence and boundary is quite successful. The relative frequency of hesitations was 7% for São Paulo and 5% for Rio de Janeiro speakers. Gender-wise, the hesitations were 10% for males and 11% for females.
Hesitations were not analyzed here. If one of these three functions was realized at places not at the right edge of the stress groups, the label NDT (not detected) before the label for the performed function was used. For instance, NDT_NTB signals that the specific non-terminal boundary was not realized at the right edge of the automatically delimited stress group. The total number of labels, excluding the FA labels, was 500 for the speakers of São Paulo and 272 for the speakers of Rio de Janeiro distributed in the following way, according to the region: 34 TLB, 296 NTB, and 170 PRO for São Paulo and 39 TLB, 153 NTB, and 80 PRO for Rio de Janeiro.
For each boundary function, both terminal and non-terminal, the shape of the melodic contour that ended the stress group was labeled as follows: increasing (INC), decreasing (DEC), increasing–decreasing (INCDEC), leveled (LEV), and decreasing with a slow F0 rate (SLDEC). All decreasing contours with a falling rate between 2.5 and 15 Hz every 50 ms were classified as a case of slow decreasing (SLDEC); those with a falling rate below 2.5 Hz every 50 ms were classified as LEV and the others as DEC.
As for the prominence function, we marked the F0 shapes by conventional symbols from intonation research (monotonal and bitonal units): leveled high F0 peak (H), falling contour (HL), and rising contour (LH), where the right tone of the bitonal contours was aligned with the stressed syllable. This choice of annotation allows for a more direct comparison with the previous literature, which uses such a convention. The H, HL, and LH labels were chosen according to criteria found in
Lucente (
2012) referring to the F0 shape during the stressed vowel in this way: the LH label refers to a rising F0 preceded by an F0 fall; the H label refers to a less sharp increase just before and leveled during the stressed vowel and not preceded by an F0 fall; the HL label refers to an F0 fall during the stressed vowel preceded by an F0 rise.
2.4. Statistical Analyses
To quantify the systematicity of the relationships between each F0 descriptor and the positions in which a F0 shape signals a function among the three studied here, we calculated the cross-correlation of these two time series for each function: (1) a series of values of 0 and 1 along the succession of V-to-V units in which 0 meant that the particular function was not realized in the current unit, and 1 if it was realized there, and (2) a series of each one of the four F0 descriptor values computed in each V-to-V unit in Hertz. This was conducted for each speaker. To do this, we used the
ccf() function for cross-correlation between two time series available in the
tseries package on the R statistical platform (
R Project n.d., version 4.3.1). For a similar analysis of dialogs analyzing the correspondence between F0 and intensity contours, see
Buder and Eriksson (
1999).
These cross-correlations were computed around a window of five V-to-V units before and after the duration peak to investigate for which lag these landmarks have the higher correlation with a particular F0 contour descriptor. A lag of 0 (zero) means that the maximum duration position and the F0 descriptor series are compared directly, without moving one series in relation to the other, that is, the F0 descriptor maximum or minimum corresponds exactly with the duration maximum position. A lag of 1 (one) means that the maximum duration position series is compared with the F0 descriptor series moved one V-to-V unit to the right, allowing us to investigate whether the series of F0 descriptor minima or maxima one V-to-V unit to the left of the duration maximum is better correlated with the latter parameter. A lag of −1 (minus one) means that the maximum duration position series is compared with the F0 descriptor series moved one V-to-V unit to the left, allowing us, on the other hand, to investigate whether the series of F0 descriptor minima or maxima one V-to-V unit to the right of the duration maximum achieves higher correlations. The same reasoning applies for higher lag values. The significance level for the cross-correlations was set to 0.05.
Proportion tests were used for comparing the significance of proportions of functions between the two regions studied here and between speakers, both at the 0.05 significance level. For this purpose, the prop.test() function in R was used.
3. Results
The great majority of studies on BP intonation, both from phonological and phonetic perspectives, have investigated either the dialect of São Paulo (
Madureira 1994,
2016;
Madureira and Fontes 1997) or of Rio de Janeiro (
Moraes 1998,
2008;
Carnaval et al. 2022). For the sake of comparison, in the following, we refer to significant differences and commonalities found between the two dialects. Furthermore, new findings beyond the analysis of controlled speech with isolated utterances usually found in those previous studies were obtained in the present one.
Figure 3 shows the relative frequency of the prosodic functions under scrutiny here for the two regions. Non-terminal boundaries are the most common function realized in story retelling, with proportions higher than 70%. According to a proportion test, terminal boundaries are significantly more frequently used by participants from Rio de Janeiro (14%, ranging from 11 to 16%) than participants from São Paulo (7%, ranging from 3 to 12%). The amount of data for this comparison (a total of 39 TLB in São Paulo and 34 TLB in Rio) is higher than the usual minimum number of 30 data points suggested for paired comparisons including proportion tests (
Dowdy and Wearden 1991). The differences in proportion between the two regions for the other two functions are not significantly different.
Considering the limited number of speakers, it is important to check if the results are relatively homogeneous across speakers or are an artifact of computing the average. This does not seem to be the case, as can be seen in
Figure 4. There are, however, some exceptions.
Figure 4 shows that male speaker AM from São Paulo has proportions that differ from the general tendency either for males or for São Paulo speakers: he uses the prominence function more than signals non-terminal boundaries. Furthermore, speakers MV (male) and SP (female) from São Paulo use a very low frequency of terminal boundaries.
Table 1 indicates the relative frequencies of the F0 shapes associated with each of the functions in story retelling according to dialect. The results are pooled for all speakers from each dialect. The two most frequent shapes for non-terminal boundaries are level and increasing, making up circa 86% in RJ and 80% in SP of all shapes, with a significant preference for increasing contours in São Paulo against level in Rio, as confirmed by a proportion test (Χ
2 = 4.0,
p = 0.04 in São Paulo, Χ
2 = 8.3,
p = 0.004 in Rio). For prominence, the most frequent shape is the high tone (H) followed by the LH contour in both dialects. Together, they account for 80% of all shapes in both RJ and SP. As for the terminal boundaries, the decreasing shape is by far the most relevant melodic form for realizing this function.
The only two significant differences across gender were that (1) male speakers use the slow decreasing shape more than female speakers, a property already found in a previous study in BP reading (
Barbosa and Mareüil 2016), and that (2) there is a significant preference for the high tone in female speakers in the prominence position (circa 61% against circa 42% in males).
With regard to the cross-correlations between the time series whose contours were illustrated in
Figure 2, only significant results are shown in
Table 2, both by dialect and by gender. Only correlations within the window of lags −2 and 2 are taken into account due to the extension of the phonological words which generally include one to two prestressed and one to two post-stressed syllables. As can be seen in
Table 2, higher correlations were found for lag 0, with only a single exception (TLB for females for the F0 rise, lag −2). Only the two highest significant correlations are shown because the others are lower than 5%, not taken into consideration here.
F0 range and F0 rise are the most relevant descriptors for non-terminal boundaries when correlations with duration maxima are concerned. With the exception of speakers from Rio de Janeiro, correlations between the F0 range and duration maxima for São Paulo and for both genders are between circa than 30% and 40%. This means that there is a tendency for the F0 range to be higher just before non-terminal boundaries, where syllabic duration maxima are positioned.
The F0 range is a relevant descriptor for the correlation of F0 and duration for the prominence function for males but not females. Looking at the data according to regions, the F0 range is the most relevant descriptor as far as correlations with duration are concerned in both regions. The higher correlation of F0 fall with duration in Rio and in both genders suggest that an F0 fall in the syllabic unit where there is a maximum of duration is a relevant cue for marking prominence in BP.
It is likely that the non-significant correlations for terminal boundaries for the less represented dialect (RJ) and for male speakers is due to the small amount of data for these cases. For São Paulo, however, the correlations of F0 range and F0 fall with duration maxima are the most relevant descriptors. This is compatible with the decreasing shape. Female speakers, on the other hand, seem to privilege the convergence of duration maxima with an F0 rise two syllabic units before the position of the local maximum, which suggests their falls are much more variable and converge less with duration maxima.
Figure 5 shows individual values of cross-correlations between the position of syllabic duration peak and the four F0 descriptor values for lags −5 to 5. The importance of the splitting of the data according to speaker is to show that the correlations shown in
Table 2 are not a result of averaging but an effect found in the great majority of the speakers. The results shown here are split according to lag, function and subject. If we consider lags between −2 and 2, which would imply examining up to two syllables around the stressed syllable, where usually duration maxima are found, a lower spread of the correlation values across the individuals for a particular combination of function and F0 descriptor signals a higher consistency of the interplay between syllabic duration and F0 for that particular function. Following this rationale, for non-terminal boundaries, F0 range and F0 rise are, in that order, the most relevant descriptors, which confirms the pooled values of
Table 2.
For prominence, the bars are more concentrated in the windows [−2, 2] for the F0 range, F0 fall, and F0 rise in that order. For terminal boundaries, the situation is more complex, presenting much more inter-individual variation with a higher concentration in the same lag window for F0 falls, as expected, with cross-correlations around 30% for the subjects with the two shades of green (LSRJ and MVSP). This value is similar to the correlations for non-terminal boundaries with the F0 range as a descriptor.
An important aspect of F0 dynamics concerns the rates of rises and falls, as well as the amount of the F0 range where significant correlations apply. For São Paulo speakers, terminal boundaries are signaled by decreasing F0 contours, with a median rate of 2.4 Hz/50 ms against 3.5 Hz/50 ms in other positions (corresponding to the points in the data series which includes positions for the two other functions and a position where none of the three functions studied here are realized), that is, F0 falls are slower when realizing terminal boundaries (see
Barbosa 2024 for similar results for the same dialect).
As for F0 rises before non-terminal boundaries, the rate values for São Paulo speakers vary from 8.6 Hz/50 ms in the window where non-terminal boundaries are realized against 5.0 Hz/50 ms elsewhere, and for Rio speakers, they vary from 6.2 Hz/50 ms in the window where non-terminal boundaries are realized against 5.0 Hz/50 ms elsewhere. This means that F0 rises are faster when realizing non-terminal boundaries in both varieties but with a larger difference in São Paulo (see
Barbosa 2024 for similar results for another corpus of São Paulo speakers).
A similar pattern applies for prominences for São Paulo speakers in terms of F0 rises: 7.1 Hz/50 ms in the window where prominences are realized against 5.4 Hz/50 ms elsewhere; for Rio speakers, the figures are 7.2 Hz/50 ms in the window where prominences are realized against 5 Hz/50 ms elsewhere, a very similar result in comparison with São Paulo. As for the F0 range, the patterns are as follows: 28.8 Hz in the window where prominences are realized against 20.0 Hz elsewhere for São Paulo; for Rio, the values are 31.6 Hz in the window where prominences are realized against 17.2 Hz elsewhere. As it can be seen, the figures for rates are quite close in both varieties.
In the prominent position, female speakers have F0 falls of 8.4 Hz/50 ms in the window where prominences are realized against 4.1 Hz/50 ms elsewhere and 3.8 Hz/50 ms in the window where prominences are realized against 2.7 Hz/50 ms elsewhere for male speakers.
4. Discussion
With respect to the previous literature on the prosody of BP, the results of this study add to the current knowledge on the matter, first by analyzing a higher number of speakers and by exploring spontaneous speech in the story retelling style. In fact, with just a few exceptions, such as the study of focus by
Carnaval et al. (
2022) with four speakers from Rio de Janeiro, the majority of past and recent studies on BP relies on results obtained from isolated utterances (see, for instance,
Moraes 1998,
2008;
Madureira 1994;
Miranda et al. 2021,
2022). Moreover, none of the studies on spontaneous speech investigated the frequency of F0 shapes and the relation of F0 descriptors to syllabic duration maxima in a systematic way. The work by
Lucente (
2012) with speakers from São Paulo, for instance, occasionally pointed to some aspects of the shapes of F0 at boundary position, while the work by
Teixeira et al. (
2018), with speakers from Minas Gerais, studied the relative importance of duration and F0 descriptors for the signaling of boundaries but not the amount of their correlation. Although in their material, instances of narrative excerpts can be found, their analysis did not consider these instances as an independent factor of investigation.
Our results point to the fact that in story retelling, the most realized function is non-terminal boundary marking, followed by the signaling of prominence. The preference for non-terminal boundaries is compatible with the need to chain stretches of speech to tell a story and to call attention to the most relevant remembered facts. Non-terminal boundaries and prominences represent almost 90% of the instances of the three functions studied here. Non-terminality is signaled by increasing and level F0 shapes where the first one has fast F0 rises which produces expanded F0 ranges partially synchronized with syllabic duration maxima in the unit just before the boundary. To the best of our knowledge, a new finding of this study is to report that faster F0 rises are associated with duration maxima in realizing non-terminality. The relevance of F0 rise peaks for non-terminality is a consequence of the use of increasing contours for realizing non-terminal boundaries.
An important aspect of the F0 dynamics for terminal boundaries which is significant, despite the limited number of data (about 30 data points per dialect), is the use of slower falls for signaling terminality in comparison to the rate of falls elsewhere, a finding not pointed out by previous studies on BP intonation.
As for prominence, the F0 range is expanded when realizing this function accompanied by earlier synchronization between the F0 fall and duration maxima within the syllabic unit where prominence is realized mainly by an H tone. This fall allows F0 to reach low values which are associated with a lengthened syllabic unit. This behavior is similar to the findings by
Arvaniti et al. (
2024) on the trade between duration and F0 around accented units, where F0 valleys accompany longer stressed syllables when realizing pitch accent.
A certain amount of interspeaker variability is part of prosodic studies on speaking style (see
Perkell et al. 2002;
Yoon 2014;
Barbosa 2022, inter alia) and this is not different for story retelling, as shown in
Figure 4 and
Figure 5, already commented upon here. A further investigation of prosodic differences across speakers, including the ones studied here, associated with the study of the effect of different retellings to the listeners could contribute to coaching on storytelling, making listening to stories a more pleasant experience to a target audience. Studies on the relation of poetry declamation and pleasantness (
Wagner and Betz 2023 for German;
Barbosa 2022 for Brazilian and European Portuguese) have results in this direction.
The rising shape (LH), the second most frequent shape for signaling prominence in the present study, is described by
Moraes (
2008) as a default realization of narrow focus for the dialect of Rio de Janeiro and later on by
Lucente (
2012) for the São Paulo dialect, the former for read speech and the latter for spontaneous speech. Nevertheless, both authors only described the LH shape as being the most common in the dialect they studied, without computing its frequency, as we showed in
Table 1: circa 21% in Rio and 24% in São Paulo. In our study, it is the second most frequent and not the most frequent, as in Moraes’ and Lucente’ studies. As for the terminal boundaries, the decreasing shape is the most relevant melodic form for realizing this function. It is proposed as the prototypical realization at the right end of neutral declaratives in BP by
Moraes (
1998,
2008) in read speech. Based on the present study, terminal boundaries are realized mostly with the same deceasing shape in story retelling.
Some findings from the current study deserve further investigation. One of them is the significantly distinct proportions of instances of terminal boundaries between the two regions, with a higher proportion for Rio de Janeiro (14% against 7% in São Paulo, see
Figure 3). Having additional data from the two regions could reveal, if this difference in proportion is confirmed, that speakers from Rio de Janeiro complete thematic excerpts of the story being told more often than speakers of São Paulo. Another finding is related to interspeaker differences specially referring to the frequency of terminal and non-terminal boundaries. This could benefit studies of the effect of different storytelling and story retelling on listeners in terms of different degrees of pleasantness depending on the uses of the types of boundaries (see
Figure 4 for some differences across speakers). The differences across gender, like the finding that female speakers are faster in signaling prominences by a previous sharp fall also deserve a more extended investigation with more speakers and a balanced corpus in this respect. As the study by
Barbosa and Mareüil (
2016) showed, this could contribute to a perception of more musical prosody in female speakers in BP.
5. Conclusions
The results of this study derive from the use of a new methodology for investigating the relation between the two main prosodic parameters (syllable duration and F0 contours) for signaling prominence and boundaries. The results presented here stress the importance of investigating the correspondence between F0 contours and syllabic duration contours to further understanding how prosodic functions are realized.
The examination of local syllabic duration maxima and the four F0 descriptors revealed that these maxima act as landmarks for particular F0 shapes: for non-terminal boundaries, the great majority of shapes were increasing and increasing–decreasing patterns; for terminal boundaries, almost all shapes were decreasing F0 patterns; and for prominence marking, the vast majority of shapes were high tones across the stressed syllable.
Time series analyses revealed significant correlations between duration and specific F0 descriptors pointing to a ruled interplay between F0 and syllabic duration patterns in Brazilian Portuguese story retelling. The cross-correlation values obtained in our analysis of the data indicate that the right edge of stress groups in BP, primarily marked by peaks of normalized duration of syllable-size units (the V-to-V unit) is the characteristic place where duration and F0 landmarks meet.
What is more, expanded F0 ranges and faster or slower rates of F0 contours are significant aspects of the dynamics of boundary marking in BP story retelling, findings that could stimulate cross-linguistic work on the prosody of storytelling and story retelling.