The Perception of Audio-Visual Composites: Accent Structure Alignment of Simple Stimuli - Scott Lipscomb 2005

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Lipscomb S. D. (2005) Th ti f " .

.' . '. e percep Ion 0 audio-visual composites: Accent structure



alignment of simple stimuli. Selected Reports in Ethnomusicology, 12, 37-67.

The Perception of Audio-visual Composites:

Accent Structure Alignment of Simple Stimuli

SCOTT D. UPSCOM13 Northwestern University

This investigation examilles the relationsliip between musical sound and visual images wlien tlley are paired in simple animated sequences. Based 011 £l model of film music proposed in 1996 by Lipscomb and Kendall, the study focuses specifically 011 tire relationship of points perceived as accented musically and visually. The following research questions were answered: (I) What are the determinants of "accent" (salient moments) in tile vistlill and auditory fields? and (2) Is the precise alignment of auditory and visual strata necessalY to ensure that an observer finds the combination effective? Itt tltis experimental study, two convergent methods were used: a l'erimlscalil1g tllsk alld Cl similarity judgment task. Three alignmellt conditions were incorporated: COllSOllant (lIccents in the music occur at the same temporal rate and are perfectly alig1led wit" accents in the visual image), out-of-phase (accents occur at the same rate, but are perceptibly misaligned), or dissonant (accents Dewr at different rates). Results confirmed that VA ME ratings are signijicant~y different for tlte 1I1Iee alignment conditions. Consonant combinations were rated highest.follo wed by out-of-phase combinations, and dissonant combinations received the lowest ratings. Subject similarity judgments ill response to these simple stimuli divided clearly into three dimensions: visllal component. audio component, and alignment condition, further confirming the sigllijlcance of the aiignment of accent strata.

In contemporary society, the human sensory system is bombarded by sounds and images intended to attract attention, manipulate stale of mind, or affect behavior.' Patients awaiting a medical or dental appointment are of len subjected to the "soothing" sounds of Muzak as they sit in the waiting area. 'Trend-setting fashions are displayed in mall shops blaring the latest Top 40 selections to attract their specific clientele. Corporate training sessions and management presentations frequently employ not only communication through text and speech, but a variety of multimedia types for the purpose of attracting and maintaining atteut ion, for example, music, graphs, and animation. Recent versions of word processors allow the embedding of

37

38

SCOTT D. LIPSCOMB

sound files, animations, charts, equations, pictures, and information from multiple applications within a single document. Even while standing in line at an amusement park or ordering a drink at the local pub, the presence of television screens providing aural and visual "companionship" is now ubiquitous. In each of these instances, music is assumed to be a catalyst for establishing the mood deemed appropriate, generating desired actions, or simply maintaining a high level of interest among participants within a given context.

Musical affect has also been claimed to result in increased labor productivity and reductions in on-the-job accidents when music is piped into the workplace (Hough 1943; Halpin 1943-1944; Kerr 1945), though these studies are often far from rigorous in their method and analysis (McGehee and Gardner 1949; Cardinell and BurrisMeyer 1949; Uhrbock 1961). Music therapists claim that music has a beneficial effect in the treatment of some handicapped individuals and as a part of physical rehabilitation following traumatic bodily injury (Brusilovsky 1972; Nordoff and Robbins 1973; an opposing viewpoint is presented by Madsen and Madsen 1970). Individuals use music to facilitate either relaxation or stimulation in leisure activities. With the increase in leisure time during the 1980s (Morris 1988), many entertainment-related products began to utilize music to great effect in augmenting the aesthetic affect of these experiences. Executives of advertising agencies have realized the impact music has on attracting a desired audience, as evidenced recently by the use of classic rock songs to call baby-boomers to attention or excerpts from the Western art music repertoire to attract a more "sophisticated" audience.

One of the most effective uses of music specifically intended to manipulate perceptual response to a visual stimulus is found in motion pictures and animation. The present study investigated the relationship of events perceived as salient (accented), both aurally and visually. As a result, this study focused on an aspect of the motion picture experience that had never before been addressed explicitly in music perception literature. Many studies had examined associational and referential aspects of both sound and vision. Some investigations had even examined explicitly the relationship of music to visual images in the context of the motion picture experience. However, none have proposed an explicit model based on stratification of accent structures or set out to test the audio-visual relationship on the basis of accent structure alignment.

Before considering the specific interrelationship between tlle aural and visual components of animated sequences, several issues were carefully examined. First, what are the determinants of "accent" (points of emphasis) in the visual and auditory fields? Second, is it necessary for accents in the musical soundtrack to line up precisely with points of emphasis in the visual modality in order for the combination to be considered effective? The ultimate goal of this line of research is to determine the fundamental principles governing interaction between the auditory and visual components in the motion picture experience.

Related Literature

To the present, there has been little empirical work specifically directed at studying the symbiotic relationship between the two primary perceptual modalities

THE PERCEPTION OF AUDIO- VISUAL COMPOSITES

39

normally used in viewing films (Lipscomb 1990; Lipscomb and Kendall 1996). III the field of perceptual psychology, interaction between the aural and visual sensory modalities is well documented (see, tor example, Radeau and Bertelson 1974; Staal and Donderi 1983; Bermant and Welch 1976; Ruff and Perret 1976; Massaro and Warner 1977; Regan and Spekreijse 1977; and Mershon, Desaulniers, Amerson, and Kiever 1980). For a detailed discussion of film music research (Tannenbaum 1956; Thayer and Levenson 1984; and Marshall and Cohen 1988), see Lipscomb (1995) and Lipscomb and Kendall (1996). The latter paper was included in a special issue of Psychomusicology (vol. 13) devoted to the topic of film-music research, including investigations by a wide array of scholars (Thompson, Russo, and Sinclair 1996; Bolivar, Cohen, and Fentress 1996; Lipscomb and Kendall 1996; Bullerjahn and Guldenring 1996; Sirius and Clarke 1996; Iwamiya 1996; and Rosar 1996). Though the list is not long, there have been many approaches to the study of combined sound and image. Marilyn Boltz and her colleagues have investigated the relationship between the presence of musical sound and memory tor filmed events and their duration (Boltz 1992; Boltz 2001; and Boltz, Schulkind, and Kantra 1991). Krumhansl and Schenck (1997) investigated the relationship between dance choreography by Balanchine aud the music that inspired it, Mozart's Divertimento No. 15. In a study by Vitouch (2001), subjects, after seeing a brief film excerpt with one of two contrasting musical soundtracks, provided a written prediction of how the plot would continue, revealing that anticipations of future events are "systematically influenced" by the accompanying musical sound (p. 70). None of these investigations, however, addressed the synchronization between the musical and visual components of the motion picture experience.

Proposed Model and Its Foundation

What is the purpose of a musical soundtrack? An effective film score, in its interactive association with the visual element, need not attract the audience member's attention to the music itself. In fact, the most successful film composers have made a fine art of manipulating audience perception and emphasizing important events in the dramatic action without causing a conscious attentional shift. When watching a film, a typical audience member's perception of the musical component often remains at a subconscious level (Lipscomb 1989).

Marshall and Cohen (1988) provided a paradigm to explain the interaction of musical sound and geometric shapes in motion entitled the "CongruenceAssociationist model," They assumed that, in the perception of a composite A-V presentation, separate judgments were made on each of three semantic dimensions (Evaluative, Potency, and Activity; see Osgood, Sud, and Tannenbaum 1957) for the music and the film, suggesting that these evaluations were then compared for congruence at a higher level of processing.

A model proposed by Lipscomb and Kendall (1996) suggests that there are lWO implicit judgments made during the perceptual processing of the motion picture experience: an association judgment and a mapping of accent structures (see Figure 1). The association judgment relies OIl past experience as a basis for determining whether or

40

SCOTT D. LIPSCOMB

Aural stimulus

Visual stimulus

-

Perception

IMPLICIT PROCESSES

Accent Structure

Relationship 1.....,....,.,. ........ -4--1

YES

Shift of Attentional Focus

No Shift of Attentional Focus

Figure 1. Lipscomb and Kendall's (1996) model of Film Music Perception.

Reprinted with permission of Psycho musicology.

not the music i.s ap~ropriat: within a given context. For example, a composer may have used legato stnng lines for romantic" scenes, brass fanfares for a "majestic" quality, or low-frequency synthesizer tones for a sense of "foreboding." The ability of music to convey such a referential "meaning" has been explored in great detail by many investigators, for example, Heinlein (1928), Hevner (1935 and 1936), Farnsworth (1954), Meyer (1956), Wedin (1972), Eagle (1973), Crozier (1974), McMullen (1976), Brown (1981),andAsmus (1985).

The second implicit judgment (mapping of accent structures) consists of matching emphasized points in one perceptual modality with those in another. Lipscomb and ~endall (1996) ~roposed that, if the associations identified with the musical style were Judged appropnate and the relationship of the aural and visual accent structures

THE PERCEPTION OF AUDIO- VISUAL COlv!POSlTES

41

were perceived as consonant, auentional focus would be maintained un the symbiotic

composite, rather than on either modality in isolation. .

Musical and Visual Periodicity. In the repertoire of mainstream motion pictures, one can find many examples that illustrate the film composer's use of periodicity in the musical structure as a means of heightening the effect of recurrent motion in the visual image. The galley rowing scene from Miklos Rosza's score composed for Bell Hut (1959) is an excellent example of the mapping of accent structures, both in pitch and tempo of the musical score. As the slaves pull up on their oars, the pitch of the musical motif ascends. As they lean forward to prepare tor the next thrust, the motif descends. Concurrently, as the Centurion orders them to row faster and faster, the tempo of the music picks up accordingly, synchronizing with the accent structure of the visual scene. A second illustration can be found in John Williams' musical soundtrack composed for ET: The Extraterrestrial (1982). The bicycle chase-scene score is replete with examples of successful musical emulation of the dramatic on-screen action. Synchronization of the music with the visual scene is achieved by inserting 3/8 patterns at appropriate points so that accents of the metrical structure remain aligned with the pedaling motion.

In the process of perception, the perceptual system seeks out such periodicities in order to facilitate data reduction. Filtering out unnecessary details in order to retain the essential elements is required because of the enormous amount of information arriving at the body's sensory receptors at every instant of time. "Chunking" of specific sensations into prescribed categories allows the individual to successfully store essential information for future retrieval (Bruner, Goodnow, and Austin 1958).

Therefore, in the context of the decision-making process proposed by Lipscomb and Kendall (1996), the music and visual images do not necessarily have to be in perfect synchronization for the composite to be considered appropriately aligned. As the Gestalt psychologists found, humans seek organization, imposing order upon situations that are open to interpretation according to the principles of good continuation, closure, similarity, proximity, and common fate (von Ehrenfels 1890; Wertheimer 1925; Kohler 1929; and Koffka 1935). In the scenes described above, the fact that evelY rowing or pedaling motion was not perfectly aligned with the musical score is probably not perceived by the average member of the audience, even if attention were somehow drawn to the musical score. Herbert Zettl (1990: 380) suggests the following simple experiment. To witness the structural power of music, take any video sequence you have at hand and run some arbitrarily selected music with it. You will be amazed how frequently the video and audio seem to match structurally. You simply expect the visual and aural beats to coincide. If they do not, you apply psychological closure and make them fit. Only if the video and audio beats are, or drift, too far apart, do we concede to a structural mismatch-but then only temporarily/

The degree to which the two strata must be aligned before perceived synchronicity breaks down has not yet been determined. The present experimental investigation manipulated the relationship of music and image by using discrete levels of synchronization. If successful in confirming a perceived difference between these levels, future research will be necessary to determine the tolerance for misalignment.

42

SCOTT D. LIPSCOMB

Accent Structure Alignment

'IWo issues had to be addressed before it was possible to consider accent -structure synchronization. First, what constitutes an "accent" in both the visual and auditory domains? Second, which specific parameters of any given visual or musical object have the capability of resulting in perceived accent?

The term "accent" will be used to describe points of emphasis (salient moments) in both the musical sound and visual images. David Huron (1994) defined "accent" as "an increased prominence, noticeability, or salience ascribed to a given sound event:' When generalized to visual images as well, it is possible to describe an AV composite in terms of accent strata and their relationships one to another.

Determinants of Accent. In the search for determinants of accent, potential variables were established by considering the various aspects of visual objects and musical phrases that constituted perceived boundaries. Fraisse (1982: 157) suggested that grouping of constituent elements results "as soon as a difference is introduced into an i~ochronous sequence." Similarly, in a discussion of Gestalt principles and their relanon to Lerdahl and Iackendoff's (1983) generative theory of tonal music, Deliege (1987: 326) stated that "in perceiving a difference in the field of sounds, one experiences a sensation of accent." Boltz and Jones (1986: 428) propose that "accents can arise from any deviation in pattern context:'

Following an extensive review of the literature relating to the perception of accent in both the aural and visual modalities, a limited number of potential variables were utilized in creating a musical stimulus set and a visual stimulus set that -conSidering each modality in isolation-resulted in a reliably consistent perception of the intended accent points. Accents were hypothesized to occur at moments in which a change occurs in any of these auditory or visual aspects of the stimulus. This change may happen in one of two ways. First, a value that remains consistent for a period of time can be given a new value (a series of soft tones may be followed suddenly by a loud tone or a blue object may suddenly turn red). Second, change in the direction of a motion vector will cause a perceived accent (melodic contour may change from ascending to descending or the direction of an object's motion may change from horizontal left to vertical up). The variables selected for use in the following experiments are listed in Table 1, along with proposed values for the direction and magnitude characteristics.

Method

This study was a quasi-experimental investigation, consisting of a post-test-only, repeated measures factorial design. The experiment was preceded by a series of exploratory studies that assisted in selecting stimulus materials. The main experiment incorporated two independent methods of data collection: verbal ratings and similarity judgments.

Subject Selection

Every participant was required to have seen at least four mainstream, American movies during each of the past ten years, ensuring at least a moderate level of

THE PERCEPTION OF AUDIOVISUAL COMPOSITES

43

Table 1

Proposed variables to be utilized in the initial exploratory study labeled with direction

Vectors

Magnitude
Variables Direction of Change
Musical
Pitch up/unchanging/down none/small/large
Loudness louder/unchanging/softer none/small/large
Timbre simple/unchanging/complex none/small/large
Visual
Location left/unchanging/right
up/unchanging/ down none/small/large
Shape simpler/same/more complex none/small/large
Color
hue red-orange-yellow-green- none/small/large
blue-indigo-violet
saturation purer/unchanging/more impure none/small/large
brightness brighter/unchanging/darker none/small/large "enculturation" with this genre of synchronized audio-visual media. Musical training was the single between-subjects grouping variable considered, using the following three levels: untrained (less than two years of formal music training), moderate (two to seven years of formal music training), and highly trained (more than seven years of

formal study).

Stimulus Milterials

Prior to the main experiment, a series of exploratory studies was run to determine auditory and visual stimuli that are consistently interpreted by subjects as generating an intended accent point. The sources of musical and visual accent delineated in Table 1 were used as a theoretical basis for creating MIDI files and generating cornputer animations for use as stimuli in this experiment. Both the sound files and the animations were limited to approximately five seconds in length, so that a paired comparisons task could be completed by subjects within a reasonable period of time,

as discussed below.

The points of accent were periodically spaced within each musical and visual

example. Fraisse (1982: 156) identified temporal limits for the perceptual grouping of sound events. The lower limit (approximately 120 ms apart) corresponded closely to the separation at which psychophysiological conditions no longer allowed the two events to be perceived as distinct. The upper limit (between 1500 and 2000 ms) represented the temporal separation at which two groups of stimuli are no longer perceptually linked (Bolton 1894; MacDougall 1903). Praisse suggested a value of 600ms

44

SCOTT D. LIPSCOMB

as the optimum for both perceptual organization and precision. Therefore, the first independent variable utilized in the present experimental procedure, that is, variance of the temporal interval between accent points, consisted of values representing a median range between the limits explicated by Fraisse. This variable had three discrete levels: 500ms, 800ms, and 1000ms. The first and last temporal values allowed the possibility of considering the nesting of accents (within every 1000ms interval two accents 500ms apart may occur). The 800ms value was chosen because it allowed precise synchronization with the visual stimulus at the rate of 20 frames per second (fps), yet it aligned with the other accent periodicities only once every 4 seconds, which is beyond Fraisse's (1982) upper limit for the perceptuallinldng of stimuli. (The specific relationships are: 1000ms = 20 frames; 800ms = 16 frames, and 500ms = 10 frames.) Seven musical patterns and seven animation sequences utilizing each temporal interval were generated, from which the actual stimuli were selected in a second exploratory study.

The manner in which audio and visual stimuli were combined served as the independent variable manipulated by the investigator. Three possible levels of juxtaposition were utilized: consonant, out-of-phase, and dissonant (Yeston 1976; Monahan, Kendall, and Carterette 1987; Lipscomb and Kendall 1996). Figure 2 presents an idealized visual representation of these three relationships. In each pair of accent strata ,t one depicting the visual component, the other the audio component), points of emphasis are represented by pulses [n] in the figure. Consonant relationships (Figure 2a) may be exemplified by accent structures that are perfectly synchronized. Accent structures that are out-oj-phase (Figure 2b) share a common temporal interval between consecutive points of emphasis, but the strata are offset such that they are perceived as out of synchronization. Juxtaposition of the 500ms periodic accent structure and the 800ms periodic accent structure mentioned in the previous paragraph would result in a dissonant relationship (Figure 2c).3 Because of the possibility of nesting the 500ms stimulus within the 1000ms stimulus, it was necessary to distinguish between identical consonance (synchronization of a 500ms temporal interval in both the audio and visual modalities) and nested consonance (synchronization of a 500ms temporal interval

a) _n n n n n
_n n n n n

b) _n n n n n
n n n n n__ Figure 2. Visual representations of relationships between sources of accent.

THE PERCEPTION OF AUDIO-VISUAL COMPOSITES

45

in one modality and a 1000ms temporal interval in the other). The same distinction was considered in the out-of-phase relationship between the SOOms anJ the IOOOJlJS periodicities.

Exploratory Studies

A series of exploratory studies was run in order to select auditory and visual stimuli that illustrate, as clearly as possible, the presence of accent structures in both perceptual modalities, so that subjects were capable of performing tasks based on the alignment of these two strata. For all experimental procedures, Roger Kendall's Music Experimellt Development System (MEDS, version 3.le) was utilized to [llay the auditory and visual examples and collect subject responses. The author programmed tor incorporation into MEDS a module that allowed quantification and storage of temporal intervals between consecutive keypresses on the computer keyboard at a resolution well below .Olms. This facility allowed the subjects to register their perceived pulse simply by tapping along on the spacebar,"

Subjects were asked to tap along with the perceived pulse created by the stimulus while either viewing the animation sequences or listening to the tonal sequences. In the exploratory study, stimuli were continuously looped for a period of about thirl y seconds so that subjects had an adequate period of time to determine accent periodicities. It was hypothesized that the position of these perceived pulses coincided with points in time when significant changes in the motion vector (magnitude or direction) of the stimulus occurred. The purpose of the exploratory studies was to confirm this hypothesis and to determine the audio and visual stimuli that produced the most reliably consistent sense of accent structure.

Main Experiment

There are two methodological innovations incorporated into this study that warrant brief discussion. First, a system of "convergent methods" was utilized to answer the research questions. Kendall and Carterette (l992a) proposed this alternative to the single-method approach used in most music perception and cognition research. The basic technique is to "converge on the answer to experimental questions by applying multiple methods, in essence, simultaneously investigating the central research question as well as ancillary questions of method" (p. 116). In addition, if the answer to a research question is the same, regardless of the method utilized, much greater confidence may be attributed to the outcome. The present investigation incorporated a verbal-scaling procedure and a similarity-judgment task.

Second, rather than using semantic differential bipolar opposites in the verbal scaling task (Osgood et aL 1957), verbal attribute magnitude estimation (VAME) was utilized (Kendall and Carterette 1992b and 1993). Tn contrast to semantic differential scales, VAME provides a means of assigning a specific amount of a given attribute within a verbal scaling framework (good-not good, instead of good-bad).

Since two convergent methods were utilized, two independent groups of subjects were required for this experiment. Group One was asked to watch every audio-visual

46

SCOTT D. LIPSCOMB

composite in a randomly-generated presentation order and provide a VAME response, according to a consistent set of instructions (see Lipscomb 1995). When the OK button was pressed after a response, location of each button on its respective scroll bar was quantified using a scale from 0 to 100 and stored for later analysis. A repeated measures analysis of variance (AN OVA) was used as the method for determining whether or not there was a significant within-subjects difference between the responses as a function of accent structure alignment and/or the between-subjects variable: level of musical training.

Group Two was asked, in a paired-comparison task, to provide ratings of "similarity" on a continuum from "not same" to "same:' according to a consistent set of instructions (see Lipscomb 1995). The quantified subject responses were submitted for multidimensional scaling (MDS) in which distances were calculated between objects-in this case, A-V composites-for placement within a multi-dimensional space (Kruskal1964a, 1964b, and 1978). The resulting points were plotted and analyzed in an attempt to determine sources of commonality and differentiation. The results were confirmed by submitting the same data set for cluster analysis in order to identify natural groupings in the data.

Alternative Hypotheses

It was hypothesized that Group One would give the highest verbal ratings of synchronization and effectiveness to the consonant alignment condition (composites in which the periodic pnlses identified in the exploratory studies were perfectly aligned). It was also hypothesized that the lowest scores would be given in response to the out -of-phase condition (combinations made up of identical temporal intervals that are offset), while intermediate ratings would be related to composites exemplifying a dissonant relationship. In the latter case, the musical and visual vectors may be perceived as more synchronized because of the process of closure described by Zettl (1990: 380). It was hypothesized that similarity ratings provided by Group Two would result in a multi-dimensional space consisting of at least three dimensions, including musical stimulus, visual stimulus, and accent alignment.

Experimental Procedure

Auditory examples for this experiment consisted of isochronous pitch sequences and visual images were computer-generated animations of a single object (a circle) moving on-screen. Since the stimuli for this experiment were created by the author, a great degree of care was taken in the exploratory study portion to ensure reliability in responses to the selected stimuli (for a detailed discussion of the exploratory studies, see Lipscomb 1995: 53-65). As a result of these carefully controlled preliminary procedures, from the seven audio examples and seven visual examples created, two audio and two visual stimuli were selected for use in the main experiment (Figures 3 and 4).

Subjects

Subjects for this experiment were forty UCLA students (ages 19 to 31) enrolled in general education classes in the Music Department, either Psychology of Music taught by Lipscomb in Fall, 1994 or American Popular Music taught by Keeling in

THE PERCEPTION OF AUDIO-VISUAL COMPOSITES

47

A1)

A2) l e J J J Jq7 Jffjg~1J±Ej~

;>- ;>-

Figure 3. Audio stimuli selected for use in the Main Experiment.

Al exhibits accent due to interval and direction change, while A2 exhibits accent resulting from dynamic accent and direction change.

V1)

V2)

Figure 4. Visual stimuli selected for use in the Main Experiment.

VI exemplifies side-to-side continuous motion,

while V2 illustrates apparent near-to-far-to~near continuous motion.

Summer Session 11,1994). The forty subjects (24 males and 16 females) were randomly assigned to two groups before performing the experiment~l tasks. Group On~ (n = ~O) responded using the VAME verbal rating scale and Group Iwo (n ~ 20) provided SImilarity judgments between pairs of stimuli. For each group of subjects, the number of subjects falling into each level of musical training is provided in Table 2.

Stimulus Materials

The A- V composites utilized in the main experiment were created b~ combining the two audio and the two visual stimuli selected in the exploratory study into all possible pairs (nAV = 4). For ease of discussion, these stimuli will h~retofore be referen~ed using the following abbreviations: Al (Audio 1) cO~1sists of ~ repeated ascending melodic contour),A2 (Audio 2) consists of an undulatmg melodic contour, VI (Visual 1) represents left to right apparent motion (that is, translation in t~e plane along the x-axis), and V2 (Visual 2) represents front to back apparent motion itranslatlon III depth along an apparent z-axis).

48

SCOTT D. LIPSCOMB

Table 2

Number of subjects falling into each cell of the between-subjects design (Experiment One)

Musical Training

Exp, Task

Untrained

Moderate

Trained

VAME Similarity

10 10

7 8

3 2

In addition to these various A-V composites, the method of audio-visual alignment was systematically altered. As explained previously, three alignment conditions were utilized: consonant, out-of-phase, and dissonant. It was important to create composites in which the A-V alignment was out-of-phase by an amount that was easily perceivable by the subjects. Friberg and Sundberg (1992) determined the amount by which the duration of a tone presented in a metrical sequence must be varied before it is perceived as different from surrounding tones. This amount is lOms for tones shorter than 240ms or approximately 5% of the duration for longer tones (p. 107). The out-of-sync versions in this study were offset by 225ms-a value well beyond the justnoticeable difference (IND) for temporal differentiation and also a value that does not nest within or subdivide any of the three lOIs used in this study (500ms, 800ms, and 1000ms).

Stratification oj accent structures. In the exploratory study, both the audio and visual stimuli were shown to create a perceived periodic accent where certain moments in the stimulus stream were considered more salient than others. It is possible-using all combinations of synchronization (consonant, out-of-phase, and dissonant) and 101 interval (500ms, 800ms, and 1000ms)-to generate 14 different alignment conditions for each A- V composite (Table 5). Notice that there are two distinct types of consonant and out-of-phase composites. The first is an identical consonance, for example, a 1000ms IOJ in the audio component perfectly aligned with a 1000ms 101 in the visual component. The second type is referred to as a nested consonance, for example, a 500ms IOJ in the audio component that is perfectly aligned with-but subdividesa 1000ms 101 in the visual component (or vice versa). The corresponding pair of outof-phase composites is referred to as out-of-phase (identical) and out-of-phase (nested). Therefore, the total stimulus set consisted of 56 A- V composites (4 A-V combinations x 14 alignment conditions). Each composite was repeated for a period of approximately 15 seconds, before requiring a subject response. The order of stimulus presentation was randomized for every subject.

Experimental Tasks

Group One. Each subject in this group was asked to respond to every A-V composite on two VAME scales: "not synchronized-synchronized" and "ineffectiveeffective." After viewing each of the composites, subjects were given a choice of either

nIE PERCEPTION OF AUDIO-VISUAL COMPOSITES

'lllLle3

The 14 alignment conditions for A- V composites in Experiment One

Music rot Visual 101 Audio-visual Alignment
500ms 500ms consonant
500ms 500ms out-of-phase
500ms 1000ms consonant
500ms J()OOms out-ofphase
500ms 800ms dissonant
1000ms 1000ms consonant
IOOOms 1000ms out-of-phase
1000ms 500ms consonant
1000ms 500ms out-of-phase
1000ms 800ms dissonant
800ms 800ms consonant
800ms 800ms out-of-phase
800ms 500ms dissonant
800ms !OOOms dissonant providing a response or repeating the stimulus. The response mechanism is shown ill Figure 5. The order in which the two VAME scales were presented was also randomized.

Group ]wo. The similarity scaling task required comparison of all possible pairs of stimuli. Therefore, it was necessary to utilize only a subset of the composites used in the VAME task in order to ensure that the entire procedure could be run within a reasonable time period (30 to 45 minutes). Only the 800ms MIDI and animation files were utilized, eliminating nesting and varying temporal interval from consideration. The alignment conditions were simply consonant (800ms lOJ MIDI file and BOOms IOJ FLI animation, perfectly aligned), out-of-phase (800ms 101 MIDI file and 800ms lOJ FLI animation, offset by 225ms), and dissonant (I OOOms lOJ MIDI file and

Example:

II

synchrcruzed

not synchronized

Figure 5. Scroll bar used to collect Group One subject responses.

50

SCOTT D. LIPSCOMB

800ms 101 FLl animation). The triangular matrix of paired comparisons included the diagonal (identities) as a means of gauging subject performance, that is, if identical composites are consistently judged to be "different:' it is likely that the subject did not understand or was unable to perform the task. Therefore, the total stimulus set consisted of 12 different A-V composites (4 A-V combinations x 3 alignment conditions), resulting in 78 pairs of stimuli. All paired-comparisons were randomly generated, so that the subject saw one A- V composite and then a second combination prior to providing a similarity judgment.

Results

Group One Data Analysis and Interpretation

A repeated measures ANOVA was performed on the subject responses to each of the VAME rating scales provided by Group One, considering two within-groups variables (4 A-V combinations and 14 alignment conditions) and one between-groups variable (3 levels of musical trainingj.f At a = .025, neither the synchronization ratings (F(2.17) = 1.62,p < .227) nor the effectiveness ratings (F(W) = .66,p < .528) exhibited any significant difference across levels of musical training. However, there was a highly significant within-subjects effect of alignment condition for both the synchronization ratings (FJ.,(I3.22J) = 88.18, P < .0005) and the effectiveness ratings (FJ.,(I3.22J) = 48.43, P < .0005). The only significant interaction that occurred was an interaction between A- V combination and alignment condition for both synchronization (F(39.663) = 3.05, P < .00(5) and effectiveness (F(39.663) = 2.94, P < .00(5). In general, there was a high correlation between subject responses on the synchronization and effectiveness scales (r = .96), confirming the strong positive relationship between ratings of synchronization and effectiveness.

Mean subject responses to the VAME scales are represented graphically in Figures 6a to 6d. Each graph represents a different A- V combination. There is a striking consistency in response pattern across A- V composites, as represented in these graphs. This consistency is confirmed by Figure 7, providing a comparison of these same responses across all four A-V combinations by superimposing Figures 6a to 6d on top of one another. In the legend to this figure, the labels simply refer to a specific alignment condition of a given A- V combination (VIA2_C refers to the consonant alignment condition of Visual #1 and Audio #2).6 There is a relatively consistent pattern of responses, based on alignment condition. In general, the consonant combinations receive the highest mean ratings on both verbal scales. The identical consonant composites (for example, alignment conditions V5A5_C, VIOAlO_C, and V8A8_C in Figure 7) are consistently given higher ratings than the nested consonant composites (alignment conditions V5AlO_C and VlOA5_C in Figure 7), with the exception of VlOA5_C4, which received a mean rating almost equal to that ofVlOAlO_C4.7 (Remember that there is no nested consonant composite for the 800ms stimuli.)

The second highest mean ratings were given in response to the out-of-phase (identical) composites (V5A5_O, V iOAlO_O, and V8A8_O). The lowest mean ratings were always given in response to the out-of-phase (nested) composites (V SAlO_O and VlOA5_O) and the dissonant composites (V5A8_D, V IOA8_D, V8A5_D, and

THE PERCEPTION OF AUDIO-VISUAL C01v'IPOSlTES

51

Experlrne nt One-VAME Ratings Visual #1 & Audio III

I·.-.·.··;_ ~~,~;" '~i"~';i'ed~;,;'~r;,o~';~d.1

-e-U-IElfftlcliv6-effecUvCl ....

.. _._------" . ,-- __ ._-_._,--- __ . __ ..... , . __ -_._._-_ ... _- -_-_ -_ .. _ ---

100.000
90.000
80.000
70.000
..
e 60.000
0
u
en
c
m 40.000
.
:i
30.000
20.000
10.000
0.000 - ----I-----..............j---+------I--,--·+-·--··--,-4-----~-,·"--,-t_---.f·-------_·t-_·,-------j-_·,·- --I
0 5 0 ~I 5 , 0, ~I 0 5 5 0 5 5,
.J 'fJ:' 0' o;J. ",' ",' , ';1'
~ 0 co 'fJ:
« :;: :;: :;: ~ ;3 ;3 "
'" '" 0 co cc co
> > ~ ~ > 0 ;;; ;;; ;;; > > >
;;; ;;; 5

0'

~

>

Alignment Condition

Figure 6a. Mean subject ratings to the two VAME scales when viewing the combination of Visual #1 and Audio #1 across alignment conditions.

For an explanation of x-axis labels, see endnote #6.

experiment One-VAME Ratings Visual #1 & Audio #2

1 .. :::;:':~~;~·Y;'~I;r~"~;;d~Y~CI ,,~,;;;·~dl·

-B-lnalfecUve-sllbclive ..

--- _ .. _---_.-_-_-_-_._-,-_-_-_---_ .. _-_ ..• _-----_

90.000 80.000 70.000

~ 60.000

II> 50.000

20.000

10.000 -------.---- .. --.---.-.--- ..... ---- .. ----------- .. --.--.--

0.000 t---·-t-----t---------j---r---r------j------t-----l-"-+--·---j'--·- ---1------,-
~l S [1 S ~ ~, S ~ S a ~ S ~ a
",' 0' 0' ",' 0' , , ",' ",' ",' , 0'
0 ~ '" 'fJ:
-c -c :;: :;: « ~ ~ « "" « -c :;:
'" ~ '" 0 0 0 ~ ~ ~
> '" ~ > ;;; ;;; ;;; '"
> ;;; ;;; >
Alignment Condition Figure 6b. Mean subject ratings to the two VAME scales when viewing the combination of Visual #1 and Audio #2 across alignment conditions.

For an explanation of alignment condition labels, see endnote #6.

52

SCOTT D. LIPSCOMB

Experiment One-VAME Ratings Visual #2 & Audia #1

[_ .. --_ .. _-------- -_. ---_._-_---_._ .. _-]

-+- nat Synchronized-SynChrOniZed.

-B--lnaffacUve-etleclive

~'--""---'~'-"'---"--'--'--'--'-""--'--'.'-'-----

100.000 90.000 80.000 70.000

~ 60.000

~

50.000

c i 40.000

SO.OOO t-----

20_000

10.000
0.000 ._-j I---J---I--+·-----i--~--+--·-I---
!3 '" cc '" '" ~, 8, !3 co '" !3 ~J '" 8
0 0 0 0 i' 0 0
":}_' ",, 0' d ",, 0 ",, ~J ,:£' ';{}_' d
<C ~ :< <C :< ~ <C <C ~
'" '" '" 0 '" ~ eo
> > '" > 0 ;:; ;:; ;:; > >
> > ;:;
>
Alignment Condition Figure 6c. Mean subject ratings to the two VAME scales when viewing the combination of Visual #2 and Audio #1 across alignment conditions.

For an explanation of alignment condition labels, see endnote #6.

experiment One-VAME Ratings Visual #2 & Audio #2

r~---=-~o, 6ynch~~'~~d.sy~cl;ron~dl

~ Ineffective-effective

--.-----".--.--~-.--,-- .. -.----

100.000

90.000 80.000 r\\---------- 70.000

I 80.000 -.\-----.

50.000
40.000
30.000
20.000
10,000
0,000
IS ~I 25
",, ~ ",,
<C ~
'"
> > ---------,,------

e--e--If--+-----;---r---t----+--

O~ ~ol ~ ~ ~ ~ ~ 25 25

~ ~ ~ ~ i J ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

> > > > > >

Alignment Condition

Figure 6d. Mean subject ratings to the two VAME scales when viewing the combination of Visual #2 and Audio #2 across alignment conditions.

For an explanation of alignment condition labels, see endnote #6.

TllE PERCEPTION OF AUDIO-VISUAL COMPOSITES

53

Comparison of AU VAME Rallngs

~~~ir--'v '1A l' -Syn~iir~'r;j~-~Ii~n

-G--V1Al E1Jdctivuness

-b,- V lA2 Synchronization

-~ V1A2 Electiveness

---*-- V2A 1 Synchronization

-a- V2A 1 81ectivenoss

-1-. V2A2 synctucouaucn

- V2A2 81ectiveness

100
90
60·
:g, 70
~ 60 _-
II: 50
: 40
:I! SO o -~!-·-I------t-----i'·-·--·--j--I--·'---i~· --,1----1---·--1--·--,1-----·-·1-·-·--··

o a 0 0 Q 0 a u 0 0 0 0 0 0

i i ~ ~ d ~ ~ ~ iii ~ ~ ~

~ ~ ~ < ~ ~ < 0 0 0 ~ ~ ~ ~

> > ~ ~ > ~ ~ > > > > > > ~

Synchronization Conditions

Figure 7. Comparison of all VAME responses across A-V composite and alignment conditions. For an explanation of alignment condition labels, see footnote #6.

V8AIO_D) with the former usually being rated slightly higher than the latter. Notice also that the widest spread of mean responses to any of the A-V composites is associated with the nested consonant composites: VSAlO_C and VIOAS_C. Therefore, the relationship between subject responses on the two VAME scales and accent structure alignment may be represented as shown in Table 4. This ordering of response means is different from that proposed initially as an alternative hypothesis.

Recall that, based upon Zettl's (1990) theory related to closure in the perception of musical and visual vectors, the dissonant conditions were predicted to be perceived as more synchronized and effective than those of the out-of-phase conditions. The responses of Group I reveal, however, that higher ratings were given to the out-ofphase conditions than to the dissonant conditions. It is still possible to explain these results in terms of closure, as described by Zett!. However, the process of closure appears to have been applied in a manner different from that predicted at the outset.

Subject responses revealed that the out-of-phase conditions were perceived as more synchronized and more effective than the dissonant conditions, in contrast to the results predicted. In hindsight, perhaps this result makes more sense than the

1able4

A-V composites arranged [TOm highest response to lowest on the VAME scales

Identical Consonant Composites

Nested Consonant Composites

Out-of-Phase (Identical) Composites Out-of-Phase (Nested) and Dissonant Composites

Highest

t

Lowest

54

SCOTT D. LIPSCOMB

proposed alternative hypothesis. It appears that, in the process of viewing the present collection of audio-visual composites, subjects sought out recurrent periodicities and considered those that shared the same 101 between accent points to be more synchronized and more effective than those with different lOIs, even if these periodicities were misaligned to a highly perceptible degree. Future research will be required to distinguish between the importance of absolute accent structure alignment and the influence of such matched periodicities.

Collapsing Alignment Conditions Across A-" Composites. The subject VAME responses were collapsed across alignment conditions. When compared to a single measurement, such multiple measures of a single condition provide increased reliability (Lord and Novick 1968). Therefore, the mean of all synchronization ratings given in response to the consonant alignment condition (VIAI_C, VIA2_C, V2Al_C, and V2A2_C) was calculated and compared to the mean ratings for the out-of-phase and dissonant alignment conditions. The ratings of effectiveness were collapsed as well. An ANOVA on the collapsed data set revealed that the significant interaction between A- V composite and alignment condition observed over the complete data set fell to a level not considered statistically significant (synchronization--F:\'(6.I01) = 1.98959,p < .056; effectiveness-F:\'(6.lol) = 2.18760, P < .117).8 Further justification for collapsing the data in this manner can be derived from the VAME data set. Figure 8 represents mean ratings for both VAME scales across all A-V combinations at every IOI. Notice the contour similarity across every consonant, out-of-phase, and dissonant combination, that is, the consonant pairs consistently received the highest rating, the dissonant pairs received the lowest rating, and the out-of-phase pairs received a rating in between the other two. In addition, the subject responses to the nested conditions exhibited the most influence of specific A-V combinations. Therefore, eliminating these conditions further justified collapsing alignment conditions (consonant, out-of-phase, and dissonant) across the various A-V combinations. For the remainder of this investigation,

2°1------------------------------~--------

10 . ~

Figure 8. VAME ratings for all consonant, out-of-phase, and dissonant combinations across all A-V composites.

TIlE PERCEPTION OF AUDIO-VISUAL COI,lPOSITES

55

only three alignment conditions will be considered: consonant, out-of-phase, and dissonant, eliminating the nested conditions.

Analysis and Interpretation of Data Collapsed Across Alignment Conditions. An ANOVA across the collapsed data set confirmed that there is no statistically significant difference between the level of musical training for either the synchronization ratings (F(1.17) = 1.699, P < .2125) or the effectiveness ratings (F(1.I7) = .521, P < .603). Once again, however, analysis reveals a highly significant effect of alignment condition for both verbal ratings: synchronization (F:\.(l.16) = 162.274, P < .0001) and effectiveness (F:\.(2.16) = 91.591, P < .0001). The interaction between level of musical training and alignment condition was not significant for either synchronization (1\('.32) = 1.5"75, P < .2048) or effectiveness (F:\.(;.31) = 2.662,p < .05(4).

Regardless of musical training, subjects are clearly distinguishing between the three alignment conditions on both VAME scales (Figure 9). Consonant combinations were given the highest ratings with a steep decline between consonant and out-ofphase combinations, followed by an even lower rating for the dissonant pairs. Interestingly, the effectiveness ratings were consistently less extreme than the ratings of synchronization. For example, when the mean synchronization rating was extremely high (the consonant alignment condition), the effectiveness rating was slightly lower. However, when the synchronization ratings were lower (for the out-of-phase and dissonant alignment conditions), the effectiveness ratings were slightly higher. This suggests that, while synchronization ratings varied more consistently according to alignment condition, ratings of effectiveness lllay have been tempered slightly by other factors inherent in the A-V composite.

Experiment One-VAME Responses Averaged Across All Levels of Musical Training

100.00 . ------------.---- .. -- ... - ... ---.------.-

90.00 80.00 70.00

e

~ 50.00

l-- ------.----- ---1

~ Syncilronization

-s- Bfectiveness

---"-_._---_ .. _- ----_ .. -._._--_._ .. __ ._.

m 40.00 :;

30.00

20.00 10.00

0.00 ---------------·--·---·····-·1-····---··-·--

Consonant

Out-of-Phase

Dissonant

Alignment Condition

Figure 9. Mean VAME ratings for Experiment One averaged across all levels of musical training.

S6

SCOTT D. LIPSCOMB

Group ]wo Data Analysis and Interpretation

. A repeated measures ANOVA was also performed on the similarity ratings provided by Group 1\'1'0, using one within-groups variable (78 paired comparisons) and ~.ne between-!lroups :ariable (~ levels .o~ musical training). There was no statistically significant effect of either musical trauung (FI2•17) == .40, P < .676) or the interaction between musical training and similarity ratings (Fus;.130,) == .56, P < J .000). As one would expect, however, the similarity ratings did vary at a high level of significance (F1;,.130,) == ~4.~6,~ < .0005). Therefore, the null hypothesis is rejected, because subject ratings of similarity between A-V composites did, in fact, vary significantly as a function of A-V alignment.

Multidimensional Scaling. The triangular mean similarity matrix was submitted for multidimensional scaling (MDS) analysis. Figure 10 provides the MDS solution in three dimensions, ac.cou~ting for 99.884% of the variance at a stress level of only .01l~9. The twelve stimuli separated dearly on each dimension. All composites using ~udlo #1 a~e on the negative side of the "Audio" dimension (x-axis) and all composites 1l1.corporatmg Audio #2 are on the positive side. Likewise, all composites utilizing VI~ual #.1 are on the negative side of the "Visual" dimension (z-axis) and all composites usmg Visual #2 are on the positive side. Finally, all of the composites that are considered dissonant, fall within the negative area of the "Sync" dimension (y-axis) and all consonant and out-of-phase composites fall on the positive side, practically on top of one another.

0"
0.2
00
~
s
_02
_0"
-P Figure 10. Multidimensional scaling solution for the similarity judgments in Experiment One.

TlIE PERCEPTION OF AU DIO- VISUAL Cm.[POSI'l'ES

57

Notice how tightly the stimuli clustered witlun llll, three-duucusional space when viewed from above (across the Visual aud Audio dimensionsl.f Io further examine the group membership among the various A- V composites, the s.uue triangular matrix was submitted for cluster analysis.

Cluster Analysis. Cluster analysis provides a method for dividing a data set into subgroups without any a priori knowledge considering the number of subgroups nor their specific members. The tree diagram presented in Figure 11 graphically illustrates the clustering of A- V composites used in the present study.

As is readily apparent when considering this duster diagram from right to left, the initial branching of A- V composites into subgroups clearly separates the composites according to the visual component. All composites on the upper brunch utilize Visual One (VI) and all composites on the lower branch utilize Visual Two (V 2). The next subdivision separates the composites according to audio component, as labeled in the diagram. The third subdivision separates the composites with the same lOIs (consonant and out-of-phase composites) from those composites in which the audio and visual components are of differing lOIs (dissonant composites). Finally, till: fourth subdivision divides the consonant composites from the outof-phase composites

Notice also the mirroring relationship within the lower cluster of six composites (those using V2), based upon alignment condition (see Figure 12a). The closest crosscluster relationship between those composites incorporating Al and those using A2 is the dissonant condition, neighbored by the consonant condition, and working

DISTANCES

0.000 VlAl_C

100.(jOO

VlAl_O

+-------

I

Audio 1

+------------------

V1A1_D

Visual 1

+-------------------------------

V1A2_D

Audio 2

+----------------

VIA2_C

+-------

V1A2_O

+--_._--

V2A2_O

V2A2_C

+----------

I

Audio 2

+-------------------------

V2A2_D

VIsual 2

+---------------------

V2A1_D

Audio 1

+---------------------------

V2A1_C

+------

V2A1_O

Figure 11. Cluster Analysis tree dlagram-s-complcte linkage (farthest neighbor)--tor similarit y ratings provided by subjects in Experiment One, Group Two.

58

SCOTT D. LIPSCOMB

V2A2 0
-
V2A2 C
-
V2A2 D
-
V2Al D
-
V2Al C
-
V2Al 0
- +----------

I

Audio 2

+-----------

Audio 1

+-------------

I

+------

Figure 12a. Illustration of the mirroring relationship between elements in the upper cluster of composites incorporating Visual One.

VIA2 D
-
VIA2 C
-
VIA2 0
-
V2A2 °
-
V2A2 C
-
V2A2 0
- +----------

I Audio 2

+-------------

Audio 2

+-------------

I

+-------

Figure 12b. Illustration of the mirroring relationship between elements in the middle cluster of composites incorporating Audio Two with either Visual One or Visual Two,

outward finally to the out-of-phase condition. A similar mirroring is apparent when comparing the composites that incorporate A2, whether combined with VI or V2 (Figure 12b). In this case, the alignment condition of the closest pair (VIA2_0 and V2A2_0) is out-of-phase, working outward to consonant, then dissonant. It is worthy of notice that, when considering the main (visual) branches of the cluster solution in Figure 11, the two neighbor composites (VIA2_0 and V2A2_0) share the same audio track and alignment condition. These relationships within the cluster branching structure further confirmed the role of alignment condition in the subject ratings of similarity.

TI IE PERCEPTION OF AUDIO-VISUAL COMPOSITES

59

Conclusions

Summarizing the results of the main experiment, both of the converging methods (VAt',1E ratings and similarity judgments) substantiated the fact that alignment condition between audio and visual components of an A-V composite were a determining factor in the subject responses. In the VAME scores, verbal ratings of"sYllchronization" and "effectiveness" did, in fact, vary as a function of A-V alignment. In general, the highest ratings on both scales were given in response to the identical consonant composites followed (in order of magnitude) by nested consonant composites, and then out-of-phase identical composites. The lowest ratings were consistently given to either the out-of-phase (nested) or dissonant composites. Collapsing VAME ratings across alignment condition confirmed the relationship between consonant, out-of-phase, and dissonant pairs, revealing that the ratings of effectiveness were consistently less extreme than the synchronization ratings.

In the similarity judgments, an analysis of variance confirmed that there was a significant difference between ratings given to composites exemplifying the various alignment conditions. MDS revealed three easily interpretable dimensions. Cluster analysis confirmed the three criteria utilized by subjects in the process of determining similarity. In decreasing order of significance, these were the visual component, the audio component, and alignment condition. The fact that the alignment condition plays a significant role in both the multidimensional scaling solution and in the cluster analysis confirms the importance of including "Accent Structure Relationship" as one of the Implicit Judgments in the model of Film Music Perception (Figure I).

Discussion

Research Questions Allswered

The first question i-'dsed was: What are the determinants ofaccent'i A review of related literature revealed numerous potential sources of accent in both the aural and visual domains. Several researchers and theorists proposed that introducing a change into a stimulus stream results in added salience (accent). The exploratory study confirmed that hypothesized accent points using parameters (both aural ant! visual) gleaned from this literature review were, in fact, perceived by subjects and reproduced in a tapping procedure. Particularly reliable in producing an event perceived as musically salient were pitch contour direction change, change in interval size, dynamic accent, and timbre change. Likewise, particularly reliable in producing events perceived as salient in the visual domain were translations in the plane (top-to-bottom, side to side, left-bottom-to-right-top, and so forth) and translation in depth (near-to-far).

The second research question-and the main source of interest in the present study-concerned whether accent structure alignment between auditory and visual components was a necessary condition for the combination to be considered effective when viewing an A- V composite. This question is answered very clearly by the VAME responses of Group One. Calcnlation of the Pearson correlation coefficient revealed that subject ratings of synchronization and effectiveness shared a strong positive

60

SCOTT D. LIPSCOMB

relationship (r = .96). Therefore, A-V combinations that were rated high in synchronization also tended to be rated high on effectiveness and vice versa. In addition, results of multidimensional scaling and duster analysis revealed clear influence of the synchronization condition. We may conclude that, when using simple audio and visual stimuli, accent structure alignment does appear to be a necessary condition in order for an A- V combination to be considered effective.

As a result, in addition to the overwhelming evidence supporting the important referential aspect of the role played by film music, it is imperative that attention be given to the manner in which the audio and visual components are placed in relation to one another. More specifically, the present study has shown that the manner in which salient moments in the auditory and visual domains are aligned results in a significantly different perceptual response to the resulting composite. Therefore, in light of the findings of the present investigation, it is important to reconsider the results of investigations that simply juxtaposed sound upon image with little attention to the manner in which they were combined (Tannenbaum 1956; Thayer and Levinson 1984; and Marshall and Cohen 1988). Marshall and Cohen (1981) appear to have taken this aspecti of animation into consideration when they proposed their "Congruence-Associationist" model, in which semantic association and temporal congruence form the basis for judgments concerning the appropriateness of an audiovisual combination. In a following published discussion of these results, Bolivar, Cohen, and Fentress (1996: 32) state that "the greater [the] temporal congruence the greater the focus of visual attention to which the meaning of the music consequently can be ascribed." Further research will be required to determine the relationship between accent structure alignment (temporal congruence) and referential meaning (association) .

Suggestions for Further Research

The most important issue to be addressed in a series of future investigations is whether accent structure alignment remains a necessary condition when viewing more complex stimuli. The present author is currently in the process of preparing a paper that reports findings of an experiment utilizing moderately complex stimuli (experimental animations by Norman Mcl.aren) and highly complex stimuli (actual movie excerpts from Brian Del/alma's Obsession, 1977).

Future research is also needed to determine the relative importance of referential (associational) and accent structure (syntactic) aspects within the motion picture or animation experience. These results would help further revise the model of Film Music Perception. Accuracy of the model could also be enhanced by experimental designs incorporating more complex A-V interrelationships. For example, instead of simply having consonant, out-of-phase, and dissonant alignment conditions, it would be possible to create a whole series of consonant alignment periodicities using a basic subset of temporal patterns. Monahan and Carterette (1985) performed a study of this kind using pitch patterns with the four rhythmic patterns: iambic, dactylic, trochaic, and anapest. These four rhythmic patterns could provide the basis for creating a series of animations and a series of pitch patterns. The two could then be combined in all

THE PERCEPTION OF AUDIO-VrSLJAL COlviPOSfTES

61

possible pairs for use in a similarity scaling procedure to determine what aspects of the A- V composite are particularly salient to an observer. An investigator could incorporate this same stimulus set into a tapping procedure to determine whether subjects tap with the audio, the video, some underlying common pulse, or a complex combinatory rhythmic pattern.

A significant limitation of the present study is that only a small number of audio and visual stimuli were used in order to ensure that subjects could complete the experimental tasks within a reasonable amount of time. In future investigations, the use of blocked designs would allow incorporation of a larger number of stimuli and, hence, improve the investigator's ability to generalize results.

Currently, the temporal duration by which visual images and sounds must be offset in order to be perceived as misaligned (j.n.d. or just noticeable difference, in psychophysical terminology) remains undefined. In the present study a liberal amount was selected (225ms) in order to ensure that the offset amounts were well beyond any psychophysiological or perceptual limitations. Friberg and Sundberg (1992) determined that, when introducing a temporal duration change into a series of isochronous tones, the variation could be perceived at temporal intervals as small as lOms. The amount of temporal offset in a cross-modal perception task would likely be significantly longer, but that determination must be made through rigorous scientific investigation. Such an experimental design should incorporate stimuli of varying levels of complexity, in order to determine whether the j.n.d. is a constant or relative value.

Much research is needed to assist in the quantification of various parameters of the audio-visual experience. Reliable metrics are needed to express accent prominence, as well as complexity of a musical passage, a visual image, or an A- V combination in quantitative terms. Creating a method to quantify the degree of referentiality in a musical or visual excerpt would be helpful in further developing the model of Film Music Perception.

Finally, the present investigation selected one between-groups variable of interest, that is, musical training. It would also be equally relevant to run a series of similar studies, using visual literacy as a potential grouping variable.!" In fact, incorporating both musical training and visual literacy would allow consideration of the musical training by visual literacy interaction, which might prove very interesting indeed.

Conclusion

Scientific investigations into the relationship of visual images and musical sound in the context of motion pictures and animation provide a relatively new area of study. The art forms themselves have only existed for a century. However, given the sociological significance of the cinematic experience, it is quite surprising that there is still only a small amount of research literature available addressing issues involved in the cognitive processing of ecologically valid audio-visual stimuli. The present series of experiments, along with those proposed above, will provide a framework upon which to build a better understanding of this important, but underrepresented, area of research.

62

SCOTT D. LIPSCOlvlIl

NOTES

I. The author would like to acknowledge the assistance of both the University of California, Los Angeles and The University of Texas at San Antonio. In addition, the support of Northwestern University has been integral to continuing research efforts. Without the use of the research facilities provided, funding for necessary equipment, and colleagues with whom the results could be discussed, completion of this project would not have been possible. I would especially like to thank Dr. Roger A.Kendall for his invaluable assistance.

2. For another interesting experience of this type, view The Wizard of Oz while listening to Pink Floyd's Dark Side of the Moon (1973) as the musical soundtrack. If you start the music on cue with the third roar of the MGM lion, you will be surprised how well the audio and visual components appear synchronized at certain transitional points in the film, Though songwriter Roger Waters has remained silent on the matter, both drummer Nick Mason and engineer Alan Parsons deny any intended relationship (MTV,1997).

3. Even when using extreme differences between temporal intervals and periodicity, it is inevitable that, at some point in time, the two strata will align for a simultaneous point of accent. This possibility occurs, as mentioned, t!very four seconds when using the 800ms temporal interval with the 500ms or every eight seconds when combined with the 1000ms intervals. The fifth pulse in the upper stratum of Figure 5c illustrates such a coincidental alignment.

4. More information about MEDS is available from Dr. Roger A. Kendall directly at:

Dept. of Ethnomusicology, UCLA, Los Angeles, CA 90095-1657 ([email protected]). In addition to the KeyPress module, the author also incorporated commands into MEDS, allowing selection of any of the digital or analog audio tracks available on the laserdisc recording. These capabilities were necessary for later experiments.

5. The data set for the main experiment (both synchronization and effectiveness ratings) failed the likelihood-ratio test for compound symmetry, violating one assumption of the ANOVA model. Therefore, when appropriate, transformed F- and p-values were provided using Wilks'lambda (F,), which did not assume compound symmetry.

6. In Figures 6a to 6d, the x-axis labels consist of acronyms formed to identify the specific combination represented. These acronyms consist of the visual stimulus [01 (5,8 or 10), the aural stimulus 101 (5,8 or 10), and the alignment condition (C for consonant, 0 for out-ofphase, or D for dissonant). In each case, the actual 101 value in milliseconds is divided by 100 to make the labels shorter. For example, the label V5A8_D identifies a composite consisting of a visual stimulus with an 101 of 500 ms and an aural stimulus with an 101 of 800 rns resulting in a dissonant alignment condition. Each graph represents a different combination of visual and aural stimuli.

7. Recall that this is the Visual pattern that, because of the results of the exploratory study, was changed from the originally hypothesized accent periodicity to that perceived by all subjects in the tapping task. Perhaps some of the subjects in Experiment One perceived composite VI0A5_C4 as nested and others (sensing an accent point at both the nearest and farthest location of visual apparent motion) considered it an identical consonance.

8. Running a second ANOVA on this same data set caused the probability of alpha error (a) to increase. The data from each of three experiments were analyzed independently for significant differences and then one final ANOVA was run across the entire data set, including subject responses from all three experiments. Included in these three experiments are the main experiment reported in this paper and two additional experiments that are presently being prepared for publication. Since the alpha error level was set a priori to .025, the resulting level of confidence remained above 95% (.975 x .975). The single exception to this rule was the analysis

THE PERCEPTION OF AUDIO-VISUAL COM PO SITES

63

of the data from the main experiment reported herein. One ANOVA was already run on the complete data set. Along with the following ANOVA on the collapsed data set and the II1IJI ANOVA across all three experiments, the resulting level or LC>lllidcnce VlD, reduced to about 93% (.975 x .975.x .975).

9. They duster so tightly in fact that, when the similarity matrix was forced into _two dimensions, it became immediately apparent that the MDS solution was degenerate. Therefore, resuits of the MDS solution will be supported by consideration of cluster analyses. For a COJll-

plete discussion, see Lipscomb (1995). .' ,

10. Visual literacy refers to an individual's capability to process Visual sensory input. For instance, individuals trained as artists, animators, or film directors tend to be more aware ot elements in their visual environment

REFERENCES

Asmus,E. 1985

"The effect of tune manipulation on affective responses to a musical stimulus:' In G.C. Turk, ed., Proceedings of the Research Symposium 011 the Psychology and Acoustics of Music. Lawrence: University of Kansas, pp. 97-110.

Bermant, R.J. and Welch, R.B.

1976 "Effect of degree of separation of visualauditory stimulus and eye position upon spatial interaction of vision and audition," Perceptual and Motor Skills 43: 487-493.

Bolivar, V:J.. Cohen, A.J., and Fentress, I.C. __

1996 "Semantic and formal congruency in music and motion pictures: Etlects on the interpretation of visual action." Psyd,ulllllsi,:ol0I!Y 13: 28-59.

Bolton,T.L. 1894

Boltz,M. 1992

"Rhythm," American Journal ufPsycilOlogy 6: I·J 5·238.

"Temporal accent structure and the remembering of filmed narrat ives" Journal of Experimental Psychology: Human Percep,iull and PeljOllfumce 18:

90-105.

"Musical soundtracks as a schematic influence 011 the cognitive processing

of filmed events:' Music Perception 18(4): 427-454.

Boltz, M. and Jones, M.rt.

1986 "Does rule recursion make melodies easier to reproduce? If not, what dues?"

Cognitive Psychology 18: 389-·431.

Boltz, M., Schulkind, M., and Kantra, S.

1991 "Effects of background music on the remembering of filmed events" Memory

and Cognition 19: 593-606.

2001

Brown,R.W. 1981

"Music and language," In Documentary report of the Al1l1 Arl>ol Symposium. Reston, VA: pp. 233-265.

Bruner, J., Goodnow, J.J., and Austin, G.A.

1986 A study ojthinking. 2nd ed. New Brunswick: Transaction Publishers.

64

Brusilovsky, L.S. 1972

SCOTT D. LIPSCOMB

''A two year experience with the use of music in the rehabilitative therapy of mental patients." Soviet Neurology and Psychiatry 5(3-4); 100.

Bullerjahn, C. and Guldenring, M.

1996 ''An empirical investigation of effects of film music using qualitative content analysis." Psychomusicology 13; 99-118.

Cardinell, R.L. and Burris-Meyer, H.

1949 "Music in industry today." Journal of the Acoustical Society of America 19: 547-548.

Crozier 1974

Deliege,l. 1987

Eagle, C.T. 1973

Farnsworth, P.R. 1954

"Verbal and exploratory responses to sound sequences varying in uncertainty level:' In D.E. Berlyne, ed., Studies in the Ilew experimental psychology:

Steps toward an object psychology of aesthetic appreciation. New York Halsted Press,pp.27-90.

"Grouping conditions in listening to music: An approach to Lerdahl and Iackendoffs Grouping Preference Rules:' Music Perception 4( 4): 325-360.

"Effects of existing mood and order of presentation of vocal and instrumental music on rated mood response to that music." Council for Research ill Music Education 32: 55-59.

''A study of the Hevner adjective list:' Journal of the Aesthetics of Artistic Criticism 13; 97-103.

Fraisse, P.

1982 "Rhythm and tempo." In D. Deutsch, ed., The psychology of music. New York:

Academic Press, pp. 149-180.

Priberg.A. and Sundberg, J.

1992 "Perception of just -noticeable displacement of a tone presented in a metrical sequence of different tones." Speech Transmission Laboratory-Quarterly Progress and Status Report 4: 97-108.

Halpin, D.O. 1943-1994

Heinlein, c.a 1928

Hevner, K. 1935

1936

Hough,E. 1943

"Industrial music and morale:' [ournal of the Acoustical Society of America 15; 116-123.

"The affective characters of major and minor modes in music." [ournal of Comparative Psychology 8: 101-142.

"Expression in music: A discussion of experimental studies and theories:' Psychological Review 42(2): 186-204.

"Experimental studies of the elements of expression in music." American loumal of Psychology 48: 246-269.

"Music as a safety factor:' [ournal of the Acoustical Society of America 15: 124.

THE PERCEPTION OF AUDIO- VISUAL COMPOSITES

"Interactions between auditory and visual processing when listening to music ill an audio visual context: 1. Matching 2. Audio Quality." Psychomusicology 13; 133-153.

Kendall, R.A. and Carterette, E.C.

1992a "Convergent methods in psychomusical research based on integrated, interactive computer control:' Behavior Research Methods 24(2); 116-13l. "Semantic space of wind instrument dyads as a basis for orchestration:' Paper presented at the Second International Conference on Music Perception and Cognition, Los Angeles, CA, February.

"Verbal attributes of simultaneous wind instrument timbres: 1. VOll Bismarck adjectives:' Music Perception, 10(4): 445-467.

Huron,D. 1994

Iwamiya,S. 1996

1 992b

1993

Kerr,W.A. 1945

65

"What is melodic accent? A computer-based study of the Liber Usualis" Paper presented at the Canadian University Music Society Theory Colloquium Calgary, Alberta, June.

"Effects of music on factory production." Applied Psychology Monographs, no. 5. California; Stanford University.

Koffka.K, 1935 K5hler,W

1929 Gestalt Psychology. New York: Liverigbt.

Krumhansl, C.L. and Schenck, D.L.

1997 "Can dance reflect the structural and expressive qualities of music? A perceptual experiment on Balanchine's choreography of Mozart's Divertimento No. 15." Musicae Scientiae 1(1); 63-85.

Principles of Gestalt psychology. New York: I larcourt, Brace.

Kruskal, J.B.

1964a "Multidimensional scaling by optunizing goodness of fit to a nonmetric hypothesis." Psychometrika 29: 1-27.

1964b "Nonmetric multidimensional scaling: A numerical method." Psychomctriku 29; 11 5-129.

1978 Multidimensional Scaling. Beverly Hills, CA; Sage Publications.

Lerdahl, F. and [ackendoff R.

1983 A generative theory of 'lonal Music. Cambridge, MA; MIT Press.

Lipscomb, S. D. 1989

1990

1995

"Film music: A sociological investigation of influences on audience awareness:' Paper presented at the Meeting of the Society of Ethnomusico!ogy, Southern Lipscomb, S. D. California Chapter, Los Angeles, March. "Perceptual judgment of the symbiosis between musical and visual components in film." Unpublished master's thesis, University of California, Los Angeles.

"Cognition of musical and visual accent structure alignment in film ami animation." Unpublished doctoral dissertation, University of California, Los Angeles.

66

SCOTT D. LIPSCOMB

Lipscomb, S.D. and Kendall, R.A. 1996

"Perceptual judgment of the relationship between musical and visual components in film." Psychomusicology 13(1): 60-98.

Lord, F.M. and Novick, M.R. 1968

Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley Publishing.

MacDougall, R. 1903

"The structure of simple rhythm forms." Psychological Review, Monograph Supplements 4: 309-416.

Madsen, C.K. and Madsen, C.H.

1970 Experimental research in music. New Jersey: Prentice Hall.

Marshall, S.K. and Cohen.A].

1988 "Effects of musical soundtracks on attitudes toward animated geometric figures:' Music Perception 6: 95-112.

Massaro, D.W. and Warner, D.S.

1977 "Dividing attention between auditory and visual perception:' Perception and Psychophysics 21: 569-574.

McGehee, W. and Gardner, I.E.

1949 "Music in a complex industrial job." Personnel Psychology 2: 405-417.

McMullen, P.T. 1976

"Influences of distributional redundancy in rhythmic sequences on judged complexity ratings." Council for Research on Music Education 46: 23-30.

Mershon, D.H., Desaulniers, D.H.,Amerson, T.e. [r., and Klever, S.A.

1980 "Visual capture in auditory distance perception: Proximity image effect reconsidered." Journal of Auditory Research 20: 129-136.

Meyer, L.B.

1956 Emotion and meaning in music. Chicago: University of Chicago Press.

Monahan, CB. and Carterette, E.C.

1985 "Pitch and duration as determinant of musical space." Music Perception 3(1): 1-32.

Monahan, C.B., Kendall, R.A., and Carterette, E.C.

1987 "The effect of melodic and temporal contour on recognition memory for pitch change:' Perception and Psychophysics 41 (6): 576-600.

Morris, Phillip, Companies Tnc.

1988 Americans and the arts: V. New York: American Council for the Arts.

MTV

1997 "The Pink Floyd/Wizard of OZ connection." Retrieved May 13,2002, from: http://www.mtv.com/news/articles/1433194/ 19970530/ stor y.Jhtml.

Nordoff, P. and Robbins, C.

1973 Therapy in music for handicapped children. London: Gallancz.

Osgood, C.E., Suci, G.J., and Tannenbaum, P.H.

1957 The measurement of meaning. Urbana: University of Illinois Press.

Pink Floyd 1973

Dark Side of the Moon. Capitol CDP 7 46001 2.

THE PERCEPTION OF AUDIO- VISUA L CO!\lPOSITES

67

Radeau, M. and Ber telson, P.

1~74 "The after-effects of ventriloquism:' Quarterly Journal of Experimental ['5)'chology 26: 03-7!.

Regan, D. and Spekreijse, H.

1977 "Auditory-visual interactions and the correspondence between perceived auditory space and perceived visual space." Perception 6:.133- 1 38.

Rosar, W.Tf.

1996 "Film music and Heinz Werner's theory of physiognomic perception:' Psyclunnusicology 13: 15'1-165.

Ruff, R.M. and Perret, E.

1976 ''Auditory spatial pattern perception aided by visual choices:' P;yclwlogiud

Research 38: 369-377.

Sirius, G. and Clarke, E.E

1996 "The perception of audiovisual relationships: A preliminary study:' Psychomusicology 13: 119-132.

Staal, II.E. and Donderi, D.C.

1983 "The effect of sound on visual apparent movement." American Journal of Psychology 96: 95- !OS.

'Iannenbaum, P. H.

1956 "Music background in the judgment of stage and television drama." AudioYisual Communications Review <1: 92-101.

Thayer, J.F. and Levenson, R. W.

1984 "Effects of music on psychophysiological responses to a stressful film."

PsycllOlJlusicology 3: 44-54.

Thompson, w.F., Russo, F.A., and Sinclair, D.

1996 "Effects of underscoring on the perception of closure in filmed events:' Psychomusicology 13: 9-27.

Uhrbock, H..S. 1961

"Music on the job: Its influence on worker morale and production:' Personnel Psychology 14: 9-38.

Vitouch,O. 2001

"When your ear sets the stage: Musical context effects in film perception:' Psychology of Music 29: 70-S3.

von Ehrenfels, C. 1890

"Uber Gestaltqualitaten" \fierteljahrsdtriJt flil' wissenschaitluhe Philosophic 14: 249-292.

Wedin, L. 1972

''A multidimensional study of perceptual-emotional qualities ill music." Scandinavian Journal of Psychology 13: 1-17.

Wertheimer, M. 1925

Yeston, M.

1976

Zettl,H.

1990

Uber Gestulttheorie. Erlangen: Weltkreis- Verlag.

The stratification of musical rhythm. New Haven, CT: Yale University Press.

Sigl«, sound, motion: Applied media aesthetics. 2nd ed. Belmont, CA:

Wadsworth Publishing Co.

You might also like