Biber (2004) PDF
Biber (2004) PDF
Biber (2004) PDF
Douglas Biber
Northern Arizona University, USA
[email protected]
Abstract
Multi-dimensional (MD) analysis is a methodological approach that applies multivariate statistical techniques
(especially factor analysis and cluster analysis) to the investigation of register variation in a language. The
approach was originally developed to analyze the full range of spoken and written registers in a language. Early
studies focused on English register variation (Biber 1985, 1986 and 1988), while later studies have applied the
same approach to Somali, Korean, Tuvaluan, Taiwanese, and Spanish.
Surprisingly, these studies have found some striking similarities in the underlying dimensions that distinguish
among spoken and written registers in these diverse languages. It is even more surprising that MD studies of
restricted discourse domains have also uncovered dimensions that are similar in linguistic form and function to
the more general studies of register variation.
The present study presents an MD analysis of a single register: conversation. Three primary dimensions of
variation are identified, and then cluster analysis is used to distinguish among six conversation text types. The
dimensions and text types are interpreted in linguistic and functional terms.
The authors expectations were that a unique set of dimensions would emerge to characterize the variation
among conversational texts. Instead, the three dimensions identified here turn out to be closely related to
dimensions identified in previous analyses of general register variation. Taken together with previous studies, the
present study of conversation raises the possibility of universal dimensions of variation.
1. Introduction
Multi-dimensional (MD) analysis is a methodological approach that applies multivariate
statistical techniques (especially factor analysis and cluster analysis) to the investigation of
register variation in a language. The approach was originally developed to analyze the range
of spoken and written registers in English (Biber 1985, 1986 and 1988). There are two major
quantitative steps in an MD analysis: (1) identifying the salient linguistic co-occurrence
patterns in a language; and (2) comparing spoken and written registers in the linguistic space
defined by those co-occurrence patterns. In a third step, it is possible to identify groupings of
texts text types that are maximally similar in their multi-dimensional profiles.
Almost any linguistic feature will vary in its distribution across registers, reflecting the
discourse functions of the feature in relation to the situational characteristics of each register
(see, e.g., the grammatical descriptions in the Longman Grammar of Spoken and Written
English; Biber et al., 1999). However, individual features cannot reliably distinguish among
registers: There are simply too many different linguistic characteristics to consider, and
individual features often have idiosyncratic distributions. Instead, analyses based on linguistic
co-occurrence and alternation patterns are required to uncover general register differences.
The theoretical importance of linguistic co-occurrence has been emphasized by linguists such
as Firth, Halliday, Ervin-Tripp, and Hymes. Brown and Fraser (1979: 38-39) observe that it
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
16 DOUGLAS BIBER
can be misleading to concentrate on specific, isolated [linguistic] markers without taking into
account systematic variations which involve the co-occurrence of sets of markers. Ervin-
Tripp (1972) and Hymes (1974) identify speech styles as varieties that are defined by a
shared set of co-occurring linguistic features. Halliday (1988: 162) defines a register as a
cluster of associated features having a greater-than-random...tendency to co-occur.
The MD approach gives formal status to the notion of linguistic co-occurrence, by providing
empirical methods to identify and interpret co-occurrence patterns as underlying dimensions
of variation. The co-occurrence patterns comprising each dimension are identified quantita-
tively through factor analysis. It is not the case, though, that quantitative techniques are
sufficient in themselves for MD analyses of register variation. Rather, qualitative techniques
are required to interpret the functional bases underlying each set of co-occurring linguistic
features. The dimensions of variation have both linguistic and functional content. The
linguistic content of a dimension comprises a group of linguistic features (e.g., nominaliza-
tions, prepositional phrases, attributive adjectives) that co-occur with a high frequency in
texts. Based on the assumption that co-occurrence reflects shared function, these co-occur-
rence patterns are interpreted in terms of the situational, social, and cognitive functions most
widely shared by the linguistic features. That is, linguistic features co-occur in texts because
they reflect shared functions.
Several experiments have been carried out to evaluate the reliability (and to a lesser extent
validity) of the original MD analysis of register variation in English. For example, Biber
(1990) shows that factor analyses carried out on split corpora result in nearly the same
dimensions of variation, as long as the texts in those corpora are sampled to include equiva-
lent ranges of register variation. Biber (1993) shows how these dimensions can be used to
predict the register category of unclassified texts with a high degree of accuracy (using
discriminant analysis). And Biber (1992) uses confirmatory factor analysis to test the good-
ness of fit of several factorial models determined on theoretical grounds, confirming the basic
structure identified using exploratory factor analysis in the 1988 analysis.
While early MD studies focused on register variation in English, subsequent studies have
applied the same approach to Somali, Korean, Tuvaluan, Taiwanese, and Spanish (see, e.g.,
Biber, 1995; Jang, 1998). Although these studies all apply the same methodological approach,
they are carried out independently. In each case, a corpus was designed to represent the range
of spoken and written registers found in the target culture, and a computational tagger was
written to capture the grammatical structure of the target language. The set of linguistic
variables used in each analysis includes the full range of lexical/grammatical distinctions that
are relevant in the target language. Despite this fact, the resulting MD analyses have turned
out to be strikingly similar in some respects. In particular, the analyses of all languages have
uncovered dimensions relating to interactiveness/involvement versus informational focus, the
expression of personal stance, and narrative versus non-narrative discourse (see Biber, 1995,
especially Chapter 7).
The MD methodological framework has also been applied to more restricted discourse
domains.1 These include analyses of elementary school registers (Reppen, 1994 and 2001),
1
There have also been several studies of specific registers that apply the dimensions that were identified and
interpreted in the 1988 MD analysis of spoken and written variation in English (see, for example, the collection
of studies in Conrad and Biber, 2001). It is important to note that these studies do not entail separate MD
analyses. That is, these studies apply the dimensions identified in the 1988 MD analysis of English to some new
discourse domain, but they do not undertake new MD analyses (i.e., involving a new factor analysis).
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 17
job interview language (White, 1994), television talk shows (Connor-Linton, 1989), 18th
century written and speech-based registers (Biber, 2001), university spoken and written
registers (Biber, 2003), and academic subregisters (e.g., Grabe, 1987; Kanoksilapatham,
2003). Many of these studies have identified dimensions of variation similar to those found in
the cross-linguistic studies, especially relating to the same functional concerns of interactive-
ness/involvement versus informational focus, the expression of personal stance, and narrative
versus non-narrative discourse.
This result is surprising for two reasons. First, the statistical technique of factor analysis
like all correlational techniques requires variability. Two linguistic variables cannot be
shown to correlate unless the texts included in an analysis represent a wide range of variation
for those variables. Similarly, factor analysis cannot reliably identify sets of co-varying
linguistic features unless the texts included in the analysis represent a wide range of variation
for the full set of features. Thus, factor analysis is most appropriate for general analyses of
spoken and written texts, which represent an extensive range of variation for almost any
linguistic feature (see the detailed analyses in Biber et al., 1999). In contrast, it might be
assumed that factor analysis is less appropriate for analyses of texts from a single, restricted
discourse domain, because that domain will represent a much smaller range of variation.
Second, to the extent that there is linguistic variability among the texts in a restricted dis-
course domain, there is no reason to assume that it would be similar to the patterns of varia-
tion found in a general-purpose corpus. We would rather expect to find different linguistic
features varying in a restricted domain, reflecting the specific functional differences found in
that domain. In the MD analyses, these specific patterns of linguistic variation should result in
dimensions of variation that are unique to each discourse domain.
Previous MD analyses have shown that restricted discourse domains represent sufficient
linguistic variability for the successful application of this methodological approach. More
surprisingly, these analyses show that some of the same basic dimensions of variation seem to
be fundamentally important across restricted and general discourse domains. (In addition,
there are other dimensions that are unique to a particular domain.) This repeated finding
that some dimensions occur across languages and across general and restricted discourse
domains raises the possibility of universal dimensions of register variation.
The present study further explores this possibility by undertaking an MD analysis of linguistic
variation within a single spoken register: conversation. Factor analysis is used to identify the
linguistic dimensions of variation operating in this discourse domain, and then cluster analysis
is used to identify conversation text types that are well-defined in that multi-dimensional
space. The following sections describe these analyses, followed by discussion of the more
general theoretical implications for the study of register variation.
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
18 DOUGLAS BIBER
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 19
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
20 DOUGLAS BIBER
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 21
of the positive set of features, that same conversation will tend to have low frequencies of the
negative set of features, and vice versa. In the interpretation of a factor, it is important to
consider the likely reasons for the complementary distribution between positive and negative
feature sets as well as the reasons for the co-occurrence patterns within those sets.
For example, the positive features on Factor 1 (e.g., long words, nominalizations, preposi-
tional phrases, abstract nouns, relative clauses, etc.) all relate to informational purposes. These
features are mostly associated with elaborated noun phrases and a dense integration of infor-
mation in a text; previous MD studies have shown these features to be typical of written non-
fictional registers intended for specialist audiences (see, e.g., Biber, 1995; Biber and Finegan,
2001).
In contrast, the negative features on Dimension 1 reflect a focus on the immediate interaction
and activities: present tense verbs, contractions, 1st and 2nd person pronouns, and activity
verbs. The overall interpretation of Dimension 1 is thus relatively straightforward, showing
that conversations tend to be either informational or interactive, but not both. The func-
tional label Information-focused versus interactive discourse is proposed for this dimension.
The positive features on Dimension 2 are mostly linguistic features that express stance:
personal attitudes or indications of likelihood. In the 1988 MD study of spoken and written
register variation, several of these features were shown to co-occur typically with interactive
and reduced structure features (on Dimension 1). In contrast, the analysis here shows that
stance-focused discourse is not necessarily highly interactive discourse, and vice versa. (This
dimension also includes several specific features that were not distinguished in the feature set
used for the 1988 analysis, such as likelihood/mental verb + that-clause and factual adverbs).
The negative pole of Dimension 2 shows a surprising co-occurrence of only two features:
nouns and WH-questions. In past analyses, nouns have co-occurred with other stereotypically
literate features (like adjectives, prepositional phrases, etc.), while WH-questions have co-
occurred with stereotypically oral and interactive features. The interpretation here must
consider why these two features would tend to co-occur in conversations, and why they would
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
22 DOUGLAS BIBER
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 23
Biber, 1989 and 1995). The present section describes the text types that can be distinguished
linguistically within the single register of conversation.
The dimensions of variation (see Section 4 above) are used as linguistic predictors for the
clustering of conversations. The individual feature counts are first standardized so that each
feature has a comparable scale with a mean of 0.0 and a standard deviation of 1. (The stan-
dardization was based on the overall means and standard deviations for each feature in the
conversation corpus.) Then, dimension scores were computed by summing the standardized
frequencies for the features comprising each of the three dimensions. The cluster analysis is
based on the three dimension scores for each conversation.
The methodology in this analytical step can be illustrated conceptually by the 2-dimensional
plot in Figure 1. Each point on Figure 1 represents a conversation, plotting the scores for that
conversation on Dimensions 1 and 2. The numbers in the figure show the cluster number for
each conversation, based on the results of the cluster analysis. Conversations that are similar
in their dimension scores are grouped together as a cluster, or text type. For example, the
conversations labelled with a 1 on Figure 1 all have large positive scores on Dimension 1
(the vertical axis) and large negative scores on Dimension 2 (the horizontal axis). In contrast,
Cluster 2 has positive scores on both Dimensions 1 and 2.
Dimension 1
30 |
|
|
| 1 1 1 1
| 1 1
|
20 |
|
|
|
|
|
10 | 2 2 2
| 2 2 2
| 2 2 2
| 6 6 6 6666 666666 46 4 4 44 22222 2
| 666 6 6 6666 66666 664664644446 4444 444 4
| 6 666 666666 66466 4466464444444 4 44444
0 | 6 66 66 66 64466466 4 4 444 4
| 5 55 55 55555555555 55
| 555555555555555 555555 555 55555
| 33 555555 5555 5 5555 5555555555
| 3 33 33 3333333 3333333333 5 5 5 555 55 55
| 3 333 3 3333333 333 3333
-10 | 3 3 333 3 3333333333 3 3 3
| 3 3 3 3 3
|
|
|
--|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|------
-10 -8 -6 -4 -2 0 2 4 6 8 10
Dimension 2
Figure 1. Plot of VBDUs along Dimension 1 vs. Dimension 2 (showing all DUs with a distance < 3
from the cluster centroid. Symbol is value of CLUSTER; NOTE: 194 obs hidden.)
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
24 DOUGLAS BIBER
Cluster analysis performs this grouping statistically, based on the scores for all three dimen-
sions. Figure 1 shows the distribution across only two dimensions (1 and 3); these two dimen-
sions were chosen because they provide a good visual display of how the conversations within
each cluster are grouped based on their dimension scores. However, the actual cluster analysis
uses all three dimension scores to identify the groupings of conversations that are maximally
similar in their linguistic characteristics.
Cluster analysis is an exploratory statistical technique. The FASTCLUS procedure from SAS
was used for the present analysis. Disjoint clusters were analyzed because there was no theo-
retical reason to expect a hierarchical structure. Peaks in the Cubic Clustering Criterion and
the Pseudo-F Statistic (produced by FASTCLUS) were used to determine the number of clus-
ters. These measures are heuristic devices that reflect goodness-of-fit: the extent to which the
texts within a cluster are similar, while the clusters are maximally distinguished. In the pre-
sent case, these measures had peaks for the 3-cluster solution and for the 6-cluster solution.
The latter was chosen for subsequent analyses because it provided greater discrimination
among the specialized clusters, facilitating the interpretation of those clusters as conversation
text types.
Figure 1 shows the distribution of these six clusters in only a 2-dimensional space, whereas
the cluster analysis is actually based on a 3-dimensional space. It turns out that the third
dimension is also important in defining some clusters. For example, Cluster 4 is not sharply
delimited in terms Dimensions 1 and 2, but all conversations in this cluster have large positive
scores on Dimension 3 (narrative).
Tables 4 and 5 provide a descriptive summary of the cluster analysis results. Table 4 shows
the number of conversations grouped into each cluster, while Table 5 gives descriptive
statistics for each dimension across the clusters. The clusters differ notably in their distinct-
iveness: the smaller clusters are more specialized and more sharply distinguished linguisti-
cally. For example, Cluster 1 has only 40 conversations; linguistically, the conversations
grouped in Cluster 1 have extremely large positive scores on Dimension 1 (informational);
large negative scores on Dimension 2 (context-focused); and scores near 0.0 on Dimension
3 (narrative). At the other extreme, Cluster 5 is a general text type: it is large (680 conver-
sations) and relatively unmarked in its dimension scores.
Cluster Frequency RMS Std Maximum Distance from Nearest Distance Between
Deviation Seed to Observation Cluster Cluster Centroids
1 40 4.6276 19.8319 2 18.2629
2 116 4.3710 16.9770 4 9.5460
3 496 3.2839 18.7622 5 8.6697
4 308 3.4828 18.2692 6 8.2551
5 680 3.2643 16.2853 4 8.3268
6 526 3.2447 17.4048 4 8.2551
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 25
Cluster Means
Cluster Dim. 1 Dim. 2 Dim. 3
1 22.15 -5.08 -0.99 (Informational context-focused)
2 7.67 5.87 0.93 (Informational stance-focused)
3 -8.04 -5.19 -2.88 (Interactive context-focused)
4 2.12 -0.31 5.61 (Narrative)
5 -4.15 1.74 0.55 (Unmarked interactive)
6 2.63 -4.46 -1.50 (Unmarked context-focused)
The clusters can be interpreted as Conversation Text Types, because each cluster represents a
grouping of conversations with similar linguistic profiles. Figure 2 compares the linguistic
characteristics of the four most distinctive of these conversation types, plotting their mean
dimension scores. The general conversation types clusters 5 and 6 are not plotted in
Figure 2.
12
10
Informational context focused
8
6
Informational stance focused
4
2
Interactive context focused
0
-2 Narrative focused
-4
-6
-8
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
26 DOUGLAS BIBER
Taken together, Table 4 and Figure 2 provide the basis for the interpretation of each conver-
sation type. (These interpretations are refined by consideration of individual conversations
from each type.)
Type 1 is the most specialized, with the fewest number of texts (only 40, or about 2% of the
conversations in the corpus). Linguistically, these conversations are extremely informational
(Dimension 1) and focused on the context (Dimension 2). Text Sample 1 provides an example
of a conversation from this cluster. This text illustrates the dense use of informational
features, such as nominalizations (e.g., conversation, sophistication, agreement, possibility,
information), other long words (e.g., paperwork, computer-wise, computerized, consequently),
attributive adjectives (modern, massive, great, preliminary, certified), passives (be/getting
inundated), prepositional phrases (to you, about the conversation, with Alec, on a piece of
paper), and relative clauses (things that youre liable to get asked). Although texts from this
cluster would be considered interactive and involved in comparison to written expository
texts, they are highly informational in comparison to other conversational texts.
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 27
B: Right
A: see if, if its a possibility
B: Right
A: What hes looking for is certified numbers, field numbers
B: That sort of
A: all this sort of information
Type 2 is also relatively specialized (with only 116 conversations grouped into this cluster).
Linguistically, this conversation type is relatively informational (Dimension 1) but especially
marked for being highly stance-focused (Dimension 2). (This conversation type should be
contrasted with Type 5: a much larger cluster that is stance-focused and highly interactive
rather than informational.) Text Sample 2 illustrates the typical linguistic characteristics of
Conversation Type 2. Notice especially the frequent mental verbs (e.g., know, think, expect,
want), stance verbs controlling that-clauses, usually with the that omitted (e.g., would have
thought, I think, I suppose), and the frequent hedges and stance adverbs and adverbials
(surely, obviously, really, actually, probably, certainly, to be perfectly frank).2 Texts from this
cluster are informational, in that they are focused on discussion of a particular topic rather
than the immediate interpersonal interaction, but their primary purpose is the expression of
personal stance in relation to that topic.
2
Note also the dense use of discourse markers (e.g., I mean, you know) supporting the expression of stance in
this conversation.
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
28 DOUGLAS BIBER
A: I think the thing is going to come unstuck . in the, I think the biggest thing is, I was thinking, is the fact
that youve got to get <unclear> I wouldnt get a commitment from Social Services until they see a property
actually ready for occupation. Now Im not gonna be prepared to go through the whole business and then find
them say oh sorry youre wrong.
C: Property is the biggest bugbear.
A: Yeah. Because I dont think
C: If youre actually sitting on <unclear>
A: I dont think the banks are gonna want to invest. To be perfectly frank. . You see the only way we can get
equity out and put money in ourselves is by selling this place.
C: Yes.
A: Therefore if we dont actually want to live in the same place as the residents, which I certainly wouldnt want
to do, right. Wed have to buy two <unclear> adjoining -
In contrast, Type 3 conversations are much more common (496 conversations grouped into
this cluster, or 23% of all conversations in the LSWE Corpus). These conversations are
extremely interactive and focused on the immediate context, as illustrated by Text Sample 3.
The turns in this conversation are short and highly interactive (notice the dense use of I and
you), and there is a dense use of common nouns together with WH-questions to express
context-dependent information.
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 29
B: Yeah.
text omitted
B: Guess what Kirsty was doing when
A: What?
B: we was just practising for recorders?
A: What?
B: She was going like this, and the music was on, she put her feet out and she put the music on her feet.
A: Oh well.
Finally, Type 4 conversations are also relatively common (308 texts). This cluster is relatively
unmarked on Dimensions 1 and 2, but these conversations are extremely narrative in their
Dimension 3 characterization. Text Sample 4 illustrates these characteristics. Note that these
conversations are not necessarily extended stories (although some of them are). Rather, as in
the present case, these conversations can be constructed out of extended discussion of past
events (with frequent past tense verb phrases, 3rd person pronouns, and communication verbs
especially said in this conversation), often coupled with commentary on their immediate
relevance.
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
30 DOUGLAS BIBER
6. Conclusion
The three dimensions identified by this factor analysis of a conversational corpus are surpris-
ingly similar to the dimensions of variation found in the earlier MD analysis of general
spoken and written registers (Biber, 1988). Both analyses have a dimension that reflects the
distinction between involved/interactive versus informational discourse; both analyses have a
narrative dimension; and both have dimensions related to the expression of stance. The large-
scale MD analyses of spoken and written registers in Somali and Korean similarly identified
dimensions associated with these functions; composed of similar kinds of linguistic features.
Even more surprisingly, several MD analyses of restricted discourse domains have identified
dimensions with similar formal and functional correlates (compare, for example, Reppens
(1994, 2001) analysis of elementary school registers with Whites (1994) analysis of job
interview registers). The fact that similar dimensions are found to be basic even in a corpus
restricted to conversation suggests that these might be candidates for universal parameters of
variation.
Comparing the present analysis to previous MD studies provides two complementary per-
spectives on the characteristics of conversation. In comparison to the full range of spoken and
written registers, conversation is distinctive in being extremely interactive, involved, focused
on the immediate context and personal stance, and constrained by real-time production
circumstances. However, when conversation is considered on its own terms, we discover
systematic patterns of variation among conversational texts (see also Carter and McCarthy,
1997; McCarthy, 1998; Quaglio, 2004; Quaglio and Biber, to appear). Interestingly, the
present analysis indicates that the major parameters of variation internal to conversation are a
mirror image to the dimensions of variation that distinguish among spoken and written
registers.
References
Biber D. (1985). Investigating macroscopic textual variation through multi-feature/multi-dimensional
analyses. Linguistics, vol. (23): 337-60.
Biber D. (1986). Spoken and written textual dimensions in English: Resolving the contradictory
findings. Language, vol. (62): 384-414.
Biber D. (1988). Variation across speech and writing. Cambridge University Press.
Biber D. (1989). A typology of English texts. Linguistics, vol. (27): 3-43.
Biber D. (1990). Methodological issues regarding corpus-based analyses of linguistic variation.
Literary and Linguistic Computing, vol. (5): 257-269.
Biber D. (1992). On the complexity of discourse complexity: A multidimensional analysis. Discourse
Processes, vol. (15) : 133-163. (Reprinted in Conrad and Biber (Eds) (2001): 215-240.)
Biber D. (1993). Using register-diversified corpora for general language studies. Computational
Linguistics, vol. (19): 219-241.
Biber D. (1994). An analytical framework for register studies. In Biber D. and Finegan E. (Eds),
Sociolinguistic perspectives on register. Oxford University Press: 31-56.
Biber D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge
University Press.
Biber D. (2001). Dimensions of variation among 18th century speech-based and written registers. In
Conrad S. and Biber D. (Eds): 200-214.
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 31
Biber D. (2003), Variation among university spoken and written registers: A new multi-dimensional
analysis. In Meyer C. and Leistyna P. (Eds), Corpus analysis: Language structure and language
use. Rodopi.
Biber D. and Finegan E. (2001). Diachronic relations among speech-based and written registers in
English. In Conrad S. and Biber D. (Eds): 66-83.
Biber D., Johansson S., Leech G., Conrad S. and Finegan E. (1999). The Longman grammar of spoken
and written English. Longman.
Brown P. and Fraser C. (1979). Speech as a marker of situation. In Scherer K.R. and Giles H. (Eds),
Social markers in speech. Cambridge University Press: 33-62.
Carter R. and McCarthy M. (1997). Exploring Spoken English. Cambridge University Press.
Connor-Linton J. (1989). Crosstalk: A multi-feature analysis of Soviet-American spacebridges. Ph.D.
Dissertation: University of Southern California.
Conrad S. and Biber D. (Eds) (2001). Variation in English: Multi-Dimensional Studies. Longman.
Ervin-Tripp S. (1972). On sociolinguistic rules: Alternation and co-occurrence. In Gumperz J.J. and
Hymes D. (Eds), Directions in sociolinguistics. Holt: 213-250
Halliday M.A.K. (1988). On the language of physical science. In Ghadessy M. (Ed.), Registers of
written English: Situational factors and linguistic features. Pinter: 162-178
Hymes D. (1974). Foundations in sociolinguistics: An ethnographic approach. University of Pennsyl-
vania Press.
Jang S.-Ch. (1998). Dimensions of spoken and written Taiwanese: A corpus-based register study.
Ph.D. Dissertation. University of Hawaii.
Grabe W. (1987). Contrastive rhetoric and text-type research. In Connor U. and Kaplan R.B. (Eds.),
Writing across languages: Analysis of L2 text. Addison-Wesley: 115-138.
Kanoksilapatham B. (2003). A Corpus-based Investigation of Biochemistry Research Articles: Linking
Move Analysis with Multidimensional Analysis. Ph.D. Dissertation. Georgetown University.
McCarthy M. (1998). Spoken Language and Applied Linguistics. Cambridge University Press.
Quaglio P. (in preparation). Conversation and TV Dialogue: A Corpus-based Study of NBCs Friends.
Ph.D. Dissertation. Northern Arizona University.
Quaglio P. and Biber D. (to appear). The grammar of conversation. In McMahon A. and Aarts B.
(Eds), The Handbook of English Linguistics. Blackwell.
Reppen R. (1994). Variation in elementary student writing. Ph.D. Dissertation. Northern Arizona
University.
Reppen R. (2001). Register variation in student and adult speech and writing. In Conrad S. and Biber
D. (Eds): 187-199.
White M. (1994). Language in job interviews: Differences relating to success and socioeconomic
variables. Ph.D. Dissertation. Northern Arizona University.
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
32 DOUGLAS BIBER
Appendix A.
List of grammatical, syntactic, lexico-grammatical, and semantic features
identified by the Biber Tagger
1. Pronouns and pro-verbs
first person pronouns
second person pronouns
third person pronouns (excluding it)
pronoun it
demonstrative pronouns (this, that, these, those as pronouns)
indefinite pronouns (e.g., anybody, nothing, someone)
pro-verb do
3. Prepositional phrases
4. Coordination
phrasal coordination (NOUN and NOUN; ADJ and ADJ; VERB and VERB; ADV and ADV)
independent clause coordination (clause initial and)
5. WH-Questions
6. Lexical specifity
type/token ratio
word length
7. Nouns
nominalizations (ending in tion, -ment, -ness, -ity)
nouns
7a. Semantic categories of nouns
animate noun (e.g., teacher, doctor, employee )
cognitive noun (e.g., fact, knowledge, understanding )
concrete noun (e.g., rain, sediment, modem )
technical/concrete noun
quantity noun (e.g., date, energy, minute )
place noun (e.g., habitat, room, ocean )
group/institution noun (e.g., committee, bank, congress )
abstract/process nouns (e.g., application, meeting, balance )
8. Verbs
8a. Tense and aspect markers
past tense
perfect aspect verbs
non-past tense
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
CONVERSATION TEXT TYPES: A MULTI-DIMENSIONAL ANALYSIS 33
8b. Passives
agentless passives
by passives
8c. Modals
possibility modals (can, may, might, could)
necessity modals (ought, must, should)
predictive modals (will, would, shall)
8d. Semantic categories of verbs
be as main verb
activity verb (e.g., smile, bring, open)
communication verb (e.g., suggest, declare, tell)
mental verb (e.g., know, think, believe)
causative verb (e.g., let, assist, permit)
occurrence verb (e.g., increase, grow, become)
existence verb (e.g., possess, reveal, include)
aspectual verb (e.g., keep, begin, continue)
8e. Phrasal verbs
intransitive activity phrasal verb (e.g., come on, sit down)
transitive activity phrasal verb (e.g., carry out, set up)
transitive mental phrasal verb (e.g., find out, give up)
transitive communication phrasal verb (e.g., point out)
intransitive occurrence phrasal verb (e.g., come off, run out)
copular phrasal verb (e.g., turn out)
aspectual phrasal verb (e.g., go on)
9. Adjectives
attributive adjectives
predicative adjectives
9a. Semantic categories of adjectives
size attributive adjectives (e.g., big, high, long)
time attributive adjectives (e.g., new, young, old)
color attributive adjectives (e.g., white, red, dark)
evaluative attributive adjectives (e.g., important, best, simple)
relational attributive adjectives (e.g., general, total, various)
topical attributive adjectives (e.g., political, economic, physical)
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles
34 DOUGLAS BIBER
14. WH-clauses
15. To-clauses
15a. To-clauses controlled by a verb (e.g., He offered to stay)
speech act verb (e.g., urge, report, convince)
cognition verb (e.g., believe, learn, pretend)
desire/intent/decision verb (e.g., aim, hope, like, prefer, want)
modality/cause/effort verb (e.g., allow, leave, order)
probability/simple fact verb (e.g., appear, happen, seem)
15b. To-clauses controlled by an adjective
certainty adjectives (e.g., prone, due, apt)
ability/will adjectives (e.g., competent, hesitant)
personal affect adjectives (e.g., annoyed, nervous)
ease/difficulty adjectives (e.g., easy, impossible)
evaluative adjectives (e.g., convenient, smart)
15c. To-clauses controlled by a noun (e.g., agreement, authority, intention)
JADT 2004 : 7es Journes internationales dAnalyse statistique des Donnes Textuelles