Fractal Structures in Language. The Question of The Imbedding Space
Fractal Structures in Language. The Question of The Imbedding Space
This contribution deals with the hypothesis of the existence of fractal structures in language which was proposed by Ludek Hreb cek in 1992. I will propose a systems-theoretical point of view in order to de ne both an imbedding space appropriate to represent linguistic fractal structures and mappings between the observed data and the imbedding space.
ABSTRACT
INTRODUCTION
The hypothesis of fractal structures in language was rst formulated by Ludek Hreb cek (see Hreb cek 1992: 91f) as a consequence of his discovery of sign and vehicle aggregations as supra-sentence structures in texts. This ingenious idea, which was derived from the MenzerathAltmann law, is of great importance for the theory of quantitative linguistics. In recent years Ludek Hreb cek has undertaken further approaches to characterize language phenomena by fractal dimensions. He has considered text as a time-series and analysed it by means of Hurstexponents (see Hreb cek 1995, 1997). In this paper I will examine the hypothesis of fractal structures in language from a mathematical point of view. I shall try to shed light on the implications of the respective mathematical apparatus on linguistic theory and I shall try to answer the question where fractal structures in language can be found.
2 There are several de nitions of fractal dimensions. Hausdor 's de nition is the oldest and probably the most important. The Hausdor -dimension has the advantage of being de ned for any subset of the imbedding space, which in most cases is Rn . A major disadvantage of the Hausdor dimension is, that it is hard to calculate in many cases. (Falconer 1990: 25) Let us recall the de nition of the Hausdor -dimension of a fractal set in an imbedding space Rn (for details see Falconer 1990: 25). The diameter j U j of a non-empty subset U of Rn is de ned as the greatest distance between any pair of points of U :
j U j = supfj x y j: x; y 2 U g
(1)
If fUi gi=1;2;::: is a countable or nite collection of sets of diameter at most that cover a set B , we say that fUi g is a -cover of B . For any > 0 we de ne
H s(B ) = inf f
1 X
i=1
(2)
Thus we look at all covers of B consisting of sets of diameter at most and seek to minimize the sum of the s-th powers of the diameters. As decreases the in mum H s (B ) increases and so approaches a limit as ! 0. We write:
H s (B ) = lim H s(B ): !0
(3)
For < 1 Hs(B )is a non-increasing function of s. The Hausdor -dimension of the set B is the critical value D where Hs(B ) jumps from 1 to 0. That is
H s(B )
lim H s (B )
!0
1 if s < D
0 if s > D
(4)
It is important to mention here that equation (3) contains a limit ! 0. So the diameter of the sets Ui , which cover the fractal set, converge to zero. Therefore the imbedding space can never be a disconnected set such as N, because in such a space there is a minimum distance min > 0 between its elements, and the limit ! 0 is not de ned. From the de nition of the Hausdor -dimension it can be deduced that if a set B is nite or even countable, then its Hausdor -dimension is zero (Falconer 1990: 29). This means that observed data can never have a Hausdor -dimension which is di erent from zero, because we can never make an in nite number of observations. When it is said that some data represents a fractal structure, this is always an idealisation in the sense that the observed structure is extrapolated to the in nitely small. The central questions one has to ask if the concept of fractal dimension is applied to linguistics are 1. Is it appropriate to assume that the observed structures can be continued to in nitely small scales? 2. Is it convenient to assume that data is just an empirical manifestation of an idealized phenomenon, which adopts real valued quantities?
text-level
text length
estimation of parameters
system-level
parameters of distribution fractal dimensions differential equations
Fig. 1) The relation between the system level and the text level in quantitative linguistic reasoning. Quantities which can be obtained by counting linguistic entities are elements of the text-level. These quantities can be used in order to estimate abstract quantities on the system level, as for instance parameters of probability distributions. The probability distributions on the system level govern the behaviour of random variables which are realized in texts.
The mapping from N to R does not seem problematic at rst sight, but it becomes important if one uses mathematical notions or operations which require topological properties of R which are not properties of N. This is the case if the fractal dimension is calculated. Due to the limit in equation (3), the notion of a fractal dimension does not make sense if the imbedding space is totally disconnected as in the case of N. Another operation which is also malde ned on the raw data is the derivation, which also involves a limit ! 0:
(5)
So di erential equations set up in quantitative linguistics tacitly imply that linguistic quantities are real valued phenomena. This assumption is obviously not ful lled if the raw linguistic data itself is considered, but it makes sense if one considers abstract real valued quantities (say parameters of probability distributions) which are estimated by some transformation of the raw data. That is, di erential equations cannot directly refer to the text-level. They can only be de ned on an abstract system-level | as for example in the paradigm of synergetic linguistics | where the required topological properties can be sensibly assumed.
4 Maki and Thompson examine the scienti c process of model construction. They distinguish between a real model and a mathematical model. The real model is constructed by idealization and construction of the real world. The mathematical model is obtained by abstraction and formalization of the real model. The application of mathematical formalisms in the mathematical model leads to conclusions which are compared with the real world. Maki and Thompson admit that in many cases it is di cult to decide where the real model ends and the mathematical model begins, but they point out that a failure to distinguish between real model and mathematical model is confusing and can lead to wrong conclusions. (Maki & Thompson 1973: 4)
idealization approximation
real world
comparison
conclusions predictions
mathematical model
I will draw the distiction between the real model and the mathematical as follows. Quantities obtained by simply counting linguistic entities are elements of the real model. The real model however does not consist of mathematical objects, and mathematical operations do not exist on this level of model construction. These operations are de ned only on the level of the mathematical model. The reason for this distinction is that the set of positive integers N is not closed under division and subtraction. The application of division and subtraction of elements of N generates the set of rational numbers Q which has completely di erent topological properties from N. N is totally disconnected in contrast to Q. The distinction between real model and mathematical model corresponds to the distinction between the text-level, which consists of texts and of linguistic entities used in a text, and the system-level where processes and forces postulated in synergetic linguistics are operating (for details see Leopold 1998b: 12). The raw data obtained by counting linguistic entities in texts belongs to the text-level, while mathematical transformations of these data are estimators of abstract theoretical quantities on the system-level. In most cases one may think of an abstract quantity on the system-level as the expected value of the respective numbers on the text-level. Although the quantities on the text-level are always positive integer valued numbers, the quantities on the system-level can assume non-integer values. (An example: the points of a dice, are integer valued, whereas the expected value (3.5) is not. If one throws the dice various times the mean value will usually di er from 3.5, but the law of large numbers ensures that it converges to the expected value.)
5 I will give an example for the di erence of the text-level and the system-level. Figure 3 displays empirical data on Japanese Kanji (The data was collected by Claudia Prun). Each point represents a Sino-Japanese grapheme with its frequency on the horizontal axis and the number of strokes it consists of on the vertical axis. The number of strokes is always a positive integer. Those points which represent graphemes with coinciding number of strokes and di erent frequencies form a horizontal line.
Fig. 3) Grapheme-complexity versus frequency of Japanese Kanji (collected data). Every point in this gure represents a Sino-Japanese grapheme with its frequency and number of strokes.
Figure 4 characterizes the situation on the system level. It was obtained from Figure 3 by adding a uniformly distributed (U 0;1]) random variable to each data point. Figure 4 can be
6 interpreted as a two dimensional probability density function. The more points lie in a given area, the more probable is the respective combination of length and frequency in the observed text. The marginal density in the vertical direction represents a grapheme-complexity density for each frequency F . Note that the number of strokes in Figure 4 is a real valued variable L, which does not represent observable values, but the inclination of the language system to adopt a value near L, when a text is produced.
Fig. 4) Grapheme-complexity versus frequency of Japanese Kanji (disturbed data). A uniformly distributed random variable was added to each data point of Figure 3. Figure 4 represents the inclination of the language system to adopt di erent combinations of frequency and complexity.
7 In Figure 5 the mean number of strokes is calculated for each frequency. The displayed numbers do not represent quantities on the text-level because mathematical operations such as summation and division have been involved in their calculation. For each frequency a point in gure 5 is an estimator of the expected value of the respective marginal grapheme-complexity density.
Fig. 5) Grapheme complexity versus frequency of Japanese Kanji (mean values). For each frequency the average number of strokes is presented on the vertical axis.
' = a _ _ = r
'
(6)
Note that the variables and in this equation do not represent frequency and length themselves. They symbolize the expected values of the probability distributions of length and frequency. (In the case of frequency the situation is somewhat more complicated, because frequency is dependent on text size. Therefore denotes the intensity of a non-stationary Poisson-process. (c.f. Leopold 1998b: 48)) As long as one is dealing with partial di erential equations in two dimensions no fractal attractor can occur. For in the plane the range of attractors for continuous systems is rather limited. The only attractors for continuous systems are isolated points or closed loops. (Falconer 1990: 184) But when three dimensional partial di erential equations are considered, it is possible that they exhibit chaotic dynamics and thus converge to strange attractors of fractal dimension. So one could consider frequency, length and polysemy, and extend equation (6) to three dimensions. A famous example of a three dimensional system of partial di erential equations is the Lorentz attractor which applies to hydrodynamic problems and is de ned by
x = (y x) _ y = rx y xz _ z_ = xy bz
(7)
At present, each continuous dynamic system must be studied individually since there is little general theory available. Attractors of continuous systems are well suited to computer study, and mathematicians are frequently challenged to explain 'strange' attractors that are observed on computer screens. (Falconer 1990: 188)
x1
0 1 11 B C B C B C B C B C = B 0 A1 1 1 C B C B B B A2 C 2 C C B@ C @ A 1 A
b b
(8)
A3 x4
b3
Hreb cek emphasizes the self-similar structure of equation (10). He points out: "In this formula, for example, m = 1 corresponds to phonemes, m = 2 to morphemes, m = 3 to words, and m = 4 to sentences. The fractal character of the sets of constructs and constituents is evident from the shape of the formula which is a formation similiar to Japanese puppets." (Hreb cek 1997: 104) Hreb cek compares the relation between the di erent levels of analysis in the Menzerath-Altmann law with the generator of the Cantor-dust (see for example Hreb cek 1995: 107). Therefore the existence of fractal structures in language seems to be an obvious consequence of the MenzerathAltmann law, but it is di cult to grasp the hypothesis exactly. So one has to answer the question: what is the imbedding space the fractals are de ned on? and what kind of metric or topology is de ned on this space? The problem we have to face in order to derive a fractal dimension from the Menzerath-Altmann Law, is neither the absence of a (physical) dimension nor the fact that measurements in quantitative linguistics usually arise from counting procedures (as Kohler 1997 argued), because one could consider discrete numbers as realizations of random variables and estimate the (realvalued) parameters of their distribution as described above (see gure 1). What is needed to derive a fractal dimension from the Menzerath-Altmann Law is a continuous scale of levels of analysis. So we should be able to proceed continuously !] from the level of sounds to the level of syllables and further on to the levels of morphs, words, clauses, sentences and supra-sentence structures. Furthermore, if denotes the level of analysis in this continuous lattice, then the limiting level of analysis for ! 0 has to be de ned. From a linguistic perspective this is of course a rather strange idea but it seems to me that this is not too far from Hreb cek's vision when he wrote : "In our empirical argumentation two neighbouring levels are characterized by parameters with values which are only their estimates; they change when a new level is inserted between the two former neighbours. The scheme of linguistic levels in an arbitrary form is nothing but a classi cation of language units. Any classi cation represents a relationship between the classi er and its knowledge about linguistic units and their relations." (Hreb cek 1995: 111) If the above conditions on the lattice of levels of analysis are ful lled, the de nition of the Hausdor -dimension can be adapted to Hreb cek's idea of fractal structures in texts. Note that in the following presentation the entities become smaller when decreases in contast to Hreb cek's notation in equation (10) where the largest unit is denoted by x1 and the smaller by x2 , x3 , and so on. Let T be a text. We want to calculate or merely de ne the fractal dimension of T . Let S be the imbedding space of T , i.e. T S . Clearly S cannot be the Euclidean space Rn . So let us de ne S as the set of all possible texts in a given environment. Each element of S consists of a stream of pre-theoretical physical events h(t) which are produced at a (physical) instant of time t. So h(t) adopts values in a space X of physical events. Let 0; tmax ] be the time-span of text
10 production. Then S can be written as the product space S = X 0; tmax ]. If we assume X to be a metric space then it is not di cult to de ne a metric on S . But the usual de nitions of distances on spaces like S are not useful for our purposes. Therefore I skip the rst step (equation (1)) in the de nition of the Hausdor -dimension and replace it directly by a de nition of what is meant by a -cover fUi gi=1;2;::: of a text: Let fEi gi=1;2;::: be the collection of all entities at the -level of analysis, which can be found in the text. I call these entities -level-entities for short. The i-th -level entity Ei begins at the (physical) instant of time tli and ends at tu . The -level-set Ui corresponding to the -level-entity i Ei is de ned by
Ui =
(9)
This means the i-th -level-set Ui begins at the end of the previous -level-entity Ei 1 and ends at the beginning of the next -level-entity Ei+1 . So Ui contains Ei as a subset and the collection fUi gi=1;2;::: covers the whole text. The diameter of a -level-set corresponding to a -level-entity is. Finally we de ne a -cover fUi gi=1;2;::: as the collection of -level-sets corresponding to any -level-entity, which can be found in the text. Now the de nition of the Hausdor -dimension can be applied in a straightforward manner. For any level of analysis we de ne
H s (T ) = inf f
This can be reduced to
1 X
i=1
(10)
H s(T ) = n
(11)
where n denotes the number of -level-entities in the text T . The Hausdor -measure of a text is therefore
s:
(12)
Finally the Hausdor -dimension of a text is that number D > 0 which ensures that
1 if s < D
0 if s > D
(13)
as ! 0. One can say more simply: The Hausdor -dimension of a text is that positive real number D where the number of -level-entities increases with exactly the same speed as D decreases if approaches zero. Now the notion of a fractal dimension of a text is mathematically well-de ned. The only question left is. How do we quantify the step from one level of analysis (say clauses) to another (say words)? Does it correspond to a multiplication by 0:5, or 0:1, or even 0:01 ? Every choice of this factor is inevitably eclectic, but it has a considerable e ect on the result of the calculation of the fractal dimension
11
REFERENCES
Falconer, Kenneth (1990):Fractal Geometry;Wiley & Sons:Chichester et al. Feder, Jens (1988): Fractals; Plenum: New York, London. Hreb cek, Ludek (1992): Text in Communication: Supra-Sentence Structures; Brockmeyer:
Bochum. Altmann Law; (QL 56); wvt: Trier. Hreb cek, Ludek (1996): Word Associations and Text; in: Glottometrika 15; pp 96{101. Hreb cek, Ludek (1997): Persistence and Other Aspects of Sentence-Length Series; in: Journal of Quantitative Linguistics 4, No. 1{2, pp.103{109. Kohler, Reinhard (1997): Are there Fractal Structures in Language? Units of Measurement and Dimensions in Linguistics; in: Journal of Quantitative Linguistics 4, No. 1{2, pp.122{125. Leopold, Edda (1998a): Frequency Spectra within Word Length Classes; in: Journal of Quantitative Linguistics 5, No. 3, pp.224{231. Leopold, Edda (1998b): Stochastische Modellierung lexikalischer Evolutionsprozesse; Dr. Kovac: Hamburg. Maki, Daniel P. & Thompson, Maynard (1973): Mathematical Models and Applications; Prentice Hall: Englewood Cli s (N.J.).
Hreb cek, Ludek (1995): Text Levels Language Constructs, Constituents and the Menzerath-