"How Old Do You Think I Am?": A Study of Language and Age in Twitter

Uploaded by

htkgdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views10 pages

"How Old Do You Think I Am?": A Study of Language and Age in Twitter

Uploaded by

htkgdsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

“How Old Do You Think I Am?

”: A Study of Language and Age in Twitter

Dong Nguyen1 , Rilana Gravel2 , Dolf Trieschnigg1 , Theo Meder2

1
University of Twente, Enschede, The Netherlands
2
Meertens Institute, Amsterdam, The Netherlands
{d.nguyen,d.trieschnigg}@utwente.nl,{gravel.rilana,theo.meder}@meertens.knaw.nl

Abstract Early sociolinguistic studies only had access to relatively

small datasets (e.g. a couple of hundred persons), due to time
In this paper we focus on the connection between age
and language use, exploring age prediction of Twitter and practical constraints on the collection of data. With the
users based on their tweets. We discuss the construction rise of social media such as Twitter, new resources have
of a fine-grained annotation effort to assign ages and life emerged that can complement these analyses. Compared
stages to Twitter users. Using this dataset, we explore to previously used resources in sociolinguistic studies like
age prediction in three different ways: classifying users face-to-face conversations, Twitter is interesting in that it
into age categories, by life stages, and predicting their collapses multiple audiences into a single context: tweets
exact age. We find that an automatic system achieves can be targeted to a person, a group, or to the general public
better performance than humans on these tasks and that (Marwick and Boyd 2011). Twitter offers the opportunity to
both humans and the automatic systems have difficul- gather large amounts of informal language from many in-
ties predicting the age of older people. Moreover, we
dividuals. However, the Twitter population might be biased
present a detailed analysis of variables that change with
age. We find strong patterns of change, and that most and only little is known about the studied persons. To over-
changes occur at young ages. come this, we carried out a large annotation effort to anno-
tate the gender and age of Twitter users. While gender is one
of the most studied variables, the relation between age and
Introduction language has only recently become a topic of interest.
A person’s language use reveals much about their social In this paper we present work on automatically predict-
identity. A person’s social identity is based on the groups he ing people’s age. This can offer new insights into the re-
or she belongs to, including groups based on age, gender and lation between language use and age. Such a system could
political affiliation. Earlier research in sociolinguistics re- also be used to improve targeting of advertisements and to
garded male and female, and age as biological variables. Ex- support fine-grained analyses of trends on the web. So far,
amples for this are Labov (1966) and Trudgill (1974). How- age prediction has primarily been approached by classifying
ever, current research views them primarily as social vari- persons into age categories. We revisit this approach being
ables. Concepts such as gender and age are shaped differ- the first to approach age prediction from three different an-
ently depending on an individual’s experiences and person- gles: classifying users into age categories (20-, 20-40, 40+),
ality, and the society and culture a person is part of (Eckert predicting their exact age, and classifying users by their life
1997; Holmes and Meyerhoff 2003). To complicate things stage (secondary school student, college student, employee).
even more, the two variables gender and age are intertwined: We compare the performance of an automatic system with
studying one of the variables implies studying the other one, that of humans on these tasks. Next, to allow a more fine-
as well. For example, the appropriate age for cultural events grained analysis, we use the exact ages of Twitter users and
often differs for males and females (Eckert 1997). Besides analyze how language use changes with age.
linguistic variation based on the groups a person belongs to, Specifically, we make the following contributions: 1) We
there is also variation within a single speaker as people adapt present a characterization of Dutch Twitter users as a re-
their language to their audience (Bell 1984). Thus it follows sult of a fine-grained annotation effort; 2) we explore differ-
that speakers can choose to show gender and age identity ent ways of approaching age prediction (age categories, life
more or less explicitly in language use, depending on peo- stages and exact age); 3) we find that an automatic system
ple’s perception of these variables, on their culture, the re- has better performance than humans on the task of inferring
cipient of their utterance, etc. From a sociolinguistic per- age from tweets; 4) we analyze variables that change with
spective, language is a resource which can be drawn on to age, and find that most changes occur at younger ages.
study different aspects of a person’s social identity at differ- We start with discussing related work and our dataset.
ent points in an interaction (Holmes and Meyerhoff 2003). Next, we discuss our experiments on age prediction. We then
continue with a more fine-grained analysis of variables that
Copyright c 2013, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
change with age. We conclude with a summary.
Related Work Corpus Collection
Eckert (1997) distinguishes between chronological age In this section we describe a large annotation effort we car-
(number of years since birth), biological age (physical matu- ried out to annotate Dutch Twitter users. Based on the results
rity) and social age (based on life events). Studies about lan- we present a characterization of Dutch Twitter users.
guage and age usually consider chronological age and apply
an etic approach, grouping speakers based on age spans (e.g. Selecting and Crawling Users
(Labov 1966; Trudgill 1974; Barbieri 2008)). But speak- Twitter users can indicate information such as their name,
ers can have a very different position in society than their location, website and short biography in their profile. How-
chronological age indicates. Therefore, it might be reason- ever, gender and age are not explicit fields in Twitter pro-
able to apply an emic approach, grouping speakers accord- files. As a result, other researchers working on identification
ing to shared experiences of time, such as school as a shared of such attributes have resorted to a variety of approaches to
experience for teenagers (Eckert 1997). construct a corpus, ranging from focused crawling to using
So far, automatic age prediction has mostly been ap- lists with common names.
proached as a two-class or three-class classification problem For example, Rao et al. (2010) constructed a corpus by fo-
based on age spans with for example boundaries at 30 or cused crawling. To collect users they used a crawl with seeds
40 years (e.g. (Rao et al. 2010; Garera and Yarowsky 2009; by looking for profiles that had ‘baby boomers’, ‘junior’,
Goswami, Sarkar, and Rustagi 2009)), thus corresponding ‘freshman’ etc. in their description. However, this leads to a
to an etic approach. However, as choosing boundaries still potential bias by starting with users that explicitly indicate
remains problematic, several researchers have looked more their age identity in their profile. Burger et al. (2011) sam-
closely into this issue. For example, Rosenthal and McKe- pled users from the Twitter stream and used links to blogging
own (2011) experimented with varying the binary split for sites, indicated in their profile, to find the gender. Therefore,
creating age categories. In contrast, Nguyen, Smith, and their set of users was restricted to users having blogs and
Rosé (2011) approached age prediction as a regression prob- willing to link them using Twitter. Some approaches used
lem, eliminating the need to create age categories. In our lists of male and female names, for example obtained using
work, we will experiment with age prediction as a regres- Facebook (Fink, Kopecky, and Morawski 2012) or from the
sion problem, as a classification problem based on age cate- US social security department (Zamal, Liu, and Ruths 2012;
gories and explore an emic approach, by classifying persons Bamman, Eisenstein, and Schnoebelen 2012).
according to their life stages. Our goal was to select a set of users as randomly as pos-
Both content features and stylistic features (such as part- sible, and not biasing user selection by searching on well-
of-speech and the amount of slang words) have been found known stereotypical behavior or relying on links to explicit
to be useful for predicting the age of users (Nguyen, Smith, sources. This did create the need for a large annotation ef-
and Rosé 2011; Argamon et al. 2007; Goswami, Sarkar, and fort, and resulted in a smaller user sample. Using the Twitter
Rustagi 2009). Pennebaker and Stone (2003) found that as API we collected tweets that contained the word ‘het’, which
people get older, they tend to use more positive and fewer can be used as a definite article or pronoun in Dutch. This al-
negative words, focus more on the future and less on the lowed us to restrict our tweets to Dutch as much as possible,
past and make fewer self-references. Not much research has and limit the risk of biasing the collection somehow. During
been done yet on investigating the relationship between gen- a one-week period in August 2012 we sampled users accord-
der and age from a computational perspective. Argamon et ing to this method. Of these users, we randomly selected a
al. (2007) found that certain linguistic features that increase set for annotation. We then collected all followers and fol-
with age, also increase more with males. Nguyen, Smith, lowees of these users and randomly selected additional users
and Rosé (2011) incorporated gender using a binary vari- from this set. We only included accounts with less than 5000
able, only allowing a simple interaction between gender and followers, to limit the inclusion of celebrities and organiza-
age. Many others have ignored the effect of gender when tions. For all users, we initially downloaded their last 1000
predicting the age of users. tweets. Then new tweets from these users were collected
Experiments on automatic classification of users accord- from September to December 2012.
ing to latent attributes such as gender and age have been
done on a wide range of resources, including telephone con- Het Followe(e/r)s
versations (Garera and Yarowsky 2009), blogs (Sarawgi,
Gajulapalli, and Choi 2011), forum posts (Nguyen, Smith, Annotated 1842 (76%) 1343 (43%)
and Rosé 2011) and scientific articles (Bergsma, Post, and Not enough tweets 15 (0.6%) 129 (4%)
Yarowsky 2012; Sarawgi, Gajulapalli, and Choi 2011). Re- Not a person 221 (9%) 441 (14%)
cently, Twitter has started to attract interest by researchers Not public 264 (11%) 719 (23%)
as a resource to study automatic identification of user at- Not Dutch 51 (2%) 468 (15%)
tributes, such as ethnicity (Pennacchiotti and Popescu 2011; Other 46 (2%) 17 (0.5%)
Rao et al. 2011), gender (Fink, Kopecky, and Morawski Total 2439 3117
2012; Bamman, Eisenstein, and Schnoebelen 2012; Rao et
al. 2010; Burger et al. 2011; Rao et al. 2011), geographical Table 1: Reasons why accounts were discarded/kept by sam-
location (Eisenstein et al. 2010) and age (Rao et al. 2010). pling method.
Dutch Twitter Users 20- 20-40 40+
In this section we analyze the effect of our sampling M 796 488 265
procedure, and present a characterization of Dutch Twitter F 1078 316 157
users in our corpus. We employed two students to perform
the annotations. Annotations were done by analyzing a
Table 2: Age and gender
user’s profile, tweets, and additional external resources (like
Facebook or LinkedIn) if available. In this paper, we only
focus on the annotations that are relevant to this study.

400
Frequency
Effect of Sampling Method
The annotators were instructed to only annotate the users

200
that met the following requirements:
• The account should be publicly accessible.

0
• The account should represent an actual person (e.g. not an 10 20 30 40 50 60 70
organization). Age
• The account should have ‘sufficient’ tweets (at least 10). Figure 1: Plot of frequencies per age
• The account should have Dutch tweets (note that this does
not eliminate multilingual accounts).
We also asked our annotators to annotate the exact age.
We separated the reasons why accounts were discarded Sometimes it was possible to get an almost exact estimate,
by the two sampling methods (het and followers/followees) for example by using LinkedIn profiles, exact age mentions
that were used (the first requirement in the list that was not in the profile, tweets, or mentioning which grade the person
satisfied was marked). The results are reported in Table 1. was in. However, since this was not always the case, annota-
We observe that the proportion of actual annotated users is tors also indicated a margin (0, 2, 5 or 10 years) of how sure
much higher for the users obtained using the query ‘het’. they were. Figure 1 shows a graph with the frequencies per
The users obtained by sampling from the followers and year of age. Table 3 reports the frequencies of the indicated
followees included more non-Dutch accounts, as well as margins. In our data, we find that the margin for young users
accounts that did not represent persons. In addition, there is low, and that for older users the margin is much higher.
was also a group of people who had protected their account As discussed earlier in this paper, it may be more natu-
between the time of sampling and the time of annotation. In ral to distinguish users according to their life stage instead
total, 3185 users were annotated. of a fixed age category. Life stages can be approached from
different dimensions. In this paper, we use life stages based
Gender on the occupation of people, by distinguishing between stu-
The biological gender was annotated for 3166 persons (for dents, employed, retired etc. The results are displayed in Ta-
some accounts, the annotators could not identify the gender). ble 4. Unfortunately, the decision to annotate this was done
The gender ratio was almost equal, with 49.5% of the per- while the annotation process was already underway; there-
sons being female. However, as we will see later, the ratio fore the accounts of some users were not available anymore
depends on age. The annotation of the gender was mostly (either removed or protected).
determined based on the profile photo or a person’s name, We find that the most common life stages are associated
but sometimes also their tweets or profile description. with clear age boundaries, although the boundaries are not
Mislove et al. (2011) analyzed the US Twitter population the same as for the age categories. We find the following age
using data from 2006-2009. Using popular female and male spans in which 90% of the persons fall: secondary school
names they were able to estimate the gender of 64% of the students (12 -16 yrs), college students (16 - 24 yrs), employ-
people, finding a highly biased gender ratio with 72% being ees (24 - 52 yrs). However, note that with the life stage ap-
male. A more recent study by Beevolve.com however found proach, people may be assigned to a different group than the
that 53% were women, based on information such as name group that most resembles their age, if this group matches
and profile. their life stage better. We have plotted the overlap between
life stage and age categories in Figure 2.
Age
Because we expected most Twitter users to be young, the
following three categories were used: 20-, 20-40, 40+. The
age category was annotated for 3110 accounts. The results Age estimation margin Frequency
separated by gender are shown in Table 21 . There are more 0 703
females in the young age group, while there are more men 2 1292
in the older age groups. The same observation was made in 5 918
statistics reported by Beevolve.com. 10 173
1
Note that this table only takes persons into account for who
both age and gender were annotated Table 3: Frequencies of margins for the exact age annotation
Life Stage Frequency Evaluation
Secondary school student 1352 We will evaluate the performance of our classification meth-
College student 316 ods (to predict the age category and life stage) using the F1
Employee 1021 measure. We will report both the macro and micro averages.
Retired 5 The regression problem (predicting age as a continuous vari-
Other 15 able) will be evaluated using the Pearson’s correlation coef-
Unknown 132 ficient, mean absolute error (MAE) and accuracy, where a
Not accessible 344 prediction was counted as correct if it fell within the margin
as specified by the annotators.
Table 4: Life stage frequencies Dataset
We restricted our dataset to users who had at least 20 tweets
Number of accounts

and for whom the gender, age category and exact age were
School students annotated. For each user we sampled up to 200 tweets. We
College students
0 400 1000

Employees divided the dataset into a train and test set. Each set con-
tains an equal number of males and females, and the same
age distribution (according to the annotated age categories)
across gender categories. This limits the risks of the model
learning features that for example are more associated with
a particular gender, due to that gender occuring more in the
20- 20-40 40+ particular age category. Parameter tuning and development
of the features were done using cross-validation on the train-
Age
ing set. The statistics are presented in Table 5.

Train Test
Figure 2: Overlap life stage and age categories M F M F
20- 602 602 186 186
Inter-annotator Agreement 20-40 231 231 73 73
40+ 118 118 37 37
We employed two students to perform the annotations. 84 Total 1902 592
accounts were annotated by both. Inter-annotator agree-
ment was measured using Cohen’s kappa. Generally, a value Table 5: Dataset statistics
above 0.7 is considered acceptable. We found the following
kappa values: gender (1.0), age category (0.83) and life stage
(0.70). For the actual age, the mean absolute difference was Learning Algorithm
1.59 years. We use linear models, specifically logistic and linear re-
gression, for our tasks. Given an input vector x ∈ Rm ,
Age Prediction x1 , . . . , xm represent features (also called independent vari-
ables or predictors). In the case of classification with two
Goal
classes, e.g. y{−1, 1}, the model estimates a conditional
In this section we compare the different ways of approaching distribution P (y|x, β) = 1/(1 + exp(−y(β0 + x> β))),
age, by testing how feasible age prediction is using simple where β0 and β are the parameters to estimate. We use a
features based only on the text of tweets. We will automati- one versus all method to handle multiclass classification. In
cally predict the following: the case of regression, we find a prediction ŷ ∈ R for the
• Age category: 20-, 20-40, 40+ exact age of a person y ∈ R using a linear regression model:
ŷ = β0 + x> β. In order to prevent overfitting we use Ridge
• Age: continuous variable (also called L2 ) regularization. We make use of the liblin-
• Life stage: secondary school student2 , college student, ear (Fan et al. 2008) and scikit-learn (Pedregosa et al. 2011)
employee libraries.
For the life stage, we only use categories for which we had Preprocessing & Features
a sufficient number of persons. Note that classifying age ac-
cording to age category and life stage are multiclass classifi- Tokenization is done using the tool by (O’Connor, Krieger,
cation problems, while treating age as a continuous variable and Ahn 2010). All user mentions (e.g. @user) are re-
results in a regression problem. In addition, we compare our placed by a common token. Because preliminary experi-
systems with the performance of humans on this task. ments showed that a unigram system already performs very
well, we only use unigrams to keep the approach simple. We
2
In Dutch this is translated to scholier, which includes all stu- keep words that occur at least 10 times in the training doc-
dents up to and including high school, there is no direct translation uments. In the next section, we will look at more informed
in English. features and how they change as people are older.
Results
In this section we present the results of the three age predic-

-30 -10 10
tion tasks. The results can be found in Tables 6 and 7. We

Error
find that a simple system using only unigram features can
already achieve high performance, with micro F1 scores of
above 0.86 for the classification approaches and a MAE of
less than 4 years for the regression approach. We also exper- 10 20 30 40 50 60 70
imented with applying a log transformation of the exact age Actual age
for the regression task. The predicted values were converted Figure 4: Scatterplot absolute error
back when calculating the metrics. We find that the MAE
and accuracy both improve. In the rest of this section, when Dutch English Weight
referring to the regression run, we refer to the standard run school school -0.081
without a log transformation. ik I -0.073
:) :) -0.071
Run F1 macro F1 micro werkgroep work group -0.069
Age categories 0.7670 0.8632 stages internships -0.069
Life stages 0.6785 0.8628 oke okay -0.067
xd xd -0.066
Table 6: Results classification ben am -0.066
haha haha -0.064
als if -0.064
Run ρ MAE Accuracy
Age regression 0.8845 3.8812 0.4730
Age regression - log 0.8733 3.6172 0.5709 Table 8: Top features for younger people (regression)

Table 7: Results age regression Dutch English Weight

verdomd damn 0.119
dochter daugther 0.112
A scatterplot of the actual age versus the predicted age can
wens wish 0.112
be found in Figure 3. Figure 4 shows the errors per actual zoon son 0.111
age. We find that starting from older ages (around 40-50) mooie beautiful 0.111
the system almost always underpredicts the age. This could geniet enjoy 0.110
have several reasons. It may be that the language changes dank thanks 0.108
less as people get older (we show evidence for this in the goedemorgen good morning 0.107
next section), another plausible reason is that we have very evalueren evaluate 0.105
little training data in the older age ranges. sterkte take care 0.102

Table 9: Top features for older people (regression)

Predicted age
10 30 50

recall for the individual classes are listed in Tables 10 and

11. The performances are comparable. The micro average
for life stages is slightly better (0.86 vs 0.85), the macro av-
erage is worse (0.68 vs 0.75) as the metric is heavily affected
by the bad performance on the students class. Although life
10 20 30 40 50 60 70 stages are better motivated from a sociolinguistics viewpoint
Actual age
(Eckert 1997), it is not yet clear which classes are the most
Figure 3: Scatterplot age suitable. In our corpus, almost all persons were either sec-
ondary school students or employees. If a more fine-grained
The most important features for old and young persons distinction is necessary (for example for personalization), it
are presented in Tables 8 and 9. We find both content fea- is still a question which categories should be used.
tures and stylistic features to be important. For example,
content words like school, son, and daughter already reveal Precision Recall
much about a person’s age. Younger persons talk more about 20- 0.9297 0.9775
themselves (I), and use more chat language such as haha, xd, 20 - 40 0.6739 0.7561
while older people use more conventional words indicating 40+ 0.8158 0.4493
support or wishing well (e.g. wish, enjoy, thanks, take care).
For the age categories we redid the classification using Table 10: Results per class: Age categories
only persons for whom the life stage was known to allow
Precision Recall
better comparison between the two classification tasks. We Sec. school student 0.8758 0.9853
found that people in the 40+ class are often misclassified as College student 0.6667 0.1250
belonging to the 20-40 class, and college students are often Employee 0.8541 0.8977
classified as secondary school students. The precision and
Table 11: Results per class: Life stages
Train Test Age categories Regression Life stages
Macro F1 Micro F1 ρ MAE Accuracy Macro F1 Micro F1
All F 0.7778 0.8750 0.9101 3.4220 0.5135 0.7038 0.8765
M 0.7563 0.8514 0.8625 4.3405 0.4324 0.6538 0.8500
Male F 0.6861 0.8277 0.8784 3.9617 0.5135 0.6151 0.8642
M 0.7027 0.8311 0.8431 4.5017 0.4459 0.6116 0.8346
Female F 0.7281 0.8581 0.8965 3.5586 0.5270 0.6438 0.8560
M 0.6373 0.8041 0.8195 5.2099 0.3682 0.6829 0.8538

Table 12: Effect of gender

Treating age prediction as a regression problem elimi- Error Analysis

nates the need to choose boundaries. The main drawback As reported in the previous section, not for all cases the
is that annotating the exact age of users requires more effort correct age was predicted. This is of course not surprising.
than annotating the life stage or an age category. However, as People do not only constitute their identity on the basis of
mentioned before, our annotators showed that reliable anno- their age, but they combine various variables in order to
tations are possible (on average less than 2 years difference). express their selves. For example, a person is not only a
In summary, we believe that both classifying users teenager, but also a female, a high school student, a piano
according to their life stage and treating age prediction player, etc. (Eckert 2008). Depending on what a person
as a regression problem are promising approaches. Both wants to express at a particular moment and towards a
approaches complement each other. Age prediction as a particular person, certain aspects of his/her identity may
regression problem relies on chronological age, while life be more emphasized, making age prediction even more
stages are built on shared experiences between people. complicated. To illustrate this, we will discuss two Twitter
Depending on the practical application, knowing the users for whom the age was incorrectly predicted.
chronological age or life stage might be more informative.
For example, groups based on life stage might be more Case study 1
useful for marketing purposes, while the chronological The first person is a 24-year old student, who the system esti-
age might be more informative when targeting medical mated to be a 17-year old secondary school student. The top
information. 10 most frequent words for this user are @USER, RT, •, Ik
(I),, G, :D, Hahaha, tmi, and jij (you). The use of special
Effect of Gender characters like a dot (• ) and the much less than sign () is
In Table 12 we have separated the performance according characteristic for younger Twitter users, who separate state-
to gender. We also experimented with training on data of ments in their tweets employing these characters. I is one
only one gender, and reported the performance separated by of the words being the most predictive of younger people as
gender. Across the three tasks the performance for females was presented in the feature analysis (see Table 8) and the
is better than the performance for males. We also find that other words like hahaha, you etc. are also highly associated
across the three tasks, the performance for females is better with younger persons in our corpus. As we can see, this per-
when trained on only females, compared to the performance son employs these words with such a high frequency that he
of males, when trained on only males. can easily be mistaken for a secondary school student under
One of the explanations could be that females write 20. Examples containing salient words are the ones below:
slightly more than men (average #tokens: 2235 versus 2130), @USER kommmdan nurd
although the differences between the means are small and @USER comeonthen nurd [nerd]
there is no significant difference in the number of tweets per
person (note that we sampled up to 200 tweets per person). Hahaahhahaha kkijk rtl gemist holland in da hood,
Another explanation can be found in sociolinguistic stud- bigga huiltt ik ga stukkk
ies. It has been pointed out that females assert their iden- Hahaahhahaha [I am] wwatching rtl gemist 3 holland
tity more through language than males (Eckert 1989; Labov in da hood 4 , bigga is cryingg it’s killinggg me
1990). Hence, they might use all kinds of in-group vocab-
ulary more often, thereby marking their affiliation with a RT @USER: Ook nog eens rennen voor me bus
certain group. Men’s vocabulary, on the contrary, is more #KutDag • Ik heb weekend :)
homogenous across the in-groups (Eckert 2000). Consistent RT @USER: Had to run for my bus too #StupidDay
with this, Ling (2005) found that females ‘seem to have • I have weekend :)
a broader register when using SMS’. Due to this, it might
be easier to determine the age of women. However, neither In addition to the words mentioned above, me (my), and heb
Eckert (1989) nor Labov (1990) looked at age specifically, (have) appear, which are indicative for younger persons in
and the studied people were also not comparable (e.g. Eck- our corpus, as well.
ert (1989) only studied young people, and social media set- 3
website where people can watch tv shows online
tings have not been explored much yet). 4
Dutch reality show
Next to the fact that this person employs words rather Manual Prediction
associated with teenagers on Twitter, we can also derive In this section we compare the performance of our systems
what kind of identity is constituted here. In the tweets, with the performance of humans on the task of inferring age
unconventional punctuation, emoticons, ellipsis, in-group only from tweets. A group of 17 people (including males and
vocabulary (nurd), and alphabetical lengthening (stukkk) are females, old and young, active and non-active Twitter users)
used to create an informal, unconventional style particularly estimated the gender, life stage, exact age and age categories
addressing an in-group. It can be concluded that this person for a random subset of the Twitter users in the test set. Each
does not appear to stress his identity as an adult, but finds person was assigned a different set of about 20 Twitter users.
other aspects of his identity more important to emphasize. For each Twitter user, a text file was provided containing the
These aspects, however, are expressed with features em- same text as used in our automatic prediction experiments.
ployed most frequently by younger persons in our corpus, The participants received no additional information such as
resulting in a wrong age prediction for this person. the name, profile information etc. They could decide them-
selves how carefully they would read the text, as long as they
Case study 2 could make a serious and informed prediction. On average,
The second person is a 19-year old student. However, the it took about 60-90 min to do the task. In total there are 337
system predicted him as being a 33-year old employee. The users for whom we both have manual and automatic predic-
top 10 most frequent words for this user are @USER, CDA, tions. The results can be found in Tables 13 and 14.
RT, Ik (I), VVD, SGP, PvdA, D66, bij (at) and Groenlinks.
It becomes clear that this person tweets about politics a lot, Run F1 macro F1 micro
with Dutch political parties (CDA, VVD, SGP, D66, Groen- Age categories
links) being six out of his ten most frequent words. Tweets Manual 0.619 0.752
that are characteristic for this user and that relate to some of Automatic 0.751 0.858
his most salient words are, for example: Life stages
Manual 0.658 0.778
@USER Woensdagochtend 15 augustus start het lan- Automatic 0.634 0.853
delijke CDA met haar regiotour op Goeree-Overflakkee
i.s.m. @USER. Table 13: Results classification - manual vs automatic
@USER On Wednesday morning, the 15th of
August the national CDA starts with its tour through
the region in Goeree-Overflakkee in collaboration with Run ρ MAE Acc.
@USER Manual 0.784 4.875 0.552
Automatic 0.879 4.073 0.466
RT @USER: Vanmiddag met @USER gezellig bij
@USER een wijntje gedaan en naar de Emmaüskerk Table 14: Results age regression - manual vs automatic
#Middelharnis geweest. Mooie dag zo!
RT @USER: Had fun this afternoon had wine at Using McNemar’s Test we find that the automatic system
@USER with @USER and went to the Emmaüschurch is significantly better in classifying according to age cate-
#Middelharnis. Beautiful day! gories (χ2 = 18.01, df=1, p < 0.01) and life stages (χ2 =
9.76, df=1, p < 0.01). The automatic system is also signif-
Almost all of his tweets are (like the first example) icantly better in predicting the exact age when comparing
about politics, so we can assume this user wants to stress the MAE’s (paired t-test, t(336) = 2.79, p < 0.01). In addi-
his identity as a person interested in politics, or even as tion, for each metric and task we calculated which fraction
a politician on Twitter. Certainly, this is a more common of the persons performed equal or better than the automatic
topic for users older than a 19-year old. Proof for this is the system. This ranged from 0.24 (age cat., all metrics) to 0.41
fact that words such as ministers, elections, voter etc. are (life stages, micro F1 ) and 0.47 (life stages, macro F1 ), to
highly ranked features associated with older people in the 0.29 (exact age, MAE’s) and 0.82 (exact age, accuracy).
regression model. In addition, the person uses more prepo- In addition we find the following. First, humans achieve a
sitions, conventional punctuation, formal abbreviations and better accuracy for the regression task. The accuracy is based
for example mentions wine which is also rather associated on margins as indicated by the annotators. Humans were of-
with older people in our corpus. Moreover, beautiful is one ten closer at the younger ages, where the indicated margins
of the top ten features predictive of older people. Thus, not were also very low and a slightly off prediction would not be
only the main topic of his tweets (politics) is associated counted as correct. Second, humans have trouble predicting
more with older people, but he also represents himself as a the ages of older people as well. The correlation between the
grown-up person in his other tweets by using which what we MAE’s and exact ages are 0.58 for humans and 0.60 for the
perceive as rather conservative vocabulary and punctuation. automatic system. Third, humans are better in classifying
people into life stages than in age categories.
Thus, the discussed cases show that people can emphasize To conclude, we find that an automatic system is capa-
other aspects of identity than age. This can result in a devia- ble of achieving better performance than humans, and being
tion from style and content from their peers, thereby making much faster (on average, taking less than a second compared
the automatic prediction of age more difficult. to 60-90 minutes to predict the age of 20 users).
Variables that change with age Variable Females ρ Males ρ
Style
By analyzing the importance of features in an automatic pre- Capitalized words -0.281∗∗ -0.453∗∗
diction system, only general effects can be seen (i.e. this fea- Alph. lengthening -0.416∗∗ -0.324∗∗
ture is highly predictive for old versus young). However, to Intensifiers -0.308∗∗ -0.381∗∗
allow for a more detailed analysis, we now use the exact ages LIWC-prepositions 0.577∗∗ 0.486∗∗
of Twitter users to track how variables change with age. Word length 0.630∗∗ 0.660∗∗
Tweet length 0.703∗∗ 0.706∗∗
Variables References
I -0.518∗∗ -0.481∗∗
We explore variables that capture style as well as content. You -0.417∗∗ -0.464∗∗
We 0.312∗∗ 0.266∗∗
Style Other -0.072 -0.148∗∗
The following style variables capture stylistic aspects that a Conversation
person is aware of and explicitly chooses to use: Replies 0.304∗∗ 0.026
• Capitalized words, for example HAHA and LOL. The Sharing
Retweets -0.101∗ -0.099∗
words need to be at least 2 characters long. Links 0.428∗∗ 0.481∗∗
• Alphabetical lengthening, for example niiiiiice instead of Hashtags 0.502∗∗ 0.462∗∗
nice. Matching against dictionaries was found to be too
noisy. Therefore, this is implemented as the proportion Table 15: Analysis of variables. For both genders n = 1247.
of words that have a sequence of the same three charac- Bonferroni correction was applied to p-values. ∗ p ≤ 0.01
∗∗
ters in the word. The words should also contain more than p ≤ 0.001
one unique character (e.g. tokens such as www are not in-
cluded) and contain only letters.
Analysis
• Intensifiers, which enhance the emotional meaning of
words (e.g. in English, words like so, really and awful). We calculate the Pearson’s correlation coefficients between
the variables and the actual age using the same data from
The following variables capture stylistic aspects that a per- the age prediction experiments (train and test together), and
son usually is not aware of: report the results separated by gender in Table 15.
• LIWC-prepositions, the proportion of prepositions such as We find that younger people use more explicit stylistic
for, by and on. The wordlist was obtained from the Dutch modifications such as alphabetical lengthening and capital-
LIWC (Zijlstra et al. 2005) and contains 48 words. ization of words. Older people tend to use more complex
• Word length, the average word length. Only tokens start- language, with longer tweets, longer words and more prepo-
ing with a letter are taken into account, so hashtags and sitions. Older people also have a higher usage of links and
user mentions are ignored. Urls are also ignored. hashtags, which can be associated with information sharing
and impression management. The usage of pronouns is one
• Tweet length, the average tweet length. of the variables most studied in relation with age. Consis-
References tent with Pennebaker and Stone (2003) and Barbieri (2008)
Pennebaker and Stone (2003) found that as people get older, we find that younger people use more first-person (e.g. I)
they make fewer self-references. We adapt the categories and second person singular (e.g. you) pronouns. These are
for the Dutch LIWC (Zijlstra et al. 2005) to use on Twitter often seen as indicating interpersonal involvement. In line
data by including alphabetical lengthening, slang, and En- with the findings of (Barbieri 2008), we also find that older
glish pronouns (since Dutch people often tweet in English people more often use first-person plurals (e.g. we).
as well). In Figure 5 we have plotted a selection of the variables
as they change with age, separated by gender. We also
• I, such as I, me, mine, ik, m’n, ikke.
show the fitted LOESS curves (Cleveland, Grosse, and Shyu
• You, such as you, u, je, jij. 1992). One should keep in mind that we have less data in
• We, such as we, our, ons, onszelf, wij. the extremes of the age ranges. We find strong changes in
the younger ages; however after an age of around 30 most
• Other, such as him, they, hij, haar. variables show little change. What little sociolinguistics re-
Conversation search there is on this issue has looked mostly at individual
• Replies, proportion of tweets that are a reply or mention a features. Their results suggest that the differences between
user (and are not a retweet). age groups above age 35 tend to become smaller (Barbieri
2008). Such trends have been observed with stance (Barbi-
Sharing eri 2008) and tag questions (Tottie and Hoffmann 2006). Re-
• Retweets, proportion of tweets that are a retweet. lated to this, it has been shown that adults tend to be more
• Links, proportion of tweets that contain a link. conservative in their language, which could also explain the
observed trends. This has been attributed to the pressure
• Hashtags, proportion of tweets that contain a hashtag. of using standard language in the workplace in order to be
taken seriously and get or retain a job (Eckert 1997).
0.06
Proportion you
0.08
Proportion I

0.03
0.04
0.00

0.00
10 20 30 40 50 60 10 20 30 40 50 60

4.0 4.5 5.0 5.5 Age Age

100
Tweet length
Word length

60
20
10 20 30 40 50 60 10 20 30 40 50 60

Age Age

0.8
0.6

0.6
Hashtags
Links

0.4

0.4
0.2

0.2
0.0

0.0
10 20 30 40 50 60 10 20 30 40 50 60

Age Age

Figure 5: Plots of variables as they change with age. Blue: males, Red: females

One should keep in mind, however, that we have studied Conclusion

people with different ages, and we did not perform a longi- We presented a study on the relation between the age of
tudinal study that looked at changes within persons as they Twitter users and their language use. A dataset was con-
became older. Therefore the observed patterns may not indi- structed by means of a fine-grained annotation effort of more
cate actual change within persons, but could be a reflection than 3000 Dutch Twitter users. We studied age prediction
of changes between different generations. based only on tweets. Next, we presented a detailed analysis
Reflecting on the age prediction task and the analysis pre- of variables as they change with age.
sented in this section, we make the following observations. We approached age prediction in different ways: predict-
First, for some variables there is almost no difference being the age category, life stage, and the actual age. Our sys-
tween males and females (e.g. tweet length), while for some tem was capable of predicting the exact age within a margin
other variables one of the genders consistently uses that vari- of 4 years. Compared with humans, the automatic system
able more (e.g. the first singular pronouns for females, links performed better and was much faster than humans. For fu-
for men). In our prediction experiments, we also observed ture research, we believe that life stages or exact ages are
differences in the prediction performance between genders. more meaningful than dividing users based on age groups.
We also found differences in the gender distribution across In addition, gender should not be ignored as we showed that
age categories on Twitter. Therefore, we conclude that re- how age is displayed in language is also strongly influenced
searchers interested in the relation between language use and by the gender of the person.
age should not ignore the gender variable. We also found that most changes occur when people are
Second, in the automatic prediction of exact age we found young, and that after around 30 years the studied variables
that as people get older the system almost always underpre- show little change. This may also explain why it is more
dicts the age. When studying how language changes over difficult to predict the age of older people (for both humans
time, we find that most change occurs in the younger ages, and the automatic system).
while at the older ages most variables barely change. This Our models were based only on the tweets of the user.
could be an explanation of why it is harder to predict the This has as a practical advantage that the data is easy to col-
correct age of older people (for both humans and the auto- lect, and thus the models can easily be applied to new Twit-
matic system). This also suggests that researchers wanting to ter users. However, a deeper investigation into the relation
improve an automatic age prediction system should focus on between language use and age should also take factors such
improving prediction for older persons, and thus identifying as the social network and the direct conversation partners of
variables that show more change at older ages. the tweeters into account.
Acknowledgements Labov, W. 1966. The social stratification of English in New
This research was supported by the Royal Netherlands York City. Centre for Applied Linguistics.
Academy of Arts and Sciences (KNAW) and the Nether- Labov, W. 1990. The intersection of sex and social class
lands Organization for Scientific Research (NWO), grants in the course of linguistic change. Language variation and
IB/MP/2955 (TINPOT) and 640.005.002 (FACT). The au- change 2(2):205–254.
thors would like to thank Mariët Theune and Leonie Cornips Ling, R. 2005. The sociolinguistics of SMS: An analysis
for feedback, Charlotte van Tongeren and Daphne van of SMS use by a random sample of Norwegians. Mobile
Kessel for the annotations, and all participants of the user Communications 335–349.
study for their time and effort. Marwick, A. E., and Boyd, D. 2011. I tweet honestly, i
tweet passionately: Twitter users, context collapse, and the
References imagined audience. New Media & Society 13(1):114–133.
Argamon, S.; Koppel, M.; Pennebaker, J.; and Schler, J. Mislove, A.; Lehmann, S.; Ahn, Y.-Y.; Onnela, J.-P.; and
2007. Mining the blogosphere: age, gender, and the vari- Rosenquist, J. 2011. Understanding the demographics of
eties of self-expression. First Monday 12(9). Twitter users. In ICWSM 2011.
Bamman, D.; Eisenstein, J.; and Schnoebelen, T. 2012. Gen- Nguyen, D.; Smith, N. A.; and Rosé, C. P. 2011. Author
der in Twitter: styles, stances, and social networks. CoRR. age prediction from text using linear regression. In LaTeCH
Barbieri, F. 2008. Patterns of age-based linguistic variation 2011.
in American English. Journal of Sociolinguistics 12(1):58– O’Connor, B.; Krieger, M.; and Ahn, D. 2010. TweetMotif:
88. exploratory search and topic summarization for Twitter. In
Beevolve.com. An exhaustive study of Twitter users across ICWSM 2010.
the world. http://www.beevolve.com/twitter-statistics/. Last Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;
accessed: Jan 2013. Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
Bell, A. 1984. Language style as audience design. Language R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;
in society 13(2):145–204. Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-
Bergsma, S.; Post, M.; and Yarowsky, D. 2012. Stylometric learn: Machine learning in Python. Journal of Machine
analysis of scientific articles. In NAACL 2012. Learning Research 12:2825–2830.
Burger, J. D.; Henderson, J.; Kim, G.; and Zarrella, G. 2011. Pennacchiotti, M., and Popescu, A.-M. 2011. A machine
Discriminating gender on Twitter. In EMNLP 2011. learning approach to Twitter user classification. In ICWSM
Cleveland, W.; Grosse, E.; and Shyu, W. 1992. Local re- 2011.
gression models. Statistical models in S 309–376. Pennebaker, J., and Stone, L. 2003. Words of wisdom: Lan-
Eckert, P. 1989. Jocks and burnouts: Social categories and guage use over the life span. Journal of personality and
identity in the high school. Teachers College Press. social psychology 85(2):291.
Eckert, P. 1997. Age as a sociolinguistic variable. The Rao, D.; Yarowsky, D.; Shreevats, A.; and Gupta, M. 2010.
handbook of sociolinguistics. Blackwell Publishers. Classifying latent user attributes in Twitter. In SMUC 2010.
Eckert, P. 2000. Linguistic Variation as Social Practice: The Rao, D.; Paul, M.; Fink, C.; Yarowsky, D.; Oates, T.; and
Linguistic Construction of Identity in Belten High. Wiley- Coppersmith, G. 2011. Hierarchical bayesian models for
Blackwell. latent attribute detection in social media. In ICWSM 2011.
Eckert, P. 2008. Variation and the indexical field. Journal Rosenthal, S., and McKeown, K. 2011. Age prediction in
of Sociolinguistics 12(4):453–476. blogs: a study of style, content, and online behavior in pre-
and post-social media generations. In ACL 2011.
Eisenstein, J.; O’Connor, B.; Smith, N. A.; and Xing, E. P.
2010. A latent variable model for geographic lexical varia- Sarawgi, R.; Gajulapalli, K.; and Choi, Y. 2011. Gender
tion. In EMNLP 2010. attribution: tracing stylometric evidence beyond topic and
genre. In CoNLL 2011.
Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and
Lin, C.-J. 2008. LIBLINEAR: A library for large linear Tottie, G., and Hoffmann, S. 2006. Tag questions in
classification. Journal of Machine Learning Research 9. British and American English. Journal of English Linguis-
tics 34(4):283–311.
Fink, C.; Kopecky, J.; and Morawski, M. 2012. Inferring
gender from the content of tweets: A region specific exam- Trudgill, P. 1974. The social differentiation of English in
ple. In ICWSM 2012. Norwich. Cambridge University Press.
Garera, N., and Yarowsky, D. 2009. Modeling latent bio- Zamal, F. A.; Liu, W.; and Ruths, D. 2012. Homophily
graphic attributes in conversational genres. In ACL-IJCNLP and latent attribute inference: inferring latent attributes of
2009. Twitter users from neighbors. In ICWSM 2012.
Goswami, S.; Sarkar, S.; and Rustagi, M. 2009. Stylometric Zijlstra, H.; van Middendorp, H.; van Meerveld, T.; and Gee-
analysis of bloggers’ age and gender. In ICWSM 2009. nen, R. 2005. Validiteit van de Nederlandse versie van de
Linguistic Inquiry and Word Count (LIWC). Netherlands
Holmes, J., and Meyerhoff, M. 2003. The handbook of lan-
Journal of Psychology 60(3).
guage and gender. Oxford: Blackwell.