"How Old Do You Think I Am?": A Study of Language and Age in Twitter
"How Old Do You Think I Am?": A Study of Language and Age in Twitter
400
Frequency
Effect of Sampling Method
The annotators were instructed to only annotate the users
200
that met the following requirements:
• The account should be publicly accessible.
0
• The account should represent an actual person (e.g. not an 10 20 30 40 50 60 70
organization). Age
• The account should have ‘sufficient’ tweets (at least 10). Figure 1: Plot of frequencies per age
• The account should have Dutch tweets (note that this does
not eliminate multilingual accounts).
We also asked our annotators to annotate the exact age.
We separated the reasons why accounts were discarded Sometimes it was possible to get an almost exact estimate,
by the two sampling methods (het and followers/followees) for example by using LinkedIn profiles, exact age mentions
that were used (the first requirement in the list that was not in the profile, tweets, or mentioning which grade the person
satisfied was marked). The results are reported in Table 1. was in. However, since this was not always the case, annota-
We observe that the proportion of actual annotated users is tors also indicated a margin (0, 2, 5 or 10 years) of how sure
much higher for the users obtained using the query ‘het’. they were. Figure 1 shows a graph with the frequencies per
The users obtained by sampling from the followers and year of age. Table 3 reports the frequencies of the indicated
followees included more non-Dutch accounts, as well as margins. In our data, we find that the margin for young users
accounts that did not represent persons. In addition, there is low, and that for older users the margin is much higher.
was also a group of people who had protected their account As discussed earlier in this paper, it may be more natu-
between the time of sampling and the time of annotation. In ral to distinguish users according to their life stage instead
total, 3185 users were annotated. of a fixed age category. Life stages can be approached from
different dimensions. In this paper, we use life stages based
Gender on the occupation of people, by distinguishing between stu-
The biological gender was annotated for 3166 persons (for dents, employed, retired etc. The results are displayed in Ta-
some accounts, the annotators could not identify the gender). ble 4. Unfortunately, the decision to annotate this was done
The gender ratio was almost equal, with 49.5% of the per- while the annotation process was already underway; there-
sons being female. However, as we will see later, the ratio fore the accounts of some users were not available anymore
depends on age. The annotation of the gender was mostly (either removed or protected).
determined based on the profile photo or a person’s name, We find that the most common life stages are associated
but sometimes also their tweets or profile description. with clear age boundaries, although the boundaries are not
Mislove et al. (2011) analyzed the US Twitter population the same as for the age categories. We find the following age
using data from 2006-2009. Using popular female and male spans in which 90% of the persons fall: secondary school
names they were able to estimate the gender of 64% of the students (12 -16 yrs), college students (16 - 24 yrs), employ-
people, finding a highly biased gender ratio with 72% being ees (24 - 52 yrs). However, note that with the life stage ap-
male. A more recent study by Beevolve.com however found proach, people may be assigned to a different group than the
that 53% were women, based on information such as name group that most resembles their age, if this group matches
and profile. their life stage better. We have plotted the overlap between
life stage and age categories in Figure 2.
Age
Because we expected most Twitter users to be young, the
following three categories were used: 20-, 20-40, 40+. The
age category was annotated for 3110 accounts. The results Age estimation margin Frequency
separated by gender are shown in Table 21 . There are more 0 703
females in the young age group, while there are more men 2 1292
in the older age groups. The same observation was made in 5 918
statistics reported by Beevolve.com. 10 173
1
Note that this table only takes persons into account for who
both age and gender were annotated Table 3: Frequencies of margins for the exact age annotation
Life Stage Frequency Evaluation
Secondary school student 1352 We will evaluate the performance of our classification meth-
College student 316 ods (to predict the age category and life stage) using the F1
Employee 1021 measure. We will report both the macro and micro averages.
Retired 5 The regression problem (predicting age as a continuous vari-
Other 15 able) will be evaluated using the Pearson’s correlation coef-
Unknown 132 ficient, mean absolute error (MAE) and accuracy, where a
Not accessible 344 prediction was counted as correct if it fell within the margin
as specified by the annotators.
Table 4: Life stage frequencies Dataset
We restricted our dataset to users who had at least 20 tweets
Number of accounts
and for whom the gender, age category and exact age were
School students annotated. For each user we sampled up to 200 tweets. We
College students
0 400 1000
Employees divided the dataset into a train and test set. Each set con-
tains an equal number of males and females, and the same
age distribution (according to the annotated age categories)
across gender categories. This limits the risks of the model
learning features that for example are more associated with
a particular gender, due to that gender occuring more in the
20- 20-40 40+ particular age category. Parameter tuning and development
of the features were done using cross-validation on the train-
Age
ing set. The statistics are presented in Table 5.
Train Test
Figure 2: Overlap life stage and age categories M F M F
20- 602 602 186 186
Inter-annotator Agreement 20-40 231 231 73 73
40+ 118 118 37 37
We employed two students to perform the annotations. 84 Total 1902 592
accounts were annotated by both. Inter-annotator agree-
ment was measured using Cohen’s kappa. Generally, a value Table 5: Dataset statistics
above 0.7 is considered acceptable. We found the following
kappa values: gender (1.0), age category (0.83) and life stage
(0.70). For the actual age, the mean absolute difference was Learning Algorithm
1.59 years. We use linear models, specifically logistic and linear re-
gression, for our tasks. Given an input vector x ∈ Rm ,
Age Prediction x1 , . . . , xm represent features (also called independent vari-
ables or predictors). In the case of classification with two
Goal
classes, e.g. y{−1, 1}, the model estimates a conditional
In this section we compare the different ways of approaching distribution P (y|x, β) = 1/(1 + exp(−y(β0 + x> β))),
age, by testing how feasible age prediction is using simple where β0 and β are the parameters to estimate. We use a
features based only on the text of tweets. We will automati- one versus all method to handle multiclass classification. In
cally predict the following: the case of regression, we find a prediction ŷ ∈ R for the
• Age category: 20-, 20-40, 40+ exact age of a person y ∈ R using a linear regression model:
ŷ = β0 + x> β. In order to prevent overfitting we use Ridge
• Age: continuous variable (also called L2 ) regularization. We make use of the liblin-
• Life stage: secondary school student2 , college student, ear (Fan et al. 2008) and scikit-learn (Pedregosa et al. 2011)
employee libraries.
For the life stage, we only use categories for which we had Preprocessing & Features
a sufficient number of persons. Note that classifying age ac-
cording to age category and life stage are multiclass classifi- Tokenization is done using the tool by (O’Connor, Krieger,
cation problems, while treating age as a continuous variable and Ahn 2010). All user mentions (e.g. @user) are re-
results in a regression problem. In addition, we compare our placed by a common token. Because preliminary experi-
systems with the performance of humans on this task. ments showed that a unigram system already performs very
well, we only use unigrams to keep the approach simple. We
2
In Dutch this is translated to scholier, which includes all stu- keep words that occur at least 10 times in the training doc-
dents up to and including high school, there is no direct translation uments. In the next section, we will look at more informed
in English. features and how they change as people are older.
Results
In this section we present the results of the three age predic-
-30 -10 10
tion tasks. The results can be found in Tables 6 and 7. We
Error
find that a simple system using only unigram features can
already achieve high performance, with micro F1 scores of
above 0.86 for the classification approaches and a MAE of
less than 4 years for the regression approach. We also exper- 10 20 30 40 50 60 70
imented with applying a log transformation of the exact age Actual age
for the regression task. The predicted values were converted Figure 4: Scatterplot absolute error
back when calculating the metrics. We find that the MAE
and accuracy both improve. In the rest of this section, when Dutch English Weight
referring to the regression run, we refer to the standard run school school -0.081
without a log transformation. ik I -0.073
:) :) -0.071
Run F1 macro F1 micro werkgroep work group -0.069
Age categories 0.7670 0.8632 stages internships -0.069
Life stages 0.6785 0.8628 oke okay -0.067
xd xd -0.066
Table 6: Results classification ben am -0.066
haha haha -0.064
als if -0.064
Run ρ MAE Accuracy
Age regression 0.8845 3.8812 0.4730
Age regression - log 0.8733 3.6172 0.5709 Table 8: Top features for younger people (regression)
0.03
0.04
0.00
0.00
10 20 30 40 50 60 10 20 30 40 50 60
100
Tweet length
Word length
60
20
10 20 30 40 50 60 10 20 30 40 50 60
Age Age
0.8
0.6
0.6
Hashtags
Links
0.4
0.4
0.2
0.2
0.0
0.0
10 20 30 40 50 60 10 20 30 40 50 60
Age Age
Figure 5: Plots of variables as they change with age. Blue: males, Red: females