Modelling View-Count Dynamics in Youtube
Modelling View-Count Dynamics in Youtube
Cédric Richier∗ , Eitan Altman† , Rachid Elazouzi∗, Tania Jimenez∗, Georges Linares∗ and Yonathan Portilla∗
∗ University
of Avignon, 84000 Avignon, FRANCE
Email: [email protected]
† INRIA, B.P 93, 06902 Sophia Antipollis Cedex, FRANCE
Email: [email protected]
Abstract—The goal of this paper is to study the behaviour of diseases spread have already been used in order to model
of view-count in YouTube. We first propose several bio-inspired the spread of viruses in computer networks [7], [12]. They
arXiv:1404.2570v2 [cs.SI] 28 May 2014
models for the evolution of the view-count of YouTube videos. We have been also used in marketing for capturing the life cycle
show, using a large set of empirical data, that the view-count for dynamics of a new product [4]. A large number of papers in
90% of videos in YouTube can indeed be associated to at least one marketing have shown that product sales life cycle follow an
of these models, with a Mean Error which does not exceed 5%.
We derive automatic ways of classifying the view-count curve
S-curve pattern in which the product sales initially grow at
into one of these models and of extracting the most suitable fast rate and it falls off as the limit of the market share is
parameters of the model. We study empirically the impact of approached [14].
videos’ popularity and category on the evolution of its view-count. We propose several information diffusion models to clas-
We finally use the above classification along with the automatic
parameters extraction in order to predict the evolution of videos’
sify a dataset of more than 800000 videos randomly ex-
view-count. tracted from YouTube and aged between 5 and 2500 days.
In particular, we exhibit six mathematical models to which
Keywords—YouTube,bio-inspired models, view-count. we fit videos in our dataset. We then propose automatic
ways for associating each video to one of the considered
I. I NTRODUCTION models. The first criterion in selecting the model is related to
the size of the population that may be potentially interested
YouTube has been one of the most successful user- by the content. We differentiate between models in which
generated video sharing sites since its establishment in early the population potentially interested in the content is nearly
2005 and constitutes currently the largest share of Internet constant (we call this the ”fixed target population property”)
traffic. The rate of subscription to YouTube as well as the and those in which it grows in time (inspired by the branching
rate of submitted videos has been growing steadily ranking process terminology, we call this ”immigration”). The fixed
YouTube and none of its competitors has achieved a similar target population property occurs in some video categories in
success [1], [2]. An important aspect of videos in YouTube YouTube as news, sport and movies. Indeed, videos in these
is their popularity, which is defined as the number of view- categories reach quickly the peak of the popularity and then
counts. Understanding and predicting the popularity is useful within a short time the diffusion dies out and the view-count
from a twofold perspective: On one hand, more popular content does not further increase.
generates more traffic, so understanding popularity has a direct
impact on caching and replication strategy that the provider The second criterion in the classification concerns the
should adopt; and on the other hand, popularity has a direct structural virality. A model is said to be viral (or to have the
economic impact. A number of researchers have analyzed the viral property) if contaminated nodes (these are the viewers of
popularity characteristics of user-generated video content for a video) have a significant role in the propagation of the video
understanding the processes governing their popularity dynam- through sharing or embedding. It is non-viral if the propagation
ics [5], [10], [13], [18], [8], [6], with the aim of developing of the video essentially relies on broadcast of the video from
models for early-stage prediction of future popularity [19]. the source (it is then said to have the broadcast property). In
There has been also interest in understanding what important that case, a large fraction of potential target population can
factors lead some videos to become more popular than others. receive the information directly from the source.
But few works have studied the temporal aspects of the Our contribution can be summarised in the following key
popularity dynamics using some metrics such as view-counts, points
ratings and number of comments [5], [9], [17].
• We propose six mathematical biology-inspired models
In this paper we describe some of the most typical be- and we show that at least 90% of videos in YouTube
haviour of the view-count of videos in YouTube. This allows are associated to one of these six mathematical models
us to provide in-depth analysis and develop models that capture with a Mean Error Rate less than 5%. We further show
the key properties of the observed popularity dynamics. Our how to extract the model parameters for each video.
goal is to match observed video view-counts with one of
several dynamic models. To select candidates for these models, • We study the robustness of these models to the differ-
we turned to bio-inspired dynamics as we believed that the ent thematic categories of the video in YouTube and
propagation of a content in YouTube has a strong similarity to different values of the peak popularity of the video.
with the temporal behaviour of an infectious disease, which is We show that the fraction of videos withing a given
a classical topic in mathematic biology [3], [16]. Such models model is quite robust and shows little dependence
on the different thematical categories of the video, dataset contains some static information for each video such
except for Education category which has a different as YouTube id, title of the video, name of the author, duration
behaviour: For this category it seems that the word- and list of related videos. It also provides the evolution of
of-mouth is the dominate mechanism through which some metrics (shares, subscribers, watch time and views) in a
contents are disseminated. The bio-inspired models daily form and in a cumulative form, from the upload day till
that we selected are further shown to be robust with the date of crawling.
respect to the peak popularity of the video but the
distribution among them is slightly different between III. P OPULARITY GROWTH PATTERNS
those videos that have acquired less than 1000 views
and the rest of the videos. We focus the analysis on view-count as the main popularity
metric of a video. Previous analyses of YouTube showed a
• In more than 80% of videos in YouTube, the potential strong correlation between view-count and other metrics as
population interested in the video increases over time. number of comments, favourites and rating. Further, these
• Two of the six models (The modified negative expo- metrics correlation becomes stronger with popularity [8]. We
nential and modified Gompertz models) cover most model the dynamic evolution of view-count some mathemat-
of videos in our YouTube Dataset (more than 75%). ical models from the biology. We classify the evolution of
Both models capture the case of immigration pro- view-count in YouTube using two criteria:
cess in which the potential population or the ceiling
value become dynamic. Further, the modified negative • Size of the target population: The target population
exponential characterizes the dynamic of a non-viral size is the maximum number of individuals that can
content and it predicts that the accumulated number of be, potentially, interested by the content. A target
view doesn’t contribute to the propagation of the con- population belongs to one of these two types: (i) a
tent. This model corresponds to the scenario wherein fixed finite target population or (ii) a potential target
the content has been broadcasted to a pool of users. population that grows in time which we call the
On the other side, the Gompertz model captures viral immigration process.
videos in which a part of this dynamic is propagated • Virality: A content is viral if the population that has
through word-of-mouth. seen the content participates actively in the dissem-
• We finally use the above classification along with the ination of the content. The content spreads like a
automatic parameters extraction in order to predict virus does in epidemics. Thus the probability that an
the evolution of videos’ view-count. We consider two individual who has not seen the content so far got
scenarios: In the first we use half of the view-count it by someone who see it, increases in time. On the
curve as a training sequence while in the second one, contrary, a non viral content is one for which former
we take a fixed training sequence that corresponds viewers scarcely alter the diffusion process.
to the first 50 days in the lifetime of the video. We In the following we describe the dynamic models in
then compare the predicted curve to the actual one biology and their uses. These dynamic models have been
and study the prediction capacity within a given error hypothesised to describe the contagion phenomena and each
bound. model has its own set of assumptions about how users are
infected by others. These models may provide some answers
II. S ETTING AND DATA about the behaviour of users in YouTube even if this behaviour
Since we intend to study different types of dynamic evolu- remains notoriously difficult to quantify.
tion of the view-count in YouTube, we need to collect a huge
number of videos which are available to the general public. A. Fixed target population
In this section we describe how we collected the dataset 1) Viral content: To describe the viral content with fixed
used in this study. On YouTube, a video is accompanied by a target population, we use the logistic model or the Gompertz
set of valuable data as title, upload time, view-count, related model. These models have been used in technology forecasting
videos. The video web page also provides some statistics which and are referred as ”S-shaped” curve. We test these models to
are available if the content’s owner allows it. capture the evolution of view-count of a video in YouTube
YouTube provides two APIs which allow to retrieve some since there is a strong similarity between a video posted in
of those data : the YouTube Data API for collecting static data YouTube and a new product launched into the marketplace. In-
(which are available for every user) and the YouTube Analytics deed, as showed in different problems in marketing, technology
API for seeking video statistics such as dynamics of a content product is often growth slowly followed by rapid exponential
(which are only available for content’s owner). Since some data growth and finally it falls off as limit of market share is
cannot be collected through the APIs, we used a tool named approached.
YOUStatAnalyzer [20] in order to collect all valuable data. Logistic model: The logistic model is a common sigmoid
The collected data are stored in a noSQL database (Mon- function which describes the evolution of view-count of a
goDB). The noSQL solution has been chosen to allow dynamic video with fixed target population. This is a first order non-
insertion of new features for future works. The dataset used for linear differential equation of the form
this study contains more than 80000 videos randomly extracted dS
from YouTube and aged between 5 and 2500 days. This = λS(M − S) (1)
dt
2
behaviour seems to fit well for some YouTube view-count
evolution dynamics.
2) Non-viral content: A non viral content describes the
case where users do not contribute on the propagation of the
content. This is the case when the time scale of the content
diffusion is very large compared to the size of potential popu-
lation. Hence this dynamic can model the case where contents
gain popularity through advertisement and other marketing
tools: examples are when advertisement is broadcasted to a
very large pool of users of a social network and people access
the content at random thereafter. Hence we assume that the
evolution dynamic of the content follows the linear differential
equation:
dS
= λ(M − S) (3)
dt
Fig. 1. Example of non symmetric S-shaped behaviour of view-count in The solution of (3) is given by :
YouTube (viral).
S(t) = S(0) + (M − S(0))(1 − e−λt )
3
Gompertz model : varying M Gompertz model : varying λ Gompertz model : varying S(0)
1.5
1.5
1.5
1.0
1.0
1.0
S(t)
S(t)
S(t)
0.5
0.5
0.5
0.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
t t t
4
Ditribution of age Popularity distribution
Categories distribution over the hole dataset
14000
25
2.0
20
10000
1.5
Percent of all
Percent of all
15
6000
1.0
10
0.5
5
2000
0
0.0 0
1 2 3 4 5 6 7 8 9 11 13 15
0 500 1000 1500 2000 2500 0 2 4 6 8 10
(a) YouTube categories distribution (b) Age distribution (in number of days) (c) Logarithmic distribution of popularity
Fig. 3. Some features distributions from the YouTube dataset.
model). This is typically the case with fixed finite population. Given a set of observations (yi , ti )1≤i≤n , a first phase
But at the horizon, the curve that represents data (plain line) consists in pointing out a linear behaviour from a time t =
seems to follow a line with a non zero slope. This is what tk , k ∈ 1, .., n. The idea is to find k in an iterative way in order
we call the immigration phenomena, when the size of target to have a good regression line for the subset (yi , ti )k≤i≤n . The
population increases linearly in time. In that case, we model algorithm is the following : we first fix a threshold ǫ reasonably
the dynamics by the modified negative exponential functions small. We then apply a linear regression with all the points.
introduced in subsection III-B1: The regression line gives the set (ŷi )i of the estimated values.
Let ȳ be the mean of (yi )i , we consider the coefficient of
S(t) = S(0) + (M − S(0))(1 − e−λt ) + kt determination r1 given by:
Pn 2
This model fits better as it is shown in Figure 4c. A mixed i=1 (yi − ŷi )
r1 = 1 − P
strategy, which consists in cutting data into two subsets and i (yi − ȳ)
2
then applying linear model on one subset and negative expo-
nential model on the other, will be discussed later. • if |1 − r1 | ≤ ǫ, observations could be considered
as linear (the whole view-count curve could be well
d) Data fitting for Viral contents: Three models have described by the linear model) and the process ends.
been considered in the case of viral contents: logistic model,
Gompertz model and a modified Gompertz model (see sec- • else, the first element of (yi , ti )1≤i≤n is removed from
tion III-B2). The first two models are for the context of fixed the set and a new linear regression is done for the
finite population and the third one is introduced in the case of subset (yi , ti )2≤i≤n . That gives a new coefficient of
immigration. determination r2
Figure 5 is an example where we fit these models to one
YouTube content (Figure 5a). We observe that the S-shape of The process is repeated until the coefficient of determination
the logistic model curve is symmetric due to the symmetrical rk satisfies |1 − rk | ≤ ǫ. Doing that, we determine the rank
property of sigmoid function (Figure 5b). However, the convex k from which observations (yi , ti )k≤i≤n can be considered as
phase and the concave phase are non symmetric as we can having a linear behaviour. In Figure 6b, time tk is represented
observe in Figure 5a. Hence the Logistic model does not fit by a vertical line. From tk the behaviour can be well described
well. Then, Gompertz model and modified Gompertz model by a linear model (dot-dashed line).
are fitted to the same YouTube content. The Gompertz model In a second phase, parameter estimation for the non-linear
(Figure 5c) fits better than the logistic model, and the modified models presented in previous section is done on the subset
Gompertz model (Figure 5d) describes better the behaviour of (yi , ti )1≤i≤k . Figure 6b illustrates this phase : here, the modi-
the data at the horizon (immigration phenomena). fied negative exponential model is applied to the subset on the
left side of tk (dashed line).
e) Mixing linear and non-linear models: An issue that
results from the model we use is the changes of the curve V. AUTOMATIC CLASSIFICATION
dynamics at the horizon. We observe two types of behaviour
at the horizon : a flat line showing that the limit of the potential In this section we first investigate the process for an
population has been reached or an oblique line highlighting the automatic classification of YouTube contents. We then go
fact that the population continues to grow. further in analysing results of our experiment. Finally we give
some keys of how to use this classification in order to predict
The first case is coherent with the description of dynamics the view-count.
(then the maximum is one of the parameters of the dynamic
equation of the model). However, for the second case we need A. Classification issues
to add a linear component to the solution function, implying
that the initial dynamic equations are not valid anymore. The main goal of our work is to provide a system that can
Hence, we propose to adopt a two phases approach for data automatically classify YouTube contents by associating one
fitting. Figure 6 gives an example where this method is applied model to one content. For each content, two issues have to
for a YouTube content (Figure 6a). be managed : First, evaluate each model in order to know
5
Negative Exponential Model Example Modified Negative Exponential Model Example
1.0
1.0
0.8
0.8
Views (normalized)
Views (normalized)
0.6
0.6
0.4
0.4
real data real data
model model
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Age (normalized) Age (normalized)
(a) YouTube content with a (b) Viewcount curve (plain) is approx- (c) Viewcount curve (plain) is approxi-
concave shape imated by a negative exponential curve mated by a modified negative exponential
(dashed) curve (dashed)
Fig. 4. From a YouTube content (on the left side), parameters of the negative exponential model are estimated, then the obtained curve is compared to data
(in the centre). The same process is applied to a negative exponential model in which a linear component has been added (on the right side)
1.0
1.0
0.8
0.8
0.8
Views (normalized)
Views (normalized)
Views (normalized)
0.6
0.6
0.6
0.4
0.4
0.4
real data real data real data
model model model
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Age (normalized) Age (normalized) Age (normalized)
(a) YouTube video (b) Logistic model fitting (c) Gompertz model fitting (d) Modified Gompertz model fitting
Fig. 5. From a YouTube video with a S-shaped view-count curve ( 5a), we first fit the logistic model in 5b. The estimated curve (dashed) is compared with
the actual normalised view-count curve (plain). Then the same is done with the Gompertz model in 5c and finally with the modified Gompertz model in 5d.
M ER =
1.0
n i yi + 1
0.8
real data 5% of the observed value. With this criterion we can fix a
model
linear
linear threshold
threshold beyond which one model would be considered as
0.2
6
TABLE III. G OODNESS OF FIT FOR MODELS FROM F IGURE 4
Histogram of Mean Error Rate − All models
Models distribution
Model MCS MER GoF
Neg. exponential 3, 558 0, 074 0, 004 80 modExp
Percent of all
TABLE IV. G OODNESS OF FIT FOR MODELS FROM F IGURE 5
40 Sigm
modSigm
Model MCS MER GoF
20
Sigmoid 0, 480 0.021 10−3
modGomp
Gompertz 0, 092 0, 018 1, 846.10−4
0
0.20
Figure 5. In the example from Figure 4, with a threshold of
M ER fixed at 0, 075, both negative exponential model and
0.15
modified negative exponential model are relevant. With a GoF
of value 4, 98.10−4, modified negative exponential model is
MER
0.10
the one that best fits the data. In the case of YouTube content
0.05
depicted in Figure 5, if the threshold of M ER is fixed at the
value of 0, 02, sigmoid model is not reliable whereas Gompertz
0.00
model and modified Gompertz model are under the threshold. E ME G MG S MS
B. Results analysis
Bad data fitting
1.0
estimation for each model, the system automatically associates
0.8
one model to the video using M ER and GoF criteria. We give Views (normalized)
7
TABLE V. C ONFIDENCE I NTERVALS FOR MODELS DISTRIBUTION PROPORTION
modGomp
8
TABLE VIII. M EAN AND VARIANCE OF PREDICTION WINDOW SIZE IN
THE HALF LIFE SCENARIO
is called the hard window which is somehow pessimistic e.g.
the earlier in the prediction window, the heavier is the error.
Model mean var number of videos
E 0.5833692 0.1504571 4132
Its definition is the following :
ME 0.576914 0.1415303 25281
1 X |ST (tT +i ) − yT +i |
G 0.3435265 0.1160928 683 k
MG 0.4596889 0.1360676 19349 δH T = max{k ≥ 0|
S 0.6765688 0.1544357 1030 k i=0 (yT +i + 1)
MS 0.4625144 0.1280855 1659
1 X |ST (tT +i ) − yT +i |
k+1
≤ 0.05 < }
k + 1 i=0 (yT +i + 1)
50 days scenario Half life scenario
1.0
1.0
For each window type (soft and hard) we normalise it by the
size of the observed window T . Results are given for 7 days, 15
0.8
0.8
Prediction window size
0.6
0.4
0.2
0.0
E ME G MG S MS E ME G MG S MS
TABLE X. 7 DAYS SCENARIO
(a) 50 days classification (b) Half life classification Model Type Distribution (%) Hard window Hard bounded (%) Soft window Soft bounded (%)
E 66.9 m: 0.6 13.5 m: 0.66 15.3
ME 0.9 m: 0.75 17.9 m: 0.82 20
Fig. 10. Prediction window size according to models type G 10.5 m: 0.42 7.8 m: 0.47 8.8
MG 7.9 m: 0.54 10.9 m: 0.61 12.7
S 11.2 m: 0.97 32.9 m: 1 33.8
MS 2.6 m: 0.81 18.8 m: 0.84 19.5
All m: 0.63 m: 0.68
of ∆T for each identified model. Table VIII precises values 100 15.1 16.6
9
R EFERENCES
[1] Comscore. more than 200 billion online videos viewed globally in oc-
tober, http://www.comscore.com/press events/press releases/2011/12/
more than 200 billion online videos viewed globally in october. De-
cember 2011.
[2] omscore danpiech. online video by the numbers,
http://www.comscore.com. July 2011.
[3] N. Bailey. The Mathematical Theory of Infectious Diseases and its
Applications. Griffin, London, 1975.
[4] F. M. Bass. The relationship between diffusion rates, experience curves,
and demand elasticities for consumer durable technological innovations.
The Journal of Business, 53(3):pp. 51–67, 1980.
[5] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube,
you tube, everybody tubes: analyzing the world’s largest user generated
content video system. In Proc. of ACM IMC, pages 1–14, San Diego,
California, USA, October 24-26 2007.
[6] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. Analyzing
the video popularity characteristics of large-scale user generated content
systems. IEEE/ACM Transactions on Networking, 17(5):1357 – 1370,
2009.
[7] D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos.
Epidemic thresholds in real networks. ACM Trans. Inf. Syst. Secur.,
10(4):1:1–1:26, jan 2008.
[8] G. Chatzopoulou, C. Sheng, and M. Faloutsos. A First Step Towards
Understanding Popularity in YouTube. In in Proc. of IEEE INFOCOM,
pages 1 –6, San Diego, March 15-19 2010.
[9] X. Cheng, C. Dale, and J. Lui. Statistics and social network of youtube
videos. In Proc. International Workshop on Quality of Service (IWQoS)
The Netherlands, page 229 238, June, 2008.
[10] R. Crane and D. Sornette. Viral, quality, and junk videos on YouTube:
Separating content from noise in an information-rich environment. In
Proc. of AAAI symposium on Social Information Processing, Menlo
Park, California, CA, March 26-28 2008.
[11] W. E. Deming. The chi-test and curve fitting. Journal of the American
Statistical Association, 29(188):372–382, Dec 1934.
[12] A. Ganesh, L. Massoulie, and D. Towsley. The effect of network
topology on the spread of epidemics. In INFOCOM 2005. 24th Annual
Joint Conference of the IEEE Computer and Communications Societies.
Proceedings IEEE, volume 2, pages 1455–1466, Miami, FL, USA,
March 2005.
[13] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube traffic characteriza-
tion: A view from the edge. In Proc. of ACM IMC, 2007.
[14] V. Mahajan, E. Muller, and Y. Wind. New-Product Diffusion Models.
International Series in Quantitative Marketing. Springer, 2000.
[15] D. W. Marquardt. An algorithm for mean-squares estimation of
nonlinear parameters. Journal of the Society for Industrial and Applied
Mathematics, 11(2):431–441, jun 1963.
[16] L. A. Meyers. Contact network epidemiology: Bond percolation applied
to infectious disease prediction and control. Bull. AMS, 44(1):63–86,
2007.
[17] S. Mitra, M. Agrawal, A. Yadav, N. Carlsson, D. Eager, and A. Mahanti.
Characterizing web-based video sharing workloads. ACM Transactions
on the Web, 2(8):8 – 27, 2011.
[18] J. Ratkiewicz, F. Menczer, S. Fortunato, A. Flammini, and A. Vespig-
nani. Traffic in Social Media II: Modeling Bursty popularity. In Proc.
of IEEE SocialCom, Minneapolis, August 20-22 2010.
[19] G. Szabo and B. A. Huberman. Predicting the Popularity of Online
Content. Communications of the ACM, 53(8):80–88, aug 2010.
[20] M. Zeni, D. Miorandi, and F. De Pellegrini. YOUStatAnalyzer: a tool
for analysing the dynamics of YouTube content popularity. In Proc. 7th
International Conference on Performance Evaluation Methodologies
and Tools (Valuetools, Torino, Italy, December 2013), Torino, Italy,
2013.
10