0% found this document useful (0 votes)
11 views10 pages

Modelling View-Count Dynamics in Youtube

This paper investigates the dynamics of view-counts on YouTube by proposing several bio-inspired mathematical models to analyze and predict video popularity. The authors demonstrate that over 90% of YouTube videos can be associated with these models, achieving a Mean Error Rate of less than 5%. The study also explores the impact of video category and popularity on view-count evolution, highlighting the robustness of the models across different themes.

Uploaded by

Stefano Tornese
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Modelling View-Count Dynamics in Youtube

This paper investigates the dynamics of view-counts on YouTube by proposing several bio-inspired mathematical models to analyze and predict video popularity. The authors demonstrate that over 90% of YouTube videos can be associated with these models, achieving a Mean Error Rate of less than 5%. The study also explores the impact of video category and popularity on view-count evolution, highlighting the robustness of the models across different themes.

Uploaded by

Stefano Tornese
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Modelling View-count Dynamics in YouTube

Cédric Richier∗ , Eitan Altman† , Rachid Elazouzi∗, Tania Jimenez∗, Georges Linares∗ and Yonathan Portilla∗
∗ University
of Avignon, 84000 Avignon, FRANCE
Email: [email protected]
† INRIA, B.P 93, 06902 Sophia Antipollis Cedex, FRANCE
Email: [email protected]

Abstract—The goal of this paper is to study the behaviour of diseases spread have already been used in order to model
of view-count in YouTube. We first propose several bio-inspired the spread of viruses in computer networks [7], [12]. They
arXiv:1404.2570v2 [cs.SI] 28 May 2014

models for the evolution of the view-count of YouTube videos. We have been also used in marketing for capturing the life cycle
show, using a large set of empirical data, that the view-count for dynamics of a new product [4]. A large number of papers in
90% of videos in YouTube can indeed be associated to at least one marketing have shown that product sales life cycle follow an
of these models, with a Mean Error which does not exceed 5%.
We derive automatic ways of classifying the view-count curve
S-curve pattern in which the product sales initially grow at
into one of these models and of extracting the most suitable fast rate and it falls off as the limit of the market share is
parameters of the model. We study empirically the impact of approached [14].
videos’ popularity and category on the evolution of its view-count. We propose several information diffusion models to clas-
We finally use the above classification along with the automatic
parameters extraction in order to predict the evolution of videos’
sify a dataset of more than 800000 videos randomly ex-
view-count. tracted from YouTube and aged between 5 and 2500 days.
In particular, we exhibit six mathematical models to which
Keywords—YouTube,bio-inspired models, view-count. we fit videos in our dataset. We then propose automatic
ways for associating each video to one of the considered
I. I NTRODUCTION models. The first criterion in selecting the model is related to
the size of the population that may be potentially interested
YouTube has been one of the most successful user- by the content. We differentiate between models in which
generated video sharing sites since its establishment in early the population potentially interested in the content is nearly
2005 and constitutes currently the largest share of Internet constant (we call this the ”fixed target population property”)
traffic. The rate of subscription to YouTube as well as the and those in which it grows in time (inspired by the branching
rate of submitted videos has been growing steadily ranking process terminology, we call this ”immigration”). The fixed
YouTube and none of its competitors has achieved a similar target population property occurs in some video categories in
success [1], [2]. An important aspect of videos in YouTube YouTube as news, sport and movies. Indeed, videos in these
is their popularity, which is defined as the number of view- categories reach quickly the peak of the popularity and then
counts. Understanding and predicting the popularity is useful within a short time the diffusion dies out and the view-count
from a twofold perspective: On one hand, more popular content does not further increase.
generates more traffic, so understanding popularity has a direct
impact on caching and replication strategy that the provider The second criterion in the classification concerns the
should adopt; and on the other hand, popularity has a direct structural virality. A model is said to be viral (or to have the
economic impact. A number of researchers have analyzed the viral property) if contaminated nodes (these are the viewers of
popularity characteristics of user-generated video content for a video) have a significant role in the propagation of the video
understanding the processes governing their popularity dynam- through sharing or embedding. It is non-viral if the propagation
ics [5], [10], [13], [18], [8], [6], with the aim of developing of the video essentially relies on broadcast of the video from
models for early-stage prediction of future popularity [19]. the source (it is then said to have the broadcast property). In
There has been also interest in understanding what important that case, a large fraction of potential target population can
factors lead some videos to become more popular than others. receive the information directly from the source.
But few works have studied the temporal aspects of the Our contribution can be summarised in the following key
popularity dynamics using some metrics such as view-counts, points
ratings and number of comments [5], [9], [17].
• We propose six mathematical biology-inspired models
In this paper we describe some of the most typical be- and we show that at least 90% of videos in YouTube
haviour of the view-count of videos in YouTube. This allows are associated to one of these six mathematical models
us to provide in-depth analysis and develop models that capture with a Mean Error Rate less than 5%. We further show
the key properties of the observed popularity dynamics. Our how to extract the model parameters for each video.
goal is to match observed video view-counts with one of
several dynamic models. To select candidates for these models, • We study the robustness of these models to the differ-
we turned to bio-inspired dynamics as we believed that the ent thematic categories of the video in YouTube and
propagation of a content in YouTube has a strong similarity to different values of the peak popularity of the video.
with the temporal behaviour of an infectious disease, which is We show that the fraction of videos withing a given
a classical topic in mathematic biology [3], [16]. Such models model is quite robust and shows little dependence
on the different thematical categories of the video, dataset contains some static information for each video such
except for Education category which has a different as YouTube id, title of the video, name of the author, duration
behaviour: For this category it seems that the word- and list of related videos. It also provides the evolution of
of-mouth is the dominate mechanism through which some metrics (shares, subscribers, watch time and views) in a
contents are disseminated. The bio-inspired models daily form and in a cumulative form, from the upload day till
that we selected are further shown to be robust with the date of crawling.
respect to the peak popularity of the video but the
distribution among them is slightly different between III. P OPULARITY GROWTH PATTERNS
those videos that have acquired less than 1000 views
and the rest of the videos. We focus the analysis on view-count as the main popularity
metric of a video. Previous analyses of YouTube showed a
• In more than 80% of videos in YouTube, the potential strong correlation between view-count and other metrics as
population interested in the video increases over time. number of comments, favourites and rating. Further, these
• Two of the six models (The modified negative expo- metrics correlation becomes stronger with popularity [8]. We
nential and modified Gompertz models) cover most model the dynamic evolution of view-count some mathemat-
of videos in our YouTube Dataset (more than 75%). ical models from the biology. We classify the evolution of
Both models capture the case of immigration pro- view-count in YouTube using two criteria:
cess in which the potential population or the ceiling
value become dynamic. Further, the modified negative • Size of the target population: The target population
exponential characterizes the dynamic of a non-viral size is the maximum number of individuals that can
content and it predicts that the accumulated number of be, potentially, interested by the content. A target
view doesn’t contribute to the propagation of the con- population belongs to one of these two types: (i) a
tent. This model corresponds to the scenario wherein fixed finite target population or (ii) a potential target
the content has been broadcasted to a pool of users. population that grows in time which we call the
On the other side, the Gompertz model captures viral immigration process.
videos in which a part of this dynamic is propagated • Virality: A content is viral if the population that has
through word-of-mouth. seen the content participates actively in the dissem-
• We finally use the above classification along with the ination of the content. The content spreads like a
automatic parameters extraction in order to predict virus does in epidemics. Thus the probability that an
the evolution of videos’ view-count. We consider two individual who has not seen the content so far got
scenarios: In the first we use half of the view-count it by someone who see it, increases in time. On the
curve as a training sequence while in the second one, contrary, a non viral content is one for which former
we take a fixed training sequence that corresponds viewers scarcely alter the diffusion process.
to the first 50 days in the lifetime of the video. We In the following we describe the dynamic models in
then compare the predicted curve to the actual one biology and their uses. These dynamic models have been
and study the prediction capacity within a given error hypothesised to describe the contagion phenomena and each
bound. model has its own set of assumptions about how users are
infected by others. These models may provide some answers
II. S ETTING AND DATA about the behaviour of users in YouTube even if this behaviour
Since we intend to study different types of dynamic evolu- remains notoriously difficult to quantify.
tion of the view-count in YouTube, we need to collect a huge
number of videos which are available to the general public. A. Fixed target population
In this section we describe how we collected the dataset 1) Viral content: To describe the viral content with fixed
used in this study. On YouTube, a video is accompanied by a target population, we use the logistic model or the Gompertz
set of valuable data as title, upload time, view-count, related model. These models have been used in technology forecasting
videos. The video web page also provides some statistics which and are referred as ”S-shaped” curve. We test these models to
are available if the content’s owner allows it. capture the evolution of view-count of a video in YouTube
YouTube provides two APIs which allow to retrieve some since there is a strong similarity between a video posted in
of those data : the YouTube Data API for collecting static data YouTube and a new product launched into the marketplace. In-
(which are available for every user) and the YouTube Analytics deed, as showed in different problems in marketing, technology
API for seeking video statistics such as dynamics of a content product is often growth slowly followed by rapid exponential
(which are only available for content’s owner). Since some data growth and finally it falls off as limit of market share is
cannot be collected through the APIs, we used a tool named approached.
YOUStatAnalyzer [20] in order to collect all valuable data. Logistic model: The logistic model is a common sigmoid
The collected data are stored in a noSQL database (Mon- function which describes the evolution of view-count of a
goDB). The noSQL solution has been chosen to allow dynamic video with fixed target population. This is a first order non-
insertion of new features for future works. The dataset used for linear differential equation of the form
this study contains more than 80000 videos randomly extracted dS
from YouTube and aged between 5 and 2500 days. This = λS(M − S) (1)
dt

2
behaviour seems to fit well for some YouTube view-count
evolution dynamics.
2) Non-viral content: A non viral content describes the
case where users do not contribute on the propagation of the
content. This is the case when the time scale of the content
diffusion is very large compared to the size of potential popu-
lation. Hence this dynamic can model the case where contents
gain popularity through advertisement and other marketing
tools: examples are when advertisement is broadcasted to a
very large pool of users of a social network and people access
the content at random thereafter. Hence we assume that the
evolution dynamic of the content follows the linear differential
equation:
dS
= λ(M − S) (3)
dt
Fig. 1. Example of non symmetric S-shaped behaviour of view-count in The solution of (3) is given by :
YouTube (viral).
S(t) = S(0) + (M − S(0))(1 − e−λt )

where S is the number of view-counts of a video and M is the B. Growing population


maximum size of the (potential) population that could access
The assumption that the population is fixed, is often a
the content. This is a standard equation in epidemiology for
reasonable approximation when the evolution of the popularity
describing the evolution of the number of infected individuals
of a content increases quickly and dies out within a short time.
under the assumption that all infected nodes have developed an
But for many cases, this assumption becomes inappropriate
immunity from infection or these infected nodes stay infected
when the time before reaching the saturation region is longer.
and will not be changed to uninfected state. Hence the infection
rate is a function of the rate λ and the size of the infected Here we consider the case of immigration process in which
population. For the YouTube case, this model corresponds to the potential population growth and the dynamic of view-count
the scenario wherein users may watch a video one time and of a content are intricacy linked. To capture such dependence
the probability to watch it again is negligible. A solution to we consider different growth scenarios that model the viral
equation (1) is given by case and non viral case. In this paper we restrict our study on
the case where the target population grows with a fixed speed.
M
S(t) = 1) Non-viral content : The linear growth model S(t) =
1+ ( M−S(0)
S(0) )e
−λMt
S(0) + λt describes in a simple way the situation where users
This function shows that initial exponential growth is followed do not contribute to propagate the content to other users but
by a period in which growth starts to decrease as approaching the content benefits of the immigration process which gives a
the maximum size of population. linear growth of the view-count.
The S-shape of the sigmoid model curve is symmetric. But Another kind of non-viral curves observed are concave
in the context of view-count, the convex phase and the concave curves (given by the negative exponential model) which do not
phase could not always be symmetric. This is shown by the converge to a flat line but become linear at the horizon due
example in Figure 1. For covering these cases we consider the to the immigration process influence. Such dynamics could be
Gompertz model. modelled by modifying solutions of equation 3 where a linear
component is added:
Gompertz model: A model which deals with the problem
of symmetry of the Logistic model is given by the following
dynamic equation: S(t) = S(0) + (M − S(0))(1 − e−λt ) + kt

dS M where k is the rate of the target population growth. Note that


= λS log( ), (2) S no more respects the equation 3 but (S − kt) does.
dt S
This model is called Gompertz model, and has been also used 2) Viral content : Now we consider the case when the
as diffusion model of product growth. A solution of equation 2 immigration process appears in the case of viral contents. In
is given by the Gompertz function : this dynamic the view-count curve first adopts a viral behaviour
(in a S-shaped phase) and then grows linearly.
M
S(t) = M exp(− log( ) exp(−λt)), One candidate solution to describe such a behaviour of
S(0) view-count is to add a linear component to the Gompertz
Figure 2 shows the effect of varying one of M , λ, S(0) while function :
keeping the others constant. M
S(t) = M exp(− log( ) exp(−λt)) + kt
This model is similar to the logistic curve but it is not S(0)
symmetric about the inflection. In general the Gompertz’s This dynamic seems to be relevant according to some examples
model reaches this point early in the growth trend. This in the dataset.

3
Gompertz model : varying M Gompertz model : varying λ Gompertz model : varying S(0)

1.5
1.5

1.5

1.0
1.0

1.0
S(t)

S(t)

S(t)
0.5
0.5

0.5
0.0

0.0

0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
t t t

(a) Varying M (b) Varying λ (c) Varying S(0)


Fig. 2. Graphs of Gompertz functions, showing the effect of varying one of M , λ, S(0) while keeping the others constant

IV. DATASET AND DATA FITTING ( yyni , ttni )1≤i≤n


This section describes how we use the models presented b) Parameters estimation methods: We estimate the pa-
in section III in order to classify the YouTube contents in our rameters of the models described in section III using regression
dataset. algorithms both based on the mean squares criterion minimisa-
tion. Given a normalised set of observations (yi , ti )1≤i≤n , let
A. Dataset S be the expression for one model. The mean squares criterion
As described in section II, we collected meta-data of more (M SC) is then given by :
X
than 80000 videos in a MongoDB database. In addition of M SC = (S(ti ) − yi )2
the dynamics of view-count used for modelling, the features i
we consider for each video are: the age (in number of days),
the YouTube category and the popularity (i.e the total number We implemented two algorithms in order to automatically
of views at the day of crawling). Table I lists the YouTube classify the dynamics of any video from YouTube in one of
categories contained within the dataset and Figure 3a shows the models presented in the section III.
their distribution. Table II summarises the values of age and
popularity metrics. Figure 3b gives the ages distribution and The first method is a simple linear regression. It works for
Figure 3c is about the distribution of popularity (in log values). videos where the view-count grows linearly over time t. In
that case, the coefficient of determination R gives a measure
TABLE I. L IST OF ALL Y OU T UBE CATEGORIES FOUND INSIDE THE for the goodness of the fit :
DATASET P 2
i (yi − S(ti ))
1. ”Animals” 7. ”Games” 13. ”Shows” R=1− P 2
2. ”Autos” 8. ”Howto” 14. ”Sports” i (yi − ȳ)
3. ”Comedy” 9. ”Music” 15. ”Tech”
4. ”Education” 10. ”News” 16. ”Travel” where ȳ is the mean of (yi )i . In our experiments, we consider
5. ”Entertainment” 11. ”Nonprofit” that a linear model is relevant if the value of R satisfies |R| ≥
6. ”Film” 12. ”People” 0.985. In the dataset, there are very few video dynamics that
meet the linear case.
TABLE II. S UMMARY OF AGE AND POPULARITY VALUES WITHIN THE
Y OU T UBE DATASET The second method is the Levenberg-Marquardt Algo-
rithm [15] which is known to be very efficient for the non-
Age (in days) Popularity (number of views) linear case. It is an iterative process for estimating parameters
Min: 5 Min: 1
1st Qu.: 140 1st Qu.: 2, 650.102
of the model through a minimisation problem of the M SC.
Median: 393 Median: 2, 728.103 Explicit formulation of the models have to be known to
Mean: 610, 5 Mean: 6, 091.105 use this algorithm because the partial derivatives have to be
3rd Qu.: 923 3rd Qu.: 2, 630.104 calculated during the iterative process. One drawback of this
Max: 2426 Max: 1, 746.109
method, like all other non-linear regression methods, is that the
solution could not be global but only a local one. Nevertheless,
B. Data fitting the Levenberg-Marquardt Algorithm suits very well for our
models.
a) Observations and normalisation: For the data fitting,
we only use the cumulative evolution of view-count. Given the c) Data fitting for Non viral contents: A non viral
evolution of view-count of a YouTube video as a function of behaviour is modelled by the dynamic equation 3. This model,
time, we define a set of observations as : (yi , ti )1≤i≤n where yi called the negative exponential model, fits for contents which
is the view-count at day ti and n is the number of observations view-count curve goes concave then reaches an asymptote. In
(this is also its age in number of days). In order to avoid Figure 4 we show an example where this model is applied.
some technical issues due to the estimation algorithms, we We observe that the estimated curve (dashed line) admits an
use normalised observations : asymptote (which limit value is in fact the M parameter of the

4
Ditribution of age Popularity distribution
Categories distribution over the hole dataset

14000
25

2.0

20

10000
1.5

Percent of all

Percent of all
15
6000

1.0
10

0.5
5
2000
0

0.0 0

1 2 3 4 5 6 7 8 9 11 13 15
0 500 1000 1500 2000 2500 0 2 4 6 8 10

Age Log of popularity

(a) YouTube categories distribution (b) Age distribution (in number of days) (c) Logarithmic distribution of popularity
Fig. 3. Some features distributions from the YouTube dataset.

model). This is typically the case with fixed finite population. Given a set of observations (yi , ti )1≤i≤n , a first phase
But at the horizon, the curve that represents data (plain line) consists in pointing out a linear behaviour from a time t =
seems to follow a line with a non zero slope. This is what tk , k ∈ 1, .., n. The idea is to find k in an iterative way in order
we call the immigration phenomena, when the size of target to have a good regression line for the subset (yi , ti )k≤i≤n . The
population increases linearly in time. In that case, we model algorithm is the following : we first fix a threshold ǫ reasonably
the dynamics by the modified negative exponential functions small. We then apply a linear regression with all the points.
introduced in subsection III-B1: The regression line gives the set (ŷi )i of the estimated values.
Let ȳ be the mean of (yi )i , we consider the coefficient of
S(t) = S(0) + (M − S(0))(1 − e−λt ) + kt determination r1 given by:
Pn 2
This model fits better as it is shown in Figure 4c. A mixed i=1 (yi − ŷi )
r1 = 1 − P
strategy, which consists in cutting data into two subsets and i (yi − ȳ)
2
then applying linear model on one subset and negative expo-
nential model on the other, will be discussed later. • if |1 − r1 | ≤ ǫ, observations could be considered
as linear (the whole view-count curve could be well
d) Data fitting for Viral contents: Three models have described by the linear model) and the process ends.
been considered in the case of viral contents: logistic model,
Gompertz model and a modified Gompertz model (see sec- • else, the first element of (yi , ti )1≤i≤n is removed from
tion III-B2). The first two models are for the context of fixed the set and a new linear regression is done for the
finite population and the third one is introduced in the case of subset (yi , ti )2≤i≤n . That gives a new coefficient of
immigration. determination r2
Figure 5 is an example where we fit these models to one
YouTube content (Figure 5a). We observe that the S-shape of The process is repeated until the coefficient of determination
the logistic model curve is symmetric due to the symmetrical rk satisfies |1 − rk | ≤ ǫ. Doing that, we determine the rank
property of sigmoid function (Figure 5b). However, the convex k from which observations (yi , ti )k≤i≤n can be considered as
phase and the concave phase are non symmetric as we can having a linear behaviour. In Figure 6b, time tk is represented
observe in Figure 5a. Hence the Logistic model does not fit by a vertical line. From tk the behaviour can be well described
well. Then, Gompertz model and modified Gompertz model by a linear model (dot-dashed line).
are fitted to the same YouTube content. The Gompertz model In a second phase, parameter estimation for the non-linear
(Figure 5c) fits better than the logistic model, and the modified models presented in previous section is done on the subset
Gompertz model (Figure 5d) describes better the behaviour of (yi , ti )1≤i≤k . Figure 6b illustrates this phase : here, the modi-
the data at the horizon (immigration phenomena). fied negative exponential model is applied to the subset on the
left side of tk (dashed line).
e) Mixing linear and non-linear models: An issue that
results from the model we use is the changes of the curve V. AUTOMATIC CLASSIFICATION
dynamics at the horizon. We observe two types of behaviour
at the horizon : a flat line showing that the limit of the potential In this section we first investigate the process for an
population has been reached or an oblique line highlighting the automatic classification of YouTube contents. We then go
fact that the population continues to grow. further in analysing results of our experiment. Finally we give
some keys of how to use this classification in order to predict
The first case is coherent with the description of dynamics the view-count.
(then the maximum is one of the parameters of the dynamic
equation of the model). However, for the second case we need A. Classification issues
to add a linear component to the solution function, implying
that the initial dynamic equations are not valid anymore. The main goal of our work is to provide a system that can
Hence, we propose to adopt a two phases approach for data automatically classify YouTube contents by associating one
fitting. Figure 6 gives an example where this method is applied model to one content. For each content, two issues have to
for a YouTube content (Figure 6a). be managed : First, evaluate each model in order to know

5
Negative Exponential Model Example Modified Negative Exponential Model Example

1.0

1.0
0.8

0.8
Views (normalized)

Views (normalized)
0.6

0.6
0.4

0.4
real data real data
model model

0.2

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Age (normalized) Age (normalized)

(a) YouTube content with a (b) Viewcount curve (plain) is approx- (c) Viewcount curve (plain) is approxi-
concave shape imated by a negative exponential curve mated by a modified negative exponential
(dashed) curve (dashed)
Fig. 4. From a YouTube content (on the left side), parameters of the negative exponential model are estimated, then the obtained curve is compared to data
(in the centre). The same process is applied to a negative exponential model in which a linear component has been added (on the right side)

Sigmoid Model Gompertz Model Gompertz Model with a linear component


1.0

1.0

1.0
0.8

0.8

0.8
Views (normalized)

Views (normalized)

Views (normalized)
0.6

0.6

0.6
0.4

0.4

0.4
real data real data real data
model model model
0.2

0.2

0.2
0.0

0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Age (normalized) Age (normalized) Age (normalized)

(a) YouTube video (b) Logistic model fitting (c) Gompertz model fitting (d) Modified Gompertz model fitting
Fig. 5. From a YouTube video with a S-shaped view-count curve ( 5a), we first fit the logistic model in 5b. The estimated curve (dashed) is compared with
the actual normalised view-count curve (plain). Then the same is done with the Gompertz model in 5c and finally with the modified Gompertz model in 5d.

Define the mean error rate (M ER):


1 X |S(ti ) − yi |
Mixed linear/concave model for Youtube_kQGMTPab9GQ

M ER =
1.0

n i yi + 1
0.8

M ER criterion is the mean error rate done by the model


Views (normalized)

regarding the observations. For example, if M ER ≤ 0, 05,


0.6

it can be said that on average, the estimation error is under


0.4

real data 5% of the observed value. With this criterion we can fix a
model
linear
linear threshold
threshold beyond which one model would be considered as
0.2

unreliable. Actually, the expected mean error rate should be


1 P |S(ti )−yi |
0.0 0.2 0.4 0.6 0.8 1.0
n i yi . M ER is exactly this calculation applied to
Age (normalized)
estimated data and observed data which have been translated
(a) YouTube video (Obama (b) Mixed linear/non linear fitting by one on the y axis. Since the observed time series are
about gay marriage) normalised and so bounded between 0 and 1, M ER definition
Fig. 6. From a YouTube content (Figure 6a), a mixed procedure is made avoids generation of large errors due to some very small values
to estimate a linear model and a non linear model on two subset of data of yi and also avoids division by zero when yi is equal to 0.
(Figure 6b) In order to compare models for which M ER is lesser than a
certain threshold, we introduce a criterion of quality discussed
in [11]. To formulate this criterion we first define the degree of
freedom of a model (df ). Let p be the number of parameters
which models are good candidates. Then compare candidates of the model, the degree of freedom is then defined by:
to determine which one is the best. df = n − p. The criterion of quality, named “goodness of
fit” (GoF ) is then given by:
1
Let us consider first the question of evaluating each model. GoF = M SC
(df )
As explained in section IV-B0b, we perform parameters es-
timation based on the least squares criterion minimisation. The model which has the smallest GoF will be considered as

6
TABLE III. G OODNESS OF FIT FOR MODELS FROM F IGURE 4
Histogram of Mean Error Rate − All models
Models distribution
Model MCS MER GoF
Neg. exponential 3, 558 0, 074 0, 004 80 modExp

Modified neg. exponential 0.453 0, 027 4, 98.10−4 Gomp


60
Exp

Percent of all
TABLE IV. G OODNESS OF FIT FOR MODELS FROM F IGURE 5
40 Sigm

modSigm
Model MCS MER GoF
20
Sigmoid 0, 480 0.021 10−3
modGomp
Gompertz 0, 092 0, 018 1, 846.10−4
0

Modified Gompertz 0, 033 0, 008 8, 831.10−5


0.0 0.2 0.4 0.6 0.8 1.0

Mean Error Rate

(a) Percent of contents by bins of(b) Models distribution after classifica-


the best one. In other words, it will be the model that best fits M ER values tion over the whole dataset
the data. In Table III and Table IV we give values of M SC,
M ER and GoF for models used respectively in Figure 4 and MER distribution

0.20
Figure 5. In the example from Figure 4, with a threshold of
M ER fixed at 0, 075, both negative exponential model and

0.15
modified negative exponential model are relevant. With a GoF
of value 4, 98.10−4, modified negative exponential model is

MER
0.10
the one that best fits the data. In the case of YouTube content

0.05
depicted in Figure 5, if the threshold of M ER is fixed at the
value of 0, 02, sigmoid model is not reliable whereas Gompertz

0.00
model and modified Gompertz model are under the threshold. E ME G MG S MS

According to the value of GoF , modified Gompertz model


is the best with GoF = 8, 831.10−5. Further, the issue of (c) M ER distribution for each model
fixing a value for the M ER threshold is crucial to rely on an Fig. 7. Sample analysis of an automatic classification for dissemination
acceptable filter for several videos. processes in YouTube

B. Results analysis
Bad data fitting

Given one video from the dataset, after doing parameters

1.0
estimation for each model, the system automatically associates
0.8
one model to the video using M ER and GoF criteria. We give Views (normalized)

results about the distribution of M ER values in Figure 7a. It 0.6

is shown that 89% of videos are associated to a model with


a M ER under 0, 05. A mean error of 5% seams reasonable
0.4

to consider a reliable fitting. Hence, if M ER is fixed at 0.05, real data


model

we should conclude that the classification system is efficient


0.2

in almost 90% of the cases.


0.0 0.2 0.4 0.6 0.8 1.0
Note that if the threshold of M ER is fixed at 0, 1, more Age (normalized)

than 97% of the videos correspond to one of the models.


There is at most 2% of the videos for which the association Fig. 8. One bad fitting example.
to one of our models gives a high error rate (let say more
than 10%). Figure 8 illustrates one example of such a video.
In this case, the system associates the modified Gompertz It appears that modified negative exponential and modified
model to the video, with a M ER equal to 0, 127 and a GoF Gompertz models give fitting with less error than the other in
of value 1, 13.10−1. The association is absolutely unreliable most of the cases: they can be referred as the most reliable
due to the many sharp changes of the behaviour. Indeed, it models for our dataset. We present the models distribution in
seems that the models are unable to capture the effect of Figure 7b. The same two models: modified negative exponen-
multiple peaks in view-count evolution. This might be the case tial and modified Gompertz, almost cover the whole dataset
of view-count curves that are somehow ’non differentiable’. with the same amount of videos for each one. Regarding
Further investigation has to be made about this issue, this is their reliability, it enforces the classification efficiency. The
one of our direction in future works. former is a non viral model whereas the latter is viral, meaning
that there is quite a balance between viral and non viral
contents. Both highlight an immigration process, leading to
In Figure 7c we focus on the M ER distribution for each the conclusion that a lot of YouTube contents still attracting
model. Note first that we did not used the linear model because viewers even after a long period.
it insignificantly appears in the dataset. Secondly we introduce
a new model, named modified sigmoid model, given by the In order to assess the evidence provided by our dataset on
logistic model with a linear component added (as done in models distribution, we provide in Table V 95% confidence
section III-B2 with the modified Gompertz model). interval for different sample sizes involved in the study. This

7
TABLE V. C ONFIDENCE I NTERVALS FOR MODELS DISTRIBUTION PROPORTION

Model/sampling 1000 10000 W holedataset(81657)


Exponential (0.05730450 − 0.07221 − 0.09049091) (0.06681688 − 0.071761 − 0.07703725) (0.07009212 − 0.07184932 − 0.07364694)
Modified Exponential (0.3390106 − 0.36885 − 0.39971) (0.3584859 − 0.367935 − 0.3774861) (0.3648174 − 0.3681252 − 0.3714455)
Sigmoid (0.02141720 − 0.03089 − 0.04411793) (0.02735499 − 0.030602 − 0.03421483) (0.02928192 − 0.03044442 − 0.03165132)
Modified Sigmoid (0.02221527 − 0.03185 − 0.04522983) (0.02779605 − 0.031068 − 0.03470537) (0.02982271 − 0.03099551 − 0.03221266)
Gompertz (0.01530999 − 0.02342 − 0.03536814) (0.02065726 − 0.023495 − 0.02670484) (0.02260972 − 0.02363545 − 0.02470626)
Modified Gompertz (0.3310334 − 0.3607 − 0.3914506) (0.3535128 − 0.362832 − 0.3723571) (0.3595007 − 0.362798 − 0.3661083)

TABLE VII. M ODELS DISTRIBUTION BY POPULARITY CLASS ( IN %)


Education models distribution

Model EUP VUP UP NSP P VP EP


modExp
Exp 11.4 12.2 8.4 8 6.8 6.2 5.7
Gomp 1.6 2.8 1.8 2.5 3.6 3.2 2.1
Gomp ModExp 11.6 54.5 48.9 35.2 35.1 42.3 47.5
Exp ModGomp 2.5 19.4 34.7 48.7 49.3 44.3 42.8
Sigm ModSigm 1.8 4.3 4.3 3.6 2.9 2.3 0.8
modSigm Sigm 70.7 6.6 1.5 1.7 2 1.3 0.8

modGomp

We can observe that the distribution varies according to the


classes of popularity. First of all, the logistic model (referred
as sigmoid in the plots) dominates the extremely unpopular
Fig. 9. Models distribution for the Education category category (constituted by videos of less than 10 views). These
results are not reliable due to the few different values of the
view-count for these videos. Actually, in our system, all the
table indicates that the whole dataset leads to very good models should be good for those videos and the first that
precision on models distribution. Furthermore, a sampling with is tested is the logistic model so that it is the one which is
10000 videos still gives an accurate estimate of this proportion. chosen. Popular class and not so popular class can be grouped
in terms of models distribution with around 50% for modified
Now it is natural to ask whether the distribution of our Gompertz model and 35% for modified exponential model. We
classification is still the same with respect to main categories can also group very popular class and extremely popular class
in YouTube and popularity of a video. for which distribution is slightly equivalent to the whole dataset
distribution (see Figure 7b). Very unpopular and unpopular
In the case of categories, we consider the different YouTube
classes exhibit modified exponential model in around 50% of
categories into the models classification: Four main categories
the cases. The modified Gompertz model represents less than
can be highlighted regarding the distribution in Figure 3a.
20% in very unpopular class whereas it covers almost 35% of
Music (over 14000 videos), Entertainment (over 8500 videos),
the videos in unpopular class.
People (around 7500 videos) and Education (almost 6000
videos). In general, the models distribution in each category
is not far from the distribution considering all categories Classification models for prediction
combined. However, Education category is quite different. The In this section, we illustrate a mechanism for predicting
models distribution for this category is given in Figure 9. We the future evolution of view-count of a video. In particular, we
note that the modified Gompertz model dominates. Further- propose a simple model that predicts the evolution of view-
more, viral models cover almost 75% of the videos. One might count from a given time tf till a target date tp with tp > tf .
assume that Education is a word-of-mouth category where We call a prediction window T the difference between tp and
videos dissemination results mainly from viewers influence and tf . This prediction is based on the early historical information
few benefits from advertising processes or internal YouTube of a video which is given by a set of observations (yi , ti )1≤i≤f
mechanisms such as recommendation system. till time tf where f is number of obesrvations1. Combining
We now analyse the models distribution considering the these information with our classification models, the evolution
popularity of videos. Let us introduce the different classes of of view-count is estimated using data fitting in order to select
a mathematical model. Using our datasets, we evaluate the
popularity defined for our dataset: According to the distribution
depicted in Figure 3c, we define seven classes of popularity maximum size of prediction window with at most 5% mean
listed in Table VI. We show the models distribution for each error, i.e
1 X
p
|Stf (ti ) − yi |
TABLE VI. P OPULARITY CLASSES Tmax = max{tp − tf | ≤ 0.05}
p−f (yi + 1)
Popularity class Total number of views V i=f +1
Extremely unpopular (EUP) 0 6 V < 10
Very unpopular (VUP) 10 6 V < 100 where Stf is the selected mathematical model.
Unpopular (UP) 100 6 V < 1000
Not so popular (NSP) 1000 6 V < 104 We test our prediction for the scenario in which tf corre-
Popular (P) 104 6 V < 105 sponds to half life cycle. Let ∆T = tTnmax
−tf where tn − tf is
Very popular (VP) 105 6 V < 106
Extremely popular (EP) 106 6 V
the remaining time of life cycle of a video from tf . Note that
∆T is bounded by 1. Fig. 10b depicts the mean and variance
popularity class in Table VII. 1 The datasets used for prediction contains videos with at least 50 days old.

8
TABLE VIII. M EAN AND VARIANCE OF PREDICTION WINDOW SIZE IN
THE HALF LIFE SCENARIO
is called the hard window which is somehow pessimistic e.g.
the earlier in the prediction window, the heavier is the error.
Model mean var number of videos
E 0.5833692 0.1504571 4132
Its definition is the following :
ME 0.576914 0.1415303 25281
1 X |ST (tT +i ) − yT +i |
G 0.3435265 0.1160928 683 k
MG 0.4596889 0.1360676 19349 δH T = max{k ≥ 0|
S 0.6765688 0.1544357 1030 k i=0 (yT +i + 1)
MS 0.4625144 0.1280855 1659
1 X |ST (tT +i ) − yT +i |
k+1
≤ 0.05 < }
k + 1 i=0 (yT +i + 1)
50 days scenario Half life scenario
1.0

1.0
For each window type (soft and hard) we normalise it by the
size of the observed window T . Results are given for 7 days, 15
0.8

0.8
Prediction window size

Prediction window size

days and 30 days scenarios in Table X, Table XI and Table XII


0.6

0.6

respectively. In each Table, we give fraction of videos for each


0.4

0.4

model, mean for size of prediction windows (hard and soft)


and fraction of videos which meet the bound effect (e.g. when
0.2

0.2

the prediction window size is bounded by the horizon).


0.0

0.0

E ME G MG S MS E ME G MG S MS
TABLE X. 7 DAYS SCENARIO
(a) 50 days classification (b) Half life classification Model Type Distribution (%) Hard window Hard bounded (%) Soft window Soft bounded (%)
E 66.9 m: 0.6 13.5 m: 0.66 15.3
ME 0.9 m: 0.75 17.9 m: 0.82 20
Fig. 10. Prediction window size according to models type G 10.5 m: 0.42 7.8 m: 0.47 8.8
MG 7.9 m: 0.54 10.9 m: 0.61 12.7
S 11.2 m: 0.97 32.9 m: 1 33.8
MS 2.6 m: 0.81 18.8 m: 0.84 19.5
All m: 0.63 m: 0.68
of ∆T for each identified model. Table VIII precises values 100 15.1 16.6

of mean and variance for each model as well as the number


of videos classified in the different models. Our results show TABLE XI. 15 DAYS SCENARIO
that our prediction is very powerful and most models provide Model Type Distribution (%) Hard window Hard bounded (%) Soft window Soft bounded (%)
E 62.7 m: 0.55 10.7 m: 0.59 12.1
a prediction window that long enough within an error bound at ME 1.3 m: 0.9 22.2 m: 0.94 23.3
5%. Further we observe that our scheme can perfectly predict G
MG
8.4
16.5
m: 0.44
m: 0.7
7.5
13.4
m: 0.47
m: 0.77
7.9
15.6
the evolution of view-count till the half of the remaining time S 8.1 m: 1.1 38.1 m: 1.11 38.9
MS 3 m: 0.94 23 m: 0.98 24.5
of life cycle from the time of prediction. All 100 m: 0.63 13.6 m: 0.67 15

We tested here the prediction based on a learning sequence


that was half the lifetime of each video in the dataset. This TABLE XII. 30 DAYS SCENARIO
allows the prediction to rely on the same amount of data Model Type Distribution (%) Hard window Hard bounded (%) Soft window Soft bounded (%)
E 52.3 m: 0.5 9.5 m: 0.54 10.5
independently of the real duration of the video. We next ME 2.2 m: 0.79 17.3 m: 0.87 19.8
compare this to the case in which, in contrast, the learning G
MG
5.2
30.6
m: 0.45
m: 0.68
8.5
11.6
m: 0.48
m: 0.76
9.1
14
sequence has a fixed duration of 50 days. We note that 50 days S 5.8 m: 1.1 39.9 m: 1.14 40.6
MS 3.9 m: 0.89 20.8 m: 0.92 21.7
represent much less than half the lifetime for most videos in All 100 m: 0.61 12.5 m: 0.66 13.9
the data set and therefore the prediction is less accurate. The
corresponding results of this scenario are depicted in Table IX
and Fig. 10b. In spite of this problem we get similar results of VI. C ONCLUSION AND F UTURE W ORK
the average prediction window for models modified Gompertz
and sigmoid (Logistic). In the present work we have focused on a method for
classifying view-counts dynamics of videos on YouTube. We
TABLE IX. M EAN AND VARIANCE OF PREDICTION WINDOW SIZE IN presented different models for YouTube view-count evolution
THE 50 DAYS SCENARIO
which are able to capture virality and potential population
Model mean var number of videos growth. Based on these models, we have developed one system
E 0.1240775 0.06050271 3401 for automatic classification of the YouTube videos. It aims at
ME 0.6159881 0.1384898 21788
G 0.1861161 0.09028605 687 classify each YouTube content within one of the four categories
MG 0.2314134 0.09438778 13821 we defined: Viral and fixed population; Viral and growing
S 0.4750911 0.1083194 1137 population; non-viral fixed population; and non-viral growing
MS 0.2082774 0.1798454 1561
population. We have tested this automatic classification in a
particular dataset -that has been presented and is available upon
It’s now natural to further investigate our prediction method request-. Results of our experiments reveal that a reasonably
in particular in case of early predictions. We slightly modify small threshold of the M ER criterion allows to classify more
our method of calculating the prediction window size and than 90% of the dataset, meaning that the defined models
consider values of T equal to 7 days, 15 days and 30 days. explain the observed behaviour in most of the cases.
In each scenario, we fix an horizon at 3 times the observed
window size. We implement two ways for computing the Our future work would focus on the context of the four de-
prediction window size. The first one is the same as in the fined categories. We will analyse how other features influence
previous method and is called the soft window. The other one the dynamic of the view-count evolution.

9
R EFERENCES
[1] Comscore. more than 200 billion online videos viewed globally in oc-
tober, http://www.comscore.com/press events/press releases/2011/12/
more than 200 billion online videos viewed globally in october. De-
cember 2011.
[2] omscore danpiech. online video by the numbers,
http://www.comscore.com. July 2011.
[3] N. Bailey. The Mathematical Theory of Infectious Diseases and its
Applications. Griffin, London, 1975.
[4] F. M. Bass. The relationship between diffusion rates, experience curves,
and demand elasticities for consumer durable technological innovations.
The Journal of Business, 53(3):pp. 51–67, 1980.
[5] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube,
you tube, everybody tubes: analyzing the world’s largest user generated
content video system. In Proc. of ACM IMC, pages 1–14, San Diego,
California, USA, October 24-26 2007.
[6] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. Analyzing
the video popularity characteristics of large-scale user generated content
systems. IEEE/ACM Transactions on Networking, 17(5):1357 – 1370,
2009.
[7] D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos.
Epidemic thresholds in real networks. ACM Trans. Inf. Syst. Secur.,
10(4):1:1–1:26, jan 2008.
[8] G. Chatzopoulou, C. Sheng, and M. Faloutsos. A First Step Towards
Understanding Popularity in YouTube. In in Proc. of IEEE INFOCOM,
pages 1 –6, San Diego, March 15-19 2010.
[9] X. Cheng, C. Dale, and J. Lui. Statistics and social network of youtube
videos. In Proc. International Workshop on Quality of Service (IWQoS)
The Netherlands, page 229 238, June, 2008.
[10] R. Crane and D. Sornette. Viral, quality, and junk videos on YouTube:
Separating content from noise in an information-rich environment. In
Proc. of AAAI symposium on Social Information Processing, Menlo
Park, California, CA, March 26-28 2008.
[11] W. E. Deming. The chi-test and curve fitting. Journal of the American
Statistical Association, 29(188):372–382, Dec 1934.
[12] A. Ganesh, L. Massoulie, and D. Towsley. The effect of network
topology on the spread of epidemics. In INFOCOM 2005. 24th Annual
Joint Conference of the IEEE Computer and Communications Societies.
Proceedings IEEE, volume 2, pages 1455–1466, Miami, FL, USA,
March 2005.
[13] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube traffic characteriza-
tion: A view from the edge. In Proc. of ACM IMC, 2007.
[14] V. Mahajan, E. Muller, and Y. Wind. New-Product Diffusion Models.
International Series in Quantitative Marketing. Springer, 2000.
[15] D. W. Marquardt. An algorithm for mean-squares estimation of
nonlinear parameters. Journal of the Society for Industrial and Applied
Mathematics, 11(2):431–441, jun 1963.
[16] L. A. Meyers. Contact network epidemiology: Bond percolation applied
to infectious disease prediction and control. Bull. AMS, 44(1):63–86,
2007.
[17] S. Mitra, M. Agrawal, A. Yadav, N. Carlsson, D. Eager, and A. Mahanti.
Characterizing web-based video sharing workloads. ACM Transactions
on the Web, 2(8):8 – 27, 2011.
[18] J. Ratkiewicz, F. Menczer, S. Fortunato, A. Flammini, and A. Vespig-
nani. Traffic in Social Media II: Modeling Bursty popularity. In Proc.
of IEEE SocialCom, Minneapolis, August 20-22 2010.
[19] G. Szabo and B. A. Huberman. Predicting the Popularity of Online
Content. Communications of the ACM, 53(8):80–88, aug 2010.
[20] M. Zeni, D. Miorandi, and F. De Pellegrini. YOUStatAnalyzer: a tool
for analysing the dynamics of YouTube content popularity. In Proc. 7th
International Conference on Performance Evaluation Methodologies
and Tools (Valuetools, Torino, Italy, December 2013), Torino, Italy,
2013.

10

You might also like