Provenance-Assisted Classification in Social Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

624 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 8, NO.

4, AUGUST 2014

Provenance-Assisted Classification in
Social Networks
Dong Wang, Md Tanvir Al Amin, Tarek Abdelzaher, Dan Roth, Clare R. Voss, Lance M. Kaplan, Stephen Tratz,
Jamal Laoudi, and Douglas Briesch

Abstract—Signal feature extraction and classification are two clever heuristics, recent work demonstrated that some classes
common tasks in the signal processing literature. This paper of such data mining problems (such as fact-finding [1], [2])
investigates the use of source identities as a common mechanism
for enhancing the classification accuracy of social signals. We have a rigorous estimation-theoretic formulation, amenable to
define social signals as outputs, such as microblog entries, geotags, well-understood solutions that use maximum-likelihood estima-
or uploaded images, contributed by users in a social network. tion techniques to accurately assess the quality of analysis re-
Many classification tasks can be defined on such outputs. For
example, one may want to identify the dialect of a microblog sults [3], [4].
contributed by an author, or classify information referred to in a This paper explores the link between estimation theory
user’s tweet as true or false. While the design of such classifiers and social networks by addressing the problem of social
is application-specific, social signals share in common one key
property: they are augmented by the explicit identity of the signal classification. Generalizing from the special case of
source. This motivates investigating whether or not knowing the fact-finding [1]–[4], we define social signals as outputs, such as
source of each signal (in addition to exploiting signal features) microblog entries, geotags, or uploaded images, contributed by
allows the classification accuracy to be improved. We call it
provenance-assisted classification. This paper answers the above users in a social network. We then consider the classification
question affirmatively, demonstrating how source identities can problem of such outputs.1 Unlike signals generated by the
improve classification accuracy, and derives confidence bounds physical environment (such as magnetic field or sound), where
to quantify the accuracy of results. Evaluation is performed in
two real-world contexts: (i) fact-finding that classifies microblog the source of the signal is often a physical object yet to be
entries into true and false, and (ii) language classification of tweets identified, in social networks the source of a social signal is
issued by a set of possibly multi-lingual speakers. We also carry usually explicitly indicated. For example, microblog entries
out extensive simulation experiments to further evaluate the
performance of the proposed classification scheme over different uploaded on Twitter include the user ID. So do images uploaded
problem dimensions. The results show that provenance features on Flickr and videos uploaded on YouTube. The ubiquity of
significantly improve classification accuracy of social signals, even source ID information begs the question of whether it can
when no information is known about the sources (besides their
ID). This observation offers a general mechanism for enhancing assist with classification tasks defined on social signals such
classification results in social networks. as identifying the location depicted in an uploaded image, the
Index Terms—Social signals, classification, uncertain prove- language used in a tweet, or the veracity of a claim.
nance, maximum likelihood estimation, expectation maximization, Current classifiers address their classification tasks by ex-
signal feature extraction. ploiting domain-specific features, such as visual clues in an
image and linguistic features in text, to perform the classifica-
tion. The question posed in this paper is whether (and to what
I. INTRODUCTION
degree) using source identity will enhance classification results.

T HE emergence of social networks in recent years opens


myriad new opportunities for extracting information from
artifacts contributed by social sources. A significant amount of
Clearly, the more one knows about the source, the better the
enhancement. To compute a worst case, we assume that one
does not know anything about the sources other than their IDs.
data mining literature has recently concerned itself with so- This assumption is often true when users find content on social
cial network analysis. While much of that literature explores networks that comes from arbitrary sources. The research
question addressed in this paper is to understand to what degree
Manuscript received September 16, 2013; revised December 18, 2013; knowledge of source ID alone, and without any additional
accepted February 25, 2014. Date of publication March 13, 2014; date of information about the sources, may enhance classifier perfor-
current version July 16, 2014. This work was supported by the Army Research
mance. Such an enhancement can then be generally applied to
Laboratory and was accomplished under Cooperative Agreement Number
W911NF-09-2-0053. The guest editor coordinating the review of this manu- any classification task in social networks where source IDs are
script and approving it for publication was Prof. Vikram Krishnamurthy. available.
D. Wang, M. T. Al Amin, T. Abdelzaher, and D. Roth are with the Depart-
One approach for incorporating source identity into the clas-
ment of Computer Science, University of Illinois at Urbana Champaign, Ur-
bana, IL 61801 USA (e-mail: [email protected]; [email protected]; sification problem is to add it as a feature into the classifier.
[email protected]; [email protected]). This may be cumbersome, however, as different classifiers are
C. R. Voss, L. M. Kaplan, S. Tratz, J. Laoudi, and D. Briesch are with the U.S.
usually employed for different types of signals. For example,
Army Research Laboratory, Adelphi, MD 20783 USA (e-mail: clare.r.voss.
[email protected]; [email protected]; [email protected]; image classifiers and language classifiers are quite different,
[email protected]; [email protected]). which may require different solutions for incorporating source
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. 1The fact-finding problem can be thought of as a special case of classification
Digital Object Identifier 10.1109/JSTSP.2014.2311586 where one needs to classify claims into true and false.

1932-4553 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
WANG et al.: PROVENANCE-ASSISTED CLASSIFICATION IN SOCIAL NETWORKS 625

information. Rather than having to change different classifiers and proposed several extended algorithms: Average.Log, In-
by incorporating source identities as features, the approach we vestment, and Pooled Investment [11]. Yin et al. introduced
take is a general one, where the original domain-specific clas- TruthFinder as an unsupervised fact-finder for trust analysis
sifier remains unchanged. Instead, source identities are consid- on a providers-facts network [12]. Other fact-finders enhanced
ered in a separate step that is independent of the domain-specific the basic framework by incorporating analysis on properties or
classifier design. This step operates on classifier output with the dependencies within claims or sources. Galland et al. [13] took
aim of improving classification results. We show that this re- the notion of hardness of facts into consideration by proposing
finement step can be formulated as a maximum-likelihood esti- their algorithms: Cosine, 2-Estimates, 3-Estimates. Similar
mation problem. iterative algorithms have also been studied in the context of
We evaluate the approach in two real world application sce- crowdsourcing applications to minimize the budget cost while
narios: (i) a fact-finding application where noisy microblog data optimizing overall quality of answers from crowd-sourced
are classified into true and false facts, and (ii) a language classi- workers [14]. While such prior work was essentially heuristic
fication application where Arabic microblogs from Twitter are in nature, an optimal solution to (a simplified version of) the
classified into different dialects. Our evaluation results show problem was recently proposed [1] in the context of a simple
that the scheme proposed in this paper significantly improves social sensing model, demonstrating improved performance.
classification accuracy of conventional classifiers by leveraging In contrast, this paper solved a more general classification
the provenance information. We also carry out extensive simu- problem beyond fact-finding where the possible values of
lation experiments to examine the performance of our classifi- artifacts are not limited to binary values.
cation enhancement scheme in different scenarios. The results Our classifier enhancement scheme is based on expectation
verify its scalablility and robustness over several key problem maximization. In estimation theory, Expectation Maximization
dimensions. (EM) is a general optimization technique for finding the max-
The rest of the paper is organized as follows. We review re- imum likelihood estimation of parameters in a statistic model
lated work in Section II. In Section III, we present our signal where the data are “incomplete” or involve latent variables in
classification model in social networks. We discuss the proposed addition to estimation parameter and observed data [15]. EM is
maximum likelihood estimation approach to improve the classi- frequently used for data clustering in data mining and machine
fication accuracy in Section IV. The theoretical accuracy bounds
learning [16], [17]. For language modeling, the EM is often used
that are used to quantify the quality of the results are derived
to estimate parameters of a mixed model where the exact model
in Section V. Experimental evaluation results are presented in
from which the data is generated is unobservable [18]–[20]. EM
Section VI. Finally, we discuss the limitations of the current
is also used in many other estimation tasks involving mixture
model and future work in Section VII, and conclude the paper
distributions including parameter estimation for hidden Markov
in Section VIII.
models with applications in pattern recognition, image recon-
struction, error correction codes, etc [21], [22].
II. RELATED WORK
Classification is a fundamental problem that has been ex- III. PROBLEM FORMULATION
tensively studied in machine learning, data mining, statistics Consider a social network of sources, who collectively
and pattern recognition. A comprehensive overview of different generate the social signals we want to classify. We henceforth
classification schemes is described in [5], [6]. Our work aug- call such signals the artifacts. Let there be a total of arti-
ments prior classification literature in the context of classifying facts, where each artifact can have one of possible classes.
social signals. The current work studied the classification of The classification problem is to determine the class of each arti-
nodes and relationships in social networks [7], [8] as well as fact. Many problems fall into the above category. Below, three
human related features [9]. In contrast, we take advantage of examples are presented:
the fact that signals in social networks, unlike physical signals • Fact-finders: A fact-finder considers sources, who col-
in other application scenarios, explicitly mention source ID. lectively generate claims (the artifacts). Each claim is
Hence, we develop a new provenance-assisted scheme for en- in one of two possible classes (i.e., ), true or false.
hancing classification results by taking into account provenance The fact-finder must determine which claims are true and
information in a separate step using a maximum-likelihood es- which are false.
timation approach. Our approach explicitly improves classifi- • Language classification: In a language classification
cation accuracy by jointly uncovering classes of artifacts and problem, there may be authors (the sources), who
affinities of sources to generating artifacts of specific classes. collectively write words (the artifacts). Each word
One application of our classification scheme has been could be in one of languages. The goal is to identify the
fact-finding. Techniques for classifying true facts from false language of each word.
ones are traced back to data mining and machine learning • Automated geo-tagging of text: In a geo-tagging problem,
literature. One of the early works is Hubs and Authorities there may be bloggers (the sources), who collectively
[10] that used a basic fact-finder where the belief in a claim describe a set of events (the artifacts). Each event may
and the truthfulness of a source are jointly computed in a take place at one of locations, not explicitly marked in
simple iterative way. Pasternack et al. extended the fact-finder the blog. The goal is to identify the location associated with
framework by incorporating prior knowledge into the analysis each event implicitly from the text.
626 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 8, NO. 4, AUGUST 2014

We also define if generates artifact (i.e,


), and otherwise. Moreover, let us denote the proba-
bility that an artifact is of class , given that it was generated by
source , as . These probabilities represent source affinities,
referred to above. Formally, is given as:

(2)

Using Bayes theorem, is related to as follows:

(3)

Fig. 1. Input Bipartite Graph. where represents the probability of a source


to generate artifacts (i.e., artifact production rate) and
is the overall prior of a randomly chosen
The traditional classification approach is to take the artifacts artifact to be of class .
in isolation and find the best class for each using specialized Our problem is to jointly estimate (i) the latent variable vector
domain knowledge. For example, a language classifier can use (i.e., the class of each artifact), and (ii) the source affinities,
linguistic features to identify the language of a word. This paper , that can be computed from the estimation parameter vector
augments that with the exploitation of provenance information. , where .
By provenance, for purposes of this work, we refer to the iden-
tity of the source(s) of each artifact. We do not assume that we IV. SOLUTION
know any information about the sources other than their IDs.
While, clearly, knowing some background about the sources In this section, we cast the problem of jointly (i) classifying
will help, this is not the point of this work. The paper inves- the values of artifacts and (ii) computing source affinities as a
tigates to what degree the knowledge of source ID alone helps maximum likelihood estimation problem. A maximum-likeli-
hood estimator is then derived to solve it. The maximum likeli-
classification outcomes.
hood estimator finds the values of the unknown parameters (i.e.,
The intuitive reason why provenance information (i.e.,
) that maximize the probability of observed input . Hence, we
source IDs) should help with classification is that sources have
would like to find that maximizes . The probability
affinity to generating artifacts of particular types. For example,
depends on which artifacts belong to which classes
a person from Egypt might have an affinity to writing in Arabic,
(i.e., the values of latent variables ). Using the total probability
a truthful person might have an affinity of generating tweets
theorem, we can now rewrite the expression we want to maxi-
of type “true”, and a person who commutes in Los Angeles
mize, namely , as follows:
might have an affinity to complaining about LA traffic. Said
differently, sources constrain the probability distribution of (4)
the classes of artifacts they generate. These constraints are
automatically estimated and explicitly accounted for in the
mathematical formulation of our algorithm, which then forces We solve this problem using the Expectation Maximization
the solution to obey them. (EM) algorithm that starts with some initial guess for , say
According to our terminology, multiple sources can “gen- and iteratively updates it using the formula:
erate” the same artifact. For example, multiple tweeters can
make the same claim, multiple authors can use the same word, (5)
and multiple bloggers can describe the same event. The input to
our problem describes which sources generated which artifacts. The above breaks down into three quantities that need to be
This input is given by a bipartite graph as shown in Fig. 1, derived:
where nodes represent sources and artifacts, and where a link • The log likelihood function,
exists between a source and an artifact if and only if the source • The expectation step (E-step),
generated that artifact. We call this set the observed input,
. The class of each artifact is unknown and is represented • The maximization step (M-step),
by a latent variable. The vector of all such latent variables is
called . Note that, the E-step and M-step are computed iteratively until
We also define as the (unknown) probability that source the algorithm converges. The above likelihood functions are de-
generates an artifact given that is of class (e.g., the rived below.
odds that source speaks word given that is from a
certain language ). Formally, is defined as follows: A. Deriving the Likelihood
To compute the log likelihood, we first compute the function
(1) . Let us divide the source and artifact bipartite graph
WANG et al.: PROVENANCE-ASSISTED CLASSIFICATION IN SOCIAL NETWORKS 627

into sub-graphs, , one per artifact . The sub-graph de- the observed data related to the artifact and current estimate
scribes which sources generate the artifact and which did not. of . can be further derived as:
Since artifacts are independent, we can re-write:

(6)

which can in turn be re-written as:

(7) (11)

where is the joint probability of all observed input


where is defined as:
involving artifact .
Considering each artifact could have possible class values,
(8) can be further rewritten as follows:
(12)
(8)
Substituting (11) into (10), we get:

Hence, the likelihood function, denoted by , is


given by:

(9)
(13)
where is defined in (1) and if source gen- For the Maximization step (M-step), we choose (i.e.,
erate artifact and otherwise. represents the for ) that maximizes the
overall prior probability that an arbitrary artifact is of class . function in each iteration to be the of the next
Let be a set of indicator variables for artifact , iteration.
where when is of class and otherwise. To get that maximizes , we set the derivatives
We now formulate an expectation maximization algorithm (EM) , which yields:
that jointly estimates the parameter vector and the indicator
variables, .

B. Deriving the E-Step and M-Step


(14)
Given the above formulation, substitute the likelihood func- Let us define is the set of artifacts the source actually
tion defined in (9) into the definition of function of Expecta- generates, and is the set of artifacts does not generate.
tion Maximization. The Expectation step (E-step) becomes: Thus, (14) can be rewritten as:

(15)

Solving the above equations, we can get expressions of the


optimal :

(16)

(10)

where represents the observed links from all sources to the where is the total number of artifacts we have. is
artifact. Let the latent variable be defined for such defined in (11).
that: when is of class . Let be Given the above, The E-step and M-step of EM optimiza-
the conditional probability that the variable is of class given tion reduce to simply calculating (11) and (16) iteratively until
628 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 8, NO. 4, AUGUST 2014

they converge. The convergence analysis has been done for EM D. Enhancing an Arbitrary Classifier
scheme and it is beyond the scope of this paper [23]. In practice,
The above algorithm can be executed as an enhancement
we can run the algorithm until the difference of estimation pa-
stage for any arbitrary (domain-specific) classifier of social sig-
rameter between consecutive iterations becomes insignificant.
nals. There are two different ways that such an enhancement can
We can then classify the classes of artifacts based on the con-
be added.
verged value of . Specially, is of class if
In the first approach, the enhanced system runs the arbitrary
is the largest for . We can also compute the values
(domain-specific) classifier first. Assuming that the original
of from the values of the estimation parameters based on (3).
classifier can tell when it is very confident in its labels, and
This completes the mathematical development. We summarize
when it is not, we can import from that classifier only labels
the resulting algorithm in the subsection below.
of those artifacts in which it is very confident. These labels
are treated as the ground truth estimate of the corresponding
C. Final Algorithm
subset of the indicator variables vector, , used by our algo-
Algorithm 1 Provenance-Assisted General Classifier rithm. Remember that the indicator variable vector, , in our
iterative algorithm states the class of each artifact. A subset
1: Initialize parameter vector of the indicator variables is thus determined by the domain
2: while does not converge do specific classifier. The rest are initialized at random and the
above iterations are carried out updating their values until they
3: for do
converge.
4: compute based on Equation (11) In the second approach, the enhanced system runs the
5: end for domain-specific classifier to obtain an initial guess of the
6: class of all artifacts. These results will presumably contain
7: for do misclassifications. Hence, the labels generated by the domain
classifier are used as initial values for the indicator variable
8: compute based on Equation (16) vector . Our algorithm is then executed to update these initial
9: update with in estimates. The converged values of these variables should im-
10: end for prove upon the initial guess (i.e., upon the domain classification
results).
11:
In the first scenario above, the improvement is obvious. Our
12: end while algorithm fills in labels that the domain classifier was unsure of.
13: Let In the second scenario, the intuitive reason why we achieve a
14: Let performance improvement is that our algorithm starts with the
output of the traditional classifier, which is already close to the
15: for do
right answer and “snaps it” to the locus of points that maximize
16: Let who has likelihood of observations in view of constraints that relate the
maximum
probability distributions computed for sources and the proba-
17: is of Class bilities of the classes of their artifacts. This “snapping” there-
18: end for fore uses additional information on source-artifact relations, not
19: for do furnished to the traditional classifier. Namely, it obeys laws of
20: calculate from based on Equation (3) probability and Bayesian equations that relate source affinities
and artifact classes.
21: end for
22: Return the computed optimal estimates of class for each
artifact and the probability of a source to generate a specific V. ACCURACY BOUND
class of artifacts (i.e., ).
In the previous section, we derived a classification enhance-
In summary of the EM classification scheme derived above, ment scheme that takes the provenance of artifacts into account.
the input is the source artifact graph describing which sources However, one important question remains: how to quantify the
generate which artifacts and the output is an estimate of the class estimation accuracy of the resulting enhanced classifier? In par-
of each artifact, as well as an estimate of source affinities. ticular, we are interested in obtaining the confidence intervals;
In particular, given the source artifact graph , our algorithm namely, the error bounds on the estimation parameters of our
begins by initializing the parameter . The algorithm then iter- model for a given confidence level. In this section, we derive
ates between the E-step and M-step until converges. Specif- such Cramer-Rao lower bounds (CRLB).
ically, we compute the conditional probability of an artifact to
A. Deriving Error Bounds
be of class (i.e., ) from (11) and the estimation pa-
rameter (i.e., ) from (16). Finally, we can decide whether We start with the derivation of Cramer-Rao lower bounds for
each artifact is of class based on the converged value of our problem. The CRLB states the lower bounds of estimation
(i.e., ). The pseudocode of the provenance as- variance that can be achieved by the maximum likelihood esti-
sisted (PA) classification algorithm is shown Algorithm 1. mation (MLE).
WANG et al.: PROVENANCE-ASSISTED CLASSIFICATION IN SOCIAL NETWORKS 629

We follow similar derivation steps in [4] and the derived “free” or “pay”. For evaluation, we collected ground truth by vis-
asymptotic CRLB of our problem is shown as follows: iting all parking lots in question and accurately inspecting their
posted signs. Note that, a slightly different version of this ap-
(17) plication was published in the context of handling conflicting
claims [2]. The current evaluation is different is that (interpreting
fact-finding as a fact classification problem) we aim to under-
Note that, the asymptotic CRLB is independent of M (i.e.,
stand the degree to which fact classification results are improved
number of sources) under the assumption that M is sufficient,
when our expectation maximization algorithm runs as a second
and it can be quickly computed.
stage after an initial solution is computed by another fact-finder.
B. Confidence Interval In the experiment, 30 participants were recruited. Recruited
volunteers were asked to mark any parking lots they thought
One of the attractive asymptotic properties about maximum were free. Participants were not asked to visit all parking lots in
likelihood estimator is called asymptotic normality: The MLE the area. Rather, they were asked to mark parking lots at will
estimator is asymptotically distributed with Gaussian behavior (e.g., those parking lots they are familiar with). Collectively,
as the data sample size goes up. The variance of estimation error they surveyed 106 parking lots (46 of which were indeed free).
on parameter is denoted as . For a problem with There were a total of 423 reports (notations claiming a “free
sufficient M and N (i.e., under asymptotic condition), parking lot”) collected from these participants.
also follows a norm distribution with 0 mean and variance We note that there are many different types of parking lots
given by: on campus: enforced parking lots with time limits, parking me-
ters, permit parking, and others. Different parking lots have dif-
(18) ferent regulations for free parking. Moreover, instructions and
permit signs sometimes are easy to miss. Hence, our partici-
Thus, the confidence interval that can be used to quantify the pants suffered both false positives and false negatives in their
probability a source generates a given class of artifacts (i.e., reports. Moreover, participants differed in their reliability (i.e.,
) is given by the following: affinity to generating correct responses). Some actually visited
the parking lots in person and carefully inspected the posted
signs. Others, reported results from memory.
In our evaluation, three different fact-finding schemes are
(19) first employed. Our expectation maximization algorithm is then
applied to their output. Specifically, Average-Log [11], Truth-
where is the standard score ( -score) of the confidence level Finder [12] and a Voting scheme were used to provide three dif-
. For example, for the 95% confidence level, . ferent initial guesses regarding artifact classification. The voting
scheme considered a parking lot to be “free” if it was reported
VI. EVALUATION free by at least a given number of volunteers. This threshold was
In this section, we first evaluate the performance of the prove- varied in the evaluation results shown later.
nance-assisted (PA) classifier described in this paper through To run our provenance-assisted (PA) expectation maximiza-
two real world application scenarios including a fact-finding ap- tion algorithm, we generated the source to artifact bipartite
plication using geotagging data and an Arabic dialect classifi- graph (i.e., observed input ) taking the participants as sources
cation application using Twitter data feeds. We then carry out and parking lots as artifacts. The artifacts were assigned class
extensive simulation experiments to study the performance of “free” or “pay” depending on the results of Average-Log [11],
the PA classifier over different problem dimensions. The results Truth-Finder [12], or the voting scheme, respectively. Our
show that our scheme significantly improves classification ac- scheme then performed its iterations until they converged. The
curacy compared to traditional classifiers by using the source ID receiver operating characteristics (ROCs) curves computed by
as additional information. these schemes as well as the final solution of our PA classifier
are shown in Fig. 2.
A. Fact-Finding Example We observe that the PA classifier achieved the best ROCs
Fact-finding is a common type of analysis applied to data (typ- performance among all schemes under comparison. The reason
ically text) uploaded to social networks. The goal of fact-finding is that our PA classifier modeled the provenance information of
is to estimate the probability of correctness of claims made in the artifacts explicitly and used the MLE approach to find the
the text. In this experiment, we generate a scenario where ground value of each claim that is most consistent with the observations
truth is known. Namely, we develop a “parking lot finder” appli- we had. We also observed that the EM algorithm converges to
cation, that helps students identify free parking lots on campus (at the ML solution given a reasonable initialization. As a result,
the University of Illinois at Urbana Champaign). “Free parking the PA classifier is insensitive to the initial guess provided by
lots” refer to parking lots that are free of charge after 5pm on the other classifiers.
weekdays (as well as weekends). The application allows volun- We also evaluated the probability of a source to report true
teers to identify parking lots they think are free. This informa- claims (“free parking lot”) and the confidence bounds we de-
tion is shared with others. It also runs our algorithm in the back- rived to quantify its accuracy. We calculated the 90% confidence
ground to compute the right class for each parking lot: either bounds based on the formula derived in Section V. The results
630 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 8, NO. 4, AUGUST 2014

TABLE I
ARABIC DIALECTS CLASSIFICATION RESULTS

words that only appear in either Egyptian or Moroccan dialect.


Then, we used them as query words to collect tweets that origi-
nated from Egypt and Morocco respectively. We collected 2945
tweets in total, including 2000 Egyptian tweets and 945 Mo-
roccan tweets. Note that, the choice of Egyptian and Moroccan
was dictated simply by available language expertise on the team,
to make ground-truthing of dialects possible.
In our experiment, we applied both a domain-specific dialect
classifier [24] and the provenance-assisted (PA) classifier we de-
veloped in this paper on the collected tweets and compared their
performance. To use the provenance-assisted classifier, we first
broke the tweets into words and removed punctuation marks,
non-linguistic symbols and tags in the tweets. We then built the
source artifact graph by taking the users of the tweets as sources
Fig. 2. ROCs of Different Fact-Finders. and the words they tweeted as artifacts. There is a link between a
source and an artifact if the user tweeted that word. We first ran
the domain classifier on the collected tweets and obtained the
dialect classification results. We then used the dialect outputs to
label all words and initialized our PA classifier accordingly.
The compared results are shown in Table I. We changed the
threshold of the probability to decide whether a word is Egyp-
tian or not from 0.5 to 0.95. From Table I, we observe that,
compared to the domain Classifier, the PA classifier was able
to increase the accuracy of Egyptian classification by more than
10% while keeping the accuracy of Moroccan tweet classifica-
tion slightly better or similar. Such performance gain is obtained
by leveraging the user ID information of tweets. We also ob-
served that the PA classifier performance is consistent and ro-
bust when the threshold value was varied in the experiment.
Fig. 3. Source probability to report true claims. We also studied the probability of a source to speak a given
dialect and the confidence bounds we derived to quantify its ac-
curacy. For demonstration purposes, we randomly picked 30
are shown in Fig. 3. The sources are sorted based on the value of sources and computed the probability that the source speaks
the lower bounds at 90% confidence. We observe that there are Egyptian from the tweets he/she actually tweeted. We also cal-
only 2 sources out of 30 whose probability to report true claims culated the 90% confidence bounds based on the formula de-
was outside the 90% confidence bounds, which matches quite rived in Section V. The results are shown in Fig. 4. We observe
well with the definition of a 90% confidence interval (which that in this case there are only 3 sources out of 30 whose prob-
implies that no more than 3 sources out of 30 should be outside ability to speak Egyptian was outside the 90% confidence in-
the interval). The results verified the correctness and accuracy terval, which means that indeed exactly 90% of the sources fall
of the confidence bounds we derived for our PA classifier. within the interval.

B. Arabic Dialect Classification C. Simulation Study


In this application, the goal is to automatically distinguish The above experiments represent only two points in the
two dialects of Arabic in a set of tweets. In particular, we used space of possible datasets to apply classifiers to. They feature
Twitter to collect Arabic tweets for our experiment. The two di- datasets with only two classes and a limited number of sources.
alects we selected were Egyptian and Moroccan. For evaluation To explore performance more broadly, in this subsection, we
purposes, we used two sets of key words, representing Arabic carried out extensive simulation experiments evaluating the
WANG et al.: PROVENANCE-ASSISTED CLASSIFICATION IN SOCIAL NETWORKS 631

2) Source-Artifact Graph Topology – Artifacts: The second


experiment studies the performance of the PA classifier when
the average number of artifacts generated per source changes.
As before, the number of generated artifacts was fixed at 3000.
The average number of sources was set to 300. The fraction
of labeled artifacts (presumably by a domain specific classifier)
was set to 0.5. The number of artifacts generated per source was
varied from 50 to 200. The number of classes was varied
from 2 to 5. Reported results are averaged over 50 random dis-
tributions of . Results are shown in Fig. 6. Observe that the
PA classifier estimation accuracy improves as the number of
generated artifacts per source increases. This is because more
artifacts simply provide more evidence for the PA classifier to
figure which artifact belongs to which class. Similarly, we note
Fig. 4. Source probability to speak Egyptian. the fraction of sources whose are bounded by the 90% con-
fidence interval are indeed above 90%. Additionally, we also
observe similar trend of performance increase of the PA classi-
provenance-assisted (PA) classification scheme along different fier as the number of classes (i.e., ) decreases.
problem dimensions. 3) Fraction of Labeled Artifacts: The third experiment ex-
We built a simulator in Matlab 7.10.0 that generates a random amines the effect of changing the fraction of the labeled arti-
number of sources and artifacts. A probability is assigned facts on the PA classifier. We vary the fraction of labeled arti-
to each source representing his/her probability to generate facts by the domain classifier from 0.1 to 0.9, while fixing the
artifacts of a given class . For each source , artifacts are total number of artifacts to 3000. The average number of arti-
generated. We ensure that . facts generated per source was set to 50. The number of sources
In the evaluation, we mainly studied three metrics: (i) the was set to 300. The number of classes was varied from 2 to
average estimation error of normalized by its mean value; 5. Reported results are averaged over 50 random distributions
(ii) the average classification error; (iii) the fraction of sources of . Results are shown in Fig. 7. Observe that the PA classi-
whose are within the confidence bounds we derived in fier estimation error reduces as the fraction of labeled artifacts
Section V. increases. This is intuitive: more correctly labeled artifacts will
1) Source-Artifact Graph Topology – Sources: In the first ex- help the PA classifier converge to better results. Moreover, the
periment, we evaluate the estimation accuracy of the PA classi- 90% confidence bounds are shown to be tight even when the
fier by varying the number of sources in the system. The number fraction of labeled artifacts is relatively small. We also observe
of generated artifacts was fixed at 3000. The average number of that the estimation performance of the PA classifier increases as
artifacts generated per source was set to 50. We assumed that the number of classes decreases.
a domain-specific classifier has already labeled half the arti- 4) Imperfect Domain Classifiers: In the above experiments,
facts with class labels. This is to emulate the case where the we looked at the case where some fraction of artifacts are cor-
initial classifier was sure of only 50% of the data. Our PA clas- rectly labeled and the rest are not labeled. The other usage sce-
sifier used the labeled artifacts to initialize the EM algorithm nario for our PA classifier is one where a domain classifier labels
and figure out the classes of the unlabeled ones. The number everything, but a certain fraction of labels are wrong. In this case,
of sources was varied from 200 to 1000. In this initial experi- the PA classifier does not view initial labels as ground truth. In-
ment, the probability that a source generates artifacts of a given stead, it simply uses them as initial values for the iterations.
class, , was drawn at random from a uniform distribution. In We first repeated the first and second experiments using a do-
some sense, this offers a worst-case for our classifier, as it indi- main classifier with imperfect class labels. The experiment setup
cates absence of a clear affinity between sources and artifacts. was the same as before except that we assumed all artifacts are
(Later, we show experiments with stronger affinity models.) The labeled by an imperfect classifier and the fraction of incorrect la-
number of classes was varied from 2 to 5. Reported results bels was set to 25%. The results are shown in Figs. 8 and 9. We
are averaged over 50 random distributions of . observe that our PA classifier significantly improves the clas-
Results are shown in Fig. 5. Observe that the PA classifier sification accuracy over the original classifiers. The fraction of
estimation accuracy improves as the number of sources in the mis-classified artifacts was reduced to below 10% (from 25%)
system increases. Given sufficient sources, the estimation error for all cases we examined. These results demonstrate the capa-
in , and artifact classification error are kept well below 5%. We bility of our classification enhancement scheme to improve the
also note the fraction of sources whose are actually bounded classification accuracy of imperfect classifiers.
by the 90% confidence interval is normally around or above Next we repeated experiment 3 above with an imperfect do-
90%, which verifies the accuracy of the confidence intervals we main classifier. In this case, we assumed all artifacts are labeled
derived. Additionally, we observe that the performance of the and varied the fraction of incorrect labels from 0.05 to 0.5. The
PA classifier increases as the number of classes decreases. results are shown in Fig. 10. We observed that our PA classifier
The reason is that the number of estimation parameters becomes performance is robust to the fraction of initial incorrect labels
smaller. and is able to reduce the classification error significantly com-
632 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 8, NO. 4, AUGUST 2014

Fig. 5. Changing the number of sources. PA classifier operates as an add-on to a domain classifier that leaves 50% of the artifacts unlabeled. (a) Source Probability
Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within 90% Confidence Interval.

Fig. 6. Changing the number of artifacts per source. PA classifier operates as an add-on to a domain classifier that leaves 50% of the artifacts unlabeled. (a) Source
Probability Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within 90% Confidence Interval.

Fig. 7. Changing the fraction of labeled artifacts. (a) Source Probability Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within
90% Confidence Interval.

pared to the original labels. For example, when half of the initial Morocco may have an affinity to the Moroccan dialect and in-
labels are wrong, our PA classifier was able to reduce the frac- dividuals living in Egypt may have an affinity to the Egyptian
tion of mis-classified artifacts to about 12% for . This dialect). We studied three types of sources in our experiment:
result demonstrated the capability of the PA classifier to im- (i) specialized sources: each source produces only one class
prove the classification accuracy of imperfect classifiers when of artifacts (e.g., speaks only one language) regardless of how
the source information is available. In all cases, the reported re- many classes are simulated in the data set; (ii) semi-special-
sults are averaged over 50 instances. ized sources: each source uniformly produces some number of
5) Study of Source Affinity Models: In the next experiment, classes of artifacts that is less than the total number of classes;
we studied the effect of different source affinity models on the (iii) and semi-specialized sources with dominant affinity: same
performance of our PA classifier. Sources may have affinity to as semi-specialized sources, except that the odds of producing
generating certain types of artifacts (e.g., individuals living in different classes of artifacts by a source are not uniform. There
WANG et al.: PROVENANCE-ASSISTED CLASSIFICATION IN SOCIAL NETWORKS 633

Fig. 8. Changing the number of sources. PA classifier operates as an add-on to a domain classifier that misclassifies 25% of the artifacts. (a) Source Probability
Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within 90% Confidence Interval.

Fig. 9. Changing the number of artifacts per source. PA classifier operates as an add-on to a domain classifier that misclassifies 25% of the artifacts. (a) Source
Probability Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within 90% Confidence Interval.

Fig. 10. Changing the fraction of initially mislabeled artifacts. (a) Source Probability Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of
Sources within 90% Confidence Interval.

is a preferred class that dominates. The other classes share the to generate given types of artifacts, which makes it easier for our
remaining probability equally. classifier to differentiate artifacts of different classes. Fig. 12
In the experiment, the number of sources was set to 300 and shows the effect of the affinity dominance of the semi-special-
each source generated 50 artifacts on average. The total number ized sources. We observe that the classification performance of
of artifacts was set to 3000, and the number of classes was our PA classifier improves as the probability to generate the pre-
fixed at 5. We set the fraction of labeled artifacts to 0.5. The re- ferred class (i.e., the class that dominates) by semi-specialized
ported results are averaged over 50 experiments. Fig. 11 showed sources increases. This is because the semi-specialized sources
the performance comparison between specialized sources and become more specialized as the probability to generate the pre-
semi-specialized sources. We observe that the source specializa- ferred class increases.
tion can improve the classification accuracy. The reason is that In this experiment, we examined the performance of our clas-
highly specialized sources have more concentrated distributions sifier when the internal redundancy (represented by the average
634 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 8, NO. 4, AUGUST 2014

Fig. 11. Changing the number of classes per source. (a) Source Probability Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within
90% Confidence Interval.

Fig. 12. Changing degree of affinity. (a) Source Probability Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within 90% Confi-
dence Interval.

Fig. 13. Changing number of sources per artifact. (a) Source Probability Estimation Error. (b) Fraction of Misclassified Artifacts. (c) Fraction of Sources within
90% Confidence Interval.

number of sources per artifact) changes. Similarly as before, we ized the sources are, the less sensitive our classifier will be to
set the number of sources to 300, the average artifacts generated the changes in the average number of sources per artifact.
per source is set to 50. We varied the average number of sources 6) Study of Scalability: In the last subsection, we studied the
per artifact by changing the total number of artifacts generated. scalability (in terms of execution time) of our classifier over sev-
We fixed the number of classes at 5 and set the fraction of la- eral basic problem dimensions. In the first experiment, we fixed
beled artifacts to 0.5. The reported results are averaged over 50 the number of artifacts as 3000 and the average number of ar-
experiments and shown in Fig. 13. We observed that the clas- tifacts generated per source as 100. We changed the number of
sification performance of our classifier improves as the average sources from 500 to 5000. The results are averaged 50 exper-
number of sources per artifact increases. This is intuitive: the iments and reported in Fig. 14(a). We observed the execution
more sources per artifact, the more redundancy is available to time of our classifier is linearly proportional to the number of
obtain better results. We also observed that the more special- sources in the system. In the second experiment, we fixed the
WANG et al.: PROVENANCE-ASSISTED CLASSIFICATION IN SOCIAL NETWORKS 635

Fig. 14. Execution Time (s). (a) Changing number of sources. (b) Changing number of artifacts. (c) Changing number artifacts generated per source.

number of sources as 1000 and the average number of artifacts performance improvement of the PA classifier by leveraging the
generated per source as 100. We varied the number of total ar- provenance information. However, another distinguishing fea-
tifacts from 600 to 6000. Results are shown in Fig. 14(b). We ture of social signals is the underlying social networks that cap-
observed that our classifier also scales linearly to the number ture relationships between nodes. It would be interesting to in-
of artifacts of the problem. In the last experiment, we fixed the vestigate the problem of incorporating the social network infor-
number of sources as 1000 and number of artifacts as 3000. We mation (e.g, connections/linkages between sources) to further
changed the average number of artifacts per source from 50 to improve classification accuracy.
500. Results are shown in Fig. 14(c). We noted that the execu- Another interesting direction for future work is to consider
tion time of our classifier is insensitive to the average number the Source ID as a feature in the domain classifiers and compare
of artifacts generated per source. their performance with our PA classifiers. In that case, we might
This concludes our evaluation study. In this section, we eval- need to change the specific model of each domain classifier to
uated the performance of the proposed PA classifier through incorporate the Source ID information. However, it would also
two real world applications as well as extensive simulation ex- be interesting to investigate if there will be another general way
periments. The results verified that our PA classifier can sig- to consider the provenance information in classification prob-
nificantly improve the classification performance of traditional lems without too much modification of the original models.
classifiers by only using the source ID information. The per- It is common to observe sources have some expertise in
formance of our classifier was shown to be robust and scalable certain knowledge domains. For example, a biologist may
over different problem dimensions. Additionally, it would also generate artifacts mainly about phylogeny of organisms while
be interesting to examine the usage of source ID as a feature in a musician may generate artifacts regarding music genres. Al-
domain classifiers. The authors would like to pursue this direc- though we studied the effect of source affinity on classification
tion in the future work. performance in the evaluation, we do not explicitly take into
account prior knowledge on source expertise in our current
VII. LIMITATIONS AND FUTURE WORK classifier. It is interesting to extend our model to take into
This paper presented a general classifier enhancement account more information about sources besides their ID. Fur-
scheme that uses source IDs to improve classification accuracy. thermore, the affinities of sources to generate different artifacts
Several simplifying assumptions were made that offer direc- may change in different situations or over time. In such case,
tions for future work. we will need more efficient estimation schemes to dynamically
In this paper, sources are assumed to be independent from track the changes in the source affinity. We reserve this as a
each other in the sense that each source has their own indepen- future work direction.
dent affinities for generating artifacts of different types. In gen- A few techniques have been proposed in fact-finding to con-
eral, these affinities may be related. For example, if my friends sider the hardness of facts [13], which could be generalized and
in the social network speak a given language, there is a higher adapted for our scheme. In general, generating certain artifacts
chance that I speak that language as well. This paper does not might require a lower degree of specialization than others. For
model such dependencies. example, in an application where artifacts are tweets describing
Several solutions have recently been proposed to model events, and classes of artifacts refer to locations of these de-
source dependencies in various special cases. One possible scribed events, many sources may tweet about worldwide events
method is to detect the copy relationship between sources of common interest. In this case, such general-interest tweets
based on historical data [25], [26]. Another possible solution give less information about their sources. However, other tweets
is to study the latent information dissemination graph between may be about special locations and represent specialized local
sources and understand how information are actually propa- knowledge. Such specialized knowledge is a better indicator of
gated through non-independent sources [27]. the locations or special interests or their sources. Future exten-
Related with the source independence assumption, the input sions of the scheme can therefore estimate and take into account,
to the proposed classifier in this paper is merely a set of arti- for different classes of artifacts, the difficulty (or degree of spe-
facts labeled with source identities. The goal is to examine the cialization needed) to generate artifacts of that class.
636 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 8, NO. 4, AUGUST 2014

VIII. CONCLUSION [14] D. R. Karger, S. Oh, and D. Shah, “Iterative learning for reliable crowd-
sourcing systems,” in Advances in Neural Information Processing Sys-
This paper presented a scheme to improve classification ac- tems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Wein-
curacy of social signals by exploiting available source IDs. A berger, Eds. Cambridge, MA, USA: MIT Press, 2011, vol. 24, pp.
1953–1961.
maximum likelihood estimation model was built to jointly esti- [15] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
mate source affinities and artifact classes, to assist classification from incomplete data via the EM algorithm,” J. R. Statist. Soc., Ser. B,
tasks. An accuracy bound was derived along with the PA classi- vol. 39, no. 1, pp. 1–38, 1977.
[16] M. H. C. Law, M. A. T. Figueiredo, and A. K. Jain, “Simultaneous
fication scheme to establish confidence in analysis results. The feature selection and clustering using mixture models,” IEEE Trans.
new scheme was evaluated through both real-world case studies Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1154–1166, Sep. 2004.
and extensive simulation experiments. The results show that [17] J. V. Graca, L. Inesc-id, K. Ganchev, B. Taskar, J. V. Graça, L. F.
our scheme significantly improves classification accuracy com- Inesc-id, K. Ganchev, and B. Taskar, “Expectation maximization and
posterior constraints,” in Adv. NIPS, 2007, pp. 569–576.
pared to traditional domain classifiers and correctly computes [18] C. Zhai, A Note on the Expectation Maximization (EM) Algorithm. :
confidence intervals. The work represents the first attempt at University of Illinois at Urbana Champaign, Department of Computer
identifying a general methodology for improving performance Scinece, 2007.
[19] J. Bilmes, “A gentle tutorial on the EM algorithm and its applica-
of arbitrary classification tasks in social network applications. tion to parameter estimation for Gaussian mixture and hidden Markov
models,” Tech. Rep. Univ. of Berkeley, 1997, ICSI-TR-97-021.
[20] G. J. McLachlan and T. Krishnan, The EM Algorithm and Exten-
ACKNOWLEDGMENT sions. New York, NY, USA: Wiley, 1997.
[21] T. Moon, “The expectation-maximization algorithm,” IEEE Signal
The views and conclusions contained in this document are Process. Mag., vol. 13, no. 6, pp. 47–60, Nov. 1996.
[22] J. Gunther, D. Keller, and T. Moon, “A generalized BCJR algorithm
those of the authors and should not be interpreted as repre-
and its use in iterative blind channel identification,” IEEE Signal
senting the official policies, either expressed or implied, of the Process. Lett., vol. 14, no. 10, pp. 661–664, Oct. 2007.
Army Research Laboratory or the U.S. Government. The U.S. [23] C. F. J. Wu, “On the convergence properties of the EM algorithm,”
Government is authorized to reproduce and distribute reprints Anna. Statist., vol. 11, no. 1, pp. 95–103, 1983.
[24] S. Tratz, D. Briesch, J. Laoudi, and C. Voss, “Tweet conversation an-
for Government purposes notwithstanding any copyright nota- notation tool with a focus on an Arabic dialect, Moroccan Darija,” in
tion here on. Proc. 7th Linguist. Annotation Workshop & Interoperability with Dis-
course, Assoc. Computat. Linguist., Sofia, Bulgaria, 2013.
[25] X. Dong, L. Berti-Equille, and D. Srivastava, “Truth discovery and
REFERENCES copying detection in a dynamic world,” VLDB, vol. 2, no. 1, pp.
562–573, 2009.
[1] D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery in [26] X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava, “Global detection
social sensing: A maximum likelihood estimation approach,” in Proc. of complex copying relationships between sources,” PVLDB, vol. 3,
11th ACM/IEEE Conf. Inf. Process. Sens. Netw. (IPSN 12), Apr. 2012. no. 1, pp. 1358–1369, 2010.
[2] D. Wang, L. M. Kaplan, and T. F. Abdelzaher, “Maximum likelihood [27] P. Netrapalli and S. Sanghavi, “Learning the graph of epidemic cas-
analysis of conflicting observations in social sensing,” ACM Trans. cades,” in Proc. 12th ACM SIGMETRICS/PERFORMANCE Joint Int.
Sens. Netw., to be published. Conf. Meas. Modeling Comput. Syst., New York, NY, USA, 2012, pp.
[3] D. Wang, L. Kaplan, T. Abdelzaher, and C. C. Aggarwal, “On scal- 211–222.
ability and robustness limitations of real and asymptotic confidence
bounds in social sensing,” in Proc. 9th Annu. IEEE Commun. Soc.
Conf. Sens., Mesh, Ad Hoc Communicat. Netw. (SECON 12), Jun.
2012, pp. 506–514. Dong Wang received his Ph.D. in Computer Science
[4] D. Wang, L. M. Kaplan, T. F. Abdelzaher, and C. C. Aggarwal, “On from University of Illinois at Urbana Champaign
credibility estimation tradeoffs in assured social sensing,” IEEE J. Sel, (UIUC) in 2012, an M.S. degree from Peking
Areas Commun., vol. 31, no. 6, pp. 1026–1037, Jun. 2013. University in 2007 and a B.Eng. from the University
[5] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, of Electronic Science and Technology of China
in 2004, respectively. He is now a postdoctoral
Third ed. : Morgan Kaufman, 2011.
researcher in the Department of Computer Science
[6] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algo-
at UIUC. Dr. Wang’s research interests lie in the area
rithms. New York, NY, USA: Wiley-Interscience, 2004.
of reliable social sensing, cyber-physical computing,
[7] S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node classification in real-time and embedded systems, and crowdsourcing
social networks,” CoRR, vol. abs/1101.3291, 2011. applications. He received the Wing Kai Cheng Fel-
[8] S. Tang, J. Yuan, X. Mao, X.-Y. Li, W. Chen, and G. Dai, “Relationship lowship from University of Illinois in 2012 and the Best Paper Award of IEEE
classification in large scale online social networks and its impact on Real-Time and Embedded Technology and Applications Symposium (RTAS)
information propagation,” in Proc. IEEE INFOCOM ’11 , 2011, pp. in 2010. He is a member of IEEE and ACM.
2291–2299.
[9] A. Vinciarelli, M. Pantic, and H. Bourlard, “Social signal processing:
Survey of an emerging domain,” Image Vis. Comput., vol. 27, no. 12,
pp. 1743–1759, 2009. Md Tanvir Al Amin is currently a Ph.D. candidate
[10] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in the Department of Computer Science in the Uni-
J. ACM, vol. 46, no. 5, pp. 604–632, 1999. versity of Illinois at Urbana-Champaign. He received
[11] J. Pasternack and D. Roth, “Knowing what to believe (when you a Master’s degree in 2011, and a Bachelor’s degree
already know something),” in Proc. Int. Conf. Comput. Linguist. in 2009, both in Computer Science and Engineering
(COLING), 2010. from Bangladesh University of Engineering and
[12] X. Yin, J. Han, and P. S. Yu, “Truth discovery with multiple conflicting Technology. Mr. Amin’s research interests lie in the
information providers on the Web,” IEEE Trans. Knowl. Data Eng., fields of Social Sensing, Distributed Systems, Big
vol. 20, no. 6, pp. 796–808, Jun. 2008. Data Analytics, and Cloud Computing. He received
[13] A. Galland, S. Abiteboul, A. Marian, and P. Senellart, “Corroborating the Chirag Foundation Graduate Fellowship in
information from disagreeing views,” in Proc. WSDM, 2010, pp. Computer Science from University of Illinois in
131–140. 2011, and Travel Grant from IEEE in 2009. He is a member of IEEE and ACM.
WANG et al.: PROVENANCE-ASSISTED CLASSIFICATION IN SOCIAL NETWORKS 637

Tarek Abdelzaher received his Ph.D. from the Lance M. Kaplan received the B.S. degree with
University of Michigan, Ann Arbor, in 1999, under distinction from Duke University, Durham, NC, in
Professor Kang Shin. He was an Assistant Professor 1989 and the M.S. and Ph.D. degrees from the Uni-
at the University of Virginia from August 1999 versity of Southern California, Los Angeles, in 1991
to August 2005. He then joined the University and 1994, respectively, all in Electrical Engineering.
of Illinois at Urbana Champaign as an Associate From 1987–1990, Dr. Kaplan worked as a Technical
Professor with tenure, where he became Full Pro- Assistant at the Georgia Tech Research Institute.
fessor in 2011. His interests lie primarily in systems, He held a National Science Foundation Graduate
including operating systems, networking, sensor net- Fellowship and a USC Dean’s Merit Fellowship
works, distributed systems, and embedded real-time from 1990–1993, and worked as a Research Assis-
systems. Dr. Abdelzaher is especially interested tant in the Signal and Image Processing Institute
in developing theory, architectural support, and computing abstractions for at the University of Southern California from 1993–1994. Then, he worked
predictability in software systems, motivated by the increasing software on staff in the Reconnaissance Systems Department of the Hughes Aircraft
complexity and the growing sources of non-determinism. Applications range Company from 1994–1996. From 1996–2004, he was a member of the faculty
from sensor networks to large-scale server farms, and from transportation in the Department of Engineering and a senior investigator in the Center of
systems to medicine. Theoretical Studies of Physical Systems (CTSPS) at Clark Atlanta University
(CAU), Atlanta, GA. Currently, he is a researcher in the Networked Sensing
and Fusion branch of the U.S. Army Research Laboratory. Dr. Kaplan serves as
Editor-In-Chief for the IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC
Dan Roth is a Professor in the Department of SYSTEMS (AES). In addition, he also serves on the Board of Governors of the
Computer Science and the Beckman Institute at the IEEE AES Society and on the Board of Directors of the International Society of
University of Illinois at Urbana-Champaign and a Information Fusion. He is a three time recipient of the Clark Atlanta University
University of Illinois Scholar. Prof. Roth is a Fellow Electrical Engineering Instructional Excellence Award from 1999–2001. His
of the ACM, AAAI, and ACL, for his contributions current research interests include signal and image processing, automatic target
to the foundations of machine learning and inference recognition, information/data fusion, and resource management.
and for developing learning centered solutions
for natural language processing problems. He has
published broadly in machine learning, natural
language processing, knowledge representation and Stephen Tratz received B.S. degrees in computer
reasoning and learning theory, and has developed science and applied math from the University
advanced machine learning based tools for natural language applications of Idaho in 2004 and M.S. and Ph.D. degrees in
that are being used widely by the research community. Prof. Roth has given computer science from the University of Southern
keynote talks in major conferences, including AAAI, The Conference of the California in 2009 and 2011, respectively. He works
American Association Artificial Intelligence; EMNLP, The Conference on as a computational linguist at the Army Research
Empirical Methods in Natural Language Processing, and ECML & PKDD, the Laboratory, Adelphi, MD, where he develops
European Conference on Machine Learning and the Principles and Practice of algorithms for dialect detection, syntactic and
Knowledge Discovery in Databases. He has also presented several tutorials morphological processing, and machine translation.
in universities and conferences including at ACL and the European ACL and From 2004 to 2007, he was a research scientist at
has won several teaching and best paper awards. Prof. Roth was the program Pacific Northwest National Laboratory, Richland,
chair of AAAI’11 and was the program chair of CoNLL’02 and of ACL’03; WA, where he worked on machine learning, natural language processing, and
he has served as an area chair and senior program committee member on all knowledge representation research projects.
major conferences in his research areas, and has been on the editorial board
of several journals in his research areas. Prof. Roth is currently the Associate
Editor-in-Chief of the Journal of Artificial Intelligence Research (JAIR) and
will serve as Editor-in-Chief for a two-year term beginning in 2015. Prof. Roth Jamal Laoudi received a B.S. in computer science
got his B.A Summa cum laude in Mathematics from the Technion, Israel and and a B.A. in economics from the University of
his Ph.D. in Computer Science from Harvard University in 1995. Maryland, College Park, MD, in 2001 and received
a Certificate in Natural Language Technology from
the University of Washington in 2011. Since 2002,
he has worked as an ARTI contractor at the Army
Clare R. Voss joined ARL as a research scientist Research Laboratory, Adelphi, MD, serving as an
in 1996. She now serves as the basic research team Arabic language expert and computer scientist on
lead for projects that are a part of ARL’s invest- numerous natural language technology research
ment in human language technology. Her research projects. He is fluent in English, French, Modern
interests are in the area of computational linguistics, Standard Arabic, and Moroccan Darija.
specifically machine translation, monolingual text
processing with special attention to low-resource
and morphologically complex languages, and se-
mantically informed language understanding and Douglas Briesch received a B.S. in Information
generation. In 2009, Dr. Voss’ team was awarded the and Computer Science from the Georgia Institute
ARL Award for Science for their significant contri- of Technology in 1986. Since 2002, he has worked
butions to machine translation research. Dr. Voss provides technical expertise as a computer scientist at the Army Research Lab-
to Army Research Office programs such as Multi-disciplinary University oratory, Adelphi, MD, first as a Northrop Grumman
Research Initiatives (MURIs), as well as other government agencies, including contractor and later as a government employee. He
DARPA, NARA, NSF, and NIST. She holds a Bachelor degree in Linguistics provides software engineering expertise to various
from the University of Michigan, Ann Arbor, and a Doctor of Philosophy natural language processing projects.
degree in Computer Science from the University of Maryland, College Park.

You might also like