Marketing

This document presents a methodology for identifying trends in consumer behavior through analyzing social media, specifically Twitter. It begins with applying keyword filtering related to a topic or anticipated purchase, then uses semantic filtering and classification. The goal is to model vehicle purchase behavior using Twitter data. Previous research demonstrated Twitter can accurately reflect political trends and predict outcomes like movie revenues. Machine learning methods have also been used to generate trends from unstructured social data and identify macro events. The proposed methodology aims to extract consumer demand signals for sales forecasting and customer relationship management.

Uploaded by

Gabriela Dascalescu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views6 pages

Marketing

Uploaded by

Gabriela Dascalescu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Identification of Trends in Consumer Behavior

through Social Media

David Alfred Ostrowski

System Analytics
Research and Innovation Center
Ford Motor Company
[email protected]

Abstract Analytics around internet-based data sources have

been leveraged to support the areas of forecasting,
Social Media has frequently been leveraged for the public relations and CRM [3][4][5]. This has been
purpose of anticipating trends. This paper presents a supported through areas including Machine Learning
methodology to identify trends of consumer behavior and Natural Language Processing. By leveraging
through Semantic Filtering. The process begins by classification related to sentiment, internet sources
applying keyword-based filtering on a ground topic, have allowed for the means to semi-automate the
proceeding to filter on terms related to anticipation engagement process with customers thus allowing
of a purchase. Next, semantic categories are greater utility than previous forms of engagement
considered to filter out messages that are inconsistent which include email, customer groups and
with the desired signal. Following, Fisher surveys[6].
Classsification is applied to identify consumer Due to its minimalistic characteristics and ease of
behavior. We apply this procedure to the goal of data collection, we have targeted the Twitter micro-
modeling vehicle purchase behavior with data blogging service as our data source. Twitter volume
acquired from Twitter. Results demonstrate that has increased in recent years, presenting a formidable
consumer demand can be determined within the challenge having over 500M active users, generating
Twitter firehose, providing a source of data to 430M tweets and handling over 1.6M search queries
contribute to forecasting efforts as well as Customer per day [6]. Through such large participation, Twitter
Relationship Management (CRM). has been able to support the identification of trends
across general topics that affect very large
populations. The challenge that exists is to extract
signals that assist in the definition of consumer
1.INTRODUCTION behavior within larger macroeconomic activity. Such
Internet-based activities have demonstrated to be signals can also support extraction of underlying
highly reflective of real-world situations. topics that will support a greater understanding in
Correspondingly, monitoring internet-based data marketing and customer relationships.
sources have demonstrated to support the In this paper we investigate the application of these
identification of trends across many subjects. By techniques to determine a more focused level of
considering these sources from the perspective of a behavior that exists within larger trends. Specifically,
social network, semantics have been applied for we are interested in the identification of stages of
substantial use in trend identification [1]. Many of consumer activity as it relates to an individual
these sources including Social Media have product line. Here, the focus is to identify specific
demonstrated advantages for use in market prediction consumer demand for an item towards application of
due to the fact that they are frequently updated and sales forecasting as well as CRM.
include very unbiased results. Due to the increased
In the next section, we present a survey of related
popularity of these sources and subsequent volume
they have become a form of disruptive technology, work. Section three continues with the discussion of
displacing traditional information media [2]. our proposed methodology. Section Four presents our
test case and Section Five presents our conclusions.
outbreak and thus validated twitter as a tracking
2.CURRENT RESEARCH system for public attention [14]. (SonDoan et.al.)
was able determine a novel algorithm applied to the
Many techniques adapted from categorization of filtering of datasets by application of a semantic
unstructured data have been applied to Social Media. featureset including consideration of negation ,
Techniques range from rule-based and semantic hashtags, emoticons, humor and geography thus
evaluation to Machine-Learning based methods. resulting in a improvement over earlier taxonomy
Being a popular source within Social Media, Twitter based approaches leveraged to semantic filtering
has been leveraged for the determination of general [15].
trends. By comparing Twitter information with Work in Twitter also involves the evaluation of
associated sentiment (Tumasjan et.al.) was able to trends as they relate to social networks. (Asur et.al.)
determine that Twitter messages could provide an identified that relationships between members of a
accurate portrayal of the political landscape [6]. social network play a substantial role in creating
Through improving on the quality of a Twitter trends [16]. They also determined that trends are also
document collection and through the incorporation of based on their influences with the passivity of users
sentiment analysis (Sang et. al.) supported predicated on their information forwarding activity.
predictions based on entity counts matching Their study also showed that that the correlation
performance of traditionally obtained opinion polls between popularity and influences was weaker than
[7]. (Bermingham et.al.) demonstrated predictivity in might be expected.
social analytics through the use of both volume-based A number of methods have leveraged machine
measurements and sentiment analysis as explored classification in order to generate trends among
over a variety of sample sizes, time periods and unstructured and social-network based data
quantitative methods [8]. collections. Among lexical-based approaches, (Xia
(Stewart et. al.) examined aggregate daily Twitter et. al.) analyzed attribute sentiment co-locations to
keyword volumes to predict aggregate current approximate reasonable generalization abilities [17].
spending. They demonstrated that weekday Twitter Working with both lingual analysis as well as
keyword volume, current spending, and weekday classification and clustering methods, Shandilya and
spending norms all have significant value allowing Jain worked towards the extraction of knowledge
for prediction of short-term consumer spending [9]. utilizing a hybrid approach [18]. More sophisticated
Within the entertainment industry, (Asur et. al.) multi-pass efforts have been developed recently with
leveraged social media to predict real-world an example being Xu and Kit who performed
outcomes by constructing a model that forecasts box- examinations of opinions at different levels (course/
office revenues for movies through the construction fine) of a document [19].
of a linear regression model, outperforming other (Wang et. al.) relied on Latent Dirichlet Allocation
market indicators such as Hollyworld Stock (LDA) in which a topic was associated with a
Exchange [10]. (Gruhl et. al.) showed how to continuous distribution over timestamps and for each
automatically generate queries for mining logs in generated document, the mixture distribution over
order to predict spikes in book sales [11]. (Huberman topics was influenced by both word co-occurences
et. al) studied the social interactions on Twitter to and the document’s timestamp. This was leveraged to
reveal that the driving process for usage is a sparse interpret trends from email , research papers and
hidden network underlying the friends and followers, state-of-the-union addresses[20]. Padman and Airoldi
while most of the links represent meaningless proposed an approach to extract sentiments from
interactions [12]. unstructured text through the means of applying a
Researchers have also been successful in the past two-stage Bayesian algorithm that is able to capture
in identifying and tracking macro-level events the dependencies among words and at the same time
including seasonal illnesses through Social Media. find a vocabulary that is efficient for the purpose of
(Signorini et. al.) used Twitter to identify and track mining algorithms [21]. Other approaches for trend
levels of disease activity and public concern in the identification include the application of centrality
U.S. during the influence of the H1N1 panademic calculations within a Social Network as well as
[13]. This was accomplished by filtering keywords search engine technology algorithms such as
that would contain disease names as well as public PageRank algorithm such as that employed by
concern keywords with the results showing that (Corely et. al.) which was able to determine that
Twitter can be used as a measure of public interest or Influenza related blogging trends have a significant
concern around health related events. (Chew et. al.) correlation to the US Fall 2008 flu season [22]. They
collected twitter messages containing keywords were also able to identify WSM Influenza-related
related to the H1N1 virus during the 2009 H1N1 communities that share flu-postings which could
broker or disseminate information in the case of a a specific product. Employing a “try and test” method
severe outbreak or Influenza epidemic. we were able to derive a subset of words that
indicated a collection of Twitter messages who’s
3. METHODOLOGY volumes were able to correlate to sales. Here, we
applied the following keywords indicating such terms
We begin with the overview of our methodology including “test drive”, “purchase”, “buy” and
by considering an internet-based data resource “insurance” as applied to our subject matter data
(Twitter) as our starting point. We are interested in a collection.
demand signal of an individual product for the Next, we examined three specific semantic
application of sales forecasting and trend prediction. categories to apply our filtering. These included
To this means we consider the theoretical journey negation, hashtags and humor. Negation was
made by consumers noted as the Awareness, Interest, examined first. In this step, we applied a natural
Desire and Action (AIDA) model, also known as the language processing approach to break down the
‘Purchasing Funnel’[23]. With this in mind, our “demand based” indicators between positive and
attributes are to match the consumer emotion of negative demand through the application of the
desire, in anticipation of the purchase event in the Python Natural Language TookKit (NLTK). This
short term (within a month). As noted in Figure 1, process examined basic part-of-speech (POS)
our methodology employs two separate levels of taggings in each sentence and applied a set of regular
direct filtering (empirical and knowledge) combined expressions to support identification of both direct
with a classifier to support the extraction of a signal. and indirect relationships between words supporting
Our process starts with the consideration of single negation (including ‘not’ and ‘never’) and the ground
ground truth keywords that are applied to the raw truth keywords.
data collection (Twitter Firehose). These are Hashtags were next considered as a means of
designated as words for unique identification of a filtering content. A hashtag is considered as a token
product. Here, a tradeoff is presented between beginning with the symbol ‘#’. As such it is
containment of the maximum amount of potential considered as a community-driven comment for the
information and precision of matching the given addition of context and metadata to tweets Here, a
subject. taxonomy of hashtags was developed that were both
related and unrelated to the topic of demand. Any
topics that were unrelated to the Twitter data were
eliminated. The related topics could be used to
consider the weighting of the messages. Humor was
also considered at the word level with specific terms
associated with common jokes or humor were
eliminated.
For the classification step, a training set of
documents was manually devised from samples of
Twitter messages that indicated a high level of desire
among users in a product. Towards support of
semantics identification we employed the Fisher
classification method. This approach allowed for a
higher level of flexibility in identification of semantic
combinations as opposed to the standard naïve
Bayesian algorithm by supporting a normalized
method of classification. To perform such
normalization the method relies on three calculations:
Figure 1. Overall methodology
clf = Pr(feature| category)
In our second stage, we are interested in keywords n
that are directly related to the anticipation of a
purchase event. Not having found a suitable ontology
Freq sum =  Pr( Feature | Category )
i ...
or taxonomy, a corpus was created from taking high cprob = crlf / (clf + nclf)
frequency words from messages that were most likely
to indicate the purchase of a product. This was Clf is determined as a conditional probability that a
accomplished by hand selecting Twitter messages document fits into a category, given a particular
indicating a high level of desire among consumers for feature. This is in contrast to the Bayesian approach
of considering the number of documents with in Figure 4, generating a correlation to Ford Focus
features divided by the total number of documents sales as presented in figure 6 of .744.
(with that feature). By approximating the feature to
category level, it takes into account of receiving far
more documents in one category than another. To
normalize, the probability is divided by the frequency
sum. The Fisher method continues by multiplying all
the probabilities together, taking the natural log and
applying the inverse chi function to obtain a
probability. Figure 2, provides a detailed overview of
the three step filtering process.

Figure 3. Ground Truth Keyword Volumes (Ford

Fusion) for the period Oct-11 to Sept-12

Next, semantic categories were applied with each

category supporting successive filtering. The largest
proportion of these categories was humor followed
by hashtags and negative. The application of the
semantic categories such as humor included the
matching against keywords known to be associated
with messages around humor including “Like Ford I
got Focus” along with references between the words
“focus” and “attention” The results of which are
Figure 2. Filtering Level Detail presented in the semantic category section in figure 5.
This resulted in only a marginal increase in
4. IMPLEMENTATION correlation (.7575) as few messages at this point were
filtered.
The classification step was trained (positively) from
The goal of our case study was to characterize the
a manually derived collection of messages indicating
demand for the Ford Focus sampled from the Twitter
demand and also trained (negatively) against
firehose for the time period of October 1, 2011 to
messages that indicated advertising for new or used
September 31, 2012. The data collections were
vehicles. This resulted in a minimal reduction of the
considered on a monthly basis. Following our model
overall collection increasing the correlation to .76
of the purchasing funnel, the focus was to
(figures 3 and 5).
characterize the anticipation immediately before the
purchase event. Our empirical filtering begins with
the determination of ground truth keywords
designated as any message that contains both the
words “ford” and “focus” supporting over a 99%
precision rate. While these words eliminated
messages which included statements such as “focus”
without the inclusion of “ford”, precision fell to
under 20% when “ford” was omitted, making it
necessary to utilize the keyword combination. The
volumes of such filtering are presented in Figure 3,
generating a .26 correlation to vehicle sales. Figure 4. Generated message volumes for Keyword,
The second stage of our empirical filtering consisted Semantic Category and Classifier – enabled filtering
of the establishment of a manual taxonomy from a for period Oct-11 to Sep-12.
sampled collection of 4000 Twitter messages (filtered
accordingly to the initial ground truth keywords) and To support our work in forecasting we compared
associated strongly to the anticipation of the purchase these results to Google Trends (another source of
event. This included the following keywords which web-based information utilized for forecasting and
included (buy, purchase, love, insurance). The indication of trends) for the same period. We
highest performing set of these keywords is presented examined Google Trends searches for the terms “ford
focus” , “ford” and “ford dealers”, under the United
States and Vehicle Sales category. The correlation for any number of social media platforms as well as
our data sample fell short of the filtered data at collections of unstructured data.
.71,.59 and .42 respectively with a correlation of only Future work includes additional methods of
.46, .37 and .32 to our best filtered collection (the comparison between keyword sets with the possible
final stage). inclusion of Machine Learning technologies to assist
in the determination of optimal or near-optimal
combinations. The weighting of keywords in filtering
as well as expanded training sets in classifiers can
also assist in supporting a stronger signal that will
enable the performance of a forecasting model.

6. REFERENCES
[1] Gloor, P.A.; Krauss, J.; Nann, S.; Fischbach, K.;
Schoder, D.; "Web Science 2.0: Identifying Trends
through Semantic Social Network Analysis",
Computational Science and Engineering, 2009. CSE '09.
Figure 5. Comparison of Ford Fusion Sales volume to International Conference on Volume: 4 pp: 215 – 222
Filtered Messages Volume for period (Oct-11 to [2]Kwak, Haewoon, Changhyun Lee, Hosung Park, and
Sep-12) Sue Moon. "What is Twitter, a social network or a news
media?." In Proceedings of the 19th international
conference on World wide web, pp. 591-600. ACM, 2010.
[3] Ostrowski, D. A. , "Predictive Semantic Social Media
Analysis, IEEE International Conference on Semantic
Computing , ICSC , 2011
[4] Baird, Carolyn Heller, and Gautam Parasnis. "From
social media to social customer relationship
management." Strategy & leadership 39, no. 5 (2011): 30-
37.
[5]Ang, Lawrence. "Community relationship management
Figure 6. Correlation of levels of filtering to Sales and social media."Journal of Database Marketing &
Volumes Customer Strategy Management 18, no. 1 (2011): 31-38.
[6]Tumasjan, Andranik, Timm O. Sprenger, Philipp G.
Sandner, and Isabell M. Welpe. "Predicting elections with
twitter: What 140 characters reveal about political
sentiment." In Proceedings of the fourth international aaai
conference on weblogs and social media, pp. 178-185.
2010.
[7]Sang, Erik Tjong Kim, and Johan Bos. "Predicting the
2011 dutch senate election results with twitter." EACL
Figure 7. Correlation of Google Trends to Sales and 2012 (2012): 53.
Semantically Filtered Data. [8]Bermingham, Adam, and Alan F. Smeaton. "On using
Twitter to monitor political sentiment and predict election
results." (2011).
5. CONCLUSION [9] Stewart, Justin, Strong, Homer, Parker, Jeffrey, Bedau,
We have proposed a methodology for the Mark A., Twitter keyword volume, current spending, and
monitoring of trends at an individual product level. weekday spending norms, predict consumer spending.,
2012 IEEE 12th International Conference on Data
Our process considers the application of unstructured [10] Asur, Sitaram, and Bernardo A. Huberman.
data through filtering to determine a demand "Predicting the future with social media." arXiv preprint
indicator to reflect consumer behavior. The results arXiv:1003.5699 (2010).
indicate that a strong demand signal can be generated [11] Daniel Gruhl, R. Guha, Ravi Kumar, Jasmine Novak
for application in forecasting and CRM. Compared to and Andrew Tomkins, The predictive power of online
another popular web-based indicator (Google Trends) chatter. SIGKDD Conference on Knowledge Discovery
our model performs well, providing higher and Data Mining, 2005
correlation. Our filtered signal also maintains a low [12]Benardo A. Huberman, Daniel M. Romero, and Fang
correlation to this attribute (.46) indicating that it Wu, Social networks that matter: Twitter under the
microscope. First Monday, 14(1), Jan 2009
could perform well as a complementary parameter. [13] A. Signorini, A. M. Segre,and P. M. Polgreen, “The
While our core strategy involves the application of use of Twitter to Track Levels of Disease Activity and
Twitter this methodology could be extended across
Public Concern in the U.S. during the Influenze A H1N1
Pandemic” PLoS ONE, vol 6. , no. 5, p. e19467 05 2011.
[14] chew and eysenbach “Pandemics in the age of twitter:
content analysis of tweets during the 2009 H1N1 outbreak
PLos One, vol5, no 11. P e14118, 11 2010
[15] Doan, Son, Ohno-Machado,Lucia, Collier, Nigel,
Enhancing Twitter Analysis with Simple Semantic
Filtering: Example in Tracking Influenze-Like Illnesses,
IEEE 2nd Conference on Healthcare Informatics, Imaging
and Systems Biology, 2012
[16]Asur, Sitaram, and Bernardo A. Huberman. "Predicting
the future with social media." arXiv preprint
arXiv:1003.5699 (2010).
[17] Yun-Qing Xia; Rui-Feng Xu; Kam-Fai Wong; Fang
Zheng; The Unified Collocation Framework for Opinion
Mining, Machine Learning and Cybernetics, Hong Kong,
Volume 2, 19-22 Aug. 2007 pp 844 – 850
[18] Shandilya, S.K.; Jain, S.; Automatic Opinion
Extraction from Web Documents, Computer and
Automation Engineering, 2009. ICCAE '09. International
Conference on 8-10 March 2009 pp. 351 – 355
[19]Ruifeng Xu; Chunyu Kit; Corse-fine opinion mining
Machine Learning and Cybernetics, 2009 International
Conference on Volume 6, 12-15 July 2009 pp 3469 - 3474
[20] Wang, Xuerui, and Andrew McCallum. "Topics over
time: a non-Markov continuous-time model of topical
trends." In Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data
mining, pp. 424-433. ACM, 2006.
[21]Airoldi, Edoardo, Xue Bai, and Rema Padman.
"Markov blankets and meta-heuristics search: Sentiment
extraction from unstructured texts." Advances in Web
Mining and Web Usage Analysis (2006): 167-187.
[22]Corley, C., Armin R. Mikler, Karan P. Singh, and
Diane J. Cook. "Monitoring influenza trends through
mining social media." In International Conference on
Bioinformatics & Computational Biology, pp. 340-346.
2009.
[23] Barry , Thomas, 1987, The Development of the
Hierarchy of Effects: An interesting Historical
Perspective, Current Issues and Research in
Advertisting, 251- 295