We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
7,300
Open access books available
192,000
International authors and editors
210M Downloads
154
Countries delivered to
TOP 1%
most cited scientists
14%
Contributors from top 500 universities
http://dx.doi.org/10.5772/58895
1. Introduction
The popularity of social networking has allowed access to staggering amounts of unique data,
which has created new possibilities for data analysis and exploitation. Such data has proven
useful in marketing, decision making, destabilizing terrorist networks, behavior evolution,
and determining future social trends [1].
Increased usage of social networking sites has also been observed during events related to
natural disasters; significant political, sporting, and social events; and other crises. Twitter
users provide status updates through tweets. Since tweets are typically short and always less
than 140 characters, they may need to undergo additional analysis to provide contextual clues.
Applying traditional natural language processing algorithms on such data is challenging.
Before making use of this data, it must first be extracted, processed and analyzed appropriately
using certain algorithms and theories.
According to a 2011 study from the International Data Corporation (IDC)–a marketing firm
specializing in information technology and other consumer technologies–unstructured data,
that is, data that does not have a pre-defined data model or is not organized in a pre-defined
manner, is growing at a faster rate than structured data. Within the next decade, unstructured
data will account for 90 percent of all data created.
A large driving factor in the increase in unstructured data is social networking data, such as
tweets. It is estimated that more than 80 percent of all potentially useful data is unstructured [2].
© 2014 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
4 MATLAB Applications for the Practical Engineer
The success of businesses in the coming decade will likely rely on their ability to successfully
analyze data from social networks.
• Use existing data mining and statistical tools within MATLAB to conduct analysis on social
networking site data.
• Discuss potential cost savings and document approaches for implementing social network‐
ing site data analysis via the cloud through vendors such as Amazon Web Services (AWS).
• Provide a tutorial on the use of MATLAB tools to analyze unstructured data with an
emphasis on social networking data.
• Provide an overview and example on the use of MATLAB in a commercial cloud environ‐
ment, such as AWS.
2. Previous work
Data mining has proven useful in Twitter as well, with much research performed into detecting
trends and bursty keywords. One has to pay attention to some abnormalities with Twitter,
such as many people use smart phones to tweet and it is common for people to make mistakes
while typing on smart phones. One would have to account for misspellings and merge them
into one variable so they are not separated and lost. In addition, some phrases are commonly
abbreviated, such as “NYC” for “New York City,” or reduced, such as “Vegas” for “Las Vegas.”
A study, using algorithms to account for these abnormalities, was performed that determined
trendy and bursty keywords on Twitter with 92% accuracy when compared with Google
trends [3].
Karandikar [4] described an approach to determine the most suitable topic model to cluster
tweets by analyzing the effect of change in topic model parameters such as training data size
and type and the number of topics on its clustering performance. The model was able to cluster
tweets, using this approach, with an accuracy of greater than 64 percent for the two specific
events.
Perera [5] developed a software architecture using a Twitter application program interface
(API) to collect tweets sent to specific users. The extracted data were processed to characterize
the inter-arrival times between tweets and the number of re-tweets. Analysis revealed that the
arrival process of new tweets to a user can be modeled as a Poisson process while the number
of retweets follows a geometric distribution.
Turner and Malleson [6] presented an exploratory geographical analysis of a sample of Twitter
post data from June 2011 to March 2012 in the city of Leeds, England. Geographical cluster
detection methods were used in combination with text mining techniques to identify patterns.
Preliminary results suggested that the data could be used as a means of exploring daily spatial-
temporal behavior.
The following presents insight in to the MATLAB toolboxes that can be used in detecting
anomalies, that is, deviation from the normal pattern of life.
MATLAB’s interactive environment allows end users to quickly prototype their own algo‐
rithms if existing toolboxes do not provide the desired functionality. A well-established
community of users supports the exchange of algorithms for those wishing to share their
research and prototypes.
This research took advantage of the MATLAB community and the existing toolboxes. From
the file exchange, Vladimir Bondarenko’s contribution of the class to the Twitter REST API
v1.1 was used to interface with Twitter to search for and extract tweets for specific topics and
geographic regions [8]. Existing MATLAB Statistics and Neural Network Toolboxes were also
6 MATLAB Applications for the Practical Engineer
used in this research to provide statistical algorithms and unsupervised learning methods,
such as cluster analysis, for exploring data to discover hidden patterns and groupings in the
data.
Twitty is a useful interface that runs within MATLAB for communicating with Twitter.
Methods used in Twitty are essentially wrapper functions that call the Twitter API [9]. The
API caller function, callTwitterAPI(), does the main work.
Key steps to successfully using the Twitter API include obtaining Twitter credentials and using
JavaScript Object Notation (JSON) parsers. The MATLAB file exchange provides a JSON parser
developed by Joel Feenstra [10]. The JSON parser parses a JSON string and returns a MATLAB
cell array with the parsed data. JSON objects are converted to structures and JSON arrays are
converted to cell arrays. Twitter credentials are easily created by registering at the Twitter site,
creating an application, and retrieving consumer and access keys. These keys are required for
running Twitty and include specific values for:
• ConsumerKey
• ConsumerSecret
• AccessToken
• AccessTokenSecret
To use these keys
in a MATLAB script, users assign the values to their credential structure.
Then, a twitty instance
can be created and methods, such as search, can be called as shown in
Figure 1.
% Create credentials
credentials.ConsumerKey = 'YourConsumerKey’
credentials.ConsumerSecret = 'YourConsumerSecret'
credentials.AccessToken = 'YourAccessToken'
credentials.AccessTokenSecret = 'YourAccessTokenSecret'
% Create Twitty Instance
tw = twitty(credentials);
% Search for World Series Related Tweets
tw.search(‘World Series’);
Twitty provides a number of useful methods for interacting with the Twitter API. Example
calls, along with a brief description, are shown in Table 1. Additional methods and detailed
descriptions are found by typing twitty.API at the MATLAB prompt.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 7
http://dx.doi.org/10.5772/58895
The Statistics Toolbox™ provides statistical and machine learning algorithms and tools for
organizing, analyzing, and modeling data. Key features include data organization and
management via data set arrays and categorical arrays, exploratory data analysis with the use
of interactive graphics and multivariate statistics, and regression or classification for predictive
modeling [11]. The toolbox also includes functions that allow users to test hypotheses more
effectively by checking for autocorrelation and randomness as well as other tests, for example,
t-tests, one-sample tests, and distribution tests such as Chi-square and Kolmogorov-Smirnov.
Clustering algorithms available within MATLAB’s Statistical Toolbox include k-means and
hierarchical approaches. Clustering algorithms are particularly useful for analyzing social
networking data as they help identify natural groupings that can then be further analyzed to
determine similarities or differences and make business, marketing, or other decisions.
A k-means clustering algorithm forms k clusters by minimizing the mean between all cluster
members. Clusters are defined by the centroid or center of each cluster. The algorithm works
by moving data between clusters until the sum of the distances between each member and the
centroid is minimized.
MATLAB allows different distance measures to be selected. Table 2 lists the available distance
measures and a brief description and example call.
8 MATLAB Applications for the Practical Engineer
Visually displaying the results for multidimensional data can be challenging. However; the
silhouette plot, available within the MATLAB Statistics Toolbox, displays a measure between
0 and 1, representing how close each point in one cluster is to the points in the neighboring
clusters. Values close to 1 indicate points are distant from neighboring clusters whereas values
close to 0 indicate points are not distinctly different from one cluster or another. Negative
values indicate points that are most likely assigned to the wrong cluster.
In the following example, cidx2 represents the cluster index for each given sample of data.
Assuming cidx2 was returned from the call to the k-means function, an example call of the
silhouette function is:
silhouette(meas,cidx2,'sqeuclidean');
Hierarchical clustering groups data to create a tree structure consisting of multiple levels. Users
can prune parts of the tree depending upon the application and level of detail required. The
links between data are represented as upside-down U-shaped lines with the height of each
line indicating the distance between the data. This height is known as the cophenetic correla‐
tion distance between the two objects. MATLAB has a function that measures this distance.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 9
http://dx.doi.org/10.5772/58895
The closer the value of the cophenetic correlation coefficient is to 1, the more accurately the
clustering solution reflects your data.
Distance measures, similar to the k-means clustering, need to be selected to generate the
linkages of the tree. Figure 2 shows the code to create a tree based on the Euclidean distance
for a matrix named ‘meas’ and displays the results in a tree and then calculates the cophenetic
correlation.
eucD = pdist(meas,'euclidean');
clustTreeEuc = linkage(eucD,'average');
[h,nodes] = dendrogram(clustTreeEuc,0);
measCophenet = cophenet(clustTreeEuc,eucD);
The Neural Network ToolboxTM provides functions and apps for modeling complex nonlinear
systems that are not easily modeled with a closed-form equation. Primary features include
data fitting, clustering, and pattern recognition to forecast future events, both supervised and
unsupervised network architectures, and training algorithms, such as gradient descent and
conjugate gradient methods, to help automatically adjust the network’s weights and biases
[12]. Also included are preprocessing and post-processing functions that improve the efficien‐
cy of neural network training and enable detailed analysis of network performance by
reducing the dimensions of the input vectors using principal component analysis and nor‐
malizing the mean and standard deviation of the training set.
MATLAB’s Neural Network Toolbox provides self-organizing maps for both unsupervised
and supervised clustering. Self-organizing maps retain topological information for similar
classes and provide reasonable classifiers. Self-organizing maps can be used for multidimen‐
sional data with complex features, making them attractive for analysis of social networking
data. The self-organizing maps functionality within MATLAB can be used to cluster tweets
based on the Twitter-extracted fields.
Self-organizing maps learn to classify input vectors according to how they are grouped in the
input space. The neurons in a self-organizing map can be arranged based on specific topologies.
Within MATLAB, topologies for gridtop, hextop and random are possible. Results are not
affected by the choice of topology used; however, visual inspection of the data space is best
achieved by use of a hexagonal grid (hextop) [13]. A gridtop topology has neurons evenly
spaced in matrix of specific dimensions. For example, an 80-neuron self-organizing map may
be arranged in an 8 × 10 gridtop topology. A hextop topology is similar, but the shape of the
grid can best be described as a hexagon. A random topology has the neurons randomly located.
Figure 3 illustrates gridtop, hextop and random topologies for a 3 × 4 set of neurons.
10 MATLAB Applications for the Practical Engineer
The self-organization map routine can be run in MATLAB by calling the selforgmap function.
Once a map is set up, it can be trained using specific input, and results can be displayed
visually. Figure 4 shows a code example to generate a 3 × 4 neuron self-organizing map,
conduct training, and display the results.
Figure 4. MATLAB code example to generate a 3 × 4 neuron self-organizing map, conduct training, and display the
results.
The Neural Network Toolbox also contains the nnstart tool, which provides an interactive tool
for loading data and selecting algorithms and training iterations. As shown in Figure 5, the
interactive tool allows the user to select run options and then train for a specific number of
epochs.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 11
http://dx.doi.org/10.5772/58895
The addition of the MATLAB’s Parallel Computing Toolbox™ allows the user to speed up
training and handle large data sets by distributing computations and data across multiple
processors and graphic processing units (GPUs). The ability to spin up multiple central
processing units (CPUs) and GPUs from cloud services such as AWS makes this toolbox
attractive. The user may be limited by the availability of MATLAB licenses for this option.
MATLAB and AWS are currently working out a process to make this more feasible with a pay-
as-you-go license feature that could be quite attractive for businesses that do not have the
resources to purchase hardware and software.
Cloud service providers such as AWS can significantly reduce the startup and maintenance
costs for high-performance computers needed for efficient running of complex math, statistics,
and optimization algorithms. AWS has free and pay tier options that will fit most budgets.
Previous research has shown the cost of running multiple high-performance servers for many
12 MATLAB Applications for the Practical Engineer
hours was just a few dollars [14]. As long as users have a valid MATLAB license, they can
install MATLAB on the AWS machine(s) and run the data and analysis.
Once the runs are complete, users can disable the MATLAB license on the AWS machine. If
users need to run the experiment again, they can spin up the server and enable the license to
continue the experiment.
4. Experiment
The experiment consisted of running MATLAB code on commercial cloud architecture (AWS)
to collect publically available tweets on common keywords, such as NFL and World Series,
during specific time intervals of a weekend over a large geographic region when both NFL
games and World Series games were being played. Various MATLAB tools and functions were
used to gather statistical information and analyze the data by use of unsupervised learning
and clustering methods.
For this experiment, a server running Windows2 2008 operating system was spun up in the
AWS free tier. A MATLAB license was installed that included the Statistics and Neural
Network toolboxes. The twitty.m and parse_json.m files were also uploaded to allow MATLAB
to call the Twitter API.
After successful installation, a MATLAB script file (m file) was created to provide the proper
Twitter credentials, and then search for tweets related to two different sports events, the World
Series and NFL games. The specific search strings used included “NFL” and “World Series”.
Four different 3-hour data collections were made over a period of two days. Collection times
included Saturday from 5 PM to 8 PM and 8 PM to 11 PM and Sunday from 1 PM to 4 PM and
5 PM to 8 PM during the weekend of the 2013 World Series. During this time frame, a World
Series event took place on both Saturday and Sunday evenings, and a series of NFL games
occurred on Sunday afternoon and evening.
A large geographic collection area, centered over a 1700 mile radius about Topeka, Kansas,
was used for this experiment. This geographic radius included most of the continental United
States and parts of Canada and Mexico. The search parameters included the Twitter entities
for storing User-, URL-, and hashtag-mentions. Sampling occurred every 60 seconds over the
3-hour window with Twitter ID information being used to eliminate duplicate tweets in the
time frame. A sample call for the search is as follows:
In this call, Twitter is being searched for up to 20 tweets that include the text “NFL” that
originated within 1700 miles of Topeka, Kansas. The LastID variable was updated after each
iteration to gather only recent tweets beyond a specific ID. Appendix A shows the MATLAB
m file that was run for the experiment.
Statistics and analysis were run at the end of the data collection using existing MATLAB tools
and functions provided in the Statistics and Neural Network Toolboxes. Within the Statistics
Toolbox, the k-means and hierarchical clustering approaches and associated visual displays
were used.
The distance measure used for all k-means calculations was the default selection of Squared
Euclidian. An example call used for three clusters with Squared Euclidian distance is as
follows:
kmeans(meas,3,'dist','sqeuclidean');
The self-organizing map functions within the Neural Network Toolbox were used to cluster
the Twitter collected data.
In addition to the existing Statistical and Neural Network functions, some simple data
manipulation algorithms were used to extract the time between tweets. As shown in Figure
6, the following code example determines and stores the time between tweets found in
MATLAB cell array named DatesQ2.
for i=1:length(DatesQ2)-1
deltaWorldSeries(i) = abs(etime(time1,time2));
end
In this code, time {i} and time {i+1} values are constructed using the default time structure of
the Twitter collection and converted to a MATLAB time structure. The dates are then differ‐
enced using MATLAB’s built-in elapsed time function (etime). This process continues to
calculate the time delta for all consecutive tweets for all search queries.
Appendix B shows the Data analysis m file used for this experiment.
Results were provided using MATLAB’s visual and plotting functions. To better understand
the frequency, or popularity, of the two search terms for each time period during the experi‐
ment, some baseline descriptive statistics were calculated.
14 MATLAB Applications for the Practical Engineer
5.1. Boxplots
Boxplots are a convenient way for visualizing the interquartile range, average and outlier data.
Figures 7–10 represent boxplots for the 3-hour experiment windows from Saturday 5 PM to 8
PM, Saturday 8 PM to 11 PM, Sunday 1 PM to 4 PM, and Sunday 5 PM to 8 PM, respectively.
Figure 7. Boxplot for NFL versus World Series time between tweets, Saturday 5 PM to 8 PM.
Figure 8. Boxplot for NFL versus World Series time between tweets, Saturday 8 PM to 11 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 15
http://dx.doi.org/10.5772/58895
Figure 9. Boxplot for NFL versus World Series time between tweets, Sunday 1 PM to 4 PM.
Figure 10. Boxplot for NFL versus World Series time between tweets, Sunday, 5 PM to 8 PM.
Reviewing the boxplots in Figures 7–10 reveals a couple of interesting trends. In most cases,
outliers outside of the interquartile ranges exist; however, as a sporting event gets closer to
start time, the frequency of tweets increase. Note the narrowing of the boxes for both the NFL
16 MATLAB Applications for the Practical Engineer
and World Series plots during or approaching the actual game events. By Sunday late
afternoon, there was significant activity in both the NFL and World Series tweets in terms of
frequency. The descriptive statistics were very similar, making it difficult to see any difference,
in terms of tweet frequency between NFL and World Series fans.
5.2. XY plots
MATLAB was also used to plot the time between tweets for each collection period and search
query. This view, although somewhat cluttered, can be used to quickly compare the number
of tweets and any other patterns obvious in an xy plot. The time between tweets is plotted in
Figures 11–14, for each of the four collection periods. The most revealing information from
these plots is the noticeable increase in the number of tweets as the events draw closer. On
Saturday, the number of tweets related to the World Series was higher than NFL tweets. This
seems reasonable since a World Series game took place on Saturday but no NFL games
occurred until Sunday.
On Sunday, with multiple NFL games taking place, the number of NFL tweets was larger than
World Series-related tweets. However, the number and frequency of World Series tweets was
also high due to the Sunday evening World Series event, but not as significant as the NFL-
related tweet activity.
Figure 11. XY plot for NFL versus World Series time between tweets, Saturday 5 PM to 8 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 17
http://dx.doi.org/10.5772/58895
Figure 12. XY plot for NFL versus World Series time between tweets, Saturday 8 PM to 11 PM.
Figure 13. XY plot for NFL versus World Series time between tweets, Sunday 1 PM to 4PM.
18 MATLAB Applications for the Practical Engineer
Figure 14. XY plot for NFL versus World Series Time between tweets, Sunday 5 PM to 8 PM.
5.3. Histograms
The histogram functionality within MATLAB was used to get a better idea of the quantity and
length of the time between tweets for each of the collection periods. As shown in Figures 15
and 16, on Saturday, the number of tweets with time difference values less than 10 seconds
was much greater for World Series than NFL tweets. In all histograms, a handful of tweets had
a time between tweets of greater than 100 seconds. The shift to NFL interest is very visible on
Sunday in the histograms shown in Figures 17 and 18. Note the significant increase in the both
the number and quantity of tweets occurring in less than 10 second intervals for NFL tweets.
However, in Figure 18, the histograms for Sunday late afternoon and evening reveal similar
results for the time between tweets for both NFL and World Series texts.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 19
http://dx.doi.org/10.5772/58895
Figure 15. Histogram plot for NFL versus World Series time between tweets, Saturday 5 PM to 8 PM.
Figure 16. Histogram plot for NFL versus World Series time between tweets, Saturday 8 PM to 11 PM.
20 MATLAB Applications for the Practical Engineer
Figure 17. Histogram plot for NFL versus World Series time between tweets, Sunday 1 PM to 4 PM.
Figure 18. Histogram plot for NFL versus World Series time between tweets, Sunday 5 PM to 8 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 21
http://dx.doi.org/10.5772/58895
Twitter queries provide additional data beyond the time and text of the tweet. For example,
the number of friends, number of followers, user screen name, date of account creation, and
time zone of user are also available and easily extract from the queries. This information can
be used to determine if any significant differences or similarities exist among users of Twitter.
Within MATLAB, scatter plots can be used to visually identify clusters of data. Figures 19–22
illustrate the use of scatter plots to compare the number of friends and number of followers
for NFL (blue Xs) and World Series (red circles) tweeters during each of the four data collection
periods. Although a significant amount of overlap exists in the groups, particularly for small
numbers of friend and followers, outliers can be identified in all four plots. Additional
variables and analysis may be needed to further isolate these groups.
Figure 19. Scatter plot for friends versus followers, Saturday 5 PM to 8 PM.
22 MATLAB Applications for the Practical Engineer
Figure 20. Scatter plot for friends versus followers, Saturday 8 PM to 11 PM.
Figure 21. Scatter plot for friends versus followers, Sunday 1 PM to 4 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 23
http://dx.doi.org/10.5772/58895
Figure 22. Scatter plot for friends versus followers, Sunday 5 PM to 8 PM.
Running the k-means algorithm in MATLAB for the Twitter feature sets further reveals the
similarity in both NFL and World Series sets of tweets for all four time collection periods.
Figures 23–26 illustrate the results of the applying the k-means algorithm for three clusters.
In all cases, one very large cluster with silhouette values very close to 1 is observed. However,
the other two clusters are very small with both negative and lower positive silhouette values.
This indicates that the separation into unique clusters is difficult for this set of data and features.
Additional selection of features and analysis would be needed to better identify clusters of
similar Twitter users.
24 MATLAB Applications for the Practical Engineer
Figure 23. Silhouette plot of NFL versus World Series tweet features, Saturday 5 PM to 8PM.
Figure 24. Silhouette plot of NFL versus World Series tweet features, Saturday 8 PM to 11 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 25
http://dx.doi.org/10.5772/58895
Figure 25. Silhouette plot of NFL versus World Series tweet features, Sunday 1 PM to 4 PM.
Figure 26. Silhouette plot of NFL versus World Series tweet features, Sunday 5 PM to 8 PM.
26 MATLAB Applications for the Practical Engineer
Figure 27. Hierarchical clustering of NFL versus World Series tweet features, Saturday 5 PM to 8 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 27
http://dx.doi.org/10.5772/58895
Figure 28. Hierarchical clustering of NFL versus World Series tweet features, Saturday 8 PM to 11 PM.
Figure 29. Hierarchical clustering of NFL versus World Series tweet features, Sunday 1 PM to 4 PM.
28 MATLAB Applications for the Practical Engineer
Figure 30. Hierarchical clustering of NFL versus World Series tweet features, Sunday 5 PM to 8 PM.
MATLAB’s Neural Network Toolbox was also used to cluster the tweets through the self-
organizing maps function. A hextop topology of 10 ×10 nodes was selected with 200 training
epochs. After training, several visualizations are available to help determine how the input is
distributed across the nodes. The locations of the data points and the weight vectors are shown
by selecting the weight positions plot. With this display, only two weights can be shown at
one time. Figure 31 shows that most of the data points cluster in one area and they are not very
well distributed. This type of clustering was observed in both the hierarchical and k-means
results.
The SOM neighbor weight distances plot, using multiple dimensions, provides more infor‐
mation. In Figures 32–35 SOM neighbor weight distance plots are displayed for each of the
four time collections. The following diagram colors and description should be used when
interpreting these plots:
Figures 32–35 show one large cluster, with small distances between the member records,
present in all four collection periods. One or two very small clusters are also present, with
relatively large distances between the member records.
The SOM weight plane plots are used to visualize the strength of weights that connect each
input to each of the neurons. For our experiment, five inputs were used; therefore, five subplots
were generated for each experiment. The five input features included, for each tweet, the
number of user mentions, URL mentions, hashtag mentions, followers, and friends. Lighter
colors in the plots represent larger weights whereas darker colors represent smaller weights.
Similar connection patterns of the inputs indicate a high correlation.
Weight plane plots are shown for each of the collection times in Figures 36–39. Inputs 4 and 5
appeared to be similar in all collection times and were interpreted as highly correlated. Input
from variables 1, 2, and 3 seemed to contribute the smallest amount of cluster separation in
the data sets as they appear to be the least similar and less correlated. This seems reasonable
because the number of friends and followers do seem to be correlated with Twitter users as a
large number of friends also have a large number of followers. The information from the user-,
URL-and hashtag-mentions shows some promise as the maps show these as not being highly
correlated. Additional features and analysis are recommended to enhance the differences in
these maps and to perform better clustering.
32 MATLAB Applications for the Practical Engineer
One additional and useful visual plot provided by the MATLAB SOM function is the SOM
sample hits plot. The sample hits plot counts the number of data records associated with each
neuron. In an ideal situation, a relatively even distribution across the neurons is desired.
Figures 40–43 show the distribution across the neurons was not even in our experiment. In
most cases, the distribution was clustered in one area, which indicates very similar data
without much separation. Even though the location of concentration seems to shift from one
area of the map to another for each time frame, the results are essentially the same for each
collection time. The exception is the increase in concentration of data around the “heavy-
hitting” neurons on Sunday as both NFL and World Series events were scheduled.
6. Conclusion
This research used MATLAB tools to extract and analyze social networking data sets and
leverage cloud technologies and infrastructures. AWS was used to spin up a Windows 2008
server and install MATLAB and its associated toolboxes for data mining and statistics. In
addition, a community MATLAB m file interfaced with the Twitter API to search specific text
queries and retrieve associated data. The setup process on AWS was straightforward and
provided a cost-effective and free solution for the hardware and operating system. The
MATLAB license was still needed in this implementation to be able to use the toolboxes and
associated m files.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 37
http://dx.doi.org/10.5772/58895
AWS provided cost savings, including all server-related hardware and software. Since
MATLAB was used, the development costs for some very specialized analysis visualization
tools were also reduced. Further cost savings are envisioned as MATLAB provides cloud
options for running its software for short durations using a pay-as-you-go cost model.
At the time of publishing, access to MATLAB Distributed Computing Server on the cloud is
available as part of an Early Adopter Program for MATLAB Distributed Computing Server
on Elastic Compute Cloud (EC2). Future efforts could take advantage of this option to speed
up implementations, run algorithms in parallel, and possibly further reduce licensing costs.
The twitty.m and parse_json.m community files were successfully used to interface with the
Twitter API to search for tweets related to specific queries including “NFL” and “World Series”
sporting events. In addition to the tweet texts, other information used in the experiment
included user data such as number of friends, number of followers, time between tweets,
number of URL mentions, number of Hashtag mentions, and number of user mentions in each
text.
The MATLAB Statistics and Neural Network Toolboxes were then used to extract and display
descriptive statistics and perform unsupervised clustering using k-means, hierarchical, and
self-organized maps. Initial analysis of the visualization and data output revealed that sports
events and the associated popularity and frequency of tweets increases as the time of the event
gets closer. The frequency of tweets also persisted throughout the event.
Tweet users were also shown to have similar characteristics and profiles and would be difficult
to separate based on just a few features. Referencing the SOM weight distance plots, smaller
clusters were observed in all the plots, however, the distance among the members of the cluster
were large compared to the one large cluster observed in each plot. Additional research with
larger data sets and more robust features are needed to form predicative models.
7. Recommendations
This research demonstrated the use of MATLAB for social networking site analysis using AWS
servers to reduce costs is feasible, cost-effective, and efficient. Further research is recommend‐
ed to Investigate MATLAB’s AWS Parallel computing license options and costs. It is believed
even more cost-savings could be realized with a pay-as-you-go model. This is particularly
attractive for smaller companies and start-ups that might have limited financial resources, yet
have the personnel skills to conduct some excellent research.
This initial experiment collected data from four different, but relatively short, time periods for
two sports-related tweets. Only a handful of features were used to identify clusters and
perform statistical analysis. We recommend expanding the time frame, number of tweet
queries, and features to be able to provide further insight and understanding from the
information available from social networking sites, such as Twitter.
38 MATLAB Applications for the Practical Engineer
ScreenName{TweetCnt} = SName;
% Get the User Mentions Count
UserMentionCnt{TweetCnt} = length(cellfun('ndims',S{1,1}.statuses{1,i}.enti-
ties.user_mentions));
URLMentionCnt{TweetCnt} = length(cellfun('ndims',S{1,1}.statuses{1,i}.enti-
ties.urls));
HashTagsMentionCnt{TweetCnt} = length(cellfun('ndims',S{1,1}.statuses{1,i}.en-
tities.hashtags));
end
for i=1:StatCntQ2
TweetCntQ2 = TweetCntQ2+1;
MyTweetsQ2=Q2{1,1}.statuses{1,i}.text ;
% Put the data in
TweetsQ2{TweetCntQ2}=MyTweetsQ2;
% Put the dates/time in
DatesQ2{TweetCntQ2} = Q2{1,1}.statuses{1,i}.created_at;
% Gather other stuff
FollowerCountQ2(TweetCntQ2) = Q2{1,1}.statuses{1,i}.user.followers_count;
FriendsCountQ2(TweetCntQ2) = Q2{1,1}.statuses{1,i}.user.friends_count;
TZonesQ2 = Q2{1,1}.statuses{1,i}.user.time_zone;
TimeZonesQ2{TweetCntQ2} = TZonesQ2;
% Get screen name
SNameQ2 = Q2{1,1}.statuses{1,i}.user.screen_name;
ScreenNameQ2{TweetCntQ2} = SNameQ2;
% Get the User Mentions Count
UserMentionCntQ2{TweetCntQ2} = length(cellfun('ndims',Q2{1,1}.sta-
tuses{1,i}.entities.user_mentions));
URLMentionCntQ2{TweetCntQ2} = length(cellfun('ndims',Q2{1,1}.statuses{1,i}.en-
tities.urls));
HashTagsMentionCntQ2{TweetCntQ2} = length(cellfun('ndims',Q2{1,1}.sta-
tuses{1,i}.entities.hashtags));
end
% Get the lastID to eliminate Dups
if (StatCntQ2 > 0)
LastIDQ2 = Q2{1,1}.statuses{1,1}.id_str;
end
% Pause a few seconds
pause(PauseTime);
end
40 MATLAB Applications for the Practical Engineer
figure(4)
gplotmatrix(MyVars(:,2), MyVars(:,3),MyGroups,['r' 'b'],['O' 'X'],
[],'off','hist',['Followers'],['Friends']);
grid;
Author details
1 U.S. Army Research Laboratory, Sensors and Electron Devices Directorate, Adelphi, MD,
USA
References
v1.1. http://www.mathworks.com/MATLABcentral/fileexchange/34837-twitty/content/
twitty.m. (accessed 21 September 2013).
[10] MathWorks. MATLAB Central. File Exchange. JSON Parser. JSON Parser by Joel
Feenstra. 3 July 2008 (Updated 18 June 2009). http://www.mathworks.com/MATLAB‐
central/fileexchange/20565-json-parser. (accessed 23 September 2013).
[13] Kohonen T. Self-Organizing Maps. 3rd ed. Springer Series in Information Sciences.
Springer-Verlag Heidelberg: New York; 2001.
[14] Bennett K, Robertson J. Signal and image processing algorithm performance in a vir‐
tual and elastic computing environment, Proc. SPIE Vol. 8734, Active and Passive
Signatures IV, 87340B (2013).