0% found this document useful (0 votes)

21 views43 pages

We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists

Uploaded by

naa.as1618

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views43 pages

We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists

Uploaded by

naa.as1618

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

We are IntechOpen,

the world’s leading publisher of

Open Access books
Built by scientists, for scientists

7,300
Open access books available
192,000
International authors and editors
210M Downloads

Our authors are among the

154
Countries delivered to
TOP 1%
most cited scientists
14%
Contributors from top 500 universities

Selection of our books indexed in the Book Citation Index

in Web of Science™ Core Collection (BKCI)

Interested in publishing with us?

Contact [email protected]
Numbers displayed above are based on latest data collected.
For more information visit www.intechopen.com
Chapter 1

Knowledge Discovery and Information Extraction on the

Open Internet Using MATLAB and Amazon Web Services
(AWS)

Kelly Bennett and James Robertson

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/58895

1. Introduction

The popularity of social networking has allowed access to staggering amounts of unique data,
which has created new possibilities for data analysis and exploitation. Such data has proven
useful in marketing, decision making, destabilizing terrorist networks, behavior evolution,
and determining future social trends [1].

Increased usage of social networking sites has also been observed during events related to
natural disasters; significant political, sporting, and social events; and other crises. Twitter
users provide status updates through tweets. Since tweets are typically short and always less
than 140 characters, they may need to undergo additional analysis to provide contextual clues.
Applying traditional natural language processing algorithms on such data is challenging.
Before making use of this data, it must first be extracted, processed and analyzed appropriately
using certain algorithms and theories.

According to a 2011 study from the International Data Corporation (IDC)–a marketing firm
specializing in information technology and other consumer technologies–unstructured data,
that is, data that does not have a pre-defined data model or is not organized in a pre-defined
manner, is growing at a faster rate than structured data. Within the next decade, unstructured
data will account for 90 percent of all data created.

A large driving factor in the increase in unstructured data is social networking data, such as
tweets. It is estimated that more than 80 percent of all potentially useful data is unstructured [2].

© 2014 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
4 MATLAB Applications for the Practical Engineer

The success of businesses in the coming decade will likely rely on their ability to successfully
analyze data from social networks.

In addition to the increase in social networking, cloud computing technologies continue to

increase in popularity and provide end users the ability to quickly spin up powerful servers
with analytical capabilities, making analysis of large amounts of data more affordable and
practical than in previous decades.

Main topics addressed in this chapter include:

• Explore viable approaches for extracting social networking data sets using tools compatible
with MATLAB1 and cloud technologies and infrastructures.

• Use existing data mining and statistical tools within MATLAB to conduct analysis on social
networking site data.

• Discuss potential cost savings and document approaches for implementing social network‐
ing site data analysis via the cloud through vendors such as Amazon Web Services (AWS).

• Provide a tutorial on the use of MATLAB tools to analyze unstructured data with an
emphasis on social networking data.

• Provide an overview and example on the use of MATLAB in a commercial cloud environ‐
ment, such as AWS.

2. Previous work

Muhammad Mahbubur Rahman, author of “Mining Social Data to Extract Intellectual

Knowledge,” was able to extract data from Facebook using various data mining techniques.
He first determined the most frequent terms that were being used, such as birthday, about me,
gender, and music, and then created a database using these terms and the appropriate data
types. [1] From the “about me” information, he was able to use an Euclidian distance formula
to put each user into a certain “class”, for example, aggressive, dishonest, romantic, eager to
learn, or lazy. Grouping classes by age and gender creates a method of comparing various
attributes by age and gender. This type of research demonstrates the ability to use data mining
techniques to provide intellectual knowledge that can be used to possibility predict human
behavior, provide insight into human decision making, or provide a method of determining
anomalous events.

Data mining has proven useful in Twitter as well, with much research performed into detecting
trends and bursty keywords. One has to pay attention to some abnormalities with Twitter,
such as many people use smart phones to tweet and it is common for people to make mistakes
while typing on smart phones. One would have to account for misspellings and merge them

1 Registered trademark of Mathworks Corporation.

Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 5
http://dx.doi.org/10.5772/58895

into one variable so they are not separated and lost. In addition, some phrases are commonly
abbreviated, such as “NYC” for “New York City,” or reduced, such as “Vegas” for “Las Vegas.”
A study, using algorithms to account for these abnormalities, was performed that determined
trendy and bursty keywords on Twitter with 92% accuracy when compared with Google
trends [3].

Karandikar [4] described an approach to determine the most suitable topic model to cluster
tweets by analyzing the effect of change in topic model parameters such as training data size
and type and the number of topics on its clustering performance. The model was able to cluster
tweets, using this approach, with an accuracy of greater than 64 percent for the two specific
events.

Perera [5] developed a software architecture using a Twitter application program interface
(API) to collect tweets sent to specific users. The extracted data were processed to characterize
the inter-arrival times between tweets and the number of re-tweets. Analysis revealed that the
arrival process of new tweets to a user can be modeled as a Poisson process while the number
of retweets follows a geometric distribution.

Turner and Malleson [6] presented an exploratory geographical analysis of a sample of Twitter
post data from June 2011 to March 2012 in the city of Leeds, England. Geographical cluster
detection methods were used in combination with text mining techniques to identify patterns.
Preliminary results suggested that the data could be used as a means of exploring daily spatial-
temporal behavior.

The following presents insight in to the MATLAB toolboxes that can be used in detecting
anomalies, that is, deviation from the normal pattern of life.

3. MATLAB toolboxes and classes for anomaly detection

MATLABTM is a high-level language and interactive environment for numerical computation,

visualization, and programming. MATLAB’s product family includes numerous toolboxes for
parallel computing, math, statistics and optimization, control system design and analysis,
signal processing and communication, image processing and computer vision, test and
measurement, and other areas [7]. A key feature of these products is the ability to research a
number of algorithms quickly without having to design and code the algorithm.

MATLAB’s interactive environment allows end users to quickly prototype their own algo‐
rithms if existing toolboxes do not provide the desired functionality. A well-established
community of users supports the exchange of algorithms for those wishing to share their
research and prototypes.

This research took advantage of the MATLAB community and the existing toolboxes. From
the file exchange, Vladimir Bondarenko’s contribution of the class to the Twitter REST API
v1.1 was used to interface with Twitter to search for and extract tweets for specific topics and
geographic regions [8]. Existing MATLAB Statistics and Neural Network Toolboxes were also
6 MATLAB Applications for the Practical Engineer

used in this research to provide statistical algorithms and unsupervised learning methods,
such as cluster analysis, for exploring data to discover hidden patterns and groupings in the
data.

3.1. TWITTY — Twitter REST API interface

Twitty is a useful interface that runs within MATLAB for communicating with Twitter.
Methods used in Twitty are essentially wrapper functions that call the Twitter API [9]. The
API caller function, callTwitterAPI(), does the main work.

Key steps to successfully using the Twitter API include obtaining Twitter credentials and using
JavaScript Object Notation (JSON) parsers. The MATLAB file exchange provides a JSON parser
developed by Joel Feenstra [10]. The JSON parser parses a JSON string and returns a MATLAB
cell array with the parsed data. JSON objects are converted to structures and JSON arrays are
converted to cell arrays. Twitter credentials are easily created by registering at the Twitter site,
creating an application, and retrieving consumer and access keys. These keys are required for
running Twitty and include specific values for:

• ConsumerKey

• ConsumerSecret

• AccessToken

• AccessTokenSecret

To use these keys
 in a MATLAB script, users assign the values to their credential structure.
Then, a twitty instance
 can be created and methods, such as search, can be called as shown in
Figure 1. 

% Create credentials
credentials.ConsumerKey = 'YourConsumerKey’
credentials.ConsumerSecret = 'YourConsumerSecret'
credentials.AccessToken = 'YourAccessToken'
credentials.AccessTokenSecret = 'YourAccessTokenSecret'
% Create Twitty Instance
tw = twitty(credentials);
% Search for World Series Related Tweets
tw.search(‘World Series’);

Figure 1. Twitty search method example.

Twitty provides a number of useful methods for interacting with the Twitter API. Example
calls, along with a brief description, are shown in Table 1. Additional methods and detailed
descriptions are found by typing twitty.API at the MATLAB prompt.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 7
http://dx.doi.org/10.5772/58895

Method Example Call Description

Search all public tweets

search(text string) tw.search(‘World Series’) containing the words “World
Series”

Search public tweets having a

identifier greater than LastID
tw.search('NFL', 'count',20, 'include_entities','true', within 1700 miles of Topeka,
search(multiple
'geocode','39.051300,-95.724660,1700mi','since_id',LastI Kansas containing the word
parameters)
D) “NFL” . Return entity data
also (e.g., hashtags, URL
mentions)

Twit the text “Watching the

updateStatus() tw.updateStatus(‘Watching the World Series tonight’) World Series tonight” from
your account to the public

Return a continuous stream

sampleStatuses() tw.sampleStatuses()
of random public tweets

Table 1. Twitty method examples.

3.2. MATLAB — Statistics TOOLBOXTM

The Statistics Toolbox™ provides statistical and machine learning algorithms and tools for
organizing, analyzing, and modeling data. Key features include data organization and
management via data set arrays and categorical arrays, exploratory data analysis with the use
of interactive graphics and multivariate statistics, and regression or classification for predictive
modeling [11]. The toolbox also includes functions that allow users to test hypotheses more
effectively by checking for autocorrelation and randomness as well as other tests, for example,
t-tests, one-sample tests, and distribution tests such as Chi-square and Kolmogorov-Smirnov.

Clustering algorithms available within MATLAB’s Statistical Toolbox include k-means and
hierarchical approaches. Clustering algorithms are particularly useful for analyzing social
networking data as they help identify natural groupings that can then be further analyzed to
determine similarities or differences and make business, marketing, or other decisions.

A k-means clustering algorithm forms k clusters by minimizing the mean between all cluster
members. Clusters are defined by the centroid or center of each cluster. The algorithm works
by moving data between clusters until the sum of the distances between each member and the
centroid is minimized.

MATLAB allows different distance measures to be selected. Table 2 lists the available distance
measures and a brief description and example call.
8 MATLAB Applications for the Practical Engineer

Distance Measure Description Example call

Squared Euclidean Each centroid is the mean of the kmeans(meas,3,'dist','sqeuclidean');

points in that cluster.

City Block Sum of absolute differences. kmeans(meas,3,'dist','cityblock');

Each centroid is the component-
wise median of the points in that
cluster.

Cosine One minus the cosine of the kmeans(meas,3,'dist','cosine');

included angle between. Each
centroid is the mean of the
points in that cluster.

Correlation One minus the sample kmeans(meas,3,'dist','correlation');

correlation between points. Each
centroid is the component-wise
mean of the points in that
cluster.

Hamming Percentage of bits that differ in kmeans(meas,3,'dist','hamming');

binary data. Each centroid is the
component-wise median of
points in that cluster.

Table 2. MATLAB k-means clustering distance measures.

Visually displaying the results for multidimensional data can be challenging. However; the
silhouette plot, available within the MATLAB Statistics Toolbox, displays a measure between
0 and 1, representing how close each point in one cluster is to the points in the neighboring
clusters. Values close to 1 indicate points are distant from neighboring clusters whereas values
close to 0 indicate points are not distinctly different from one cluster or another. Negative
values indicate points that are most likely assigned to the wrong cluster.

In the following example, cidx2 represents the cluster index for each given sample of data.
Assuming cidx2 was returned from the call to the k-means function, an example call of the
silhouette function is:

silhouette(meas,cidx2,'sqeuclidean');

Hierarchical clustering groups data to create a tree structure consisting of multiple levels. Users
can prune parts of the tree depending upon the application and level of detail required. The
links between data are represented as upside-down U-shaped lines with the height of each
line indicating the distance between the data. This height is known as the cophenetic correla‐
tion distance between the two objects. MATLAB has a function that measures this distance.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 9
http://dx.doi.org/10.5772/58895

The closer the value of the cophenetic correlation coefficient is to 1, the more accurately the
clustering solution reflects your data.

Distance measures, similar to the k-means clustering, need to be selected to generate the
linkages of the tree. Figure 2 shows the code to create a tree based on the Euclidean distance
for a matrix named ‘meas’ and displays the results in a tree and then calculates the cophenetic
correlation.

eucD = pdist(meas,'euclidean');
clustTreeEuc = linkage(eucD,'average');
[h,nodes] = dendrogram(clustTreeEuc,0);
measCophenet = cophenet(clustTreeEuc,eucD);

Figure 2. MATLAB code to calculate cophenetic correlation.

3.3. MATLAB – Neural Network TOOLBOXTM

The Neural Network ToolboxTM provides functions and apps for modeling complex nonlinear
systems that are not easily modeled with a closed-form equation. Primary features include
data fitting, clustering, and pattern recognition to forecast future events, both supervised and
unsupervised network architectures, and training algorithms, such as gradient descent and
conjugate gradient methods, to help automatically adjust the network’s weights and biases
[12]. Also included are preprocessing and post-processing functions that improve the efficien‐
cy of neural network training and enable detailed analysis of network performance by
reducing the dimensions of the input vectors using principal component analysis and nor‐
malizing the mean and standard deviation of the training set.

MATLAB’s Neural Network Toolbox provides self-organizing maps for both unsupervised
and supervised clustering. Self-organizing maps retain topological information for similar
classes and provide reasonable classifiers. Self-organizing maps can be used for multidimen‐
sional data with complex features, making them attractive for analysis of social networking
data. The self-organizing maps functionality within MATLAB can be used to cluster tweets
based on the Twitter-extracted fields.

Self-organizing maps learn to classify input vectors according to how they are grouped in the
input space. The neurons in a self-organizing map can be arranged based on specific topologies.
Within MATLAB, topologies for gridtop, hextop and random are possible. Results are not
affected by the choice of topology used; however, visual inspection of the data space is best
achieved by use of a hexagonal grid (hextop) [13]. A gridtop topology has neurons evenly
spaced in matrix of specific dimensions. For example, an 80-neuron self-organizing map may
be arranged in an 8 × 10 gridtop topology. A hextop topology is similar, but the shape of the
grid can best be described as a hexagon. A random topology has the neurons randomly located.
Figure 3 illustrates gridtop, hextop and random topologies for a 3 × 4 set of neurons.
10 MATLAB Applications for the Practical Engineer

Figure 3. Gridtop, hextop, and randtop self-organizing map topologies.

The self-organization map routine can be run in MATLAB by calling the selforgmap function.
Once a map is set up, it can be trained using specific input, and results can be displayed
visually. Figure 4 shows a code example to generate a 3 × 4 neuron self-organizing map,
conduct training, and display the results.

net = selforgmap([3 4]);

net = train(net,meas);
view(net);
y = net(meas);
classes = vec2ind(y);

Figure 4. MATLAB code example to generate a 3 × 4 neuron self-organizing map, conduct training, and display the
results.

The Neural Network Toolbox also contains the nnstart tool, which provides an interactive tool
for loading data and selecting algorithms and training iterations. As shown in Figure 5, the
interactive tool allows the user to select run options and then train for a specific number of
epochs.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 11
http://dx.doi.org/10.5772/58895

Figure 5. MATLAB’s nnstart interactive interface.

The addition of the MATLAB’s Parallel Computing Toolbox™ allows the user to speed up
training and handle large data sets by distributing computations and data across multiple
processors and graphic processing units (GPUs). The ability to spin up multiple central
processing units (CPUs) and GPUs from cloud services such as AWS makes this toolbox
attractive. The user may be limited by the availability of MATLAB licenses for this option.
MATLAB and AWS are currently working out a process to make this more feasible with a pay-
as-you-go license feature that could be quite attractive for businesses that do not have the
resources to purchase hardware and software.

3.4. Deploying MATLAB on AWS

Cloud service providers such as AWS can significantly reduce the startup and maintenance
costs for high-performance computers needed for efficient running of complex math, statistics,
and optimization algorithms. AWS has free and pay tier options that will fit most budgets.
Previous research has shown the cost of running multiple high-performance servers for many
12 MATLAB Applications for the Practical Engineer

hours was just a few dollars [14]. As long as users have a valid MATLAB license, they can
install MATLAB on the AWS machine(s) and run the data and analysis.

Once the runs are complete, users can disable the MATLAB license on the AWS machine. If
users need to run the experiment again, they can spin up the server and enable the license to
continue the experiment.

4. Experiment

The experiment consisted of running MATLAB code on commercial cloud architecture (AWS)
to collect publically available tweets on common keywords, such as NFL and World Series,
during specific time intervals of a weekend over a large geographic region when both NFL
games and World Series games were being played. Various MATLAB tools and functions were
used to gather statistical information and analyze the data by use of unsupervised learning
and clustering methods.

For this experiment, a server running Windows2 2008 operating system was spun up in the
AWS free tier. A MATLAB license was installed that included the Statistics and Neural
Network toolboxes. The twitty.m and parse_json.m files were also uploaded to allow MATLAB
to call the Twitter API.

After successful installation, a MATLAB script file (m file) was created to provide the proper
Twitter credentials, and then search for tweets related to two different sports events, the World
Series and NFL games. The specific search strings used included “NFL” and “World Series”.

Four different 3-hour data collections were made over a period of two days. Collection times
included Saturday from 5 PM to 8 PM and 8 PM to 11 PM and Sunday from 1 PM to 4 PM and
5 PM to 8 PM during the weekend of the 2013 World Series. During this time frame, a World
Series event took place on both Saturday and Sunday evenings, and a series of NFL games
occurred on Sunday afternoon and evening.

A large geographic collection area, centered over a 1700 mile radius about Topeka, Kansas,
was used for this experiment. This geographic radius included most of the continental United
States and parts of Canada and Mexico. The search parameters included the Twitter entities
for storing User-, URL-, and hashtag-mentions. Sampling occurred every 60 seconds over the
3-hour window with Twitter ID information being used to eliminate duplicate tweets in the
time frame. A sample call for the search is as follows:

S=tw.search('NFL', 'count',20, 'include_entities','true','geoco-

de','39.051300,-95.724660,1700mi', 'since_id',LastID);

In this call, Twitter is being searched for up to 20 tweets that include the text “NFL” that
originated within 1700 miles of Topeka, Kansas. The LastID variable was updated after each

2 Registered trademark of Microsoft Corporation.

Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 13
http://dx.doi.org/10.5772/58895

iteration to gather only recent tweets beyond a specific ID. Appendix A shows the MATLAB
m file that was run for the experiment.

Statistics and analysis were run at the end of the data collection using existing MATLAB tools
and functions provided in the Statistics and Neural Network Toolboxes. Within the Statistics
Toolbox, the k-means and hierarchical clustering approaches and associated visual displays
were used.

The distance measure used for all k-means calculations was the default selection of Squared
Euclidian. An example call used for three clusters with Squared Euclidian distance is as
follows:

kmeans(meas,3,'dist','sqeuclidean');

The self-organizing map functions within the Neural Network Toolbox were used to cluster
the Twitter collected data.

In addition to the existing Statistical and Neural Network functions, some simple data
manipulation algorithms were used to extract the time between tweets. As shown in Figure
6, the following code example determines and stores the time between tweets found in
MATLAB cell array named DatesQ2.

for i=1:length(DatesQ2)-1

time1 = [str2num(DatesQ2{i}(26:30)) 10 str2num(DatesQ2{i}(9:10)) str2num(DatesQ2{i}(12:13))

str2num(DatesQ2{i}(15:16)) str2num(DatesQ2{i}(18:19))];

time2 = [str2num(DatesQ2{i+1}(26:30)) 10 str2num(DatesQ2{i+1}(9:10)) str2num(DatesQ2{i+1}(12:13))

str2num(DatesQ2{i+1}(15:16)) str2num(DatesQ2{i+1}(18:19))];

deltaWorldSeries(i) = abs(etime(time1,time2));

end

Figure 6. MATLAB code to determine time between tweets.

In this code, time {i} and time {i+1} values are constructed using the default time structure of
the Twitter collection and converted to a MATLAB time structure. The dates are then differ‐
enced using MATLAB’s built-in elapsed time function (etime). This process continues to
calculate the time delta for all consecutive tweets for all search queries.

Appendix B shows the Data analysis m file used for this experiment.

5. Initial results and discussion

Results were provided using MATLAB’s visual and plotting functions. To better understand
the frequency, or popularity, of the two search terms for each time period during the experi‐
ment, some baseline descriptive statistics were calculated.
14 MATLAB Applications for the Practical Engineer

5.1. Boxplots

Boxplots are a convenient way for visualizing the interquartile range, average and outlier data.
Figures 7–10 represent boxplots for the 3-hour experiment windows from Saturday 5 PM to 8
PM, Saturday 8 PM to 11 PM, Sunday 1 PM to 4 PM, and Sunday 5 PM to 8 PM, respectively.

Figure 7. Boxplot for NFL versus World Series time between tweets, Saturday 5 PM to 8 PM.

Figure 8. Boxplot for NFL versus World Series time between tweets, Saturday 8 PM to 11 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 15
http://dx.doi.org/10.5772/58895

Figure 9. Boxplot for NFL versus World Series time between tweets, Sunday 1 PM to 4 PM.

Figure 10. Boxplot for NFL versus World Series time between tweets, Sunday, 5 PM to 8 PM.

Reviewing the boxplots in Figures 7–10 reveals a couple of interesting trends. In most cases,
outliers outside of the interquartile ranges exist; however, as a sporting event gets closer to
start time, the frequency of tweets increase. Note the narrowing of the boxes for both the NFL
16 MATLAB Applications for the Practical Engineer

and World Series plots during or approaching the actual game events. By Sunday late
afternoon, there was significant activity in both the NFL and World Series tweets in terms of
frequency. The descriptive statistics were very similar, making it difficult to see any difference,
in terms of tweet frequency between NFL and World Series fans.

5.2. XY plots

MATLAB was also used to plot the time between tweets for each collection period and search
query. This view, although somewhat cluttered, can be used to quickly compare the number
of tweets and any other patterns obvious in an xy plot. The time between tweets is plotted in
Figures 11–14, for each of the four collection periods. The most revealing information from
these plots is the noticeable increase in the number of tweets as the events draw closer. On
Saturday, the number of tweets related to the World Series was higher than NFL tweets. This
seems reasonable since a World Series game took place on Saturday but no NFL games
occurred until Sunday.

On Sunday, with multiple NFL games taking place, the number of NFL tweets was larger than
World Series-related tweets. However, the number and frequency of World Series tweets was
also high due to the Sunday evening World Series event, but not as significant as the NFL-
related tweet activity.

Figure 11. XY plot for NFL versus World Series time between tweets, Saturday 5 PM to 8 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 17
http://dx.doi.org/10.5772/58895

Figure 12. XY plot for NFL versus World Series time between tweets, Saturday 8 PM to 11 PM.

Figure 13. XY plot for NFL versus World Series time between tweets, Sunday 1 PM to 4PM.
18 MATLAB Applications for the Practical Engineer

Figure 14. XY plot for NFL versus World Series Time between tweets, Sunday 5 PM to 8 PM.

5.3. Histograms

The histogram functionality within MATLAB was used to get a better idea of the quantity and
length of the time between tweets for each of the collection periods. As shown in Figures 15
and 16, on Saturday, the number of tweets with time difference values less than 10 seconds
was much greater for World Series than NFL tweets. In all histograms, a handful of tweets had
a time between tweets of greater than 100 seconds. The shift to NFL interest is very visible on
Sunday in the histograms shown in Figures 17 and 18. Note the significant increase in the both
the number and quantity of tweets occurring in less than 10 second intervals for NFL tweets.
However, in Figure 18, the histograms for Sunday late afternoon and evening reveal similar
results for the time between tweets for both NFL and World Series texts.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 19
http://dx.doi.org/10.5772/58895

Figure 15. Histogram plot for NFL versus World Series time between tweets, Saturday 5 PM to 8 PM.

Figure 16. Histogram plot for NFL versus World Series time between tweets, Saturday 8 PM to 11 PM.
20 MATLAB Applications for the Practical Engineer

Figure 17. Histogram plot for NFL versus World Series time between tweets, Sunday 1 PM to 4 PM.

Figure 18. Histogram plot for NFL versus World Series time between tweets, Sunday 5 PM to 8 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 21
http://dx.doi.org/10.5772/58895

5.4. Scatter plots

Twitter queries provide additional data beyond the time and text of the tweet. For example,
the number of friends, number of followers, user screen name, date of account creation, and
time zone of user are also available and easily extract from the queries. This information can
be used to determine if any significant differences or similarities exist among users of Twitter.

Within MATLAB, scatter plots can be used to visually identify clusters of data. Figures 19–22
illustrate the use of scatter plots to compare the number of friends and number of followers
for NFL (blue Xs) and World Series (red circles) tweeters during each of the four data collection
periods. Although a significant amount of overlap exists in the groups, particularly for small
numbers of friend and followers, outliers can be identified in all four plots. Additional
variables and analysis may be needed to further isolate these groups.

Figure 19. Scatter plot for friends versus followers, Saturday 5 PM to 8 PM.
22 MATLAB Applications for the Practical Engineer

Figure 20. Scatter plot for friends versus followers, Saturday 8 PM to 11 PM.

Figure 21. Scatter plot for friends versus followers, Sunday 1 PM to 4 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 23
http://dx.doi.org/10.5772/58895

Figure 22. Scatter plot for friends versus followers, Sunday 5 PM to 8 PM.

5.5. Silhouette plots

Running the k-means algorithm in MATLAB for the Twitter feature sets further reveals the
similarity in both NFL and World Series sets of tweets for all four time collection periods.
Figures 23–26 illustrate the results of the applying the k-means algorithm for three clusters.

In all cases, one very large cluster with silhouette values very close to 1 is observed. However,
the other two clusters are very small with both negative and lower positive silhouette values.
This indicates that the separation into unique clusters is difficult for this set of data and features.
Additional selection of features and analysis would be needed to better identify clusters of
similar Twitter users.
24 MATLAB Applications for the Practical Engineer

Figure 23. Silhouette plot of NFL versus World Series tweet features, Saturday 5 PM to 8PM.

Figure 24. Silhouette plot of NFL versus World Series tweet features, Saturday 8 PM to 11 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 25
http://dx.doi.org/10.5772/58895

Figure 25. Silhouette plot of NFL versus World Series tweet features, Sunday 1 PM to 4 PM.

Figure 26. Silhouette plot of NFL versus World Series tweet features, Sunday 5 PM to 8 PM.
26 MATLAB Applications for the Practical Engineer

5.6. Hierarchical plots

Hierarchical clustering provides an additional visualization of possible cluster separation for

the tweet data. In all cases, the number of cluster and representatives within that cluster
matches the k-means clustering results. Specifically, one large cluster was identified along with
at most one or two very small clusters as illustrated in Figures 27–30 where the left portion of
the tree has many nodes whereas the right portion just has one or two. The size of the cluster
grows over the collection periods with the final collection period on Sunday resulting in the
largest cluster with the least amount of additional clusters. For example, on Sunday, the tree’s
one small branch is evident on the right side of the graph with the larger branch on the left
side containing most observations.

Figure 27. Hierarchical clustering of NFL versus World Series tweet features, Saturday 5 PM to 8 PM.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 27
http://dx.doi.org/10.5772/58895

Figure 28. Hierarchical clustering of NFL versus World Series tweet features, Saturday 8 PM to 11 PM.

Figure 29. Hierarchical clustering of NFL versus World Series tweet features, Sunday 1 PM to 4 PM.
28 MATLAB Applications for the Practical Engineer

Figure 30. Hierarchical clustering of NFL versus World Series tweet features, Sunday 5 PM to 8 PM.

5.7. Self-Organizing Maps (SOM) plots

MATLAB’s Neural Network Toolbox was also used to cluster the tweets through the self-
organizing maps function. A hextop topology of 10 ×10 nodes was selected with 200 training
epochs. After training, several visualizations are available to help determine how the input is
distributed across the nodes. The locations of the data points and the weight vectors are shown
by selecting the weight positions plot. With this display, only two weights can be shown at
one time. Figure 31 shows that most of the data points cluster in one area and they are not very
well distributed. This type of clustering was observed in both the hierarchical and k-means
results.

Figure 31. SOM weight positions, Saturday 5 PM to 8 PM.

Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 29
http://dx.doi.org/10.5772/58895

5.7.1. SOM weight distance plots

The SOM neighbor weight distances plot, using multiple dimensions, provides more infor‐
mation. In Figures 32–35 SOM neighbor weight distance plots are displayed for each of the
four time collections. The following diagram colors and description should be used when
interpreting these plots:

1. Neurons are represented by blue hexagons

2. Red lines connect neighboring neurons

3. Dark-colored regions represent larger distances between neurons

4. Light-colored regions represent smaller distances between neurons

Figures 32–35 show one large cluster, with small distances between the member records,
present in all four collection periods. One or two very small clusters are also present, with
relatively large distances between the member records.

Figure 32. SOM weight distances, Saturday 5 PM to 8 PM.

30 MATLAB Applications for the Practical Engineer

Figure 33. SOM weight distances, Saturday 8 PM to 11 PM.

Figure 34. SOM weight distances, Sunday 1 PM to 4 PM.

Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 31
http://dx.doi.org/10.5772/58895

Figure 35. SOM weight distances, Sunday 5 PM to 8 PM.

5.7.2. SOM weight plane plots

The SOM weight plane plots are used to visualize the strength of weights that connect each
input to each of the neurons. For our experiment, five inputs were used; therefore, five subplots
were generated for each experiment. The five input features included, for each tweet, the
number of user mentions, URL mentions, hashtag mentions, followers, and friends. Lighter
colors in the plots represent larger weights whereas darker colors represent smaller weights.
Similar connection patterns of the inputs indicate a high correlation.

Weight plane plots are shown for each of the collection times in Figures 36–39. Inputs 4 and 5
appeared to be similar in all collection times and were interpreted as highly correlated. Input
from variables 1, 2, and 3 seemed to contribute the smallest amount of cluster separation in
the data sets as they appear to be the least similar and less correlated. This seems reasonable
because the number of friends and followers do seem to be correlated with Twitter users as a
large number of friends also have a large number of followers. The information from the user-,
URL-and hashtag-mentions shows some promise as the maps show these as not being highly
correlated. Additional features and analysis are recommended to enhance the differences in
these maps and to perform better clustering.
32 MATLAB Applications for the Practical Engineer

Figure 36. SOM weight planes, Saturday 5 PM to 8 PM.

Figure 37. SOM weight planes, Saturday 8 PM to 11 PM.

Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 33
http://dx.doi.org/10.5772/58895

Figure 38. SOM weight planes, Sunday 1 PM to 4 PM.

Figure 39. SOM weight planes, Sunday 5 PM to 8 PM.

34 MATLAB Applications for the Practical Engineer

5.7.3. SOM sample hits plots

One additional and useful visual plot provided by the MATLAB SOM function is the SOM
sample hits plot. The sample hits plot counts the number of data records associated with each
neuron. In an ideal situation, a relatively even distribution across the neurons is desired.

Figures 40–43 show the distribution across the neurons was not even in our experiment. In
most cases, the distribution was clustered in one area, which indicates very similar data
without much separation. Even though the location of concentration seems to shift from one
area of the map to another for each time frame, the results are essentially the same for each
collection time. The exception is the increase in concentration of data around the “heavy-
hitting” neurons on Sunday as both NFL and World Series events were scheduled.

Figure 40. SOM sample hits, Saturday 5 PM to 8 PM.

Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 35
http://dx.doi.org/10.5772/58895

Figure 41. SOM sample hits, Saturday 8 PM to 11 PM.

Figure 42. SOM sample hits, Sunday 1 PM to 4 PM.

36 MATLAB Applications for the Practical Engineer

Figure 43. SOM sample hits, Sunday 5 PM to 8 PM.

6. Conclusion

This research used MATLAB tools to extract and analyze social networking data sets and
leverage cloud technologies and infrastructures. AWS was used to spin up a Windows 2008
server and install MATLAB and its associated toolboxes for data mining and statistics. In
addition, a community MATLAB m file interfaced with the Twitter API to search specific text
queries and retrieve associated data. The setup process on AWS was straightforward and
provided a cost-effective and free solution for the hardware and operating system. The
MATLAB license was still needed in this implementation to be able to use the toolboxes and
associated m files.
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 37
http://dx.doi.org/10.5772/58895

AWS provided cost savings, including all server-related hardware and software. Since
MATLAB was used, the development costs for some very specialized analysis visualization
tools were also reduced. Further cost savings are envisioned as MATLAB provides cloud
options for running its software for short durations using a pay-as-you-go cost model.

At the time of publishing, access to MATLAB Distributed Computing Server on the cloud is
available as part of an Early Adopter Program for MATLAB Distributed Computing Server
on Elastic Compute Cloud (EC2). Future efforts could take advantage of this option to speed
up implementations, run algorithms in parallel, and possibly further reduce licensing costs.

The twitty.m and parse_json.m community files were successfully used to interface with the
Twitter API to search for tweets related to specific queries including “NFL” and “World Series”
sporting events. In addition to the tweet texts, other information used in the experiment
included user data such as number of friends, number of followers, time between tweets,
number of URL mentions, number of Hashtag mentions, and number of user mentions in each
text.

The MATLAB Statistics and Neural Network Toolboxes were then used to extract and display
descriptive statistics and perform unsupervised clustering using k-means, hierarchical, and
self-organized maps. Initial analysis of the visualization and data output revealed that sports
events and the associated popularity and frequency of tweets increases as the time of the event
gets closer. The frequency of tweets also persisted throughout the event.

Tweet users were also shown to have similar characteristics and profiles and would be difficult
to separate based on just a few features. Referencing the SOM weight distance plots, smaller
clusters were observed in all the plots, however, the distance among the members of the cluster
were large compared to the one large cluster observed in each plot. Additional research with
larger data sets and more robust features are needed to form predicative models.

7. Recommendations

This research demonstrated the use of MATLAB for social networking site analysis using AWS
servers to reduce costs is feasible, cost-effective, and efficient. Further research is recommend‐
ed to Investigate MATLAB’s AWS Parallel computing license options and costs. It is believed
even more cost-savings could be realized with a pay-as-you-go model. This is particularly
attractive for smaller companies and start-ups that might have limited financial resources, yet
have the personnel skills to conduct some excellent research.

This initial experiment collected data from four different, but relatively short, time periods for
two sports-related tweets. Only a handful of features were used to identify clusters and
perform statistical analysis. We recommend expanding the time frame, number of tweet
queries, and features to be able to provide further insight and understanding from the
information available from social networking sites, such as Twitter.
38 MATLAB Applications for the Practical Engineer

Appendix A: Example MATLAB Code

% Test Twitty call account

% Clear variables
clear;
% Set up the credentials based on Twitter account
credentials.ConsumerKey = 'YourKey';
credentials.ConsumerSecret = 'Yoursecret';
credentials.AccessToken = yourAccess;
credentials.AccessTokenSecret = 'youraccesssecret';

% Create Tweet instance based on credentials

% tw = twitty(credentials);
% Create a Tweet example
% Pick Topeka Kansas for Center of Continental U.S.
% Set the last ID to Any value to start
LastID = 123456;
LastIDQ2 = 123456;
NumIts = 180;
TweetCnt=0; % Initialize to 0 counts
TweetCntQ2=0;
PauseTime = 60;
% Start a timer
Tic;
% Loop several times gathering data
for j=1:NumIts
j
% clear old data
clear S;
clear Q2;
% Time out issues establish new tw each iteration
tw = twitty(credentials);
% Run the First Query
S = tw.search('NFL', 'count',20, 'include_entities','true','geoco-
de','39.051300,-95.724660,1700mi','since_id',LastID);
% Let us see how many we got
StatCnt = length(cellfun('ndims',S{1,1}.statuses));
for i=1:StatCnt
TweetCnt = TweetCnt+1;
MyTweets=S{1,1}.statuses{1,i}.text ;
% Put the data in
Tweets{TweetCnt}=MyTweets;
% Put the dates/time in
Dates{TweetCnt} = S{1,1}.statuses{1,i}.created_at;
% Gather other stuff
FollowerCount(TweetCnt) = S{1,1}.statuses{1,i}.user.followers_count;
FriendsCount(TweetCnt) = S{1,1}.statuses{1,i}.user.friends_count;
TZones = S{1,1}.statuses{1,i}.user.time_zone;
TimeZones{TweetCnt} = TZones;
% Get screen name
SName = S{1,1}.statuses{1,i}.user.screen_name;
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 39
http://dx.doi.org/10.5772/58895

ScreenName{TweetCnt} = SName;
% Get the User Mentions Count
UserMentionCnt{TweetCnt} = length(cellfun('ndims',S{1,1}.statuses{1,i}.enti-
ties.user_mentions));
URLMentionCnt{TweetCnt} = length(cellfun('ndims',S{1,1}.statuses{1,i}.enti-
ties.urls));
HashTagsMentionCnt{TweetCnt} = length(cellfun('ndims',S{1,1}.statuses{1,i}.en-
tities.hashtags));
end

% Get the lastID to eliminate Dups

if (StatCnt > 0)
LastID = S{1,1}.statuses{1,1}.id_str;
end
% Run the World Series Query
Q2 = tw.search('World Series', 'count',20, 'include_entities','true','geoco-
de','39.051300,-95.724660,1700mi','since_id',LastIDQ2);
% Let us see how many we got
StatCntQ2 = length(cellfun('ndims',Q2{1,1}.statuses));

for i=1:StatCntQ2
TweetCntQ2 = TweetCntQ2+1;
MyTweetsQ2=Q2{1,1}.statuses{1,i}.text ;
% Put the data in
TweetsQ2{TweetCntQ2}=MyTweetsQ2;
% Put the dates/time in
DatesQ2{TweetCntQ2} = Q2{1,1}.statuses{1,i}.created_at;
% Gather other stuff
FollowerCountQ2(TweetCntQ2) = Q2{1,1}.statuses{1,i}.user.followers_count;
FriendsCountQ2(TweetCntQ2) = Q2{1,1}.statuses{1,i}.user.friends_count;
TZonesQ2 = Q2{1,1}.statuses{1,i}.user.time_zone;
TimeZonesQ2{TweetCntQ2} = TZonesQ2;
% Get screen name
SNameQ2 = Q2{1,1}.statuses{1,i}.user.screen_name;
ScreenNameQ2{TweetCntQ2} = SNameQ2;
% Get the User Mentions Count
UserMentionCntQ2{TweetCntQ2} = length(cellfun('ndims',Q2{1,1}.sta-
tuses{1,i}.entities.user_mentions));
URLMentionCntQ2{TweetCntQ2} = length(cellfun('ndims',Q2{1,1}.statuses{1,i}.en-
tities.urls));
HashTagsMentionCntQ2{TweetCntQ2} = length(cellfun('ndims',Q2{1,1}.sta-
tuses{1,i}.entities.hashtags));
end
% Get the lastID to eliminate Dups
if (StatCntQ2 > 0)
LastIDQ2 = Q2{1,1}.statuses{1,1}.id_str;
end
% Pause a few seconds
pause(PauseTime);
end
40 MATLAB Applications for the Practical Engineer

Appendix B: Example Data Analysis MATLAB M-File

% Combine the Data into a vector

measWS = [cell2mat(UserMentionCntQ2)' cell2mat(URLMentionCntQ2)' cell2mat(Hash-
TagsMentionCntQ2)' FollowerCountQ2' FriendsCountQ2'];
measNFL= [cell2mat(UserMentionCnt)' cell2mat(URLMentionCnt)' cell2mat(HashTags-
MentionCnt)' FollowerCount' FriendsCount'];
% Combine
meas = [measWS;measNFL];
% Now let's do some Stats and Clustering
[cidx2,cmeans2] = kmeans(meas,2,'dist','sqeuclidean');
figure;
[silh2,h] = silhouette(meas,cidx2,'sqeuclidean');
% Look for maximum number of clusters
[cidx3,cmeans3] = kmeans(meas,3,'display','iter');
% Now do some clustering an
figure;
[silh3,h] = silhouette(meas,cidx3,'sqeuclidean');
% Some plots Might want to plot with NFL versus World Series
% Hierarchical Clustering
eucD = pdist(meas,'euclidean');
clustTreeEuc = linkage(eucD,'average');
myCop = cophenet(clustTreeEuc,eucD)
figure;
[h,nodes] = dendrogram(clustTreeEuc,0);
set(gca,'TickDir','out','TickLength',[.002 0],'XTickLabel',[]);
% Reduce nodes
figure;
[h,nodes] = dendrogram(clustTreeEuc,12);
toc
% World Series tweet Gaps
for i=1:length(Dates)-1
time1 = [str2num(Dates{i}(26:30)) 10 str2num(Dates{i}(9:10)) str2num(Dates{i}
(12:13)) str2num(Dates{i}(15:16)) str2num(Dates{i}(18:19))];
time2 = [str2num(Dates{i+1}(26:30)) 10 str2num(Dates{i+1}(9:10))
str2num(Dates{i+1}(12:13)) str2num(Dates{i+1}(15:16)) str2num(Dates{i+1}
(18:19))];
deltaWorldSeries(i) = abs(etime(time1,time2));
end
% NFLtweet gaps
for i=1:length(DatesQ2)-1
time1 = [str2num(DatesQ2{i}(26:30)) 10 str2num(DatesQ2{i}(9:10)) str2num(Da-
tesQ2{i}(12:13)) str2num(DatesQ2{i}(15:16)) str2num(DatesQ2{i}(18:19))];
time2 = [str2num(DatesQ2{i+1}(26:30)) 10 str2num(DatesQ2{i+1}(9:10))
str2num(DatesQ2{i+1}(12:13)) str2num(DatesQ2{i+1}(15:16)) str2num(DatesQ2{i+1}
(18:19))];
deltaNFL(i) = abs(etime(time1,time2));
end
figure (1)
subplot(2,1,1)
plot (deltaNFL,'r');
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 41
http://dx.doi.org/10.5772/58895

ylabel('Time between Tweets')

xlabel ('Tweet Number')
title('"NFL" Time between Tweets');
grid;
subplot(2,1,2)
plot(deltaWorldSeries, 'b');
ylabel('Time between Tweets')
xlabel ('Tweet Number')
title('"World Series" Time between Tweets');
grid;
% Stats
% Quick histograms for the 2 delta times
xvalues = -1:2:800;
figure (2)
subplot (1,2,1)
hist(deltaNFL,xvalues);
grid;
title ('Time between "NFL" Tweets')
axis ([-1 200 0 200]);
subplot (1,2,2)
hist(deltaWorldSeries,xvalues);
grid;
title ('Time between "World Series" Tweets')
axis ([-1 200 0 200]);
% Box plots for Descriptive visualization
figure(3)
subplot(1,2,1)
boxplot(deltaNFL)
title ('Time between "NFL" Tweets')
grid;
subplot(1,2,2)
boxplot(deltaWorldSeries)
title ('Time between "World Series" Tweets')
grid;
% Group Scatter
% Put everything together
MyGroup1 = zeros(1,length(HashTagsMentionCnt));
MyGroup2 = ones(1,length(HashTagsMentionCntQ2));
MyGroups = [MyGroup1 MyGroup2];
A1 = cell2mat(HashTagsMentionCnt);
B1 = FollowerCount;
C1 = FriendsCount;
D1 = cell2mat(UserMentionCnt);
E1 = cell2mat(URLMentionCnt);
A2 = cell2mat(HashTagsMentionCntQ2);
B2 = FollowerCountQ2;
C2 = FriendsCountQ2;
D2 = cell2mat(UserMentionCntQ2);
E2 = cell2mat(URLMentionCntQ2);
MyVars = [A1' B1' C1' D1' E1'; A2' B2' C2' D2' E2'];
varNames = {'HashTagMention' 'FollowerCount' 'FriendCount' 'URLMention' 'Hash-
TagMention'};
42 MATLAB Applications for the Practical Engineer

figure(4)
gplotmatrix(MyVars(:,2), MyVars(:,3),MyGroups,['r' 'b'],['O' 'X'],
[],'off','hist',['Followers'],['Friends']);
grid;

Author details

Kelly Bennett1 and James Robertson2

1 U.S. Army Research Laboratory, Sensors and Electron Devices Directorate, Adelphi, MD,
USA

2 Clearhaven Technologies LLC, Severna Park, MD, USA

References

[1] Rahman, M. Mining social data to extract intellectual knowledge. doi:10.5815/ijisa.

2012.10.02
[2] Das TK., Kumar P. BIG Data Analytics: A Framework for Unstructured Data Analy‐
sis. International Journal of Engineering Science and Technology 2013;5(2) 153-156.
[3] Daehoon K., Daeyong K., Seungmin R., Eenjun H. Detecting Trend and Bursty Key‐
words Using Characteristics of Twitter Stream Data. International Journal of Smart
Home 2013;7(1) 209-220.
[4] Karandikar A. Clustering short status messages: A topic model based approach.
Master’s thesis. University of Maryland; 2010. http://ebiquity.umbc.edu/_file_directo‐
ry_/papers/518.pdf (accessed 29 October 2013).
[5] Perera R. Twitter Analytics: Architecture, Tools and Analysis, Proc. The 2010 Military
Communications Conference, MILCOM2010, Oct. 31, 2010-Nov. 3, 2010, San Jose
Convention Center, CA; 2010.
[6] Turner and Malleson (2011). Applying geographical clustering methods to analyze
geo-located open micro-blog posts: A case study of Tweets around Leeds since June
2011, https://docs.google.com/document/d/1yaRBUwJy8Cb3JWlN-
fJOZZRN1JnXvpdWzhRDstqwsoI/edit?pli=1#heading=h.rx5m14o2e6yw (accessed 30
September 2013).
[7] MathWorks. MATLAB Products and Services. Overview. http://www.math‐
works.com/products/MATLAB/ (accessed 8 October 2013).
[8] MathWorks. MATLAB Central. File Exchange. twitty by Vladimir Bondarenko. 30
January 2012 (Updated 12 July 2013). Interface-class to access the Twitter REST API
Knowledge Discovery and Information Extraction on the Open Internet Using MATLAB and Amazon… 43
http://dx.doi.org/10.5772/58895

v1.1. http://www.mathworks.com/MATLABcentral/fileexchange/34837-twitty/content/
twitty.m. (accessed 21 September 2013).

[9] Twitter API. https://dev.twitter.com/docs/api/1.1 (accessed 12 July 2013).

[10] MathWorks. MATLAB Central. File Exchange. JSON Parser. JSON Parser by Joel
Feenstra. 3 July 2008 (Updated 18 June 2009). http://www.mathworks.com/MATLAB‐
central/fileexchange/20565-json-parser. (accessed 23 September 2013).

[11] MathWorks. Statistics Toolbox. Overview. http://www.mathworks.com/products/

statistics/ (accessed 1 October 2013).

[12] MathWorks. Neural Network Toolbox Overview. http://www.mathworks.com/prod‐

ucts/neural-network/ (accessed 18 September 2013).

[13] Kohonen T. Self-Organizing Maps. 3rd ed. Springer Series in Information Sciences.
Springer-Verlag Heidelberg: New York; 2001.

[14] Bennett K, Robertson J. Signal and image processing algorithm performance in a vir‐
tual and elastic computing environment, Proc. SPIE Vol. 8734, Active and Passive
Signatures IV, 87340B (2013).

Knowledge Discovery Practices and Emerging Applications of Data Mining Trends and New Domains 1st Edition A.V. Senthil Kumar Instant Download
100% (1)
Knowledge Discovery Practices and Emerging Applications of Data Mining Trends and New Domains 1st Edition A.V. Senthil Kumar Instant Download
80 pages
Mastering Social Media Mining with Python
From Everand
Mastering Social Media Mining with Python
Marco Bonzanini
5/5 (1)
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
From Everand
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Michael Walker
5/5 (1)
Tiếng Anh Chuyên Nghành Điện Tử - Viễn Thông
No ratings yet
Tiếng Anh Chuyên Nghành Điện Tử - Viễn Thông
181 pages
3rd Plate Sample
100% (1)
3rd Plate Sample
39 pages
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
From Everand
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
Timothy Eastridge
No ratings yet
Learning Social Media Analytics with R
From Everand
Learning Social Media Analytics with R
Raghav Bali
No ratings yet
2020 Intro
No ratings yet
2020 Intro
58 pages
Updated DM
No ratings yet
Updated DM
72 pages
R High Performance Programming
From Everand
R High Performance Programming
Aloysius Lim
4.5/5 (2)
Machine Learning Unraveled: Exploring the World of Data Science and AI
From Everand
Machine Learning Unraveled: Exploring the World of Data Science and AI
Alex Murphy
No ratings yet
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Dr. Trukk Price List W.E.F. 1st April, 2015. List No: 4
No ratings yet
Dr. Trukk Price List W.E.F. 1st April, 2015. List No: 4
20 pages
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies
From Everand
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies
Timothy Eastridge
No ratings yet
MIKE11 UserManual
No ratings yet
MIKE11 UserManual
542 pages
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
No ratings yet
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
33 pages
Open-Source Odyssey: Pioneering Data Engineering with AI Automation
From Everand
Open-Source Odyssey: Pioneering Data Engineering with AI Automation
Muthukrishnan Muthusubramanian
No ratings yet
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Pre Qualificaion1
No ratings yet
Pre Qualificaion1
16 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
Ammonia How Much Catalyst Is Needed For
No ratings yet
Ammonia How Much Catalyst Is Needed For
10 pages
Datamining Lect1
No ratings yet
Datamining Lect1
59 pages
Python for Data Science: A Practical Approach to Machine Learning
From Everand
Python for Data Science: A Practical Approach to Machine Learning
Jarrel E.
No ratings yet
DM Lec1 2
No ratings yet
DM Lec1 2
39 pages
Questions With Solutions Mid-Sem Final
No ratings yet
Questions With Solutions Mid-Sem Final
7 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
1st Slides
No ratings yet
1st Slides
60 pages
Data Munging
No ratings yet
Data Munging
27 pages
Google Search Revealed: Mastering the Algorithm for Search Dominance
From Everand
Google Search Revealed: Mastering the Algorithm for Search Dominance
Azhar ul Haque Sario
No ratings yet
A Brief Survey On Data Mining For Biological and Environmental Problems.
No ratings yet
A Brief Survey On Data Mining For Biological and Environmental Problems.
46 pages
On Bottom Stability Analysis and Mudmat Design
No ratings yet
On Bottom Stability Analysis and Mudmat Design
9 pages
Designing Machine Learning Systems with Python
From Everand
Designing Machine Learning Systems with Python
David Julian
No ratings yet
Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure
From Everand
Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure
Kristen Kehrer
No ratings yet
Formulation and Evaluation of Topical Herbal Gel For The Treatment
No ratings yet
Formulation and Evaluation of Topical Herbal Gel For The Treatment
16 pages
System Architecture Evolution (SAE) in 3GPP
No ratings yet
System Architecture Evolution (SAE) in 3GPP
24 pages
Data Mining For Business Analytics & Data Analysis In Python
From Everand
Data Mining For Business Analytics & Data Analysis In Python
Book Option
No ratings yet
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Using OpenRefine
From Everand
Using OpenRefine
Ruben Verborgh
4/5 (1)
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
From Everand
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Steven Cooper
3.5/5 (9)
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Introduction Data Science
No ratings yet
Introduction Data Science
29 pages
Where To Find Data PDF
No ratings yet
Where To Find Data PDF
10 pages
Geneaid - GSYNC DNA Extraction Kit - Protocol
100% (1)
Geneaid - GSYNC DNA Extraction Kit - Protocol
16 pages
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
Irrigation Engineering II
100% (1)
Irrigation Engineering II
1 page
John - Fields - HW1 Data Mining
No ratings yet
John - Fields - HW1 Data Mining
10 pages
A Novel Visual Analytics Approach For Clustering Large-Scale Social Data
No ratings yet
A Novel Visual Analytics Approach For Clustering Large-Scale Social Data
8 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
CRP I TV 53 Galloway
No ratings yet
CRP I TV 53 Galloway
12 pages
Case Stud1
No ratings yet
Case Stud1
5 pages
Q4G8W2
No ratings yet
Q4G8W2
7 pages
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
2.5/5 (2)
Compressor Data Sheet: Chicago Pneumatic
No ratings yet
Compressor Data Sheet: Chicago Pneumatic
1 page
Sentiment Analysis PDF
No ratings yet
Sentiment Analysis PDF
4 pages
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
No ratings yet
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
9 pages
Real Time Text Mining On Twitter Data: Shilpy Gandharv Vivek Richhariya Richhariya
No ratings yet
Real Time Text Mining On Twitter Data: Shilpy Gandharv Vivek Richhariya Richhariya
5 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
The Web Circular
From Everand
The Web Circular
Prakash Hegade
No ratings yet
Data Security System Using Crytography Algorithm (RC6
No ratings yet
Data Security System Using Crytography Algorithm (RC6
8 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
3 pages
A Framework To Predict Social Crimes Using Twitter Tweets
No ratings yet
A Framework To Predict Social Crimes Using Twitter Tweets
5 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
KODAG
No ratings yet
KODAG
24 pages
Handbook Rheometer
No ratings yet
Handbook Rheometer
328 pages
Touchpad Computer Applications Class 10
From Everand
Touchpad Computer Applications Class 10
Sanjay Jain
No ratings yet
Hi Ac 8011 Liquid Particle Counting System Manual
No ratings yet
Hi Ac 8011 Liquid Particle Counting System Manual
36 pages
Nursery - Syllabus
No ratings yet
Nursery - Syllabus
10 pages
Chemical Engineering - Why in A Normal Distillation Column Does Temperature and Pressure Gradient Exist From Bottom To Top - Quora PDF
No ratings yet
Chemical Engineering - Why in A Normal Distillation Column Does Temperature and Pressure Gradient Exist From Bottom To Top - Quora PDF
6 pages
Sub-Surface Understanding of An Oil Field in Cambay Basin
No ratings yet
Sub-Surface Understanding of An Oil Field in Cambay Basin
9 pages
International GCSE Biology (4BI1) - Grade Characteristics: Holistic Approach To Grades
No ratings yet
International GCSE Biology (4BI1) - Grade Characteristics: Holistic Approach To Grades
7 pages
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
No ratings yet
MMW Midterms Notes
No ratings yet
MMW Midterms Notes
6 pages
SOAv 1
No ratings yet
SOAv 1
50 pages
Edexcel Electricity QS
No ratings yet
Edexcel Electricity QS
24 pages
Computer Project
No ratings yet
Computer Project
59 pages
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
AEF3e Level 1 TG PCM Grammar 2A
No ratings yet
AEF3e Level 1 TG PCM Grammar 2A
1 page
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Week 1 Byteshell 1
No ratings yet
Week 1 Byteshell 1
14 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Question Answering: Fundamentals and Applications
From Everand
Question Answering: Fundamentals and Applications
Fouad Sabry
No ratings yet
2 Operations On Polynomials
No ratings yet
2 Operations On Polynomials
5 pages
Electrical and Optical Properties of Germanium-Doped Zinc Oxide Thin Films
No ratings yet
Electrical and Optical Properties of Germanium-Doped Zinc Oxide Thin Films
4 pages