An Introduction To Data Mining With Open-Source Technologies
An Introduction To Data Mining With Open-Source Technologies
PDXScholar
This Presentation is brought to you for free and open access. It has been accepted for inclusion in Online
Northwest by an authorized administrator of PDXScholar. Please contact us if we can make this document more
accessible: [email protected].
Blake L. Galbreath
Core Services Librarian, WSU
An Introduction to Data Mining with
Open-Source Technologies
OpenRefine · RapidMiner · Voyant Tools
Learning Outcomes
● Construct URLs
● Fetch Data
● Parse Data
Idea Exchange
● Primo
Ideas
Gather URLs
● Create
Project
● Clipboard
● Paste
data from
clipboard
here
Fetch Data
● Edit
Column
● Add
Column
by
Fetching
URLs
Parse Data
● Undo/
Redo Tab
● Apply
● Paste
Extracted
JSON
Admire Data
● And
compare
against
original
website
data
RapidMiner
What is RapidMiner?
A lightning fast unified data science platform.
A software platform for data science teams that unites data prep, machine
learning, and predictive model deployment.
Description
Extracting sentiment from a piece of text such as a tweet, a review or an
article can provide us with valuable insight about the author's emotions
and perspective: whether the tone is positive, neutral or negative, and
whether the text is subjective (meaning it's reflecting the author's opinion)
or objective (meaning it's expressing a fact).
Processing Operators
● Read
Document
● Tokenize
● Sentiment
Analysis
Analysis
● Polarity = Positive
● Polarity_confidence
= .907
Social Media: Search Twitter
Synopsis
This operator searches for Twitter statuses.
Description
With the Search Twitter operator, you can specify a query and get Twitter
statuses containing this query. The list of statuses contains additional data
with context of the statuses. In the expert mode, you can specify
additional search restrictions.
Processing Operators
● Search
Twitter
● Analyze
Sentiment
● Query =
“trump -rt -
http” (-
retweets) (-
links)
Analysis
● Sentiment
= Positive
● Confidence
= .855
@jeneps And don’t forget Trump’s broad
shoulders that Pence loves so much.
Process Documents: WordList
Synopsis
Generates word vectors from a text object.
Description
This operator uses one single TextObject as input for generating a term
vector. The resulting exampleset will hence consist of only one single
example. This makes this operator especially useful for applying a model
on one single text.
Processing Operators
● Retrieve
● Select
Attributes
● Data to
Documents
● Process
Documents
● WordList to
Data
● Store
Processing Operators (Sub-routine)
● Extract
Content
● Tokenize
● Transform
Cases
● Stopwords
● n-Grams
● Tokens
● Stem
Analysis
● Journal
● Effect
● Histori
● Educ
● Studi
● State
● ...
Voyant Tools
What is Voyant Tools?
A web-based text reading and analysis environment. It is a scholarly
project that is designed to facilitate reading and interpretive practices for
digital humanities students and scholars as well as for the general public.
Possibilities:
● Study texts that you find on the web or texts that you have carefully
edited and have on your computer.
● Add functionality to your online collections, journals, blogs or websites
so others can see through your texts with analytical tools.
● Learn how computers-assisted analysis works.
Interface
● Drop in
text, URL,
upload file
● Reveal!
Multiple Visualizations
● Open
● Science
● Danielle
● Portland
Links: Keywords and Collocates
● Science:
fellowship
mozilla day
datarescue
hack
References
Kroeze, J., Matthee, M., & Bothma, T. (2004). Differentiating between
data-mining and text-mining terminology. 6(4), 297-306.
http://openrefine.org/
http://voyant-tools.org/docs/#!/guide/about
https://rapidminer.com/
https://rapidminer.com/blog/the-core-of-rapidminer-is-open-source/
Questions
Other Experiments
Retrieve Primo and Alma ideas from Ex Libris Idea Exchange
● Negative =
160
● Neutral =
175
Sentiment Analysis, Count: Alma
● Positive =
13
● Negative =
418
● Neutral =
568
Sentiment Analysis, Scatterplot: Primo
● Shows
some
separation
in the
Neutral
and
Negative
categories
Sentiment Analysis, Scatterplot: Alma
● Shows
some
separation
in the
Negative
and
Positive
categories
Correlation Matrix for + and - Polarities: Primo
● |r| = .113
for
Polarity =
Positive
● |r| = .115
for
Polarity =
Negative
Correlation Matrix for + and - Polarities: Alma
● |r| = .283
for
Polarity =
Positive
● |r| = .094
for
Polarity =
Negative
Voyant Tools: Cirrus
● Cirrus is a
wordcloud
of the
most
frequently
occuring
words in
the corpus
Voyant Tools: Links
● Collocates
graph
shows a
network
graph of
higher
frequency
terms that
appear in
proximity