0% found this document useful (0 votes)
8 views

An Introduction To Data Mining With Open-Source Technologies

Uploaded by

gamze kaskas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

An Introduction To Data Mining With Open-Source Technologies

Uploaded by

gamze kaskas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Portland State University

PDXScholar

Online Northwest Online Northwest 2018

Mar 30th, 2:15 PM - 3:00 PM

An Introduction to Data Mining with Open-Source


Technologies
Blake Galbreath
Washington State University, [email protected]

Follow this and additional works at: https://pdxscholar.library.pdx.edu/onlinenorthwest


Let us know how access to this document benefits you.
Galbreath, Blake, "An Introduction to Data Mining with Open-Source Technologies" (2018). Online
Northwest. 8.
https://pdxscholar.library.pdx.edu/onlinenorthwest/2018/presentations/8

This Presentation is brought to you for free and open access. It has been accepted for inclusion in Online
Northwest by an authorized administrator of PDXScholar. Please contact us if we can make this document more
accessible: [email protected].
Blake L. Galbreath
Core Services Librarian, WSU
An Introduction to Data Mining with
Open-Source Technologies
OpenRefine · RapidMiner · Voyant Tools
Learning Outcomes

Learn how to grab relevant data from a website with OpenRefine

Learn how to use text processing operators in RapidMiner

Learn how to use various tools in Voyant to display data analysis


What is Data Mining?
A step in the process of knowledge discovery from data (KDD)

'(Semi) automated discovery of trends and patterns across very large


datasets, usually for the purpose of decision making'

A proactive process that automatically searches data for new relationships


and anomalies on which to base business decisions in order to gain
competitive advantage

--Attempt to stay ahead of your competition by having more complete


information to proactively make better-informed decisions.'
What is Text Mining?
A branch or a sibling of data mining; Also called ‘text data mining’

'Text mining is all about extracting patterns and associations previously


unknown from large text databases' (Thuraisingham 1999:167).'

Text mining is 'a way to examine a collection of documents and discover


information not contained in any individual document in the collection'
(Lucas 1999/2000:1).

Text mining 'performs various searching functions, linguistic analysis and


categorizations' (Chen 2001:5,9).
OpenRefine
What is OpenRefine?
OpenRefine (formerly Google Refine) is a powerful tool for working with
messy data: cleaning it; transforming it from one format into another; and
extending it with web services and external data.

Also: see my presentation from last year: Galbreath, Blake, "Using


OpenRefine to Standardize and Augment Your Data" (2017). Online
Northwest. 8.
http://pdxscholar.library.pdx.edu/onlinenorthwest/2017/schedule/8
Pulling Data from Website with OpenRefine

● Construct URLs
● Fetch Data
● Parse Data
Idea Exchange
● Primo
Ideas
Gather URLs
● Create
Project

● Clipboard

● Paste
data from
clipboard
here
Fetch Data
● Edit
Column
● Add
Column
by
Fetching
URLs
Parse Data
● Undo/
Redo Tab
● Apply
● Paste
Extracted
JSON
Admire Data
● And
compare
against
original
website
data
RapidMiner
What is RapidMiner?
A lightning fast unified data science platform.

A software platform for data science teams that unites data prep, machine
learning, and predictive model deployment.

Is it really open-source? Actually it has moved to “business source.”


Analyze Sentiment: Document
Synopsis
Analyzes Sentiment of text.

Description
Extracting sentiment from a piece of text such as a tweet, a review or an
article can provide us with valuable insight about the author's emotions
and perspective: whether the tone is positive, neutral or negative, and
whether the text is subjective (meaning it's reflecting the author's opinion)
or objective (meaning it's expressing a fact).
Processing Operators
● Read
Document
● Tokenize
● Sentiment
Analysis
Analysis
● Polarity = Positive
● Polarity_confidence
= .907
Social Media: Search Twitter
Synopsis
This operator searches for Twitter statuses.

Description
With the Search Twitter operator, you can specify a query and get Twitter
statuses containing this query. The list of statuses contains additional data
with context of the statuses. In the expert mode, you can specify
additional search restrictions.
Processing Operators
● Search
Twitter
● Analyze
Sentiment
● Query =
“trump -rt -
http” (-
retweets) (-
links)
Analysis
● Sentiment
= Positive
● Confidence
= .855
@jeneps And don’t forget Trump’s broad
shoulders that Pence loves so much.
Process Documents: WordList
Synopsis
Generates word vectors from a text object.

Description
This operator uses one single TextObject as input for generating a term
vector. The resulting exampleset will hence consist of only one single
example. This makes this operator especially useful for applying a model
on one single text.
Processing Operators
● Retrieve
● Select
Attributes
● Data to
Documents
● Process
Documents
● WordList to
Data
● Store
Processing Operators (Sub-routine)
● Extract
Content
● Tokenize
● Transform
Cases
● Stopwords
● n-Grams
● Tokens
● Stem
Analysis
● Journal
● Effect
● Histori
● Educ
● Studi
● State
● ...
Voyant Tools
What is Voyant Tools?
A web-based text reading and analysis environment. It is a scholarly
project that is designed to facilitate reading and interpretive practices for
digital humanities students and scholars as well as for the general public.
Possibilities:

● Study texts that you find on the web or texts that you have carefully
edited and have on your computer.
● Add functionality to your online collections, journals, blogs or websites
so others can see through your texts with analytical tools.
● Learn how computers-assisted analysis works.
Interface
● Drop in
text, URL,
upload file
● Reveal!
Multiple Visualizations
● Open
● Science
● Danielle
● Portland
Links: Keywords and Collocates
● Science:
fellowship
mozilla day
datarescue
hack
References
Kroeze, J., Matthee, M., & Bothma, T. (2004). Differentiating between
data-mining and text-mining terminology. 6(4), 297-306.

http://openrefine.org/

http://voyant-tools.org/docs/#!/guide/about

https://rapidminer.com/

https://rapidminer.com/blog/the-core-of-rapidminer-is-open-source/
Questions
Other Experiments
Retrieve Primo and Alma ideas from Ex Libris Idea Exchange

Determine the sentiment of these ideas

Analyze sentiment of ideas to see if there was any correlation between


polarity and number of votes received
Retrieve Ideas from Idea Exchange
● Construct
URLs
● Fetch
Data
● Parse
Data
Sentiment Analysis, Count: Primo
● Positive =
20

● Negative =
160

● Neutral =
175
Sentiment Analysis, Count: Alma
● Positive =
13

● Negative =
418

● Neutral =
568
Sentiment Analysis, Scatterplot: Primo
● Shows
some
separation
in the
Neutral
and
Negative
categories
Sentiment Analysis, Scatterplot: Alma
● Shows
some
separation
in the
Negative
and
Positive
categories
Correlation Matrix for + and - Polarities: Primo
● |r| = .113
for
Polarity =
Positive
● |r| = .115
for
Polarity =
Negative
Correlation Matrix for + and - Polarities: Alma
● |r| = .283
for
Polarity =
Positive
● |r| = .094
for
Polarity =
Negative
Voyant Tools: Cirrus
● Cirrus is a
wordcloud
of the
most
frequently
occuring
words in
the corpus
Voyant Tools: Links
● Collocates
graph
shows a
network
graph of
higher
frequency
terms that
appear in
proximity

You might also like