Retrieving and Visualizing Data: Charles Severance
Retrieving and Visualizing Data: Charles Severance
Charles Severance
• http://spark.apache.org/
• https://aws.amazon.com/redshift/
• http://community.pentaho.com/
• ....
"Personal Data Mining"
Our goal is to make you better programmers – not to make you data
mining experts
GeoData
• Makes a Google Map from user
entered data
Google
geoload.p
geodat geodata.sqlite
y
a
where.js
geodump.p
y
Northeastern University, ... Boston, MA 02115, USA 42.3396998
-71.08975
Bradley University, 1501 ... Peoria, IL 61625, USA 40.6963857
-89.6160811
...
Technion, Viazman 87, Kesalsaba, 32000, Israel 32.7775 35.0216667
Monash University Clayton ... VIC 3800, Australia -37.9152113
145.134682
Kokshetau, Kazakhstan 53.2833333 69.3833333
...
12 records written to where.js
http://www.py4e.com/code3/geodata.zip
Open where.html to view the data in a browser
Page Rank
• Write a simple web page
crawler
http://www.py4e.com/code3/pagerank.zip
Search Engine Architecture
• Web Crawling
• Index Building
• Searching
http://infolab.stanford.edu/~backrub/googl
e.html
Web Crawler
A Web crawler is a computer program that browses
the World Wide Web in a methodical, automated
manner. Web crawlers are mainly used to create a
copy of all the visited pages for later processing by
a search engine that will index the downloaded
pages to provide fast searches.
http://en.wikipedia.org/wiki/Web_crawler
Web Crawler
• Retrieve a page
• Repeat... http://en.wikipedia.org/wiki/Web_crawl
er
Web Crawling Policy
• a selection policy that states which pages to download,
http://en.wikipedia.org/wiki/Robots_Exclusion_Stan
dard
http://en.wikipedia.org/wiki/Spider_trap
Google Architecture
• Web Crawling
• Index Building
• Searching
http://infolab.stanford.edu/~backrub/goog
le.html
Search Indexing
Search engine indexing collects, parses, and
stores data to facilitate fast and accurate
information retrieval. The purpose of storing an
index is to optimize speed and performance in
finding relevant documents for a search query.
Without an index, the search engine would scan
every document in the corpus, which would
require considerable time and computing power.
http://en.wikipedia.org/wiki/Index_(search_eng
ine)
spreset.p sprank.p
y y force.html
d3.js
The
We spider.py spider.sqlite
b
spjson.p
y
spdump.py
force.js
(5, None, 1.0, 3, u'http://www.dr-chuck.com/csev-blog')
(3, None, 1.0, 4, u'http://www.dr-chuck.com/dr-
chuck/resume/speaking.htm')
(1, None, 1.0, 2, u'http://www.dr-chuck.com/csev-blog/')
(1, None, 1.0, 5, u'http://www.dr-chuck.com/dr-
chuck/resume/index.htm')
4 rows.
http://www.py4e.com/code3/pagerank.zip
Mailing Lists - Gmane
http://www.py4e.com/code3/gmane.zip
Warning: This Dataset is > 1GB
• Do not just point this application at gmane.org and let it run
• There is no rate limit – these are cool folks
http://mbox.dr-chuck.net/sakai.devel/4/5
mbox.dr- gmane.p gword.htm
chuck.net y content.sqlite d3.js
gmodel.p gword.js
y
mapping.sqlite
gword.p
y
content.sqlite
gbasic.py
gline.py
How many to dump? 5
Loaded messages= 51330 subjects= 25033
senders= 1584
Top 5 Email list participants gline.js
[email protected] 2657
[email protected] 1742
[email protected] 1591
[email protected] 1304
[email protected] 1184 gline.htm
...
http://www.py4e.com/code3/gmane.zip d3.js
Acknowledgements / Contributions
These slides are Copyright 2010- Charles R. Severance (
...
www.dr-chuck.com) of the University of Michigan School
of Information and open.umich.edu and made available
under a Creative Commons Attribution 4.0 License.
Please maintain this last slide in all copies of the
document to comply with the attribution requirements of
the license. If you make a change, feel free to add your
name and organization to the list of contributors on this
page as you republish the materials.