0% found this document useful (0 votes)
97 views34 pages

Daniel Burseth Co-President MIT Big Data Explorers

This document provides steps for scraping real estate listings from Craigslist, cleaning the data using OpenRefine, and creating an initial visualization of the data in Tableau. It outlines scraping listings from Boston Craigslist using WebHarvy, exporting the raw data to OpenRefine for cleaning like removing duplicates and parsing fields. The cleaned data is then exported to Tableau where filters are applied to identify erroneous prices and the listings are mapped with price represented through marker size and max price as a filter.

Uploaded by

Andrew Brown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views34 pages

Daniel Burseth Co-President MIT Big Data Explorers

This document provides steps for scraping real estate listings from Craigslist, cleaning the data using OpenRefine, and creating an initial visualization of the data in Tableau. It outlines scraping listings from Boston Craigslist using WebHarvy, exporting the raw data to OpenRefine for cleaning like removing duplicates and parsing fields. The cleaned data is then exported to Tableau where filters are applied to identify erroneous prices and the listings are mapped with price represented through marker size and max price as a filter.

Uploaded by

Andrew Brown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Daniel Burseth

Co-president MIT Big Data Explorers


[email protected]
@dmbnyc
Github: dburseth

Acronyms abound
Tremendous complexity
Use building blocks not code
This is easy
EPPM of 10 requires 500 professionals
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-
work.html?emc=eta1&_r=0
Data preparation and cleansing:
Missing
Duplicative
Conventions (dates, time,
geographies)
Spacing
Can we measure data
cleanliness?
Whats our Pareto point?

AWS -> EC2
Launch instance: ami-c6b61fae (US-EAST)
Instance type m3.medium
Connect
You should see some software on the desktop



Scrape all of Craiglists Boston apartment listings using WebHarvy
Examine, clean, and prepare the data set using OpenRefine
Map our data and apply filters using Tableau

all without writing a single line of code.
A hyper-intelligent utility to scrape website data.
SysNucleus, makers of USBTrace
Heavy duty alternatives: Scrapy (scrappy.org),
Beautiful Soup


HTTP://SHOUTKEY.COM/WIRE
1. Start Config
2. Click on Hungry Mother
capture text
3. Click on Hungry Mother
capture URL
4. Click on Kendall
Square/MIT capture text
5. Click lasts review capture
text
CLEAR

1. Mine -> Scrape a list of similar
links
2. Click on Hungry Mother

Lets start collecting
information in the first sub-
page.

Edit Clear
Navigate into a sub-page
Start Config
Set as Next Page Link


Scheduler
Input keywords
Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for
apps, commercial purposes!)
TRY VISITING CRAIGSLIST IN AWS BTW!!
Proxy
Database export
Download Craigslist Boston from http://shoutkey.com/glorify
Look at our data: open Boston Dirty.csv (20k rows of mess!)
Time to CLEAN: Launch GOOGLE-REFINE.EXE
Within MOZILLA, navigate to http://127.0.0.1:3333/
Create Project -> This Computer -> Browse
Parse by tab
Create Project

1. First, sort your column.
2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the
middle of the data table.
3. Then invoke Edit cells and Blank down on the Title column.
4. Then on that column, invoke menu Facet > Custom facets and Facet by blank.
5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu.
6. Remove the facet.


Then run the To Number transform again



Increment the radius to 7
and make judgment calls
along the way.
Change the Distance
Function and do the same
thing


Looks like we have SOME really expensive real
estate. Data errors????
Boston Clean.csv
Load Boston clean.csv
Go to Worksheet
Great semantic example. Tableau understands that this text translates to a lat/long
Look on the map in the lower right corner
Lets Filter Data

Under Measures, drag Price onto size in Marks
Change sum(Price) to avg(Price)
Drag Price, change to max(price) into Filters and select an At Most
Right click on the filter and show Quick Filter
Drag City onto Label
Menu Map -> Map Options
Click on a node for info and drill down potential
1. Explored various webpage structures and scraped them
2. Exported the data to Refine
3. Parsed columns to extract critical price and location information
4. Used clustering algorithms to merge related geographies
5. Applied filters to identify errant prices
6. Exported the data to Tableau
7. Completed a real cursory mapping visualization
Please come talk to me

You might also like