0% found this document useful (0 votes)
51 views

Data Mining: IE:4172 Big Data Analytics Stephen Baek

This document discusses different methods for collecting and mining data from the internet, including publicly available datasets, web crawling/scraping, and APIs. It notes that internet data is prevalent and can be useful for applications like predicting election outcomes, market trends, and more. It describes how data must be "mined" from publicly available datasets, web crawling/scraping bots that automatically collect data by following links on websites, and APIs that allow querying and retrieving data. Specific examples of public datasets, web crawling tools, and APIs are provided. The document also discusses policies for web crawlers regarding which pages to access, revisiting rates, avoiding overloading websites, and coordinating distributed crawlers.

Uploaded by

maithuong85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Data Mining: IE:4172 Big Data Analytics Stephen Baek

This document discusses different methods for collecting and mining data from the internet, including publicly available datasets, web crawling/scraping, and APIs. It notes that internet data is prevalent and can be useful for applications like predicting election outcomes, market trends, and more. It describes how data must be "mined" from publicly available datasets, web crawling/scraping bots that automatically collect data by following links on websites, and APIs that allow querying and retrieving data. Specific examples of public datasets, web crawling tools, and APIs are provided. The document also discusses policies for web crawlers regarding which pages to access, revisiting rates, avoiding overloading websites, and coordinating distributed crawlers.

Uploaded by

maithuong85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Mining

IE:4172 Big Data Analytics


Stephen Baek
Sea of Information
● Internet data are extremely prevalent
● They can be useful in many applications:
○ Predicting outcomes of political elections
○ Market trend research
○ Sentiment/reputation analysis
○ Stock market prediction
○ Sports science
○ Diffusion of information
○ Natural disasters
○ Diseases, epidemiology, public health
○ … the list goes on and on

Image Source: Unknown


Data is the new oil
● We have to “mine” it…
○ Publicly available datasets
■ Raw files made available for download
■ e.g. UCI ML repository, Kaggle competitions, data.gov, NIH Chest X-ray Dataset, …
○ Web crawling/scraping
■ Automated bots/macros to collect data from the web
■ Navigate through websites by tracking down the links
■ e.g. Search engines!
○ API - Application Programming Interface
■ A programing interface to send query & retrieve data
■ e.g. Twitter API
○ Proprietary datasets

Image Source: Wikipedia


Public Datasets
● https://www.data.gov/
Public Datasets
● https://www.kaggle.com
Public Datasets
● https://archive.ics.uci.edu/ml/index.php
Web Crawling & Scraping
● Data mining from websites can be incredibly tedious and repetitious
● Web browser macros can automate repetitive web clicks, filling in forms, etc.

https://youtu.be/hytfjJGqlio
Web Crawling & Scraping
● Crawler: aka web robot, or web spider
○ A software program that automatically traverses hyperlinks
○ Systematically browses the world wide web
○ Examples:
■ Googlebot: collects documents from the web to build a searchable index.
■ Xenon: is a web crawler used by government tax authorities to detect fraud

● There are many open source crawlers:


○ For example: https://github.com/scrapinghub
○ BeautifulSoup, LXML
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
● Web crawlers are not always welcome
○ A not so well-behaved crawler can be blacklisted
○ robot.txt: a special file located on a web server that enforces restrictions
■ ‘Allow’ tag: list of pages that can be accessed
■ ‘Disallow’ tag: list of pages that should not be indexed
○ HTML META tags: does the similar thing with robot.txt
■ <META name=”ROBOT” content=”NOFOLLOW”>
■ <META name=”GOOGLEBOT” content=”NOINDEX”>
Application Programming Interface (API)
● Set of functions, routines, protocols, and tools for building software
applications
● APIs define the standard way of accessing data
● Examples:
○ Twitter API: https://dev.twitter.com
○ Facebook API: https://developers.facebook.com
○ Yahoo! Finance API
○ Google Map API
○ …
(ICA) Let’s Play

Image Source: https://pixabay.com


Homework! - Due: 9/17 (Tuesday)
ICA - Topic 1
● Debate on the Nobel Prize in Physics 2017: “First Direct Observation of
Gravitational Wave”
○ What is the gravitational wave?
■ https://www.nationalgeographic.com/news/2017/10/gravitational-waves-nobel-prize-phy
sics-ligo-science-space/
○ The debate:
■ https://arstechnica.com/science/2018/10/danish-physicists-claim-to-cast-doubt-on-detec
tion-of-gravitational-waves/

● Discuss:
○ What is the gravitational wave in layperson's terms?
○ What’s the root of the debate?
○ What is the correlated noise and what can you do about it?
○ Danish vs American scientists - who do you think is more convincing?
ICA - Topic 2
● David Balley. (2018). Why outliers are good for science?
○ https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2018.01105.x

● Discuss:
○ What is the Gaussian distribution (the bell curve) and what is the Cauchy distribution?
○ Is real-world measurement closer to the Gaussian or Cauchy? Why do you think is the
reason?
○ What’s the criteria commonly used to determine outliers? How can they be wrong?
○ What is the author’s point to claim that outliers might actually be good for science?
ICA - Topic 3
● Candace Corbeil - Gaps in the Spreadsheet
○ https://www.apa.org/science/about/psa/2016/02/gaps-spreadsheet
● Gerhard Svolba - The origin, detection, treatment and consequences of
missing values in analytics.
○ http://analytics-magazine.org/missing-values/

● Discuss:
○ What are the three types of missing data?
○ What is multiple imputation how can they be useful for data that are missing at random?
○ In case of systematic (non-random) missing data, would you still use multiple imputation? Or
what else can you do?

You might also like