Data Mining: IE:4172 Big Data Analytics Stephen Baek

This document discusses different methods for collecting and mining data from the internet, including publicly available datasets, web crawling/scraping, and APIs. It notes that internet data is prevalent and can be useful for applications like predicting election outcomes, market trends, and more. It describes how data must be "mined" from publicly available datasets, web crawling/scraping bots that automatically collect data by following links on websites, and APIs that allow querying and retrieving data. Specific examples of public datasets, web crawling tools, and APIs are provided. The document also discusses policies for web crawlers regarding which pages to access, revisiting rates, avoiding overloading websites, and coordinating distributed crawlers.

Uploaded by

maithuong85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Data Mining: IE:4172 Big Data Analytics Stephen Baek

Uploaded by

maithuong85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Data Mining

IE:4172 Big Data Analytics

Stephen Baek
Sea of Information
● Internet data are extremely prevalent
● They can be useful in many applications:
○ Predicting outcomes of political elections
○ Market trend research
○ Sentiment/reputation analysis
○ Stock market prediction
○ Sports science
○ Diffusion of information
○ Natural disasters
○ Diseases, epidemiology, public health
○ … the list goes on and on

Image Source: Unknown

Data is the new oil
● We have to “mine” it…
○ Publicly available datasets
■ Raw files made available for download
■ e.g. UCI ML repository, Kaggle competitions, data.gov, NIH Chest X-ray Dataset, …
○ Web crawling/scraping
■ Automated bots/macros to collect data from the web
■ Navigate through websites by tracking down the links
■ e.g. Search engines!
○ API - Application Programming Interface
■ A programing interface to send query & retrieve data
■ e.g. Twitter API
○ Proprietary datasets

Image Source: Wikipedia

Public Datasets
● https://www.data.gov/
Public Datasets
● https://www.kaggle.com
Public Datasets
● https://archive.ics.uci.edu/ml/index.php
Web Crawling & Scraping
● Data mining from websites can be incredibly tedious and repetitious
● Web browser macros can automate repetitive web clicks, filling in forms, etc.

https://youtu.be/hytfjJGqlio
Web Crawling & Scraping
● Crawler: aka web robot, or web spider
○ A software program that automatically traverses hyperlinks
○ Systematically browses the world wide web
○ Examples:
■ Googlebot: collects documents from the web to build a searchable index.
■ Xenon: is a web crawler used by government tax authorities to detect fraud

● There are many open source crawlers:

○ For example: https://github.com/scrapinghub
○ BeautifulSoup, LXML
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
● Web crawlers are not always welcome
○ A not so well-behaved crawler can be blacklisted
○ robot.txt: a special file located on a web server that enforces restrictions
■ ‘Allow’ tag: list of pages that can be accessed
■ ‘Disallow’ tag: list of pages that should not be indexed
○ HTML META tags: does the similar thing with robot.txt
■ <META name=”ROBOT” content=”NOFOLLOW”>
■ <META name=”GOOGLEBOT” content=”NOINDEX”>
Application Programming Interface (API)
● Set of functions, routines, protocols, and tools for building software
applications
● APIs define the standard way of accessing data
● Examples:
○ Twitter API: https://dev.twitter.com
○ Facebook API: https://developers.facebook.com
○ Yahoo! Finance API
○ Google Map API
○ …
(ICA) Let’s Play

Image Source: https://pixabay.com

Homework! - Due: 9/17 (Tuesday)
ICA - Topic 1
● Debate on the Nobel Prize in Physics 2017: “First Direct Observation of
Gravitational Wave”
○ What is the gravitational wave?
■ https://www.nationalgeographic.com/news/2017/10/gravitational-waves-nobel-prize-phy
sics-ligo-science-space/
○ The debate:
■ https://arstechnica.com/science/2018/10/danish-physicists-claim-to-cast-doubt-on-detec
tion-of-gravitational-waves/

● Discuss:
○ What is the gravitational wave in layperson's terms?
○ What’s the root of the debate?
○ What is the correlated noise and what can you do about it?
○ Danish vs American scientists - who do you think is more convincing?
ICA - Topic 2
● David Balley. (2018). Why outliers are good for science?
○ https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2018.01105.x

● Discuss:
○ What is the Gaussian distribution (the bell curve) and what is the Cauchy distribution?
○ Is real-world measurement closer to the Gaussian or Cauchy? Why do you think is the
reason?
○ What’s the criteria commonly used to determine outliers? How can they be wrong?
○ What is the author’s point to claim that outliers might actually be good for science?
ICA - Topic 3
● Candace Corbeil - Gaps in the Spreadsheet
○ https://www.apa.org/science/about/psa/2016/02/gaps-spreadsheet
● Gerhard Svolba - The origin, detection, treatment and consequences of
missing values in analytics.
○ http://analytics-magazine.org/missing-values/

● Discuss:
○ What are the three types of missing data?
○ What is multiple imputation how can they be useful for data that are missing at random?
○ In case of systematic (non-random) missing data, would you still use multiple imputation? Or
what else can you do?

Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
Bluetooth Hacking
67% (3)
Bluetooth Hacking
57 pages
Lecture 4: Let's Get Data!: Prof. Esther Duflo
No ratings yet
Lecture 4: Let's Get Data!: Prof. Esther Duflo
44 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
3 Web Crawling
No ratings yet
3 Web Crawling
39 pages
L2_Data Acquisition
No ratings yet
L2_Data Acquisition
48 pages
Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya
No ratings yet
Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya
35 pages
Inverted Indexing For Text Retrieval
No ratings yet
Inverted Indexing For Text Retrieval
21 pages
Data Mining News Article
No ratings yet
Data Mining News Article
30 pages
Deep Web
No ratings yet
Deep Web
35 pages
Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya
No ratings yet
Deep Web: Under The Guidance of Prof. Pushpak Bhattacharyya
35 pages
An Effective Implementation of Web Crawling Technology To Retrieve Data From The World Wide Web WWW - 220200226 36108 8o75vt With Cover Page v2
No ratings yet
An Effective Implementation of Web Crawling Technology To Retrieve Data From The World Wide Web WWW - 220200226 36108 8o75vt With Cover Page v2
6 pages
The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE
No ratings yet
The Basic of Computer Science: Dr. Manish Kumar Kamboj Assistant Professor, CSE
25 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
ir5
No ratings yet
ir5
18 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Algorithms of the intelligent Web 1st Edition Haralambos Marmanis download pdf
100% (11)
Algorithms of the intelligent Web 1st Edition Haralambos Marmanis download pdf
42 pages
Algorithms of the intelligent Web 1st Edition Haralambos Marmanis - Instantly access the full ebook content in just a few seconds
No ratings yet
Algorithms of the intelligent Web 1st Edition Haralambos Marmanis - Instantly access the full ebook content in just a few seconds
42 pages
Web Mining
No ratings yet
Web Mining
13 pages
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
No ratings yet
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
4 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Module 2_final
No ratings yet
Module 2_final
58 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
Web Data Extractors
No ratings yet
Web Data Extractors
26 pages
Web Crawler Assisted Web Page Cleaning For Web Data Mining
No ratings yet
Web Crawler Assisted Web Page Cleaning For Web Data Mining
75 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
Document For Scribd
No ratings yet
Document For Scribd
54 pages
Chapter 3
No ratings yet
Chapter 3
39 pages
Lizarani Senapati: Udayanath Autonomous College of Science and Technology Prachi Jnanapitha, Adaspur
No ratings yet
Lizarani Senapati: Udayanath Autonomous College of Science and Technology Prachi Jnanapitha, Adaspur
31 pages
It and Environment
No ratings yet
It and Environment
81 pages
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
The Wisdom of Crowds: Web Mining or
No ratings yet
The Wisdom of Crowds: Web Mining or
50 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Google, Making Information Accessible: The search engine that changed the world
From Everand
Google, Making Information Accessible: The search engine that changed the world
50minutes
No ratings yet
Algorithms of the intelligent Web 1st Edition Haralambos Marmanis pdf download
No ratings yet
Algorithms of the intelligent Web 1st Edition Haralambos Marmanis pdf download
61 pages
Advanced Techniques in Web Intelligence - Part I
No ratings yet
Advanced Techniques in Web Intelligence - Part I
277 pages
Lecture 1 - On Internet
No ratings yet
Lecture 1 - On Internet
56 pages
S O W C A: Urvey F EB Rawling Lgorithms
No ratings yet
S O W C A: Urvey F EB Rawling Lgorithms
8 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
No ratings yet
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
34 pages
Where to find data PDF
No ratings yet
Where to find data PDF
10 pages
Data Analytics
No ratings yet
Data Analytics
21 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
10.1007@s11280 018 0602 1
No ratings yet
10.1007@s11280 018 0602 1
34 pages
04 Spectrum 25
No ratings yet
04 Spectrum 25
52 pages
Web Mining
No ratings yet
Web Mining
48 pages
4.link Analysis and Page Rank S4
No ratings yet
4.link Analysis and Page Rank S4
32 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
BITCOIN UNCHARTED : MAPPING THE FUTURE of DIGITAL GOLD
From Everand
BITCOIN UNCHARTED : MAPPING THE FUTURE of DIGITAL GOLD
Nightwhisperer
No ratings yet
Open-Source Odyssey: Pioneering Data Engineering with AI Automation
From Everand
Open-Source Odyssey: Pioneering Data Engineering with AI Automation
Muthukrishnan Muthusubramanian
No ratings yet
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Active Filters 4
No ratings yet
Active Filters 4
19 pages
Active Filters 3
No ratings yet
Active Filters 3
17 pages
COMP9313: Big Data Management: Course Web Site: HTTP://WWW - Cse.unsw - Edu.au/ cs9313
No ratings yet
COMP9313: Big Data Management: Course Web Site: HTTP://WWW - Cse.unsw - Edu.au/ cs9313
76 pages
Active Filters: Conventional Passive Filters Consist of LCR Networks. Inductors Are Undesirable Components
No ratings yet
Active Filters: Conventional Passive Filters Consist of LCR Networks. Inductors Are Undesirable Components
18 pages
Sallen and Key Two Pole Filter: Z Z Z V V V
No ratings yet
Sallen and Key Two Pole Filter: Z Z Z V V V
18 pages
So Do Nb2dsk01
No ratings yet
So Do Nb2dsk01
72 pages
Ah 00212
71% (7)
Ah 00212
17 pages
Final Exam Sample
No ratings yet
Final Exam Sample
4 pages
Continuous Electronics Circuit Design
No ratings yet
Continuous Electronics Circuit Design
3 pages
ISDN Question Bank
No ratings yet
ISDN Question Bank
3 pages
Negative Impedance Converter - Wikipedia, The Free Encyclopedia
No ratings yet
Negative Impedance Converter - Wikipedia, The Free Encyclopedia
3 pages
Bai Tap Ve Diode 1
100% (1)
Bai Tap Ve Diode 1
43 pages
I-TCP:Indirect TCP For Mobile Hosts: Authors: Ajay Bakre and B.R.Badrinath Proceedings of IEEE ICDCS 1995
No ratings yet
I-TCP:Indirect TCP For Mobile Hosts: Authors: Ajay Bakre and B.R.Badrinath Proceedings of IEEE ICDCS 1995
16 pages
I-TCP: Indirect TCP For Mobile Hosts by Ajay Bakre and B.R. Badrinath
No ratings yet
I-TCP: Indirect TCP For Mobile Hosts by Ajay Bakre and B.R. Badrinath
13 pages
SecureBoost A Lossless Federated Learning Framework
No ratings yet
SecureBoost A Lossless Federated Learning Framework
9 pages
3.Matrices MCQs
No ratings yet
3.Matrices MCQs
5 pages
Vivobook S 15 OLED (K3502,12th Gen Intel) Product Different Compare ASUS India
No ratings yet
Vivobook S 15 OLED (K3502,12th Gen Intel) Product Different Compare ASUS India
5 pages
SSH Tunneling Control: SSH Decryption To Prevent Non SSH Applications Bypass Firewall
No ratings yet
SSH Tunneling Control: SSH Decryption To Prevent Non SSH Applications Bypass Firewall
7 pages
Computer Graphics Evolution With Sample Models
No ratings yet
Computer Graphics Evolution With Sample Models
3 pages
Top-Down Network Design: Chapter Three
No ratings yet
Top-Down Network Design: Chapter Three
26 pages
Microsoft 365, Office 365, Enterprise Mobility + Security, and Windows 11 Subscriptions3
No ratings yet
Microsoft 365, Office 365, Enterprise Mobility + Security, and Windows 11 Subscriptions3
1 page
IS - Report Temp
No ratings yet
IS - Report Temp
7 pages
MIC College of Technology: Submitted by
No ratings yet
MIC College of Technology: Submitted by
17 pages
PRESENTACION FAGSmartQB
No ratings yet
PRESENTACION FAGSmartQB
22 pages
6.8 Expert Systems
No ratings yet
6.8 Expert Systems
13 pages
Fortigate 1500D Series: Data Sheet
No ratings yet
Fortigate 1500D Series: Data Sheet
6 pages
Go Digital ASEAN - PH Trainers Survey LDN (TESDA LNNAIS)
No ratings yet
Go Digital ASEAN - PH Trainers Survey LDN (TESDA LNNAIS)
57 pages
Ship Weight
No ratings yet
Ship Weight
3 pages
Vulnerability-Oriented Testing For RESTful APIs
No ratings yet
Vulnerability-Oriented Testing For RESTful APIs
17 pages
Zynga/Rollic vs AdOne via Polygon
No ratings yet
Zynga/Rollic vs AdOne via Polygon
23 pages
Accounting User Guide
No ratings yet
Accounting User Guide
309 pages
640i 780i Modbus Instruction Manual
No ratings yet
640i 780i Modbus Instruction Manual
28 pages
Sidexis 4 Installation
No ratings yet
Sidexis 4 Installation
11 pages
Salesforce - Banquet Hall Booking
No ratings yet
Salesforce - Banquet Hall Booking
40 pages
Kadali
No ratings yet
Kadali
1 page
Logistics Extraction
No ratings yet
Logistics Extraction
60 pages
Xenia Canary Settings
No ratings yet
Xenia Canary Settings
12 pages
Sophos
No ratings yet
Sophos
4 pages
2021 CE143 Practical List
No ratings yet
2021 CE143 Practical List
24 pages
About Creating Simplified Representations
No ratings yet
About Creating Simplified Representations
2 pages
Odoo Document (1) (1)
No ratings yet
Odoo Document (1) (1)
125 pages
Content Log - Previous
No ratings yet
Content Log - Previous
67 pages
Comprehensive Guide On Dirb Tool
No ratings yet
Comprehensive Guide On Dirb Tool
20 pages