Scraping Document

This document outlines functions for scraping various websites and extracting PDFs. It describes 9 functions: 1) PDFExtraction to extract PDFs, 2) winopen to open new tabs, 3) Wincloseandswitch to close tabs, 4) checkdomain to check domains against a list, 5) aer to scrape aer.gov, 6) iana to scrape iana.org, 7) sec to scrape sec.gov, 8) textread and directlink to get URLs from files or input, and 9) main function to run the scraping. Various approaches are described like extracting direct or indirect PDF links, storing in text files or folders. The document provides explanations of the scraping approaches and functions.

Uploaded by

Er Sachin Safale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views

Scraping Document

Uploaded by

Er Sachin Safale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Scraping Document

Websites considered for Scraping: Iana.org , Sec.gov , Aer.gov

Functions Explanation:

1) PDFExtraction

The above function is created to extract pdf file and save it accordingly this function is
called when a web address contains a pdf url and we need to extract it.

2)winopen

The above function is created to click on a tab and move it to a new tab.

3)Wincloseandswitch

The above function is used to switch to the previous window after closing the current
window.

4)checkdomain
In the above function in the domain list tab we mention the domain of the websites to
be scraped and based on the list this function checks the domain and looks for an
appropriate function created for that specific domain to be scraped.

5)aer

The above function is the scraper which scrapes the website with domain aer

Explanation:-

Over here the approach is identify the compliance reporting text in the website to scrape
all the compliance reporting pdfs

After checking the for compliance reporting links if the link contains directly the pdf then
we open the pdf in a new tab using the function winopen and the extract the pdf using
the function PDF extractor ,once done close the window and move a new url.

If the page does nto contain the pdf directly rather contains the link of different reports
then we traverse to that report link in a different tab using winopen function and then on
that page we search for pdf and when found again we use the winopen to open it in a
new window and then extract using PDF extractor and this repeats till all the pdfs are
finished.

PS: As of now haven’t added the paging as this is just a raw code

6) iana
The above function is the scraper which scrapes the website with domain iana

Explanation:-

Over here we use the approach where we find the table consisting of the reports and
their links once we have found the links one by one we capture the href for the same
and store it into the list.

Then after traversing through the list we move to each url in a new tab using winopen
and and we save the file with filename of the report and store it into a text file.

PS: The cleaning part of the text file has to be handled yet and also apart from text we
can store it in any format as required.

7) sec

The above function is the scraper which scrapes the website with domain iana

Explanation:-

Over here we use the approach where we feed the company search bar with the tickers
that we have store in the Company_ticker_list and then we select the suggestions from
the dropdown and click the same and go the listings page.

As we have moved on the listings page we search for the latest filings and open each
filing one by one in a new tab using winopen function.
Once the filing is open then we check for the extenstion of the filing if it is any thing
apart from “.htm” that is store in txt file with the name of the file as the title of the
page. Those with extension as “.htm” is store as the filename of the page title .

Also along with the same we create a different folder for each ticker and save it to the
mentioned path of the Ticker folder.

PS: Currently we are storing the scraped data in a text file but can be changed based on
the needs

8) textread and direct link

The above to functions are used as an input where we provide a choice to the user to
select the source of the feed url either reading it from a text file or directly giving it as
an input into the code

Textread is a function to read the seedurl from the file

Directlink is a function to read the seedurl from the user as an input

9) main function
Over here when we run the code we prompt a message asking user if the seed url are
present in text files if the answer is yes then Text read is called else directlink.

And then from textread the remaining function are called.

PS: This is a rough code giving the approach

Thank You

Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
How To Build A Web Scraper For Tenders Extraction
No ratings yet
How To Build A Web Scraper For Tenders Extraction
12 pages
6 Results and Discussions
No ratings yet
6 Results and Discussions
5 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
scrapeez
No ratings yet
scrapeez
3 pages
DeVito_et_al_2020_how_we_learnt_to_stop_worrying_and
No ratings yet
DeVito_et_al_2020_how_we_learnt_to_stop_worrying_and
3 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Practical Web Scraping for Economists 1744341390
No ratings yet
Practical Web Scraping for Economists 1744341390
33 pages
Rohan report
No ratings yet
Rohan report
25 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Template
No ratings yet
Template
21 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
How To Web Scrape With Python in 4 Minutes
100% (1)
How To Web Scrape With Python in 4 Minutes
12 pages
Web Scraping - Notes - 321
No ratings yet
Web Scraping - Notes - 321
3 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Introduction to Web Scraping in RPA With Python
No ratings yet
Introduction to Web Scraping in RPA With Python
10 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
4 Design and Development
No ratings yet
4 Design and Development
3 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
WEb Scrape
No ratings yet
WEb Scrape
3 pages
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Christos Chen
No ratings yet
Christos Chen
42 pages
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Industrial Training Presentation: Prepared By: Guided by
No ratings yet
Industrial Training Presentation: Prepared By: Guided by
26 pages
Scrap Website With Python Free Code Camp
No ratings yet
Scrap Website With Python Free Code Camp
6 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Web Scraping
No ratings yet
Web Scraping
14 pages
Industrial Training Presentation: Prepared By: Guided by
No ratings yet
Industrial Training Presentation: Prepared By: Guided by
27 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Pdfsearch em Ingles
No ratings yet
Pdfsearch em Ingles
29 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Internet Information Services 8.5
From Everand
Internet Information Services 8.5
Murat Yildirimoglu
No ratings yet
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
From Everand
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
Jens Boje
No ratings yet
Software Engineering Project
No ratings yet
Software Engineering Project
55 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Development Web Scrapping
No ratings yet
Development Web Scrapping
14 pages
B42_IP105__S1_D2
No ratings yet
B42_IP105__S1_D2
4 pages
Python Web Scraping
No ratings yet
Python Web Scraping
2 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Web Scrape For Barcodes
No ratings yet
Web Scrape For Barcodes
9 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Become A Web Scraping Pro: With These 5 Tips
No ratings yet
Become A Web Scraping Pro: With These 5 Tips
6 pages
Web Scraping Best Practices
No ratings yet
Web Scraping Best Practices
1 page
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
web_scrapping_final[1]
No ratings yet
web_scrapping_final[1]
7 pages
Creating add-ons for Blender
From Everand
Creating add-ons for Blender
Michel Anders
5/5 (1)
20_BeautifulSoup Library for Web Scraping
No ratings yet
20_BeautifulSoup Library for Web Scraping
12 pages
21CSC303JJ-SEPM_Ex-1.docx - Google Docs
No ratings yet
21CSC303JJ-SEPM_Ex-1.docx - Google Docs
4 pages
06 - (Beams) Behavior of Beams Under Bending
No ratings yet
06 - (Beams) Behavior of Beams Under Bending
161 pages
Full The Security Leader S Communication Playbook Bridging The Gap Between Security and The Business 1st Edition Brown Ebook All Chapters
100% (1)
Full The Security Leader S Communication Playbook Bridging The Gap Between Security and The Business 1st Edition Brown Ebook All Chapters
79 pages
CP PBL
No ratings yet
CP PBL
13 pages
Ihsanullah CV
No ratings yet
Ihsanullah CV
3 pages
Journal Homepage: - : Manuscript History
No ratings yet
Journal Homepage: - : Manuscript History
15 pages
Class Viii - English - 2021-22 Clauses
No ratings yet
Class Viii - English - 2021-22 Clauses
3 pages
Dissolution and Winding Up
No ratings yet
Dissolution and Winding Up
4 pages
Complaint Affidavit
No ratings yet
Complaint Affidavit
7 pages
Traditional Occupation of Goa
No ratings yet
Traditional Occupation of Goa
1 page
Lean Start-Up Management Assesment-1: Name: Tanay Kharche Reg No: 17BEC0463 Slot: TE2
No ratings yet
Lean Start-Up Management Assesment-1: Name: Tanay Kharche Reg No: 17BEC0463 Slot: TE2
4 pages
Swimming Pool Project For Nitt, Zaria
No ratings yet
Swimming Pool Project For Nitt, Zaria
5 pages
Summer 2022-2023 - Course Outline - Introduction To Business
No ratings yet
Summer 2022-2023 - Course Outline - Introduction To Business
10 pages
CPP Latihan
No ratings yet
CPP Latihan
9 pages
VinitKumarDubey (16y 0m)
No ratings yet
VinitKumarDubey (16y 0m)
3 pages
3131306
No ratings yet
3131306
2 pages
Lab Manual - Advance Testing Lab - Cec 509
No ratings yet
Lab Manual - Advance Testing Lab - Cec 509
58 pages
Quality Control BMMP 2333 Tutorial 2 SESI 2020/2021: Fakulti Teknologi Kejuruteraan Universiti Teknikal Malaysia Melaka
No ratings yet
Quality Control BMMP 2333 Tutorial 2 SESI 2020/2021: Fakulti Teknologi Kejuruteraan Universiti Teknikal Malaysia Melaka
4 pages
INDIAN FOREIGN POLICY
No ratings yet
INDIAN FOREIGN POLICY
12 pages
ACCESSORIES and EQUIPMENT Communication - Non-DTC Based Diagnostics - Ram Pickup
No ratings yet
ACCESSORIES and EQUIPMENT Communication - Non-DTC Based Diagnostics - Ram Pickup
97 pages
Data Management
No ratings yet
Data Management
10 pages
MIC-5010 / MIC-5005: Insulation Resistance Meter
No ratings yet
MIC-5010 / MIC-5005: Insulation Resistance Meter
2 pages
Fds 54 GFD
No ratings yet
Fds 54 GFD
11 pages
Billabong Case Study
0% (1)
Billabong Case Study
13 pages
Lourdes B. Mesa Lourdes B. Mesa
No ratings yet
Lourdes B. Mesa Lourdes B. Mesa
3 pages
Testing Seed Quality of BT Cotton
No ratings yet
Testing Seed Quality of BT Cotton
3 pages
Account Statement From 1 Jan 2022 To 30 Jun 2022: TXN Date Value Date Description Ref No./Cheque No. Debit Credit Balance
No ratings yet
Account Statement From 1 Jan 2022 To 30 Jun 2022: TXN Date Value Date Description Ref No./Cheque No. Debit Credit Balance
13 pages
Coercion in Contract Law
100% (1)
Coercion in Contract Law
17 pages
Credit Policy and Loan Characteristics - 2nd Unit
No ratings yet
Credit Policy and Loan Characteristics - 2nd Unit
33 pages
Manual Transmission and Differential (Diagnostics)
No ratings yet
Manual Transmission and Differential (Diagnostics)
38 pages
582final1 (1)
No ratings yet
582final1 (1)
43 pages