0% found this document useful (0 votes)
116 views

Scraping Document

This document outlines functions for scraping various websites and extracting PDFs. It describes 9 functions: 1) PDFExtraction to extract PDFs, 2) winopen to open new tabs, 3) Wincloseandswitch to close tabs, 4) checkdomain to check domains against a list, 5) aer to scrape aer.gov, 6) iana to scrape iana.org, 7) sec to scrape sec.gov, 8) textread and directlink to get URLs from files or input, and 9) main function to run the scraping. Various approaches are described like extracting direct or indirect PDF links, storing in text files or folders. The document provides explanations of the scraping approaches and functions.

Uploaded by

Er Sachin Safale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

Scraping Document

This document outlines functions for scraping various websites and extracting PDFs. It describes 9 functions: 1) PDFExtraction to extract PDFs, 2) winopen to open new tabs, 3) Wincloseandswitch to close tabs, 4) checkdomain to check domains against a list, 5) aer to scrape aer.gov, 6) iana to scrape iana.org, 7) sec to scrape sec.gov, 8) textread and directlink to get URLs from files or input, and 9) main function to run the scraping. Various approaches are described like extracting direct or indirect PDF links, storing in text files or folders. The document provides explanations of the scraping approaches and functions.

Uploaded by

Er Sachin Safale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Scraping Document

Websites considered for Scraping: Iana.org , Sec.gov , Aer.gov

Functions Explanation:

1) PDFExtraction

The above function is created to extract pdf file and save it accordingly this function is
called when a web address contains a pdf url and we need to extract it.

2)winopen

The above function is created to click on a tab and move it to a new tab.

3)Wincloseandswitch

The above function is used to switch to the previous window after closing the current
window.

4)checkdomain
In the above function in the domain list tab we mention the domain of the websites to
be scraped and based on the list this function checks the domain and looks for an
appropriate function created for that specific domain to be scraped.

5)aer

The above function is the scraper which scrapes the website with domain aer

Explanation:-

Over here the approach is identify the compliance reporting text in the website to scrape
all the compliance reporting pdfs

After checking the for compliance reporting links if the link contains directly the pdf then
we open the pdf in a new tab using the function winopen and the extract the pdf using
the function PDF extractor ,once done close the window and move a new url.

If the page does nto contain the pdf directly rather contains the link of different reports
then we traverse to that report link in a different tab using winopen function and then on
that page we search for pdf and when found again we use the winopen to open it in a
new window and then extract using PDF extractor and this repeats till all the pdfs are
finished.

PS: As of now haven’t added the paging as this is just a raw code

6) iana
The above function is the scraper which scrapes the website with domain iana

Explanation:-

Over here we use the approach where we find the table consisting of the reports and
their links once we have found the links one by one we capture the href for the same
and store it into the list.

Then after traversing through the list we move to each url in a new tab using winopen
and and we save the file with filename of the report and store it into a text file.

PS: The cleaning part of the text file has to be handled yet and also apart from text we
can store it in any format as required.

7) sec

The above function is the scraper which scrapes the website with domain iana

Explanation:-

Over here we use the approach where we feed the company search bar with the tickers
that we have store in the Company_ticker_list and then we select the suggestions from
the dropdown and click the same and go the listings page.

As we have moved on the listings page we search for the latest filings and open each
filing one by one in a new tab using winopen function.
Once the filing is open then we check for the extenstion of the filing if it is any thing
apart from “.htm” that is store in txt file with the name of the file as the title of the
page. Those with extension as “.htm” is store as the filename of the page title .

Also along with the same we create a different folder for each ticker and save it to the
mentioned path of the Ticker folder.

PS: Currently we are storing the scraped data in a text file but can be changed based on
the needs

8) textread and direct link

The above to functions are used as an input where we provide a choice to the user to
select the source of the feed url either reading it from a text file or directly giving it as
an input into the code

Textread is a function to read the seedurl from the file

Directlink is a function to read the seedurl from the user as an input

9) main function
Over here when we run the code we prompt a message asking user if the seed url are
present in text files if the answer is yes then Text read is called else directlink.

And then from textread the remaining function are called.

PS: This is a rough code giving the approach

Thank You

You might also like