Scraping Document
Scraping Document
Functions Explanation:
1) PDFExtraction
The above function is created to extract pdf file and save it accordingly this function is
called when a web address contains a pdf url and we need to extract it.
2)winopen
The above function is created to click on a tab and move it to a new tab.
3)Wincloseandswitch
The above function is used to switch to the previous window after closing the current
window.
4)checkdomain
In the above function in the domain list tab we mention the domain of the websites to
be scraped and based on the list this function checks the domain and looks for an
appropriate function created for that specific domain to be scraped.
5)aer
The above function is the scraper which scrapes the website with domain aer
Explanation:-
Over here the approach is identify the compliance reporting text in the website to scrape
all the compliance reporting pdfs
After checking the for compliance reporting links if the link contains directly the pdf then
we open the pdf in a new tab using the function winopen and the extract the pdf using
the function PDF extractor ,once done close the window and move a new url.
If the page does nto contain the pdf directly rather contains the link of different reports
then we traverse to that report link in a different tab using winopen function and then on
that page we search for pdf and when found again we use the winopen to open it in a
new window and then extract using PDF extractor and this repeats till all the pdfs are
finished.
PS: As of now haven’t added the paging as this is just a raw code
6) iana
The above function is the scraper which scrapes the website with domain iana
Explanation:-
Over here we use the approach where we find the table consisting of the reports and
their links once we have found the links one by one we capture the href for the same
and store it into the list.
Then after traversing through the list we move to each url in a new tab using winopen
and and we save the file with filename of the report and store it into a text file.
PS: The cleaning part of the text file has to be handled yet and also apart from text we
can store it in any format as required.
7) sec
The above function is the scraper which scrapes the website with domain iana
Explanation:-
Over here we use the approach where we feed the company search bar with the tickers
that we have store in the Company_ticker_list and then we select the suggestions from
the dropdown and click the same and go the listings page.
As we have moved on the listings page we search for the latest filings and open each
filing one by one in a new tab using winopen function.
Once the filing is open then we check for the extenstion of the filing if it is any thing
apart from “.htm” that is store in txt file with the name of the file as the title of the
page. Those with extension as “.htm” is store as the filename of the page title .
Also along with the same we create a different folder for each ticker and save it to the
mentioned path of the Ticker folder.
PS: Currently we are storing the scraped data in a text file but can be changed based on
the needs
The above to functions are used as an input where we provide a choice to the user to
select the source of the feed url either reading it from a text file or directly giving it as
an input into the code
9) main function
Over here when we run the code we prompt a message asking user if the seed url are
present in text files if the answer is yes then Text read is called else directlink.
Thank You