GuidedPractice3 3
GuidedPractice3 3
In order to use preprogrammed modules you must first load the module into Python. This is a standard
procedure and one you will need to do each time you add a new module to your Python setup on your
computer. After you add the module, you can call them as you normally would using the import
statement at the top of the program. If you do not import the module Python will give you and error
and require you to add the module before you can proceed with the program
First, we need to install the program called pip. To do this you first open a command prompt as an
administrator. Right click on the star
menu and select Command Prompt (Admin).
In the command prompt type the commands:
This will install the pip program and upgrade it in your system. You can now use the pip command to
load in modules for your Python programs. You need to close the command prompt and restart it with
admin rights as you did above.
Now we’re going to load in the modules which will allow us to scrape webpages from the Internet.
Task 2 – Pseudocode
Now you will be creating pseudocode for three functions that will be used in the program. The functions
are readwebpage, parsehtml, and printquotes. You will also write the pseudocode for the main program
that will call the functions.
Pseudocode - readwebpage
Pseudocode - parsehtml
Take the html data from the webpage and translate using the html parser
Return the parsed data
Pseudocode - outputquotes
Pull quotes from the parsed data and display them on the screen
Now you are going to write a program to read in a web page, process the data, and write out the quotes
to the screen. Open a file in IDLE and name the program webscrape.py. The name of the webpage you
will be scraping is 8 and you will be reading in the information in one function, parsing the data in
another, and printing out the quotes in a third. There will also be a main program that will call each of
the functions in turn.
First we need to import our two external modules you imported using pip above.
import requests
from bs4 import BeautifulSoup
print(“<StudentID>”)
The second line only import the BeautifulSoup module into your program from the bs4 program. You
could import the whole bs4 module into your program but we’re only going to use a small part so it
makes sense to only import what we need.
def readwebpage(url):
output = requests.get(url)
return(output)
url = “http://quotes.toscrape.com”
html = readwebpage(url)
print(html.text)
Now let’s add the second function. Enter the following below the readwebpage function
def parsehtml(html):
parsed = BeautifulSoup(html.content, ‘html.parser’)
return(parsed)
Now let’s test it by adding the following to the main program
url = “http://quotes.toscrape.com”
html = readwebpage(url)
parsed = parsehtml(html)
print(parsed.text)
Finally we’ll put together the last function outputquotes. As you saw from the previous test results your
information is there it just needs to be formatted.
def outputquotes(parsed):
quotes = parsed.find_all(“div”, class_=”quote”)
for quote in quotes:
text = quote.find(‘span’, class_=’text’).text
author = quote.find(‘small’, class_=’author’).text
print(text, author)
outputquotes(parsed)
Now we are going to use a module to pull text information from a PDF document. Often it is difficult to
pull information from a PDF into a usable format you can use in your databases and spreadsheets. In
this task you will pull text information from a PDF document and display or send the contents to a text
file.
Open your IDLE editor and create a new program called PullPDF.py.
First you need to install the module to pull information for the PDF document. Type the following:
Open your IDLE editor and create a new program called PullPDF.py. Now let write a simple program to
pull text information from a document.
import pdfplumber
print(“<Your StudentID>”)
def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull
pdf = pullpdf(“SimplePDF.pdf”)
page = pdf.pages[0]
print(page.extract_text())
Test the program. Take a screenshot of your output.
You will need to double click on the yellow box to see the text from the pdf document. You will notice
that we needed to set the page to pdf page 0 (the first page) in order to extract the text. If we have
more than one page we can scan all of them by putting them into a loop. Type the following example:
pdf = pullpdf(“MediumPDF.pdf”)
for page in pdf.pages:
print(page.extract_text())
In this case, because the PDF file has multiple pages you will need to loop through each page.
Finally, we’re going to use the same program on a complex PDF file. This file contains 90 pages as well
as data and information that can be useful for you to pull into a spreadsheet so you can manipulate the
data.
import pdfplumber
print(“<Your StudentID>”)
def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull
pdf = pdfpull(“ComplexPDF.pdf”)
for page in pdf.pages:
print(page.extract_text())
You will notice that this is not terribly useful as the data is just being printed out to the system. Let’s try
writing a second function that will write the data into a file so we can pull the data into a spreadsheet or
database for analysis.
import pdfplumber
print(“<Your StudentID>”)
def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull
def writepdf(pdf):
f = open(“Complex_Output.txt, “w”, encoding=’utf-8’)
for page in pdf.pages:
f.write(page.extract_text())
f.close
pdf = pullpdf(“ComplexPDF.pdf”)
writepdf(pdf)
Take a screenshot of the program and output from your program.