0% found this document useful (0 votes)
21 views11 pages

GuidedPractice3 3

gp

Uploaded by

angelineortiz100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

GuidedPractice3 3

gp

Uploaded by

angelineortiz100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

.

Guided Practice 3.3 – Web scraping and reading PDF

Task 1 – Adding modules

In order to use preprogrammed modules you must first load the module into Python. This is a standard
procedure and one you will need to do each time you add a new module to your Python setup on your
computer. After you add the module, you can call them as you normally would using the import
statement at the top of the program. If you do not import the module Python will give you and error
and require you to add the module before you can proceed with the program

First, we need to install the program called pip. To do this you first open a command prompt as an
administrator. Right click on the star
menu and select Command Prompt (Admin).
In the command prompt type the commands:

python -m ensurepip --upgrade

python -m pip install --upgrade pip

This will install the pip program and upgrade it in your system. You can now use the pip command to
load in modules for your Python programs. You need to close the command prompt and restart it with
admin rights as you did above.

Now we’re going to load in the modules which will allow us to scrape webpages from the Internet.

In the administrative command prompt type:

pip install requests

pip install bs4

Take a screenshot of your completed pip installation.

Deliverables for Task 1

 Screenshot of your completed pip installation

Task 2 – Pseudocode
Now you will be creating pseudocode for three functions that will be used in the program. The functions
are readwebpage, parsehtml, and printquotes. You will also write the pseudocode for the main program
that will call the functions.

Pseudocode - readwebpage

 Open the webpage


 Read in information
 Return the html content from the webpage

Pseudocode - parsehtml

 Take the html data from the webpage and translate using the html parser
 Return the parsed data

Pseudocode - outputquotes

 Pull quotes from the parsed data and display them on the screen

Pseudocode – Main program

 Pass a webpage to readwebpage


 Pass the html data to parsehtml
 Print out the quotes using outputquotes

Deliverables for Task 2

 Screenshot of the Pseudocode for your program

Task 3 – Writing the program

Now you are going to write a program to read in a web page, process the data, and write out the quotes
to the screen. Open a file in IDLE and name the program webscrape.py. The name of the webpage you
will be scraping is 8 and you will be reading in the information in one function, parsing the data in
another, and printing out the quotes in a third. There will also be a main program that will call each of
the functions in turn.

The first module is called readwebpage.

Enter the following into the IDLE program ReadWebPage.

First we need to import our two external modules you imported using pip above.

Enter the lines

import requests
from bs4 import BeautifulSoup

print(“<StudentID>”)

The second line only import the BeautifulSoup module into your program from the bs4 program. You
could import the whole bs4 module into your program but we’re only going to use a small part so it
makes sense to only import what we need.

The first module is called readwebpage. Enter the following lines

def readwebpage(url):
output = requests.get(url)
return(output)

Let’s test it by adding the lines below

url = “http://quotes.toscrape.com”
html = readwebpage(url)
print(html.text)

Take a screenshot of your test results.

Now let’s add the second function. Enter the following below the readwebpage function

def parsehtml(html):
parsed = BeautifulSoup(html.content, ‘html.parser’)
return(parsed)
Now let’s test it by adding the following to the main program

url = “http://quotes.toscrape.com”
html = readwebpage(url)
parsed = parsehtml(html)
print(parsed.text)

Take a screenshot of your test results.

Finally we’ll put together the last function outputquotes. As you saw from the previous test results your
information is there it just needs to be formatted.

Add the following lines for your third function.

def outputquotes(parsed):
quotes = parsed.find_all(“div”, class_=”quote”)
for quote in quotes:
text = quote.find(‘span’, class_=’text’).text
author = quote.find(‘small’, class_=’author’).text
print(text, author)

Add the following to your main program

outputquotes(parsed)

Take a screenshot of your code and output from the program.

Deliverables for Task 3

 Screenshot of test results


 Screenshot of test results
 Final screenshot of the code and output of your program

Task 4 – Pull information from PDF document

Now we are going to use a module to pull text information from a PDF document. Often it is difficult to
pull information from a PDF into a usable format you can use in your databases and spreadsheets. In
this task you will pull text information from a PDF document and display or send the contents to a text
file.

Open your IDLE editor and create a new program called PullPDF.py.

First you need to install the module to pull information for the PDF document. Type the following:

pip install pdfplumber

Open your IDLE editor and create a new program called PullPDF.py. Now let write a simple program to
pull text information from a document.

import pdfplumber

print(“<Your StudentID>”)

def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull

pdf = pullpdf(“SimplePDF.pdf”)
page = pdf.pages[0]
print(page.extract_text())
Test the program. Take a screenshot of your output.

You will need to double click on the yellow box to see the text from the pdf document. You will notice
that we needed to set the page to pdf page 0 (the first page) in order to extract the text. If we have
more than one page we can scan all of them by putting them into a loop. Type the following example:

pdf = pullpdf(“MediumPDF.pdf”)
for page in pdf.pages:
print(page.extract_text())
In this case, because the PDF file has multiple pages you will need to loop through each page.

Finally, we’re going to use the same program on a complex PDF file. This file contains 90 pages as well
as data and information that can be useful for you to pull into a spreadsheet so you can manipulate the
data.

import pdfplumber

print(“<Your StudentID>”)

def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull

pdf = pdfpull(“ComplexPDF.pdf”)
for page in pdf.pages:
print(page.extract_text())

You will notice that this is not terribly useful as the data is just being printed out to the system. Let’s try
writing a second function that will write the data into a file so we can pull the data into a spreadsheet or
database for analysis.

import pdfplumber

print(“<Your StudentID>”)

def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull

def writepdf(pdf):
f = open(“Complex_Output.txt, “w”, encoding=’utf-8’)
for page in pdf.pages:
f.write(page.extract_text())
f.close

pdf = pullpdf(“ComplexPDF.pdf”)
writepdf(pdf)
Take a screenshot of the program and output from your program.

Deliverables for Task 4

 Screenshot of your output for SimplePDF.


 Screenshot of your program and output from the PullPDF program.

You might also like