0% found this document useful (0 votes)

21 views11 pages

GuidedPractice3 3

Uploaded by

angelineortiz100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views11 pages

GuidedPractice3 3

Uploaded by

angelineortiz100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

.

Guided Practice 3.3 – Web scraping and reading PDF

Task 1 – Adding modules

In order to use preprogrammed modules you must first load the module into Python. This is a standard
procedure and one you will need to do each time you add a new module to your Python setup on your
computer. After you add the module, you can call them as you normally would using the import
statement at the top of the program. If you do not import the module Python will give you and error
and require you to add the module before you can proceed with the program

First, we need to install the program called pip. To do this you first open a command prompt as an
administrator. Right click on the star
menu and select Command Prompt (Admin).
In the command prompt type the commands:

python -m ensurepip --upgrade

python -m pip install --upgrade pip

This will install the pip program and upgrade it in your system. You can now use the pip command to
load in modules for your Python programs. You need to close the command prompt and restart it with
admin rights as you did above.

Now we’re going to load in the modules which will allow us to scrape webpages from the Internet.

In the administrative command prompt type:

pip install requests

pip install bs4

Take a screenshot of your completed pip installation.

Deliverables for Task 1

 Screenshot of your completed pip installation

Task 2 – Pseudocode
Now you will be creating pseudocode for three functions that will be used in the program. The functions
are readwebpage, parsehtml, and printquotes. You will also write the pseudocode for the main program
that will call the functions.

Pseudocode - readwebpage

 Open the webpage

 Read in information
 Return the html content from the webpage

Pseudocode - parsehtml

 Take the html data from the webpage and translate using the html parser
 Return the parsed data

Pseudocode - outputquotes

 Pull quotes from the parsed data and display them on the screen

Pseudocode – Main program

 Pass a webpage to readwebpage

 Pass the html data to parsehtml
 Print out the quotes using outputquotes

Deliverables for Task 2

 Screenshot of the Pseudocode for your program

Task 3 – Writing the program

Now you are going to write a program to read in a web page, process the data, and write out the quotes
to the screen. Open a file in IDLE and name the program webscrape.py. The name of the webpage you
will be scraping is 8 and you will be reading in the information in one function, parsing the data in
another, and printing out the quotes in a third. There will also be a main program that will call each of
the functions in turn.

The first module is called readwebpage.

Enter the following into the IDLE program ReadWebPage.

First we need to import our two external modules you imported using pip above.

Enter the lines

import requests
from bs4 import BeautifulSoup

print(“<StudentID>”)

The second line only import the BeautifulSoup module into your program from the bs4 program. You
could import the whole bs4 module into your program but we’re only going to use a small part so it
makes sense to only import what we need.

The first module is called readwebpage. Enter the following lines

def readwebpage(url):
output = requests.get(url)
return(output)

Let’s test it by adding the lines below

url = “http://quotes.toscrape.com”
html = readwebpage(url)
print(html.text)

Take a screenshot of your test results.

Now let’s add the second function. Enter the following below the readwebpage function

def parsehtml(html):
parsed = BeautifulSoup(html.content, ‘html.parser’)
return(parsed)
Now let’s test it by adding the following to the main program

url = “http://quotes.toscrape.com”
html = readwebpage(url)
parsed = parsehtml(html)
print(parsed.text)

Take a screenshot of your test results.

Finally we’ll put together the last function outputquotes. As you saw from the previous test results your
information is there it just needs to be formatted.

Add the following lines for your third function.

def outputquotes(parsed):
quotes = parsed.find_all(“div”, class_=”quote”)
for quote in quotes:
text = quote.find(‘span’, class_=’text’).text
author = quote.find(‘small’, class_=’author’).text
print(text, author)

Add the following to your main program

outputquotes(parsed)

Take a screenshot of your code and output from the program.

Deliverables for Task 3

 Screenshot of test results

 Screenshot of test results
 Final screenshot of the code and output of your program

Task 4 – Pull information from PDF document

Now we are going to use a module to pull text information from a PDF document. Often it is difficult to
pull information from a PDF into a usable format you can use in your databases and spreadsheets. In
this task you will pull text information from a PDF document and display or send the contents to a text
file.

Open your IDLE editor and create a new program called PullPDF.py.

First you need to install the module to pull information for the PDF document. Type the following:

pip install pdfplumber

Open your IDLE editor and create a new program called PullPDF.py. Now let write a simple program to
pull text information from a document.

import pdfplumber

print(“<Your StudentID>”)

def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull

pdf = pullpdf(“SimplePDF.pdf”)
page = pdf.pages[0]
print(page.extract_text())
Test the program. Take a screenshot of your output.

You will need to double click on the yellow box to see the text from the pdf document. You will notice
that we needed to set the page to pdf page 0 (the first page) in order to extract the text. If we have
more than one page we can scan all of them by putting them into a loop. Type the following example:

pdf = pullpdf(“MediumPDF.pdf”)
for page in pdf.pages:
print(page.extract_text())
In this case, because the PDF file has multiple pages you will need to loop through each page.

Finally, we’re going to use the same program on a complex PDF file. This file contains 90 pages as well
as data and information that can be useful for you to pull into a spreadsheet so you can manipulate the
data.

import pdfplumber

print(“<Your StudentID>”)

def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull

pdf = pdfpull(“ComplexPDF.pdf”)
for page in pdf.pages:
print(page.extract_text())

You will notice that this is not terribly useful as the data is just being printed out to the system. Let’s try
writing a second function that will write the data into a file so we can pull the data into a spreadsheet or
database for analysis.

import pdfplumber

print(“<Your StudentID>”)

def pullpdf(pdf):
pull = pdfplumber.open(pdf)
return pull

def writepdf(pdf):
f = open(“Complex_Output.txt, “w”, encoding=’utf-8’)
for page in pdf.pages:
f.write(page.extract_text())
f.close

pdf = pullpdf(“ComplexPDF.pdf”)
writepdf(pdf)
Take a screenshot of the program and output from your program.

Deliverables for Task 4

 Screenshot of your output for SimplePDF.

 Screenshot of your program and output from the PullPDF program.

Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
C# For Beginners: An Introduction to C# Programming with Tutorials and Hands-On Examples
From Everand
C# For Beginners: An Introduction to C# Programming with Tutorials and Hands-On Examples
Nathan Metzler
5/5 (1)
Introduction to PHP Web Services: PHP, JavaScript, MySQL, SOAP, RESTful, JSON, XML, WSDL
From Everand
Introduction to PHP Web Services: PHP, JavaScript, MySQL, SOAP, RESTful, JSON, XML, WSDL
Imran Ghani
No ratings yet
Cryptographic Hash Functions
No ratings yet
Cryptographic Hash Functions
6 pages
Virtual Smart Phone
100% (2)
Virtual Smart Phone
15 pages
Cp5151 Advanced Data Structures and Algorithms
0% (1)
Cp5151 Advanced Data Structures and Algorithms
4 pages
Fresher PyQt5: A Beginner’s Guide to PyQt5
From Everand
Fresher PyQt5: A Beginner’s Guide to PyQt5
Edward Chang
No ratings yet
Pyqt6 101: A Beginner’s Guide to PyQt6
From Everand
Pyqt6 101: A Beginner’s Guide to PyQt6
Edward Chang
No ratings yet
Python Week 1 PDF
No ratings yet
Python Week 1 PDF
8 pages
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
Python Programming Quantecon Org Python Essentials HTML
No ratings yet
Python Programming Quantecon Org Python Essentials HTML
29 pages
Lecture 31-Document GPT Hands On
No ratings yet
Lecture 31-Document GPT Hands On
18 pages
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
From Everand
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
Jens Boje
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Programming with Python
From Everand
Programming with Python
Enrique Vicente
No ratings yet
Visual Basic 6.0 Programming By Examples: 7 Windows Application Examples
From Everand
Visual Basic 6.0 Programming By Examples: 7 Windows Application Examples
Sergey Skudaev
3/5 (2)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
8 10
No ratings yet
8 10
6 pages
The Ultimate Guide To Python Programming With Python 3.10
No ratings yet
The Ultimate Guide To Python Programming With Python 3.10
2 pages
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Que&practical
No ratings yet
Que&practical
3 pages
ASP.NET Application Development Fundamentals
From Everand
ASP.NET Application Development Fundamentals
James Lombard
No ratings yet
3252 Ids 10
No ratings yet
3252 Ids 10
5 pages
Project X
No ratings yet
Project X
10 pages
WWW Javatpoint Com Python Interview Questions
No ratings yet
WWW Javatpoint Com Python Interview Questions
50 pages
A concise guide to PHP MySQL and Apache
From Everand
A concise guide to PHP MySQL and Apache
alasdair gilchrist
4/5 (2)
Python Lab ALL 10 Prgms
No ratings yet
Python Lab ALL 10 Prgms
16 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Tb125ziegenhagen Python
No ratings yet
Tb125ziegenhagen Python
3 pages
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
C# Programming Illustrated Guide For Beginners & Intermediates: The Future Is Here! Learning By Doing Approach
From Everand
C# Programming Illustrated Guide For Beginners & Intermediates: The Future Is Here! Learning By Doing Approach
William Sullivan
3.5/5 (3)
Imp Python
No ratings yet
Imp Python
4 pages
Python Lab Manual
No ratings yet
Python Lab Manual
20 pages
PHP MySQL Development of Login Modul: 3 hours Easy Guide
From Everand
PHP MySQL Development of Login Modul: 3 hours Easy Guide
Esstree Ishak Abdullah
5/5 (1)
Spring Boot and Single-Page Applications: Securing Your API with a Single-Page Application Frontend - Second Edition
From Everand
Spring Boot and Single-Page Applications: Securing Your API with a Single-Page Application Frontend - Second Edition
Jens Boje
No ratings yet
Python Task Descriptions
No ratings yet
Python Task Descriptions
10 pages
Easy, Clean, Reliable Python 2/3 Compatibility
No ratings yet
Easy, Clean, Reliable Python 2/3 Compatibility
48 pages
PDF To Text With Python 1658153600
No ratings yet
PDF To Text With Python 1658153600
12 pages
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
From Everand
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
Coleman Newton
No ratings yet
Working with Vue.js
From Everand
Working with Vue.js
Jack Franklin
No ratings yet
How to Write a Bulk Emails Application in Vb.Net and Mysql: Step by Step Fully Working Program
From Everand
How to Write a Bulk Emails Application in Vb.Net and Mysql: Step by Step Fully Working Program
Lotfi Ferchichi
No ratings yet
Experiment Python 12018
No ratings yet
Experiment Python 12018
13 pages
Introduction To Machine Learning Report 1
No ratings yet
Introduction To Machine Learning Report 1
17 pages
Ai ANass1
No ratings yet
Ai ANass1
12 pages
CPE 103 Laboratory Activity 5 Functions Modules and Packages
No ratings yet
CPE 103 Laboratory Activity 5 Functions Modules and Packages
9 pages
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
No ratings yet
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
17 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Combining L TEX With Python: Uwe Ziegenhagen August 9, 2019
No ratings yet
Combining L TEX With Python: Uwe Ziegenhagen August 9, 2019
41 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
03 Web Scraping
No ratings yet
03 Web Scraping
41 pages
Learn Automation with n8n A Complete Beginner’s Guide with Practical Projects
From Everand
Learn Automation with n8n A Complete Beginner’s Guide with Practical Projects
turki alkhwlani
5/5 (1)
Module 1
No ratings yet
Module 1
117 pages
DIP Lab 1
No ratings yet
DIP Lab 1
6 pages
Python: Lab 4 Conditionals
No ratings yet
Python: Lab 4 Conditionals
6 pages
Name "Dave": End - Quote Find Start - Quote
No ratings yet
Name "Dave": End - Quote Find Start - Quote
3 pages
Python-Deprecated Library v1.1 Documentation
From Everand
Python-Deprecated Library v1.1 Documentation
Laurent LAPORTE
No ratings yet
CH #3 Solved Exercise
No ratings yet
CH #3 Solved Exercise
6 pages
Python 21 Sample Codes and Advance Crash Course Guide in Python
No ratings yet
Python 21 Sample Codes and Advance Crash Course Guide in Python
97 pages
Advanced Python Guide
No ratings yet
Advanced Python Guide
3 pages
Python Mastery Lessons
No ratings yet
Python Mastery Lessons
28 pages
Object-Oriented Chapter2 Full
No ratings yet
Object-Oriented Chapter2 Full
46 pages
Advaith International Academy
No ratings yet
Advaith International Academy
12 pages
Answer Any Two Full Questions, Each Carries 15 Marks.: Page 1 of 2
No ratings yet
Answer Any Two Full Questions, Each Carries 15 Marks.: Page 1 of 2
2 pages
Product Introduction: Equipment Appearance and Features
No ratings yet
Product Introduction: Equipment Appearance and Features
20 pages
Project Synopsis
No ratings yet
Project Synopsis
8 pages
SPEAKING MICROCONTROLLER FOR DEAF AND DUMB - Electronicsprojects
No ratings yet
SPEAKING MICROCONTROLLER FOR DEAF AND DUMB - Electronicsprojects
14 pages
DAC Record
No ratings yet
DAC Record
6 pages
Synopsis of Charity Management System
100% (1)
Synopsis of Charity Management System
7 pages
CN-Unit - 4 Notes - AR20 - REC
No ratings yet
CN-Unit - 4 Notes - AR20 - REC
19 pages
Rust by Example
No ratings yet
Rust by Example
255 pages
Rudra Astra AD
No ratings yet
Rudra Astra AD
354 pages
Surendra Resume
No ratings yet
Surendra Resume
1 page
Virtual Systems & Services Lecture 12
No ratings yet
Virtual Systems & Services Lecture 12
14 pages
CAO Fall 2024 Lecture 07 RISC V Pipelined Implementation
No ratings yet
CAO Fall 2024 Lecture 07 RISC V Pipelined Implementation
114 pages
CSCU Exam Blueprint v3
No ratings yet
CSCU Exam Blueprint v3
3 pages
Furniture Shop Management System Python Project
No ratings yet
Furniture Shop Management System Python Project
55 pages
Azure SQL IT Resource Kit
No ratings yet
Azure SQL IT Resource Kit
3 pages
P5070NA DataSheet en
No ratings yet
P5070NA DataSheet en
2 pages
Ip Adressing Tutorials
No ratings yet
Ip Adressing Tutorials
7 pages
BRKRST 2309
No ratings yet
BRKRST 2309
104 pages
2 Firewall+Technologies
No ratings yet
2 Firewall+Technologies
7 pages
DCC Unit 1 Digital Notes
No ratings yet
DCC Unit 1 Digital Notes
60 pages
CS3401 ALG UNIT 1 NOTES EduEngg
No ratings yet
CS3401 ALG UNIT 1 NOTES EduEngg
37 pages
4.1.file System and Disk Partitions
No ratings yet
4.1.file System and Disk Partitions
44 pages
Chapters 1 & 2 Correction
No ratings yet
Chapters 1 & 2 Correction
1 page
Broadworks Access Mediation Server Command Line Interface Administration Guide
No ratings yet
Broadworks Access Mediation Server Command Line Interface Administration Guide
140 pages
ntk2019 m-12
No ratings yet
ntk2019 m-12
23 pages