0% found this document useful (0 votes)
6 views

Controlling The Web With Python - Towards Data Science

Uploaded by

kuan5050
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Controlling The Web With Python - Towards Data Science

Uploaded by

kuan5050
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2018/9/30 Controlling the Web with Python – Towards Data Science

Applause from James Le, Conor Dewey, and 1,003 others

William Koehrsen
Data Scientist at Feature Labs, Data Science Communicator
Mar 11 · 9 min read

Controlling the Web with Python


An adventure in simple web automation

Problem: Submitting class assignments requires navigating a maze of


web pages so complex that several times I’ve turned an assignment in to
the wrong place. Also, while this process only takes 1–2 minutes, it
sometimes seems like an insurmountable barrier (like when I’ve
nished an assignment way too late at night and I can barely remember
my password).

Solution: Use Python to automatically submit completed assignments!


Ideally, I would be able to save an assignment, type a few keys, and
have my work uploaded in a matter of seconds. At rst this sounded too

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 1/14
2018/9/30 Controlling the Web with Python – Towards Data Science

good to be true, but then I discovered selenium, a tool which can be


used with Python to navigate the web for you.

Obligatory XKCD

. . .

Anytime we nd ourselves repeating tedious actions on the web with


the same sequence of steps, this is a great chance to write a program to
automate the process for us. With selenium and Python, we just need to
write a script once, and which then we can run it as many times and
save ourselves from repeating monotonous tasks (and in my case,
eliminate the chance of submitting an assignment in the wrong place)!

Here, I’ll walk through the solution I developed to automatically (and


correctly) submit my assignments. Along the way, we’ll cover the basics
of using Python and selenium to programmatically control the web.
While this program does work (I’m using it every day!) it’s pretty
custom so you won’t be able to copy and paste the code for your
application. Nonetheless, the general techniques here can be applied to

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 2/14
2018/9/30 Controlling the Web with Python – Towards Data Science

a limitless number of situations. (If you want to see the complete code,
it’s available on GitHub).

Approach
Before we can get to the fun part of automating the web, we need to
gure out the general structure of our solution. Jumping right into
programming without a plan is a great way to waste many hours in
frustration. I want to write a program to submit completed course
assignments to the correct location on Canvas (my university’s
“learning management system”). Starting with the basics, I need a way
to tell the program the name of the assignment to submit and the class.
I went with a simple approach and created a folder to hold completed
assignments with child folders for each class. In the child folders, I
place the completed document named for the particular assignment.
The program can gure out the name of the class from the folder, and
the name of the assignment by the document title.

Here’s an example where the name of the class is EECS491 and the
assignment is “Assignment 3 — Inference in Larger Graphical Models”.

File structure (left) and Complete Assignment (right)

The rst part of the program is a loop to go through the folders to nd


the assignment and class, which we store in a Python tuple:

# os for file management


import os

# Build tuple of (class, file) to turn in


submission_dir = 'completed_assignments'

dir_list = list(os.listdir(submission_dir))

for directory in dir_list:


file_list =
list(os.listdir(os.path.join(submission_dir,
directory)))
if len(file_list) != 0:
file_tup = (directory, file_list[0])

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 3/14
2018/9/30 Controlling the Web with Python – Towards Data Science

print(file_tup)

('EECS491', 'Assignment 3 - Inference in Larger


Graphical Models.txt')

This takes care of le management and the program now knows the
program and the assignment to turn in. The next step is to use selenium
to navigate to the correct webpage and upload the assignment.

Web Control with Selenium


To get started with selenium, we import the library and create a web
driver, which is a browser that is controlled by our program. In this
case, I’ll use Chrome as my browser and send the driver to the Canvas
website where I submit assignments.

import selenium

# Using Chrome to access web


driver = webdriver.Chrome()

# Open the website


driver.get('https://canvas.case.edu')

When we open the Canvas webpage, we are greeted with our rst
obstacle, a login box! To get past this, we will need to ll in an id and a
password and click the login button.

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 4/14
2018/9/30 Controlling the Web with Python – Towards Data Science

Imagine the web driver as a person who has never seen a web page
before: we need to tell it exactly where to click, what to type, and which
buttons to press. There are a number of ways to tell our web driver
what elements to nd, all of which use selectors. A selector is a unique
identi er for an element on a webpage. To nd the selector for a
speci c element, say the CWRU ID box above, we need to inspect the
webpage. In Chrome, this is done by pressing “ctrl + shift + i” or right
clicking on any element and selecting “Inspect”. This brings up the
Chrome developer tools, an extremely useful application which shows
the HTML underlying any webpage.

To nd a selector for the “CWRU ID” box, I right clicked in the box, hit
“Inspect” and saw the following in developer tools. The highlighted line
corresponds to the id box element (this line is called an HTML tag).

HTML in Chrome developer tools for the webpage

This HTML might look overwhelming, but we can ignore the majority
of the information and focus on the id = "username" and
name="username" parts. (these are known as attributes of the HTML
tag).

To select the id box with our web driver, we can use either the id or
name attribute we found in the developer tools. Web drivers in
selenium have many di erent methods for selecting elements on a
webpage and there are often multiple ways to select the exact same
item:

# Select the id box


id_box = driver.find_element_by_name('username')

# Equivalent Outcome!
id_box = driver.find_element_by_id('username')

Our program now has access to the id_box and we can interact with
it in various ways, such as typing in keys, or clicking (if we have
selected a button).

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 5/14
2018/9/30 Controlling the Web with Python – Towards Data Science

# Send id information
id_box.send_keys('my_username')

We carry out the same process for the password box and login button,
selecting each based on what we see in the Chrome developer tools.
Then, we send information to the elements or click on them as needed.

# Find password box


pass_box = driver.find_element_by_name('password')

# Send password
pass_box.send_keys('my_password')

# Find login button


login_button = driver.find_element_by_name('submit')

# Click login
login_button.click()

Once we are logged in, we are greeted by this slightly intimidating


dashboard:

We again need to guide the program through the webpage by


specifying exactly the elements to click on and the information to enter.
In this case, I tell the program to select courses from the menu on the
left, and then the class corresponding to the assignment I need to turn
in:

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 6/14
2018/9/30 Controlling the Web with Python – Towards Data Science

# Find and click on list of courses


courses_button =
driver.find_element_by_id('global_nav_courses_link')

courses_button.click()

# Get the name of the folder


folder = file_tup[0]

# Class to select depends on folder


if folder == 'EECS491':
class_select =
driver.find_element_by_link_text('Artificial
Intelligence: Probabilistic Graphical Models
(100/10039)')

elif folder == 'EECS531':


class_select =
driver.find_element_by_link_text('Computer Vision
(100/10040)')

# Click on the specific class


class_select.click()

The program nds the correct class using the name of the folder we
stored in the rst step. In this case, I use the selection method
find_element_by_link_text to nd the speci c class. The “link text”
for an element is just another selector we can nd by inspecting the
page. :

Inspecting the page to nd the selector for a speci c class

This work ow may seem a little tedious, but remember, we only have to
do it once when we write our program! After that, we can hit run as
many times as we want and the program will navigate through all these
pages for us.

We use the same ‘inspect page — select element — interact with


element’ process to get through a couple more screens. Finally, we
reach the assignment submission page:

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 7/14
2018/9/30 Controlling the Web with Python – Towards Data Science

At this point, I could see the nish line, but initially this screen
perplexed me. I could click on the “Choose File” box pretty easily, but
how was I supposed to select the actual le I need to upload? The
answer turns out to be incredibly simple! We locate the Choose File
box using a selector, and use the send_keys method to pass the exact
path of the le (called file_location in the code below) to the box:

# Choose File button


choose_file =
driver.find_element_by_name('attachments[0]
[uploaded_data]')

# Complete path of the file


file_location = os.path.join(submission_dir, folder,
file_name)

# Send the file location to the button


choose_file.send_keys(file_location)

That’s it! By sending the exact path of the le to the button, we can skip
the whole process of navigating through folders to nd the right le.
After sending the location, we are rewarded with the following screen
showing that our le is uploaded and ready for submission.

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 8/14
2018/9/30 Controlling the Web with Python – Towards Data Science

Now, we select the “Submit Assignment” button, click, and our


assignment is turned in!

# Locate submit button and click


submit_assignment =
driver.find_element_by_id('submit_file_button')
submit_assignent.click()

Cleaning Up
File management is always a critical step and I want to make sure I
don’t re-submit or lose old assignments. I decided the best solution was
to store a single le to be submitted in the completed_assignments
folder at any one time and move les to a submitted_assignments
folder once they had been turned in. The nal bit of code uses the os
module to move the completed assignment by renaming it with the
desired location:

# Location of files after submission


submitted_file_location = os.path.join(submitted_dir,
submitted_file_name)

# Rename essentially copies and pastes files


os.rename(file_location, submitted_file_location)

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 9/14
2018/9/30 Controlling the Web with Python – Towards Data Science

All of the proceeding code gets wrapped up in a single script, which I


can run from the command line. To limit opportunities for mistakes, I
only submit one assignment at a time, which isn’t a big deal given that
it only takes about 5 seconds to run the program!

Here’s what it looks like when I start the program:

The program provides me with a chance to make sure this is the correct
assignment before uploading. After the program has completed, I get
the following output:

While the program is running, I can watch Python go to work for me:

Conclusions
The technique of automating the web with Python works great for
many tasks, both general and in my eld of data science. For example,
we could use selenium to automatically download new data les every
day (assuming the website doesn’t have an API). While it might seem
like a lot of work to write the script initially, the bene t comes from the
fact that we can have the computer repeat this sequence as many times
as want in exactly the same manner. The program will never lose focus
and wander o to Twitter. It will faithfully carry out the same exact

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 10/14
2018/9/30 Controlling the Web with Python – Towards Data Science

series of steps with perfect consistency (which works great until the
website changes).

I should mention you do want to be careful before you automate critical


tasks. This example is relatively low-risk as I can always go back and re-
submit assignments and I usually double-check the program’s
handiwork. Websites change, and if you don’t change the program in
response you might end up with a script that does something
completely di erent than what you originally intended!

In terms of paying o , this program saves me about 30 seconds for


every assignment and took 2 hours to write. So, if I use it to turn in 240
assignments, then I come out ahead on time! However, the payo of
this program is in designing a cool solution to a problem and learning a
lot in the process. While my time might have been more e ectively
spent working on assignments rather than guring out how to
automatically turn them in, I thoroughly enjoyed this challenge. There
are few things as satisfying as solving problems, and Python turns out
to be a pretty good tool for doing exactly that.

As always, I welcome feedback and constructive criticism. I can be


reached on Twitter @koehrsen_will.

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 11/14
2018/9/30 Controlling the Web with Python – Towards Data Science

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 12/14
2018/9/30 Controlling the Web with Python – Towards Data Science

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 13/14
2018/9/30 Controlling the Web with Python – Towards Data Science

https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08 14/14

You might also like