Python program to find GSoC organisations that use a Particular Programming Language
Last Updated :
19 Feb, 2020
Currently, it's not possible to sort GSoC participating organizations by the programming languages they use in their code. This results in students spending a lot of time going through each organization's page and manually sorting through them.
This article introduces a way for students to write their own
Python script using the BeautifulSoup4 library. Using this script the students can find the organization that uses the language they desire to contribute in.
You will learn the following through this article:
- How to use Requests library to send HTTPS requests to webpages
- How to use BeautifulSoup4 library in python to parse HTML code
- Output data in the form of a spreadsheet (eg. MS Excel) using OpenPyXL
Installation
The above module does not come pre-installed with Python. To install them type the below command in the terminal.
pip install requests
pip install beautifulsoup4
pip install openpyxl
Note: Only beginner level knowledge of Python 3 is required for following this article. For more information, refer to
Python Programming Language
Getting Started
Step 1: Import the required libraries
Python3 1==
import requests, bs4, openpyxl
Step 2: Create a response object using Requests. We will be using the Archive page as our source
Python3 1==
# Replace "YEAR" by the year you
# want to get data from. Eg. "2018"
url = 'https://summerofcode.withgoogle.com/archive/YEAR/organizations/'
# Creating a response object
# from the given url
res = requests.get(url)
# We'll be using the Archive page
# of GSoC's website as our source.
# Checking the url's status
res.raise_for_status()
Step 3: Create a BeautifulSoup object
From the Archive page's source code:
html
<li class="organization-card__container"
layout
flex-xs="100"
flex-sm="50"
flex="33"
aria-label="AerospaceResearch.net">
...
...
<div class="organization-card__footer md-padding">
<h4 class="organization-card__name font-black-54">AerospaceResearch.net</h4>
</div>
...
...
</li>
We can see that the Orgs's name is in a
H4
tag with class name "
organization-card__name font-black-54
" .
Using BS4, we can search for this particular tag in the HTML code and store the text in a list.
Python3 1==
# Specify the language you
# want to search for
language = 'python'
# BS4 object to store the
# html text We use res.text
# to get the html code in text format
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# Selecting the specific tag
# with class name
orgElem = soup.select('h4[class ="organization-card__name font-black-54"]')
# Similarly finding the links
# for each org's gsoc page
orgLink = soup.find_all("a", class_="organization-card__link")
languageCheck = ['no'] * len(orgElem)
orgURL = ['none'] * len(orgElem)
Step 4: Opening each Orgs's GSoC page and finding the languages used
Python3 1==
item = 0
# Loop to go through each organisation
for link in orgLink:
# Gets the anchor tag's hyperlink
presentLink = link.get('href')
url2 = 'https://summerofcode.withgoogle.com' + presentLink
print(item)
print(url2)
orgURL[item] = url2
res2 = requests.get(url2)
res2.raise_for_status()
soup2 = bs4.BeautifulSoup(res2.text, 'html.parser')
tech = soup2.find_all("li",
class_="organization__tag organization__tag--technology")
# Finding if the org uses
# the specified language
for name in tech:
if language in name.getText():
languageCheck[item] = 'yes'
item = item + 1
Step 5: Writing the list to a spreadsheet
Using the
openpyxl
library, we first a create a workbook. In this workbook we open a sheet using wb['Sheet'], where we will actually write the data. Using the
cell().value
function, we can directly write values to each cell. Finally we save the workbook using
save()
function.
Python3 1==
wb = openpyxl.Workbook()
sheet = wb['Sheet']
for i in range(0, len(orgElem)):
sheet.cell(row = i + 1, column = 1).value = orgElem[i].getText()
sheet.cell(row = i + 1, column = 2).value = languageCheck[i]
sheet.cell(row = i + 1, column = 3).value = orgURL[i]
wb.save('gsocOrgsList.xlsx')
Note: The spreadsheet will be stored in the same directory as the Python file
Troubleshooting
Due to repeated requests to the website, the server may block your IP address after repeated attempts. Using a VPN will solve this issue.
If the problem still persists, add the following to your code:
Python3 1==
from fake_useragent import UserAgent
ua = UserAgent()
header = {
"User-Agent": ua.random
}
Similar Reads
Python Tutorial | Learn Python Programming Language
Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Python Tips and Tricks for Competitive Programming
Python Programming language makes everything easier and straightforward. Effective use of its built-in libraries can save a lot of time and help with faster submissions while doing Competitive Programming. Below are few such useful tricks that every Pythonist should have at their fingertips: Convert
4 min read
Python for Kids - Fun Tutorial to Learn Python Programming
Python for Kids - Python is an easy-to-understand and good-to-start programming language. In this Python tutorial for kids or beginners, you will learn Python and know why it is a perfect fit for kids to start. Whether the child is interested in building simple games, creating art, or solving puzzle
15+ min read
Python program to find String in a List
Searching for a string in a list is a common operation in Python. Whether we're dealing with small lists or large datasets, knowing how to efficiently search for strings can save both time and effort. In this article, weâll explore several methods to find a string in a list, starting from the most e
3 min read
Difference Between Go and Python Programming Language
Golang is a procedural programming language. It was developed in 2007 by Robert Griesemer, Rob Pike, and Ken Thompson at Google but launched in 2009 as an open-source programming language. Programs are assembled by using packages, for efficient management of dependencies. This language also supports
2 min read
Python | Program that matches a word containing 'g' followed by one or more e's using regex
Prerequisites : Regular Expressions | Set 1, Set 2 Given a string, the task is to check if that string contains any g followed by one or more e's in it, otherwise, print No match. Examples : Input : geeks for geeks Output : geeks geeks Input : graphic era Output : No match Approach : Firstly, make a
2 min read
Python Program to Get Country Information
This tutorial will show you how to use Python modules to obtain the country's information. We will utilize the countryinfo module for this purpose, which will assist in obtaining information about the capital, currencies, official language, and numerous other types of things that are probably presen
3 min read
Python program to check if a word is a noun
Given a word, the task is to write a Python program to find if the word is a noun or not using Python. Examples: Input: India Output: India is noun. Input: Writing Output: Writing is not a noun. There are various libraries that can be used to solve this problem. Approach 1: PoS tagging using NLTK Py
1 min read
Competitive Coding Setup for C++ and Python in VS Code using Python Script
Most of us struggle with using heavy software to run C++ and python code and things become more complicated when we have too many files in a folder. In this blog, we are going to create a python script with VS-code-editor that works out to do all your works. It creates the folder + numbers of files
3 min read
Python List Checking and Verification Programs
Lists in Python are versatile, but often, we need to verify their contents, structure, or relationships with other lists. Whether we're checking for duplicates, order, existence of elements or special conditions, Python provides efficient ways to perform these checks.This article covers various list
4 min read