0% found this document useful (0 votes)
42 views4 pages

Upwork Scraping Job 12871

The document provides instructions for scraping contract data from a natural gas pipeline website. It outlines scraping data from a summary grid displaying post dates, contract holders, and other details for a 90 day range. It also describes scraping additional data from linked detail pages and specifying the desired output schema and files to submit. The job is to be completed within a week by submitting source code and sample output to a linked GitHub repository.

Uploaded by

lacktii lackti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views4 pages

Upwork Scraping Job 12871

The document provides instructions for scraping contract data from a natural gas pipeline website. It outlines scraping data from a summary grid displaying post dates, contract holders, and other details for a 90 day range. It also describes scraping additional data from linked detail pages and specifying the desired output schema and files to submit. The job is to be completed within a week by submitting source code and sample output to a linked GitHub repository.

Uploaded by

lacktii lackti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

GTN (Gas Transmission Northwest LLC) Pipeline Data – 12871

⚠️ If the resource you are scraping requires you to agree to any Terms & Conditions,
please do not proceed and notify your contract manager immediately. Under no
circumstances should you create a false account or fake identity.

Description:

Please write a scraper tool to enter a Post Date (from) and Post Date (to) range of 90 days back from
the day the scrape is run (ie. Today going back 90 days). There is a limit of 90 days for the query.

• Parameters Setup,
o Enter the From Date,
o Enter the End Date
o Click Retrieve Button, then you will see the grid of data

• Click each row in the table, you can get into the details for each contract:
Scraping Description

There should be one output dataset:

• Summary Dataset
o Scrape all data display in the summary grid (blue section in the first illustration). You
can either
▪ Manually scrape the data in the html
▪ Using the [Download] button on the top right, which will give you CSV
o Add an additional link column in the dataset.
▪ For each row, the link is the html link pointing to the detail page.
o Add an additional scrape_time column to indicate the scrape time

The desired schema is listed below

Root URL:

http://tcplus.com/GTN/ContractRouteRate/Interruptible

Job Frequency:

Realtime (every minute)


Output Columns:

File One: summary.csv

Column Name Original Columns Type Example


scrape_time - datetime
post_datetime Post Date / Time datetime 20230223 12:10:45
Castleton Commodities Merchant Trading
k_holder_name K Holder Name str
L.P.
k_holder K Holder str 118638852
svc_req_k Svc Req K str 20578
rate_sch Rate Sch str PAL
it_qty_k IT Qty – K int 30000
k_stat K Stat str N
disc_beg_date Disc Beg Date datetime 20230224
disc_end_date Disc End Date datetime 20230224
receipt_loc Loc str 370672
receipt_loc_name Loc Name sts MALIN MC
receipt_loc_qti_desc Loc/QTI Desc str Rec Qty
delivery_loc Loc int 0
delivery_loc_name Loc Name str MALIN MC
delivery_loc_qti_desc Loc/QTI Desc str
loc_ind Loc Ind str I
rate_chgd Rate Chgd float 0.2
max_trf_rate Max Trf Rate float 0.204356
ngtd_rate_ind Ngtd Rate Ind n N
rate_id_desc Rate ID Desc str Loan Chrg-Bal
affil Affil str None
terms_notes Terms/Notes str N
link - str -

• Please note that the original column in the table has the same column names for receipt and
delivery locations. We should treat the first three as the receipt and the last three as
delivery.

Timeline:

You may complete this job any time and submit any required files to the linked GitHub repository
within one week of accepting the job.

Please submit your code here: https://github.com/international-data-repository-cpd/scrape-12871

Submission Files:
Sample.csv for sample data

A requirement.txt

scrape/ - containing all of the source code

Main file: scrape.py that will be run with a output $filename.

Job Schema/Output Format:

You should save the output csv using these settings from a pandas DataFrame:

encoding="utf-8",
line_terminator="\n",
quotechar='"',
quoting=csv.QUOTE_ALL,
index=False

Runtime Environment:

Your code will be copied form the root to/usr/src/scrape

You should feel free to modify the requirements as you need. However, you must keep the
awscli dependency

You may also upload additional binaries into the repository root and reference them
there.

Please do not change the Dockerfile or shell scripts in the repository as this will cause
automated test failure.

python scrape.py $filename

Page access limitations (max requests / day):

If you encounter a captcha during your scrape job, please contact the job poster before continuing.

10% of website traffic max

You might also like