0% found this document useful (0 votes)
47 views4 pages

Upwork Scraping Job 12873

The document provides instructions to scrape data from the Northern Border Pipeline (NBP) website. It outlines the root URLs, desired data schema including output columns, timeline for completion within one week, and submission format. The scraper should capture the table in the blue section of the first 3 pages under NBP > Transaction Reporting > Interruptible, and output to a CSV file with additional columns for data URL and scrape datetime in ISO format.

Uploaded by

lacktii lackti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views4 pages

Upwork Scraping Job 12873

The document provides instructions to scrape data from the Northern Border Pipeline (NBP) website. It outlines the root URLs, desired data schema including output columns, timeline for completion within one week, and submission format. The scraper should capture the table in the blue section of the first 3 pages under NBP > Transaction Reporting > Interruptible, and output to a CSV file with additional columns for data URL and scrape datetime in ISO format.

Uploaded by

lacktii lackti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Northern Border Pipeline (NBP) Data – 12873

⚠️ If the resource you are scraping requires you to agree to any Terms & Conditions,
please do not proceed and notify your contract manager immediately. Under no
circumstances should you create a false account or fake identity.

Description:

Please write a scraper tool to scrape at least the first page of the NBP data. If possible,also scrape
the first 3 pages of the data in the frame.

• Root Domain: https://ebb.tceconnects.com/infopost/


• Data are under the:
o Northern Border Pipeline Company (NBPL) >
o Transaction Reporting >
o Interruptible
Then the right-hand side frame contains the data.
It appears that the right-hand side frame via following link directly. Please verify though:
https://ebb.tceconnects.com/infopost/ReportViewer.aspx?%2FInfoPost%2FTransInterrupt&
AssetNbr=3029
Scraping Description:

• Data can be accessed as various format in the save button above


• We will scrape the main table in the blue, schema below
• Add two columns, data_url and scrape_datetime

The desired schema is listed below

Root URL:

• https://ebb.tceconnects.com/infopost/
• https://ebb.tceconnects.com/infopost/ReportViewer.aspx?%2FInfoPost%2FTransInterrupt&
AssetNbr=3029 - Please verify before use

Job Frequency:

Realtime (every minute)

Output Columns:

File One: summary.csv

Column name Datatype Example value Comment


Ensure this is a datetime in ISO-8601
datetime.datetime.now().isoforma format, and have one value per run,
scrape_datetime datetime
t() evaluated at start of the script, rather
than at the time of each request
data_url string URL where the row is scraped from
posting_date_time datetime ISO Datetime
contract_holder string 196748938
Constellation Energy Generation,
contract_holder_name string
LLC
k_holder_prop string 1444
svc_req_k string 102273
rate_schedule string PAL
contract_status string A
contract_begin_date date 4/4/2013
contract_end_date date 3/31/2024
k_ent_begin_date date 10/2/2021
K_ent_end_date date 2/27/2023
deal_type string 65
it_qty_k int 100,000
location_indicator string
market_based_rate_ind string
disc_provisions string
term_notes string

Timeline:

You may complete this job any time and submit any required files to the linked GitHub repository
within one week of accepting the job.

Please submit your code here: https://github.com/international-data-repository-cpd/scrape-12873

Submission Files:

Sample.csv for sample data

A requirement.txt

scrape/ - containing all of the source code

Main file: scrape.py that will be run with a output $filename.

Job Schema/Output Format:

You should save the output csv using these settings from a pandas DataFrame:

encoding="utf-8",
line_terminator="\n",
quotechar='"',
quoting=csv.QUOTE_ALL,
index=False

Runtime Environment:

Your code will be copied form the root to/usr/src/scrape

You should feel free to modify the requirements as you need. However, you must keep the
awscli dependency

You may also upload additional binaries into the repository root and reference them
there.

Please do not change the Dockerfile or shell scripts in the repository as this will cause
automated test failure.

python scrape.py $filename

Page access limitations (max requests / day):

If you encounter a captcha during your scrape job, please contact the job poster before continuing.
10% of website traffic max

You might also like