0% found this document useful (0 votes)
109 views

Assignment Web Scraping

The document outlines an assignment to scrape data from a government website on foreign contributions to Indian organizations, clean the raw data by consolidating files and adding additional fields, and submit the scripts, documentation, and final dataset within two working days to continue the interview process. The raw data would be stored in a nested folder structure by year, state and district before being transformed into a single consolidated file with additional columns for state, district, and other identifying information.

Uploaded by

Aakriti Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Assignment Web Scraping

The document outlines an assignment to scrape data from a government website on foreign contributions to Indian organizations, clean the raw data by consolidating files and adding additional fields, and submit the scripts, documentation, and final dataset within two working days to continue the interview process. The raw data would be stored in a nested folder structure by year, state and district before being transformed into a single consolidated file with additional columns for state, district, and other identifying information.

Uploaded by

Aakriti Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Assignment

Web Scraping and Data Transformation using python

Website: FCRA Online Services

This is a dataset of foreign contributions to Indian organizations and the organizations’ returns,
which falls under the ministry of home affairs.

The objective is to scrape this data for all years, states, and districts and create another script to
clean and consolidate all the data into one single file.

The first python script should go to each year, state, and district to get the tabular data present on
the website and store it as a CSV file under raw data in the proper folder structure of the below-
mentioned fashion

1. raw_data
a. State
i. District
1. Year
a. Output.csv

You may use python libraries like requests, urllib3, beautiful soup, or scrapy and pandas to achieve
this task. It would be lovely if you can automate the data upload process to your cloud storage
(google drive or onedrive). Data upload automation is not mandatory but is great to have.

The second python script should read all the files from the raw_data folder to read the CSV files
using pandas and clean the columns. You should add the state name district name to each file and
create a consolidated data file for all the years, states, and districts.

The final dataset should be of this format as a CSV file.

yea state_na district_na registration_ association_na addre amount_of_FC_recei


r me me no me ss ved

Submission:

1. You need to create a zip folder in your cloud storage (gdrive/onedrive) with Name_JDC_IDP
as a folder name and share it with us.
2. Under Assignment, we expect you to provide us with the following
a. Python script to scrape the raw data
b. Draft on the steps followed during the data scraping process
c. Python script to clean and consolidate (data transformation) the data
d. Draft on the steps followed during the data transformation process, ss
e. Exploratory analysis report on final dataset
f. Final Dataset

Note:

Please refrain from sending us ipython notebooks, you can always convert your Jupiter notebook to
a script.

Please adhere to best coding standards like naming conventions script comments in all of your
scripts with the aim that anyone can run the scripts get the raw data, and transform the data.
Timeline:

You should be completing these exercises in 2 working days after receiving the email. If you fail to
send the assignments on time, the interview process will void. In case of emergency, please write to
us.

You might also like