Assignment Web Scraping
Assignment Web Scraping
This is a dataset of foreign contributions to Indian organizations and the organizations’ returns,
which falls under the ministry of home affairs.
The objective is to scrape this data for all years, states, and districts and create another script to
clean and consolidate all the data into one single file.
The first python script should go to each year, state, and district to get the tabular data present on
the website and store it as a CSV file under raw data in the proper folder structure of the below-
mentioned fashion
1. raw_data
a. State
i. District
1. Year
a. Output.csv
You may use python libraries like requests, urllib3, beautiful soup, or scrapy and pandas to achieve
this task. It would be lovely if you can automate the data upload process to your cloud storage
(google drive or onedrive). Data upload automation is not mandatory but is great to have.
The second python script should read all the files from the raw_data folder to read the CSV files
using pandas and clean the columns. You should add the state name district name to each file and
create a consolidated data file for all the years, states, and districts.
Submission:
1. You need to create a zip folder in your cloud storage (gdrive/onedrive) with Name_JDC_IDP
as a folder name and share it with us.
2. Under Assignment, we expect you to provide us with the following
a. Python script to scrape the raw data
b. Draft on the steps followed during the data scraping process
c. Python script to clean and consolidate (data transformation) the data
d. Draft on the steps followed during the data transformation process, ss
e. Exploratory analysis report on final dataset
f. Final Dataset
Note:
Please refrain from sending us ipython notebooks, you can always convert your Jupiter notebook to
a script.
Please adhere to best coding standards like naming conventions script comments in all of your
scripts with the aim that anyone can run the scripts get the raw data, and transform the data.
Timeline:
You should be completing these exercises in 2 working days after receiving the email. If you fail to
send the assignments on time, the interview process will void. In case of emergency, please write to
us.