0% found this document useful (0 votes)
65 views3 pages

Bda Survey Assignment: Parta - Rollnumbers - Ipynb Parta - Rollnumbers - Ipynb Part A

This document provides instructions for a survey data analysis assignment to be completed in Python notebooks. It involves reading survey response data from a CSV file, cleaning and preparing the data by converting variable types and creating new derived variables. It then asks students to perform exploratory data analysis on the data, including generating frequency tables, time series plots, pair plots, and correlations. For text variables, it involves analyzing word frequencies, identifying top terms, and building word clouds. It emphasizes practical data skills and working with others to analyze open-ended questions.

Uploaded by

Sankeerth Goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views3 pages

Bda Survey Assignment: Parta - Rollnumbers - Ipynb Parta - Rollnumbers - Ipynb Part A

This document provides instructions for a survey data analysis assignment to be completed in Python notebooks. It involves reading survey response data from a CSV file, cleaning and preparing the data by converting variable types and creating new derived variables. It then asks students to perform exploratory data analysis on the data, including generating frequency tables, time series plots, pair plots, and correlations. For text variables, it involves analyzing word frequencies, identifying top terms, and building word clouds. It emphasizes practical data skills and working with others to analyze open-ended questions.

Uploaded by

Sankeerth Goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

BDA Survey Assignment

Thank you for taking part in the survey. You generated 61 responses, and this is precisely the
starting point for our explorations. Crack this assignment in groups of 4, with the hope that
you will later on develop into a project team. Upload your answers on Moodle.

The purpose of this assignment is to get you going on all of the Python you have picked up.
By now, you should have gone through the MOOC, and learned more ways to handle data.

 The point of this assignment is NOT to go down the rabbit hole of identifying
responses with individuals and feel super about your investigational skills. Want to
play Sherlock? Why not show how good you are at cracking the tougher questions!
 Don’t be daunted by the assignment. The point is NOT to hint at the kind of questions
that you’ll have to solve on an exam (keep that obsession aside for a few weeks
more).

An easy way to administer a survey is through Google Forms. Not only does it summarise
the responses, but it also provides a ready spreadsheet (see BDAResponses.csv). Create two
Python notebooks in your work folder, and name them as as PartA_RollNumbers.ipynb
and PartA_RollNumbers.ipynb.

Part A

1. Read the data from the spreadsheet into a dataframe called Responses.

2. Figure out how to obtain the following:


a. Header
b. The number of rows and columns
c. Types of the variables – how many of these are correctly identified?

3. Rename the variables in the dataframe as follows:

Variable Name Question


response_date Timestamp
education Your educational background
work_ex Work experience in months - NOT years
code_ability How do you rate your ability to program?
languages What languages have you coded in?
lines_of_code How many lines of code have you written?
mba_code_reasons Why MBAs need to code?
worked_with_db Have you worked with a database?
which_db What kind of database was that?
why_bda Why on earth a course in big data analysis?
soc_med_accounts What social media accounts do you have?
soc_med_challenges What would your dream career be?
dream_career What would your dream career be?
life_mission What did you say your mission was in life?
As a first step, let’s convert all the variables into the appropriate types. Refer to this link by
Chris Albon to understand how to deal with datetime variables.

4. Convert the remaining variables to categorical/numerical depending on their scale.


Convert all the elaborate answers (e.g. why_bda) into string type variables.

5. Examine the response_date variable. Bucket the values into consecutive hourly
interval slots, using a new or “derived” variable called hour_slot. To illustrate, the
top three responses will be slotted as shown below.

response_date education work_ex hour_slot


02/07/20 9:52 Science 0 1
02/07/20 9:56 Engineering 47 1
02/07/20 10:00 Engineering 55 2
… … … …

6. Create a table to roll up the hour slots by frequencies as shown:

hour_slot count
1 3
2 1
3 1
… …

How would you interpret the zero count values in the above table?

7. Create a variable called inter_arrival_time to capture successive differences between


responses. Note that the spreadsheet is conveniently sorted in the first column, so you
don’t have to worry about the order of arrivals. Express the values in minutes.

8. In Question 7, in addition to the hour_slot, create a truncated_hour, which snips off


the minutes field from the response_date value. Next, create a frequency table just
like the one shown.

9. We’re dealing with times here. Maybe it’s time for a time series question? Get a fancy
hourly time series depicting the response arrival counts across the survey period.

Much ado about one puny variable? Now that you have developed a taste for what is entailed
with data preparation and exploration, let’s focus our attention on other variables.

10. Obtain a “pairs plot” of all pairs of numerical variables, old and new. Make it pretty.
What observations can you make?

11. Examine the correlations between all pairs of numerical variables, old and new –
figure out how to do this efficiently (Hint: Not by choosing variables one pair at a
time!). What conclusions can you make?

12. Check out the survey result link, and reproduce the bar charts and pie charts as
faithfully as possible, with the colour schemes and legends. Obtain a more meaningful
histogram for work_ex – use the seaborn library. Don’t know how to carry this out?
All you have to do is ask (Google).

Part B

For this part, you will use the second Python notebook you have created.

How would you analyse the textual variable values? Discuss this among your friends. This
is clearly the toughest question, one that has no clearcut answer. Just like the ones you
encounter at the workplace…

13. Use the following link to figure out word frequencies in an answer using Approach
3. Figure out how to eliminate punctuation marks.

14. For each textual question, identify the top 10 terms across the base of 61 responses.
What do you notice? How would you remedy the problem of “trivial” terms?

15. After applying the remedy, correlate the frequencies for the top 10 terms in each
answer with work_ex and code_ability. What conclusions do you make?

16. Figure out how to build word clouds by going through this link. Carry this out for all
questions for which the answers are free form.

17. If you use Docker, you realise that the code for the answer to Q.16 does NOT run the
next time you fire it up. How would you avert this?

You might also like