Bda Survey Assignment: Parta - Rollnumbers - Ipynb Parta - Rollnumbers - Ipynb Part A
Bda Survey Assignment: Parta - Rollnumbers - Ipynb Parta - Rollnumbers - Ipynb Part A
Thank you for taking part in the survey. You generated 61 responses, and this is precisely the
starting point for our explorations. Crack this assignment in groups of 4, with the hope that
you will later on develop into a project team. Upload your answers on Moodle.
The purpose of this assignment is to get you going on all of the Python you have picked up.
By now, you should have gone through the MOOC, and learned more ways to handle data.
The point of this assignment is NOT to go down the rabbit hole of identifying
responses with individuals and feel super about your investigational skills. Want to
play Sherlock? Why not show how good you are at cracking the tougher questions!
Don’t be daunted by the assignment. The point is NOT to hint at the kind of questions
that you’ll have to solve on an exam (keep that obsession aside for a few weeks
more).
An easy way to administer a survey is through Google Forms. Not only does it summarise
the responses, but it also provides a ready spreadsheet (see BDAResponses.csv). Create two
Python notebooks in your work folder, and name them as as PartA_RollNumbers.ipynb
and PartA_RollNumbers.ipynb.
Part A
1. Read the data from the spreadsheet into a dataframe called Responses.
5. Examine the response_date variable. Bucket the values into consecutive hourly
interval slots, using a new or “derived” variable called hour_slot. To illustrate, the
top three responses will be slotted as shown below.
hour_slot count
1 3
2 1
3 1
… …
How would you interpret the zero count values in the above table?
9. We’re dealing with times here. Maybe it’s time for a time series question? Get a fancy
hourly time series depicting the response arrival counts across the survey period.
Much ado about one puny variable? Now that you have developed a taste for what is entailed
with data preparation and exploration, let’s focus our attention on other variables.
10. Obtain a “pairs plot” of all pairs of numerical variables, old and new. Make it pretty.
What observations can you make?
11. Examine the correlations between all pairs of numerical variables, old and new –
figure out how to do this efficiently (Hint: Not by choosing variables one pair at a
time!). What conclusions can you make?
12. Check out the survey result link, and reproduce the bar charts and pie charts as
faithfully as possible, with the colour schemes and legends. Obtain a more meaningful
histogram for work_ex – use the seaborn library. Don’t know how to carry this out?
All you have to do is ask (Google).
Part B
For this part, you will use the second Python notebook you have created.
How would you analyse the textual variable values? Discuss this among your friends. This
is clearly the toughest question, one that has no clearcut answer. Just like the ones you
encounter at the workplace…
13. Use the following link to figure out word frequencies in an answer using Approach
3. Figure out how to eliminate punctuation marks.
14. For each textual question, identify the top 10 terms across the base of 61 responses.
What do you notice? How would you remedy the problem of “trivial” terms?
15. After applying the remedy, correlate the frequencies for the top 10 terms in each
answer with work_ex and code_ability. What conclusions do you make?
16. Figure out how to build word clouds by going through this link. Carry this out for all
questions for which the answers are free form.
17. If you use Docker, you realise that the code for the answer to Q.16 does NOT run the
next time you fire it up. How would you avert this?