0% found this document useful (0 votes)
25 views

Data Science - Test Module

Uploaded by

Mukarram Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Data Science - Test Module

Uploaded by

Mukarram Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Welcome to our Introduction to Data Science Test Module.

Greetings future data scientists and professionals eager to dive into the world of data science. Whether
you're here to upskill, gain insights into potential career pathways, or develop foundational expertise,
we're thrilled to have you join us.
Please Note: Before you begin to fill out this document, kindly make a copy and rename it to ‘Your
Name_ DS Testing’, e.g. “Ainee_DS Testing”

About the Course


At LUMSx, we believe in empowering you, our learners with the ever-evolving field of Data Science,
where we will lay the foundation of descriptive statistics, and solve the intricacies of data biases. We’ll
then dive into the art of statistical inference and machine learning, where hypotheses are tested.
Together, we'll understand models and algorithms, exploring concepts like regression and classification,
as we try to unlock the fundamental challenges of learning. As our journey nears its end, we'll equip
ourselves with the tools and ethics needed to navigate scalable data collection and processing.

Learning Outcomes
By the end of this course, learners will be able to:
- Conduct sound data analysis.
- Describe a given data set and assess its quality.
- Understand issues in data collection.
- Build data pipelines (collection, cleaning, EDA, modeling, evaluation, results) for “repeatable”
work.
- Become well-versed with tools and technologies for data analysis (e.g., Pandas, sci-kit-learn)
- Understand the theory behind drawing inferences from data.
- Communicate results effectively.

For testing purposes: We will be sharing videos for one lesson of module 1, one quiz, and one data
assignment with you.
As you go through the content, here are some friendly reminders:
- Follow the sequence of the course material as outlined in this document. Try not to skip any
sections.

- Remember, you do not have to go through all the material in one sitting. You have 2 weeks to go
through the material. The data assignment will take around 5-6 hours. Hence you can divide it
over two weeks and complete it in chunks.

- Feel free to rewind, pause, and replay the video if needed for repetition.
- Attempt the quiz at the end and try to answer it independently. While answer keys will be
provided, these questions are designed for practice purposes and will not be graded.
- If you have any comments or feedback on your working document that you would like the
LUMSx team to view, don’t hesitate to send it in an email to us.

Thank you!

Module-1: Descriptive Statistics, Data Acquisition, and Tools


What are some possible sources of bias when collecting data? How do I sift through my data using data science tools (e.g.,
Pandas)?

Lesson- 4: Data Manipulation Using Pandas II

M1_L4_V1_Pandas Str Methods.mp4


M1_L4_V2_Pandas Sorting.mp4
M1_L4_V3_Pandas Groupby.mp4
M1_L4_V4_Groupby Other Features.mp4
M1_L4_V5_Pivot Tables.mp4

Data Assignment:
Click here to attempt the Data assignment.

Quiz:
Select ONE correct answer for each multiple choice question.

1. A restaurant hygiene inspector for a chain with multiple locations randomly selects some of their
locations for a cleanliness check of their kitchens. The inspectors check every kitchen in the
locations that were chosen. What type of sample is this?
a. Cluster sampling
b. Stratified sampling
c. Convenience sampling
2. You have a dataframe called quizScores with column names “1”, “2”, and “3”. The dataframe
contains 10 rows. What will be the result of the following line of code:

quizScores[[“1”]][1]

a. A Dataframe with the second element of the column “1”


b. A Series with the second element of the column “1”
c. Neither, since this will throw a key error
3. What is the primary drawback of quota sampling?
a. It is time consuming
b. It can introduce bias
c. It requires a large sample size
4. In a study where you want to find the top career choices for university students in Pakistan, you
visit the top 3 most expensive universities in the country to gather your data. What kind of bias
could most likely be present in your data?
a. Selection Bias
b. Non-response Bias
c. None of the above
5. Consider the following Dataframe named menu

Which of the following will return the names (“Menu Item”), prices (“Price”), and calories
(“Calories”) of all items with price below 400 and calories below 500

a. menu.loc[(menu["Price"]<400) & (menu["Calories"]<500), "Menu Item":"Review"]


b. menu.loc[(menu["Price"]<400) & (menu["Calories"]<500), "Menu Item":"Calories"]
c. menu.iloc[[1,2,5,6,7], 0:2]

Answers and Explanations:


MCQ Option Correct/Incorrect Explanation

a Correct The inspector randomly selects some locations (clusters)


and checks every kitchen within those selected locations,
making it a cluster sampling method.

b Incorrect Stratified sampling involves dividing the population into


homogeneous groups (strata) and then randomly selecting
samples from each group. This scenario does not involve
1
dividing the population into strata or any random
selection of samples from within them.

c Incorrect Convenience sampling involves selecting the most readily


available individuals or units as samples, rather than
using random selection. This scenario does not involve
convenience sampling as the selection is random.

a Incorrect The line of code will return an error and not a DataFrame
(see explanation of the error below)

b Incorrect The line of code will return an error and not a Series (see
explanation of the error below)

2 c Correct quizScores[[“1”]] will return a DataFrame with only one


column named “1”. Then, quizScores[[“1”]][1] tries to
create a Series using values in a column named 1 from the
returned DataFrame. However, no such column exists, as
the one in the previously returned DataFrame has the
name “1” (string) and not 1 (int). Thus, we get a
KeyError.

a Incorrect While sampling methods vary in time requirements, this


is not the primary drawback of quota sampling.

b Correct Quota sampling may lead to biased results because


individuals are not randomly selected, but rather chosen
3
based on predetermined characteristics.

c Incorrect Quota sampling does not necessarily require a large


sample size; it depends on the specific quotas set for the
sample.

4 a Correct By only sampling from the top 3 most expensive


universities, you may not capture the perspectives and
career aspirations of students from other socioeconomic
backgrounds, leading to a biased representation of career
choices.

b Incorrect Non-response bias occurs when certain groups within the


sample are more likely to respond to the survey than
others, but it doesn't directly relate to the method of
selecting the sample, as in this scenario.

c Incorrect Selection bias is likely present due to the limited and


specific selection of universities, which doesn't represent
the entire population of university students in Pakistan.

a Incorrect Since loc is inclusive for the end index when selecting a
range, the given code selects columns "Menu Item"
through "Review" for items meeting the criteria, but it
includes "Review" which wasn't asked for.

b Correct Since loc is inclusive for the end index when selecting a
range, the given code selects columns "Menu Item"
5
through "Calories" for items meeting the criteria,
fulfilling the requirements.

c Incorrect Since iloc is exclusive for the end index when selecting a
range, the given code selects columns "Menu Item"
through "Price" for items meeting the criteria, but does
not include "Review" which was required.

You might also like