0% found this document useful (0 votes)
15 views

Data-Science-Assignments

The document outlines an assignment on the fundamentals of data science, covering key concepts such as the interdisciplinary nature of data science, the distinction between data science and machine learning, and the importance of soft skills. It also details the data science process through a book recommendation system, emphasizing stages like problem definition, data collection, and exploratory data analysis. Additionally, it classifies data attributes related to students and discusses the implications of misclassification in binary data.

Uploaded by

fazalabbas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Data-Science-Assignments

The document outlines an assignment on the fundamentals of data science, covering key concepts such as the interdisciplinary nature of data science, the distinction between data science and machine learning, and the importance of soft skills. It also details the data science process through a book recommendation system, emphasizing stages like problem definition, data collection, and exploratory data analysis. Additionally, it classifies data attributes related to students and discusses the implications of misclassification in binary data.

Uploaded by

fazalabbas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment 01

Fundamentals of Data Science

Total Marks: 30 (10 marks per question)

Question 1: The Big Picture of Data Science

Imagine you are explaining Data Science to a friend who thinks it’s just about
"coding and numbers."
● How would you describe Data Science as an interdisciplinary field?
Mention at least three core areas it combines and explain why each is
important.
● How is Data Science different from Machine Learning (ML)? Provide a
real-world example (e.g., predicting weather vs. analyzing climate trends) to
highlight the difference.
● Why are soft skills like storytelling and communication critical for a
Data Scientist? Give an example of how poor communication could lead to a
failed project.
Answer:

The Big Picture of Data Science

Data Science is not just "coding and numbers." Here's how you might break it down:

1. Data Science as an Interdisciplinary Field: Data Science is much more than just writing
code or crunching numbers. It's a mix of various disciplines working together to solve real-world
problems using data. Three core areas it combines are:

 Mathematics and Statistics: This is essential because data science relies on


mathematical models and statistical analysis to extract meaningful insights from data.
Whether it's finding patterns or making predictions, math and stats are at the heart of
every analysis.
 Computer Science: This is the backbone of data science. It's where coding comes in. A
Data Scientist needs to know how to handle, manipulate, and store large datasets, write
algorithms, and build efficient systems for processing data.
 Domain Knowledge: Data science is often applied to specific fields, such as healthcare,
finance, or marketing. Domain knowledge helps a Data Scientist understand the context
of the data and ask the right questions, ensuring that insights are relevant and actionable.
Each of these areas is crucial because they allow Data Scientists to analyze data properly, extract
insights, and apply them in meaningful ways.

2. Data Science vs. Machine Learning (ML): While Data Science and Machine Learning are
related, they are not the same thing.

 Data Science involves the entire process of collecting, analyzing, and interpreting data to
make decisions or gain insights. It includes tasks like data cleaning, statistical analysis,
and data visualization.
 Machine Learning is a subset of Data Science that focuses on using algorithms to learn
from data and make predictions or decisions without being explicitly programmed. ML is
one of the tools a Data Scientist uses.

For example:

 Predicting the weather (Data Science) involves gathering vast amounts of weather data,
analyzing it with statistical models, and providing forecasts.
 Analyzing climate trends (Machine Learning) involves using algorithms to study long-
term climate patterns, identifying trends, and making predictions based on past data. The
focus here is more on learning from data to make forecasts about the future.

3. The Importance of Soft Skills (Storytelling and Communication): Soft skills are critical for
Data Scientists because they help transform complex data insights into clear, actionable
messages that non-experts can understand.

For example, poor communication could lead to a failed project if a Data Scientist identifies
important trends in a dataset but fails to communicate the findings effectively. If the insights are
presented in a confusing or overly technical way, stakeholders might not understand their
significance and fail to act on them. In contrast, a well-told story backed by data can inspire
action and drive business decisions.

In short, Data Science is a collaborative field that combines technical expertise and
communication skills to turn data into valuable insights for decision-making.

Question 2: The Data Science Process in Action

You are tasked with building a system to recommend books to users based on their
preferences.
● List and briefly explain the key stages of the Data Science process you
would follow for this project.
● Why is Exploratory Data Analysis (EDA) important before building the
model? Mention two specific tasks you’d perform during EDA (e.g.,
detecting outliers, checking data types).
● How would you evaluate the final model? Name one metric to assess its
performance.

Answer:
Question 2: The Data Science Process in Action

Key Stages of the Data Science Process:

1. Problem Definition:
Start by clearly understanding the problem at hand. In this case, the goal is to build
a system that recommends books to users according to their preferences. You need
to identify the type of recommendations (e.g., content-based, collaborative
filtering) and the required data to build the model.
2. Data Collection:
Gather relevant data, such as user ratings, book details (genres, authors, etc.), and
user profiles. This can come from databases, APIs, or publicly available datasets.
3. Data Cleaning and Preprocessing:
Clean the collected data to handle missing values, duplicates, and irrelevant
features. Normalize or scale the data if necessary, and convert categorical data
(like book genres) into numerical values using techniques like one-hot encoding.
4. Exploratory Data Analysis (EDA):
EDA is done to understand the data better. This step helps in identifying patterns,
correlations, and any anomalies. Visualizations like histograms and scatter plots
are useful for this.
5. Feature Engineering:
Create new features from the raw data that may enhance the model’s performance.
For example, combining user preferences or extracting metadata features from
book descriptions can add value.
6. Model Selection and Training:
Choose an appropriate model based on the problem type (e.g., collaborative
filtering, content-based filtering, or hybrid methods). Split the data into training
and testing sets and train the model using the training data.
7. Model Evaluation:
Assess the performance of the model using suitable metrics like precision, recall,
or RMSE (Root Mean Squared Error). Adjust model parameters as needed and
retest.
8. Deployment and Monitoring:
Once the model performs well, deploy it to a production environment where it can
recommend books to real users. Monitor its performance continuously and update
it as new data comes in.

Why is Exploratory Data Analysis (EDA) Important Before Building the Model?

Exploratory Data Analysis (EDA) is critical because it helps you understand the structure
of your data, the relationships between different features, and any issues that could affect
the model's performance.

 Detecting Outliers: Outliers can distort model predictions. Identifying and


handling outliers helps ensure that the model doesn't place too much weight on
extreme values.
 Checking Data Types: Ensuring that the data types (numerical, categorical) are
correctly assigned allows proper processing and avoids errors during modeling.
For example, if a categorical feature is mistakenly treated as a numerical feature, it
could lead to inaccurate results.
Question 3: Understanding Data Attributes
A dataset contains information about students in a school, including:
● Height (in cm)
● Favorite Subject (Math, Science, Arts)
● Exam Pass/Fail Status (Yes/No)
● Student ID (e.g., S001, S002)
For each attribute above:
● Classify its type (Nominal, Binary, or Other) and justify your answer.
● Which attribute is asymmetric binary? Explain why it’s asymmetric with a
real-world consequence (e.g., how misclassifying a "Fail" as "Pass" could
impact students).
● Why can’t we calculate the "average" of Student ID? Relate your answer
to the properties of nominal attributes.

Answer 3: Understanding Data Attributes

A dataset contains information about students in a school, including:

 Height (in cm)


 Favorite Subject (Math, Science, Arts)
 Exam Pass/Fail Status (Yes/No)
 Student ID (e.g., S001, S002)

For each attribute above:

 Classify its type (Nominal, Binary, or Other) and justify your answer.
 Which attribute is asymmetric binary? Explain why it’s asymmetric with
a real-world consequence (e.g., how misclassifying a "Fail" as "Pass" could
impact students).
 Why can’t we calculate the "average" of Student ID? Relate your answer
to the properties of nominal attributes.

Classification and Justification:

1. Height (in cm):


o Type: Other (Continuous or Quantitative)
o Justification: Height is a numerical value that can take any real
number within a range, making it a continuous variable. It’s used to
measure a quantity (the student's height).
2. Favorite Subject (Math, Science, Arts):
o Type: Nominal
oJustification: The favorite subject is a categorical variable with no
inherent order or ranking. Math, Science, and Arts are different
categories that represent preferences, but there’s no natural order
among them.
3. Exam Pass/Fail Status (Yes/No):
o Type: Binary
o Justification: This attribute has two possible values (Yes or No),
representing a binary outcome. It’s a categorical variable but with
only two categories, making it binary.
4. Student ID (e.g., S001, S002):
o Type: Nominal
o Justification: Student IDs are labels used to uniquely identify each
student. These IDs don’t have any mathematical significance and don't
follow a specific order. Each ID represents a unique individual, but
the numbers or characters are arbitrary, making them nominal.

Asymmetric Binary Attribute:

 Asymmetric Binary Attribute: Exam Pass/Fail Status (Yes/No)


 Explanation: The exam pass/fail status is an asymmetric binary attribute
because the consequences of a misclassification can have serious real-world
impacts. For example, if a student who has failed is incorrectly classified as
having passed, they might not receive the necessary support or intervention
to improve, which could affect their academic progress and future
opportunities. On the other hand, misclassifying a "Pass" as "Fail" may lead
to unnecessary intervention but won’t jeopardize the student’s future as
much as the reverse error.

Why Can’t We Calculate the "Average" of Student ID?

 Reason: We cannot calculate the "average" of Student IDs because Student


ID is a nominal attribute.
 Explanation: Nominal attributes are categorical and do not have a
meaningful order or numerical relationship. An ID like "S001" represents a
unique student, but it doesn't have any inherent numerical meaning.
Averaging nominal data doesn't make sense because the IDs are simply
labels used for identification, not quantities that can be averaged or
calculated in any meaningful way.

You might also like