0% found this document useful (0 votes)
2 views

IDS_Crispy_Notes

Uploaded by

Deeksha Peddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

IDS_Crispy_Notes

Uploaded by

Deeksha Peddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Crispy Notes: Introduction to Data Science

1. Explain the four levels of data with an example.


Ans: The FOUR levels of data: A specific characteristic (feature/column) of
structured data can be broken down into one of four levels of data. The levels
are:

1. Nominal Data: Categories with no inherent order.


o Example: Gender (Male, Female), Types of fruits (Apple, Banana,
Orange).
2. Ordinal Data: Categories with a specific order but no consistent
difference between them.
o Example: Customer satisfaction (Satisfied, Neutral, Dissatisfied),
Grades (A, B, C).
Interval vs. Ratio Data
Both interval and ratio data are quantitative, meaning they deal with numerical
values. However, there are key differences between the two in terms of operations
we can perform and how we interpret the numbers. Let’s break them down:
Interval Data
• Definition: Interval data allows you to measure the difference between
data points, but it lacks a true zero point.
• Characteristics:
o Order and spacing: The distance between two values is meaningful.
For instance, the difference between 30°C and 40°C is 10°C, just like
between 20°C and 30°C.
oNo true zero: Zero is arbitrary. Zero degrees Celsius doesn’t mean
the absence of temperature, just that it's the freezing point of
water.
o Mathematical operations: You can add and subtract values, but
multiplication and division don’t make sense.
• Example:
o Temperature in Celsius or Fahrenheit:
▪ 20°C is hotter than 10°C by 10°C, but 40°C is not “twice as
hot” as 20°C because the zero point is arbitrary.
Ratio Data
• Definition: Ratio data is the highest level of measurement and allows all
mathematical operations, including multiplication and division.
• Characteristics:
o Order and spacing: Like interval data, ratio data also has meaningful
distances between values.
o True zero: The key difference is that ratio data has a true zero point,
meaning zero represents the complete absence of the quantity.
o Mathematical operations: You can perform all mathematical
operations—addition, subtraction, multiplication, and division.
• Example:
o Weight:
▪ A weight of 0 kg means no weight at all. Also, 40 kg is twice
as heavy as 20 kg.
Feature or
Attribute or
Variable or
column of dataset

________________________________________________________
2. Explain structured vs unustructured data with an example. Explain the Venn
diagram of data science? What are domains involved in data science.
Ans: Structured vs. Unstructured Data

• Structured Data: This is organized data, often found in tables


with rows and columns. It’s easy to analyze using traditional
statistical methods and machine learning algorithms.
o Example: A spreadsheet with columns like "Name," "Age,"

"Salary." This is structured because it follows a specific


format.
• Unstructured Data: This is disorganized data that doesn’t fit
into neat rows and columns. It could be text, videos, social
media posts, or images.
o Example: A tweet like “Had a great coffee today!” This

doesn’t follow a structured format but still holds valuable


information.
• Why It Matters: While structured data is easier to analyze,
unstructured data makes up around 80-90% of all the data in
the world (e.g., emails, social media posts, server logs). Data
scientists need techniques to turn unstructured data into
structured data for analysis, often through preprocessing.

Understanding the Venn Diagram of Data Science


• Data Science combines three key areas (Domains):
Math/Statistics, Computer Programming (Hacking Skills), and
Domain Expertise. Mastering all three allows someone to truly
perform data science, but let’s break down each area and how
they intersect.

Intersections of These Skills:


• Hacking Skills + Math/Statistics:
o Intersection: This combination allows for machine learning

(ML), where algorithms are developed and tuned.


o Example: A developer can create and optimize a model to

recommend movies to users based on their past


preferences, but they may lack the understanding of the
movie industry itself.
• Math/Statistics + Domain Expertise:
oIntersection: This leads to traditional research, where
mathematical tools are applied to real-world problems, but
without programming automation.
o Example: A researcher might use statistical analysis to

study heart attack patterns, but they need help coding an


algorithm to predict outcomes.
• Hacking Skills + Domain Expertise:
o Intersection: This can be a "Danger Zone," where someone

can automate processes but might lack the mathematical


understanding to ensure accuracy.
o Example: A day trader creates an automated trading

system but doesn’t evaluate its long-term performance


mathematically, which could lead to losing money.
Data Science: The Sweet Spot
• Where All Three Meet:
o True data science happens at the intersection of all three

skills—coding, math/statistics, and domain knowledge.


This enables someone to:
▪ Write code to access and process data,

▪ Understand the math behind the models they create,

▪ Apply the results to their specific field for meaningful

insights.
---------------------------------------------------------------------------------------------------
3. A) Define data science. What it does (or why data science)?
• Ans; Data Science Definition:

o Data science is the process of using data to gain knowledge

and insights. It’s about making decisions, predicting the


future, and understanding the past or present by analyzing
data.
B) Define Qualitative and quantitative data with example.
Examples: 1) Name coffeeshop (Qualitative) 2)Revenue (Quantitative)
3) Zip Code (Qualitative) 4) Avg. monthly income(Quantitative) 5)Country of coffe
origin (Qualitative)
C) Explain the case study of sigma technologies and how does data science add
value to their operations?
Ans: Importance
of Data science with Business example: (Company:
Sigma Technologies)
• Problem: Ben Runkle, CEO of Sigma Technologies, noticed the
company was losing customers, but didn't know why.
o His gut feeling was to create new products and features to

fix the problem.


• Solution by the Data Scientist: Dr. Jessie Hughan, the company's

chief data scientist, took a different approach. She looked at


customer service transcripts instead of going with her gut
feeling.
o She found that customers were not leaving because of a

lack of features, but because of confusing UI/UX (User


Interface/User Experience).
Key Insights from Customer Data:
• Dr. Hughan noticed customers were saying things like:
o"Not sure how to export this; are you?"
o "Where is the button that makes a new list?"

o "Wait, do you even know where the slider is?"

• Conclusion: The issue was that customers found the product

difficult to use. They weren’t frustrated due to missing features


but because of bad UI/UX.
• Action Taken: The company did a UI/UX overhaul (improving

how the software looks and functions), and sales skyrocketed.


Analytical vs. Gut-Driven Thinking
• Gut-Driven Approach (Runkle): The CEO, like many leaders,
wanted to make quick decisions based on his instincts.
o Example: Runkle believed creating new features would fix

the issue, without looking at the data first.


• Analytical Approach (Hughan): The data scientist wanted to use

data to find the real problem.


o Example: Dr. Hughan used customer service data to

uncover that customers were struggling with the existing


design, not the lack of features.
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
1. Define below mathematics terms in data science perspective.
a)Vector

• Ans: Definition:
A vector is a list of numbers that represents both
magnitude (size) and direction. However, for simplicity, think of it
as a 1-dimensional array.

Ex: rating (0-100) of employees for three departments of a company as being


57 for HR, 89 for engineering, and 94 for management.
b)Martrice:

have three offices in different locations, each with the same three
departments: HR, engineering, and management.

c)Proportional:
Scenario: When oil availability decreases, gas prices increase.

D)Graph:
E)Logarithms/exponent:
--------------------------------------------------------------------------------------------------------------------
2. Explain the Dot product, set theory (Jaccard index), matrix
multiplication concept with the movie recommendation
example.
Set Theory in Data Science:
Set Theory is a branch of mathematical logic that deals with sets, which
are collections of distinct objects.
1. Set:
o A set is a collection of distinct objects.

o Example: A set of numbers: {1, 2, 3}

Real-time Example: Movie Recommendation System

Imagine you are running a movie recommendation platform like


Netflix. You want to recommend movies to users based on what they
and others have watched. This is where set theory comes into play.

Step-by-Step Breakdown:

1. Users as Sets:

Each user on your platform can be represented as a set of movies


they have watched. Let's say you have two users:

• User1: {"Inception", "Titanic", "Avengers"}


• User2: {"Avengers", "The Matrix", "Titanic"}

In this case, the sets contain the movies each user has watched.

2. Intersection (Common Movies):

You can use the intersection of sets to find out which movies both
users have watched. The intersection of two sets is the set of movies
that both users have in common:

User1 ∩ User2={"Avengers","Titanic"}

So, both User1 and User2 have watched "Avengers" and "Titanic".

3. Union (All Movies Watched):


The union of the two sets gives you all the movies that either of the
users has watched, without duplicates:

User1 ∪ User2={"Inception","Titanic","Avengers","TheMatrix"}

This tells you the complete list of movies watched by either of the
two users.

4. Jaccard Similarity (How Similar are the Users?):

In data science, we often want to know how similar two users are.
One way to measure similarity is to use the Jaccard similarity. It’s
calculated as:

In this case:

• Intersection = 2 (they both watched 2 movies: "Avengers" and


"Titanic")
• Union = 4 (total unique movies watched by both: 4)

So, the Jaccard similarity is:

This means User1 and User2 are 50% similar based on their movie-
watching habits.

5. How It Helps with Recommendations:

If a new user (let's call them User3) has watched "Titanic" and
"Avengers", you can use set theory to find similar users (like User1
and User2) and recommend movies they haven't watched yet. For
example:
• User3 hasn't watched "Inception" and "The Matrix".
• Since User1 and User2 both like similar movies, it's likely that
User3 might enjoy "Inception" or "The Matrix."

This process of using similarity between sets of users to recommend new


items (in this case, movies) is called collaborative filtering, a common
technique in recommendation systems.
Q: Given a student dataset of test scores, calculate the measures of
centre (mean, median, mode), variance, standard deviation, Z-score,
correlation. Interpret the Empirical Rule for this dataset.
Ans:

Statistical Analysis of a Simple Student Dataset:


Dataset Overview: We have a simple dataset of students with the
following attributes:
1. Student ID
2. Age (in years)
3. Marks (in percentage)
4. Study Hours (per week)
Mean Calculation:

Median Calculation:
Mode Calculation:

Standard Deviation Calculation:

Variance Calculation:
Coefficient of Variation (CV)Calculation:

Z-Scores Calculation:
Correlation Calculation:
Empirical Rule (68-95-99.7 Rule):

------------------------------------------------------------------------------------------------------------
Q:Explain the sampling methods (random, unequal probability) with example.

Use Samples?

• Sometimes it’s impossible or too expensive to measure an entire population. So,


we take a sample and use statistics to make estimates about the whole
population. For example, you can’t ask every teenager in the world if they drink
alcohol, but you can ask a sample of them and estimate the rate.
The Problem with Biased Sampling
• The first two options seem okay, but they can introduce bias. Bias
means that the way we pick our sample might favour one
outcome over another, leading to inaccurate results.
• Example of Bias:
• If you group people by location (option 1), it might turn out that

people on the west coast don’t like the design of Website A as


much as people on the east coast. Now, your results are skewed
because the location influences the outcome.
• If you group people based on time of visit (option 2), maybe

people at night prefer a different website style compared to


daytime users. Again, the results could be biased.
Confounding Factors
• In both of these cases, there's something called a confounding
factor. A confounding factor is a variable that affects the
outcome but is not being measured directly.
Example: When grouping users based on location, a confounding
factor could be regional preferences. Maybe people on the west
coast prefer different design aesthetics than people on the east
coast, and that difference impacts the test results.
The Solution: Random Sampling
• To avoid bias and confounding factors, the best approach is
random sampling. In random sampling, everyone has an equal
chance of being selected for the sample, which means there’s no
influence from factors like location or time of visit.
• Example of Random Sampling:
• Imagine you have a hat with all your users’ names. You randomly

pull out names to assign people to groups A and B. Every user


has the same chance of being selected, so no bias is introduced.
Unequal Probability Sampling
But what if we want to measure a group where some members

are underrepresented? For example, let’s say you want to


measure the happiness of your employees, and your company is
made up of 75% men and 25% women.
• If you use random sampling, your sample will likely reflect that

split—mostly men and fewer women. This could introduce bias


because men’s opinions will dominate the results, leaving
women’s voices underrepresented.
• To fix this, we use unequal probability sampling, where we

purposely include more women in the sample to make sure their


opinions are heard.
• Example of Unequal Sampling:
• If your company is 75% men and 25% women, instead of

selecting employees randomly, you might choose 50% men and


50% women to balance the opinions and avoid bias.
Why Unequal Sampling is Okay
At first, unequal sampling might seem unfair, but it’s used to ensure that
minority groups aren’t drowned out in the data. This helps us get a more
accurate picture of the entire population and ensures that everyone’s voice
is represented.
----------------------------------------------------------------------------------------------------
Q: Define below terms:
a) Population parameter
b) Sample mean statistic
Ans:

c) Observational study
d) Experimental study
Ans:
e) Empirical Rule:

---------------------------------------------------------------------------------------------------------------
1. Explain the point estimates and confidence intervals with
example (Employees long break, short break):
Ans:
Introduction to Confidence Intervals
What is a Confidence Interval?
• A confidence interval is a range of values that is likely to contain the true
population parameter (like the mean).
• It helps us estimate the population parameter based on sample data,
while accounting for uncertainty.
Key Idea:
• A point estimate (like sample mean) can be inaccurate. Confidence
intervals give us a range to be more certain about our estimate.
• Example: We can be 95% confident that the true average break time of
employees falls between 36.36 and 45.44 minutes.
Components of Confidence Intervals
To calculate a confidence interval, you need:
1. Point Estimate: In our case, the sample mean.
o E.g., the average break time from a sample of employees.
2. Margin of Error: Reflects how certain we are that the sample mean is close
to the population mean.
o The margin of error increases with more variability or smaller
samples.
3. Confidence Level: The probability that the interval contains the true
population parameter.
o Common levels: 90%, 95%, 99%. (e.g., a 95% confidence level
means we're 95% sure the interval contains the true mean.)
Calculating Confidence Intervals (Example)
Steps to Calculate Confidence Interval (for Mean):
1. Sample Mean: Calculate the average of a sample.
o E.g., Average break time = 40 minutes.
2. Sample Standard Deviation: Measure the spread or variability of data in
the sample.
o E.g., Sample Standard Deviation = 5 minutes.
3. Standard Error of the Mean:
o This is the standard deviation divided by the square root of the
sample size.

Confidence Interval Output Example


• Example Calculation:
o Sample Mean = 40, Standard Deviation = 5, Sample Size = 100
o Confidence Level = 95%
o Calculated Confidence Interval: (36.36, 45.44)
Interpretation:
• We are 95% confident that the true mean break time for all employees is
between 36.36 and 45.44 minutes.
The Importance of Confidence Levels
• Higher Confidence Level = Wider Interval:
o The more confident we want to be, the larger the range must be.
o Example: To be 99% confident, the interval is wider than for 95%
confidence.
Example of Confidence Intervals:
• 50% Confidence: (39.2, 40.8) [Smaller range]
• 95% Confidence: (36.36, 45.44) [Larger range]
• 99% Confidence: (32.5, 47.5) [Even larger]
--------------------------------------------------------------------------------------------------------

2. Define below terms


A) Central limit theorem:
Ans: central limit theorem, which states that the sampling distribution
(the distribution of point estimates) will approach a normal distribution
as we increase the number of samples taken. • What's more, as we take
more and more samples, the mean of the sampling distribution will
approach the true population mean
Define hypotheses test (Null and Alternative):
List down the 5- Steps check to accept or reject the Hypotheses test.
Ans: Hypothesis testing is a critical concept in data science, used to make
decisions or inferences about population parameters based on sample data. It
helps us determine whether there is enough evidence to support a specific claim
or belief. There are two key hypotheses in this process: the null hypothesis (H₀)
and the alternative hypothesis (H₁ or Ha).
Key Terms:
1. Null Hypothesis (H₀):
o It represents the default assumption or status quo.
o The null hypothesis suggests that there is no effect, no difference,
or no relationship between variables.
o We either reject or fail to reject H₀ based on the data.
2. Alternative Hypothesis (H₁ or Ha):
o This is what you want to test or prove.
o The alternative hypothesis suggests that there is an effect, a
difference, or a relationship between variables.
3. Significance Level (α):
o It’s the probability threshold below which we reject the null
hypothesis.
o Common levels are 0.05 (5%) or 0.01 (1%).
4. P-Value:
o It tells us the probability of observing the data if the null hypothesis
were true.
o If the p-value is less than the significance level (α), we reject the null
hypothesis.
General Steps in Hypothesis Testing:
1. State the Hypotheses (H₀ and H₁):
o Clearly define the null and alternative hypotheses.
2. Collect and Analyze Data:
o Gather relevant sample data to test the hypothesis.
3. Perform the Test:
o Use the appropriate statistical test (Z-test, t-test, Chi-square test,
etc.) to calculate the p-value.
4. Make a Decision:
o Compare the p-value with the chosen significance level (α).
o If p-value < α: Reject the null hypothesis (evidence supports the
alternative hypothesis).
o If p-value ≥ α: Fail to reject the null hypothesis (not enough
evidence to support the alternative hypothesis).
5. Conclude:
o Based on the results, conclude whether the data supports the claim
in the alternative hypothesis.
Three types of hypothesis tests
• One-sample t-tests
• Chi-square goodness of fit
• Chi-square test for association/independence
Q: Define type-1 and type-2 errors.
Ans: Type I Error: Rejecting the null hypothesis when it is actually true.

• Type II Error: Failing to reject the null hypothesis when it is actually false.
3. Explain the one-sample t-test with example.
Ans:

Assumptions of the One-Sample T-Test


• Sample Size: The sample size should be n ≥ 30 (Central Limit

Theorem).
o We have a sample of 400 (n = 400), which is sufficient.

• Independence: The sample should be randomly selected.

o Engineering break data is randomly selected and

independent from the population.


Real-Time Example: Testing Mean Sleep Duration
Scenario:
• A fitness tracker company claims that the average sleep duration

of adults is 7 hours per night.


• You collect data from 25 adults using a new fitness tracker to see

if their mean sleep duration is different from 7 hours.


Sample Data:
• Sample mean = 6.5 hours.

• Population mean = 7 hours.

• Sample standard deviation = 0.8 hours.

• Sample size n=25

Steps to Perform the One-Sample T-Test


Step 3: Calculation of t-value

Step 4: Find the Critical Value


• For a two-tailed test at α = 0.05, we look up the critical t-

value in the t-distribution table for df = n - 1 = 24.


• Critical t-value ≈ ± 2.064.

Step 5: Compare t-value and p-value


• The calculated t-value (-3.125) is more extreme than the

critical value (±2.064).


• p-value corresponding to the t-value is 0.0049 (calculated

or found using statistical software).


Step 6: Make a Decision
• p-value (0.0049) < α (0.05): Reject the null hypothesis.

• There is strong evidence that the mean sleep duration is

different from 7 hours.



Q: Explain the Chi-Square test for Goodness of fit with an example. \
Ans: Chi-Squared Goodness of Fit Test

What is a Chi-Squared Goodness of Fit Test?


• The Chi-Squared Goodness of Fit Test is a statistical test used to
determine if a sample data set fits a population with a specific distribution.
• It checks if the observed data matches the expected distribution (i.e.,
whether the sample follows a theoretical distribution like uniform,
normal, etc.).
When to Use a Chi-Squared Goodness of Fit Test?
• Use when you want to test if the observed frequencies of categories
match expected frequencies.
• Data should be categorical (divided into distinct categories).
Examples:
• Testing if a die is fair (each face has an equal chance).
• Checking if a population follows a specific genetic distribution.

Key Assumptions of the Chi-Squared Test


1. Random Sampling: Data should be randomly collected.
2. Expected Frequency: Expected frequency in each category should be at
least 5 for the test to be valid.
3. Independence: Observations must be independent of each other.

Hypotheses for Chi-Squared Goodness of Fit:

Real-Time Example: Testing a Fair Die


Scenario:
• A die is rolled 60 times, and the following frequencies are observed for
each face of the die.
Observed Data:
• Face 1: 8 times
• Face 2: 10 times
• Face 3: 12 times
• Face 4: 14 times
• Face 5: 8 times
• Face 6: 8 times
We want to test if the die is fair, meaning that each face has an equal probability
of occurring (expected to appear 10 times each).
Step 1: Define the Hypotheses

Step 2: Calculate Expected Frequencies

Step 3: Calculate the Test Statistic (Chi-Squared Value)


Step 4: Find the Critical Value

Step 5: Compare Test Statistic with Critical Value

Conclusion
• Based on the results of the chi-squared goodness of fit test, we conclude
that there is no significant evidence to suggest the die is unfair.
• The die's outcomes are consistent with a fair die's expected distribution.

You might also like