IDS_Crispy_Notes
IDS_Crispy_Notes
________________________________________________________
2. Explain structured vs unustructured data with an example. Explain the Venn
diagram of data science? What are domains involved in data science.
Ans: Structured vs. Unstructured Data
insights.
---------------------------------------------------------------------------------------------------
3. A) Define data science. What it does (or why data science)?
• Ans; Data Science Definition:
• Ans: Definition:
A vector is a list of numbers that represents both
magnitude (size) and direction. However, for simplicity, think of it
as a 1-dimensional array.
have three offices in different locations, each with the same three
departments: HR, engineering, and management.
c)Proportional:
Scenario: When oil availability decreases, gas prices increase.
D)Graph:
E)Logarithms/exponent:
--------------------------------------------------------------------------------------------------------------------
2. Explain the Dot product, set theory (Jaccard index), matrix
multiplication concept with the movie recommendation
example.
Set Theory in Data Science:
Set Theory is a branch of mathematical logic that deals with sets, which
are collections of distinct objects.
1. Set:
o A set is a collection of distinct objects.
Step-by-Step Breakdown:
1. Users as Sets:
In this case, the sets contain the movies each user has watched.
You can use the intersection of sets to find out which movies both
users have watched. The intersection of two sets is the set of movies
that both users have in common:
User1 ∩ User2={"Avengers","Titanic"}
So, both User1 and User2 have watched "Avengers" and "Titanic".
User1 ∪ User2={"Inception","Titanic","Avengers","TheMatrix"}
This tells you the complete list of movies watched by either of the
two users.
In data science, we often want to know how similar two users are.
One way to measure similarity is to use the Jaccard similarity. It’s
calculated as:
In this case:
This means User1 and User2 are 50% similar based on their movie-
watching habits.
If a new user (let's call them User3) has watched "Titanic" and
"Avengers", you can use set theory to find similar users (like User1
and User2) and recommend movies they haven't watched yet. For
example:
• User3 hasn't watched "Inception" and "The Matrix".
• Since User1 and User2 both like similar movies, it's likely that
User3 might enjoy "Inception" or "The Matrix."
Median Calculation:
Mode Calculation:
Variance Calculation:
Coefficient of Variation (CV)Calculation:
Z-Scores Calculation:
Correlation Calculation:
Empirical Rule (68-95-99.7 Rule):
------------------------------------------------------------------------------------------------------------
Q:Explain the sampling methods (random, unequal probability) with example.
Use Samples?
c) Observational study
d) Experimental study
Ans:
e) Empirical Rule:
---------------------------------------------------------------------------------------------------------------
1. Explain the point estimates and confidence intervals with
example (Employees long break, short break):
Ans:
Introduction to Confidence Intervals
What is a Confidence Interval?
• A confidence interval is a range of values that is likely to contain the true
population parameter (like the mean).
• It helps us estimate the population parameter based on sample data,
while accounting for uncertainty.
Key Idea:
• A point estimate (like sample mean) can be inaccurate. Confidence
intervals give us a range to be more certain about our estimate.
• Example: We can be 95% confident that the true average break time of
employees falls between 36.36 and 45.44 minutes.
Components of Confidence Intervals
To calculate a confidence interval, you need:
1. Point Estimate: In our case, the sample mean.
o E.g., the average break time from a sample of employees.
2. Margin of Error: Reflects how certain we are that the sample mean is close
to the population mean.
o The margin of error increases with more variability or smaller
samples.
3. Confidence Level: The probability that the interval contains the true
population parameter.
o Common levels: 90%, 95%, 99%. (e.g., a 95% confidence level
means we're 95% sure the interval contains the true mean.)
Calculating Confidence Intervals (Example)
Steps to Calculate Confidence Interval (for Mean):
1. Sample Mean: Calculate the average of a sample.
o E.g., Average break time = 40 minutes.
2. Sample Standard Deviation: Measure the spread or variability of data in
the sample.
o E.g., Sample Standard Deviation = 5 minutes.
3. Standard Error of the Mean:
o This is the standard deviation divided by the square root of the
sample size.
• Type II Error: Failing to reject the null hypothesis when it is actually false.
3. Explain the one-sample t-test with example.
Ans:
Theorem).
o We have a sample of 400 (n = 400), which is sufficient.
Conclusion
• Based on the results of the chi-squared goodness of fit test, we conclude
that there is no significant evidence to suggest the die is unfair.
• The die's outcomes are consistent with a fair die's expected distribution.