Module 1
Module 1
Applications
What is Data Science?
• Data Science is the process of extracting knowledge from data in
various forms.
• the process of using data to find solutions / to predict outcomes for a
problem statement
• It involves data cleaning, integration, visualization, and statistical
analysis of data sets to uncover patterns and trends.
• “Data Science, Also Known As Data-driven Science, Is An Interdisciplinary
Field Of Scientific Methods, Processes, Algorithms And Systems To Extract
Knowledge Or Insights From Data In Various Forms, Either Structured Or
Unstructured, Similar To Data Mining.”
• 1. Descriptive analysis
• 2. Diagnostic analysis
• 3. Predictive analysis
• 4. Prescriptive analysis
1. Descriptive analysis
• It not only predicts what is likely to happen but also suggests an optimum
response to that outcome.
friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
And then we populate the lists using the friendships data:
for
• i, j in friendships:
• They consist of line charts, bar charts, pie charts and scatter
plots.
Types of data visualizations
• Matplotlib
• Bar charts
• Line charts
• Scatter plots
Matplotlib:
• This program lets you draw appealing and informative graphics like
line plots, scatter plots, histograms, and bar charts.
• With our dataset, a line chart could be used to show the trend of
layoffs over the past year or two.
• This depends on what you are trying to communicate, but we'll work
with a one year analysis.
Line Chart
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y
for x, y in zip(variance, bias_squared)]
xs = [i
for i, _ in enumerate(variance)]
plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-.', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.xticks([])
plt.title("The Bias-Variance Tradeoff")
plt.show()
Scatter plots
Vectors
• In data science, vectors are ordered sets of numbers that represent
quantities with direction, often used to describe features or data points.
Matrices
• Matrices are rectangular arrays of numbers, where every row can
represent an observation and each column a feature.
• Vector
• with each inner list having the same size and representing a row of the
matrix.
Mean: The average of all the numbers in the data set. It is calculated by adding all
the values and dividing by the number of values.
Median: The middle number when the data is arranged in order from least to
greatest. If there are two middle numbers, the median is the mean of those two
numbers.
Mode: The most frequent value in the data set. A data set can have one mode,
multiple modes (bimodal), or no mode.
def _median_odd(xs: List[float]) -> float:
return sorted(xs)[len(xs) // 2]
• Typically they’re statistics for which values near zero signify not
spread out at all and for which large values (whatever that means)
signify very spread out.
• For instance, a very simple measure is the range, which is just the
difference between the largest and smallest elements
def data_range(xs: List[float]) -> float:
return max(xs) - min(xs)
assert data_range(num_friends) == 99
• The range is zero precisely when the max and min are equal, which
can only happen if the elements of x are all the same.
• if the range is large, then the max is much larger than the min and the
data is more spread out.
Measures of dispersion: These statistics describe how spread out the
data is from the center. There are two main types:
Range: The difference between the largest and smallest values in the
data set.
Variance: A measure of how spread out the data is from the mean. It is
calculated by finding the average of the squared deviations from the
mean.
• The variance, on the other hand, has units that are the square of the
original units.
• It describes the extent to which two variables change together, but it's
important to remember that it doesn't necessarily imply cause and
effect.
Scatter Plots: These are visual representations of the data where each
point shows the values of two variables for a single observation. The
pattern of the points can reveal the direction and strength of the
correlation.
Important Cautions:
Example: Studies have shown that smoking (cause) leads to lung cancer (effect).
This is a causal relationship supported by significant scientific evidence.
Probability
Probability
• Dependence and Independence
• Conditional Probability
• Bayes’s Theorem
• Random Variables
• Continuous Distributions
• The Normal Distribution
• The Central Limit Theorem
Probability
• Probability is an estimation of how likely a certain event or outcome will
occur. It is typically expressed as a number between 0 and 1, reflecting the
likelihood that an event or outcome will take place.
• Ror example, “the die rolls a 1” or “the die rolls an even number.”
(Eg:-4)
Dependence and Independence
Independent Events Dependent Events
Independent events are events that Dependent events are events that are
are not affected by the occurrence of affected by the occurrence of other
other events. events.
The formula for the Independent The formula for the Dependent Events
Events is, is,
P(A and B) = P(A)×P(B) P(B and A) = P(A)×P(B after A)
P(E,F) = P(E)P(F)
Understanding Dependence and Independence Matters:
It's crucial for data analysis, where you need to assess if variables are
influencing each other.
• . If the probability of events A and B are P(A) and P(B) respectively then the
conditional probability of B such that A has already occurred is denoted as
P(A/B).
• In the case of P(A) = 0 means A is an impossible event, in this case, P(A/B) does
not exist.
Bayes’s Theorem
• The Bayes Theorem is used to describe the probability of an event
based on the prior knowledge of the other related conditions or events.
The event F can be split into the two mutually exclusive events “F and E” and “F
and not E.” If we write ¬E for “not E” (i.e., “E doesn’t happen”), then:
P(F) = P(F,E)+P(F,¬E)
so that:
P(E|F) = P(F|E)P(E)/[P(F|E)P(E)+P(F|¬E)P(¬E)]
Bayes' theorem
• Bayes' theorem is a fundamental concept in probability and statistics, and it plays a crucial role in
data science tasks involving conditional probabilities and updating beliefs based on new evidence.
• It's a formula that allows you to calculate the posterior probability (the probability of an event A
occurring given that you already know event B has happened) based on the following:
Prior probability: The initial probability of event A happening before considering any new
evidence (represented by P(A)).
Likelihood: The probability of observing event B given that event A has already occurred
(represented by P(B | A)).
• The expected value of a random variable, which is the average of its values
weighted by their probabilities.
• Ex:- The coin flip variable has an expected value of 1/2 (= 0 * 1/2 + 1 *
1/2), and the range(10) variable has an expected value of 4.5.
• For example, the uniform distribution puts equal weight on all the numbers
between 0 and 1.
Continuous Distribution…
• The mean indicates where the bell is centered, and the standard
deviation how “wide” it is.
The Normal Distribution…
• When μ = 0 and σ = 1, it’s called the standard normal distribution.
• The central limit theorem says (in essence) that a random variable
defined as the average of a large number of independent and
identically distributed random variables is itself approximately
normally distributed.