0% found this document useful (0 votes)

7 views

Module 1

module 1 notes which helps for the students to study.. It includes all the important topics

Uploaded by

Pratheek Shetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Module 1

module 1 notes which helps for the students to study.. It includes all the important topics

Uploaded by

Pratheek Shetty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Data Science and its

Applications
What is Data Science?
• Data Science is the process of extracting knowledge from data in
various forms.
• the process of using data to find solutions / to predict outcomes for a
problem statement
• It involves data cleaning, integration, visualization, and statistical
analysis of data sets to uncover patterns and trends.
• “Data Science, Also Known As Data-driven Science, Is An Interdisciplinary
Field Of Scientific Methods, Processes, Algorithms And Systems To Extract
Knowledge Or Insights From Data In Various Forms, Either Structured Or
Unstructured, Similar To Data Mining.”

• “Data Science Intends To Analyze And Understand Actual Phenomena With

‘Data’. In Other Words, The Aim Of Data Science Is To Reveal The Features Or
The Hidden Structure Of Complicated Natural, Human, And Social Phenomena
With Data From A Different Point Of View From The Established Or Traditional
Theory And Method.”
Importance of Data Science
• Data science is important because it combines tools, methods, and
technology to generate meaning from data.
• Modern organizations are in undated with data; there is a
proliferation of devices that can automatically collect and store
information.
• Online systems and payment portals capture more data in the fields of
e-commerce, medicine, finance, and every other aspect of human life.
• We have text, audio, video, and image data available in vast
quantities.
Data science is used to study data in four main ways:

• 1. Descriptive analysis

• 2. Diagnostic analysis

• 3. Predictive analysis

• 4. Prescriptive analysis
1. Descriptive analysis

• Descriptive analysis examines data to gain insights into what

happened or what is happening in the data environment.

• It is characterized by data visualizations such as pie charts, bar charts,

line graphs, tables, or generated narratives.
• 2. Diagnostic analysis

• Diagnostic analysis is a deep-dive or detailed data examination to

understand why something happened.

• It is characterized by techniques such as drill-down, data discovery,

data mining, and correlations.
3. Predictive analysis

• Predictive analysis uses historical data to make accurate forecasts

about data patterns that may occur in the future.

• It is characterized by techniques such as machine learning, forecasting,

pattern matching, and predictive modeling..
4. Prescriptive analysis

• Prescriptive analytics takes predictive data to the next level.

• It not only predicts what is likely to happen but also suggests an optimum
response to that outcome.

• It uses graph analysis, simulation, complex event processing, neural networks

A dump data set
•• users = [
• • { "id": 0, "name": "Hero" },
• • { "id": 1, "name": "Dunn" },
• • { "id": 2, "name": "Sue" },
• • { "id": 3, "name": "Chi" },
• • { "id": 4, "name": "Thor" },
• • { "id": 5, "name": "Clive" },
• • { "id": 6, "name": "Hicks" },
• • { "id": 7, "name": "Devin" },
• • { "id": 8, "name": "Kate" },
• • { "id": 9, "name": "Klein" }
• ]
“friendship” data, repre sented as a list of pairs of IDs:

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
And then we populate the lists using the friendships data:

for
• i, j in friendships:

users[i]["friends"].append(users[j]) # add i as a friend of j

users[j]["friends"].append(users[i]) # add j as a friend of i

Visualizing Data
What is data visualization?

• Data visualization is the representation of data through

use of common graphics, such as charts, plots,
infographics and even animations.

• These visual displays of information communicate

complex data relationships and data-driven insights in
a way that is easy to understand.
• Data visualization is a critical step in the data science
process, helping teams and individuals convey data more
effectively to colleagues and decision makers.

• Teams that manage reporting systems typically leverage

defined template views to monitor performance.
There are two primary uses for data visualization:
• To explore data
• To communicate data
• Data visualization

• Everyday dataviz which sits in the quadrant between data-

driven and declarative.

• These are simple charts and graphs used in presentations and

reports to communicate key findings.

• They consist of line charts, bar charts, pie charts and scatter
plots.
Types of data visualizations
• Matplotlib

• Bar charts

• Line charts

• Scatter plots
Matplotlib:

• It is a popular Python library for displaying data and creating static,

animated, and interactive plots.

• This program lets you draw appealing and informative graphics like
line plots, scatter plots, histograms, and bar charts.

• Matplotlib is highly customizable and flexible, which makes it a

preferred choice for data analysts and scientists working in fields such
as finance, science, engineering, and social sciences.
Using Matplotlib creating a graph

from matplotlib import pyplot as plt

years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
plt.plot(years, gdp, color='green', marker='o', linestyle='solid’)
plt.title("Nominal GDP")
plt.ylabel("Billions of $")
plt.show()
Bar Charts
• A bar chart is a good choice when you want to show how some
quantity varies among some discrete set of items.

• A bar chart can also be a good choice for plotting histograms of

bucketed numeric values, in order to visually explore how the values
are distributed
Bar Chart Creation

import matplotlib.pyplot as plt

movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"]
num_oscars = [5, 11, 3, 8, 10]
xs = [i + 0.1 for i, _ in enumerate(movies)]
# Plot bars with left x-coordinates [xs], heights [num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")
plt.xticks([i + 0.5 for i, _ in enumerate(movies)], movies)
plt.show()
Line charts :

• Line charts show changes over time for an entity.

• With our dataset, a line chart could be used to show the trend of
layoffs over the past year or two.

• This depends on what you are trying to communicate, but we'll work
with a one year analysis.
Line Chart
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y
for x, y in zip(variance, bias_squared)]
xs = [i
for i, _ in enumerate(variance)]
plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-.', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.xticks([])
plt.title("The Bias-Variance Tradeoff")
plt.show()
Scatter plots

• A scatterplot is the right choice for visualizing the relationship

between two paired sets of data

• These visuals are beneficial in reveling the relationship between two

variables, and they are commonly used within regression data analysis.

• However, these can sometimes be confused with bubble charts, which

are used to visualize three variables via the x-axis, the y-axis, and the
size of the bubble.
Scatter plot graph
import matplotlib.pyplot as plt
friends = [70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# Annotate each point with its corresponding label
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # Put the label with its point
xytext=(5, -5), # But slightly offset
textcoords='offset points')

plt.title("Daily Minutes vs. Number of Friends")

plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()
Linear Algebra
Linear Algebra

• Linear Algebra is a branch of mathematics that is extremely useful in

data science and machine learning.

• Linear algebra is the most important math skill in machine learning.

Most machine learning models can be expressed in matrix form.

• Linear algebra is used in data preprocessing, data transformation, and

model evaluation.
 It forms the backbone of machine learning algorithms, enabling
operations like matrix multiplication, which are essential to model
training and prediction.

 Focuses on vectors, matrices, and their operations.

 Essential for representing data, manipulating datasets, and

implementing machine learning algorithms
•It forms the backbone of machine learning algorithms, enabling
operations like matrix multiplication, which are essential to model
training and prediction.

•Linear algebra techniques facilitate dimensionality reduction,

enhancing the performance of data processing and interpretation.
These are two methods:-

Vectors
• In data science, vectors are ordered sets of numbers that represent
quantities with direction, often used to describe features or data points.

Matrices
• Matrices are rectangular arrays of numbers, where every row can
represent an observation and each column a feature.
• Vector

• An array of numbers(data) is a vector

• Ordered collections of numbers representing data points or directions.

• A vector is conventionally denoted by a lowercase, italics and bold

type variable(feature) name.
Vector is just a list of floats:

from typing import List

Vector = List[float]
height_weight_age = [70,
170,
40 ]
Grades = [95,
80,
75,
62 ]
Python list s aren’t vectors, we’ll need to build these arithmetic tools ourselves.

Adding two vectors

def add(v: Vector, w: Vector) -> Vector:

assert len(v) == len(w),
return [v_i + w_i for v_i, w_i in zip(v, w)]
assert add([1, 2, 3], [4, 5, 6]) == [5, 7, 9]
Substraction of two vectors

def add(v: Vector, w: Vector) -> Vector:

assert len(v) == len(w),
return [v_i - w_i for v_i, w_i in zip(v, w)]
assert add([1, 2, 3], [4, 5, 6]) == [5, 7, 9]
Creating a new vector

def vector_sum(vectors: List[Vector]) -> Vector:

num_elements = len(vectors[0])
assert all(len(v) == num_elements for v in vectors), "different
sizes!"
return [sum(vector[i] for vector in vectors)
for i in range(num_elements)]
assert vector_sum([[1, 2], [3, 4], [5, 6], [7, 8]]) == [16, 20]
Multiply a vector by a scalar

def scalar_multiply(c: float, v: Vector) -> Vector:

return [c * v_i for v_i in v]
assert scalar_multiply(2, [1, 2, 3]) == [2, 4, 6]
Means of a list of vectors:

def vector_mean(vectors: List[Vector]) -> Vector:

n = len(vectors)
return scalar_multiply(1/n, vector_sum(vectors))
assert vector_mean([[1, 2], [3, 4], [5, 6]]) == [3, 4]
Dot product of two vectors

def dot(v: Vector, w: Vector) -> float:

assert len(v) == len(w),
return sum(v_i * w_i for v_i, w_i in zip(v, w))
assert dot([1, 2, 3], [4, 5, 6]) == 32

• Using this find the vectors sum of squares.

Matrix

• A matrix is a two-dimensional collection of numbers.

• We will represent matrices as lists of lists

• A dataset itself is often represented as a matrix.

• with each inner list having the same size and representing a row of the
matrix.

• we can use a matrix to represent a dataset consisting of multiple vectors

Find the size of a matrix

from typing import Tuple

def shape(A: Matrix) -> Tuple[int, int]:
num_rows = len(A)
num_cols = len(A[0]) if A else 0
return num_rows, num_cols
assert shape([[1, 2, 3], [4, 5, 6]]) == (2, 3)
• If a matrix has n rows and k columns, we will refer to it as an n × k
matrix.

• Each row of an n × k matrix as a vector of length k, and each column

as a vector of length n:

def get_row(A: Matrix, i: int) -> Vector:

return A[i]
def get_column(A: Matrix, j: int) -> Vector:
return [A_i[j]
For A_i in A]
Create a matrix in given shape and generating its elements

from typing import Callable

def make_matrix(num_rows: int,
num_cols: int,
entry_fn: Callable[[int, int], float]) -> Matrix:
return [[entry_fn(i, j)
for j in range(num_cols)]
for i in range(num_rows)]
STATISTICS
STATISTICS

• Statistics is the science of analyzing data.

• When we have created a model for prediction, we must assess the

prediction's reliability.

• Describing a single set of data are called descriptive statistics.

• They provide a summary of the important characteristics of the data,

without making any inferences about other data sets or populations.
Describing a Single Set of Data
Histogram of friends count
from collections import Counter
import matplotlib.pyplot as plt
num_friends = [100, 49, 41, 40, 25, 50, 65, 90, 85……………….. ]
friend_counts = Counter(num_friends)
xs = range(101)
ys = [friend_counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
• From this data set generating some statistics. Probably the simplest statistic is the number
of data points.
num_points = len(num_friends) # 204

• Largest And Smallest Values:

largest_value = max(num_friends) # 100
smallest_value = min(num_friends) #1

• The values in specific positions:

sorted_values = sorted(num_friends)
smallest_value = sorted_values[0]
second_smallest_value = sorted_values[1]
second_largest_value = sorted_values[-2]
Central Tendencies
1. Measures of central tendency:
These statistics describe the centre or middle point of the data set.
There are three main types:

 Mean: The average of all the numbers in the data set. It is calculated by adding all
the values and dividing by the number of values.

 Median: The middle number when the data is arranged in order from least to
greatest. If there are two middle numbers, the median is the mean of those two
numbers.

 Mode: The most frequent value in the data set. A data set can have one mode,
multiple modes (bimodal), or no mode.
def _median_odd(xs: List[float]) -> float:
return sorted(xs)[len(xs) // 2]

def _median_even(xs: List[float]) -> float:

sorted_xs = sorted(xs)
hi_midpoint = len(xs) //
return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2

def median(v: List[float]) -> float:

return _median_even(v) if len(v) % 2 == 0 else _median_odd(v)
assert median([1, 10, 2, 9, 5]) == 5
assert median([1, 9, 2, 10]) == (2 + 9) / 2
print(median(num_friends))
A generalization of the median is the quantile, which represents the
value under which a certain percentile of the data lies.

def quantile(xs: List[float],

p: float) -> float:
p_index = int(p * len(xs))
return sorted(xs)[p_index]
assert quantile(num_friends, 0.10) == 1
assert quantile(num_friends, 0.25) == 3
assert quantile(num_friends, 0.75) == 9
assert quantile(num_friends, 0.90) == 13
Less commonly you might want to look at the mode, or most common
value(s)

def mode(x: List[float]) -> List[float]:

counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.items()
if count == max_count]
assert set(mode(num_friends)) == {1, 6}
Dispersion

• Dispersion refers to measures of how spread out our data is.

• Typically they’re statistics for which values near zero signify not
spread out at all and for which large values (whatever that means)
signify very spread out.

• For instance, a very simple measure is the range, which is just the
difference between the largest and smallest elements
def data_range(xs: List[float]) -> float:
return max(xs) - min(xs)
assert data_range(num_friends) == 99

• The range is zero precisely when the max and min are equal, which
can only happen if the elements of x are all the same.

• if the range is large, then the max is much larger than the min and the
data is more spread out.
 Measures of dispersion: These statistics describe how spread out the
data is from the center. There are two main types:

 Range: The difference between the largest and smallest values in the
data set.

 Variance: A measure of how spread out the data is from the mean. It is
calculated by finding the average of the squared deviations from the
mean.

 Standard deviation: The square root of the variance. It is measured in

the same units as the original data.
• A more complex measure of dispersion is the variance, which is computed
as:
from scratch.linear_algebra import sum_of_squares
def de_mean(xs: List[float]) -> List[float]:
x_bar = mean(xs)
return [x - x_bar for x in xs]

def variance(xs: List[float]) -> float:

assert len(xs) >= 2,
n = len(xs)
deviations = de_mean(xs)
return sum_of_squares(deviations) / (n - 1)
assert 81.54 < variance(num_friends) < 81.55
 Measures of shape: These statistics describe the shape of the
distribution of the data. There are two main types:

 Skewness: A measure of the asymmetry of a distribution. A positive

skew indicates a long tail to the right, while a negative skew indicates
a long tail to the left.

 Kurtosis: A measure of how peaked or flat a distribution is relative to

a normal distribution. A positive kurtosis indicates a peaked
distribution, while a negative kurtosis indicates a flat distribution.
• whatever units our data is in all of our measures of central tendency
are in that same unit.

• The range will similarly be in that same unit.

• The variance, on the other hand, has units that are the square of the
original units.

• As it can be hard to make sense of these, we often look instead at the

standard deviation
import math
def standard_deviation(xs: List[float]) -> float:
return math.sqrt(variance(xs))
assert 9.02 < standard_deviation(num_friends) < 9.04

A more robust alternative computes the difference between the 75th

percentile value and the 25th percentile value:

def interquartile_range(xs: List[float]) -> float:

return quantile(xs, 0.75) - quantile(xs, 0.25)
assert interquartile_range(num_friends) == 6
Correlation, in statistics

• It refers to the relationship between two variables.

• It describes the extent to which two variables change together, but it's
important to remember that it doesn't necessarily imply cause and
effect.

• DataSciencester’s VP of Growth has a theory that the amount of time

people spend on the site is related to the number of friends they have
on the site
What it Measures:

 Strength: Correlation quantifies how strong the linear relationship is

between two variables. A strong correlation means the variables tend
to move together in a predictable way.

 Direction: It can be positive (both variables increase or decrease

together) or negative (one increases while the other decreases).
How it's Measured:
 Correlation Coefficient: This is a numerical value typically denoted
by "r" and ranges from -1 to +1.
 +1 indicates a perfect positive correlation (variables always
increase/decrease in tandem).
 -1 indicates a perfect negative correlation (as one increases, the other
perfectly decreases).
 0 indicates no linear relationship.

 Scatter Plots: These are visual representations of the data where each
point shows the values of two variables for a single observation. The
pattern of the points can reveal the direction and strength of the
correlation.
Important Cautions:

 Correlation ≠ Causation: Just because two variables are correlated

doesn't mean one causes the other. There might be a third unknown
factor influencing both.

 Non-linear Relationships: Correlation coefficients only measure

linear relationships. If the relationship is curved or more complex,
correlation might not be a good measure.

 Outliers: Extreme values in the data can significantly impact the

correlation coefficient.
Simpson's paradox
• Simpson's paradox is a fascinating phenomenon in statistics that
highlights how aggregating data can be misleading.

• It occurs when a trend appears in several groups of data but disappears

or even reverses when the groups are combined.

• This can lead to misinterpretations and wrong conclusions if not

carefully considered.
• Understanding Simpson's paradox is crucial for data analysis in
various fields, including medicine, social sciences, and business.

• By recognizing its potential to mislead, you can draw more

reliable conclusions from your data.
• Reverse Causation: Sometimes, the seemingly dependent variable might
actually be influencing the independent variable. For instance, ice cream sales
might correlate with drowning rates, but it's unlikely ice cream causes drowning.
More likely, hot weather (lurking variable) drives up both ice cream sales and
swimming activity, potentially leading to more drownings.

• Third Variable Influence (Lurking Variable): As mentioned with Simpson's

paradox, a hidden variable can be affecting both variables you're analyzing,
creating a correlation even though there's no direct link between them. For
example, a correlation between coffee consumption and heart disease might be
influenced by stress levels (lurking variable) which can drive both coffee
drinking and heart problems.
Causation:

 Definition: Causation refers to a cause-and-effect relationship between two

variables. One variable (the cause) directly influences the other variable (the
effect).

 Establishing Causation: It's generally more difficult to establish causation than

correlation. Ideally, you need well-designed experiments or strong evidence to
show that changes in one variable truly cause changes in the other.

 Example: Studies have shown that smoking (cause) leads to lung cancer (effect).
This is a causal relationship supported by significant scientific evidence.
Probability
Probability
• Dependence and Independence
• Conditional Probability
• Bayes’s Theorem
• Random Variables
• Continuous Distributions
• The Normal Distribution
• The Central Limit Theorem
Probability
• Probability is an estimation of how likely a certain event or outcome will
occur. It is typically expressed as a number between 0 and 1, reflecting the
likelihood that an event or outcome will take place.

• Probability can be calculated by dividing the number of favorable outcomes

by the total number of outcomes of an event.

• It is hard to do data science without some sort of understanding of

probability and its mathematics.

• Probability is a Measure of Chance of assurance of an event E is denoted by

P[E].
Probability…

• For our purposes you should think of probability as a way of

quantifying the uncertainty associated with events chosen from some
universe of events.
• Ex:- Rolling a Die

• The universe consists of all possible outcomes. And any subset of

these outcomes is an event.

• Ror example, “the die rolls a 1” or “the die rolls an even number.”
(Eg:-4)
Dependence and Independence
Independent Events Dependent Events
Independent events are events that Dependent events are events that are
are not affected by the occurrence of affected by the occurrence of other
other events. events.
The formula for the Independent The formula for the Dependent Events
Events is, is,
P(A and B) = P(A)×P(B) P(B and A) = P(A)×P(B after A)

Examples of Independent Events are, Examples of Dependent Events are,

 Tossing one coin was not affected  The probability of finding a red ball
by the tossing of other coins from a box of 4 red balls and 3 green
 Raining for a day and getting six in balls changes if we take out two
dice are independent events. balls from the box.
Dependence and Independence…

• The two events E and F are dependent if knowing something about

whether E happens gives us information about whether F happens (and
vice versa). Otherwise, they are independent.

• Mathematically, we say that two events E and F are independent if the

probability that they both happen is the product of the probabilities
that each one happens:

P(E,F) = P(E)P(F)
Understanding Dependence and Independence Matters:

 It helps determine how to calculate probabilities of compound events

(where multiple events happen together).

 It's crucial for data analysis, where you need to assess if variables are
influencing each other.

 It plays a role in various fields like finance (analyzing stock market

fluctuations) and reliability engineering (assessing component
failures).
Conditional Probability
• Conditional Pobability formula tells the formula for the probability of the event
when an event has already occurred.

• . If the probability of events A and B are P(A) and P(B) respectively then the
conditional probability of B such that A has already occurred is denoted as
P(A/B).

• If P(A) > 0, then the P(A/B) is calculated by using the formula,

• P(A/B) = P(A ∩ B)/P(A)

• In the case of P(A) = 0 means A is an impossible event, in this case, P(A/B) does
not exist.
Bayes’s Theorem
• The Bayes Theorem is used to describe the probability of an event
based on the prior knowledge of the other related conditions or events.

• Bayes’s theorem, which is a way of “reversing” conditional

probabilities.

• if we know the conditional probabilities P(B/A) and prior probabilities

P(A) and P(B), then we can calculate P(A/B) using the Bayes
Theorem
P(A B)=P(A B)P(B)=P(A) P(B A)P(B)P(A B)=P(B)
Bayes’s Theorem…

• Bayes’s Theorem enables us to calculate the probability of an event

given prior knowledge of other events.

• . For example, predicting the probability of a customer purchasing a

product given their purchase history,
• predicting whether an email is a spam or not given it contains certain
words, etc.
The Formula For Bayes' Theorem:

P(A | B) = ( P(B | A) * P(A) ) / P(B)

The event F can be split into the two mutually exclusive events “F and E” and “F
and not E.” If we write ¬E for “not E” (i.e., “E doesn’t happen”), then:

P(F) = P(F,E)+P(F,¬E)
so that:
P(E|F) = P(F|E)P(E)/[P(F|E)P(E)+P(F|¬E)P(¬E)]
Bayes' theorem

• Bayes' theorem is a fundamental concept in probability and statistics, and it plays a crucial role in
data science tasks involving conditional probabilities and updating beliefs based on new evidence.

• It's a formula that allows you to calculate the posterior probability (the probability of an event A
occurring given that you already know event B has happened) based on the following:
 Prior probability: The initial probability of event A happening before considering any new
evidence (represented by P(A)).

 Likelihood: The probability of observing event B given that event A has already occurred
(represented by P(B | A)).

 Marginal probability of event B: The probability of observing event B regardless of whether A

happened (represented by P(B)).
Random Variables

• A random variable is a variable whose possible values have an

associated probability distribution.

• A very simple random variable equals 1 if a coin flip turns up heads

and 0 if the flip turns up tails.

• A more complicated one might measure the number of heads you

observe when flipping a coin 10 times or a value picked from
range(10) where each number is equally likely.
Random Variables…

• The expected value of a random variable, which is the average of its values
weighted by their probabilities.

• Ex:- The coin flip variable has an expected value of 1/2 (= 0 * 1/2 + 1 *
1/2), and the range(10) variable has an expected value of 4.5.

• Random variables can be conditioned on events just as other events can.

• we will be using random variables implicitly in what we do without calling

special attention to them
Continuous Distribution

• A continuous distribution describes the probabilities of the possible

values of a continuous random variable

• A coin flip corresponds to a discrete distribution—one that associates

positive probability with discrete outcomes.

• Often we’ll want to model distributions across a continuum of outcomes.

• For example, the uniform distribution puts equal weight on all the numbers
between 0 and 1.
Continuous Distribution…

• We represent a continuous distribution with a probability density

function (PDF) such that the probability of seeing a value in a certain
interval equals the integral of the density function over the interval.

• The cumulative distribution function (CDF), which gives the

probability that a random variable is less than or equal to a certain
value.
The Normal Distribution

• The normal distribution is the classic bell curve–shaped distribution.

• It is completely determined by two parameters:

its mean μ (mu) and its standard deviation σ (sigma).

• The mean indicates where the bell is centered, and the standard
deviation how “wide” it is.
The Normal Distribution…
• When μ = 0 and σ = 1, it’s called the standard normal distribution.

• If Z is a standard normal random variable, then it turns out that:

X=σZ+μ
is also normal but with mean μ and standard deviation σ .

• Conversely, if X is a normal random variable with mean μ and

standard deviation
σ , Z =(X−μ)/σ
is a standard normal variable.
The Central Limit Theorem

• The central limit theorem says (in essence) that a random variable
defined as the average of a large number of independent and
identically distributed random variables is itself approximately
normally distributed.

• In particular, if x1,...,xn are random variables with mean μ and

standard deviation σ, and if n is large, then: 1/n(x1+...+xn) is
approximately normally distributed with mean μ and standard
deviation σ/√n .

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Assignment 2.sol
50% (2)
Assignment 2.sol
5 pages
SonoAce R3 Training Manual
100% (1)
SonoAce R3 Training Manual
72 pages
Module1 DS Ppt
No ratings yet
Module1 DS Ppt
61 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
DMV-U4-RK
No ratings yet
DMV-U4-RK
16 pages
1st Class-Introduction and Python Package (1)
No ratings yet
1st Class-Introduction and Python Package (1)
93 pages
data science
No ratings yet
data science
42 pages
Data Unit4
No ratings yet
Data Unit4
8 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Da&ml PPT-1
No ratings yet
Da&ml PPT-1
35 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Data Science
No ratings yet
Data Science
59 pages
AI-Data Science
No ratings yet
AI-Data Science
21 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Data Mining
No ratings yet
Data Mining
34 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Datascienece
No ratings yet
Datascienece
18 pages
Data Analytics and Interactive Dashboards using Python
No ratings yet
Data Analytics and Interactive Dashboards using Python
96 pages
Data Manipulation and Visualization
No ratings yet
Data Manipulation and Visualization
21 pages
Udacity Dandsyllabus
No ratings yet
Udacity Dandsyllabus
7 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
Digital Vidya Python Data Analytst Course
No ratings yet
Digital Vidya Python Data Analytst Course
18 pages
Data Science With Python-Sasmita PDF
67% (3)
Data Science With Python-Sasmita PDF
9 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
21CS644 Module 4
No ratings yet
21CS644 Module 4
24 pages
MLS+1+-+Python+for+Data+Science
No ratings yet
MLS+1+-+Python+for+Data+Science
33 pages
Unit IV
No ratings yet
Unit IV
63 pages
Data Science Curriculum 2024
No ratings yet
Data Science Curriculum 2024
16 pages
Data Modeling Featurization Visualization
No ratings yet
Data Modeling Featurization Visualization
3 pages
ds with py
No ratings yet
ds with py
39 pages
Effective Data Visualization Techniques in Data Science Using Python
No ratings yet
Effective Data Visualization Techniques in Data Science Using Python
14 pages
foundation of Data science imp notes
No ratings yet
foundation of Data science imp notes
6 pages
DsNaIT v2.0
No ratings yet
DsNaIT v2.0
43 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
139 pages
unit_5 (1)
No ratings yet
unit_5 (1)
81 pages
Data Visualization
No ratings yet
Data Visualization
25 pages
Unit I-V
No ratings yet
Unit I-V
184 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
Data Science Class X Notes
No ratings yet
Data Science Class X Notes
3 pages
FDS
No ratings yet
FDS
7 pages
B Ei
No ratings yet
B Ei
44 pages
Data Visualization With Matplotlib
No ratings yet
Data Visualization With Matplotlib
20 pages
DS 2
No ratings yet
DS 2
38 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Unit-1
No ratings yet
Unit-1
84 pages
CS1010S Lecture 11 - Visualising Data
No ratings yet
CS1010S Lecture 11 - Visualising Data
68 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Artificial Intelligence Tutorial PDF
100% (2)
Artificial Intelligence Tutorial PDF
69 pages
Resume Sample
No ratings yet
Resume Sample
6 pages
jav5
No ratings yet
jav5
5 pages
Tutorial (Chapter 9)
No ratings yet
Tutorial (Chapter 9)
4 pages
Data Engineering Cookbook
No ratings yet
Data Engineering Cookbook
125 pages
Part 2
No ratings yet
Part 2
22 pages
Ricardo Bautista JR
No ratings yet
Ricardo Bautista JR
4 pages
Deploying Forefront TMG 2010 Server As A Reverse Proxy in An Existing Firewall DMZ - MS Server Pro
No ratings yet
Deploying Forefront TMG 2010 Server As A Reverse Proxy in An Existing Firewall DMZ - MS Server Pro
24 pages
Confidential: SESSION 2018/2019
No ratings yet
Confidential: SESSION 2018/2019
26 pages
TCP2 Final Exam_TCPIP and solution
No ratings yet
TCP2 Final Exam_TCPIP and solution
1 page
1-Planning and scheduling procedures from A to Z - Planning Engineer
No ratings yet
1-Planning and scheduling procedures from A to Z - Planning Engineer
5 pages
Leica CS10/CS15: User Manual
No ratings yet
Leica CS10/CS15: User Manual
140 pages
COS3721-101 2018 3 B PDF
No ratings yet
COS3721-101 2018 3 B PDF
18 pages
PPOp E
No ratings yet
PPOp E
10 pages
DWDM Introduction and DWDM Components: N. Mary, Sde (TX) RTTC, Hyderabad
No ratings yet
DWDM Introduction and DWDM Components: N. Mary, Sde (TX) RTTC, Hyderabad
63 pages
Insecurities of D2D and A Usable Solution
No ratings yet
Insecurities of D2D and A Usable Solution
2 pages
Operating Instr Uctions Control Elements and Regulators Kl01 For Ac Systems
No ratings yet
Operating Instr Uctions Control Elements and Regulators Kl01 For Ac Systems
24 pages
QA Interview Questions For Telegram App (Responses)
No ratings yet
QA Interview Questions For Telegram App (Responses)
2 pages
Cyber Crime and Its Effects On Youth: An Empirical Study On MBSTU Students
No ratings yet
Cyber Crime and Its Effects On Youth: An Empirical Study On MBSTU Students
55 pages
FANVIL-H3
No ratings yet
FANVIL-H3
2 pages
Motor Suppression:: Important
No ratings yet
Motor Suppression:: Important
2 pages
Via Ivrea 8b 10098 Rivoli - (To) Italy Phone +39 011 9573423
No ratings yet
Via Ivrea 8b 10098 Rivoli - (To) Italy Phone +39 011 9573423
2 pages
Concept Board
100% (1)
Concept Board
1 page
Operation Manual OMD
No ratings yet
Operation Manual OMD
3 pages
Manual Del FaultKin - Geomechanics
100% (1)
Manual Del FaultKin - Geomechanics
31 pages
Adam Thierer (PFF) Remarks at FCC Hearing On Public Interest in Digital Era (3!4!10)
No ratings yet
Adam Thierer (PFF) Remarks at FCC Hearing On Public Interest in Digital Era (3!4!10)
14 pages
Verification of Systems and Circuits Using LOTOS Petri Nets and CCS 1st Edition Yoeli 2024 Scribd Download
100% (15)
Verification of Systems and Circuits Using LOTOS Petri Nets and CCS 1st Edition Yoeli 2024 Scribd Download
70 pages
2017 Book VirtualAugmentedAndMixedRealit PDF
No ratings yet
2017 Book VirtualAugmentedAndMixedRealit PDF
250 pages