0% found this document useful (0 votes)
3 views

Module 1

module 1 notes which helps for the students to study.. It includes all the important topics

Uploaded by

Pratheek Shetty
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 1

module 1 notes which helps for the students to study.. It includes all the important topics

Uploaded by

Pratheek Shetty
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Data Science and its

Applications
What is Data Science?
• Data Science is the process of extracting knowledge from data in
various forms.
• the process of using data to find solutions / to predict outcomes for a
problem statement
• It involves data cleaning, integration, visualization, and statistical
analysis of data sets to uncover patterns and trends.
• “Data Science, Also Known As Data-driven Science, Is An Interdisciplinary
Field Of Scientific Methods, Processes, Algorithms And Systems To Extract
Knowledge Or Insights From Data In Various Forms, Either Structured Or
Unstructured, Similar To Data Mining.”

• “Data Science Intends To Analyze And Understand Actual Phenomena With


‘Data’. In Other Words, The Aim Of Data Science Is To Reveal The Features Or
The Hidden Structure Of Complicated Natural, Human, And Social Phenomena
With Data From A Different Point Of View From The Established Or Traditional
Theory And Method.”
Importance of Data Science
• Data science is important because it combines tools, methods, and
technology to generate meaning from data.
• Modern organizations are in undated with data; there is a
proliferation of devices that can automatically collect and store
information.
• Online systems and payment portals capture more data in the fields of
e-commerce, medicine, finance, and every other aspect of human life.
• We have text, audio, video, and image data available in vast
quantities.
Data science is used to study data in four main ways:

• 1. Descriptive analysis

• 2. Diagnostic analysis

• 3. Predictive analysis

• 4. Prescriptive analysis
1. Descriptive analysis

• Descriptive analysis examines data to gain insights into what


happened or what is happening in the data environment.

• It is characterized by data visualizations such as pie charts, bar charts,


line graphs, tables, or generated narratives.
• 2. Diagnostic analysis

• Diagnostic analysis is a deep-dive or detailed data examination to


understand why something happened.

• It is characterized by techniques such as drill-down, data discovery,


data mining, and correlations.
3. Predictive analysis

• Predictive analysis uses historical data to make accurate forecasts


about data patterns that may occur in the future.

• It is characterized by techniques such as machine learning, forecasting,


pattern matching, and predictive modeling..
4. Prescriptive analysis

• Prescriptive analytics takes predictive data to the next level.

• It not only predicts what is likely to happen but also suggests an optimum
response to that outcome.

• It uses graph analysis, simulation, complex event processing, neural networks


A dump data set
•• users = [
• • { "id": 0, "name": "Hero" },
• • { "id": 1, "name": "Dunn" },
• • { "id": 2, "name": "Sue" },
• • { "id": 3, "name": "Chi" },
• • { "id": 4, "name": "Thor" },
• • { "id": 5, "name": "Clive" },
• • { "id": 6, "name": "Hicks" },
• • { "id": 7, "name": "Devin" },
• • { "id": 8, "name": "Kate" },
• • { "id": 9, "name": "Klein" }
• ]
“friendship” data, repre sented as a list of pairs of IDs:

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]
And then we populate the lists using the friendships data:

for
• i, j in friendships:

users[i]["friends"].append(users[j]) # add i as a friend of j

users[j]["friends"].append(users[i]) # add j as a friend of i


Visualizing Data
What is data visualization?

• Data visualization is the representation of data through


use of common graphics, such as charts, plots,
infographics and even animations.

• These visual displays of information communicate


complex data relationships and data-driven insights in
a way that is easy to understand.
• Data visualization is a critical step in the data science
process, helping teams and individuals convey data more
effectively to colleagues and decision makers.

• Teams that manage reporting systems typically leverage


defined template views to monitor performance.
There are two primary uses for data visualization:
• To explore data
• To communicate data
• Data visualization

• Everyday dataviz which sits in the quadrant between data-


driven and declarative.

• These are simple charts and graphs used in presentations and


reports to communicate key findings.

• They consist of line charts, bar charts, pie charts and scatter
plots.
Types of data visualizations
• Matplotlib

• Bar charts

• Line charts

• Scatter plots
Matplotlib:

• It is a popular Python library for displaying data and creating static,


animated, and interactive plots.

• This program lets you draw appealing and informative graphics like
line plots, scatter plots, histograms, and bar charts.

• Matplotlib is highly customizable and flexible, which makes it a


preferred choice for data analysts and scientists working in fields such
as finance, science, engineering, and social sciences.
Using Matplotlib creating a graph

from matplotlib import pyplot as plt


years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
plt.plot(years, gdp, color='green', marker='o', linestyle='solid’)
plt.title("Nominal GDP")
plt.ylabel("Billions of $")
plt.show()
Bar Charts
• A bar chart is a good choice when you want to show how some
quantity varies among some discrete set of items.

• A bar chart can also be a good choice for plotting histograms of


bucketed numeric values, in order to visually explore how the values
are distributed
Bar Chart Creation

import matplotlib.pyplot as plt


movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"]
num_oscars = [5, 11, 3, 8, 10]
xs = [i + 0.1 for i, _ in enumerate(movies)]
# Plot bars with left x-coordinates [xs], heights [num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")
plt.xticks([i + 0.5 for i, _ in enumerate(movies)], movies)
plt.show()
Line charts :

• Line charts show changes over time for an entity.

• With our dataset, a line chart could be used to show the trend of
layoffs over the past year or two.

• This depends on what you are trying to communicate, but we'll work
with a one year analysis.
Line Chart
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y
for x, y in zip(variance, bias_squared)]
xs = [i
for i, _ in enumerate(variance)]
plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-.', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.xticks([])
plt.title("The Bias-Variance Tradeoff")
plt.show()
Scatter plots

• A scatterplot is the right choice for visualizing the relationship


between two paired sets of data

• These visuals are beneficial in reveling the relationship between two


variables, and they are commonly used within regression data analysis.

• However, these can sometimes be confused with bubble charts, which


are used to visualize three variables via the x-axis, the y-axis, and the
size of the bubble.
Scatter plot graph
import matplotlib.pyplot as plt
friends = [70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# Annotate each point with its corresponding label
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # Put the label with its point
xytext=(5, -5), # But slightly offset
textcoords='offset points')

plt.title("Daily Minutes vs. Number of Friends")


plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()
Linear Algebra
Linear Algebra

• Linear Algebra is a branch of mathematics that is extremely useful in


data science and machine learning.

• Linear algebra is the most important math skill in machine learning.


Most machine learning models can be expressed in matrix form.

• Linear algebra is used in data preprocessing, data transformation, and


model evaluation.
 It forms the backbone of machine learning algorithms, enabling
operations like matrix multiplication, which are essential to model
training and prediction.

 Focuses on vectors, matrices, and their operations.

 Essential for representing data, manipulating datasets, and


implementing machine learning algorithms
•It forms the backbone of machine learning algorithms, enabling
operations like matrix multiplication, which are essential to model
training and prediction.

•Linear algebra techniques facilitate dimensionality reduction,


enhancing the performance of data processing and interpretation.
These are two methods:-

Vectors
• In data science, vectors are ordered sets of numbers that represent
quantities with direction, often used to describe features or data points.

Matrices
• Matrices are rectangular arrays of numbers, where every row can
represent an observation and each column a feature.
• Vector

• An array of numbers(data) is a vector

• Ordered collections of numbers representing data points or directions.

• A vector is conventionally denoted by a lowercase, italics and bold


type variable(feature) name.
Vector is just a list of floats:

from typing import List


Vector = List[float]
height_weight_age = [70,
170,
40 ]
Grades = [95,
80,
75,
62 ]
Python list s aren’t vectors, we’ll need to build these arithmetic tools ourselves.

Adding two vectors

def add(v: Vector, w: Vector) -> Vector:


assert len(v) == len(w),
return [v_i + w_i for v_i, w_i in zip(v, w)]
assert add([1, 2, 3], [4, 5, 6]) == [5, 7, 9]
Substraction of two vectors

def add(v: Vector, w: Vector) -> Vector:


assert len(v) == len(w),
return [v_i - w_i for v_i, w_i in zip(v, w)]
assert add([1, 2, 3], [4, 5, 6]) == [5, 7, 9]
Creating a new vector

def vector_sum(vectors: List[Vector]) -> Vector:


num_elements = len(vectors[0])
assert all(len(v) == num_elements for v in vectors), "different
sizes!"
return [sum(vector[i] for vector in vectors)
for i in range(num_elements)]
assert vector_sum([[1, 2], [3, 4], [5, 6], [7, 8]]) == [16, 20]
Multiply a vector by a scalar

def scalar_multiply(c: float, v: Vector) -> Vector:


return [c * v_i for v_i in v]
assert scalar_multiply(2, [1, 2, 3]) == [2, 4, 6]
Means of a list of vectors:

def vector_mean(vectors: List[Vector]) -> Vector:


n = len(vectors)
return scalar_multiply(1/n, vector_sum(vectors))
assert vector_mean([[1, 2], [3, 4], [5, 6]]) == [3, 4]
Dot product of two vectors

def dot(v: Vector, w: Vector) -> float:


assert len(v) == len(w),
return sum(v_i * w_i for v_i, w_i in zip(v, w))
assert dot([1, 2, 3], [4, 5, 6]) == 32

• Using this find the vectors sum of squares.


Matrix

• A matrix is a two-dimensional collection of numbers.

• We will represent matrices as lists of lists

• A dataset itself is often represented as a matrix.

• with each inner list having the same size and representing a row of the
matrix.

• we can use a matrix to represent a dataset consisting of multiple vectors


Find the size of a matrix

from typing import Tuple


def shape(A: Matrix) -> Tuple[int, int]:
num_rows = len(A)
num_cols = len(A[0]) if A else 0
return num_rows, num_cols
assert shape([[1, 2, 3], [4, 5, 6]]) == (2, 3)
• If a matrix has n rows and k columns, we will refer to it as an n × k
matrix.

• Each row of an n × k matrix as a vector of length k, and each column


as a vector of length n:

def get_row(A: Matrix, i: int) -> Vector:


return A[i]
def get_column(A: Matrix, j: int) -> Vector:
return [A_i[j]
For A_i in A]
Create a matrix in given shape and generating its elements

from typing import Callable


def make_matrix(num_rows: int,
num_cols: int,
entry_fn: Callable[[int, int], float]) -> Matrix:
return [[entry_fn(i, j)
for j in range(num_cols)]
for i in range(num_rows)]
STATISTICS
STATISTICS

• Statistics is the science of analyzing data.

• When we have created a model for prediction, we must assess the


prediction's reliability.

• Describing a single set of data are called descriptive statistics.

• They provide a summary of the important characteristics of the data,


without making any inferences about other data sets or populations.
Describing a Single Set of Data
Histogram of friends count
from collections import Counter
import matplotlib.pyplot as plt
num_friends = [100, 49, 41, 40, 25, 50, 65, 90, 85……………….. ]
friend_counts = Counter(num_friends)
xs = range(101)
ys = [friend_counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0, 101, 0, 25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
• From this data set generating some statistics. Probably the simplest statistic is the number
of data points.
num_points = len(num_friends) # 204

• Largest And Smallest Values:


largest_value = max(num_friends) # 100
smallest_value = min(num_friends) #1

• The values in specific positions:


sorted_values = sorted(num_friends)
smallest_value = sorted_values[0]
second_smallest_value = sorted_values[1]
second_largest_value = sorted_values[-2]
Central Tendencies
1. Measures of central tendency:
These statistics describe the centre or middle point of the data set.
There are three main types:

 Mean: The average of all the numbers in the data set. It is calculated by adding all
the values and dividing by the number of values.

 Median: The middle number when the data is arranged in order from least to
greatest. If there are two middle numbers, the median is the mean of those two
numbers.

 Mode: The most frequent value in the data set. A data set can have one mode,
multiple modes (bimodal), or no mode.
def _median_odd(xs: List[float]) -> float:
return sorted(xs)[len(xs) // 2]

def _median_even(xs: List[float]) -> float:


sorted_xs = sorted(xs)
hi_midpoint = len(xs) //
return (sorted_xs[hi_midpoint - 1] + sorted_xs[hi_midpoint]) / 2

def median(v: List[float]) -> float:


return _median_even(v) if len(v) % 2 == 0 else _median_odd(v)
assert median([1, 10, 2, 9, 5]) == 5
assert median([1, 9, 2, 10]) == (2 + 9) / 2
print(median(num_friends))
A generalization of the median is the quantile, which represents the
value under which a certain percentile of the data lies.

def quantile(xs: List[float],


p: float) -> float:
p_index = int(p * len(xs))
return sorted(xs)[p_index]
assert quantile(num_friends, 0.10) == 1
assert quantile(num_friends, 0.25) == 3
assert quantile(num_friends, 0.75) == 9
assert quantile(num_friends, 0.90) == 13
Less commonly you might want to look at the mode, or most common
value(s)

def mode(x: List[float]) -> List[float]:


counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.items()
if count == max_count]
assert set(mode(num_friends)) == {1, 6}
Dispersion

• Dispersion refers to measures of how spread out our data is.

• Typically they’re statistics for which values near zero signify not
spread out at all and for which large values (whatever that means)
signify very spread out.

• For instance, a very simple measure is the range, which is just the
difference between the largest and smallest elements
def data_range(xs: List[float]) -> float:
return max(xs) - min(xs)
assert data_range(num_friends) == 99

• The range is zero precisely when the max and min are equal, which
can only happen if the elements of x are all the same.

• if the range is large, then the max is much larger than the min and the
data is more spread out.
 Measures of dispersion: These statistics describe how spread out the
data is from the center. There are two main types:

 Range: The difference between the largest and smallest values in the
data set.

 Variance: A measure of how spread out the data is from the mean. It is
calculated by finding the average of the squared deviations from the
mean.

 Standard deviation: The square root of the variance. It is measured in


the same units as the original data.
• A more complex measure of dispersion is the variance, which is computed
as:
from scratch.linear_algebra import sum_of_squares
def de_mean(xs: List[float]) -> List[float]:
x_bar = mean(xs)
return [x - x_bar for x in xs]

def variance(xs: List[float]) -> float:


assert len(xs) >= 2,
n = len(xs)
deviations = de_mean(xs)
return sum_of_squares(deviations) / (n - 1)
assert 81.54 < variance(num_friends) < 81.55
 Measures of shape: These statistics describe the shape of the
distribution of the data. There are two main types:

 Skewness: A measure of the asymmetry of a distribution. A positive


skew indicates a long tail to the right, while a negative skew indicates
a long tail to the left.

 Kurtosis: A measure of how peaked or flat a distribution is relative to


a normal distribution. A positive kurtosis indicates a peaked
distribution, while a negative kurtosis indicates a flat distribution.
• whatever units our data is in all of our measures of central tendency
are in that same unit.

• The range will similarly be in that same unit.

• The variance, on the other hand, has units that are the square of the
original units.

• As it can be hard to make sense of these, we often look instead at the


standard deviation
import math
def standard_deviation(xs: List[float]) -> float:
return math.sqrt(variance(xs))
assert 9.02 < standard_deviation(num_friends) < 9.04

A more robust alternative computes the difference between the 75th


percentile value and the 25th percentile value:

def interquartile_range(xs: List[float]) -> float:


return quantile(xs, 0.75) - quantile(xs, 0.25)
assert interquartile_range(num_friends) == 6
Correlation, in statistics

• It refers to the relationship between two variables.

• It describes the extent to which two variables change together, but it's
important to remember that it doesn't necessarily imply cause and
effect.

• DataSciencester’s VP of Growth has a theory that the amount of time


people spend on the site is related to the number of friends they have
on the site
What it Measures:

 Strength: Correlation quantifies how strong the linear relationship is


between two variables. A strong correlation means the variables tend
to move together in a predictable way.

 Direction: It can be positive (both variables increase or decrease


together) or negative (one increases while the other decreases).
How it's Measured:
 Correlation Coefficient: This is a numerical value typically denoted
by "r" and ranges from -1 to +1.
 +1 indicates a perfect positive correlation (variables always
increase/decrease in tandem).
 -1 indicates a perfect negative correlation (as one increases, the other
perfectly decreases).
 0 indicates no linear relationship.

 Scatter Plots: These are visual representations of the data where each
point shows the values of two variables for a single observation. The
pattern of the points can reveal the direction and strength of the
correlation.
Important Cautions:

 Correlation ≠ Causation: Just because two variables are correlated


doesn't mean one causes the other. There might be a third unknown
factor influencing both.

 Non-linear Relationships: Correlation coefficients only measure


linear relationships. If the relationship is curved or more complex,
correlation might not be a good measure.

 Outliers: Extreme values in the data can significantly impact the


correlation coefficient.
Simpson's paradox
• Simpson's paradox is a fascinating phenomenon in statistics that
highlights how aggregating data can be misleading.

• It occurs when a trend appears in several groups of data but disappears


or even reverses when the groups are combined.

• This can lead to misinterpretations and wrong conclusions if not


carefully considered.
• Understanding Simpson's paradox is crucial for data analysis in
various fields, including medicine, social sciences, and business.

• By recognizing its potential to mislead, you can draw more


reliable conclusions from your data.
• Reverse Causation: Sometimes, the seemingly dependent variable might
actually be influencing the independent variable. For instance, ice cream sales
might correlate with drowning rates, but it's unlikely ice cream causes drowning.
More likely, hot weather (lurking variable) drives up both ice cream sales and
swimming activity, potentially leading to more drownings.

• Third Variable Influence (Lurking Variable): As mentioned with Simpson's


paradox, a hidden variable can be affecting both variables you're analyzing,
creating a correlation even though there's no direct link between them. For
example, a correlation between coffee consumption and heart disease might be
influenced by stress levels (lurking variable) which can drive both coffee
drinking and heart problems.
Causation:

 Definition: Causation refers to a cause-and-effect relationship between two


variables. One variable (the cause) directly influences the other variable (the
effect).

 Establishing Causation: It's generally more difficult to establish causation than


correlation. Ideally, you need well-designed experiments or strong evidence to
show that changes in one variable truly cause changes in the other.

 Example: Studies have shown that smoking (cause) leads to lung cancer (effect).
This is a causal relationship supported by significant scientific evidence.
Probability
Probability
• Dependence and Independence
• Conditional Probability
• Bayes’s Theorem
• Random Variables
• Continuous Distributions
• The Normal Distribution
• The Central Limit Theorem
Probability
• Probability is an estimation of how likely a certain event or outcome will
occur. It is typically expressed as a number between 0 and 1, reflecting the
likelihood that an event or outcome will take place.

• Probability can be calculated by dividing the number of favorable outcomes


by the total number of outcomes of an event.

• It is hard to do data science without some sort of understanding of


probability and its mathematics.

• Probability is a Measure of Chance of assurance of an event E is denoted by


P[E].
Probability…

• For our purposes you should think of probability as a way of


quantifying the uncertainty associated with events chosen from some
universe of events.
• Ex:- Rolling a Die

• The universe consists of all possible outcomes. And any subset of


these outcomes is an event.

• Ror example, “the die rolls a 1” or “the die rolls an even number.”
(Eg:-4)
Dependence and Independence
Independent Events Dependent Events
Independent events are events that Dependent events are events that are
are not affected by the occurrence of affected by the occurrence of other
other events. events.
The formula for the Independent The formula for the Dependent Events
Events is, is,
P(A and B) = P(A)×P(B) P(B and A) = P(A)×P(B after A)

Examples of Independent Events are, Examples of Dependent Events are,


 Tossing one coin was not affected  The probability of finding a red ball
by the tossing of other coins from a box of 4 red balls and 3 green
 Raining for a day and getting six in balls changes if we take out two
dice are independent events. balls from the box.
Dependence and Independence…

• The two events E and F are dependent if knowing something about


whether E happens gives us information about whether F happens (and
vice versa). Otherwise, they are independent.

• Mathematically, we say that two events E and F are independent if the


probability that they both happen is the product of the probabilities
that each one happens:

P(E,F) = P(E)P(F)
Understanding Dependence and Independence Matters:

 It helps determine how to calculate probabilities of compound events


(where multiple events happen together).

 It's crucial for data analysis, where you need to assess if variables are
influencing each other.

 It plays a role in various fields like finance (analyzing stock market


fluctuations) and reliability engineering (assessing component
failures).
Conditional Probability
• Conditional Pobability formula tells the formula for the probability of the event
when an event has already occurred.

• . If the probability of events A and B are P(A) and P(B) respectively then the
conditional probability of B such that A has already occurred is denoted as
P(A/B).

• If P(A) > 0, then the P(A/B) is calculated by using the formula,


• P(A/B) = P(A ∩ B)/P(A)

• In the case of P(A) = 0 means A is an impossible event, in this case, P(A/B) does
not exist.
Bayes’s Theorem
• The Bayes Theorem is used to describe the probability of an event
based on the prior knowledge of the other related conditions or events.

• Bayes’s theorem, which is a way of “reversing” conditional


probabilities.

• if we know the conditional probabilities P(B/A) and prior probabilities


P(A) and P(B), then we can calculate P(A/B) using the Bayes
Theorem
P(A B)=P(A B)P(B)=P(A) P(B A)P(B)P(A B)=P(B)
Bayes’s Theorem…

• Bayes’s Theorem enables us to calculate the probability of an event


given prior knowledge of other events.

• . For example, predicting the probability of a customer purchasing a


product given their purchase history,
• predicting whether an email is a spam or not given it contains certain
words, etc.
The Formula For Bayes' Theorem:

P(A | B) = ( P(B | A) * P(A) ) / P(B)

The event F can be split into the two mutually exclusive events “F and E” and “F
and not E.” If we write ¬E for “not E” (i.e., “E doesn’t happen”), then:

P(F) = P(F,E)+P(F,¬E)
so that:
P(E|F) = P(F|E)P(E)/[P(F|E)P(E)+P(F|¬E)P(¬E)]
Bayes' theorem

• Bayes' theorem is a fundamental concept in probability and statistics, and it plays a crucial role in
data science tasks involving conditional probabilities and updating beliefs based on new evidence.

• It's a formula that allows you to calculate the posterior probability (the probability of an event A
occurring given that you already know event B has happened) based on the following:
 Prior probability: The initial probability of event A happening before considering any new
evidence (represented by P(A)).

 Likelihood: The probability of observing event B given that event A has already occurred
(represented by P(B | A)).

 Marginal probability of event B: The probability of observing event B regardless of whether A


happened (represented by P(B)).
Random Variables

• A random variable is a variable whose possible values have an


associated probability distribution.

• A very simple random variable equals 1 if a coin flip turns up heads


and 0 if the flip turns up tails.

• A more complicated one might measure the number of heads you


observe when flipping a coin 10 times or a value picked from
range(10) where each number is equally likely.
Random Variables…

• The expected value of a random variable, which is the average of its values
weighted by their probabilities.

• Ex:- The coin flip variable has an expected value of 1/2 (= 0 * 1/2 + 1 *
1/2), and the range(10) variable has an expected value of 4.5.

• Random variables can be conditioned on events just as other events can.

• we will be using random variables implicitly in what we do without calling


special attention to them
Continuous Distribution

• A continuous distribution describes the probabilities of the possible


values of a continuous random variable

• A coin flip corresponds to a discrete distribution—one that associates


positive probability with discrete outcomes.

• Often we’ll want to model distributions across a continuum of outcomes.

• For example, the uniform distribution puts equal weight on all the numbers
between 0 and 1.
Continuous Distribution…

• We represent a continuous distribution with a probability density


function (PDF) such that the probability of seeing a value in a certain
interval equals the integral of the density function over the interval.

• The cumulative distribution function (CDF), which gives the


probability that a random variable is less than or equal to a certain
value.
The Normal Distribution

• The normal distribution is the classic bell curve–shaped distribution.

• It is completely determined by two parameters:


its mean μ (mu) and its standard deviation σ (sigma).

• The mean indicates where the bell is centered, and the standard
deviation how “wide” it is.
The Normal Distribution…
• When μ = 0 and σ = 1, it’s called the standard normal distribution.

• If Z is a standard normal random variable, then it turns out that:


X=σZ+μ
is also normal but with mean μ and standard deviation σ .

• Conversely, if X is a normal random variable with mean μ and


standard deviation
σ , Z =(X−μ)/σ
is a standard normal variable.
The Central Limit Theorem

• The central limit theorem says (in essence) that a random variable
defined as the average of a large number of independent and
identically distributed random variables is itself approximately
normally distributed.

• In particular, if x1,...,xn are random variables with mean μ and


standard deviation σ, and if n is large, then: 1/n(x1+...+xn) is
approximately normally distributed with mean μ and standard
deviation σ/√n .

You might also like