0% found this document useful (0 votes)
158 views63 pages

DS Mod 1 To 2 Complete Notes

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 63

Course Code: CS602 Subject: Data Science

MODULE-I
INTRODUCTION: -
Introduction to data science, Different sectors of using data science, Purpose and components of Python,
Data Analytics processes, Exploratory data analytics, Quantitative technique and graphical technique,
Data types for plotting.
MODULE-I
INTRODUCTION:

Introduction to data science

What Is Data Science?

An automated way to analyze enormous amounts of data and extract information.

or
A new discipline that combines the aspects of statistics, mathematics, programming,
and visualization to turn data into information .
Or
Data Science is about extraction, preparation, analysis, visualization, and maintenance
of information. It is a cross-disciplinary field which uses scientific methods and
processes to draw insight.

Life cycle of data science


Components of Data Science

The main components of Data Science are given below:

1. Domain Expertise and Scientific Methods Technology-

• Domain expertise means specialized knowledge or skills of a particular area. In


data science, there are various areas for which we need domain experts.
• Domain experts like scientist and statistician collect data and analysis the data in
laboratory set up or control environment(scientific method and
tools).The data was then subject to relevant laws and mathematical
and statistical models to analyze the data set and derive relevant
information from it.

Data Scientists collect, explore, analyze, and visualize data. They apply mathematical and
statistical models to find patterns and solutions in the data.
• Modern tools and technologies have made data processing and analytics faster and
efficient.Technology like: Data Processing Tools, Python Language, Application Design,
Library.
➢ These technologies help Data Scientists to:
• Build and train machine learning models
• Manipulate data with technology
• Build data tools, applications, and services
• Extract information from data

Role of a Data Scientist:

• Ask the right questions


• Understand data structure
• Interpret and wrangle data
• Apply statistical and mathematical methods
• Visualize data and communicate with stakeholders
• Work as a team player
Different sectors of using data science

i. Data Science in Healthcare


Data Science has been playing a pivotal role in the Healthcare Industry. With
the help of classification algorithms, doctors are able to detect cancer and
tumors at an early stage using Image Recognition software.
Genetic Industries use Data Science for analyzing and classifying patterns
of genomic sequences. Various virtual assistants are also helping patients to
resolve their physical and mental ailments.
ii. Data Science in E-commerce
Amazon uses a recommendation system that recommends users various
products based on their historical purchase. Data Scientists have developed
recommendation systems predict user preferences using Machine Learning.
iii. Data Science in Manufacturing
Industrial robots have made taken over mundane and repetitive roles
required in the manufacturing unit. These industrial robots are autonomous
in nature and use Data Science technologies such as Reinforcement Learning
and Image Recognition.
iv. Data Science as Conversational Agents
Amazon’s Alexa and Siri by Apple use Speech Recognition to understand
users. Data Scientists develop this speech recognition system, that converts
human speech into textual data. Also, it uses various Machine Learning
algorithms to classify user queries and provide an appropriate response.
Data Science in Transport
Self Driving Cars use autonomous agents that utilize Reinforcement
Learning and Detection algorithms. Self-Driving Cars are no longer fiction
due to advancements in Data Science.
Data Analytics processes
Data analytics is the process of examining data sets to find trends and
draw conclusions about the information they contain. Increasingly, data
analytics is done with the aid of specialized systems and software.
Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions.
Scientists and researchers also use analytics tools to verify or disprove
scientific models, theories and hypotheses.
Types of Data Analytics
There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics

1.Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach future events.
It looks at past performance and understands the performance by mining historical data to understand the
cause of success or failure in the past. Almost all management reporting such as sales, marketing,
operations, and finance uses this type of analysis.
Common examples of Descriptive analytics are company reports that provide historic reviews like:
• Data Queries
• Reports
• Descriptive Statistics
• Data dashboard
Example: D-mart, we can look at the product’s history and find out which product have been sold more
or which products have large demand by looking at the product sold trends and based on their self
analysis we can further make the decision of putting a stock that item in large quantity for the coming
year.
2.Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics uses data to
determine the probable outcome of an event or a likelihood of a situation occurring. Predictive analytics
holds a variety of statistical techniques from modeling, machine learning, data mining, and game
theory that analyze current and historical facts to make predictions about a future event.
Example: Amazon and Netflix recommendation system
3. Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule, and
machine learning to make a prediction and then suggests a decision option to take advantage of the
prediction.
Example :Google Self Driving Car
4.Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for the solution
of any problem. We try to find any dependency and pattern in the historical data of the particular problem
Data Analysis Process consists of the following phases that are iterative in
nature −

1.Business Problem- The process of analytics begins with questions or business problems of
stakeholders.
Examples of question are
Who are the customers?2Why are sales going down?
How to manage the inventory?Why the system not scaling up with increasing traffic volume?
Such kind of Business problems trigger the need to analyze data and find answers.

2 Data Acquisition is a process to collect data from various data sources such as RDBMS,No
SQL databases,web server logs and also scrape the web through web APIs.

3. Data Wrangling
Data Wrangling: Challenges
Causes of challenges in the data wrangling phase:
• Unexpected data format
• Erroneous data
• Voluminous data to be manipulated
• Classifying data into linear or clustered
• Determining relationship between observation, feature, and response
Data Wrangling includes:
1. Data cleansing
2. Data manipulation
Data cleansing
The processed and organized data may be incomplete, contain duplicates, or
contain errors. Data Cleaning is the process of preventing and correcting these
errors.
Data manipulation
Data manipulation technique such as transform, aggregate ,groupby,reshape,merge
transform the data and make it available for exploratory data analysis.

5. Data Exploration
Data Exploration includes:
1. Data discovery
2. Data pattern

➢ Data exploration uses all the available data and present in either numerical
or graphical format.
➢ This helps to identify right pattern in the data.The data and underline
pattern fed into appropriate Machine Learning Model leading directly to
conclusion or prediction phase.
Exploratory Data Analysis (EDA)-
• APPROACH-EDA approach studies the data to recommend suitable models that best fit
the data.
• FOCUS-The focus is on data; its structure, outliers, and models suggested by the data.
• ASSUMPTIONS-EDA techniques make minimal or no assumptions.
They present and show all the underlying data without any data loss.
• EDA TECHNIQUES-
Quantitative: Provides numeric outputs for the inputted data .
Graphical: Uses statistical functions for graphical output.

EDA: Quantitative Technique


EDA: Quantitative technique has two goals, measurement of central tendency and spread of
data.

EDA: Graphical Technique


Types of Plot
1. HISTOGRAM
2. HEAT MAP
3. SCATTER PLOT
4. BOX PLOT

Histograms and scatter plots are two popular graphical techniques to depict data.
Histogram graphically summarizes the distribution of a univariate data set.
It shows:
• the center or location of data (mean, median, or mode)
• the spread of data
• the skewness of data
• the presence of outliers
• the presence of multiple modes in the data

Q.Draw Histogram of following data:

Marks 0-10 10-20 20-30 30-40 40-50 50-60


No. of 5 12 15 22 14 4
students

Scatter plot represents relationships between two variables.It can answere these question
visually:
• Are variables X and Y related?
• Are variables X and Y linearly related?
• Are variables X and Y non-linearly related?
• Does change in variation of Y depend on X?
• Are there outliers?
Q.Draw scatter plot:

Age:[5,7,11,12,14,16,18,6,4]

Weight:[15,17,18,20,22,24,20,13,9]
Conclusion or Prediction

• This step involves reaching a conclusion and making predictions based on the data
analysis
• Involves heavy use of mathematical and statistical functions
• Requires model selection, training, and testing to help in forecasting
• Is called machine learning as data analysis is fully or semi-automated with minimal or
no human intervention
➢ Hypothesis
Hypothesis building begins in the data exploration stage, but becomes more
mature in the conclusion or prediction phase. Hypothesis building uses feature
engineering and Model.
Hypothesis testing-A premise or claim that we want to test.

Communication

The last step of data analysis is communication, where the analyzed


data is formally presented to stakeholders. Forms of Data analysis
presentations:
• Visual graphs
• Plotting maps
• Reports
• Whitepaper reports
• PowerPoint presentations
Data Visualization
Benefits of data visualization:
• Simplifies quantitative information through visuals
• Shows the relationship between data points and variables
• Identifies patterns
• Establishes trends
Examples of data visualization
• Presenting information about new and existing customers on the website and their
behavior when they access the website.
• Representing web traffic pattern for the website, for example, more activity on the
website in the morning than in the evening
Plotting-
Plotting is a data visualization technique used to represent underlying data through graphics.

Features of plotting:
• Plotting is like telling a story about data using different colors, shapes, and sizes.
• Plotting shows the relationship between variables.
• Example:
oChange in value of Y results in change in value of X
oX is independent of y

Types of Plot
Different data types can be visualized using various plotting techniques.

1.histogram 2.line chart


3.regression plot 4.heat map
5.clusture map 6.scatter plot
Data Types for Plotting
1.Numerical Data or Quantative data
Numerical data refers to the data that is in the form of numbers. Often referred to as quantitative
data, numerical data is collected in number form and stands different from any form of number
data types due to its ability to be statistically and arithmetically calculated.
There are two types of numerical data:
Discrete Data: Distinct or counted values.
Example: Number of employees in a company or number of students in a class.
Continuous Data : Values within a range that can be measured.
Example: Height can be measured in feet or inches and weight can be measured inpounds or
kilograms.
2.Categorical Data or Qualitative
Categorical data is a type of data that can be stored into groups or categories with the aid of
names or labels.
Also known as qualitative data, each element of a categorical dataset can be placed in only one
category according to its qualities, where each of the categories is mutually exclusive.
For example, gender is a categorical data because it can be categorized into male and female
according to some unique qualities possessed by each gender.
There are two types of categorical data:
Nominal data:This is the data type of categorical data that names or labels. Sometimes called
naming data, it has characteristics similar to that of a noun.
Eg:Gender:male,female,Blood Type:A,B,O,AB
Ordinal data:This type of categorical data includes elements that are ranked, ordered or have
a rating scale attached. One can count and order, ordinal data, but it can not be measured.
For example, suppose a group of customers were asked to taste the varieties of a restaurant’s new menu on
a rating scale of 1 to 5—with each level on the rating scale representing strongly dislike, dislike, neutral, like,
strongly like. In this case, a rating of 5 indicates more enjoyment than a rating of 4, making such data
ordinal.
Questions:
Q.Differentiate between Data science and Big Data.Briefly Explain 5vs of Big Data.
Big Data: Big Data is a collection of large datasets that cannot be processed using traditional
computer techniques. Big Data is nothing but lots of data consisting of a large variety. It is the
concept of gathering useful insights from such voluminous amounts of structured, semi-
structured and unstructured data that can be used for effective decision-making in the business
environment.
Sources of Big Data

• Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount of data on
a day-to-day basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which
users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.

The 5 Vs of big data:

1.Volume: Volume refers to the vast increase in the data growth. This is evident as more than 90% of the
data we encounter was produced recently. In fact, more than 2.5 quintillion (1018) bytes are created daily
since even as earlier as 2013 from every post, share, search, click, stream, and many more data producers.
Although, this huge data poses a challenge to the storage capacity, but still challenge is less stimulating due
to the advanced storage technologies as well as the decrease in the cost of computer storage acquisition.
However, the analysis of such vast data islands is the actual challenge considering the heterogeneity nature
of data.
2.Velocity: Velocity represents the accumulation of data at a high speed, near real-time and real-time from
dissimilar data sources. In Big Data velocity flows in from sources like machines, networks, social media,
mobile phones, etc. There is a massive and continuous flow of data. This determines the potential of data -
how fast the data is generated and processed to meet the demands.
Example : There are more than 3.5 billion searches per day on Google. Also, Facebook users are increasing
by 22% (Approx.) year by year.
3.Variety: Variety involves collecting data from various resources and in fuzzy and heterogeneous types.
This includes importing data in dissimilar formats, namely structured (tables resides in relational databases-
RDBMS,etc),semi-structured(e-mail,XML,JSON, and other markup languages, etc.) and unstructured (text,
pictures, audio files, video,sensor data, etc.).
4.Veracity: It refers to the provenance, accuracy and correctness of data. Multiple factors to ensure the
veracity of Big Data are as follows:
• trustworthiness of data origin
• reliability and security of data store data availability
• correctness
• consistency
5.Value: Value represents the outcome product of Big Data analysis.Value is an essential characteristic of
Big Data. It is not the data that we process or store It is valuable and reliable data that we store, process, and
also analyse .The bulk of data having no value is of no good to the company, unless you turn it into
something useful.
Q.What is EDA technique?
Most EDA technique are graphical in nature with a few quantitative
techniques and also suggest models that best fit thedata.They use the entire
data with minimum or no assumption.
Q.Differentiate between Univariate, Bivariate, and Multivariate analysis?

• Univariate – When analyze one variable at a time, it is called


univariate data analysis. Example: height of students
• Bivariate – Bivariate data involves two different variables. The
analysis of this type of data deals with causes and relationships.
Example: temperature and ice cream sales in the summer season.
• Multivariate – Analyzing three or more variables together is
categorized under multivariate data analysis.
Example: data for house price prediction
Module-2
STATISTICAL ANALYSIS: -

Introduction to statistics, statistical and non-statistical analysis, major categories of statistics, population
and sample, Measure of central tendency and dispersion, Moments, Skewness and kurtosis, Correlation
and regression, Theoretical distributions – Binomial, Poisson, Normal

Introduction to Statistics
Statistics deals with the methods for collection, classification and analysis of numerial
data for drawing valid conclusions and making reasonable decisions.

➢ The field of Statistics has an influence over all domains of life, the Stock market,
life sciences, weather, retail, insurance, and education .

Types of Analysis
1. Quantitative Analysis and Qualitative Analysis(on the basis of data)
2. Descriptive analysis and Inferential analysis(on the basis of tool)
3. Univariate, . Bivariate ,Multivariate analysis(on the basis of variable).

Types of Analysis

Quantitative Analysis Qualitative Analysis

1. Quantitative Analysis: Quantitative Analysis or Statistical Analysis is the


science of collecting and interpreting data with numbers and graphs to identify
patterns and trends.
2. Qualitative Analysis: Qualitative or Non-Statistical Analysis involves collection
and analysis of qualitative data to understand concepts,opinions or experiences.

example, a purchase a coffee from coffee shop, it is available in Short, Tall and
Grande. This is an example of Qualitative Analysis. But if a store sells 70 regular
coffees a week, it is Quantitative Analysis .
Terminologies in Statistics –
• The population is the set of sources from which data has to be collected.
• A Sample is a subset of the Population.
• A Variable is any characteristics, number, or quantity that can be measured or
counted.
• Statistics are quantitative values calculated from the sample.
• Parameters are the characteristics of the population.
Categories in Statistics
There are two main categories in Statistics, namely:
1. Descriptive Statistics
2. Inferential Statistics

Descriptive Statistics
Descriptive Statistics uses the data to provide descriptions of the population,
either through numerical calculations or graphs or tables.
➢ Descriptive Statistics helps organize data and focuses on the characteristics of
data providing parameters.
Example:study the average height of students in a classroom, in descriptive
statistics, record the heights of all students in the class and then find out the
maximum, minimum and average height of the class.
Inferential Statistics
Inferential Statistics makes inferences and predictions about a population based
on a sample of data taken from the population in question.
➢ Inferential statistics generalizes a large data set and applies probability to arrive at a
conclusion. It allows to infer parameters of the population based on sample stats
and build models on it.
Example: Consider the same example of finding the average height of students
in a class, in Inferential Statistics, take a sample set of the class, which is
basically a few people from the entire class. Grouped the class into tall, average
and short. In this method, basically build a statistical model and expand it for the
entire population in the class.

Measures of Descriptive Statistics


1. Measures of central tendency
2. Measures of dispersion.
Measures of central tendency -The measure of central tendency is a
single value that attempts to describe a set of data by identifying
the central position within that set of data.

1.MEAN-The mean is equal to the sum of all the values in


the data set divided by the number of values in the data set i.e
the calculated average.
• It susceptible to outliers
• when unusual values are added it gets skewed .
2.MEDIAN: The median is the middle value for a dataset that has
been arranged in order of magnitude.
➢ Median is a better alternative to mean as it is less affected by
outliers and skewness of the data.
If the total number of values is odd then

If the total number of values is even then

MODE: The mode is the most commonly occurring value in the


dataset.
➢ The mode can, therefore sometimes consider the mode as
being the most popular option.
Eg:In a dataset containing {13,35,54,54,55,56,57,67,85,89,96} values. Mean is 60.09.
Median is 56. Mode is 54.
Q.Find the mean,median and mode.
23,29,20,32,23,21,33,25
Mean=23+29+20+32+23+21+33+25/8=25.75
Median :20,21,23,23,25,29,32,33

Median=23+25/2=24

Mode=23

2.Measures of dispersion- The measure of dispersion helps us to study the


variability of the items i.e the spread of data
1.Range: The difference between the largest and the smallest value of a
data, is termed as the range of the distribution.
{13,33,45,67,70}
Range=(70–13)=57
2.Variance: Variance measures how far is the sum of squared distances
from each point to the mean i.e the dispersion around the mean.
➢ Variance is the average of all squared deviations.

σ2 = ∑(xi-x̅)2
n
3.Standard Deviation: The square root of the variance is the standard
deviation. It tells about the concentration of the data around the mean of the
data set.
Q. {3,5,6,9,10} are the values in a dataset.
Q.For this dataset which mesures of central tendency and dispersion is better.
35,50,50,50,56,60,60,75,250
Mean=76.2
SD=62.3

Median=50
IQR=17.5
Mo<rn12'Y"I-\:::-
"" 0 ~~ 5 ~ \) Se. cl .\: D d ,..Q_~C.T ,· b,'?...
" ' DJu_' OU. s C "' 0 • 0\ C k --y \ s.,-\- \ c.. s D ,\- 0- ,S""If'..e...1 \).J.,Ylt.-::t
d \' ~ +t-; b LA. .\-i' 0 l'J \J ) ~ J • C..e 'Y)r\s-°'- l tt ~ ci.e.Y") l..,f::J _;J
d ~ s ~r-("s., o'Yl , ~ \.<..A..'-u 'Y\..t. <; ~ a ~ A \.<-..-r t-c, s,1 ~ ,

H O'Yn.e..'i \t ')'Y) t2- Q YI

T-t is ck. -n o4c-- d


a b t> u..>r '
' '
,

?-, l I _ --X.., 2- 1 ,. - - ' ')L'n

rx... • -=-=- H-e.q~ ~ - - ··--


A -= ,~,+)( 11..+-- . , L'\.
I · • - - - -

::: :i. X~ -
/V
w h_,~_,,."'.e.. ----- ----
~ z_ r- f_rx_ -~)(:)
N

-r
c~<;.¾¥"'\
r-y-,-,:_ \ /

_L- 1.- C ?- - ~ ·1
~\ -
- N
-',<- ~
- ~ ~
rl
')L - ~
~

-- - -
_,,.., )(_;
~
c-x., -
~
D

~\ -- 0

----r --::::

-
1..-
_l-
~ (_--i- - -,.,_,
) '2-

¼-1,:,
t--1

'
'\I ~ C)-\ 0-V) ~
\A-.1..,.. - ,-

1::::- "f,,__ qLv..,, , j V i -::I-< ; \:,c,-1 ' o VJ

---- -----
>r ,\ (_
-----
'IL I( ";_I
-- I
) L l.--

L-
) L :..

- •
' ' .

• I
X
,
k

~
- I {_ I '/'YUL- )

. \· l)l. l .--,
::} . j. , ', . ,.-_,I )

~ \..')\__ ~ ...-ir ~t'-- ) ~""')c .J


~ \ I")_ \ ~ 'l.- >~•L 'I . ·
T ~ "A ,.yL v')
~-...,.-;::~-1.---r-- . .-. -
. -..,..-~~ -'v)- ---
Fe,~
rt. 'l. • • . • • J rx,_ 'n ~ V 0- Ll,(.v. . 1-
L.e.'t
V C\'Y r' a bu..
?- t
J
'):....
.J
W ,· 'th c_o'"),.e..S,. y>t1n d,.t.'.'v> d. > I.. ~,

s \ , g 2- J • • • • -s r'J 72.e-_t ~e..4''-~ \)


' J.

•. I f I '

~o -- _L
(\[

---
.,....
-
.,....
\'
~

t\-o -- J ' I

_L
N
( Lkl.t: ~l J x 2.. :: • . - , >
-
")(_'1-) b...e... C'f) v C\ u.u.. ~ o-j-
'v a""r j a bLt. , ?(... W f ,t ~ C O "') "'I'.e .S po-n ~ ' >i ':} -5- 7-e, </ l.U. h Ci -eA

-t I / f 2 / ,_ ' ,. • ~- ·, f ~ ·

7 -lJ, ?"'f)c,~ d O bcui: on_j VO\UA..L ,·r


ck no-k.c:L b~ ·f'-! 7 ' anJ d..e 1 t)'\.Q_ cL as

~-) :: ff 5- C-x.. - A)~ :::. ~ ~ d '1"


N N

N
N
-,;r
va\lA.L

zf C'L -A_)
~/ --
rJ
'L
r' '2-, . ,..
..... ~ f Crx.-1') . , I
'

\.
J--l 3
,....
..... ££c. f-~J )
rJ
t<-e.. I - -\1 '0)" R, e-1Lv ~ (. )'\ Mo rne n-l.S o b o tt:f 'm...e on @· ·
0. ,,, d~ .~ ')"Y") 0 n'l---" n ../~ 0 60 LL '1 any poln&-

kt ?'.-- 1/ 'X. 2 ' ... · ')(..'>1 he.


')t.. u.:11'+ h ~0 lfo-€. S. l?o Y1 cU 1
1') Oj_
't'".e.S/>e c-t1' V-<!._ I~ . a

µ.,. -==- J_· ~ £. C7- - )(, ) r,-


N .

µ-y -::: -¼ <£ . 1[(')_-A) - c-x: - AD,-


}--L-y :::. t {~ [ ( A ) - µ/ J )' -Cb
7.- -
I
1-1, ::: 0 }-ID ....
I

~ ,\ ::. LJC O /-'-,i' - 'f(: ,CJ.! i' ) /.-1 ~: L3c_2 (f l/ ) 'Lj/,/21

'1 c_:/ff,' /!i (µ_


1
) T Lj c,/},1,') r µ ~ )
M1 ::- µ1 I - 1 ?!3' 111/ +- 6 /ti ~ I - -2:, {f t/ I) 1--
''
I\ ' I
I
I

'
f,:K•,nple 7·21. /hr fir.ft th ree momen t.,. nf a di.,tribu tion about the value 67
of the variabl e are 0
_ (Ind ~-91. Calrula te t~e ·~ern~d and third central mo~nt .,, and the morrumt coeffici ent of J/c.ewn
73
~ diratf' thl' natur~ 0 / th" dutnbu t,nn. (Delhi Univ.B.A . (Econ. Hon.,. /), 2f
n Solution. In the usual notatio ns we arc given :
A= 67, µ,' = 0-45, ~, = 8-73 and ~, = 8-91
The second and third central momen ts are given by :
µ2 = µ2' - µ,'2 =8-73-( 0-45)2 =8-73-0 -2025 =8··5275
µ:\ = µ3' - 3 ~, µ,' + 2µ,' 3 = 8·91 - 3 X 8·73 X 0·45 + 2 X (0•45)3
= 8-91 - 11 ·7855 + 0-18225 = -2-693 3
Hence, the variance of the distribution is
a2 = µ 2 =8-5275 ⇒ o(s.d.) = ✓ s-s21s =2-9202
Since µ 3 is negativ e, the given distribu tion is negatively skewed. In other words~ the frequency
a longer tail toward s the left. Karl Pearso n's momen t coefficient of skewness is given by :
µ3 µ3 -2-6933 2
Yi =~3/2 = ~✓ ~ = 8-5275 x 2-9202 =-2-6933
24-9020 =-0-lOS

=> ~. = µi
~3 = (-0-10 82)2=0 -0117

n ·
r .
M O'Yu.,~ a bo"1-- Bo ''c/YIC
-J_<f- -v, cl
i~ ck.nok- d. bJ Q trl cl41}u_c[

Cl~
'· \
1 )': -- _J_ {_ f z.
y
rJ

\ ..

i ()

1) - _j_ f ~ ')(_
I - N
")
"1- -- _L~ ~ 'X. 2..
N

'V3 -
,,_
_l_~ ~x.3
N
ry-y -- _L ~ -\ ---x..y
fJ

\
.f
..
> .,,:
'

\: .

""·
I ·: \, ,.. ,a
0. \"\ c.\ Ku R -r o<s, S,

$ \<. .e.W"Y'\...e..<; S
lo- c. \.<. 0 & S ~ 'Y"() "fY"\L ~~ j
'--r(\ e.o'<lS
o.. ~...-,. "l.,_,_"' t ::I di s .\:.,.,. ; b (,l,t i"" .
O ,.., I0 ps I d..e.- Y) .es s I "<)

Pl -5- .-,e. 'f W...""':j en sh j l, ,· .s 'Ylot


U..--\-,' 0 'n L0 \,v..' C. \.,
S ~ 'rfl ~--l<o t ca.l , g. c ai... \ \ .e. d s ~ u ..u . J.
--:C.\:. 1'5 o ~ -t.wo --ljr.e.-~:
0..) p OS I --1:. I """'j ~ K ._.,.,c,. J d ,· ~; 6.L-\1' OV]

b)
c::4 i' ~ <tY 1' k,~+'. bV7
;~ so.,' cl .\-1,

'

..e' () ·. ~,;_\.\I,,,' ,__5}'D ,c;,..\..- :\,1,e~- ~'1 I


\q. s,a ./d +o b-<.a
A 5.-e.<1.ll~'<lLj dls ..-\ ,i\, u_- \;lo Y'I

'h.,Q..o Ctk\V'2-\j (u~y g\( .e W..( !. d :

, J: t . (_1-e-- 5 rt -""'I "'-',.,.'.J Ct, ' \A£ _ cf "-' -S


CL I o Yl ~ · ta .I t on -u, ,,_ » f.le: h ,;;,,,, d s,· cl.L v

µ d~ 7 {\,lA- ~~ ei .,, 7 M .e_ a Vl


0

' '

!
'

M ~O\ Yl ::. Vu. c\ ( " "rl ~ M t> 1 -e.- ··- :--~

Sj "<Y'I "YY\R..+d c... l ) f s..r\.., \\au-<\-\ o 11


0 s-
;
s: LA ~ _$ Cs. k )
\u.. +-e- V Y ) -e.
et.) /~
C) l
b So M ,e d f 0 , 11
1

F <m ...e. OVY)

i. l3 o.. S-e. d L., C, Y) /'

Olnd mo d...e.
- Mo d-e_
H-e.. o'Y">

---
/30.S-ed
')
($1 I - 2- ($J '2
S I \ -::. &3 ,+-
i) \
2- M -e..cl ,, C Y
ii ) £ '< :::
~3 -;- (9-,, -

b) I S
P e . o r SCYl
C ~ K ')- )
~k I ~ ~ " ',~
:S
12.-S \
,e _\ ,,\ / 'Y)-
wha..v.e.. -
M o d ,e _
M<..a YI ~
)
I

6
fl . · ·-::.M .ic \, aV '\ -. = M o d - e
1

, \ ~ ~ M-e.o" ..,.,, Moc.L


L
• ~\(,, .:: o M .L .d l' c :,
i;.. M.~""7 c. Y \ L l"-
1. o c l~
a ~
1

-" ;? J - . M d ' -
· - ~ :• S\t:i- t ,r 0
M7-€-CII y., L
.T ~
(l l . ' _;
· • ~5-2 '<
~
n , - H e.. d ,'a~ __)
3 ( M -L e
~ \t..,y -::..
r;
ts
2. G{;) Lv I.e. ':J
'Y) .e _ ~ .s
5: k .e.. l. Q

l - 2 d 2-

-:: ~ 3 --r~
5'\c_
~ I
(£ 3 -

2 - 'D s
S k -=- 'De, + o, -
~ ,- ~,
- s...s
r' V>t c,-5- s ~ e. w 'n..R
o e . ~ ~ i' c.
.IJ_

CV 1
-·13 0Y)d. C
2 .. Y L .C
6n TY1 O 'n 1 1
b&l..S-e.cl

----- l-t-3~
P, '.3
l-l2-

J -
-?
+ J 11,
a cyn ..J2 0 .s. \..( rr -e
v, .
d f g f ~ ; bu--\1 o

~ p-es o t k v"' hs:,· r


i- l ~
c. t.rrv~s (J
.,

c - T h e. a;_f
p-1: " K u .-.·H -t.n....e.. 7 ) 6 o m
() L e '1:.-n a 'Y}
~ 'l'-&..--k. "!
-e.o. k d 'Y) .e..s.s
p L..e. E+o l<u-.cti.::
C a rI..e. d
CC.l"'6 y -e a..1 '..e. -
cJ~·7 'yY) ~
(v1.e.s:o ku,+t
c._
is c0-{ La..d

Wh<
1
c~
Th,.e_ Cu o v ..e_
- 'h DYWJ J__
h t J vo+i'c.. \
@ P K
'IY)O rf'.R_
--ftod
-l V) 01 \r

. +.J k u ~ + l c _
~
~ /~ J a
C , ~ { ( 1 ,6 ,
DJu.-
CCL'YV'-L
l.

YL
\. 1 ~

UJ,y V_.e_ 1'_s
.-t::- n a .'V l
J LO
'2 . J t 02 L .3
C )- Y ")...

f )a f :t k y -r +1' c..
Seu· d h b-ri- r{- h J _ VJ
DY y')_ 70 J
3► 1;
\.
(52 ~3
h ~
Le. p,h;, k lA ll +, (,
'~ S a , 'd
C L l' Y V \. .e .

K -e... ) \ j l_s. f'-1 ...e... .S &. L


-\v.e _ °\- •
2.
~ L:::: P7r-Pi-s
P40 - - ~ o
Example!: For a distribution Karl Pearson's coefficient of skewnes:
standard deviation is 13 and mean is 59.2 Find mode and median.
Solution: We have given
Sk = 0.64, CJ= 13 and Mean= 59.2
Therefore by using formulae
_ Mean - Mode
Sk -
CJ
59.2-Mode
0.64 -
13
Mode= 59.20 - 8.32 = 50.88
Mode= 3 Median - 2 Mean
50.88 = 3 Median D 2 (59.2)

Median= 50.88 + 11 8.4 = 169 .28 = 56 _42


3 3
,
Analysis of Quantitative Data µ· (0.7)2
Therefore, Skewness, ~. = -3 - =0.03 1
µ~ (2.5)3
µ4 18.75 = 18.75 =3 .
Kurtosis, ~2 =-, - 6.25
µi (2.5)2
E2) For a frequency distribution the Bowley's coefficient of skewness is
J .2. lf the sum of the l lll and 3 rd quarterlies is 200 and median is 76,
find the value of third auartile.
1
2
3
4
5
6
7
8
Pre-requisites topics
➢ PROBABILITY
Probability of event,P(E)=No. of favorable occurrences
No. of possible occurrences
e.g- if the experiment is to flip one fair coin, event A might be getting
at most one head. The probability of an event A isP(A).
P(A)=1/2=0.5
➢ Random Variable
A random variable describes the outcomes of a statistical experiment in
words. The values of a random variable can vary with each repetition of an
experiment.

If X is a random variable, then X is written in words, and x is given as a


number. X values are countable outcomes
e.g- X=the number of heads get when toss a fair coin
x=0,1
➢ Discrete Random Variables-distinct value
X=number of heads after flipping 3 fair coins
X=number of days that a student attend class
➢ Continuous Random Variables-any value in interval
X=mass of animals in zoo
Y=winning time for mens in a race
A discrete probability distribution function has two characteristics:

1. Each probability is between zero and one.


2. The sum of the probabilities is one.

Q.Suppose Nancy has classes three days a week. She attends classes three days a
week 80% of the time, two days 15% of the time, one day 44% of the time, and no days 11% of
the time. Suppose one week is randomly selected.
a. Let X = the number of days Nancy ____________________.
b. X takes on what values?
c. Suppose one week is randomly chosen. Construct a probability distribution
table (called a PDF table) .What does the P(x) column sum to?

Solution: a. Let X = the number of days Nancy attends class per week.
b.X takes on values 0,1,2,3
c.PDF table
X P(x)
0 0.01
1 0.04
2 0.15
3 0.80

Note:The above topic is not included in syllabus.


Theoretical distributions – Binomial, Poisson, Normal

1.Binomial Distribution

A distribution where only two outcomes are possible, such as success or failure,
gain or loss, win or lose and where the probability of success and failure is same
for all the trials is called a Binomial Distribution

The properties of a Binomial Distribution are

1. Each trial is independent.


2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials.

Notation: X∼B(n,p)
B= Binomial Probability Distribution Function
X is a random variable with a binomial distribution.
The parameters are n and p; n= number of trials, p= probability of a success
on each trial.

The mathematical representation of binomial distribution is given by:

Where,x=number of success
n= number of trials, p= probability of a success on each trial.
q= probability of a failure on each trial =1-p

For the binomial probability distribution


The mean or Expected value of X is, E(x)=μ=np and variance, σ2=npq.
The standard deviation, σ=√npq.

GRAPH OF BINOMIAL DISTRIBUTION


Q. Sixty-five percent of people pass the state driver’s exam on the first try. A group of
50 individuals who have taken the driver’s exam is randomly selected. Give two reasons
why this is a binomial problem.
solution
This is a binomial problem because there is only a success or a failure, and there are a definite
number of trials. The probability of a success stays the same for each trial.

Q.Suppose you play a game that you can only either win or lose. The probability that
you win any game is 55%, and the probability that you lose is 45%. Each game you play
is independent. If you play the game 20 times, write the function that describes the
probability that you win 15 of the 20 times.
Show Solution
Here, if you define X as the number of wins, then X takes on the values 0, 1, 2, 3, …, 20.
The probability of a success is p=0.55.
The probability of a failure is q=0.45.
The number of trials n=20.
The probability question can be stated mathematically as P(x=15).

Q.If flip a regular coin three times.What is the probability of getting exactly one
head,and is this binomial experiment.

1. Each trial is independent.


2. There are only two possible outcomes in a trial- either a success or a failure.

Success=heads
Failure=tails
3. A total number of n identical trials are conducted. n=3
4. The probability of success and failure is same for all trials. p=0.5

➢ This is a binomial experiment

P(X=1)= (3!/(3-1)!*1!) *(0.5)1 *(0.5)2 = 0.375


➢ The graph of X∼B(3)
PRACTICSE QUESTION
Q.An agent sells life insurance policies to five equally aged, healthy people. According
to recent data, the probability of a person living in these conditions for 30 years or more
is 2/3. Calculate the probability that after 30 years:

1. All five people are still living.

2. At least three people are still living.

3. Exactly two people are still living.

All five people are still living.

2.At least three people are still living.

3.Exactly two people are still living.


MODULE-2

Theoretical distributions – Binomial, Poisson, Normal

1.Normal distributions (Bell Curve or Gaussian distribution)


The normal distribution has two parameters, the mean (μ) and the standard deviation
(σ). If X is a quantity to be measured that has a normal distribution with mean (μ) and
standard deviation (σ),then

Notation: X ∼ N(μ, σ)

Any distribution is known as Normal distribution if it has the following characteristics:

1. The mean, median and mode of the distribution coincide.


2. The curve of the distribution is bell-shaped and symmetrical about the line
x=μ.
3. The total area under the curve is 1.
4. Exactly half of the values are to the left of the center and the other half to
the right.

 A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.

Many things closely follows a normal Distribution:


 Heights of people
 Blood pressure
 Marks on test
The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed
is given by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

 A standard normal distribution is defined as the distribution with


mean 0 and standard deviation 1.

z-score-The z-score tells how many standard deviations the value x is


above (to the right of) or below (to the left of) the mean, μ.

Z = x-μ

σ
E.g- X ~ N(5, 6).
mean μ = 5
standard deviation σ = 6. Suppose x = 17.
z=x−μ/σ=17-5/6=2
This means that x = 17 is two standard deviations (2σ) above or to the right of the
mean μ = 5.
Q. What is the z-score of x, when x = 1 andX ~ N(12,3)?
Hint:z=−3.67

Q.The mean height of 15 to 18-year-old males from Chile from 2009 to 2010 was 170
cm with a standard deviation of 6.28 cm. Male heights are known to follow a normal
distribution. Let X = the height of a 15 to 18-year-old male from Chile in 2009 to 2010.
Then X ~ N(170, 6.28).
a. Suppose a 15 to 18-year-old male from Chile was 168 cm tall from 2009 to 2010.
The z-score when x = 168 cm is z = _______. This z-score tells you that x = 168 is
________ standard deviations to the ________ (right or left) of the mean _____ (What
is the mean?).
b. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010
has a z-score of z = 1.27. What is the male’s height? The z-score (z = 1.27) tells you
that the male’s height is ________ standard deviations to the __________ (right or left)
of the mean.
Solution: a. –0.32, 0.32, left, 170, b. 177.98, 1.27, right
EMPIRICAL RULE(68%-95%-99.7%)
a)P(68<=X<=82)=68.268%
b)P(61<=X<=89)=?
c)P( 54<=X<=75)=?
d)P(X>=96)=0.135%
Skewed data distribution indicates the tendency of the data distribution to be more
spread out on one side.
 Right Skewed

• The data is right skewed


• The distribution is positively skewed
• Mean > Median
• Right tail contains large distributions

 Left Skewed The data is left skewed


• Mean < Median
• The distribution is negatively skewed
• Left tail contains large distributions

Q. Give examples of Left skewed distribution and right skewed distribution and what is the relation between
mean, median and mode in these distribution.

Kurtosis
Kurtosis measures the tendency of the data toward the center or toward the tail.
Kurtosis describes the shape of a probability distribution.
Platykurtic is negative kurtosis.
Mesokurtic represents a normal distribution curve.
Leptokurtic is positive kurtosis.
Poisson Probability Distribution is a discrete probability distribution that expresses
the probability of a given number of events occurring in a fixed interval of time or space
if these events occur with a known constant mean rate and independently of the time
since the last event.

examples are:

1. The number of emergency calls recorded at a hospital in a day.


2. The number of thefts reported in an area on a day.
3. The number of customers arriving at a salon in an hour.

Notation for the Poisson: P=Poisson Probability Distribution Function

X∼P(μ)

X is a random variable with a Poisson distribution. The parameter is μ (or λ)= the mean
for the interval of interest.

Formula of Poisson distribution is :

where ,

 is Euler's number e≈ 2.72


 x is the number of occurrences
 x! is the factorial of x
 μ is equal to the expected value or mean .

The variance is σ2=μ, and the standard deviation is σ=√μ


Practice question
Q1.A hospital switch board receives an average of 4 emergency calls in 10 minutes
interval.What is probability that there are at most 2 emergency calls.
P[X≤2]=P[x=0]+P[x=1]+P[x=2]
μ=4

Q2.A life insurance salesman sells on the average 3 life insurance policies per
week. Use Poisson's law to calculate the probability that in a given week he
will sell
a. Some policies
b. 2 or more policies but less than 5 policies.
c. Assuming that there are 5 working days per week, what is the
probability that in a given day he will sell one.

Here, μ = 3

(a) "Some policies" means "1 or more policies".

P(X > 0) = 1 − P(x0)

=1−P(x0)=0.95021
(b) The probability of selling 2 or more, but less than 5 policies is:
P(2≤X<5)=P(x2)+P(x3)+P(x4)
c) Average number of policies sold per day: μ=3/5=0.6
So on a given day,P(X=1) = 0.32929

Q.3 A transmission channel has per digit error prob. p=0.01.calculate the
probability of more than 1 error in 10 received call.
μ=np=0.01*10
P[X>1]=1-P[x=0]+P[x=1]

 When P(μ) is used to approximate a binomial distribution, μ=np where n represents


the number of independent trials and p represents the probability of success in a
single trial.

You might also like