DS Mod 1 To 2 Complete Notes
DS Mod 1 To 2 Complete Notes
DS Mod 1 To 2 Complete Notes
Introduction to data science, Different sectors of using data science, Purpose and components of Python,
Data Analytics processes, Exploratory data analytics, Quantitative technique and graphical technique,
Data types for plotting.
A new discipline that combines the aspects of statistics, mathematics, programming,
and visualization to turn data into information .
Data Science is about extraction, preparation, analysis, visualization, and maintenance
of information. It is a cross-disciplinary field which uses scientific methods and
processes to draw insight.
Data Scientists collect, explore, analyze, and visualize data. They apply mathematical and
statistical models to find patterns and solutions in the data.
• Modern tools and technologies have made data processing and analytics faster and
efficient.Technology like: Data Processing Tools, Python Language, Application Design,
➢ These technologies help Data Scientists to:
• Build and train machine learning models
• Manipulate data with technology
• Build data tools, applications, and services
• Extract information from data
1.Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach future events.
It looks at past performance and understands the performance by mining historical data to understand the
cause of success or failure in the past. Almost all management reporting such as sales, marketing,
operations, and finance uses this type of analysis.
Common examples of Descriptive analytics are company reports that provide historic reviews like:
• Data Queries
• Reports
• Descriptive Statistics
• Data dashboard
Example: D-mart, we can look at the product’s history and find out which product have been sold more
or which products have large demand by looking at the product sold trends and based on their self
analysis we can further make the decision of putting a stock that item in large quantity for the coming
2.Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics uses data to
determine the probable outcome of an event or a likelihood of a situation occurring. Predictive analytics
holds a variety of statistical techniques from modeling, machine learning, data mining, and game
theory that analyze current and historical facts to make predictions about a future event.
Example: Amazon and Netflix recommendation system
3. Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule, and
machine learning to make a prediction and then suggests a decision option to take advantage of the
Example :Google Self Driving Car
4.Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for the solution
of any problem. We try to find any dependency and pattern in the historical data of the particular problem
Data Analysis Process consists of the following phases that are iterative in
nature −
1.Business Problem- The process of analytics begins with questions or business problems of
Examples of question are
Who are the customers?2Why are sales going down?
How to manage the inventory?Why the system not scaling up with increasing traffic volume?
Such kind of Business problems trigger the need to analyze data and find answers.
2 Data Acquisition is a process to collect data from various data sources such as RDBMS,No
SQL databases,web server logs and also scrape the web through web APIs.
3. Data Wrangling
Data Wrangling: Challenges
Causes of challenges in the data wrangling phase:
• Unexpected data format
• Erroneous data
• Voluminous data to be manipulated
• Classifying data into linear or clustered
• Determining relationship between observation, feature, and response
Data Wrangling includes:
1. Data cleansing
2. Data manipulation
Data cleansing
The processed and organized data may be incomplete, contain duplicates, or
contain errors. Data Cleaning is the process of preventing and correcting these
Data manipulation
Data manipulation technique such as transform, aggregate ,groupby,reshape,merge
transform the data and make it available for exploratory data analysis.
5. Data Exploration
Data Exploration includes:
1. Data discovery
2. Data pattern
➢ Data exploration uses all the available data and present in either numerical
or graphical format.
➢ This helps to identify right pattern in the data.The data and underline
pattern fed into appropriate Machine Learning Model leading directly to
conclusion or prediction phase.
Exploratory Data Analysis (EDA)-
• APPROACH-EDA approach studies the data to recommend suitable models that best fit
the data.
• FOCUS-The focus is on data; its structure, outliers, and models suggested by the data.
• ASSUMPTIONS-EDA techniques make minimal or no assumptions.
They present and show all the underlying data without any data loss.
Quantitative: Provides numeric outputs for the inputted data .
Graphical: Uses statistical functions for graphical output.
Histograms and scatter plots are two popular graphical techniques to depict data.
Histogram graphically summarizes the distribution of a univariate data set.
It shows:
• the center or location of data (mean, median, or mode)
• the spread of data
• the skewness of data
• the presence of outliers
• the presence of multiple modes in the data
Scatter plot represents relationships between two variables.It can answere these question
• Are variables X and Y related?
• Are variables X and Y linearly related?
• Are variables X and Y non-linearly related?
• Does change in variation of Y depend on X?
• Are there outliers?
Q.Draw scatter plot:
Conclusion or Prediction
• This step involves reaching a conclusion and making predictions based on the data
• Involves heavy use of mathematical and statistical functions
• Requires model selection, training, and testing to help in forecasting
• Is called machine learning as data analysis is fully or semi-automated with minimal or
no human intervention
➢ Hypothesis
Hypothesis building begins in the data exploration stage, but becomes more
mature in the conclusion or prediction phase. Hypothesis building uses feature
engineering and Model.
Hypothesis testing-A premise or claim that we want to test.
Features of plotting:
• Plotting is like telling a story about data using different colors, shapes, and sizes.
• Plotting shows the relationship between variables.
• Example:
oChange in value of Y results in change in value of X
oX is independent of y
Types of Plot
Different data types can be visualized using various plotting techniques.
• Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount of data on
a day-to-day basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which
users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through its daily
1.Volume: Volume refers to the vast increase in the data growth. This is evident as more than 90% of the
data we encounter was produced recently. In fact, more than 2.5 quintillion (1018) bytes are created daily
since even as earlier as 2013 from every post, share, search, click, stream, and many more data producers.
Although, this huge data poses a challenge to the storage capacity, but still challenge is less stimulating due
to the advanced storage technologies as well as the decrease in the cost of computer storage acquisition.
However, the analysis of such vast data islands is the actual challenge considering the heterogeneity nature
of data.
2.Velocity: Velocity represents the accumulation of data at a high speed, near real-time and real-time from
dissimilar data sources. In Big Data velocity flows in from sources like machines, networks, social media,
mobile phones, etc. There is a massive and continuous flow of data. This determines the potential of data -
how fast the data is generated and processed to meet the demands.
Example : There are more than 3.5 billion searches per day on Google. Also, Facebook users are increasing
by 22% (Approx.) year by year.
3.Variety: Variety involves collecting data from various resources and in fuzzy and heterogeneous types.
This includes importing data in dissimilar formats, namely structured (tables resides in relational databases-
RDBMS,etc),semi-structured(e-mail,XML,JSON, and other markup languages, etc.) and unstructured (text,
pictures, audio files, video,sensor data, etc.).
4.Veracity: It refers to the provenance, accuracy and correctness of data. Multiple factors to ensure the
veracity of Big Data are as follows:
• trustworthiness of data origin
• reliability and security of data store data availability
• correctness
• consistency
5.Value: Value represents the outcome product of Big Data analysis.Value is an essential characteristic of
Big Data. It is not the data that we process or store It is valuable and reliable data that we store, process, and
also analyse .The bulk of data having no value is of no good to the company, unless you turn it into
something useful.
Q.What is EDA technique?
Most EDA technique are graphical in nature with a few quantitative
techniques and also suggest models that best fit thedata.They use the entire
data with minimum or no assumption.
Q.Differentiate between Univariate, Bivariate, and Multivariate analysis?
Introduction to statistics, statistical and non-statistical analysis, major categories of statistics, population
and sample, Measure of central tendency and dispersion, Moments, Skewness and kurtosis, Correlation
and regression, Theoretical distributions – Binomial, Poisson, Normal
Introduction to Statistics
Statistics deals with the methods for collection, classification and analysis of numerial
data for drawing valid conclusions and making reasonable decisions.
➢ The field of Statistics has an influence over all domains of life, the Stock market,
life sciences, weather, retail, insurance, and education .
Types of Analysis
1. Quantitative Analysis and Qualitative Analysis(on the basis of data)
2. Descriptive analysis and Inferential analysis(on the basis of tool)
3. Univariate, . Bivariate ,Multivariate analysis(on the basis of variable).
Types of Analysis
example, a purchase a coffee from coffee shop, it is available in Short, Tall and
Grande. This is an example of Qualitative Analysis. But if a store sells 70 regular
coffees a week, it is Quantitative Analysis .
Terminologies in Statistics –
• The population is the set of sources from which data has to be collected.
• A Sample is a subset of the Population.
• A Variable is any characteristics, number, or quantity that can be measured or
• Statistics are quantitative values calculated from the sample.
• Parameters are the characteristics of the population.
Categories in Statistics
There are two main categories in Statistics, namely:
1. Descriptive Statistics
2. Inferential Statistics
Descriptive Statistics
Descriptive Statistics uses the data to provide descriptions of the population,
either through numerical calculations or graphs or tables.
➢ Descriptive Statistics helps organize data and focuses on the characteristics of
data providing parameters.
Example:study the average height of students in a classroom, in descriptive
statistics, record the heights of all students in the class and then find out the
maximum, minimum and average height of the class.
Inferential Statistics
Inferential Statistics makes inferences and predictions about a population based
on a sample of data taken from the population in question.
➢ Inferential statistics generalizes a large data set and applies probability to arrive at a
conclusion. It allows to infer parameters of the population based on sample stats
and build models on it.
Example: Consider the same example of finding the average height of students
in a class, in Inferential Statistics, take a sample set of the class, which is
basically a few people from the entire class. Grouped the class into tall, average
and short. In this method, basically build a statistical model and expand it for the
entire population in the class.
σ2 = ∑(xi-x̅)2
3.Standard Deviation: The square root of the variance is the standard
deviation. It tells about the concentration of the data around the mean of the
data set.
Q. {3,5,6,9,10} are the values in a dataset.
Q.For this dataset which mesures of central tendency and dispersion is better.
"" 0 ~~ 5 ~ \) Se. cl .\: D d ,..Q_~C.T ,· b,'?...
" ' DJu_' OU. s C "' 0 • 0\ C k --y \ s.,-\- \ c.. s D ,\- 0- ,S""If'..e...1 \).J.,Ylt.-::t
d \' ~ +t-; b LA. .\-i' 0 l'J \J ) ~ J • C..e 'Y)r\s-°'- l tt ~ ci.e.Y") l..,f::J _;J
d ~ s ~r-("s., o'Yl , ~ \.<..A..'-u 'Y\..t. <; ~ a ~ A \.<-..-r t-c, s,1 ~ ,
::: :i. X~ -
w h_,~_,,."'.e.. ----- ----
~ z_ r- f_rx_ -~)(:)
r-y-,-,:_ \ /
_L- 1.- C ?- - ~ ·1
~\ -
- N
-',<- ~
- ~ ~
')L - ~
-- - -
_,,.., )(_;
c-x., -
~\ -- 0
----r --::::
~ (_--i- - -,.,_,
) '2-
'\I ~ C)-\ 0-V) ~
\A-.1..,.. - ,-
---- -----
>r ,\ (_
'IL I( ";_I
-- I
) L l.--
) L :..
- •
' ' .
• I
- I {_ I '/'YUL- )
. \· l)l. l .--,
::} . j. , ', . ,.-_,I )
•. I f I '
~o -- _L
t\-o -- J ' I
( Lkl.t: ~l J x 2.. :: • . - , >
")(_'1-) b...e... C'f) v C\ u.u.. ~ o-j-
'v a""r j a bLt. , ?(... W f ,t ~ C O "') "'I'.e .S po-n ~ ' >i ':} -5- 7-e, </ l.U. h Ci -eA
-t I / f 2 / ,_ ' ,. • ~- ·, f ~ ·
zf C'L -A_)
~/ --
r' '2-, . ,..
..... ~ f Crx.-1') . , I
J--l 3
..... ££c. f-~J )
t<-e.. I - -\1 '0)" R, e-1Lv ~ (. )'\ Mo rne n-l.S o b o tt:f 'm...e on @· ·
0. ,,, d~ .~ ')"Y") 0 n'l---" n ../~ 0 60 LL '1 any poln&-
=> ~. = µi
~3 = (-0-10 82)2=0 -0117
n ·
r .
M O'Yu.,~ a bo"1-- Bo ''c/YIC
-J_<f- -v, cl
i~ ck.nok- d. bJ Q trl cl41}u_c[
'· \
1 )': -- _J_ {_ f z.
\ ..
i ()
1) - _j_ f ~ ')(_
I - N
"1- -- _L~ ~ 'X. 2..
'V3 -
_l_~ ~x.3
ry-y -- _L ~ -\ ---x..y
> .,,:
\: .
I ·: \, ,.. ,a
0. \"\ c.\ Ku R -r o<s, S,
$ \<. .e.W"Y'\...e..<; S
lo- c. \.<. 0 & S ~ 'Y"() "fY"\L ~~ j
'--r(\ e.o'<lS
o.. ~...-,. "l.,_,_"' t ::I di s .\:.,.,. ; b (,l,t i"" .
O ,.., I0 ps I d..e.- Y) .es s I "<)
c::4 i' ~ <tY 1' k,~+'. bV7
;~ so.,' cl .\-1,
' '
Olnd mo d...e.
- Mo d-e_
H-e.. o'Y">
($1 I - 2- ($J '2
S I \ -::. &3 ,+-
i) \
2- M -e..cl ,, C Y
ii ) £ '< :::
~3 -;- (9-,, -
b) I S
P e . o r SCYl
C ~ K ')- )
~k I ~ ~ " ',~
12.-S \
,e _\ ,,\ / 'Y)-
wha..v.e.. -
M o d ,e _
M<..a YI ~
fl . · ·-::.M .ic \, aV '\ -. = M o d - e
-" ;? J - . M d ' -
· - ~ :• S\t:i- t ,r 0
M7-€-CII y., L
.T ~
(l l . ' _;
· • ~5-2 '<
n , - H e.. d ,'a~ __)
3 ( M -L e
~ \t..,y -::..
2. G{;) Lv I.e. ':J
'Y) .e _ ~ .s
5: k .e.. l. Q
l - 2 d 2-
-:: ~ 3 --r~
~ I
(£ 3 -
2 - 'D s
S k -=- 'De, + o, -
~ ,- ~,
- s...s
r' V>t c,-5- s ~ e. w 'n..R
o e . ~ ~ i' c.
CV 1
-·13 0Y)d. C
2 .. Y L .C
6n TY1 O 'n 1 1
----- l-t-3~
P, '.3
J -
+ J 11,
a cyn ..J2 0 .s. \..( rr -e
v, .
d f g f ~ ; bu--\1 o
c - T h e. a;_f
p-1: " K u .-.·H -t.n....e.. 7 ) 6 o m
() L e '1:.-n a 'Y}
~ 'l'-&..--k. "!
-e.o. k d 'Y) .e..s.s
p L..e. E+o l<u-.cti.::
C a rI..e. d
CC.l"'6 y -e a..1 '..e. -
cJ~·7 'yY) ~
(v1.e.s:o ku,+t
is c0-{ La..d
Th,.e_ Cu o v ..e_
- 'h DYWJ J__
h t J vo+i'c.. \
@ P K
'IY)O rf'.R_
-l V) 01 \r
. +.J k u ~ + l c _
~ /~ J a
C , ~ { ( 1 ,6 ,
\. 1 ~
UJ,y V_.e_ 1'_s
.-t::- n a .'V l
'2 . J t 02 L .3
C )- Y ")...
f )a f :t k y -r +1' c..
Seu· d h b-ri- r{- h J _ VJ
DY y')_ 70 J
3► 1;
(52 ~3
h ~
Le. p,h;, k lA ll +, (,
'~ S a , 'd
C L l' Y V \. .e .
Q.Suppose Nancy has classes three days a week. She attends classes three days a
week 80% of the time, two days 15% of the time, one day 44% of the time, and no days 11% of
the time. Suppose one week is randomly selected.
a. Let X = the number of days Nancy ____________________.
b. X takes on what values?
c. Suppose one week is randomly chosen. Construct a probability distribution
table (called a PDF table) .What does the P(x) column sum to?
Solution: a. Let X = the number of days Nancy attends class per week.
b.X takes on values 0,1,2,3
c.PDF table
X P(x)
0 0.01
1 0.04
2 0.15
3 0.80
1.Binomial Distribution
A distribution where only two outcomes are possible, such as success or failure,
gain or loss, win or lose and where the probability of success and failure is same
for all the trials is called a Binomial Distribution
Notation: X∼B(n,p)
B= Binomial Probability Distribution Function
X is a random variable with a binomial distribution.
The parameters are n and p; n= number of trials, p= probability of a success
on each trial.
Where,x=number of success
n= number of trials, p= probability of a success on each trial.
q= probability of a failure on each trial =1-p
Q.Suppose you play a game that you can only either win or lose. The probability that
you win any game is 55%, and the probability that you lose is 45%. Each game you play
is independent. If you play the game 20 times, write the function that describes the
probability that you win 15 of the 20 times.
Show Solution
Here, if you define X as the number of wins, then X takes on the values 0, 1, 2, 3, …, 20.
The probability of a success is p=0.55.
The probability of a failure is q=0.45.
The number of trials n=20.
The probability question can be stated mathematically as P(x=15).
Q.If flip a regular coin three times.What is the probability of getting exactly one
head,and is this binomial experiment.
3. A total number of n identical trials are conducted. n=3
4. The probability of success and failure is same for all trials. p=0.5
Notation: X ∼ N(μ, σ)
A normal distribution is highly different from Binomial Distribution. However, if the number of
trials approaches infinity then the shapes will be quite similar.
The mean and variance of a random variable X which is said to be normally distributed
is given by:
Z = x-μ
E.g- X ~ N(5, 6).
mean μ = 5
standard deviation σ = 6. Suppose x = 17.
This means that x = 17 is two standard deviations (2σ) above or to the right of the
mean μ = 5.
Q. What is the z-score of x, when x = 1 andX ~ N(12,3)?
Q.The mean height of 15 to 18-year-old males from Chile from 2009 to 2010 was 170
cm with a standard deviation of 6.28 cm. Male heights are known to follow a normal
distribution. Let X = the height of a 15 to 18-year-old male from Chile in 2009 to 2010.
Then X ~ N(170, 6.28).
a. Suppose a 15 to 18-year-old male from Chile was 168 cm tall from 2009 to 2010.
The z-score when x = 168 cm is z = _______. This z-score tells you that x = 168 is
________ standard deviations to the ________ (right or left) of the mean _____ (What
is the mean?).
b. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010
has a z-score of z = 1.27. What is the male’s height? The z-score (z = 1.27) tells you
that the male’s height is ________ standard deviations to the __________ (right or left)
of the mean.
Solution: a. –0.32, 0.32, left, 170, b. 177.98, 1.27, right
EMPIRICAL RULE(68%-95%-99.7%)
c)P( 54<=X<=75)=?
Skewed data distribution indicates the tendency of the data distribution to be more
spread out on one side.
Right Skewed
Q. Give examples of Left skewed distribution and right skewed distribution and what is the relation between
mean, median and mode in these distribution.
Kurtosis measures the tendency of the data toward the center or toward the tail.
Kurtosis describes the shape of a probability distribution.
Platykurtic is negative kurtosis.
Mesokurtic represents a normal distribution curve.
Leptokurtic is positive kurtosis.
Poisson Probability Distribution is a discrete probability distribution that expresses
the probability of a given number of events occurring in a fixed interval of time or space
if these events occur with a known constant mean rate and independently of the time
since the last event.
examples are:
X is a random variable with a Poisson distribution. The parameter is μ (or λ)= the mean
for the interval of interest.
where ,
Q2.A life insurance salesman sells on the average 3 life insurance policies per
week. Use Poisson's law to calculate the probability that in a given week he
will sell
a. Some policies
b. 2 or more policies but less than 5 policies.
c. Assuming that there are 5 working days per week, what is the
probability that in a given day he will sell one.
Here, μ = 3
(b) The probability of selling 2 or more, but less than 5 policies is:
c) Average number of policies sold per day: μ=3/5=0.6
So on a given day,P(X=1) = 0.32929
Q.3 A transmission channel has per digit error prob. p=0.01.calculate the
probability of more than 1 error in 10 received call.