Data Science and Big Data Analytics-1-82
Data Science and Big Data Analytics-1-82
Data Science
and Big Data Analytics
Iresh A. Dhotre
M.E. (Information Technology)
Ex-Faculty, Sinhgad College of Engineering,
Pune.
® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge
(i)
Data Science and Big Data Analytics
Published by :
® ®
Amit Residency, Office No.1, 412, Shaniwar Peth,
TECHNICAL Pune - 411030, M.S. INDIA, Ph.: +91-020-24495496/97
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : [email protected] Website : www.technicalpublications.org
Printer :
Yogiraj Printers & Binders
Sr.No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.
ISBN 978-93-91567-92-7
SPPU 19
9 789391 567927
Geographical
Data information
Machine data systems and
volume
geo-spatial data
Mobile
Sensor data
networks
Amazon,
facebook,
Data Social
Yahoo, Google
velocity media
(Web based
companies)
Structured
Data
variety Unstructured
Semi-structured
Business
understanding
Data Data
mining exploration
Data science
Data
Feature visualization
engineering
Data Predictive
cleaning modeling
Printed
books
101010101110010101
010100001110001100
Videos 011010101000011100
011110000011100010
hp
101010101010000111
000101010001100110
Photos and 101010101010101000
drawing
Types of data
Structured Unstructured
Attributes
Qualitative/Categorical Quantitative/Numeric
V – Min A
V ( New – MaxA – New – Min A ) New – Min A
MaxA – Min A
V – min marks
( New marks – New min) + New min
Max marks – Min marks
8– 8
(1 – 0) 0
20 – 8
( 0)
1
12
(10 – 8)
(1 – 0) 0
20 – 8
2
1
12
(15 – 8)
(1 – 0) 0
20 – 8
( 3)
1
12
( 20 – 8)
(1 – 0) 0
20 – 8
12
1
12
V i 10 j
Vi
Vi [{( V i – min) (max – min) } { new_ max – new_ min} ] new_ min
(76, 300 – Mean) (54000)
Standard deviation 16000
54000
16000
new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A
73600 – 12000
v (1.0 – 0.0) 0.0
98000 – 12000
v
min A maxA new_ min A
new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A
600 – 200
v (1.0 – 0.0) 0.0
1000 – 200
v 600
j
10 10 3
State
University
Branch
College
TM
Descriptive
statistics
Sample
Inferential statistics
Used to make
inferences about the population
x
Q1
Q3 Q3 Q1
xi x
(x i x) 2
S2
n 1
2
(x i )2
N
2
(x )2
N
(x x) 2
S2
n 1
(x M) 2 (x M) 2
n n
(x)
x 5 10 15 20 50
x
n 4 4
x –x (x – x) 2
(– 7.5) 2
(– 2.5) 2
(2.5) 2
(7.5) 2
(x – x) 2
(x – x) 2 125.01
6.455
n –1 4 –1
P(D A ) P(A)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)
0.05 0.25
0.05 0.25 0.04 0.35 0.02 0.4
0.0125
0.0125 0.014 0.008
P(D B) P(B)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)
0.04 0.35
0.05 0.25 0.04 0.35 0.02 0.4
0.014 0.014
0.0125 0.014 0.008 0.0345
P(D C) P(C)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)
0.02 0.4
0.05 0.25 0.04 0.35 0.02 0.4
0.008 0.008
0.0125 0.014 0.008 0.0345
P(T F) P(F)
P(T F) P(F) P(T M) P(M)
1 3 3
100 5 500
1 3 4 2 3 8
100 5 100 5 500 500
3
11
62 36
1
9
A B
2
P(A B)
36 18
1 18 1 9
19 18 1
1 3
P(T F)P(F) 100 5 3
P(T F) P(F) P(T M) P(M) 1 3 4 2 11
100 5 100 5
H0
H1
Reject H0 Reject H0
Middle 95 %
high-probability values
if H0 is true
80
from H0
Z = – 1.96 0 Z = 1.96
Critical region;
Extreme 5 %
( )
P
Z cal
P
(Z cal )
( )
(z )
Formulate H0 and H1
Determine if TSCR
Compare with level
of significance, falls into (Non)
Rejection Region
Rejection Rejection
Acceptance
region region
region
(1 – )
/2 /2
1 2
f(x, )
(1 – )
/2 /2
X
1 2
(H0 )
H0
(H1 )
H1
H1
H1
1 – (1 – 0.05) 20
H0 H1
( x i – x) ( y i – y)
(x i – x) 2 (y i – y) 2
xi
yi
X 6500
X
n 10
Y 660
Y
n 10
1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n
456040
650 66
10
4764800 45784
650 2 66 2
10 10
X2 Y2
X Y X2 Y2 XY
X 420
X
n 7
Y 476
Y
n 7
1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n
29464
60 68
7
28096 34376
60 2 68 2
7 7
3
1 2 3
d1 d2 d3 d12 d 22 d 32
6 d 12 6 74
( x, y) 1 1
n (n 2 1) 10 (100 1)
6 d 22 6 44
( y , z) 1 1
n (n 2 1) 10 (100 1)
6 d 23 6 156
( z , x) 1 1
n ( n 2 1) 10 (100 1)
( y , z) 2 3
2
( )
2 (Observed Expected) 2
Expected
2
k (O E) 2
E
i 1
xi i
2 2
i
2 2 2
2 (x 1 1) (x 2 2) (x v v)
...
2 2 2
1 2 v
v (x i 2
i)
2
i 1 i
2
d.f. = 2
0.2
d.f. = 5
0.1 d.f. = 10
d.f. = 15
d.f. = 30
10 20 30 40 50
Critical region
area =
2
X
(Critical value)
0
xi 2.15 1.99 2.05 2.12 2.17 2.01 1.98 2.03 2.25 1.93
x
n 10
20.68
x 2.068
10
(x x) 2
S2
n
x x (x x) 2
(x x) 2 0.09096
S2
n 10
2 2
(H 0 ) : 0
2 2
(H 1 ) : 0
n s2 10 0.009096
X2
2 (0.145) 2
0
X2
(H 0 )
n
2 (observed expected) 2
expected
B1 B2
A1 a b r1
A2 a|c b|d r2
c1 c2 a|b||
cd n
X2
n( ad bc) 2
X2
c 1 c 2 r1 r2
X 0
Z Z2
n
2
(Of) ij (Ef) ij
2
(Ef) ij
X1 – X2
1 1
SP
n1 n2
( n 1 – 1) S 21 ( n 2 – 1) S 22
S 2P
n1 n2 – 2
n1 n2
n 1 (N 1)
w
2
n 1 n 2 (N 1)
w
2