Data Science and Big Data Analytics
Data Science and Big Data Analytics
Data Science
and Big Data Analytics
Iresh A. Dhotre
M.E. (Information Technology)
Ex-Faculty, Sinhgad College of Engineering,
Pune.
® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge
(i)
Data Science and Big Data Analytics
Published by :
® ®
Amit Residency, Office No.1, 412, Shaniwar Peth,
TECHNICAL Pune - 411030, M.S. INDIA, Ph.: +91-020-24495496/97
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : [email protected] Website : www.technicalpublications.org
Printer :
Yogiraj Printers & Binders
Sr.No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.
ISBN 978-93-91567-92-7
SPPU 19
9 789391 567927
Geographical
Data information
Machine data systems and
volume
geo-spatial data
Mobile
Sensor data
networks
Amazon,
facebook,
Data Social
Yahoo, Google
velocity media
(Web based
companies)
Structured
Data
variety Unstructured
Semi-structured
Business
understanding
Data Data
mining exploration
Data science
Data
Feature visualization
engineering
Data Predictive
cleaning modeling
Printed
books
101010101110010101
010100001110001100
Videos 011010101000011100
011110000011100010
hp
101010101010000111
000101010001100110
Photos and 101010101010101000
drawing
Types of data
Structured Unstructured
Attributes
Qualitative/Categorical Quantitative/Numeric
V – Min A
V ( New – MaxA – New – Min A ) New – Min A
MaxA – Min A
V – min marks
( New marks – New min) + New min
Max marks – Min marks
8– 8
(1 – 0) 0
20 – 8
( 0)
1
12
(10 – 8)
(1 – 0) 0
20 – 8
2
1
12
(15 – 8)
(1 – 0) 0
20 – 8
( 3)
1
12
( 20 – 8)
(1 – 0) 0
20 – 8
12
1
12
V i 10 j
Vi
Vi [{( V i – min) (max – min) } { new_ max – new_ min} ] new_ min
(76, 300 – Mean) (54000)
Standard deviation 16000
54000
16000
new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A
73600 – 12000
v (1.0 – 0.0) 0.0
98000 – 12000
v
min A maxA new_ min A
new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A
600 – 200
v (1.0 – 0.0) 0.0
1000 – 200
v 600
j
10 10 3
State
University
Branch
College
TM
Descriptive
statistics
Sample
Inferential statistics
Used to make
inferences about the population
x
Q1
Q3 Q3 Q1
xi x
(x i x) 2
S2
n 1
2
(x i )2
N
2
(x )2
N
(x x) 2
S2
n 1
(x M) 2 (x M) 2
n n
(x)
x 5 10 15 20 50
x
n 4 4
x –x (x – x) 2
(– 7.5) 2
(– 2.5) 2
(2.5) 2
(7.5) 2
(x – x) 2
(x – x) 2 125.01
6.455
n –1 4 –1
P(D A ) P(A)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)
0.05 0.25
0.05 0.25 0.04 0.35 0.02 0.4
0.0125
0.0125 0.014 0.008
P(D B) P(B)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)
0.04 0.35
0.05 0.25 0.04 0.35 0.02 0.4
0.014 0.014
0.0125 0.014 0.008 0.0345
P(D C) P(C)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)
0.02 0.4
0.05 0.25 0.04 0.35 0.02 0.4
0.008 0.008
0.0125 0.014 0.008 0.0345
P(T F) P(F)
P(T F) P(F) P(T M) P(M)
1 3 3
100 5 500
1 3 4 2 3 8
100 5 100 5 500 500
3
11
62 36
1
9
A B
2
P(A B)
36 18
1 18 1 9
19 18 1
1 3
P(T F)P(F) 100 5 3
P(T F) P(F) P(T M) P(M) 1 3 4 2 11
100 5 100 5
H0
H1
Reject H0 Reject H0
Middle 95 %
high-probability values
if H0 is true
80
from H0
Z = – 1.96 0 Z = 1.96
Critical region;
Extreme 5 %
( )
P
Z cal
P
(Z cal )
( )
(z )
Formulate H0 and H1
Determine if TSCR
Compare with level
of significance, falls into (Non)
Rejection Region
Rejection Rejection
Acceptance
region region
region
(1 – )
/2 /2
1 2
f(x, )
(1 – )
/2 /2
X
1 2
(H0 )
H0
(H1 )
H1
H1
H1
1 – (1 – 0.05) 20
H0 H1
( x i – x) ( y i – y)
(x i – x) 2 (y i – y) 2
xi
yi
X 6500
X
n 10
Y 660
Y
n 10
1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n
456040
650 66
10
4764800 45784
650 2 66 2
10 10
X2 Y2
X Y X2 Y2 XY
X 420
X
n 7
Y 476
Y
n 7
1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n
29464
60 68
7
28096 34376
60 2 68 2
7 7
3
1 2 3
d1 d2 d3 d12 d 22 d 32
6 d 12 6 74
( x, y) 1 1
n (n 2 1) 10 (100 1)
6 d 22 6 44
( y , z) 1 1
n (n 2 1) 10 (100 1)
6 d 23 6 156
( z , x) 1 1
n ( n 2 1) 10 (100 1)
( y , z) 2 3
2
( )
2 (Observed Expected) 2
Expected
2
k (O E) 2
E
i 1
xi i
2 2
i
2 2 2
2 (x 1 1) (x 2 2) (x v v)
...
2 2 2
1 2 v
v (x i 2
i)
2
i 1 i
2
d.f. = 2
0.2
d.f. = 5
0.1 d.f. = 10
d.f. = 15
d.f. = 30
10 20 30 40 50
Critical region
area =
2
X
(Critical value)
0
xi 2.15 1.99 2.05 2.12 2.17 2.01 1.98 2.03 2.25 1.93
x
n 10
20.68
x 2.068
10
(x x) 2
S2
n
x x (x x) 2
(x x) 2 0.09096
S2
n 10
2 2
(H 0 ) : 0
2 2
(H 1 ) : 0
n s2 10 0.009096
X2
2 (0.145) 2
0
X2
(H 0 )
n
2 (observed expected) 2
expected
B1 B2
A1 a b r1
A2 a|c b|d r2
c1 c2 a|b||
cd n
X2
n( ad bc) 2
X2
c 1 c 2 r1 r2
X 0
Z Z2
n
2
(Of) ij (Ef) ij
2
(Ef) ij
X1 – X2
1 1
SP
n1 n2
( n 1 – 1) S 21 ( n 2 – 1) S 22
S 2P
n1 n2 – 2
n1 n2
n 1 (N 1)
w
2
n 1 n 2 (N 1)
w
2
TM
Data
Dashboards science
Department Reports users
warehouse
Reports
Alerts
Enterprise
data
warehouse
(EDW)
MP3 Mobile GPS ATM RFID Video games Medical imaging
Data device player phone
Marketers
Data collectors Data aggregators
Retail
Data user
and buyers Mobile
phone/TV
Media
Delivery
Financial service
Bank Credits
Brokers
bureau
1. Discovery
4. Model building
How can we
make it happen?
What will Prescriptive
happen? analytics
ion
Why did it Predictive
m i zat
ti
happen? analytics Op igh
t
What res
Value Diagnostic Fo
happened? analytics
Descriptive t
igh
analytics Ins
n
atio
for
m ght
In dsi
Hin
Difficulty Gartner
Diaper
Bread
Coke
Eggs
Beer
Milk
TID Items
T1 0 1 1 0 0 0
1 Bread, milk
2 T2 1 1 0 1 1 0
Bread, diaper, beer, eggs
3 Milk, diaper, beer, coke T3 1 0 1 1 0 1
T4 1 1 1 1 0 0
4 Bread, milk, diaper, beer
5 Bread, milk, diaper, coke T5 0 1 1 1 0 1
A B
Total
A B
A
(A B) / A Confidence
(B / Total) (B / Total)
1 – supp(Y)
conv (X Y)
1 – conf (X Y)
K1
K2
(X H H1 ) (X W W1 ) 2
XH
H1
XW
W1
K1
K2
Maximal
frequent
itemsets
i1 i2 in
t1 t2 tn
L1
L k –1
Ck L k –1 L k –1 L k –1
Ck
Lk Ck
k Lk
Database (D)
T2
T3
T4
T5
T6
Database
C2 L1
Itemset Supp Itemset Itemset Supp
{A,B} 3 {A,B} A 4
nd {A,C}
{A,C} 3 2 Scan
{A,D} B 5
{A,D} 2
{B,C} 2 {B,C} C 3
{B,D} 4 {B,D}
{C,D} 1 {C,D} D 4
L2 C3
Itemset Supp Itemset
A,B 3
A,B,C,D
A,C 3
B,D 4
Itemset Supp
A,B,D 2
C1 L1
Itemset Sup. count Itemset Sup. count
{11} 6 Compare candidate {11} 6
Scan D for
{12} 7 support count with {12} 7
count of each
{13} 6 minimum support {13} 6
candidate
count
{14} 2 {14} 2
{15} 2 {15} 2
C2 C2 L2
C3 C3 L3
null null
b:5 c:2
b:1 c:1
c:1
c:1
c:3 d:1
d:1 d:1
d:1 d:1 d:1
d:1 e:1
e:1 e:1 e:1
Error
Error
Y
2 1
Error
Y
Error
X
105
12
6660
12
(
–
P(Y)
ln 0 1X 1 2 X2 ... kXk
1 – P(Y)
0 1X 1 2 X2 ... kXk
Linear
1.0
Logistic
X
0.0
a0 a 1 X 1 a 2 X 2 ... a k X k
ln [ p (1 – p) ] b0 b 1 X 1 b 2 X 2 ... b k X k
Y
1
6BHK
5BHK
g(z)
0.5
4BHK
3BHK
2BHK 0
–20 0 20
1Cr 2Cr 3Cr 4Cr 5Cr 6Cr X z
New example
Training Machine
learning Rules for Predicted
example
algorithm classification classification
labeled
P(A B)
P(A)
P(A B)
P(A B)
P(B)
A B
P(A B)
P(T F) P(F)
P(T F) P(F) P(T M) P(M)
1 3 3
100 5 500
1 3 4 2 3 8
100 5 100 5 500 500
3
11
R1 W2
R
4/6 P (R2 | R1)
R
P (W2 | R1)
5/7 P (R1) 2/6
W
R
P (W1) 5/6
2/7 P (R2 | W1)
W
P (W2 | W1)
1/6
W
P(F|R) P(R)
P(F|R) P(R) P(F|E) P(E) P(F|P) P(P)
none astigmatism
yes no
age
Spectacle prescription
age
hard
soft none young not young
hard none
p(j| t) (1 – p(j| t)) – p(j| t) 2
1–1
nc
k
ni
GIN I split GINI (i) p(j t) 2
n
i 1 j
ni
Gini(p) – n
Gini(1) – n1 Gini(n) – n2
Gini(k) – nk
|S v|
– Entropy ( S v )
S
v values
Tear production rate Tear production rate
yes no
yes no
age soft
spectacle prescription spectacle prescription
not presbyopic presbyopic
myope myope hypermetrope
hypermetrope
hard none
– p(I) log 2 p(I)
9 9 5 5
– log 2 – log 2
14 14 14 14
2 correct
4 incorrect
Training Color Validation Color
red blue red blue
1 positive 0 positive 1 positive 1 positive If we had simply predicted the
0 negative 2 negative 3 negative 1 negative majority class (negative), we
make 2 errors instead of 4,
Pruned 1
ci
Sv
Sv
Sv
Sv
Attribute A
scores highest
for gain (S, A) A
Attribute B w
v
Scores highest
for gain (Sw, B) B
Leaf node
category c x
y
Sv must contain
only examples in Default
category c Etc.
leaf node d
Sw must have no
examples taking value
x for attribute B, and d
must be the category
containing the most
members of Sw
Pcinema log 2 ( Pcinema ) – Ptennis log 2 ( Ptennis )
– Pshopping log 2 ( Pshopping ) – Pstay_ in log 2 ( Pstay-in )
–(6 / 10) * log 2 ( 6 / 10) – ( 2 / 10) * log 2 ( 2 / 10) – (1 / 10) * log 2 (1 / 10) – (1 / 10) * log 2 (1 / 10)
– (6 / 10) * – 0.737 – (2 / 10) * – 2.322 – (1 / 10) * – 3.322 – (1 / 10) * – 3.322
Gain(S, weather) 1.571 – (|S sun|/10) * Entropy(S sun ) – ( S wind ) / 10) * Entropy(S wind )
– (|S rain|/10) * Entropy(S rain )
( 0.3) * Entropy(S sun ) – ( 0.4) * Entropy(S wind ) – ( 0.3) * Entropy(S rain )
1.571 – (0.3) * (0.918) – (0.4) * (0.81125) – (0.3) * (0.918)
Gain(S, parents) 1571
. – (|S yes|) / 10) * Entropy(S yes ) – (S no / 10) * Entropy(S no )
Weather
Sunny Rainy
Windy
S sunny
S windy
Weather
Sunny Rainy
Windy
A
Gain(S sunny , parents) 0.918 – (|S yes|/|S|) * Entropy (S yes ) – (|S no|/|S|) * Entropy(S no )
' '
' '
' '
25
20
15
10
0
0 1 2 3 4 5
TM
Centroid
(p 1 , p 2 , ...) (q 1 , q 2 , ...)
k
(p i q i )2
i 1
x1 xN
xi
K
1
||x i – x j ||2
2
K – 1 C(i) – K C(i) – K
K
Nk ||x i – m K ||2
K–1 C(i) – K
mK K th
NK K th
C1 C2
C1 C2
yt Yt
yt Yt – Yt – 1
yt ( Yt – Yt – 1 ) – ( Yt – 1 – Yt – 2 )
Yt – 2 Yt – 1 Yt – 2
yt 0 1 Yt – 1
yt 0 1 Yt – 1 2 Yt – 2
y t – y t–1
y t – y t–1 + 1 ( Yt – 1 – Yt – 2 )
yi 1, ... , n
s i , t i , ri
yi si ti ri
"INFORMATION RETRIEVAL by Technical Publication"
Tokenization
They 0
puppy 1
and 1
It is a puppy and it
is extremely cute cat 0
aardvark 0
cute 1
extremely 1
... ...
N
dft
N
Wt , d (1 log tft,d ) log
dft
A B
E D
N
i ij
j= 1
N–1
pk p k (1 – p) N– 1– k
k
Emerging
clusters
1 2 3 4
A B C D E F G H I J K
H
A
B F
C I
D G
E K
J
|True negatives| |True positives|
|False negatives|+|True positive|+|True negatives|+|True positives|
|True positive|
|True positive|+|False negative|
|True negatives|
|False positives|+|True negative|
i th Ei
Ei
K
1
Ei
K
i= 1
Dataset
Cross validation
Data permitting :
Training Validation Testing Training, Validation, Testing
Total number of examples
Experiment 1
Experiment 2
Experiment 3
Test examples
Experiment 4
80 80
60 60
SSE
SSE
40 40
Elbow point
20 20
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
K : Number of clusters K : Number of clusters
(a) (b)
TM
6.5
6.0
5.5
5.0
4.5
4.0
1.0 1.5 2.0 2.5 3.0
30000 27500
25000
25000
20000
20000
Sales (USD)
17500
15000
15000
10000
5000
0
2008 2009 2010 2011 2012
Years
Gymnastics
(11 %)
Tennis Swimming
(12 %) (27 %)
Track
(20 %) Soccer
(30 %)
Hard copy book Both format
Kindle book
20
23 12
4
3.2
Height (m)
2.4
Diameter (cm) 4 6 8
Visual analytics
Persists meta
data Database
Repository
database
hp
Data
Browser user sources
Load
balancer Email services
hp
JasperReports server
deployed as WAR files e.g. SiteMinder, CAS,
JAAS, LDAP etc.
Management and monitoring (Ambari)
Client
Block ops
Read Datanodes Datanodes
replication
Blocks
Client
Namenode
Update fsimage
with editlogs
Map()
Output data
Input data
Reduce()
Map()
Reduce()
Map()
Split 1 Map 1
Split 2 Map 2
Reduce
1 Output 1
Split 3 Map 3
Split 4 Map 4
Split 5 Map 5
Split 8 Map 8
Split 9 Map 9
Apache Pig
Grunt shell Pig server
Parser
Optimizer
Compiler
Execution engine
MapReduce
Hadoop
HDFS
Client
HMaster
ZooKeeper
Region Server Region Server Region Server
Region Region Region
HDFS
1
2
2
®
FE
SE A Guide for Engineering Students
TE PAPER SOLUTIONS
BE