0% found this document useful (0 votes)

3K views264 pages

Data Science and Big Data Analytics

Uploaded by

smartysmit2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3K views264 pages

Data Science and Big Data Analytics

Uploaded by

smartysmit2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 264

SUBJECT CODE : 310251

As per Revised Syllabus of

Savitribai Phule Pune University

Choice Based Credit System (CBCS)
T.E. (Computer) Semester - VI

Data Science
and Big Data Analytics
Iresh A. Dhotre
M.E. (Information Technology)
Ex-Faculty, Sinhgad College of Engineering,
Pune.

Dr. Kalpana V. Metre

(Ph.D.),
Associate Professor,
MET’s Institute of Engineering,
Nashik.

® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge

(i)
Data Science and Big Data Analytics

Subject Code : 310251

T.E. (Computer Engineering) Semester - VI

ã Copyright with I.A.Dhotre

All publishing rights (printed and ebook version) reserved with Technical Publications. No part of this book
should be reproduced in any form, Electronic, Mechanical, Photocopy or any information storage and
retrieval system without prior permission in writing, from Technical Publications, Pune.

Published by :
® ®
Amit Residency, Office No.1, 412, Shaniwar Peth,
TECHNICAL Pune - 411030, M.S. INDIA, Ph.: +91-020-24495496/97
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : [email protected] Website : www.technicalpublications.org

Printer :
Yogiraj Printers & Binders
Sr.No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.

ISBN 978-93-91567-92-7

SPPU 19
9 789391 567927

9789391567927 [1] (ii)

(iv)
(v)
TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge

?
Clickstream logs Emails

Application logs Contracts

Geographical
Data information
Machine data systems and
volume
geo-spatial data

Mobile
Sensor data
networks

Amazon,
facebook,
Data Social
Yahoo, Google
velocity media
(Web based
companies)
Structured

Data
variety Unstructured

Semi-structured
Business
understanding

Data Data
mining exploration

Data science
Data
Feature visualization
engineering

Data Predictive
cleaning modeling
Printed
books

101010101110010101
010100001110001100
Videos 011010101000011100
011110000011100010
hp

101010101010000111
000101010001100110
Photos and 101010101010101000
drawing
Types of data

Structured Unstructured
Attributes

Qualitative/Categorical Quantitative/Numeric

Nominal Ordinal Intervel Ratio

1 st
Wrangling

Raw data Evaluate usability Analyze Visualize

Cleanse Usable data Findings

V1 V2 V3 V4

V – Min A
V ( New – MaxA – New – Min A ) New – Min A
MaxA – Min A
V – min marks
( New marks – New min) + New min
Max marks – Min marks
8– 8
(1 – 0) 0
20 – 8
( 0)
1
12

(10 – 8)
(1 – 0) 0
20 – 8
2
1
12

(15 – 8)
(1 – 0) 0
20 – 8
( 3)
1
12

( 20 – 8)
(1 – 0) 0
20 – 8
12
1
12
V i 10 j

Vi [{( V i – min) (max – min) } { new_ max – new_ min} ] new_ min
(76, 300 – Mean) (54000)
Standard deviation 16000

54000
16000

min A maxA new_ min A

new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A

73600 – 12000
v (1.0 – 0.0) 0.0
98000 – 12000

v
min A maxA new_ min A

new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A

600 – 200
v (1.0 – 0.0) 0.0
1000 – 200

v 600
j
10 10 3
State

University

Branch

College
TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge

Probability
Used to summarize data
Population

Descriptive
statistics

Sample

Inferential statistics

Used to make
inferences about the population
x

Sum of the values of the n observations xi

x
Number of observations in the sample n

Sum of the values of the N observations xi

Number of observations in the population n
(19 26)
2
Q1
Q1
Q3
Q3

Q1
Q3 Q3 Q1

xi x

(x i x) 2
S2
n 1

2
(x i )2
N

2
(x )2
N

(x x) 2
S2
n 1
(x M) 2 (x M) 2
n n

(x)
x 5 10 15 20 50
x
n 4 4

x –x (x – x) 2

(– 7.5) 2

(– 2.5) 2

(2.5) 2

(7.5) 2

(x – x) 2
(x – x) 2 125.01
6.455
n –1 4 –1

600 470 170 430 300

(206) 2 (76) 2 (– 224) 2 (36) 2 (– 94) 2

5
42436 5776 50176 1296 8836
5
B 1 , B 2 , B 3 , ... , B n
1 k n
P(A B k ) P(B k )
P(B k A )
n
P(A B i ) P(B i )
i 1

P(D A ) P(A)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)

0.05 0.25
0.05 0.25 0.04 0.35 0.02 0.4
0.0125
0.0125 0.014 0.008

P(D B) P(B)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)

0.04 0.35
0.05 0.25 0.04 0.35 0.02 0.4
0.014 0.014
0.0125 0.014 0.008 0.0345

P(D C) P(C)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)

0.02 0.4
0.05 0.25 0.04 0.35 0.02 0.4

0.008 0.008
0.0125 0.014 0.008 0.0345

P(T F) P(F)
P(T F) P(F) P(T M) P(M)

1 3 3
100 5 500
1 3 4 2 3 8
100 5 100 5 500 500
3
11
62 36

1
9

A B
2
P(A B)
36 18

P(A B) P(A) P(B A )

P(A B)
P(A)

1 18 1 9
19 18 1
1 3
P(T F)P(F) 100 5 3
P(T F) P(F) P(T M) P(M) 1 3 4 2 11
100 5 100 5

H1
Reject H0 Reject H0
Middle 95 %
high-probability values
if H0 is true

from H0
Z = – 1.96 0 Z = 1.96
Critical region;
Extreme 5 %

( )

P
Z cal
P
(Z cal )
( )

(z )

Formulate H0 and H1

Select appropriate test

Choose level of significance

Calculate test statistic TSCAL

Determine probability Determine critical

association with test stat value of test stat TSCR

Determine if TSCR
Compare with level
of significance, falls into (Non)
Rejection Region

Reject / Do not Reject H0

Draw marketing research conclusion

f(x, )

Rejection Rejection
Acceptance
region region
region
(1 – )
/2 /2

1 2

f(x, )

(1 – )

/2 /2
X
1 2
(H0 )

(H1 )

H1
H1
H1
1 – (1 – 0.05) 20

H0 H1
( x i – x) ( y i – y)
(x i – x) 2 (y i – y) 2

(a) Positive correlation (b) Negative correlation (c) No correlation

X2 Y2
X Y X2 Y2 XY

X 6500
X
n 10
Y 660
Y
n 10

1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n

456040
650 66
10
4764800 45784
650 2 66 2
10 10

X2 Y2
X Y X2 Y2 XY

X 420
X
n 7
Y 476
Y
n 7

1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n

29464
60 68
7
28096 34376
60 2 68 2
7 7

3
1 2 3

d1 d2 d3 d12 d 22 d 32

6 d 12 6 74
( x, y) 1 1
n (n 2 1) 10 (100 1)

6 d 22 6 44
( y , z) 1 1
n (n 2 1) 10 (100 1)

6 d 23 6 156
( z , x) 1 1
n ( n 2 1) 10 (100 1)

( y , z) 2 3
2

( )

2 (Observed Expected) 2
Expected

2
k (O E) 2
E
i 1

xi i
2 2
i
2 2 2
2 (x 1 1) (x 2 2) (x v v)
...
2 2 2
1 2 v

v (x i 2
i)
2
i 1 i

2
d.f. = 2

0.2
d.f. = 5

0.1 d.f. = 10

d.f. = 15

d.f. = 30

10 20 30 40 50

Critical region
area =

2
X
(Critical value)
0

xi 2.15 1.99 2.05 2.12 2.17 2.01 1.98 2.03 2.25 1.93
x
n 10
20.68
x 2.068
10

(x x) 2
S2
n

x x (x x) 2

(x x) 2 0.09096
S2
n 10
2 2
(H 0 ) : 0
2 2
(H 1 ) : 0

n s2 10 0.009096
X2
2 (0.145) 2
0

(H 0 )
n

2 (observed expected) 2
expected
B1 B2

A1 a b r1

A2 a|c b|d r2

c1 c2 a|b||
cd n

n( ad bc) 2
X2
c 1 c 2 r1 r2
X 0
Z Z2
n
2
(Of) ij (Ef) ij
2
(Ef) ij

( 31 30) 2 ( 42 30) 2 ( 22 30) 2 ( 25 30) 2

30 30 30 30

(19 20) 2 ( 8 20) 2 ( 28 20) 2 ( 25 20) 2

20 20 20 20
n1 n2

X1 – X2
1 1
SP
n1 n2

( n 1 – 1) S 21 ( n 2 – 1) S 22
S 2P
n1 n2 – 2
n1 n2

n 1 (N 1)
w
2

n 1 n 2 (N 1)
w
2
TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge

Different data source

Data
Dashboards science
Department Reports users
warehouse
Reports

Alerts

Enterprise
data
warehouse
(EDW)
MP3 Mobile GPS ATM RFID Video games Medical imaging
Data device player phone

Marketers
Data collectors Data aggregators

Government Medical Websites

Law
enforcement Employer

Retail
Data user
and buyers Mobile
phone/TV
Media

Delivery
Financial service

Bank Credits
Brokers
bureau
1. Discovery

6. Operationalize 2. Data prep.

5. Communication 3. Model planning

results

4. Model building
How can we
make it happen?
What will Prescriptive
happen? analytics
ion
Why did it Predictive
m i zat
ti
happen? analytics Op igh
t
What res
Value Diagnostic Fo
happened? analytics
Descriptive t
igh
analytics Ins

n
atio
for
m ght
In dsi
Hin

Difficulty Gartner
Diaper
Bread

Coke
Eggs
Beer

Milk

TID Items
T1 0 1 1 0 0 0
1 Bread, milk
2 T2 1 1 0 1 1 0
Bread, diaper, beer, eggs
3 Milk, diaper, beer, coke T3 1 0 1 1 0 1
T4 1 1 1 1 0 0
4 Bread, milk, diaper, beer
5 Bread, milk, diaper, coke T5 0 1 1 1 0 1

A B
Total

A B
A
(A B) / A Confidence
(B / Total) (B / Total)

1 – supp(Y)
conv (X Y)
1 – conf (X Y)
K1

(X H H1 ) (X W W1 ) 2

XH
H1
XW
W1

(185 185)2 (72 72)2 0 (185 170)2 (72 56)2 21.93

(170 185)2 (56 72)2 21.93 (170 170)2 (56 56)2 0

(168 185) 2 ( 60 72) 2 (168 185) 2 ( 60 72) 2
Frequent itemsets

Closed frequent itemsets

Maximal
frequent
itemsets
i1 i2 in
t1 t2 tn
L1
L k –1
Ck L k –1 L k –1 L k –1

Ck
Lk Ck

k Lk
Database (D)

T/D Item C1 Item set Support L1 Item set Support

T001 I1, I3, I4 {I1} 2 {I1} 2

SCAN D
T002 I2, I3, I5 {I2} 3 {I2} 3
T003 I1, I2, I3, I5 1-Scan {I3} 3 {I3} 3
T004 I2, I5 {I4} 1 {I4} 3
{I5} 3

L2 Item set Support C2 Item set Support C2 Item set

{I1, I3} 2 {I1, I3} 1 {I1, I2}
SCAN D
2 {I1, I3}
{I2, I3} 2 {I2, I3} 1
2-Scan {I1, I5}
{I2, I5} 3 {I2, I5} 2
{I2, I3}
3
{I3, I5} 2 {I3, I5} 2 {I2, I5}
{I3, I5}
C3
Item set L3 Item set Support
SCAN D (I1, I2, I3) (I1, I2, I5)
{I2, I3, I5} {I2, I3, I5} 2 and (I1, I3, I5) not in C3
3 - Scan
T1

Database

Tid Itemlist C1 Itemset Supp

T1 {A,B,D,K} A 4
st
T2 {A,B,C,D,E} 1 Scan B 5
T3 {A,B,C,E} C 3
T4 {B,D} D 4
T5 {A,C} E 2
T6 {B,D} K 1

C2 L1
Itemset Supp Itemset Itemset Supp

{A,B} 3 {A,B} A 4
nd {A,C}
{A,C} 3 2 Scan
{A,D} B 5
{A,D} 2
{B,C} 2 {B,C} C 3
{B,D} 4 {B,D}
{C,D} 1 {C,D} D 4

L2 C3
Itemset Supp Itemset
A,B 3
A,B,C,D
A,C 3
B,D 4

Itemset Supp

A,B,D 2
C1 L1
Itemset Sup. count Itemset Sup. count
{11} 6 Compare candidate {11} 6
Scan D for
{12} 7 support count with {12} 7
count of each
{13} 6 minimum support {13} 6
candidate
count
{14} 2 {14} 2
{15} 2 {15} 2

C2 C2 L2

Itemset Itemset Sup. count Itemset Sup. count

{11,12} {11,12} 4 Compare candidate {11,12} 4
{11,13} {11,13} 4 support count with {11,13} 4
{11,14} {11,14} 1 minimum support {11,15} 2
Generate C2 {11,15} Scan D for {11,15} 2 count {12,13} 4
{12,13} count of each {12,13} 4 {12,14} 2
candidates from L1
candidate
{12,14} {12,14} 2 {12,15} 2
{12,15} {12,15} 2
{13,14} {13,14} 0
{13,15} {13,15} 1
{14,15} {14,15} 0

C3 C3 L3

Itemset Itemset Sup. count Compare candidate Itemset Sup. count

Generate C3 Scan D for
{11,12,13} count of each {11,12,13} 2 support count with {11,12,13} 2
candidates from minimum support
L2 candidate
count
{11,12,15} {11,12,15} 2 {11,12,15} 2
Ck
D1 D2 Dn
null
null
a:1 b:1
a:1
b:1 c:1
b:1
d:1

null null

a:2 b:1 a:8 b:2

b:5 c:2
b:1 c:1
c:1
c:1
c:3 d:1
d:1 d:1
d:1 d:1 d:1

d:1 e:1
e:1 e:1 e:1
Error

Error
Y

2 1

Error
Y

Error

X
105
12
6660
12

(
–

P(Y)
ln 0 1X 1 2 X2 ... kXk
1 – P(Y)

0 1X 1 2 X2 ... kXk

Linear

1.0


Logistic

X
0.0

a0 a 1 X 1 a 2 X 2 ... a k X k

ln [ p (1 – p) ] b0 b 1 X 1 b 2 X 2 ... b k X k
Y
1
6BHK

5BHK
g(z)

0.5
4BHK

3BHK

2BHK 0
–20 0 20
1Cr 2Cr 3Cr 4Cr 5Cr 6Cr X z
New example

Training Machine
learning Rules for Predicted
example
algorithm classification classification
labeled
P(A B)
P(A)

P(A B)

P(A B)
P(B)

A B

P (First choice and second choice)

P (First choice)
P(A B)

P(A B)

P(A B) P(A) P(B) P(A B) P(A) P(B)

B 1 , B 2 , B 3 ... B n
1 k n
P(A B k ) P(B k )
P(B k A )
n
P(A B i ) P(B i )
i 1

P(T F) P(F)
P(T F) P(F) P(T M) P(M)

1 3 3
100 5 500
1 3 4 2 3 8
100 5 100 5 500 500

3
11

R1 W2
R
4/6 P (R2 | R1)

R
P (W2 | R1)
5/7 P (R1) 2/6
W

R
P (W1) 5/6
2/7 P (R2 | W1)
W

P (W2 | W1)
1/6
W

P(F|R) P(R)
P(F|R) P(R) P(F|E) P(E) P(F|P) P(P)

0.3 0.4 0.12 0.12

0.3 0.4 0.5 0.35 0.6 0.25 0.12 0.175 0.15 0.445
Tear production rate
reduced normal

none astigmatism
yes no

age
Spectacle prescription

not presbyopic presbyopic myope hypermetrope

age
hard
soft none young not young

hard none
p(j| t) (1 – p(j| t)) – p(j| t) 2
1–1
nc

p(j t) log p(j| t)

j
nc

k
ni
GIN I split GINI (i) p(j t) 2
n
i 1 j

Gini(p) – n

Gini(1) – n1 Gini(n) – n2
 Gini(k) – nk
|S v|
– Entropy ( S v )
S
v values
Tear production rate Tear production rate

reduced normal reduced normal

none astigmatism none astigmatism

yes no
yes no

age soft
spectacle prescription spectacle prescription
not presbyopic presbyopic
myope myope hypermetrope
hypermetrope

soft none hard age

hard none
young not young

hard none
– p(I) log 2 p(I)

9 9 5 5
– log 2 – log 2
14 14 14 14

2 correct
4 incorrect
Training Color Validation Color
red blue red blue
1 positive 0 positive 1 positive 1 positive If we had simply predicted the
0 negative 2 negative 3 negative 1 negative majority class (negative), we
make 2 errors instead of 4,
Pruned 1
ci

Sv
Sv

Sv
Attribute A
scores highest
for gain (S, A) A

Attribute B w
v
Scores highest
for gain (Sw, B) B
Leaf node
category c x
y
Sv must contain
only examples in Default
category c Etc.
leaf node d

Sw must have no
examples taking value
x for attribute B, and d
must be the category
containing the most
members of Sw
Pcinema log 2 ( Pcinema ) – Ptennis log 2 ( Ptennis )
– Pshopping log 2 ( Pshopping ) – Pstay_ in log 2 ( Pstay-in )
–(6 / 10) * log 2 ( 6 / 10) – ( 2 / 10) * log 2 ( 2 / 10) – (1 / 10) * log 2 (1 / 10) – (1 / 10) * log 2 (1 / 10)
– (6 / 10) * – 0.737 – (2 / 10) * – 2.322 – (1 / 10) * – 3.322 – (1 / 10) * – 3.322
Gain(S, weather) 1.571 – (|S sun|/10) * Entropy(S sun ) – ( S wind ) / 10) * Entropy(S wind )
– (|S rain|/10) * Entropy(S rain )
( 0.3) * Entropy(S sun ) – ( 0.4) * Entropy(S wind ) – ( 0.3) * Entropy(S rain )
1.571 – (0.3) * (0.918) – (0.4) * (0.81125) – (0.3) * (0.918)
Gain(S, parents) 1571
. – (|S yes|) / 10) * Entropy(S yes ) – (S no / 10) * Entropy(S no )

Gain(S, money) 1571

. – (|S rich|) / 10) * Entropy(S rich ) – (S poor / 10) * Entropy(S poor )

Weather

Sunny Rainy
Windy

S sunny

S windy

Weather

Sunny Rainy
Windy

A
Gain(S sunny , parents) 0.918 – (|S yes|/|S|) * Entropy (S yes ) – (|S no|/|S|) * Entropy(S no )

Gain(S sunny , money) 0.918 – (|S rich|/|S|) * Entropy (S rich )

– (|S poor|/|S|) * Entropy (S poor )
' '

' '

25
20

0
0 1 2 3 4 5
TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge

Raw data Clustering algorithm Clusters of data

Centroid

(p 1 , p 2 , ...) (q 1 , q 2 , ...)
k
(p i q i )2
i 1
x1 xN
xi

K
1
||x i – x j ||2
2
K – 1 C(i) – K C(i) – K

K
Nk ||x i – m K ||2
K–1 C(i) – K

mK K th

NK K th
C1 C2
C1 C2
yt Yt

yt Yt – Yt – 1
yt ( Yt – Yt – 1 ) – ( Yt – 1 – Yt – 2 )

Yt – 2 Yt – 1 Yt – 2

yt 0 1 Yt – 1

yt 0 1 Yt – 1 2 Yt – 2

y t – y t–1

y t – y t–1 + 1 ( Yt – 1 – Yt – 2 )

y t – y t–1 1 ( t–1 – t–2)

yi 1, ... , n
s i , t i , ri
yi si ti ri
"INFORMATION RETRIEVAL by Technical Publication"

Tokenization

"INFORMATION" "RETRIEVAL" "by" "Technical" "Publication"

Raw text Bag-of-words
vector
it 2

They 0

puppy 1

and 1
It is a puppy and it
is extremely cute cat 0

aardvark 0

cute 1

extremely 1

... ...
N
dft

N
Wt , d (1 log tft,d ) log
dft
A B

E D
N
i ij
j= 1

Number of lines (L) 2L

(Number of points (Number of points – 1)) / 2 g (g – 1)
pk

N–1
pk p k (1 – p) N– 1– k
k
Emerging
clusters
1 2 3 4

A B C D E F G H I J K

H
A
B F

C I

D G
E K

|False negatives| |False positives|

|False negatives|+|False positive|+|True negatives|+|True positives|

Number of true positives

Number of true positives + Number of false positives
Total number of examples

Training set Test set

i th Ei
Ei
K
1
Ei
K
i= 1
Dataset

Training Testing Holdout method

Cross validation

Data permitting :
Training Validation Testing Training, Validation, Testing
Total number of examples

Experiment 1

Experiment 2

Experiment 3
Test examples

Experiment 4
80 80

60 60

SSE
SSE

40 40

Elbow point
20 20

0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
K : Number of clusters K : Number of clusters

(a) (b)
TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge

7.0

6.5

6.0

5.5

5.0

4.5

4.0
1.0 1.5 2.0 2.5 3.0
30000 27500
25000
25000
20000
20000
Sales (USD)

17500
15000
15000

10000

5000

0
2008 2009 2010 2011 2012
Years

Gymnastics
(11 %)

Tennis Swimming
(12 %) (27 %)

Track
(20 %) Soccer
(30 %)
Hard copy book Both format
Kindle book

20
23 12
4

3.2
Height (m)

2.4

Diameter (cm) 4 6 8
Visual analytics

Data integration Predictive analytics

100 % Java Open web-based API's Pluggable architecture

Operational data Big data Public / Private clouds Data stream

Submit job Hadoop

Datameer Manages data

server

Persists meta
data Database
Repository
database
hp

Data
Browser user sources

Load
balancer Email services
hp

Web services client External

Java application servers authentication

JasperReports server
deployed as WAR files e.g. SiteMinder, CAS,
JAAS, LDAP etc.
Management and monitoring (Ambari)

Coordination Workflow Scripting Machine Query NoSQL Data

(ZooKeeper) and (Pig) Learning (Hive) (HBase) integration
scheduling (Mahout)
(Oozie)

Distributed processing (MapReduce)

Distributed storage (HDFS)

Metadata ops Metadata(Name, replicas..)
Namenode
(/home/foo/data,6...

Client
Block ops
Read Datanodes Datanodes

replication

Blocks

Rack 1 Write Rack 2

Client
Namenode

Write the modification Reads at startup and merges

to file system with edit logs

Edit logs fsimage

Secondary Query for edit logs
in regular intervals Namenode
Namenode

Update fsimage
with editlogs

Copy the updated

fsimage fsimage
fsimage back to
Namenode
Shuffle
Input Split Map [Combine] and Reduce Output
sort

Map()
Output data
Input data

Reduce()
Map()
Reduce()
Map()
Split 1 Map 1

Split 2 Map 2
Reduce
1 Output 1
Split 3 Map 3

Split 4 Map 4

Split 5 Map 5

Split 6 Map 6 Reduce

Big data 2 Output 2
Split 7 Map 7

Split 8 Map 8

Split 9 Map 9

Split 10 Map 10 Reduce

Output N
N
Split N Map N
Pig Latin
scripts

Apache Pig
Grunt shell Pig server

Parser

Optimizer

Compiler

Execution engine

MapReduce

Hadoop

HDFS
Client
HMaster

ZooKeeper
Region Server Region Server Region Server
Region Region Region

Region Region Region

HDFS
1
2
2
®

FE
SE A Guide for Engineering Students

TE PAPER SOLUTIONS
BE

Machine Learning Al3451
No ratings yet
Machine Learning Al3451
10 pages
Ccs341 Data Warehousing Technical Publication
No ratings yet
Ccs341 Data Warehousing Technical Publication
103 pages
Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
Fake Job Posting Detection Documentation Updated
No ratings yet
Fake Job Posting Detection Documentation Updated
65 pages
Artificial Intelligence
67% (3)
Artificial Intelligence
472 pages
Unit - 1 Big Data Handwritten Notes
No ratings yet
Unit - 1 Big Data Handwritten Notes
16 pages
Artificial Intelligence: Technical
No ratings yet
Artificial Intelligence: Technical
472 pages
ML Decode
No ratings yet
ML Decode
130 pages
Machine Learning by Iresh A. Dhotre
67% (3)
Machine Learning by Iresh A. Dhotre
82 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
Ccs341 - Data Warehousing
100% (1)
Ccs341 - Data Warehousing
2 pages
Data Mining Lab Manual COMPLETE GMR
No ratings yet
Data Mining Lab Manual COMPLETE GMR
66 pages
CCW331-UNIT3,4,5 Business Analytics
No ratings yet
CCW331-UNIT3,4,5 Business Analytics
300 pages
Techknowledge Publication: Big Data Analytics
No ratings yet
Techknowledge Publication: Big Data Analytics
156 pages
M. Tech. (Sem-Ii) Theory Examination 2017-18 Distributed Data Base
100% (1)
M. Tech. (Sem-Ii) Theory Examination 2017-18 Distributed Data Base
2 pages
EDA - With Python Question Bank
No ratings yet
EDA - With Python Question Bank
3 pages
Dbms Nirali Pub
No ratings yet
Dbms Nirali Pub
117 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
SPPU - BE - HPC - Unit 1 Notes
67% (3)
SPPU - BE - HPC - Unit 1 Notes
47 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
Software Engineering
No ratings yet
Software Engineering
279 pages
DMW Ebook TechKnowledge
No ratings yet
DMW Ebook TechKnowledge
216 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
82 pages
Eda Question Paper
No ratings yet
Eda Question Paper
4 pages
Ad3491 Fdsa Unit 2 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 2 Notes Eduengg
82 pages
CS3481 - DBMS Lab Manual - New
100% (2)
CS3481 - DBMS Lab Manual - New
82 pages
CS402 Data Mining and Warehousing Question Bank
No ratings yet
CS402 Data Mining and Warehousing Question Bank
6 pages
Estimating Moments
No ratings yet
Estimating Moments
22 pages
Cd3291 Dsa Notes
100% (1)
Cd3291 Dsa Notes
168 pages
Data Science Fundamentals QB
No ratings yet
Data Science Fundamentals QB
23 pages
CS3491 Ai & ML Lab Manual
No ratings yet
CS3491 Ai & ML Lab Manual
57 pages
Unit 1 Introduction of Machine Learning Notes
No ratings yet
Unit 1 Introduction of Machine Learning Notes
57 pages
Map Reduce Applications
No ratings yet
Map Reduce Applications
94 pages
Data Structure Using C by Mamata Garanayak 238c40
50% (4)
Data Structure Using C by Mamata Garanayak 238c40
460 pages
Co-Po Big Data Analytics
100% (1)
Co-Po Big Data Analytics
41 pages
C & Ds Notes 2022-2023 r22 Syllabus
100% (1)
C & Ds Notes 2022-2023 r22 Syllabus
210 pages
DSBDA Easy Solution 2019
No ratings yet
DSBDA Easy Solution 2019
58 pages
Big Data Analysis Lab Manual
No ratings yet
Big Data Analysis Lab Manual
39 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
Machine Learning and Applications by Iresh A. Dhotre
No ratings yet
Machine Learning and Applications by Iresh A. Dhotre
186 pages
Business Analytics CCW331 Tech Publication (f16's Yht)
No ratings yet
Business Analytics CCW331 Tech Publication (f16's Yht)
233 pages
Machine Learning Notes - TutorialsDuniya
100% (1)
Machine Learning Notes - TutorialsDuniya
58 pages
Cs3301 Unit Important Q-Data-Structures
No ratings yet
Cs3301 Unit Important Q-Data-Structures
8 pages
Play Tennis Example: Outlook Temperature Humidity Windy
No ratings yet
Play Tennis Example: Outlook Temperature Humidity Windy
29 pages
Bda Sem 7 Book
No ratings yet
Bda Sem 7 Book
188 pages
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
No ratings yet
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
89 pages
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
Data Warehousing ccs341
No ratings yet
Data Warehousing ccs341
103 pages
Introduction To Data Analytics and Visualization Question Paper
100% (1)
Introduction To Data Analytics and Visualization Question Paper
2 pages
11.NUMPY Lab File (R20)
100% (1)
11.NUMPY Lab File (R20)
105 pages
VTU Question Paper of 18CS72 Big Data Analytics Feb-2022
100% (1)
VTU Question Paper of 18CS72 Big Data Analytics Feb-2022
2 pages
Data Mining - Classification Using Frequent Pattern
No ratings yet
Data Mining - Classification Using Frequent Pattern
8 pages
Assignment No 2
No ratings yet
Assignment No 2
26 pages
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
ML - LAB Record
No ratings yet
ML - LAB Record
36 pages
CSM Laboratory Manual Edited
No ratings yet
CSM Laboratory Manual Edited
22 pages
Data Science and Big Data Analytics-1-82
No ratings yet
Data Science and Big Data Analytics-1-82
82 pages
TE - SEMVI - DSBDA (Data Science and Big Data Analytics) - SPPU 2019 Pattern
No ratings yet
TE - SEMVI - DSBDA (Data Science and Big Data Analytics) - SPPU 2019 Pattern
264 pages