0% found this document useful (0 votes)
99 views82 pages

Data Science and Big Data Analytics-1-82

The document outlines the syllabus for the Data Science and Big Data Analytics course under the Choice Based Credit System at Savitribai Phule Pune University for the T.E. (Computer Engineering) Semester - VI. It includes information on various data types, data processing techniques, and statistical methods relevant to data science. The publication is authored by Iresh A. Dhotre and Dr. Kalpana V. Metre and is published by Technical Publications.

Uploaded by

Mahindra Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views82 pages

Data Science and Big Data Analytics-1-82

The document outlines the syllabus for the Data Science and Big Data Analytics course under the Choice Based Credit System at Savitribai Phule Pune University for the T.E. (Computer Engineering) Semester - VI. It includes information on various data types, data processing techniques, and statistical methods relevant to data science. The publication is authored by Iresh A. Dhotre and Dr. Kalpana V. Metre and is published by Technical Publications.

Uploaded by

Mahindra Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

SUBJECT CODE : 310251

As per Revised Syllabus of

Savitribai Phule Pune University


Choice Based Credit System (CBCS)
T.E. (Computer) Semester - VI

Data Science
and Big Data Analytics
Iresh A. Dhotre
M.E. (Information Technology)
Ex-Faculty, Sinhgad College of Engineering,
Pune.

Dr. Kalpana V. Metre


(Ph.D.),
Associate Professor,
MET’s Institute of Engineering,
Nashik.

® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge

(i)
Data Science and Big Data Analytics

Subject Code : 310251

T.E. (Computer Engineering) Semester - VI

ã Copyright with I.A.Dhotre


All publishing rights (printed and ebook version) reserved with Technical Publications. No part of this book
should be reproduced in any form, Electronic, Mechanical, Photocopy or any information storage and
retrieval system without prior permission in writing, from Technical Publications, Pune.

Published by :
® ®
Amit Residency, Office No.1, 412, Shaniwar Peth,
TECHNICAL Pune - 411030, M.S. INDIA, Ph.: +91-020-24495496/97
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : [email protected] Website : www.technicalpublications.org

Printer :
Yogiraj Printers & Binders
Sr.No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.

ISBN 978-93-91567-92-7

SPPU 19
9 789391 567927

9789391567927 [1] (ii)


(iv)
(v)
TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge


?
Clickstream logs Emails

Application logs Contracts

Geographical
Data information
Machine data systems and
volume
geo-spatial data

Mobile
Sensor data
networks

Amazon,
facebook,
Data Social
Yahoo, Google
velocity media
(Web based
companies)
Structured

Data
variety Unstructured

Semi-structured
Business
understanding

Data Data
mining exploration

Data science
Data
Feature visualization
engineering

Data Predictive
cleaning modeling
Printed
books

101010101110010101
010100001110001100
Videos 011010101000011100
011110000011100010
hp

101010101010000111
000101010001100110
Photos and 101010101010101000
drawing
Types of data

Structured Unstructured
Attributes

Qualitative/Categorical Quantitative/Numeric

Nominal Ordinal Intervel Ratio


1 st
Wrangling

Raw data Evaluate usability Analyze Visualize

Cleanse Usable data Findings


V1 V2 V3 V4

V – Min A
V ( New – MaxA – New – Min A ) New – Min A
MaxA – Min A
V – min marks
( New marks – New min) + New min
Max marks – Min marks
8– 8
(1 – 0) 0
20 – 8
( 0)
1
12

(10 – 8)
(1 – 0) 0
20 – 8
2
1
12

(15 – 8)
(1 – 0) 0
20 – 8
( 3)
1
12

( 20 – 8)
(1 – 0) 0
20 – 8
12
1
12
V i 10 j

Vi

Vi [{( V i – min) (max – min) } { new_ max – new_ min} ] new_ min
(76, 300 – Mean) (54000)
Standard deviation 16000

54000
16000

min A maxA new_ min A

new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A

73600 – 12000
v (1.0 – 0.0) 0.0
98000 – 12000

v
min A maxA new_ min A

new_ maxA v
v – min A
v (new_ maxA – new_ min A ) new_ min A
maxA – min A

600 – 200
v (1.0 – 0.0) 0.0
1000 – 200

v 600
j
10 10 3
State

University

Branch

College
TM

TECHNICAL PUBLICATIONS - An up thrust for knowledge


Probability
Used to summarize data
Population

Descriptive
statistics

Sample

Inferential statistics

Used to make
inferences about the population
x

Sum of the values of the n observations xi


x
Number of observations in the sample n

Sum of the values of the N observations xi


Number of observations in the population n
(19 26)
2
Q1
Q1
Q3
Q3

Q1
Q3 Q3 Q1

xi x

(x i x) 2
S2
n 1

2
(x i )2
N

2
(x )2
N

(x x) 2
S2
n 1
(x M) 2 (x M) 2
n n

(x)
x 5 10 15 20 50
x
n 4 4

x –x (x – x) 2

(– 7.5) 2

(– 2.5) 2

(2.5) 2

(7.5) 2

(x – x) 2
(x – x) 2 125.01
6.455
n –1 4 –1

600 470 170 430 300


5

(206) 2 (76) 2 (– 224) 2 (36) 2 (– 94) 2


5
42436 5776 50176 1296 8836
5
B 1 , B 2 , B 3 , ... , B n
1 k n
P(A B k ) P(B k )
P(B k A )
n
P(A B i ) P(B i )
i 1

P(D A ) P(A)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)

0.05 0.25
0.05 0.25 0.04 0.35 0.02 0.4
0.0125
0.0125 0.014 0.008

P(D B) P(B)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)

0.04 0.35
0.05 0.25 0.04 0.35 0.02 0.4
0.014 0.014
0.0125 0.014 0.008 0.0345

P(D C) P(C)
P(D A ) P(A) P(D B) P(B) P(D C) P(C)

0.02 0.4
0.05 0.25 0.04 0.35 0.02 0.4

0.008 0.008
0.0125 0.014 0.008 0.0345

P(T F) P(F)
P(T F) P(F) P(T M) P(M)

1 3 3
100 5 500
1 3 4 2 3 8
100 5 100 5 500 500
3
11
62 36

1
9

A B
2
P(A B)
36 18

P(A B) P(A) P(B A )


P(A B)
P(A)

1 18 1 9
19 18 1
1 3
P(T F)P(F) 100 5 3
P(T F) P(F) P(T M) P(M) 1 3 4 2 11
100 5 100 5

H0

H1
Reject H0 Reject H0
Middle 95 %
high-probability values
if H0 is true

80

from H0
Z = – 1.96 0 Z = 1.96
Critical region;
Extreme 5 %

( )

P
Z cal
P
(Z cal )
( )

(z )

Formulate H0 and H1

Select appropriate test

Choose level of significance

Calculate test statistic TSCAL

Determine probability Determine critical


association with test stat value of test stat TSCR

Determine if TSCR
Compare with level
of significance, falls into (Non)
Rejection Region

Reject / Do not Reject H0

Draw marketing research conclusion


f(x, )

Rejection Rejection
Acceptance
region region
region
(1 – )
/2 /2

1 2

f(x, )

(1 – )

/2 /2
X
1 2
(H0 )

H0

(H1 )

H1
H1
H1
1 – (1 – 0.05) 20

H0 H1
( x i – x) ( y i – y)
(x i – x) 2 (y i – y) 2

xi

yi

(a) Positive correlation (b) Negative correlation (c) No correlation


X2 Y2
X Y X2 Y2 XY

X 6500
X
n 10
Y 660
Y
n 10

1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n

456040
650 66
10
4764800 45784
650 2 66 2
10 10

X2 Y2
X Y X2 Y2 XY

X 420
X
n 7
Y 476
Y
n 7

1
cov (X , Y) XY XY
n
x y 1 2 1 2
X2 X Y2 Y
n n

29464
60 68
7
28096 34376
60 2 68 2
7 7

3
1 2 3

d1 d2 d3 d12 d 22 d 32

6 d 12 6 74
( x, y) 1 1
n (n 2 1) 10 (100 1)

6 d 22 6 44
( y , z) 1 1
n (n 2 1) 10 (100 1)

6 d 23 6 156
( z , x) 1 1
n ( n 2 1) 10 (100 1)

( y , z) 2 3
2

( )

2 (Observed Expected) 2
Expected

2
k (O E) 2
E
i 1

xi i
2 2
i
2 2 2
2 (x 1 1) (x 2 2) (x v v)
...
2 2 2
1 2 v

v (x i 2
i)
2
i 1 i

2
d.f. = 2

0.2
d.f. = 5

0.1 d.f. = 10

d.f. = 15

d.f. = 30

10 20 30 40 50

Critical region
area =

2
X
(Critical value)
0

xi 2.15 1.99 2.05 2.12 2.17 2.01 1.98 2.03 2.25 1.93
x
n 10
20.68
x 2.068
10

(x x) 2
S2
n

x x (x x) 2

(x x) 2 0.09096
S2
n 10
2 2
(H 0 ) : 0
2 2
(H 1 ) : 0

n s2 10 0.009096
X2
2 (0.145) 2
0

X2

(H 0 )
n

2 (observed expected) 2
expected
B1 B2

A1 a b r1

A2 a|c b|d r2

c1 c2 a|b||
cd n

X2

n( ad bc) 2
X2
c 1 c 2 r1 r2
X 0
Z Z2
n
2
(Of) ij (Ef) ij
2
(Ef) ij

( 31 30) 2 ( 42 30) 2 ( 22 30) 2 ( 25 30) 2


30 30 30 30

(19 20) 2 ( 8 20) 2 ( 28 20) 2 ( 25 20) 2


20 20 20 20
n1 n2

X1 – X2
1 1
SP
n1 n2

( n 1 – 1) S 21 ( n 2 – 1) S 22
S 2P
n1 n2 – 2
n1 n2

n 1 (N 1)
w
2

n 1 n 2 (N 1)
w
2

You might also like