Data Analysis
Data Analysis
CATEGORY L T P CREDIT
ECT332 DATA ANALYSIS
PEC 2 1 0 3
Preamble: This course aims to set the foundation for students to develop new-age skills pertaining
to analysis of large-scale data using modern tools.
Prerequisite: None
Course Outcomes: After the completion of the course the student will be able to
CO 1 3 3 3 2
CO 2 3 3 2 3 3
CO 3 3 3 2 3 3 2 2
CO 4 3 3 2 3 3 2 2
CO 5 3 3 2 3 3 2 2
CO 6 3 3 2 3 3 2 2
Assessment Pattern
Continuous Assessment
End Semester Examination
Bloom’s Category Tests
1 2
Remember 10 10 20
Understand 30 30 60
Apply 10 10 20
Analyse
Evaluate
Create
Mark distribution
Total CIE ESE ESE
Marks Duration
150 50 100 3 hours
End Semester Examination Pattern: There will be two parts; Part A and Part B. Part A contain 10
questions with 2 questions from each module, having 3 marks for each question. Students should
answer all questions. Part B contains 2 questions from each module of which student should answer
any one. Each question can have maximum 2 sub-divisions and carry 14 marks.
SYLLABUS
No Topic No. of
Lectures
1 Overview of Data Analysis and Python
1.1 Numpy and Scipy Python modules for data analysis. 2
1.2 Reading and processing spreadsheets and csv files with Python using 2
xlrd, xlwt and openpyxl.
1.3 Data visualization with Matplotlib. Two dimensional charts and plots. 2
Scatter plots with matplotlib. Three dimensional visualization using
Mayavi module.
1.4 Reading data from sql and mongodb databases with Python 2
2 Big Data Arrays with Pandas
2.1 Intro. To Python pandas 1
2.2 Reading and writing of data as pandas dataframes. Separating header, 3
columns row etc and other manipulations
2.3 Reading data from different kind of files, Merging, concatenating and 3
grouping of data frames. Use of pivot tables. Pickilng
3 PCA and Cluster Analysis
3.1 Singular value decomposition of a matrix/array. Eigen values and eigen 1
vectors.
3.2 PCA,Scree plot. Dimensioanality reduction with PCA. Loadings for 3
principal components. Case study with Python. Cluster analysis.
3.3 Cluster analysis, dendrograms 2
4 Statistical Data Analysis
4.1 Hypothesis testing. Bayesian analysis. Meaning of prior, posterior and 3
likelyhood functions. Use of pymc3 module to compute the posterior
probability.
4.2 MAP Estimation. Credible interval, conjugate 3
distributions .Contingency table and chi square test. Kernel density
estimation.
4.3 Contingency table and chi square test. Kernel density estimation. 3
5 Machine Learning
5.1 Supervised and unsupervised learning. Use of scikit-learn. Regression 2
using scikit-learn.
5.2 Deep learning with convolutional newural networks. Structure of CNN. 2
5.3 Use of Keras and Tensorflow. Machine learning with pytorch. Case 3
study of character recognition with MNIST dataset.
5.4 High performance computing for machine learning. Use of numba, jit 2
and numexpr for faster Python code. Use of Ipython-parallel.
Simulation Assignments
1. Download the iris data set and read into a pandas data frame. Extract the heaser and replace
with a new header. Extract colums and rows. Extract pivot tables. Filter the data based on
the labels. Store a pivot table as a pickle and retrieve it.
2. For the same data sert, perform principal component analysis. Observe the scree plot.
Identify the principal components. Obtain a low dimensional data, with only the principal
components and compute the mean square error between the original data and the
approximated one. Compute the loadings for the principal components.
3. For the same data, perform hierarchical and K-means clustering with Python codes. Obtain
dendrograms in each case and appreciate the clusters.
4. Download the MNIST letter data set. Construct a CNN network with appropriate layers
using Keras and Tensorflow. Train the CNN with the MNIST data set. Appreciate the
selection and use of training, test and cross-validation data sets. Save the model and weights
and use the model to identify letter images. You may use openCV for reading images.
5. Write a Python script to generate alphanumeric images (26 upper case, 26 lowercase and 10
numbers each 12 point in size) of say 16X16 dimension out of windows .ttf files. Create 62
folders each containg a data set of every alphanumeric character. Create a new CNN with
Keras and Tensorflow. Create a cross validation data set by taking 10 images out of every 62
folder. Use 80% of the total data for training and 20% for testing the CNN. Use an HPCC
like system to train the model and save the model and weight. Test this model to recognize
letter images. You may use openCV for reading images.
6. Repeat assignment 4 using pytorch instead of Keras
7. Repeat assignment 5 using pytorch instead of Keras
PART A
Answer All Questions
1 Create a two dimensional array of real numbers using numpy. (3) K3
Write Python code to pickle this data.
2 Write Python code to import mayavi module and perform 3-D (3) K3
2 2 2
visualization of x + y + z = 1
3 Write Python code to generate a 5 × 5 pandas data frame of random (3) K3
numbers. Add a header to this dataframe.
4 Write Python code to concatenate two dataframes of same num- ber (3) K3
of columns.
5 Write the expression for the singular value decomposition of a (3) K3
matrix A
6 Explain how principal components are isolated using scree plot. (3) K1
7 State Bayes theorem and explain the significance of the terms prior, (3) K1
likelyhood and posterior.
8 Write Python code with pymc3 to realize a Bernoulli trial with (3) K3
p(head) = 0.2
9 Give the structure a convolutional neural network (3) K1
10 Compare supervised and unsupervised learning (3) K1
PART B
Answer one question from each module. Each question carries 14 mark.
Module I
11(A) Write Python code to read a spreadsheet in .xls format a text (8) K3
file in .csv format and put these data into numpy arrays. in
both cases, plot the second column against the first column
using matplotlib
11(B) Write Python code to read tables from sql and mongodb (6) K3
databases.
OR
Module II
13(A) Write Python code to import a table in .xls format into a (6) K3
data frame. Remove all NaN values.
13(B) Write Python code to generate 10 data frames of size 5 × 5 (8) K3
of random numbers and use a for loop to concatenate them.
Pickle the concatenated dataframe and store it. Write another
code to retrieve the dataframe from the pickle.
OR
14(A) Write Python code to read in a table from a pdf file into (8) K3
a pandas dataframe. Write code to remove the first two
columns and write the rest of the dataframe as a json file.
14(B) Explain the term pivot table. Create a pivot table from the (6) K3
above dataframe
Module III
OR
Module IV
17(A) Assume that you have a dataset with 57 data points of Gaus- (8) K3
sian distribution with a mean of 4 and standard deviation of
0.5. Using PyMC3, write Python code to compute:
• The posterior distribution
.
17(B) Write a python code to find the Bayesian credible interval (6) K3
in the above question. How is it different from confidence
interval.
OR
19(A) Explain the use of numba and numexpr in fatser Python execution with
examples (8) K3
19(B) Explain the use of Keras as a frontend for Tensorflow with (6) K3
Python codes
OR