Guidelines DAVP
Guidelines DAVP
Week 1 Unit 1 Introduction to basic statistics and Ch1: pg 11-24, pg 29-35, pg 37-p38 [2]
to 3 analysis: Fundamentals of Data Analysis,
Statistical foundations for Data Analysis, Types
of data, Descriptive Statistics, Correlation and
covariance, Linear Regression, Statistical Ch 1: 1.3 (pg 4-6) [1]
Hypothesis Generation and Testing
Python Libraries: NumPy, Pandas, Matplotlib
Week 4 Unit 2 Array manipulation using Numpy: Ch4:4.1. Usage of rand(), randn() and randint() [1]
to 6 NumPy array: Creating NumPy arrays, various functions of NumPy
data types of NumPy arrays
Indexing and slicing, swapping axes, transposing
arrays, data processing using Numpy arrays
Week 7 Unit 3 Data Manipulation using Pandas: Data Ch 5: 5.1, 5.2 excluding Arithmetic and data [1]
to 10 Structures in Pandas: Series, Data Frame, Index alignment, axis indexes with duplicate labels, 5.3
objects, loading data into Panda’s data frame, Ch 6: 6.1 (pg 177-181,184)
Working with Data Frames: Arithmetics, Ch 7: 7.1, 7.2 till binning (pg 203-217)
Statistics, Binning, Indexing, Reindexing, Ch 8: 8.1 (pg 247-253), 8.2 (pg 253-258) 8.3 (pg
Filtering, Handling missing data, Hierarchical 270-273)
indexing, Data wrangling: Data cleaning,
transforming, merging and reshaping
Week Unit 4 Plotting and Visualization: Using Ch 9: 9.1 (pg 281-296), 9.2 (pg 298-313), 9.3 [1]
11 to 13 Matplotlib to plot data: figures, subplots,
markings, color and line styles, labels and
legends, Plotting functions in Pandas: Lines, bar, Ch 5 : pg 281-282 [2]
Scatter plots, histograms, stacked bars, Heatmap,
3D Plotting, interactive plotting using Bokeh and
Plotly
Week Data Aggregation and Group operations: Chapter 10: 10.1, 10.2, 10.3 (till pg 337), 10.5 [1]
14 to 15 Group by mechanics, Data aggregation, General
split-apply-combine, Pivot tables and cross
tabulation
Essential/recommended readings
1. McKinney W. Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython. 3rd edition. O’Reilly
Media, 2022
2. Molin S. Hands-On Data Analysis with Pandas, Packt Publishing, 2019.
3. Gupta S.C., Kapoor V.K., Fundamentals of Mathematical Statistics, Sultan Chand & Sons, 2020.
1. Load a Pandas dataframe with a selected dataset. Identify and count the missing values
in a dataframe. Clean the data after removing noise as follows
a) Drop duplicate rows.
b) Detect the outliers and remove the rows having more than two outliers identified using boxplot.
c) Identify the most correlated positively correlated attributes and negatively correlated
attributes
2. Import iris data using sklearn library or (Download IRIS data from:
https://archive.ics.uci.edu/ml/datasets/iris or import it from sklearn.datasets)
a. Compute mean, mode, median, standard deviation, confidence interval and standard
error for each feature
b. Compute correlation coefficients between each pair of features and plot heatmap
c. Find covariance between length of sepal and petal iv. Build contingency table for class
feature
3. Load Titanic data from sklearn library , plot the following with proper legend and axis
labels:
a. Plot bar chart to show the frequency of survivors and non-survivors for male
and female passengers separately
b. Draw a scatter plot for any two selected features
c. Compare density distribution for features age and passenger fare
d. Use a pair plot to show pairwise bivariate distribution
Project: Students are encouraged to work on a good dataset in consultation with their faculty
and apply the concepts learned in the course.
Additional Practice Exercises:
2. Consider two data files (in CSV format) having attendance of two workshops. Each file has three fields ‘Name’,
‘Date, duration (in minutes) where names are unique within a file. Note that duration may take one of three
values (30, 40, 50) only. Import the data into two data frames and do the following:
a. Perform merging of the two data frames to find the names of students who had attended both
workshops.
b. Find names of all students who have attended a single workshop only.
c. Merge two data frames row-wise and find the total number of records in the data frame.
d. Merge two data frames row-wise and use two columns viz. names and dates as multi-row indexes.
Generate descriptive statistics for this hierarchical data frame.
3. Consider the following data frame containing a family name, gender of the family member and her/his monthly
income in each record.
Name Gender MonthlyIncome (Rs.)
Shah Male 114000.00
Vats Male 65000.00
Vats Female 43150.00
Kumar Female 69500.00
Vats Female 155000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
Write a program in Python using Pandas to perform the following:
a. Calculate and display familywise gross monthly income.
b. Display the highest and lowest monthly income for each family name
c. Calculate and display monthly income of all members earning income less than Rs. 80000.00.
d. Display total number of females along with their average monthly income
e. Delete rows with Monthly income less than the average income of all members