Class Notes
Class Notes
Practical File
Umama Yahya
BA (HONS) Psychology
24528/55
1
INDEX:
1) Write programs in Python using the NumPy library to do the following:
a. Compute the mean, standard deviation, and variance of a two-dimensional random integer
array along the second axis.
b. Create a 2-dimensional array of size m x n integer elements; also print the shape, type and
data type of the array and then reshape it into an n x m array, where n and m are user inputs
given at run time.
c. Test whether the elements of a given 1D array are zero, non-zero and NaN. Record the
indices of these elements in three separate arrays.
d. Create three random arrays of the same size: Array1, Array2 and Array3. Subtract Array2
from Array3 and store in Array4. Create another array Array5 having two times the
values in Array1. Find Co-variance and Correlation of Array1 with Array4 and
Array5 respectively.
e. Create two random arrays of the same size, 10: Array1, and Array2. Find the sum of the
first half of both the arrays and product of the second half of both the arrays.
2] Do the following using PANDAS Series:
a. Create a series with 5 elements. Display the series sorted on index and also sorted on values
separately
b. Create a series with N elements with some duplicate values. Find the minimum and maximum
ranks assigned to the values using ‘first’ and ‘max’ methods
c. Display the index value of the minimum and maximum element of a Series
2
3.Create a data frame having at least 3 columns and 50 rows to store numeric data generated
using a random function. Replace 10% of the values by null values whose index positions are
generated using random function. Do the following:
a. Identify and count missing values in a data frame.
c. Identify the row label having maximum of the sum of all values in a row and drop that row.
f. Find the correlation between first and second column and covariance between second and third
column.
g. Discretize the second column and create 5 bins.
4. Consider two excel files having attendance of two workshos. Each file has three fields ‘Name’,
‘Date, duration (in minutes) where names are unique within a file. Note that duration may
take one of three values (30, 40, 50) only. Import the data into two data frames and do the
following:
a. Perform merging of the two data frames to find the names of students who had attended both
workshops.
b. Find names of all students who have attended a single workshop only.
c. Merge two data frames row-wise and find the total number of records in the data frame.
d. Merge two data frames row-wise and use two columns viz. names and dates as multi-row
indexes. Generate descriptive statistics for this hierarchical data frame.
5. Using Iris data, plot the following with proper legend and axis labels: (Download IRIS data
from: https://archive.ics.uci.edu/ml/datasets/iris or import it from sklearn datasets)
3
a. Plot bar chart to show the frequency of each class label in the data.
b. Draw a scatter plot for Petal width vs sepal width and fit a regression line
d. Use a pair plot to show pairwise bivariate distribution in the Iris Dataset.
f. Compute mean, mode, median, standard deviation, confidence interval and standard error for
each feature
g. Compute correlation coefficients between each pair of features and plot heatmap
6. Consider the following data frame containing a family name, gender of the family member and
her/his monthly income in each record.
Name Gender Monthly Income (Rs.)
Shah Male 114000.00
Vats Male 65000.00
Vats Female 43150.00
Kumar Female 69500.00
Vats Female 155000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
Write a program in Python using Pandas to perform the following:
b. Calculate and display the member with the highest monthly income.
4
c. Calculate and display monthly income of all members with income greater than Rs. 60000.00.
d. Calculate and display the average monthly income of the female members
7. Using Titanic dataset, to do the following: a. Find total number of passengers with age less
than 30 b. Find total fare paid by passengers of first class c. Compare number of survivors of
each passenger class
d. Compute descriptive statistics for any numeric attribute genderwise
5
2]Write programs in Python using NumPy library to do the following:
INPUT:
(a) Compute the mean, standard deviation, and variance of a two dimensional random integer
array along the second axis.
import numpy as np
array_a = np.random.randint(1, 100, size=(4, 5))
print("Array A:\n", array_a)
print("Mean along axis 1:", np.mean(array_a, axis=1))
print("Standard Deviation along axis 1:", np.std(array_a, axis=1))
print("Variance along axis 1:", np.var(array_a, axis=1))
OUTPUT:
INPUT:
(b) Create a 2-dimensional array of size m x n integer elements, also print the shape, type and
data type of the array and then reshape it into an n x m array, where n and m are user inputs
given at the run time.
import numpy as np
m = int(input("Enter number of rows (m): "))
n = int(input("Enter number of columns (n): "))
array_b = np.random.randint(0, 100, size=(m, n))
print("Array B:\n", array_b)
print("Shape:", array_b.shape)
6
print("Type:", type(array_b))
print("Data Type:", array_b.dtype)
array_b_reshaped = array_b.reshape(n, m)
print("Reshaped Array (n x m):\n", array_b_reshaped)
OUTPUT:
INPUT:
(C) Test whether the elements of a given 1D array are zero, non-zero and NaN. Record the indices
of these elements in three separate arrays.
import numpy as np
array_c = np.array([0, 1, np.nan, 2, 0, 3, np.nan])
zero_indices = np.where(array_c == 0)[0]
nonzero_indices = np.where(array_c != 0)[0]
nan_indices = np.where(np.isnan(array_c))[0]
print("Zero indices:", zero_indices)
print("Non-zero indices:", nonzero_indices)
7
print("NaN indices:", nan_indices)
OUTPUT:
INPUT: (d) Create three random arrays of the same size: Array1, Array2, and Array3. Subtract
Array2 from Array3 and store in Array4. Create another arra,y Array5, having twice the values
in Array1. Find the covariance and correlation of Array1 with Array4 and Array5, respectively.
import numpy as np
a1 = np.random.rand(5)
a2 = np.random.rand(5)
a3 = np.random.rand(5)
a4 = a3 - a2
a5 = 2 * a1
cov = np.cov(a1, a4)[0, 1]
corr = np.corrcoef(a1, a5)[0, 1]
print("Covariance (a1 & a4):", cov)
print("Correlation (a1 & a5):", corr)
OUTPUT:
8
INPUT: (e) Create two random arrays of the same size, 10: Array1, and Array2. Find the sum of the
first half of both the arrays and product of the second half of both arrays.
A1 = np.random.randint(1, 10, size=10)
A2 = np.random.randint(1, 10, size=10)
half = len(A1) // 2
sum_first_half = np.sum(A1[:half] + A2[:half])
product_second_half = np.prod(A1[half:] * A2[half:])
print("A1:", A1)
print("A2:", A2)
print("Sum of first half of both arrays:", sum_first_half)
print("Product of second half of both arrays:", product_second_half)
OUTPUT:
a. Create a series with 5 elements. Display the series sorted on index and also sorted on values
separately
INPUT:
import pandas as pd
data = pd.Series([50, 10, 40, 20, 30], index=['e', 'b', 'd', 'a', 'c'])
print("Original Series:")
print(data)
9
print("\nSeries sorted by index:")
print(data.sort_index())
print("\nSeries sorted by values:")
print(data.sort_values())
OUTPUT:
b. Create a series with N elements with some duplicate values. Find the minimum and maximum
ranks assigned to the values using ‘first’ and ‘max’ methods
INPUT:
import pandas as pd
data = pd.Series([50, 30, 50, 20, 30, 50])
print("Rank using method='first':")
10
print(data.rank(method='first'))
print("\nRank using method='max':")
print(data.rank(method='max'))
OUTPUT:
c. Display the index value of the minimum and maximum element of a Series
INPUT:
import pandas as pd
data = pd.Series([100, 20, 30, 90, 10], index=['a', 'b', 'c', 'd', 'e'])
min_index = data.idxmin()
max_index = data.idxmax()
print(f"Index of minimum value: {min_index}")
11
print(f"Index of maximum value: {max_index}")
OUTPUT:
3] Create a data frame having at least 3 columns and 50 rows to store numeric data generated
using a random function. Replace 10% of the values by null values whose index positions are
generated using random function.
INPUT:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(
np.random.randint(1, 100, size=(50, 3)),
columns=['A', 'B', 'C']
)
num_nulls = df.size // 10
random_indices = np.random.choice(df.size, num_nulls, replace=False)
for idx in random_indices:
row, col = divmod(idx, df.shape[1])
df.iat[row, col] = np.nan
print("Initial DataFrame with random null values:")
12
print(df.head())
OUTPUT:
INPUT:
INPUT:
df = df.dropna(axis=1, thresh=len(df) - 5 + 1)
print("\n(b) DataFrame after dropping columns with > 5 nulls:")
print(df.head())
13
OUTPUT:
c. Identify the row label having maximum of the sum of all values in a row and drop that row.
INPUT:
INPUT:
first_col = df.columns[0]
df = df.sort_values(by=first_col)
14
print(df.head())
OUTPUT:
INPUT:
df = df.drop_duplicates(subset=first_col)
print(f"\n(e) DataFrame after removing duplicates in column '{first_col}':")
print(df.head())
OUTPUT:
f. Find the correlation between first and second column and covariance between second and third
column.
INPUT:
15
cols = df.columns
if len(cols) >= 2:
corr = df[cols[0]].corr(df[cols[1]])
print(f"\n(f) Correlation between '{cols[0]}' and '{cols[1]}': {corr}")
if len(cols) >= 3:
cov = df[cols[1]].cov(df[cols[2]])
print(f"Covariance between '{cols[1]}' and '{cols[2]}': {cov}")
OUTPUT:
INPUT:
if len(cols) >= 2:
df['Binned_' + cols[1]] = pd.cut(df[cols[1]], bins=5)
print(f"\n(g) Discretized column '{cols[1]}' into 5 bins:")
print(df[['Binned_' + cols[1]]].head())
OUTPUT:
16
4] Consider two excel files having attendance of two workshos. Each file has three fields ‘Name’,
‘Date, duration (in minutes) where names are unique within a file. Note that duration may
take one of three values (30, 40, 50) only. Import the data into two data frames and do the
following:
a. Perform merging of the two data frames to find the names of students who had attended both
workshops.
INPUT:
import pandas as pd
data1 = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Heidi'],
'Date': pd.to_datetime(['2025-05-01'] * 8),
'Duration': [40, 30, 40, 50, 50, 30, 50, 30]
}
df1 = pd.DataFrame(data1)
df1.to_excel('workshop1.xlsx', index=False)
data2 = {
'Name': ['Charlie', 'David', 'Eve', 'Ivan', 'Judy', 'Mallory', 'Niaj', 'Olivia'],
'Date': pd.to_datetime(['2025-05-02'] * 8),
'Duration': [50, 40, 30, 30, 40, 50, 40, 50]
}
df2 = pd.DataFrame(data2)
df2.to_excel('workshop2.xlsx', index=False)
OUTPUT:
17
b. Find names of all students who have attended a single workshop only.
INPUT:
names_1 = set(df1['Name'])
names_2 = set(df2['Name'])
only_one_workshop = names_1.symmetric_difference(names_2)
print("\n(b) Students who attended only one workshop:")
print(only_one_workshop)
OUTPUT:
c. Merge two data frames row-wise and find the total number of records in the data frame.
INPUT:
merged_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print("\n(c) Total number of records after row-wise merge:")
print(len(merged_df))
d. Merge two data frames row-wise and use two columns viz. names and dates as multi-row
indexes. Generate descriptive statistics for this hierarchical data frame.
INPUT:
18
hierarchical_df = pd.concat([df1, df2], axis=0)
hierarchical_df.set_index(['Name', 'Date'], inplace=True)
print("\n(d) Descriptive statistics using hierarchical index:")
print(hierarchical_df.describe())
OUTPUT:
5] Using Iris data, plot the following with proper legend and axis labels: (Download IRIS data
from: https://archive.ics.uci.edu/ml/datasets/iris or import it from sklearn datasets)
INPUT:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris=load_iris()
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df['target']=iris.target
df
19
OUTPUT:
a. Plot bar chart to show the frequency of each class label in the data.
INPUT:
#(A)
sns.barplot(x=df['target'].value_counts().index,y=df['target'].value_counts())
plt.xlabel('class label')
plt.ylabel('frequency')
plt.title('frequency of each class label in iris dataset')
plt.show()
OUTPUT:
20
b. Draw a scatter plot for Petal width vs sepal width and fit a regression line
INPUT:
plt.figure(figsize=(6, 4))
sns.regplot(x='sepal width (cm)', y='petal width (cm)', data=df)
plt.title("Petal Width vs Sepal Width with Regression Line")
plt.xlabel("Sepal Width (cm)")
plt.ylabel("Petal Width (cm)")
plt.show()
21
OUTPUT:
INPUT:
plt.figure(figsize=(6, 4))
plt.show()
22
OUPUT:
d. Use a pair plot to show pairwise bivariate distribution in the Iris Dataset.
INPUT:
#(D)
sns.pairplot(df, hue='species')
plt.suptitle("Pairplot of Iris Features", y=1.02)
plt.show()
23
OUTPUT:
24
OUTPUT:
f. Compute mean, mode, median, standard deviation, confidence interval and standard error for
each feature
INPUT:
summary_stats = pd.DataFrame({
'mean': df.iloc[:, :-1].mean(),
'median': df.iloc[:, :-1].median(),
'mode': df.iloc[:, :-1].mode().iloc[0],
25
'std': df.iloc[:, :-1].std(),
'std_err': df.iloc[:, :-1].sem()
})
confidence_intervals = df.iloc[:, :-1].apply(
lambda x: stats.t.interval(0.95, len(x)-1, loc=np.mean(x), scale=stats.sem(x))
)
summary_stats['conf_int_lower'] = [ci[0] for ci in confidence_intervals]
summary_stats['conf_int_upper'] = [ci[1] for ci in confidence_intervals]
OUTPUT:
g. Compute correlation coefficients between each pair of features and plot heatmap
INPUT:
plt.figure(figsize=(6, 5))
sns.heatmap(df.iloc[:, :-1].corr(), annot=True, cmap='viridis')
plt.title("Correlation Heatmap of Iris Features")
plt.show()
26
OUTPUT:
6] Consider the following data frame containing a family name, gender of the family member and
her/his monthly income in each record.
Name Gender Monthly Income (Rs.)
Shah Male 114000.00
Vats Male 65000.00
27
Vats Female 43150.00
Kumar Female 69500.00
Vats Female 155000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
Write a program in Python using Pandas to perform the following:
b. Calculate and display the member with the highest monthly income.
c. Calculate and display monthly income of all members with income greater than Rs. 60000.00.
d. Calculate and display the average monthly income of the female members
INPUT:
import pandas as pd
data = {
'Name': ['Shah', 'Vats', 'Vats', 'Kumar', 'Vats', 'Kumar', 'Shah', 'Shah', 'Kumar', 'Vats'],
'Gender': ['Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male'],
'Monthly Income (Rs.)': [114000.00, 65000.00, 43150.00, 69500.00, 155000.00, 103000.00,
55000.00, 112400.00, 81030.00, 71900.00]
}
df = pd.DataFrame(data)
family_income = df.groupby('Name')['Monthly Income (Rs.)'].sum()
print("a. Familywise Gross Monthly Income:\n", family_income, "\n")
28
max_income_member = df.loc[df['Monthly Income (Rs.)'].idxmax()]
print("b. Member with the Highest Monthly Income:\n", max_income_member, "\n")
high_income_members = df[df['Monthly Income (Rs.)'] > 60000.00]
print("c. Members with Income > Rs. 60000.00:\n", high_income_members, "\n")
female_avg_income = df[df['Gender'] == 'Female']['Monthly Income (Rs.)'].mean()
print("d. Average Monthly Income of Female Members: Rs.", round(female_avg_income, 2))
OUTPUT:
29
c. Compare number of survivors of each passenger class
INPUT:
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
under_30_count = titanic[titanic['age'] < 30].shape[0]
print("a. Total passengers with age < 30:", under_30_count)
total_fare_first_class = titanic[titanic['pclass'] == 1]['fare'].sum()
print("b. Total fare paid by 1st class passengers: Rs.", round(total_fare_first_class, 2))
survivors_by_class = titanic.groupby(['pclass', 'survived']).size().unstack(fill_value=0)
print("\nc. Survivors per class:\n", survivors_by_class)
stats_genderwise = titanic.groupby('sex')['fare'].describe()
print("\nd. Descriptive statistics for 'fare' by gender:\n", stats_genderwise)
OUTPUT:
30