0% found this document useful (0 votes)
8 views30 pages

Class Notes

The document outlines a practical file for data analysis and visualization using Python, focusing on libraries like NumPy and Pandas. It includes various programming tasks such as computing statistics, handling data frames, and visualizing data using the Iris dataset. The tasks cover creating arrays, manipulating data, and performing statistical analysis, along with specific examples and expected outputs.

Uploaded by

umama yahya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views30 pages

Class Notes

The document outlines a practical file for data analysis and visualization using Python, focusing on libraries like NumPy and Pandas. It includes various programming tasks such as computing statistics, handling data frames, and visualizing data using the Iris dataset. The tasks cover creating arrays, manipulating data, and performing statistical analysis, along with specific examples and expected outputs.

Uploaded by

umama yahya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Analysis and Visualization using Python

Practical File
Umama Yahya

BA (HONS) Psychology

24528/55

1
INDEX:
1)​ Write programs in Python using the NumPy library to do the following:

a. Compute the mean, standard deviation, and variance of a two-dimensional random integer
array along the second axis.
b. Create a 2-dimensional array of size m x n integer elements; also print the shape, type and
data type of the array and then reshape it into an n x m array, where n and m are user inputs
given at run time.
c. Test whether the elements of a given 1D array are zero, non-zero and NaN. Record the
indices of these elements in three separate arrays.
d. Create three random arrays of the same size: Array1, Array2 and Array3. Subtract Array2
from Array3 and store in Array4. Create another array Array5 having two times the
values in Array1. Find Co-variance and Correlation of Array1 with Array4 and
Array5 respectively.
e. Create two random arrays of the same size, 10: Array1, and Array2. Find the sum of the
first half of both the arrays and product of the second half of both the arrays.
2] Do the following using PANDAS Series:

a. Create a series with 5 elements. Display the series sorted on index and also sorted on values
separately
b. Create a series with N elements with some duplicate values. Find the minimum and maximum
ranks assigned to the values using ‘first’ and ‘max’ methods
c. Display the index value of the minimum and maximum element of a Series

2
3.Create a data frame having at least 3 columns and 50 rows to store numeric data generated
using a random function. Replace 10% of the values by null values whose index positions are
generated using random function. Do the following:
a. Identify and count missing values in a data frame.

b. Drop the column having more than 5 null values.

c. Identify the row label having maximum of the sum of all values in a row and drop that row.

d. Sort the data frame on the basis of the first column.

e. Remove all duplicates from the first column.

f. Find the correlation between first and second column and covariance between second and third
column.
g. Discretize the second column and create 5 bins.

4. Consider two excel files having attendance of two workshos. Each file has three fields ‘Name’,
‘Date, duration (in minutes) where names are unique within a file. Note that duration may
take one of three values (30, 40, 50) only. Import the data into two data frames and do the
following:
a. Perform merging of the two data frames to find the names of students who had attended both
workshops.
b. Find names of all students who have attended a single workshop only.

c. Merge two data frames row-wise and find the total number of records in the data frame.

d. Merge two data frames row-wise and use two columns viz. names and dates as multi-row
indexes. Generate descriptive statistics for this hierarchical data frame.
5. Using Iris data, plot the following with proper legend and axis labels: (Download IRIS data
from: https://archive.ics.uci.edu/ml/datasets/iris or import it from sklearn datasets)

3
a. Plot bar chart to show the frequency of each class label in the data.

b. Draw a scatter plot for Petal width vs sepal width and fit a regression line

c. Plot density distribution for feature petal length.

d. Use a pair plot to show pairwise bivariate distribution in the Iris Dataset.

e. Draw heatmap for the four numeric attributes

f. Compute mean, mode, median, standard deviation, confidence interval and standard error for
each feature
g. Compute correlation coefficients between each pair of features and plot heatmap

6. Consider the following data frame containing a family name, gender of the family member and
her/his monthly income in each record.
Name Gender Monthly Income (Rs.)
Shah Male 114000.00
Vats Male 65000.00
Vats Female 43150.00
Kumar Female 69500.00
Vats Female 155000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
Write a program in Python using Pandas to perform the following:

a. Calculate and display familywise gross monthly income.

b. Calculate and display the member with the highest monthly income.

4
c. Calculate and display monthly income of all members with income greater than Rs. 60000.00.

d. Calculate and display the average monthly income of the female members

7. Using Titanic dataset, to do the following: a. Find total number of passengers with age less
than 30 b. Find total fare paid by passengers of first class c. Compare number of survivors of
each passenger class
d. Compute descriptive statistics for any numeric attribute genderwise

5
2]Write programs in Python using NumPy library to do the following:

INPUT:

(a) Compute the mean, standard deviation, and variance of a two dimensional random integer
array along the second axis.
import numpy as np
array_a = np.random.randint(1, 100, size=(4, 5))
print("Array A:\n", array_a)
print("Mean along axis 1:", np.mean(array_a, axis=1))
print("Standard Deviation along axis 1:", np.std(array_a, axis=1))
print("Variance along axis 1:", np.var(array_a, axis=1))
OUTPUT:

INPUT:

(b) Create a 2-dimensional array of size m x n integer elements, also print the shape, type and
data type of the array and then reshape it into an n x m array, where n and m are user inputs
given at the run time.
import numpy as np
m = int(input("Enter number of rows (m): "))
n = int(input("Enter number of columns (n): "))
array_b = np.random.randint(0, 100, size=(m, n))
print("Array B:\n", array_b)
print("Shape:", array_b.shape)

6
print("Type:", type(array_b))
print("Data Type:", array_b.dtype)
array_b_reshaped = array_b.reshape(n, m)
print("Reshaped Array (n x m):\n", array_b_reshaped)

OUTPUT:

INPUT:

(C) Test whether the elements of a given 1D array are zero, non-zero and NaN. Record the indices
of these elements in three separate arrays.
import numpy as np
array_c = np.array([0, 1, np.nan, 2, 0, 3, np.nan])
zero_indices = np.where(array_c == 0)[0]
nonzero_indices = np.where(array_c != 0)[0]
nan_indices = np.where(np.isnan(array_c))[0]
print("Zero indices:", zero_indices)
print("Non-zero indices:", nonzero_indices)

7
print("NaN indices:", nan_indices)
OUTPUT:

INPUT: (d) Create three random arrays of the same size: Array1, Array2, and Array3. Subtract
Array2 from Array3 and store in Array4. Create another arra,y Array5, having twice the values
in Array1. Find the covariance and correlation of Array1 with Array4 and Array5, respectively.
import numpy as np
a1 = np.random.rand(5)
a2 = np.random.rand(5)
a3 = np.random.rand(5)
a4 = a3 - a2
a5 = 2 * a1
cov = np.cov(a1, a4)[0, 1]
corr = np.corrcoef(a1, a5)[0, 1]
print("Covariance (a1 & a4):", cov)
print("Correlation (a1 & a5):", corr)
OUTPUT:

8
INPUT: (e) Create two random arrays of the same size, 10: Array1, and Array2. Find the sum of the
first half of both the arrays and product of the second half of both arrays.
A1 = np.random.randint(1, 10, size=10)
A2 = np.random.randint(1, 10, size=10)
half = len(A1) // 2
sum_first_half = np.sum(A1[:half] + A2[:half])
product_second_half = np.prod(A1[half:] * A2[half:])
print("A1:", A1)
print("A2:", A2)
print("Sum of first half of both arrays:", sum_first_half)
print("Product of second half of both arrays:", product_second_half)
OUTPUT:

2] Do the following using PANDAS Series:

a. Create a series with 5 elements. Display the series sorted on index and also sorted on values
separately
INPUT:

import pandas as pd

data = pd.Series([50, 10, 40, 20, 30], index=['e', 'b', 'd', 'a', 'c'])

print("Original Series:")

print(data)

9
print("\nSeries sorted by index:")
print(data.sort_index())
print("\nSeries sorted by values:")
print(data.sort_values())
OUTPUT:

b. Create a series with N elements with some duplicate values. Find the minimum and maximum
ranks assigned to the values using ‘first’ and ‘max’ methods
INPUT:

import pandas as pd
data = pd.Series([50, 30, 50, 20, 30, 50])
print("Rank using method='first':")

10
print(data.rank(method='first'))
print("\nRank using method='max':")
print(data.rank(method='max'))
OUTPUT:

c. Display the index value of the minimum and maximum element of a Series

INPUT:

import pandas as pd
data = pd.Series([100, 20, 30, 90, 10], index=['a', 'b', 'c', 'd', 'e'])
min_index = data.idxmin()
max_index = data.idxmax()
print(f"Index of minimum value: {min_index}")

11
print(f"Index of maximum value: {max_index}")
OUTPUT:

3] Create a data frame having at least 3 columns and 50 rows to store numeric data generated
using a random function. Replace 10% of the values by null values whose index positions are
generated using random function.
INPUT:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(
np.random.randint(1, 100, size=(50, 3)),
columns=['A', 'B', 'C']
)
num_nulls = df.size // 10
random_indices = np.random.choice(df.size, num_nulls, replace=False)
for idx in random_indices:
row, col = divmod(idx, df.shape[1])
df.iat[row, col] = np.nan
print("Initial DataFrame with random null values:")

12
print(df.head())
OUTPUT:

a. Identify and count missing values in a data frame.

INPUT:

print("\n(a) Count of missing values per column:")


print(df.isnull().sum())
OUTPUT:

b. Drop the column having more than 5 null values.

INPUT:

df = df.dropna(axis=1, thresh=len(df) - 5 + 1)
print("\n(b) DataFrame after dropping columns with > 5 nulls:")
print(df.head())

13
OUTPUT:

c. Identify the row label having maximum of the sum of all values in a row and drop that row.

INPUT:

row_sums = df.sum(axis=1) # NaNs ignored by default


max_row_index = row_sums.idxmax()
df = df.drop(index=max_row_index)
print(f"\n(c) Dropped row with max sum at index: {max_row_index}")
OUTPUT:

d. Sort the data frame on the basis of the first column.

INPUT:

first_col = df.columns[0]

df = df.sort_values(by=first_col)

print(f"\n(d) DataFrame sorted by column '{first_col}':")

14
print(df.head())

OUTPUT:

e. Remove all duplicates from the first column.

INPUT:

df = df.drop_duplicates(subset=first_col)
print(f"\n(e) DataFrame after removing duplicates in column '{first_col}':")
print(df.head())
OUTPUT:

f. Find the correlation between first and second column and covariance between second and third
column.
INPUT:

15
cols = df.columns
if len(cols) >= 2:
corr = df[cols[0]].corr(df[cols[1]])
print(f"\n(f) Correlation between '{cols[0]}' and '{cols[1]}': {corr}")
if len(cols) >= 3:
cov = df[cols[1]].cov(df[cols[2]])
print(f"Covariance between '{cols[1]}' and '{cols[2]}': {cov}")
OUTPUT:

g. Discretize the second column and create 5 bins.

INPUT:

if len(cols) >= 2:
df['Binned_' + cols[1]] = pd.cut(df[cols[1]], bins=5)
print(f"\n(g) Discretized column '{cols[1]}' into 5 bins:")
print(df[['Binned_' + cols[1]]].head())
OUTPUT:

16
4] Consider two excel files having attendance of two workshos. Each file has three fields ‘Name’,
‘Date, duration (in minutes) where names are unique within a file. Note that duration may
take one of three values (30, 40, 50) only. Import the data into two data frames and do the
following:
a. Perform merging of the two data frames to find the names of students who had attended both
workshops.
INPUT:
import pandas as pd
data1 = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Heidi'],
'Date': pd.to_datetime(['2025-05-01'] * 8),
'Duration': [40, 30, 40, 50, 50, 30, 50, 30]
}
df1 = pd.DataFrame(data1)
df1.to_excel('workshop1.xlsx', index=False)
data2 = {
'Name': ['Charlie', 'David', 'Eve', 'Ivan', 'Judy', 'Mallory', 'Niaj', 'Olivia'],
'Date': pd.to_datetime(['2025-05-02'] * 8),
'Duration': [50, 40, 30, 30, 40, 50, 40, 50]
}
df2 = pd.DataFrame(data2)
df2.to_excel('workshop2.xlsx', index=False)
OUTPUT:

17
b. Find names of all students who have attended a single workshop only.
INPUT:
names_1 = set(df1['Name'])
names_2 = set(df2['Name'])
only_one_workshop = names_1.symmetric_difference(names_2)
print("\n(b) Students who attended only one workshop:")
print(only_one_workshop)
OUTPUT:

c. Merge two data frames row-wise and find the total number of records in the data frame.
INPUT:
merged_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print("\n(c) Total number of records after row-wise merge:")
print(len(merged_df))

d. Merge two data frames row-wise and use two columns viz. names and dates as multi-row
indexes. Generate descriptive statistics for this hierarchical data frame.
INPUT:

18
hierarchical_df = pd.concat([df1, df2], axis=0)
hierarchical_df.set_index(['Name', 'Date'], inplace=True)
print("\n(d) Descriptive statistics using hierarchical index:")
print(hierarchical_df.describe())
OUTPUT:

5] Using Iris data, plot the following with proper legend and axis labels: (Download IRIS data
from: https://archive.ics.uci.edu/ml/datasets/iris or import it from sklearn datasets)
INPUT:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris=load_iris()
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df['target']=iris.target
df

19
OUTPUT:

a. Plot bar chart to show the frequency of each class label in the data.

INPUT:
#(A)
sns.barplot(x=df['target'].value_counts().index,y=df['target'].value_counts())
plt.xlabel('class label')
plt.ylabel('frequency')
plt.title('frequency of each class label in iris dataset')
plt.show()

OUTPUT:

20
b. Draw a scatter plot for Petal width vs sepal width and fit a regression line

INPUT:

plt.figure(figsize=(6, 4))
sns.regplot(x='sepal width (cm)', y='petal width (cm)', data=df)
plt.title("Petal Width vs Sepal Width with Regression Line")
plt.xlabel("Sepal Width (cm)")
plt.ylabel("Petal Width (cm)")
plt.show()

21
OUTPUT:

c. Plot density distribution for feature petal length.

INPUT:

plt.figure(figsize=(6, 4))

sns.kdeplot(x=df['petal length (cm)'])

plt.title("Density Plot for Petal Length")

plt.xlabel("Petal Length (cm)")

plt.show()

22
OUPUT:

d. Use a pair plot to show pairwise bivariate distribution in the Iris Dataset.

INPUT:
#(D)
sns.pairplot(df, hue='species')
plt.suptitle("Pairplot of Iris Features", y=1.02)
plt.show()

23
OUTPUT:

e. Draw heatmap for the four numeric attributes


plt.figure(figsize=(6, 5))
corr_matrix = df.iloc[:, :-1].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Heatmap of Feature Correlations")
plt.show()

24
OUTPUT:

f. Compute mean, mode, median, standard deviation, confidence interval and standard error for
each feature
INPUT:
summary_stats = pd.DataFrame({
'mean': df.iloc[:, :-1].mean(),
'median': df.iloc[:, :-1].median(),
'mode': df.iloc[:, :-1].mode().iloc[0],

25
'std': df.iloc[:, :-1].std(),
'std_err': df.iloc[:, :-1].sem()
})
confidence_intervals = df.iloc[:, :-1].apply(
lambda x: stats.t.interval(0.95, len(x)-1, loc=np.mean(x), scale=stats.sem(x))
)
summary_stats['conf_int_lower'] = [ci[0] for ci in confidence_intervals]
summary_stats['conf_int_upper'] = [ci[1] for ci in confidence_intervals]

print("\nDescriptive Statistics with Confidence Intervals:\n")


print(summary_stats)

OUTPUT:

g. Compute correlation coefficients between each pair of features and plot heatmap

INPUT:
plt.figure(figsize=(6, 5))
sns.heatmap(df.iloc[:, :-1].corr(), annot=True, cmap='viridis')
plt.title("Correlation Heatmap of Iris Features")
plt.show()

26
OUTPUT:

6] Consider the following data frame containing a family name, gender of the family member and
her/his monthly income in each record.
Name Gender Monthly Income (Rs.)
Shah Male 114000.00
Vats Male 65000.00

27
Vats Female 43150.00
Kumar Female 69500.00
Vats Female 155000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
Write a program in Python using Pandas to perform the following:

a. Calculate and display familywise gross monthly income.

b. Calculate and display the member with the highest monthly income.

c. Calculate and display monthly income of all members with income greater than Rs. 60000.00.

d. Calculate and display the average monthly income of the female members
INPUT:
import pandas as pd
data = {
'Name': ['Shah', 'Vats', 'Vats', 'Kumar', 'Vats', 'Kumar', 'Shah', 'Shah', 'Kumar', 'Vats'],
'Gender': ['Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male'],
'Monthly Income (Rs.)': [114000.00, 65000.00, 43150.00, 69500.00, 155000.00, 103000.00,
55000.00, 112400.00, 81030.00, 71900.00]
}
df = pd.DataFrame(data)
family_income = df.groupby('Name')['Monthly Income (Rs.)'].sum()
print("a. Familywise Gross Monthly Income:\n", family_income, "\n")

28
max_income_member = df.loc[df['Monthly Income (Rs.)'].idxmax()]
print("b. Member with the Highest Monthly Income:\n", max_income_member, "\n")
high_income_members = df[df['Monthly Income (Rs.)'] > 60000.00]
print("c. Members with Income > Rs. 60000.00:\n", high_income_members, "\n")
female_avg_income = df[df['Gender'] == 'Female']['Monthly Income (Rs.)'].mean()
print("d. Average Monthly Income of Female Members: Rs.", round(female_avg_income, 2))
OUTPUT:

7] Using Titanic dataset, to do the following

: a. Find total number of passengers with age less than 30

b. Find total fare paid by passengers of first class

29
c. Compare number of survivors of each passenger class

d. Compute descriptive statistics for any numeric attribute genderwise

INPUT:
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
under_30_count = titanic[titanic['age'] < 30].shape[0]
print("a. Total passengers with age < 30:", under_30_count)
total_fare_first_class = titanic[titanic['pclass'] == 1]['fare'].sum()
print("b. Total fare paid by 1st class passengers: Rs.", round(total_fare_first_class, 2))
survivors_by_class = titanic.groupby(['pclass', 'survived']).size().unstack(fill_value=0)
print("\nc. Survivors per class:\n", survivors_by_class)
stats_genderwise = titanic.groupby('sex')['fare'].describe()
print("\nd. Descriptive statistics for 'fare' by gender:\n", stats_genderwise)

OUTPUT:

30

You might also like