DAV Practicals
DAV Practicals
BHATTACHARJEE
ROLL NO.- 21HCS4116
EXAMINATION ROLL NO.-
SEMESTER: Vth
PAPER: DATA ANALYSIS &
VISUALIZATION PRACTICALS
Q1. Given below is a dictionary having two keys ‘Boys’ and ‘Girls’ and having two lists of
heights of five Boys and Five Girls respectively as values associated with these keys
{'Boys': [72, 68, 70, 69, 74], 'Girls': [63, 65, 69, 62, 61]}
From the given dictionary of lists create the following list of dictionaries:
[{'Boys': 72, 'Girls': 63}, {'Boys': 68, 'Girls': 65}, {'Boys': 70, 'Girls': 69}, {'Boys': 69, 'Girls': 62},
{‘Boys’:74, ‘Girls’:61]
Answer:-
def list_of_dict(heights):
keys=heights.keys()
# print(keys)
values = zip(*[heights[k] for k in keys])
# print(values)
result = [dict(zip(keys,v )) for v in values]
return result
OUTPUT :
ORIGINAL DICTIONARY OF LISTS : {'Boys': [72, 68, 70, 69, 74], 'Girls': [63, 65, 69, 62, 61]}
a. Compute the mean, standard deviation, and variance of a two dimensional random
integer array along the second axis.
c. Create a 2-dimensional array of size m x n integer elements, also print the shape, type
and data type of the array and then reshape it into nx m array, n and m are user inputs
given at the run time.
d. Test whether the elements of a given array are zero, non-zero and NaN. Record the
indices of these elements in three separate arrays.
Answer:- (a)
import numpy as np
arr = np.random.randint(1,50,(4,6))
arr
(b)
B = [56, 48, 22, 41, 78, 91, 24, 46, 8, 33]
arr1 = np.array(B)
#arr1
print("Sorted array: ",np.sort(arr1))
print("Indices of the sorted elements of a given array:
",np.argsort(arr1))
(c)
(d)
x = np.array([1, 0, 3, 4])
print("ORIGINAL ARRAY ::-> ",x)
print("\nTest if none of the elements of the said array is
zero ::-> ", np.all(x))
x = np.array([1, 0, 0, 3, 2, 0])
print("\n")
print("\nORIGINAL ARRAY ::-> ",x)
print("\nTest whether any of the elements of a given array is
non-zero ::",np.any(x))
res = np.where(x != 0)[0]
print("The index of the non- zero elements is :: ",res)
x = np.array([0, 0, 0, 0])
OUTPUT :
(a)
Mean of the array: [17.66666667 29.33333333 25.16666667 29.
]
Standard Deviation of the array: [ 6.79869268 12.78888406
18.46994556 8.02080628]
Variance of the array: [ 46.22222222 163.55555556 341.13888889
64.33333333]
(b)
Sorted array: [ 8 22 24 33 41 46 48 56 78 91]
Indices of the sorted elements of a given array: [8 2 6 9 3 7 1
0 4 5]
(c)
[[ 6 77 89]
[55 43 24]]
Shape: (2, 3)
Type: <class 'numpy.ndarray'>
Data Type: int32
After reshaping:
[[ 6 77]
[89 55]
[43 24]]
New Shape: (3, 2)
(d)
ORIGINAL ARRAY ::-> [1 0 3 4]
Q3. Create a dataframe having at least 3 columns and 50 rows to store numeric data
generated using a random function. Replace 10% of the values by null values whose
index positions are generated using random function. Do the following:
Answer:-
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(50,3)),
columns=list('123'))
df.head()
for c in df.sample(int(df.shape[0]*df.shape[1]*0.10)).index:
df.loc[c,str(np.random.randint(1,4))]=np.nan
df
(a)
print(df.isnull().sum().sum())
(b)
for col in df.columns:
print(col,df[col].isnull().sum())
df.dropna(axis = 1,thresh=(df.shape[0]-5)).head()
(c)
sum=df.sum(axis=1)
print("SUM IS :\n",sum)
print("\nMAXIMUM SUM IS :",sum.max())
max_sum_row = df.sum(axis=1).idxmax()
print("\nRow index having maximum sum is :" ,max_sum_row)
(d)
sortdf=df.sort_values('1')
sortdf.head()
(e)
df =df.drop_duplicates(subset='1',keep = "first")
print(df)
(f)
correlation = df['1'].corr(df['2'])
print("CORRELATION between column 1 and 2 : ", correlation)
covariance = df['2'].cov(df['3'])
print("COVARIANCE between column 2 and 3 :",covariance)
(g)
df.plot.box()
(h)
df1 = pd.cut(df['2'],bins=5).head()
df1
OUTPUT :
(a)
Q4. Consider two excel files having attendance of a workshop’s participants for two
days. Each file has three fields ‘Name’, ‘Time of joining’, duration (in minutes) where
names are unique within a file. Note that duration may take one of three values (30, 40,
50) only. Import the data into two dataframes and do the following:
a.Perform merging of the two dataframes to find the names of students who had
attended the workshop on both days.
b. Find names of all students who have attended workshop on either of the days.
c. Merge two data frames row-wise and find the total number of records in the
data frame.
d. Merge two data frames and use two columns names and duration as multi-row
indexes. Generate descriptive statistics for this multi-index.
Answer:-
import numpy as np
import pandas as pd
dfDay1 = pd.read_excel('Day1_anirbn.xlsx')
dfDay2 = pd.read_excel('Day2_anirbn.xlsx')
print(dfDay1.head(),"\n")
print(dfDay2.head())
(a)
pd.merge(dfDay1,dfDay2,how='inner',on='Name')
(b)
either_day = pd.merge(dfDay1,dfDay2,how='outer',on='Name')
either_day
(c)
either_day['Name'].count()
(d)
both_days =
pd.merge(dfDay1,dfDay2,how='outer',on=['Name','Duration']).copy()
# creates a copy of an existing list
OUTPUT :
Q5. Taking Iris data, plot the following with proper legend and axis labels: (Download
IRIS data from: https://archive.ics.uci.edu/ml/datasets/iris or import it from
sklearn.datasets)
a. Plot bar chart to show the frequency of each class label in the data.
b. Draw a scatter plot for Petal width vs sepal width.
c. Plot density distribution for feature petal length.
d. Use a pair plot to show pairwise bivariate distribution in the Iris Dataset.
Answer:-
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
(a)
sns.countplot(x='species',data=iris,palette='Set2')
plt.xlabel('Species')
plt.ylabel('Frequency')
plt.title('Frequency of Each class label')
(b)
plt.scatter(x='petal_width',y='sepal_width',data=iris)
plt.xlabel('Petal Width')
plt.ylabel('Sepal Width')
plt.title("Scatter plot Petel width vs Sepal Width")
(c)
sns.histplot(iris['petal_length'],kde=False,bins=30)
(d)
sns.pairplot(iris,hue='species',palette='coolwarm')
OUTPUT :
Next Page -----
Q6. Consider any sales training/ weather forecasting dataset
b. Fill an intermittent time series to replace all missing dates with values of previous
non-missing date.
d. Split a dataset to group by two columns and then sort the aggregated results within
the groups.
data = {
'Date': pd.date_range(start='2022-01-01', end='2022-01-10'),
'Sales': [100, 120, np.nan, 150, 200, 180, np.nan, 220, 250,
300],
'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
}
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
print("\n")
OUTPUT:-
Original Dataset:
Date Sales Product
0 2022-01-01 100.0 A
1 2022-01-02 120.0 B
2 2022-01-03 NaN A
3 2022-01-04 150.0 B
4 2022-01-05 200.0 A
5 2022-01-06 180.0 B
6 2022-01-07 NaN A
7 2022-01-08 220.0 B
8 2022-01-09 250.0 A
9 2022-01-10 300.0 B
Q7. Consider a data frame containing data about students i.e. name, gender and
passing division:
a. Perform one hot encoding of the last two columns of categorical data using the
get_dummies() function.
b. Sort this data frame on the “Birth Month” column (i.e. January to December). Hint:
Convert Month to Categorical.
Answer: -
import pandas as pd
df = pd.DataFrame(data)
OUTPUT:-
Q8. Consider the following data frame containing a family name, gender of the family
member and her/his monthly income in each record.
c. Calculate and display monthly income of all members with income greater than Rs.
60000.00.
d. Calculate and display the average monthly income of the female members in the
Shah family.
Answer:-
import pandas as pd
df = pd.DataFrame(data)
OUTPUT:-