HW 1
HW 1
PID: <A17054351>
Instructions
Submit your solutions online on Gradescope
Look at the detailed instructions here
I certify that the following write-up is my own work, and have abided by the UCSD Academic Integrity Guidelines.
Yes
No
Question 1
For this question you will use the class data from HW-0 to generate insights with the help of pandas
The dataset student_data_189.csv is available on Github here or on Canvas in the Files tab.
In [ ]: import numpy as np
import pandas as pd
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 1/11
1/25/24, 12:02 AM hw-1
a. Read the dataset as a pandas dataframe and print the first 5 rows of the dataframe.
In [ ]: df = pd.read_csv('student_data_189.csv')
b. Print the number of variables and the number of observations in the dataset.
In [ ]: print('There are', df.columns.size,'variables, and', df.size,'observations in the dataset.')
c. Describe the type for each variable you answered in your survey.
In [ ]: possible_types = ['categorical', 'ordinal', 'discrete quantitative', 'continuous quantitative']
df.columns
print('Based on names of the columns, name would be categorical, fav_color would be categorical, math183_ex
Based on names of the columns, name would be categorical, fav_color would be categorical, math183_excited w
ould be ordinal,
seat_comfort would be ordinal, year would be discrete quantitative, major would be categorical, wi24_credi
ts would be discrete quantitative,
time_reading would be continuous quantitative, time_physical would be continuous quantitative, time_online
would be continuous quantitative, and sex would be categorical.
d. create a boxplot of the number of hours of physical activity by sex. Do you see any differences?
In [ ]: sns.boxplot(df, x = 'sex', y = 'time_physical')
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/seaborn/categorical.py:640:
FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[ ]: <Axes: xlabel='sex', ylabel='time_physical'>
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 2/11
1/25/24, 12:02 AM hw-1
These tables look very similar, although it appears that the female plot has a more extreme outlier and a lower middle 50%
than males, but male has more outliers.
e. create a boxplot of the number of credits taken by sex. Do you see any differences?
In [ ]: sns.boxplot(df, x = 'sex', y = 'wi24_credits')
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/seaborn/categorical.py:640:
FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[ ]: <Axes: xlabel='sex', ylabel='wi24_credits'>
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 3/11
1/25/24, 12:02 AM hw-1
I see that the bulk of the female plot is higher on average than the male plot, the males have less outliers, but also a higher
max.
f. create a scatterplot of the number of hours of physical activity vs. the number of hours online. Do you see any patterns?
In [ ]: sns.scatterplot(df, y = 'time_physical', x = 'time_online')
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 4/11
1/25/24, 12:02 AM hw-1
There seems to be an average positive correlation between the two, a cluster near the smaller values, with some outliers.
g. create a bar chart for the overall comfort in the classroom's seating
In [ ]: #sns.barplot(df, y = 'seat_comfort')
sns.histplot(df, x = 'seat_comfort')
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 5/11
1/25/24, 12:02 AM hw-1
h. create another column called fav_color_simplified which keeps the five most popular fav_color as is, but
changes every other color to other . Create a bar chart of the new column fav_color_simplified .
In [ ]: # determine the five most popular colors here
# Hint: you can use the .value_counts() for this
popular_colors = df['fav_color'].value_counts()[:5].index.tolist()
In [ ]: #sns.barplot(df, y = 'fav_color_simplified')
sns.histplot(df, x = 'fav_color_simplified')
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 6/11
1/25/24, 12:02 AM hw-1
Question 2
Consider the following list:
In [ ]: my_list = [
"+0.07",
"-0.07",
"+0.25",
"-0.84",
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 7/11
1/25/24, 12:02 AM hw-1
"+0.32",
"-0.24",
"-0.97",
"-0.36",
"+1.76",
"-0.36"
]
Out[ ]: str
The above code confirms that the list above contains data of type str or string
b. Create two new lists called my_list_float , my_vec_int and my_array which converts my_list to Float, Integer
and numpy array types, respectively,
In [ ]: my_list_float = [float(x) for x in my_list]
my_list_int = [int(float(x)) for x in my_list]
my_array = np.array(my_list)
c. what is the difference between my_list_float and my_array ? e.g., what happens when you multiply them by 2?
In [ ]: floattimestwo = my_list_float * 2
arraytimestwo = my_array.astype(float) * 2
print(floattimestwo)
print(arraytimestwo)
[0.07, -0.07, 0.25, -0.84, 0.32, -0.24, -0.97, -0.36, 1.76, -0.36, 0.07, -0.07, 0.25, -0.84, 0.32, -0.24, -
0.97, -0.36, 1.76, -0.36]
[ 0.14 -0.14 0.5 -1.68 0.64 -0.48 -1.94 -0.72 3.52 -0.72]
The float list times 2 just doubles the elements of the list, and the array needs to be converted to an appropriate type before
multiplying by an integer, but then scales each element by two like mathematical multiplication.
d. Let's call my_array as x . Compute the and norm of x , and compute the dot product of x with itself.
ℓ2 ℓ1
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 8/11
1/25/24, 12:02 AM hw-1
In [ ]: x = my_array
l2_norm = np.linalg.norm(x, ord=2)
l1_norm = np.linalg.norm(x, ord=1)
x_dot_x = np.dot(x.astype(float), x.astype(float))
In [ ]: np.random.seed(42)
A = np.random.randn(1000, 10)
In [ ]: AtA = np.dot(A.T, A)
eigenvalues, eigenvectors = np.linalg.eigh(AtA)
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]
top2_eigenvalues = eigenvalues[:2]
top2_eigenvectors = eigenvectors[:, :2]
g. Let be the vector obtained by summing the squares of the rows of . Plot the histogram of with the
v A v axis to show
Y−
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 9/11
1/25/24, 12:02 AM hw-1
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 10/11
1/25/24, 12:02 AM hw-1
h. Using the same fig, ax objects from part (g). overlay the probability density function of the 2
χ (10) distribution—the
chi2 distribution with 10 degrees.
i. What do you observe in the previous plot? Why do you think this is the case?
I cannot get the plot to show.
file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 11/11