0% found this document useful (0 votes)
12 views11 pages

HW 1

The document outlines the homework assignment HW-1 for Math 189, due on January 24, 2024, and includes instructions for submission and academic integrity certification. It consists of various questions requiring data analysis using pandas and visualization with seaborn and matplotlib, focusing on a dataset of student responses. The tasks include generating insights, creating plots, and performing calculations related to data types, statistics, and matrix operations.

Uploaded by

dande.t.lion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

HW 1

The document outlines the homework assignment HW-1 for Math 189, due on January 24, 2024, and includes instructions for submission and academic integrity certification. It consists of various questions requiring data analysis using pandas and visualization with seaborn and matplotlib, focusing on a dataset of student responses. The tasks include generating insights, creating plots, and performing calculations related to data types, statistics, and matrix operations.

Uploaded by

dande.t.lion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1/25/24, 12:02 AM hw-1

HW-1 • Math 189 • Wi 2024


Due Date: Wed, Jan 24
NAME: <Dylan Oquendo>

PID: <A17054351>

Instructions
Submit your solutions online on Gradescope
Look at the detailed instructions here
I certify that the following write-up is my own work, and have abided by the UCSD Academic Integrity Guidelines.
Yes
No

Question 1
For this question you will use the class data from HW-0 to generate insights with the help of pandas
The dataset student_data_189.csv is available on Github here or on Canvas in the Files tab.

In [ ]: import numpy as np
import pandas as pd

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 1/11
1/25/24, 12:02 AM hw-1

import matplotlib.pyplot as plt


import seaborn as sns

a. Read the dataset as a pandas dataframe and print the first 5 rows of the dataframe.
In [ ]: df = pd.read_csv('student_data_189.csv')

b. Print the number of variables and the number of observations in the dataset.
In [ ]: print('There are', df.columns.size,'variables, and', df.size,'observations in the dataset.')

There are 11 variables, and 3025 observations in the dataset.

c. Describe the type for each variable you answered in your survey.
In [ ]: possible_types = ['categorical', 'ordinal', 'discrete quantitative', 'continuous quantitative']
df.columns
print('Based on names of the columns, name would be categorical, fav_color would be categorical, math183_ex

Based on names of the columns, name would be categorical, fav_color would be categorical, math183_excited w
ould be ordinal,
seat_comfort would be ordinal, year would be discrete quantitative, major would be categorical, wi24_credi
ts would be discrete quantitative,
time_reading would be continuous quantitative, time_physical would be continuous quantitative, time_online
would be continuous quantitative, and sex would be categorical.

d. create a boxplot of the number of hours of physical activity by sex. Do you see any differences?
In [ ]: sns.boxplot(df, x = 'sex', y = 'time_physical')

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/seaborn/categorical.py:640:
FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[ ]: <Axes: xlabel='sex', ylabel='time_physical'>

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 2/11
1/25/24, 12:02 AM hw-1

These tables look very similar, although it appears that the female plot has a more extreme outlier and a lower middle 50%
than males, but male has more outliers.
e. create a boxplot of the number of credits taken by sex. Do you see any differences?
In [ ]: sns.boxplot(df, x = 'sex', y = 'wi24_credits')

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/seaborn/categorical.py:640:
FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[ ]: <Axes: xlabel='sex', ylabel='wi24_credits'>

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 3/11
1/25/24, 12:02 AM hw-1

I see that the bulk of the female plot is higher on average than the male plot, the males have less outliers, but also a higher
max.
f. create a scatterplot of the number of hours of physical activity vs. the number of hours online. Do you see any patterns?
In [ ]: sns.scatterplot(df, y = 'time_physical', x = 'time_online')

Out[ ]: <Axes: xlabel='time_online', ylabel='time_physical'>

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 4/11
1/25/24, 12:02 AM hw-1

There seems to be an average positive correlation between the two, a cluster near the smaller values, with some outliers.
g. create a bar chart for the overall comfort in the classroom's seating
In [ ]: #sns.barplot(df, y = 'seat_comfort')
sns.histplot(df, x = 'seat_comfort')

Out[ ]: <Axes: xlabel='seat_comfort', ylabel='Count'>

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 5/11
1/25/24, 12:02 AM hw-1

h. create another column called fav_color_simplified which keeps the five most popular fav_color as is, but
changes every other color to other . Create a bar chart of the new column fav_color_simplified .
In [ ]: # determine the five most popular colors here
# Hint: you can use the .value_counts() for this
popular_colors = df['fav_color'].value_counts()[:5].index.tolist()

Changed to histplot for better visualization


In [ ]: df['fav_color_simplified'] = df['fav_color'].apply(lambda x: x if x in popular_colors else 'other')

In [ ]: #sns.barplot(df, y = 'fav_color_simplified')
sns.histplot(df, x = 'fav_color_simplified')

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 6/11
1/25/24, 12:02 AM hw-1

Out[ ]: <Axes: xlabel='fav_color_simplified', ylabel='Count'>

Question 2
Consider the following list:
In [ ]: my_list = [
"+0.07",
"-0.07",
"+0.25",
"-0.84",

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 7/11
1/25/24, 12:02 AM hw-1

"+0.32",
"-0.24",
"-0.97",
"-0.36",
"+1.76",
"-0.36"
]

a. What type of data type does the list contain?


In [ ]: type(my_list[0])

Out[ ]: str

The above code confirms that the list above contains data of type str or string
b. Create two new lists called my_list_float , my_vec_int and my_array which converts my_list to Float, Integer
and numpy array types, respectively,
In [ ]: my_list_float = [float(x) for x in my_list]
my_list_int = [int(float(x)) for x in my_list]
my_array = np.array(my_list)

c. what is the difference between my_list_float and my_array ? e.g., what happens when you multiply them by 2?
In [ ]: floattimestwo = my_list_float * 2
arraytimestwo = my_array.astype(float) * 2
print(floattimestwo)
print(arraytimestwo)

[0.07, -0.07, 0.25, -0.84, 0.32, -0.24, -0.97, -0.36, 1.76, -0.36, 0.07, -0.07, 0.25, -0.84, 0.32, -0.24, -
0.97, -0.36, 1.76, -0.36]
[ 0.14 -0.14 0.5 -1.68 0.64 -0.48 -1.94 -0.72 3.52 -0.72]

The float list times 2 just doubles the elements of the list, and the array needs to be converted to an appropriate type before
multiplying by an integer, but then scales each element by two like mathematical multiplication.
d. Let's call my_array as x . Compute the and norm of x , and compute the dot product of x with itself.
ℓ2 ℓ1

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 8/11
1/25/24, 12:02 AM hw-1

In [ ]: x = my_array
l2_norm = np.linalg.norm(x, ord=2)
l1_norm = np.linalg.norm(x, ord=1)
x_dot_x = np.dot(x.astype(float), x.astype(float))

e. Let be the following matrix:


A

In [ ]: np.random.seed(42)
A = np.random.randn(1000, 10)

Find the row-wise and column-wise mean of . A

In [ ]: row_mean = np.mean(A, axis=1)


col_mean = np.mean(A, axis=0)

f. Find the top 2 eigenvalues and eigenvectors of A



.
A

In [ ]: AtA = np.dot(A.T, A)
eigenvalues, eigenvectors = np.linalg.eigh(AtA)
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]
top2_eigenvalues = eigenvalues[:2]
top2_eigenvectors = eigenvectors[:, :2]

g. Let be the vector obtained by summing the squares of the rows of . Plot the histogram of with the
v A v axis to show
Y−

the normalized frequency of each bin.


In [ ]: v = np.sum(A**2, axis=1)

fig, ax = plt.subplots(1, 1, figsize=(5, 5))


ax.hist(v, bins=30, density=True, color='skyblue', edgecolor='black')

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 9/11
1/25/24, 12:02 AM hw-1

Out[ ]: (array([0.00543086, 0.00868938, 0.03801604, 0.04670542, 0.06191184,


0.08906615, 0.0912385 , 0.11296195, 0.1107896 , 0.08037677,
0.07060122, 0.07711825, 0.04779159, 0.04561925, 0.04453308,
0.0358437 , 0.02498197, 0.02280962, 0.01629259, 0.0119479 ,
0.01086173, 0.00868938, 0.00760321, 0.00325852, 0.00543086,
0. , 0.00217235, 0.00325852, 0. , 0.00217235]),
array([ 1.09471143, 2.01537543, 2.93603943, 3.85670343, 4.77736743,
5.69803143, 6.61869543, 7.53935943, 8.46002343, 9.38068743,
10.30135143, 11.22201543, 12.14267943, 13.06334343, 13.98400743,
14.90467143, 15.82533543, 16.74599943, 17.66666343, 18.58732743,
19.50799143, 20.42865543, 21.34931943, 22.26998343, 23.19064743,
24.11131143, 25.03197543, 25.95263943, 26.87330342, 27.79396742,
28.71463142]),
<BarContainer object of 30 artists>)

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 10/11
1/25/24, 12:02 AM hw-1

h. Using the same fig, ax objects from part (g). overlay the probability density function of the 2
χ (10) distribution—the
chi2 distribution with 10 degrees.

In [ ]: !pip install scipy


%matplotlib inline
import scipy.stats as stats

Requirement already satisfied: scipy in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/s


ite-packages (1.12.0)
Requirement already satisfied: numpy<1.29.0,>=1.22.4 in /Library/Frameworks/Python.framework/Versions/3.11/
lib/python3.11/site-packages (from scipy) (1.26.3)

In [ ]: x_range = np.linspace(0, 30, 1000)


y = stats.chi2.pdf(x_range, df=10)
ax.plot(x_range, y)
plt.show()

i. What do you observe in the previous plot? Why do you think this is the case?
I cannot get the plot to show.

file:///Users/dylanoquendo/Downloads/materials-main/notebooks/hw-1.html 11/11

You might also like