Homework
Homework
Instructions
• Submission format: (Jupyter Notebook or MATLAB Livescript) + HTML
– Your notebook should be a comprehensive report, not just a code snippet. Mark-ups are
mandatory to answer the homework questions. You need to use LaTeX equations in the
markup if you’re asked.
– Google Colab is the best place to begin with if this is the first time using iPython
notebook. No need to use GPUs.
– Download your notebook as an .html version and submit it as well, so that the AIs can
check out the plots and audio. Here is how to convert to html in Google Colab.
– Meaning you need to embed an audio player in there if you’re asked to submit an audio
file
• Avoid using toolboxes.
1
5. First off, create a DFT matrix F using the equation shown in M02-L01-S11 and S12. You’ll
of course create a N × N complex matrix, but if you see its real and imaginary versions
separately, you’ll see something like the ones in M02-L01-S14 (the ones in the slide are 20×20,
i.e. N = 20). For this problem let’s fix N = 1024.
6. Prepare your data matrix X. You extract the first frame of N samples from the input signal,
and apply a Hann window1 . What that means is that from the definition of Hann window, you
create a window of size N and element-wise multiply the window and your N audio samples.
Place it as your first column vector of the data matrix X. Move by N/2 samples. Extract
another frame of N samples and apply the window. This goes to the second column vector of
X. Do it for your third frame (which should start from (N + 1)’th sample, and so on. Since
you moved just by the half of the frame size, your frames are overlapping each other by 50%.
(Note: this time it’s okay to use the toolbox to calculate Hann windows.)
7. Apply the DFT matrix to your data matrix, i.e. Y = F X. This is your spectrogram with
complex values. See how it looks like (by taking magnitudes and plotting). For example, you
can use imshow in matplotlib.
8. In this spectrogram, identify frames that are only with noise2 . For example the ones at the
end of signal would be a good choice. Take a sample mean of the chosen Pcolumn vectors (the
1
original magnitudes, not the exponentiated ones), e.g. M = |Cnoise | i∈Cnoise |Y :,i |, where
Cnoise is the set of chosen frames and |Cnoise | is the number of frames. This is your noise
model.
9. Subtract M out of all the magnitude spectra, |Y |. This will give you some residual mag-
nitudes with suppressed noise. Be careful with negative values: you don’t want them in
your “magnitude” spectra. One quick method to remove them is to turn them into zeros.
Get the original phase from the input spectrogram, i.e. Y /|Y | (element-wise division), and
multiply each of the phase values by the corresponding cleaned-up magnitude to recover the
complex-valued spectra of the estimated clean speech.
10. Multiply the inverse DFT matrix, which you can also create by using the equation in S12.
Let’s call this F ∗ . Since it’s the inverse transform, F ∗ F ≈ I (you can check it, although the
off diagonals might be a very small number rather than zero). You multipy this matrix to
your spectrogram, which is with suppressed white noise, to get back to the recovered version
of your data matrix, X̂. In theory this should give you a real-valued matrix, but you’ll still
see some imaginary parts with a very small value. Ignore them by just taking the real part.
Reverse the procedure in 1.6 to get the time domain signal. Basically it must be a procedure
that transpose every column vector of X̂ and overlap-and-add the right half of t-th row vector
with the left half of the (t + 1)-th row vector and so on. Listen to the signal to check if the
white noise is suppressed.
1 https://en.wikipedia.org/wiki/Hann_function
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.get_window.html
2 Depending on the plotting function you use, it’s possible that you can’t really “see” the white noise. It’s because
your white noise is not loud enough. What you can do to better visualize this spectrogram is to exaggerate the small
magnitudes while suppress the large ones. For example, I can visualize |Y |0.5 instead of |Y |, where exponentiation
is element-wise. Don’t worry about this visualization issue if you can see the white noise-only frames from your
spectrogram.
2
11. Submit your code and the denoised audio file. Do NOT use any STFT functions you can find
in toolboxes.
4. You just saw that PCA might be able to replace the pre-fixed DCT basis vectors. But, as
you can see in your matrices, they are not the same. Discuss the pros and cons of PCA and
DCT in your report.
2. You are going to use a technique called “parallax” to solve this problem. It’s actually very
similar to the computer vision algorithm called “stereo matching” that stereophonic cameras
are using to find out the 3D depth information from the scene. That’s actually why we humans
can recognize the distance of a visual object (we have two eyes). See Figure 1 for an example.
3. Let’s get back to your remote planet. In your planet, parallax works by taking a picture of
the deep sky in June and another one in December (yes, you have 12 months there, too). If
you take a picture of the deep sky, you see the stars nearby (i.e. the ones in your galaxy)
changes its position much more in the two pictures, while the starts far way (i.e. the ones in
the neighboring galaxy) change their position less. See Figure 2 and 3.
4. june.png and december.png are the two pictures you took for this parallax project. In
theory, you need to apply a computer vision technique, called “non-maximum suppression,”
with which you can identify the position of all the stars in the two pictures. In theory, for
each of the stars in june.png, you look for its position in december.png by scanning a star
in the same row (because the stars always move to right). But we will not do it in here.
5. Instead, you get the (x, y) coordinates of all the stars in the two pictures. june.mat and
december.mat contain the positions of the stars in the two pictures. Each row has two
3
Left image Right image
Figure 1: The tree is closer than the mountain. So, from the left camera, the tree is located on
the right hand side, while the right camera captures it on the left hand side. On the contrary, the
mountain in the back does not have this disparity.
Your sun
A star in your galaxy
Figure 2: You take two pictures of the same area of the sky in June and December, respectively.
Because of the pretty big movement of your planet due to its revolution, you can see that the
close-by stars oscillate more in the two pictures than the far-away ones, just like the tree and the
mountain in Figure 1.
4
June Dec
Figure 3: The oscillation of the close star (green) in the two pictures. Note that the other stars
(blue) don’t move much.
coordinates, x-coordinate and y-coordinate, and there are 2,700 such rows in each matrix,
each of which is for a particular star. You can load the .mat files in python with the
scipy.io.loadmat(filepath) function3 4 .
6. If you take the first column vectors of the two matrices (i.e. the x-coordinates of the stars),
you can subtract the ones in June from the ones in December to find out their disparity, or
the amount of oscillation, which is a vector of 2,700 elements. Draw a histogram out of this
disparity values to gauge if you can see two clusters there (pay attention to the bin size).
Submit your histogram and answer in the report.
7. Write your own k-means clustering code, and apply on this disparity dataset. Find out the
cluster means and report the values. Which one do you think corresponds to the stars in your
galaxy and which is for the other galaxy? Why do you think so? Justify your answer in the
report.
3 https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html
4 Also, you might have to cast the value to int16 in python to avoid overflowing.