The Glowing Python: audio

Showing posts with label audio. Show all posts

Saturday, November 3, 2012

Text to Speech with correct intonation

Google has an unoffciale text to speech API. It can be accessed by http requests but it is limited to strings with less than 100 characters. In this post we will see how to split a text longer than 100 characters in order to obtain a correct voice intonation with this service. The approach is straighforward, we split the text in sentences with less than 100 characters according to the punctuation. Let's see how:

def parseText(text):
 """ returns a list of sentences with less than 100 caracters """
 toSay = []
 punct = [',',':',';','.','?','!'] # punctuation
 words = text.split(' ')
 sentence = ''
 for w in words:
  if w[len(w)-1] in punct: # encountered a punctuation mark
   if (len(sentence)+len(w)+1 < 100): # is there enough space?
    sentence += ' '+w # add the word
    toSay.append(sentence.strip()) # save the sentence
   else:
    toSay.append(sentence.strip()) # save the sentence
    toSay.append(w.strip()) # save the word as a sentence
   sentence = '' # start another sentence
  else:
   if (len(sentence)+len(w)+1 < 100):   
    sentence += ' '+w # add the word
   else:
    toSay.append(sentence.strip()) # save the sentence
    sentence = w # start a new sentence
 if len(sentence) > 0:
  toSay.append(sentence.strip())
 return toSay

Now, we can obtain the speech with a http request for each setence:

text = 'Think of color, pitch, loudness, heaviness, and hotness. Each is the topic of a branch of physics.'

print text
toSay = parseText(text)

google_translate_url = 'http://translate.google.com/translate_tts'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)')]

for i,sentence in enumerate(toSay):
 print i,len(sentence), sentence
 response = opener.open(google_translate_url+'?q='+sentence.replace(' ','%20')+'&tl=en')
 ofp = open(str(i)+'speech_google.mp3','wb')
 ofp.write(response.read())
 ofp.close()
 os.system('cvlc --play-and-exit -q '+str(i)+'speech_google.mp3')

The API returns the speech using the mp3 format. The code above saves the result of the query and plays it using vlc.

Wednesday, September 14, 2011

Uncertainty principle and spectrogram with pylab

The Fourier transform does not give any information on the time at which a frequency component occurs. One approach which can give information on the time resolution of the spectrum is the Short Time Fourier Transform (STFT). Here a moving window is applied to the signal and the Fourier transform is applied to the signal within the window as the window is moved [Ref]. The choice of window is very important with respect to the performance of the STFT in practice. Since the STFT is simply applying the Fourier transform to pieces of the time series of interest, a drawback of the STFT is that it will not be able to resolve events if they happen to appear within the width of the window. In this case, the lack of time resolution of the Fourier transform is present. In general, one cannot achieve simultaneous time and frequency resolution because of the Heisenberg uncertain principle. In the field of particle physics, an elementary particle does not have precise position and momentum. The better one determines the position of the particle, the less precisely is know at that time, and vice versa. For signal processing, this rule translates into the fact that a signal does not simultaneously have a precise location in time and precise frequency [Ref].

The library pylab provides the function specgram(...) to compute the spectrogram of a signal using the STFT. The following script uses that function to show the spectrogram of a signal with different windows size:

from scipy.io.wavfile import read,write
from pylab import plot,show,subplot,specgram

# Open the Homer Simpson voice: "Ummm, Crumbled up cookie things."
# from http://www.thesoundarchive.com/simpsons/homer/mcrumble.wav
rate,data = read('mcrumble.wav') # reading

subplot(411)
plot(range(len(data)),data)
subplot(412)
# NFFT is the number of data points used in each block for the FFT
# and noverlap is the number of points of overlap between blocks
specgram(data, NFFT=128, noverlap=0) # small window
subplot(413)
specgram(data, NFFT=512, noverlap=0) 
subplot(414)
specgram(data, NFFT=1024, noverlap=0) # big window

show()

This image is the result:

The pictures shows that changing the number of data points used for each Fourier transform block, the spectrogram loses definition in frequency or in the time.

Thursday, September 8, 2011

Sound Synthesis

Physically, sound is an oscillation of a mechanical medium that makes the surrounding air also oscillate and transport the sound as a compression wave. Mathematically, the oscillations can be described as

where t is the time, and f the frequency of the oscillation. Sound on a computer is a sequence of numbers and in this post we will see how to generate a musical tone with numpy and how to write it to a file wav file. Each musical note vibrates at a particular frequency and the following script contains a function to generate a note (tone(...)), we'll use this function to generate the A tone creating an oscillation with f = 440 Hz.

from scipy.io.wavfile import write
from numpy import linspace,sin,pi,int16
from pylab import plot,show,axis

# tone synthesis
def note(freq, len, amp=1, rate=44100):
 t = linspace(0,len,len*rate)
 data = sin(2*pi*freq*t)*amp
 return data.astype(int16) # two byte integers

# A tone, 2 seconds, 44100 samples per second
tone = note(440,2,amp=10000)

write('440hzAtone.wav',44100,tone) # writing the sound to a file

plot(linspace(0,2,2*44100),tone)
axis([0,0.4,15000,-15000])
show()

Now we can play the file 440hzAtone.wav with an external player. This plot shows a part of the signal generated by the script:

And using the plotSpectrum function defined in a previous post we can verify that 440 Hz is the fundamental frequency of the tone.