0% found this document useful (0 votes)
5 views

Build A Python Web Application That Turns Voice Into Text Into Image - by Andrew

Uploaded by

willdynamics
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Build A Python Web Application That Turns Voice Into Text Into Image - by Andrew

Uploaded by

willdynamics
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Published in Better Programming

You have 1 free member-only story left this month.


Sign up for Medium and get an extra one

Andrew Hershy Follow

Sep 29, 2022 · 4 min read · · Listen

Save

Build a Python Web Application That Turns


Voice Into Text Into Image
Speaking images into existence using DALL-E mini and
assembly
DALLE-2 Image Source Prompt: “steampunk Iphone 12”

Introduction
Speech, text, and images are the three ways humanity has transmitted
information
Open in app
throughout history. In this project, we are going toSign
build
up anSign In
You are that
application signedlistens
out. Sign
tointhe
withspeech,
your turns that speech into text, then turns
member account (wi__@g__.com) to
that text into images. All this can be done in the afternoon. We live in a
view other member-only stories. Sign
remarkable
in time!
speech to text to image

Background knowledge needed:


DALL-E was created by the organization OpenAI. This introduced the
world to AI-generated images and took off in popularity about a year
ago. They have a free API that does all sorts of other fun AI-related functions
also. 163

DALL-E mini is an open-source alternative to DALL-E that tinkerers, like


you and I, can play around with for free. This is the engine we’ll be
leveraging in this tutorial

DALL-E Playground is an open source application that does two things:


1. Uses Google Colab to create and run a backend DALL-E mini server
which provides the GPU processing needed to generate images. And 2.
Provides a front-end web interface via javascript that users can interact
with and view their images on. This interface is linked to the Google
Colab server.

What this application does


1. Reengineers DALL-E Playground’s front-end interface from JavaScript to
streamlit Python (because 1. The UI looks better 2. It functions more
seamlessly with the speech-to-text API and 3. Python is cooler).

2. Leverages AssemblyAI’s transcription models to transcribe speech into


the text input DALL-E mini engine can work with

3. Listens to speech and displays creative and interesting images

Design
This project is broken up into two primary files: main.py and dalle.py.

If the summaries of the files below sound like gibberish to you, hang in there!
Because within the code ,itself, there are many comments which break down these
concepts more thoroughly!

The main script is used for both the streamlit web application and the voice-
to-text API connection. It involves configuring the streamlit session-state,
creating visual features such as buttons and sliders on the web app
interface, setting up a WebSockets server, filling in all the parameters
required for pyaudio , creating asynchronous functions for sending and

receiving the speech data concurrently between our application and the
AssemblyAi’s server.

The dalle.py file is used to connect the streamlit web application to the
Google Colab server running the DALL-E mini engine. This file has a few
functions which serve the following purposes:

1. Establishes a connection to backend server and verifies it’s valid

2. Initiates call to the server by sending text input for processing

3. Retrieves image JSON data, and decodes data using base64.b64decode()


Code
Please reference my GitHub here to see the full application. I tried to
include comments and a breakdown of what each chunk of code is doing as
I went along, so hopefully, it’s fairly intuitive. And please reference the
original project’s repository here for additional context.

main file:

1 #create web apps in python using streamlit


2 import streamlit as st
3 #PyAudio provides Python bindings for PortAudio v19, the cross-platform audio I/O li
4 import pyaudio
5 #python library for building a websocket server, a two-way interactive communication
6 import websockets
7 #asyncio is a library to write concurrent code using the async/await syntax.
8 import asyncio
9 #This module provides functions for encoding binary data to printable ASCII characte
10 import base64
11 #Python has a built-in package called json, which can be used to work with JSON data
12 import json
13 #pulling in the api key for AssemblyAI
14 from configure import api_key
15 #pulling in function from other file
16 from dalle import create_and_show_images
17
18 #configuring session_state
19 if 'text' not in st.session_state:
20 st.session_state['text'] = ''
21 st.session_state['run'] = False
22
23 #creating webapp title
24 st.title("DALL-E Mini")
25
26 #function to begin session_state
27 def start_listening():
28 st.session_state["run"] = True
29
30 #button to activate session_state function
31 st.button("Say something", on_click=start_listening)
32
33 #text on application
34 text = st.text_input("What should I create?", value=st.session_state["text"
35
36 #slider visualization
37 num_images = st.slider("How many images?", 1, 6)
38
39 #variable for button
40 ok = st.button("GO!")
41
42 #if statement to determine when to call and retreive from dalle file
43 if ok and text:
44 create_and_show_images(text, num_images)
45
46 # the AssemblyAI endpoint we're going to hit
47 URL = "wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000"
48
49 #setting up microphone paramaters
50 #how many bytes of data per chunk of audio processed
51 FRAMES_PER_BUFFER = 3200
52 #port audio input / output bit integer format, this is default
53 FORMAT = pyaudio.paInt16
54 #monoformat channel, meaning we only need the input audio coming from 1 direction
55 CHANNELS = 1
56 #desired rate in Hz of incoming audio
57 RATE = 16000
58 p = pyaudio.PyAudio()
59
60 # starts recording, creating stream variable, assigning paramaters
61 stream = p.open(
62 format=FORMAT,
63 channels=CHANNELS,
64 rate=RATE,
65 input=True,
66 frames_per_buffer=FRAMES_PER_BUFFER
67 )
68 #creating asynchronous function, so it can continue running and sending stream of sp
69 async def send_receive():
70 print(f'Connecting websocket to url ${URL}')
70 print(f'Connecting websocket to url ${URL}')
71
72 async with websockets.connect(
73 URL,
74 extra_headers=(("Authorization", api_key),),
75 ping_interval=5,
76 ping_timeout=20
77 ) as _ws:
78
79 r = await asyncio.sleep(0.1)
80 print("Receiving Session begins ...")
81
82 session_begins = await _ws.recv()
83
84 async def send():
85 while st.session_state['run']:
86 try:
87 data = stream.read(FRAMES_PER_BUFFER)
88 data = base64.b64encode(data).decode("utf-8")
89 json_data = json.dumps({"audio_data":str(data)})
90 r = await _ws.send(json_data)
91 except websockets.exceptions.ConnectionClosedError as e:
92 print(e)
93 assert e.code == 4008
94 break
95 except Exception as e:
96 print(e)
97 assert False, "Not a websocket 4008 error"
98
99 r = await asyncio.sleep(0.01)
100
101
102 async def receive():
103 while st.session_state['run']:
104 try:
105 result_str = await _ws.recv()
106 result = json.loads(result_str)['text']
107
108 if json.loads(result_str)['message_type'] == 'FinalTranscript'
109 result = result.replace('.', '')
110 result = result.replace('!', '')
111 st.session_state['text'] = result
111 st.session_state['text'] = result
112 st.session_state['run'] = False
113 st.experimental_rerun()
114 except websockets.exceptions.ConnectionClosedError as e:
115 print(e)
116 assert e.code == 4008
117 break
118 except Exception as e:
119 print(e)
120 assert False, "Not a websocket 4008 error"
121
122 send_result, receive_result = await asyncio.gather(send(), receive())
123
124
125 asyncio.run(send_receive())

main.py hosted with by GitHub view raw


dalle file:

1 #Requests allows you to send HTTP/1.1 requests easily


2 import requests
3 #This module provides functions for encoding binary data to printable ASCII character
4 import base64
5 #create web apps in python using streamlit
6 import streamlit as st
7
8 #This is the unique URL that obtained by going to the google colab link found in DALL
9 URL = "https://sky-reservoir-fighting-sacrifice.trycloudflare.com"
10 headers = {'Bypass-Tunnel-Reminder': "go",
11 'mode': 'no-cors'}
12
13 #Establishes connection to backend server and verifies it's valid
14 def check_if_valid_backend(url):
15 try:
16 resp = requests.get(url, timeout=5, headers=headers)
17 return resp.status_code == 200
18 except requests.exceptions.Timeout:
19 return False
20 #Initiates call to server by sending text input for processing
21 def call_dalle(url, text, num_images=1):
22 data = {"text": text, "num_images": num_images}
23 resp = requests.post(url + "/dalle", headers=headers, json=data)
24 if resp.status_code == 200:
25 return resp
26 #Retrieves image json data, and decodes data using base64.b64decode()
27 def create_and_show_images(text, num_images):
28 valid = check_if_valid_backend(URL)
29 if not valid:
30 st.write("Backend service is not running")
31 else:
32 resp = call_dalle(URL, text, num_images)
33 if resp is not None:
34 for data in resp.json():
35 img_data = base64.b64decode(data)
36 st.image(data)

dalle.py hosted with by GitHub view raw


Conclusion
This project is a proof of concept for something I’d like to have in my house
one day. I’d like to have a screen on my wall in the middle of a decorative
frame. Let’s call it a smart picture frame. This screen will have a built-in
microphone that listens to all conversations spoken in proximity. Using
speech-to-text transcription and natural language processing, the frame will
filter and choose the most interesting assortment of words spoken every 30
seconds or so. From there, the text will be continually visualized to
dynamically add more depth to the atmosphere.

Imagine visual representations and themes of conversation being displayed


on the wall during hangouts and family gatherings in real time. How many
creative ideas can emerge from something similar to this? How can the
mood of the house change and morph depending on the mood of the
participants? The house will feel less like an inorganic structure and more
like a participant, itself. Very interesting to think about.

Alas, this project was a fun way to get our hands dirty and play around with
these concepts. It’s sort of disappointing that the DALL-E mini doesn’t have
the same sort of extremely high-quality images that engines like the OpenAI
DALL-E2 have. Nevertheless, I still enjoyed learning the process and
principles behind the technology on this project. Most likely in a few years,
APIs for these high-resolution image-generating services will be easier to
access and play around with anyway. Thanks to anyone who made it all the
way through. And good luck on your journey towards learning every day.

This project was influenced by a YouTube tutorial, so please check that out,
as I found it helpful and they deserve credit.

Check out some of my other articles if you found this one


helpful/interesting:
Build an Alexa- or Siri-Equivalent Bot in Python Using OpenAI
How to find land when you’re at sea using python
I wrote a python script to play the lottery for me

AI Python Web Development Software Development

Artificial Intelligence

Enjoy the read? Reward the writer.Beta


Your tip will go to Andrew Hershy through a third-party platform of their choice, letting them know you
appreciate their story.

Give a tip

Sign up for Coffee Bytes


By Better Programming

A newsletter covering the best programming articles published across Medium Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter


About Help Terms Privacy

Get the Medium app

You might also like