Chatbot Report
Chatbot Report
This project integrates OpenAI's ChatGPT and Google's Text-to-Speech (TTS) API with
ESP32 and MAX98357A I2S audio hardware to create an interactive voice assistant. The
system allows users to input text prompts via a serial interface, which are processed by
ChatGPT to generate intelligent, conversational responses. These responses are then
converted into audible speech using Google TTS, decoded from Base64 format, and played
through an I2S-enabled audio output. The project leverages Wi-Fi connectivity for
seamless API interactions, efficient I2S configuration for high-quality audio playback, and
the Base64 library for data decoding. By combining advanced AI capabilities with
embedded system hardware, the project demonstrates a practical and innovative
implementation of AI-driven voice assistance for real-world applications.
INTRODUCTION
.
DESIGN SELECTION
The design selection for this project was carefully planned to ensure the system operates efficiently and
meets the requirements of a high-performance voice assistant. The selected components, both hardware
and software, were evaluated based on functionality, reliability, and compatibility to achieve an optimal
balance of performance and ease of implementation. The following components were chosen:
1. ESP32 Microcontroller
The ESP32 microcontroller serves as the central processing unit (CPU) of the system, offering robust
computational power, built-in Wi-Fi, and Bluetooth connectivity. Its versatile GPIOs and support for
I2S (Inter-IC Sound) communication make it an ideal choice for interfacing with the MAX98357A
audio module and other peripherals.
Features:
The MAX98357A module was selected for its ability to produce high-fidelity audio output from digital
I2S data. This makes it essential for playing speech synthesized by the Google Text-to-Speech (TTS)
API.
Features:
3. Speaker (4 Ohms)
A compact 4-ohm speaker was used to convert the audio signals into clear and audible speech output.
Its compatibility with the MAX98357A ensures distortion-free sound playback.
Features:
To ensure stable power delivery to all components, a combination of voltage regulators and rechargeable
batteries was used. The TP4056 charging module was chosen for its simplicity in managing lithium-ion
batteries, while the HT7333 regulator ensures consistent 3.3V output to the ESP32.
Features:
The Base64 library by Densaugeo was integrated into the software design to decode audio data received
from the Google TTS API. This decoding process is crucial for converting base64-encoded audio into a
playable format.
Features:
6. Software APIs
OpenAI ChatGPT API: Processes user inputs and generates intelligent text responses.
Google Text-to-Speech API: Converts textual responses into natural-sounding speech for
playback.
These components and technologies work in synergy to form a fully functional and efficient voice
assistant, capable of handling user queries, processing responses, and delivering audio output seamlessly.
The thoughtful selection and integration of these elements ensure a robust and reliable system.
WORKING PRINCIPLE OF SMART GLOVE
The working principles of this project focus on the seamless integration of hardware and software
components to create a voice assistant that processes user queries and provides audio responses.
The system leverages advanced technologies, including APIs and embedded hardware, to achieve
accurate and efficient functionality. The following principles outline the system’s operation:
The system begins by capturing the user’s text input, which is entered through the serial interface
of the ESP32 microcontroller. This text is sent to the OpenAI ChatGPT API to generate an
intelligent and contextually appropriate response.
Key Steps:
The ChatGPT API processes the user input to generate a relevant response. The ESP32 handles
this interaction by constructing an HTTP POST request with the input data.
Key Steps:
The API processes the input using a specified language model (e.g., GPT-4).
The server responds with a JSON object containing the response text.
The ESP32 parses the JSON to extract the text message.
3. Text-to-Speech Conversion
Once the response is received from the ChatGPT API, the text is sent to the Google Text-to-
Speech (TTS) API. The API converts the text into a base64-encoded audio stream.
Key Steps:
The ESP32 constructs an HTTP POST request containing the response text and desired
voice parameters.
The TTS API returns an audio stream encoded in base64 format.
The audio data is decoded on the ESP32 using the Base64 library.
4. Audio Playback
The decoded audio data is transferred to the MAX98357A I2S audio module, which converts the
digital signal into an analog output. This output is then played through the connected speaker.
Key Steps:
4. Audio Playback
The decoded audio data is transferred to the MAX98357A I2S audio module, which converts the digital
signal into an analog output. This output is then played through the connected speaker.
Key Steps:
The ESP32 streams decoded audio data to the MAX98357A module using the I2S protocol.
The MAX98357A converts digital data into analog signals.
The speaker produces clear, audible responses corresponding to the API-generated text.
By combining these principles, the system efficiently processes user inputs, generates intelligent
responses, and delivers audio output. The integration of APIs, decoding processes, and audio playback
mechanisms ensures a smooth and user-friendly operation.
CHATGPT RESPONSE ON TERMINAL
RESULT ANALYSIS
The voice assistant successfully integrates ChatGPT and Google TTS to provide
intelligent responses in real-time. User inputs are processed effectively, and the audio
output is clear, demonstrating the accurate implementation of the I2S interface for
high-quality sound playback. The system reliably establishes Wi-Fi connectivity to
interact with APIs, ensuring seamless communication.
Accuracy of AI Responses
ChatGPT produces contextually relevant and coherent responses to user inputs. The
AI's ability to understand and generate meaningful replies ensures the voice assistant
performs well in various conversational scenarios. However, response accuracy can
vary depending on the complexity of user queries or the limitations of the AI model
used.
The MAX98357A I2S module delivers clear and distortion-free audio output. The
system effectively decodes Base64-encoded TTS responses into PCM data for
playback, maintaining audio fidelity. Any minor latency observed in processing is
within acceptable limits for practical use.
While the project achieves its core objectives, certain limitations were identified, such
as reliance on internet connectivity for API calls and occasional latency in response
generation. Future improvements could include adding offline TTS capabilities,
enhancing processing speed, and optimizing power consumption for portable
applications. Additionally, integrating more natural-sounding voices and multilingual
support could broaden its usability.
COST ANALYSIS OF THE CHATBOT
Enhanced Accessibility:
The voice assistant project contributes to making technology more accessible for people
with visual impairments, literacy challenges, or physical disabilities by enabling hands-free
interaction through speech.
Improved Communication:
By leveraging AI for intelligent responses, the project facilitates seamless communication,
aiding individuals in learning, accessing information, and resolving queries in real-time.
Technological Inclusion:
The project demonstrates how low-cost hardware and open-source software can bring
advanced AI technologies to underserved communities, reducing the digital divide.
Encouragement of Innovation:
By showcasing the integration of embedded systems with AI and APIs, this project inspires
students, hobbyists, and professionals to explore and innovate in the field of IoT and AI-
driven solutions.
This voice assistant project demonstrates the seamless integration of hardware and software to
create a functional, intelligent system capable of generating human-like interactions. By
leveraging advanced AI technologies such as OpenAI's ChatGPT API and integrating them with
efficient hardware components like the ESP32 and I2S interface, the project showcases the
potential of modern embedded systems. The design's adaptability, scalability, and user-centered
approach highlight its practical applications across various domains, including accessibility,
education, and automation. This project not only addresses current technological needs but also
lays a strong foundation for future developments in AI-powered assistants, contributing to a
more connected and intelligent world.
FUTURE SCOPE
The voice assistant project holds immense potential for future advancements and applications. By
integrating more sophisticated AI models and expanding its compatibility with additional APIs, the
system can evolve into a multifunctional assistant capable of performing complex tasks such as smart
home automation, real-time language translation, and advanced data analysis. Its scalability allows for
the inclusion of features like emotional recognition, voice biometrics, and personalized user profiles,
making it a powerful tool for both individual and enterprise-level use. Furthermore, with advancements
in hardware miniaturization and energy efficiency, the project could transition into wearable devices or
portable assistants, catering to on-the-go users. Its adaptability also positions it as a key player in
educational technology, healthcare assistance, and accessibility solutions, ensuring its relevance and
impact in a technology-driven future.
Code Detail
#include <WiFi.h>
#include <HTTPClient.h>
#include <ArduinoJson.h>
#include <driver/i2s.h>
// Wi-Fi Credentials
const char* ssid = "moizbarlas";
const char* password = "moizbarlas123";
void setupI2S() {
i2s_config_t i2s_config = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_TX),
.sample_rate = 16000,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_RIGHT,
.communication_format = I2S_COMM_FORMAT_I2S_MSB,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 8,
.dma_buf_len = 256,
.use_apll = false,
.tx_desc_auto_clear = true,
.fixed_mclk = 0
};
i2s_pin_config_t pin_config = {
.bck_io_num = I2S_BCK_IO,
.ws_io_num = I2S_WS_IO,
.data_out_num = I2S_DO_IO,
.data_in_num = I2S_PIN_NO_CHANGE
};
void setup() {
Serial.begin(115200);
// Connect to Wi-Fi
WiFi.begin(ssid, password);
Serial.print("Connecting to Wi-Fi");
while (WiFi.status() != WL_CONNECTED) {
delay(500);
Serial.print(".");
}
Serial.println("\nWi-Fi Connected!");
setupI2S();
}
void loop() {
Serial.println("\nEnter your prompt:");
while (!Serial.available()) {
delay(10);
}
if (userPrompt.isEmpty()) {
Serial.println("No input provided. Try again.");
return;
}
HTTPClient http;
http.begin(api_endpoint);
http.addHeader("Content-Type", "application/json");
http.addHeader("Authorization", String("Bearer ") + api_key);
String requestBody;
serializeJson(doc, requestBody);
if (httpResponseCode == 200) {
response = http.getString();
DynamicJsonDocument responseDoc(4096);
DeserializationError error = deserializeJson(responseDoc, response);
if (!error) {
response = responseDoc["choices"][0]["message"]["content"].as<String>();
} else {
response = "Error parsing ChatGPT response";
}
} else {
response = "HTTP Error: " + String(httpResponseCode);
}
http.end();
return response;
}
size_t bytesWritten;
i2s_write(I2S_NUM, audioBuffer, sizeof(audioBuffer), &bytesWritten, portMAX_DELAY);
}
References
[1] Available: Voice-Enabled ChatGPT Terminal with ESP32 and Google TTS. [Accessed: Dec. 15,
2024].
[2] [Moiz, Ahmed, Yahya], "Voice Assistant System using ESP32 and OpenAI ChatGPT API for Speech
Synthesis and Recognition," self-implemented project utilizing AI-driven conversation processing,
embedded system design, and speech-to-text conversion, Jan. 2025