0% found this document useful (0 votes)
55 views18 pages

Chatbot Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views18 pages

Chatbot Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Project Report

Text to Speech Chatbot

Supervised By: Dr. Junaid Ahmed


Engr. Asif Ali

Presented By Muhammad Ahmed


Abdul Moiz Barlas
Muhammad Yahya
ABSTRACT

This project integrates OpenAI's ChatGPT and Google's Text-to-Speech (TTS) API with
ESP32 and MAX98357A I2S audio hardware to create an interactive voice assistant. The
system allows users to input text prompts via a serial interface, which are processed by
ChatGPT to generate intelligent, conversational responses. These responses are then
converted into audible speech using Google TTS, decoded from Base64 format, and played
through an I2S-enabled audio output. The project leverages Wi-Fi connectivity for
seamless API interactions, efficient I2S configuration for high-quality audio playback, and
the Base64 library for data decoding. By combining advanced AI capabilities with
embedded system hardware, the project demonstrates a practical and innovative
implementation of AI-driven voice assistance for real-world applications.
INTRODUCTION

This project aims to develop a fully functional voice assistant by integrating


advanced AI capabilities with embedded system hardware. The system is built
using the ESP32 microcontroller and MAX98357A I2S audio module, enabling
high-quality audio playback. It leverages OpenAI's ChatGPT for intelligent
conversational responses and Google's Text-to-Speech (TTS) API for converting
textual responses into audible speech. The system allows users to input queries via
a serial interface, processes these inputs through ChatGPT via Wi-Fi connectivity,
and decodes the TTS audio response from Base64 format for playback. By
combining cutting-edge AI technologies with efficient embedded hardware, this
project showcases a practical application of AI-powered voice assistance,
demonstrating potential uses in smart devices, IoT applications, and accessibility
tools. The project highlights innovative solutions for bridging AI software and
embedded hardware to create an interactive and user-friendly system.
TECHNOLOGY FOR CAPTURING GESTURE

The project utilizes a combination of advanced technologies spanning both


hardware and software domains. The ESP32 microcontroller serves as the core
hardware platform, chosen for its robust processing capabilities and built-in Wi-Fi
support, enabling seamless communication with APIs. For audio output, the
MAX98357A I2S audio module is employed, allowing high-fidelity sound
playback. On the software side, OpenAI's ChatGPT API is integrated to provide
intelligent conversational capabilities, enabling the system to generate context-
aware and meaningful responses to user queries. Additionally, Google Text-to-
Speech (TTS) API is used to convert text-based responses into natural-sounding
audio, enhancing interactivity. The system also leverages the Base64 decoding
library by Densaugeo for efficient handling of encoded audio data. Together, these
technologies create a sophisticated voice assistant that bridges cutting-edge AI
functionality with embedded system design

.
DESIGN SELECTION
The design selection for this project was carefully planned to ensure the system operates efficiently and
meets the requirements of a high-performance voice assistant. The selected components, both hardware
and software, were evaluated based on functionality, reliability, and compatibility to achieve an optimal
balance of performance and ease of implementation. The following components were chosen:

1. ESP32 Microcontroller

The ESP32 microcontroller serves as the central processing unit (CPU) of the system, offering robust
computational power, built-in Wi-Fi, and Bluetooth connectivity. Its versatile GPIOs and support for
I2S (Inter-IC Sound) communication make it an ideal choice for interfacing with the MAX98357A
audio module and other peripherals.

Features:

 Dual-core processing for efficient task handling.


 Integrated Wi-Fi and Bluetooth for seamless API communication.
 Low-power operation for energy-efficient performance.

2. MAX98357A I2S Audio Module

The MAX98357A module was selected for its ability to produce high-fidelity audio output from digital
I2S data. This makes it essential for playing speech synthesized by the Google Text-to-Speech (TTS)
API.

Features:

 Direct digital-to-analog audio conversion.


 Minimal external components required for integration.
 Compact design, suitable for embedded systems.

3. Speaker (4 Ohms)

A compact 4-ohm speaker was used to convert the audio signals into clear and audible speech output.
Its compatibility with the MAX98357A ensures distortion-free sound playback.

Features:

 High sound clarity and durability.


 Optimized for use with the MAX98357A audio module.
4. Power Management Circuit

To ensure stable power delivery to all components, a combination of voltage regulators and rechargeable
batteries was used. The TP4056 charging module was chosen for its simplicity in managing lithium-ion
batteries, while the HT7333 regulator ensures consistent 3.3V output to the ESP32.

Features:

 Reliable power supply for uninterrupted operation.


 Compact design, suitable for portable systems.

5. Base64 Library (Densaugeo)

The Base64 library by Densaugeo was integrated into the software design to decode audio data received
from the Google TTS API. This decoding process is crucial for converting base64-encoded audio into a
playable format.

Features:

 High-speed base64 decoding.


 Lightweight and efficient, suitable for embedded systems.

6. Software APIs

The software side includes the integration of two key APIs:

 OpenAI ChatGPT API: Processes user inputs and generates intelligent text responses.
 Google Text-to-Speech API: Converts textual responses into natural-sounding speech for
playback.

These components and technologies work in synergy to form a fully functional and efficient voice
assistant, capable of handling user queries, processing responses, and delivering audio output seamlessly.
The thoughtful selection and integration of these elements ensure a robust and reliable system.
WORKING PRINCIPLE OF SMART GLOVE

The working principles of this project focus on the seamless integration of hardware and software
components to create a voice assistant that processes user queries and provides audio responses.
The system leverages advanced technologies, including APIs and embedded hardware, to achieve
accurate and efficient functionality. The following principles outline the system’s operation:

1. User Input Processing

The system begins by capturing the user’s text input, which is entered through the serial interface
of the ESP32 microcontroller. This text is sent to the OpenAI ChatGPT API to generate an
intelligent and contextually appropriate response.

Key Steps:

 User inputs are read from the serial monitor.


 The input is trimmed and formatted for API compatibility.
 The ESP32 sends the input data to the ChatGPT API over an HTTPS connection.

2. ChatGPT API Interaction

The ChatGPT API processes the user input to generate a relevant response. The ESP32 handles
this interaction by constructing an HTTP POST request with the input data.

Key Steps:

 The API processes the input using a specified language model (e.g., GPT-4).
 The server responds with a JSON object containing the response text.
 The ESP32 parses the JSON to extract the text message.

3. Text-to-Speech Conversion

Once the response is received from the ChatGPT API, the text is sent to the Google Text-to-
Speech (TTS) API. The API converts the text into a base64-encoded audio stream.

Key Steps:

 The ESP32 constructs an HTTP POST request containing the response text and desired
voice parameters.
 The TTS API returns an audio stream encoded in base64 format.
 The audio data is decoded on the ESP32 using the Base64 library.

4. Audio Playback

The decoded audio data is transferred to the MAX98357A I2S audio module, which converts the
digital signal into an analog output. This output is then played through the connected speaker.

Key Steps:
4. Audio Playback

The decoded audio data is transferred to the MAX98357A I2S audio module, which converts the digital
signal into an analog output. This output is then played through the connected speaker.

Key Steps:

 The ESP32 streams decoded audio data to the MAX98357A module using the I2S protocol.
 The MAX98357A converts digital data into analog signals.
 The speaker produces clear, audible responses corresponding to the API-generated text.

By combining these principles, the system efficiently processes user inputs, generates intelligent
responses, and delivers audio output. The integration of APIs, decoding processes, and audio playback
mechanisms ensures a smooth and user-friendly operation.
CHATGPT RESPONSE ON TERMINAL
RESULT ANALYSIS

Functionality and Performance

The voice assistant successfully integrates ChatGPT and Google TTS to provide
intelligent responses in real-time. User inputs are processed effectively, and the audio
output is clear, demonstrating the accurate implementation of the I2S interface for
high-quality sound playback. The system reliably establishes Wi-Fi connectivity to
interact with APIs, ensuring seamless communication.

Accuracy of AI Responses

ChatGPT produces contextually relevant and coherent responses to user inputs. The
AI's ability to understand and generate meaningful replies ensures the voice assistant
performs well in various conversational scenarios. However, response accuracy can
vary depending on the complexity of user queries or the limitations of the AI model
used.

Audio Quality and Playback

The MAX98357A I2S module delivers clear and distortion-free audio output. The
system effectively decodes Base64-encoded TTS responses into PCM data for
playback, maintaining audio fidelity. Any minor latency observed in processing is
within acceptable limits for practical use.

Limitations and Improvements

While the project achieves its core objectives, certain limitations were identified, such
as reliance on internet connectivity for API calls and occasional latency in response
generation. Future improvements could include adding offline TTS capabilities,
enhancing processing speed, and optimizing power consumption for portable
applications. Additionally, integrating more natural-sounding voices and multilingual
support could broaden its usability.
COST ANALYSIS OF THE CHATBOT

Component Quantity Cost per Unit (PKR) Total Cost (PKR)


ESP32 1 1700 1700
MAX98375a i2s amplifier 1 900 900
Speakers (4Ω) 2 250 500
Google cloud platform api 1 1250 1250
Chatgpt 4o-mini api 1 2000 2000
Wires 10 10 100
Miscellaneous 3000 3000
Total - - 9,110
IMPACT OF CHATBOT ON SOCIETY

Impact of the Project on Society

Enhanced Accessibility:
The voice assistant project contributes to making technology more accessible for people
with visual impairments, literacy challenges, or physical disabilities by enabling hands-free
interaction through speech.

Improved Communication:
By leveraging AI for intelligent responses, the project facilitates seamless communication,
aiding individuals in learning, accessing information, and resolving queries in real-time.

Technological Inclusion:
The project demonstrates how low-cost hardware and open-source software can bring
advanced AI technologies to underserved communities, reducing the digital divide.

Education and Learning Support:


This system can serve as an educational tool, offering interactive learning experiences,
personalized tutoring, and access to knowledge repositories in an auditory format.

Encouragement of Innovation:
By showcasing the integration of embedded systems with AI and APIs, this project inspires
students, hobbyists, and professionals to explore and innovate in the field of IoT and AI-
driven solutions.

Support for Multilingual Communities:


The inclusion of text-to-speech capabilities, customizable voice settings, and the potential
for multilingual support broadens the reach of this technology to diverse linguistic and
cultural groups.
CONCLUSION

This voice assistant project demonstrates the seamless integration of hardware and software to
create a functional, intelligent system capable of generating human-like interactions. By
leveraging advanced AI technologies such as OpenAI's ChatGPT API and integrating them with
efficient hardware components like the ESP32 and I2S interface, the project showcases the
potential of modern embedded systems. The design's adaptability, scalability, and user-centered
approach highlight its practical applications across various domains, including accessibility,
education, and automation. This project not only addresses current technological needs but also
lays a strong foundation for future developments in AI-powered assistants, contributing to a
more connected and intelligent world.

FUTURE SCOPE

The voice assistant project holds immense potential for future advancements and applications. By
integrating more sophisticated AI models and expanding its compatibility with additional APIs, the
system can evolve into a multifunctional assistant capable of performing complex tasks such as smart
home automation, real-time language translation, and advanced data analysis. Its scalability allows for
the inclusion of features like emotional recognition, voice biometrics, and personalized user profiles,
making it a powerful tool for both individual and enterprise-level use. Furthermore, with advancements
in hardware miniaturization and energy efficiency, the project could transition into wearable devices or
portable assistants, catering to on-the-go users. Its adaptability also positions it as a key player in
educational technology, healthcare assistance, and accessibility solutions, ensuring its relevance and
impact in a technology-driven future.
Code Detail
#include <WiFi.h>
#include <HTTPClient.h>
#include <ArduinoJson.h>
#include <driver/i2s.h>

// Wi-Fi Credentials
const char* ssid = "moizbarlas";
const char* password = "moizbarlas123";

// OpenAI API Configuration


const char* api_endpoint = "https://api.openai.com/v1/chat/completions";
const char* api_key = "sk-proj-
VsQPzwPnJqbwvzSq_zrg0zLFQeg4wmoxRvp_heAfKXue2kDcxfzjreglpCfKrFmF0gGaVAV-
faT3BlbkFJdcx7Egb_dFG9z1sXv52o6YOcqbVNHineGD6uSE259xMHvN-bIgESnJat5XYMWUF07WpM96DjAA";

// I2S Configuration for MAX98357A


#define I2S_NUM I2S_NUM_0
#define I2S_BCK_IO 26 // I2S Clock pin
#define I2S_WS_IO 25 // I2S Word Select pin
#define I2S_DO_IO 22 // I2S Data Out pin

void setupI2S() {
i2s_config_t i2s_config = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_TX),
.sample_rate = 16000,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_RIGHT,
.communication_format = I2S_COMM_FORMAT_I2S_MSB,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 8,
.dma_buf_len = 256,
.use_apll = false,
.tx_desc_auto_clear = true,
.fixed_mclk = 0
};

i2s_pin_config_t pin_config = {
.bck_io_num = I2S_BCK_IO,
.ws_io_num = I2S_WS_IO,
.data_out_num = I2S_DO_IO,
.data_in_num = I2S_PIN_NO_CHANGE
};

// Initialize I2S driver


esp_err_t err = i2s_driver_install(I2S_NUM, &i2s_config, 0, NULL);
if (err != ESP_OK) {
Serial.printf("I2S Driver install failed! Error: %d\n", err);
while (true);
}

// Set up I2S pins


err = i2s_set_pin(I2S_NUM, &pin_config);
if (err != ESP_OK) {
Serial.printf("I2S Pin config failed! Error: %d\n", err);
while (true);
}
// Clear DMA buffer
i2s_zero_dma_buffer(I2S_NUM);
}

void setup() {
Serial.begin(115200);

// Connect to Wi-Fi
WiFi.begin(ssid, password);
Serial.print("Connecting to Wi-Fi");
while (WiFi.status() != WL_CONNECTED) {
delay(500);
Serial.print(".");
}
Serial.println("\nWi-Fi Connected!");

setupI2S();
}

void loop() {
Serial.println("\nEnter your prompt:");
while (!Serial.available()) {
delay(10);
}

// Read user input


String userPrompt = Serial.readStringUntil('\n');
userPrompt.trim();

if (userPrompt.isEmpty()) {
Serial.println("No input provided. Try again.");
return;
}

Serial.println("Sending prompt to ChatGPT...");

// Send Request to ChatGPT API


String chatResponse = sendToChatGPT(userPrompt);

// Output Response to Serial Monitor


Serial.println("\nChatGPT Response:");
Serial.println(chatResponse);

// Generate and Play Audio using Google TTS


playTextAsSpeech(chatResponse);
}
String sendToChatGPT(String prompt) {
if (WiFi.status() != WL_CONNECTED) {
return "Wi-Fi not connected";
}

HTTPClient http;
http.begin(api_endpoint);
http.addHeader("Content-Type", "application/json");
http.addHeader("Authorization", String("Bearer ") + api_key);

// Construct JSON Payload


DynamicJsonDocument doc(1024);
doc["model"] = "gpt-4o-mini";
JsonArray messages = doc.createNestedArray("messages");
JsonObject message = messages.createNestedObject();
message["role"] = "user";
message["content"] = prompt;
doc["temperature"] = 0.7;

String requestBody;
serializeJson(doc, requestBody);

int httpResponseCode = http.POST(requestBody);


String response = "";

if (httpResponseCode == 200) {
response = http.getString();

DynamicJsonDocument responseDoc(4096);
DeserializationError error = deserializeJson(responseDoc, response);
if (!error) {
response = responseDoc["choices"][0]["message"]["content"].as<String>();
} else {
response = "Error parsing ChatGPT response";
}
} else {
response = "HTTP Error: " + String(httpResponseCode);
}

http.end();
return response;
}

void playTextAsSpeech(String text) {


// This is a placeholder. Replace it with a real TTS solution.
Serial.println("Playing audio...");
for (size_t i = 0; i < text.length(); i++) {
int toneFreq = 300 + ((text[i] - 'a') % 26) * 20; // Map character to frequency
playTone(toneFreq, 100);
}
}

void playTone(int frequency, int duration) {


const int sampleRate = 16000;
const int numSamples = (sampleRate * duration) / 1000;
int16_t audioBuffer[numSamples];

for (int i = 0; i < numSamples; i++) {


float sample = sin(2 * PI * frequency * i / sampleRate);
audioBuffer[i] = (int16_t)(sample * 32767);
}

size_t bytesWritten;
i2s_write(I2S_NUM, audioBuffer, sizeof(audioBuffer), &bytesWritten, portMAX_DELAY);
}
References

[1] Available: Voice-Enabled ChatGPT Terminal with ESP32 and Google TTS. [Accessed: Dec. 15,
2024].

[2] [Moiz, Ahmed, Yahya], "Voice Assistant System using ESP32 and OpenAI ChatGPT API for Speech
Synthesis and Recognition," self-implemented project utilizing AI-driven conversation processing,
embedded system design, and speech-to-text conversion, Jan. 2025

You might also like