Building Business-Ready Generative AI Systems

Defining a Business-Ready Generative AI System

Implementing a generative AI system (GenAISys) in an organization doesn’t stop at simply integrating a standalone model such as GPT, Grok, Llama, or Gemini via an API. While this is often a starting point, we often mistake it as the finish line. The rising demand for AI, as it expands across all domains, calls for the implementation of advanced AI systems that go beyond simply integrating a prebuilt model.

A business-ready GenAISys should provide ChatGPT-grade functionality in an organization, but also go well beyond it. Its capabilities and features must include natural language understanding (NLU), contextual awareness through memory retention across dialogues in a chat session, and agentic functions such as autonomous image, audio, and document analysis and generation. Think of a generative AI model as an entity with a wide range of functions, including AI agents as agentic co-workers.

We will begin the chapter by defining what a business-ready GenAISys is. From there, we’ll focus on the central role of a generative AI model, such as GPT-4o, that can both orchestrate and execute tasks. Building on that, we will lay the groundwork for contextual awareness and memory retention, discussing four types of generative AI memory: memoryless, short-term, long-term, and multiple sessions. We will also define a new approach to retrieval-augmented generation (RAG) that introduces an additional dimension to data retrieval: instruction and agentic reasoning scenarios. Adding instructions stored in a vector store takes RAG to another level by retrieving instructions that we can add to a prompt. In parallel, we will examine a critical component of a GenAISys: human roles. We will see how, throughout its life cycle, an AI system requires human expertise. Additionally, we will define several levels of implementation to adapt the scope and scale of a GenAISys, not only to business requirements but also to available budgets and resources.

Finally, we’ll illustrate how contextual awareness and memory retention can be implemented using OpenAI’s LLM and multimodal API. A GenAISys cannot work without solid memory retention functionality—without memory, there’s no context, and without context, there’s no sustainable generation. Throughout this book, we will create modules for memoryless, short-term, long-term, and multisession types depending on the task at hand. By the end of this chapter, you will have acquired a clear conceptual framework for what makes an AI system business-ready and practical experience in building the first bricks of an AI controller.

In a nutshell, this chapter covers the following topics:

Components of a business-ready GenAISys
AI controllers and agentic functionality (model-agnostic)
Hybrid human roles and collaboration with AI
Business opportunities and scope
Contextual awareness through memory retention

Let’s begin by defining what a business-ready GenAISys is.

Components of a business-ready GenAISys

A business-ready GenAISys is a modular orchestrator that seamlessly integrates standard AI models with multifunctional frameworks to deliver hybrid intelligence. By combining generative AI with agentic functionality, RAG, machine learning (ML), web search, non-AI operations, and multiple-session memory systems, we are able to deliver scalable and adaptive solutions for diverse and complex tasks. Take ChatGPT, for example; people use the name “ChatGPT” interchangeably for the generative AI model as well as for the application itself. However, behind the chat interface, tools such as ChatGPT and Gemini are part of larger systems—online copilots—that are fully integrated and managed by intelligent AI controllers to provide a smooth user experience.

It was Tomczak (2024) who took us from thinking of generative AI models as a collective entity to considering complex GenAISys architectures. His paper uses the term “GenAISys” to describe these more complex platforms. Our approach in this book will be to expand the horizon of a GenAISys to include advanced AI controller functionality and human roles in a business-ready ecosystem. There is no single silver-bullet architecture for a GenAISys. However, in this section, we’ll define the main components necessary to attain ChatGPT-level functionality. These include a generative AI model, memory retention functions, modular RAG, and multifunctional capabilities. How each component contributes to the GenAISys framework is illustrated in Figure 1.1:

Figure 1.1: GenAISys, the AI controller, and human roles

Let’s now define the architecture of the AI controllers and human roles that make up a GenAISys.

AI controllers

At the heart of a business-ready GenAISys is an AI controller that activates custom ChatGPT-level features based on the context of the input. Unlike traditional pipelines with predetermined task sequences, the AI controller operates without a fixed order, dynamically adapting tasks—such as web search, image analysis, and text generation—based on the specific context of each input. This agentic context-driven approach enables the AI controller to orchestrate various components seamlessly, ensuring effective and coherent performance of the generative AI model.

A lot of work is required to achieve effective results with a custom ChatGPT-grade AI controller. However, the payoff is a new class of AI systems that can withstand real-world pressure and produce tangible business results. A solid AI controller ecosystem can support use cases across multiple domains: customer support automation, sales lead generation, production optimization (services and manufacturing), healthcare response support, supply chain optimization, and any other domain the market will take you! A GenAISys, thus, requires an AI controller to orchestrate multiple pipelines, such as contextual awareness to understand the intent of the prompt and memory retention to support continuity across sessions.

The GenAISys must also define human roles, which determine which functions and data can be accessed. Before we move on to human roles, however, let’s first break down the key components that power the AI controller. As shown in Figure 1.1, the generative AI model, memory, modular RAG, and multifunctional capabilities each play vital roles in enabling flexible, context-driven orchestration. Let’s explore how these elements work together to build a business-ready GenAISys. We will first define the role of the generative AI model.

Model-agnostic approach to generative AI

When we build a sustainable GenAISys, we need model interchangeability—the flexibility to swap out the underlying model as needed. A generative AI model should serve as a component within the system, not as the core that the system is built around. That way, if our model is deprecated or requires updating, or we simply find a better-performing one, we can simply replace it with another that better fits our project.

As such, the generative AI model can be OpenAI’s GPT, Google’s Gemini, Meta’s Llama, xAI’s Grok, or any Hugging Face model, as long as it supports the required tasks. Ideally, we should choose a multipurpose, multimodal model that encompasses text, vision, and reasoning abilities. Bommasani et al. (2021) provide a comprehensive analysis of such foundation models, whose scope reaches beyond LLMs.

A generative AI model has two main functions, as shown in Figure 1.2:

Orchestrates by determining which tasks need to be triggered based on the input. This input can be a user prompt or a system request from another function in the pipeline. The orchestration function agent can trigger web search, document parsing, image generation, RAG, ML functions, non-AI functions, and any other function integrated into the GenAISys.
Executes the tasks requested by the orchestration layer or executes a task directly based on the input. For example, a simple query such as requesting the capital of the US will not necessarily require complex functionality. However, a request for document analysis might require several functions (chunking, embedding, storing, and retrieving).

Figure 1.2: A generative AI model to orchestrate or execute tasks

Notice that Figure 1.2 has a unique feature. There are no arrows directing the input, orchestration, and execution components. Unlike traditional hardcoded linear pipelines, a flexible GenAISys has its components unordered. We build the components and then let automated scenarios selected by the orchestration function order the tasks dynamically.

This flexibility ensures the system’s adaptability to a wide range of tasks. We will not be able to build a system that solves every task, but we can build one that satisfies a wide range of tasks within a company. Here are two example workflows that illustrate how a GenAISys can dynamically sequence tasks based on the roles involved:

Human roles can be configured so that, in some cases, the user input executes a simple API call to provide a straightforward response, such as requesting the capital of a country. In this case, the generative AI model executes a request directly.
System roles can be configured dynamically to orchestrate a set of instructions, such as searching the web first and then summarizing the web page. In this case, the system goes through an orchestration process to produce an output.

The possibilities are unlimited; however, all the scenarios will rely on the memory to ensure consistent, context-aware behavior. Let’s look at memory next.

Building the memory of a GenAISys

Advanced generative AI models such as OpenAI’s GPT, Meta’s Llama, xAI’s Grok, Google’s Gemini, and many Hugging Face variants are context-driven regardless of their specific version or performance level. You will choose the model based on your project, but the basic rule remains simple:

No-context => No meaningful generation

When we use ChatGPT or any other copilot, we have nothing to worry about as contextual memory is handled for us. We just start a dialogue, and things run smoothly as we adapt our prompt to the level of responses we are obtaining. However, when we develop a system with a generative AI API from scratch, we have to explicitly build contextual awareness and memory retention.

Four approaches stand out among the wide range of possible memory retention strategies with an API:

Stateless and memoryless session: A request is sent to the API, and a response is returned with no memory retention functionality.
Short-term memory session: The exchanges between the requests and responses are stored in memory during the session but not beyond.
Long-term memory of multiple sessions: The exchanges between the requests and responses are stored in memory and memorized even after the session ends.
Long-term memory of multiple cross-topic sessions: This feature links the long-term memory of multiple sessions to other sessions. Each session is assigned a role: a system or multiple users. This feature is not standard in platforms such as ChatGPT but is essential for workflow management within organizations.

Figure 1.3 sums up these four memory architectures. We’ll demonstrate each configuration in Python using GPT-4o in the upcoming section, Contextual awareness and memory retention.

Figure 1.3: Four different GenAISys memory configurations

These four memory types serve as a starting point that can be expanded as necessary when developing a GenAISys. However, practical implementations often require additional functionality, including the following:

Human roles to define users or groups of users that can access session history or sets of sessions on multiple topics. This will take us beyond ChatGPT-level platforms. We will introduce this aspect in Chapter 2, Building the Generative AI Controller.
Storage strategies to define what we need to store and what we need to discard. We will introduce storage strategies and take this concept further with a Pinecone vector store in Chapter 3, Integrating Dynamic RAG into the GenAISys.

There are native distinctions between two key categories of memorization in generative models:

Semantic memory, which contains facts such as hard science
Episodic memory, which contains personal timestamped memories such as personal events in time and business meetings

We can see that building a GenAISys’s memory requires careful design and deliberate development to implement ChatGPT-grade memory and additional memory configurations, such as long-term, cross-topic sessions. The ultimate goal, however, of this advanced memory system is to enhance the model’s contextual awareness. While generative AI models such as GPT-4o have inbuilt contextual awareness, to expand the scope of a context-driven system such as the GenAISys we’re building, we need to integrate advanced RAG functionality.

RAG as an agentic multifunction co-orchestrator

In this section, we explain the motivations for using RAG for three core functions within a GenAISys:

Knowledge retrieval: Retrieving targeted, nuanced information
Context window optimization: Engineering optimized prompts
Agentic orchestration of multifunctional capabilities: Triggering functions dynamically

Let’s begin with knowledge retrieval.

1. Knowledge retrieval

Generative AI models excel when it comes to revealing parametric knowledge that they have learned, which is embedded in their weights. This knowledge is learned during training and embedded in models such as GPT, Llama, Grok, and Gemini. However, that knowledge stops at the cutoff date when no additional data is fed to the model. At that point, to update or supplement it, we have two options:

Implicit knowledge: Fine-tune the model so that more trained knowledge is added to its weights (parametric). This process can be challenging if you are working with dynamic data that changes daily, such as weather forecasts, newsfeeds, or social media messages. It also comes with costs and risks if the fine-tuning process doesn’t work that well for your data.
Explicit knowledge: Store the data in files or embed data in vector stores. The knowledge will then be structured, accessible, traceable, and updated. We can then retrieve the information with advanced queries.

It’s important to note here that static implicit knowledge cannot scale effectively without dynamic explicit knowledge. More on that in the upcoming chapters.

2. Context window optimization

Generative AI models are expanding the boundaries of context windows. For example, at the time of writing, the following are the supported context lengths:

Llama 4 Scout: 10 million tokens
Gemini 2.0 Pro Experimental: 2 million tokens
Claude 3.7 Sonnet: 200,000 tokens
GPT-4o: 128,000 tokens

While impressive, these large context windows can be expensive in terms of token costs and compute. Furthermore, the main issue is that their precision diminishes when the context becomes too large. Also, we don’t need the largest context window but only the one that best fits our project. This can justify implementing RAG if necessary to optimize a project.

The chunking process of RAG splits large content into more nuanced groups of tokens. When we embed these chunks, they become vectors that can be stored and efficiently retrieved from vector stores. This approach ensures we use only the most relevant context per task, minimizing token usage and maximizing response quality. Thus, we can rely on generative AI capabilities for parametric implicit knowledge and RAG for large volumes of explicit non-parametric data in vector stores. We can take RAG further and use the method as an orchestrator.

3. Agentic orchestrator of multifunctional capabilities

The AI controller bridges with RAG through the generative AI model. RAG is used to augment the model’s input with a flexible range of instructions. Now, using RAG to retrieve instructions might seem counterintuitive at first—but think about it. If we store instructions as vectors and retrieve the best set for a task, we get a fast, adaptable way to enable agentic functionality, generate effective results, and avoid the need to fine-tune the model every time we change our instruction strategies for how we want it to behave.

These instructions act as optimized prompts, tailored to the task at hand. In this sense, RAG becomes part of the orchestration layer of the AI system. A vector store such as Pinecone can store and return this functional information, as illustrated in Figure 1.4:

Figure 1.4: RAG orchestration functionality

The orchestration of these scenarios is performed through the following:

Scenario retrieval: The AI controller will receive structure instructions (scenarios) from a vector database, such as Pinecone, adapted to the user’s query
Dynamic task activation: Each scenario specifies a series of tasks, such as web search, ML algorithms, standard SQL queries, or any function we need

Adding classical functions and ML functionality to the GenAISys enhances its capabilities dramatically. The modular architecture of a GenAISys makes this multifunctional approach effective, as in the following use cases:

Web search to perform real-time searches to augment inputs
Document analysis to process documents and populate the vector store
Document search to retrieve parts of the processed documents from the vector store
ML such as K-means clustering (KMC) to group data and k-nearest neighbors (KNN) for similarity searches
SQL queries to execute rule-based retrieval on structured datasets
Any other function required for your project or workflow

RAG remains a critical component of a GenAISys, which we will build into our GenAISys in Chapter 3, Integrating Dynamic RAG into the GenAISys. In Chapter 3, Integrating Dynamic RAG into the GenAISys, we will also enhance the system with multifunctional features.

We’ll now move on to the human roles, which form the backbone of any GenAISys.

Human roles

Contrary to popular belief, the successful deployment and operation of a GenAISys—such as the ChatGPT platform—relies heavily on human involvement throughout its entire life cycle. While these tools may seem to handle complex tasks effortlessly, behind the scenes are multiple layers of human expertise, oversight, and coordination that make their smooth operation possible.

Software professionals must first design the architecture, process massive datasets, and fine-tune the system on million-dollar servers equipped with cutting-edge compute resources. After deployment, large teams are required to monitor, validate, and interpret system outputs—continuously adapting them in response to errors, emerging technologies, and regulatory changes. On top of that, when it comes to deploying these systems within organizations—whether inside corporate intranets, public-facing websites, research environments, or learning management systems—it takes cross-functional coordination efforts across multiple domains.

These tasks require high levels of expertise and qualified teams. Humans are, therefore, not just irreplaceable; they are critical! They are architects, supervisors, curators, and guardians of the AI systems they create and maintain.

GenAISys implementation and governance teams

Implementing a GenAISys requires technical skills and teamwork to gain the support of end users. It’s a collaborative challenge between AI controller design, user roles, and expectations. To anyone who thinks that deploying a real-world AI system is just about getting access to a model—such as the latest GPT, Llama, or Gemini—a close look at the resources required will reveal the true challenges. A massive number of human resources might be involved in the development, deployment, and maintenance of an AI system. Of course, not every organization will need all of these roles, but we must recognize the range of skills involved, such as the following:

Project manager (PM)
Product manager
Program manager
ML engineer (MLE)/data scientist
Software developer/backend engineer (BE)
Cloud engineer (CE)
Data engineer (DE) and privacy manager
UI/UX designer
Compliance and regulatory officer
Legal counsel
Security engineer (SE) and security officer
Subject-matter experts for each domain-specific deployment
Quality assurance engineer (QAE) and tester
Technical documentation writer
System maintenance and support technician
User support
Trainer

These are just examples—just enough to show how many different roles are involved in building and operating a full-scale GenAISys. Figure 1.5 shows that designing and implementing a GenAISys is a continual process, where human resources are needed at every stage.

Figure 1.5: A GenAISys life cycle

We can see that a GenAISys life cycle is a never-ending process:

Business requirements will continually evolve with market constraints
GenAISys design will have to adapt with each business shift
AI controller specifications must adapt to technological progress
Implementation must adapt to ever-changing business specifications
User feedback will drive continual improvement

Real-world AI relies heavily on human abilities—the kind of contextual and technical understanding that AI alone cannot replicate. AI can automate a wide range of tasks effectively. But it’s humans who bring the deep insight needed to align those systems with real business goals.

Let’s take this further and look at a RACI heatmap to show why humans are a critical component of a GenAISys.

GenAISys RACI

Organizing a GenAISys project requires human resources that go far beyond what AI automation alone can provide. RACI is a responsibility assignment matrix that helps define roles and responsibilities for each task or decision by identifying who is Responsible, Accountable, Consulted, and Informed. RACI is ideal for managing the complexity of building a GenAISys. It adds structure to the growing list of human roles required during the system’s life cycle and provides a pragmatic framework for coordinating their involvement.

As in any complex project, teams working on a GenAISys need to collaborate across disciplines, and RACI helps define who does what. Each letter in RACI stands for a specific type of role:

R (Responsible): The person(s) who works actively on the task. They are responsible for the proper completion of the work. For example, an MLE may be responsible for processing datasets with ML algorithms.
A (Accountable): The person(s) answerable for the success or failure of a task. They oversee the task that somebody else is responsible for carrying out. For example, the product owner (PO) will have to make sure that the MLE’s task is done on time and in compliance with the specifications. If not, the PO will be accountable for the failure.
C (Consulted): The person(s) providing input, advice, and feedback to help the others in a team. They are not responsible for executing the work. For example, a subject-matter expert in retail may help the MLE understand the goal of an ML algorithm.
I (Informed): The person(s) kept in the loop about the progress or outcome of a task. They don’t participate in the task but want to be simply informed or need to make decisions. For example, a data privacy officer (DPO) would like to be informed about a system’s security functionality.

A RACI heatmap typically contains legends for each human role in a project. Let’s build a heatmap with the following roles:

The MLE develops and integrates AI models
The DE designs data management pipelines
The BE builds API interactions
The frontend engineer (FE) develops end user features
The UI/UX designer designs user interfaces
The CE/DevOps engineer manages cloud infrastructure
The prompt engineer (PE) designs optimal prompts
The SE handles secure data and access
The DPO manages data governance and regulation compliance
The legal/compliance officer (LC) reviews the legal scope of a project
The QAE tests the GenAISys
The PO defines the scope and scale of a product
The PM coordinates resources and timelines
The technical writer (TW) produces documentation
The vendor manager (VM) communicates with external vendors and service providers

Not every GenAISys project will include all of these roles, but depending on the scope and scale of the project, many of them will be critical. Now, let’s list the key responsibilities of the roles defined above in a typical generative AI project:

Model: AI model development
Controller: Orchestration of APIs and multimodal components
Pipelines: Data processing and integration workflows
UI/UX: User interface and experience design
Security: Data protection and access control
DevOps: Infrastructure, scaling, and monitoring
Prompts: Designing and optimizing model interactions
QA: Testing and quality assurance

We’ve defined the roles and the tasks. Now, we can show how they can be mapped to a real-world scenario. Figure 1.6 illustrates an example RACI heatmap for a GenAISys.

Figure 1.6: Example of a RACI heatmap

For example, in this heatmap, the MLE has the following responsibilities:

(R)esponsible and (A)ccountable for the model, which could be GPT-4o.
(R)esponsible and (A)ccountable for the prompts for the model
(C)onsulted as an expert for the controller, the pipeline, and testing (QA)
(I)nformed about the UI/UX, security, and DevOps

We can sum it up with one simple rule for a GenAISys:

No humans -> no system!

We can see that we are necessary during the whole life cycle of a GenAISys, from design to maintenance and support, including continual evolutions to keep up with user feedback. Humans have been and will be here for a long time! Next, let’s explore the business opportunities that a GenAISys can unlock.

Business opportunities and scope

More often than not, we will not have access to the incredible billion-dollar resources of OpenAI, Meta, xAI, or Microsoft Azure to build ChatGPT-like platforms. The previous section showed that beneath a ChatGPT-like, seemingly simple, seamless interface, there is a complex layer of expensive infrastructure, rare talent, and continuous improvement and evolution that absorb resources only large corporations can afford. Therefore, a smarter path from the start is to determine which project category we are in and leverage the power of existing modules and libraries to build our GenAISys. Whatever the use case, such as marketing, finance, production, or support, we need to find the right scope and scale to implement a realistic GenAISys.

The first step of any GenAISys is to define the project’s goal (opportunity), including its scope and scale, as we mentioned. During this step, you will assess the risks, such as costs, confidentiality, and resource availability (risk management).

We can classify GenAISys projects into three main business implementation types depending on our resources, our objectives, the complexity of our use case, and our budget. These are illustrated in Figure 1.7:

Hybrid approach: Leveraging existing AI platforms
Small scope and scale: A focused GenAISys
Full-scale generative multi-agent AI system: A complete ChatGPT-level generative AI platform

Figure 1.7: The three main GenAISys business implementations

Let’s begin with a hybrid approach, a practical way to deliver business results without overbuilding.

Hybrid approach

A hybrid framework enables you to minimize development costs and time by combining ready-to-use SaaS platforms with custom-built components developed only when necessary, such as web search and data cleansing. This way, you can leverage the power of generative AI without developing everything from scratch. Let’s go through the key characteristics and a few example use cases.

Key characteristics

Relying on proven web services such as OpenAI’s GPT API, AWS, Google AI, or Microsoft Azure. These platforms provide the core generative functionality.
Customizing your project by integrating domain-specific vector stores and your organization’s proprietary datasets.
Focusing development on targeted functionality, such as customer support automation or marketing campaign generation.

Use case examples

Implementing a domain-specific vector store to handle legal, medical, or product-related customer queries
Building customer support on a social media platform with real-time capabilities

This category offers the ability to do more with less—in terms of both cost and development effort. A hybrid system can be a standalone GenAISys or a subsystem within a larger generative AI platform where full-scale development isn’t necessary. Let’s now look at how a small-scope, small-scale GenAISys can take us even further.

Small scope and scale

A small-scale GenAISys might include an intelligent, GenAI-driven AI controller connected to a vector store. This setup allows the system to retrieve data, trigger instructions, and call additional functionality such as web search or ML—without needing full-scale infrastructure.

Key characteristics

A clearly defined profitable system designed to achieve reasonable objectives with optimal development time and cost
The AI controller orchestrates instruction scenarios that, in turn, trigger RAG, web search, image analysis, and additional custom tasks that fit your needs
The focus is on high-priority, productive features

Use case examples

A GenAISys for document retrieval and summarization for any type of document with nuanced analysis through chunked and embedded content
Augmenting a model such as GPT or Llama with real-time web search to bypass its data cutoff date—ideal for applications such as weather forecasting or news monitoring that don’t need continual fine-tuning

This category takes us a step beyond the hybrid approach, while still staying realistic and manageable for small to mid-sized businesses or even individual departments within large organizations.

Full-scale GenAISys

If you’re working in a team of experts within an organization that has a large budget and advanced infrastructure, this category is for you. Your team can build a full-scale GenAISys that begins to approach the capabilities of ChatGPT-grade platforms.

Key characteristics

A full-blown AI controller that manages and orchestrates complex automated workflows, including RAG, instruction scenarios, multimodal functionality, and real-time data
Requires significant computing resources and highly skilled development teams

Think of the GenAISys we’re building in this book as an alpha version—a template that can be cloned, configured, and deployed anywhere in the organization as often as needed.

Use case examples

GenAISys is already present in healthcare to assist with patient diagnosis and disease prevention. The Institut Curie in Paris, for example, has a very advanced AI research team: https://institut-curie.org/.
Many large organizations have begun implementing GenAISys for fraud detection, weather predictions, and legal expertise.

You can join one of these large organizations that have the resources to build a sustainable GenAISys, whether it be on a cloud platform, local servers, or both.

The three categories—hybrid, small scale, and full scale—offer distinct paths for building a GenAISys, depending on your organization’s goals, budget, and technical capabilities. In this book, we’ll explore the critical components that make up a GenAISys. By the end, you’ll be equipped to contribute to any of these categories and offer realistic, technically grounded recommendations for the projects you work on.

Let’s now lift the hood and begin building contextual awareness and memory retention in code.

Contextual awareness and memory retention

In this section, we’ll begin implementing simulations of contextual awareness and memory retention in Python to illustrate the concepts introduced in the Building the memory of a GenAISys section. The goal is to demonstrate practical ways to manage context and memory—two features that are becoming increasingly critical as generative AI platforms evolve.

Open the Contextual_Awareness_and_Memory_Retention.ipynb file located in the chapter01 folder of the GitHub repository (https://github.com/Denis2054/Building-Business-Ready-Generative-AI-Systems/tree/main). You’ll see that the notebook is divided into five main sections:

Setting up the environment, building reusable functions, and storing them in the commons directory of the repository, so we can reuse them when necessary throughout the book
Stateless and memoryless session with semantic and episodic memory
Short-term memory session for context awareness during a session
Long-term memory across multiple sessions for context retention across different sessions
Long-term memory of multiple cross-topic sessions, expanding long-term memory over formerly separate sessions

The goal is to illustrate each type of memory in an explicit process. These examples are intentionally kept manual for now, but they will be automated and managed by the AI controller we will begin to build in the next chapter.

Due to the probabilistic nature of generative models, you may observe different outputs for the same prompt across runs. Make sure to run the entire notebook in a single session, as memory retention in this notebook is explicit in different cells. In Chapter 2, this functionality will become persistent and fully managed by the AI controller

The first step is to install the environment.

Setting up the environment

We will need a commons directory for our GenAISys project. This directory will contain the main modules and libraries needed across all notebooks in this book’s GitHub repository. The motivation is to focus on designing the system for maintenance and support. As such, by grouping the main modules and libraries in one directory, we can zero in on a resource that requires our attention instead of repeating the setup steps in every notebook. Furthermore, this section will serve as a reference point for all the notebooks in this book’s GitHub repository. We’ll only describe the downloading of each resource once and then reuse them throughout the book to build our educational GenAISys.

Thus, we can download the notebook resources from the commons directory and install the requirements.

The first step is to download grequests.py, a utility script we will use throughout the book. It contains a function to download the files we need directly from GitHub:

!curl -L https://raw.githubusercontent.com/Denis2054/Building-Business-Ready-Generative-AI-Systems/master/commons/grequests.py --output grequests.py

Quick tip: Enhance your coding experience with the AI Code Explainer and Quick Copy features. Open this book in the next-gen Packt Reader. Click the Copy button

(1) to quickly copy code into your coding environment, or click the Explain button

(2) to get the AI assistant to explain a block of code to you.

A white background with a black text

AI-generated content may be incorrect.

The next-gen Packt Reader is included for free with the purchase of this book. Scan the QR code OR visit packtpub.com/unlock, then use the search bar to find this book by name. Double-check the edition shown to make sure you get the right one.

A qr code on a white background

AI-generated content may be incorrect.

The goal of this script is to download a file from any directory of the repository by calling the download function from grequests:

import sys
import subprocess 
from grequests import download
download([directory],[file])

This function uses a curl command to download files from a specified directory and filename. It also includes basic error handling in case of command execution failures.

The code begins by importing subprocess to handle paths and commands. The download function contains two parameters:

def download(directory, filename):

directory: The subdirectory of the GitHub repository where the file is stored
filename: The name of the file to download

The base URL for the GitHub repository is then defined, pointing to the raw files we will need:

base_url = 'https://raw.githubusercontent.com/Denis2054/Building-Business-Ready-Generative-AI-Systems/main/'

We now need to define the file’s full URL with the directory and filename parameters:

file_url = f"{base_url}{directory}/{filename}"

The function now defines the curl command:

curl_command = f'curl -o {filename} {file_url}'

Finally, the download command is executed:

subprocess.run(curl_command, check=True, shell=True)

check=True activates an exception if the curl command fails
shell=True runs the command through the shell

The try-except block is used to handle errors:

try:
    # Prepare the curl command with the Authorization header
    curl_command = f'curl -o {filename} {file_url}'
    # Execute the curl command
    subprocess.run(curl_command, check=True, shell=True)
    print(f"Downloaded '{filename}' successfully.")
except subprocess.CalledProcessError:
    print(f"Failed to download '{filename}'. Check the URL and your internet connection")

We now have a standalone download script that we’ll use throughout the book. Let’s go ahead and download the resources we need for this program.

Downloading OpenAI resources

We need three resources for this notebook:

requirements01.py to install the precise OpenAI version we want
openai_setup.py to initialize the OpenAI API key
openai_api_py contains a reusable function for calling the GPT-4o model, so you don’t need to rewrite the same code across multiple cells or notebooks

We will be reusing the same functions throughout the book for standard OpenAI API calls. You can come back to this section any time you want to revisit the installation process. Other scenarios will be added to the commons directory when necessary.

We can download these files with the download() function:

from grequests import download
download("commons","requirements01.py")
download("commons","openai_setup.py")
download("commons","openai_api.py")

The first resource is requirements01.py.

Installing OpenAI

requirements01.py makes sure that a specific version of the OpenAI library is installed to avoid conflicts with other installed libraries. The code thus uninstalls existing versions, force-installs the specified version requested, and verifies the result. The function executes the installation with error handling:

def run_command(command):
    try:
        subprocess.check_call(command)
    except subprocess.CalledProcessError as e:
        print(f"Command failed: {' '.join(command)}\nError: {e}")
        sys.exit(1)

The first step for the function is to uninstall the current OpenAI library, if there is one:

print("Installing 'openai' version 1.57.1...")
run_command([sys.executable, "-m", "pip", "install", "--force-reinstall", "openai==1.57.1"])

The function then installs a specific version of OpenAI:

run_command(
    [
        sys.executable, "-m", "pip", "install", 
        "--force-reinstall", "openai==1.57.1"
    ]
)

Finally, the function verifies that OpenAI is properly installed:

try:
    import openai
    print(f"'openai' version {openai.__version__} is installed.")
except ImportError:
    print("Failed to import the 'openai' library after installation.")
    sys.exit(1)

The output at the end of the function should be as follows:

'openai' version 1.57.1 is installed.

We can now initialize the OpenAI API key.

OpenAI API key initialization

There are two methods to initialize the OpenAI API key in the notebook:

Using Google Colab secrets: Click on the key icon in the left pane in Google Colab, as shown in Figure 1.8, then click on Add new secret and add your key with the name of the key variable you will use in the notebook:

Figure 1.8: Add a new Google secret key

Then, we can use Google’s function to initialize the key by calling it in our openai_setup function in openai_setup.py:

# Import libraries
import openai
import os
from google.colab import userdata
# Function to initialize the OpenAI API key
def initialize_openai_api():
# Function to initialize the OpenAI API key
def initialize_openai_api():
    # Access the secret by its name
    API_KEY = userdata.get('API_KEY')
   
    if not API_KEY:
        raise ValueError("API_KEY is not set in userdata!")
   
    # Set the API key in the environment and OpenAI
    os.environ['OPENAI_API_KEY'] = API_KEY
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print("OpenAI API key initialized successfully.")

This method is activated if google_secrets is set to True:

google_secrets=True
if google_secrets==True:
    import openai_setup
    openai_setup.initialize_openai_api()

Custom secure method: You can also choose a custom method or enter the key in the code by setting google_secrets to False, uncommenting the following code, and entering your API key directly, or any method of your choice:

if google_secrets==False: # Uncomment the code and choose any method you wish to initialize the API_KEY
    import os
    #API_KEY=[YOUR API_KEY]
    #os.environ['OPENAI_API_KEY'] = API_KEY
    #openai.api_key = os.getenv("OPENAI_API_KEY")
    #print("OpenAI API key initialized successfully.")

In both cases, the code will create an environment variable:

os.environ['OPENAI_API_KEY'] = API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

The OpenAI API key is initialized. We will now import a custom OpenAI API call.

OpenAI API call

The goal next is to create an OpenAI API call function in openai_api.py that we can import in two lines:

#Import the function from the custom OpenAI API file
import openai_api
from openai_api import make_openai_api_call

The function is thus built to receive four variables when making the call and display them seamlessly:

# API function call
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role)
print(response)

The parameters in this function are the following:

input: Contains the input (user or system), for example, Where is Hawaii?
mrole: Defines the system’s role, for example, You are a geology expert. or simply System.
mcontent: Is what we expect the system to be, for example, You are a geology expert.
user_role: Defines the role of the user, for example, user

The first part of the code in the function defines the model we will be using in this notebook and creates a message object for the API call with the parameters we sent:

def make_openai_api_call(input, mrole,mcontent,user_role):
    # Define parameters
    gmodel = "gpt-4o"
    # Create the messages object
    messages_obj = [
        {
            "role": mrole,
            "content": mcontent
        },
        {
            "role": user_role,
            "content": input
        }
    ]

We then define the API call parameters in a dictionary for this notebook:

# Define all parameters in a dictionary named params:
    params = {
        "temperature": 0,
        "max_tokens": 256,
        "top_p": 1,
        "frequency_penalty": 0,
        "presence_penalty": 0
    }

The dictionary parameters are the following:

temperature: Controls the randomness of a response. 0 will produce deterministic responses. Higher values (e.g., 0.7) will produce more creative responses.
max_tokens: Limits the maximum number of tokens of a response.
top_p: Produces nucleus sampling. It controls the diversity of a response by sampling from the top tokens with a cumulative probability of 1.
frequency_penalty: Reduces the repetition of tokens to avoid redundancies. 0 will apply no penalty, and 2 a strong penalty. In this case, 0 is sufficient because of the high performance of the OpenAI model.
presence_penalty: Encourages new content by penalizing existing content to avoid redundancies. It applies to the same values as for the frequency penalty. In this case, due to the high performance of the OpenAI model, it doesn’t require this control.

We then initialize the OpenAI client to create an instance for the API calls:

    client = OpenAI()

Finally, we make the API call by sending the model, the message object, and the unpacked parameters:

    # Make the API call
    response = client.chat.completions.create(
        model=gmodel,
        messages=messages_obj,
        **params  # Unpack the parameters dictionary
    )

The function ends by returning the content of the API’s response that we need:

    #Return the response
    return response.choices[0].message.content

This function will help us focus on the GenAISys architecture without having to overload the notebook with repetitive libraries and functions.

In the notebook, we have the following:

The program provides the input, roles, and message content to the function
messages_obj contains the conversation history
The parameters for the API’s behavior are defined in the params dictionary
An API call is made to the OpenAI model using the OpenAI client
The function returns only the AI’s response content

A GenAISys will contain many components—including a generative model. You can choose the one that fits your project. In this book, the models are used for educational purposes only, not as endorsements or recommendations.

Let’s now build and run a stateless and memoryless session.

1. Stateless and memoryless session

A stateless and memoryless session is useful if we only want a single and temporary exchange with no stored information between requests. The examples in this section are both stateless and memoryless:

Stateless indicates that each request will be processed independently
Memoryless means that there is no mechanism to remember past exchanges

Let’s begin with a semantic query.

Semantic query

This request expects a purely semantic, factual response:

uinput = "Hawai is on a geological volcano system. Explain:"
mrole = "system"
mcontent = "You are an expert in geology."
user_role = "user"

Now, we call the OpenAI API function:

# Function call
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role)
print(response)

As you can see, the response is purely semantic:

Hawaii is located on a volcanic hotspot in the central Pacific Ocean, which is responsible for the formation of the Hawaiian Islands. This hotspot is a region where magma from deep within the Earth's mantle rises to the surface, creating volcanic activity…

The next query is episodic.

Episodic query with a semantic undertone

The query in this example is episodic and draws on personal experience. However, there is a semantic undertone because of the description of Hawaii. Here’s the message, which is rather poetic:

# API message
uinput = "I vividly remember my family's move to Hawaii in the 1970s, how they embraced the warmth of its gentle breezes, the joy of finding a steady job, and the serene beauty that surrounded them. Sum this up in one nice sentence from a personal perspective:"
mrole = "system"
mcontent = "You are an expert in geology."
user_role = "user"

mcontent is reused from the semantic query example (“You are an expert in geology”), but in this case, it doesn’t significantly influence the response. Since the user input is highly personal and narrative-driven, the system prompt plays a minimal role.

We could insert external information before the function call if necessary. For example, we could add some information from another source, such as a text message received that day from a family member:

text_message='I agree, we had a wonderful time there.'
uninput=text_message+uinput
text_message="Hi, I agree, we had a wonderful time there."

Now, we call the function:

# Call the function
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role)
print(response)

We see that the response is mostly episodic with some semantic information:

Moving to Hawaii in the 1970s was a transformative experience for my family, as they found joy in the island's gentle breezes, the security of steady employment, and the serene beauty that enveloped their new home.

Stateless and memoryless verification

We added no memory retention functionality earlier, making the dialogue stateless. Let’s check:

# API message
uinput = "What question did I just ask you?"
mrole = "system"
mcontent = "You already have this information"
user_role = "user"

When we call the function, our dialogue will be forgotten:

# API function call
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role
)
print(response)

The output confirms that the session is memoryless:

I'm sorry, but I can't recall previous interactions or questions. Could you please repeat your question?

The API call is stateless because the OpenAI API does not retain memory between requests. If we were using ChatGPT directly, the exchanges would be memorized within that session. This has a critical impact on implementation. It means we have to build our own memory mechanisms to give GenAISys stateful behavior. Let’s start with the first layer: short-term memory.

2. Short-term memory session

The goal of this section is to emulate a short-term memory session using a two-step process:

First, we initiate a session that goes from user input to a response:

User input => Generative model API call => Response

To achieve this first step, we run the session up to the response:

uinput = "Hawai is on a geological volcano system. Explain:"
mrole = "system"
mcontent = "You are an expert in geology."
user_role = "user"
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role)
print(response)

The response’s output is stored in response:

"Hawaii is part of a volcanic system known as a hotspot, which is a region of the Earth's mantle where heat rises as a thermal plume from deep within the Earth. This hotspot is responsible for the formation of the Hawaiian Islands. Here's how the process works:…"

The next step is to feed the previous interaction into the next prompt, along with a follow-up question:
- Explain the situation: The current dialog session is:
- Add the user’s initial input: Hawai is on a geological volcano system. Explain:
- Add the response we obtained in the previous call
- Add the user’s new input: Sum up your previous response in a short sentence in a maximum of 20 words.

The goal here is to compress the session log. We won’t always need to compress dialogues, but in longer sessions, large context windows can pile up quickly. This technique helps keep the token count low, which matters for both cost and performance. In this particular case, we’re only managing one response, so we could keep the entire interaction in memory if we wanted to. Still, this example introduces a useful habit for scaling up.

Once the prompt is assembled:

Call the API function
Display the response

The scenario is illustrated in the code:

ninput = "Sum up your previous response in a short sentence in a maximum of 20 words."
uinput = (
    "The current dialog session is: " +
    uinput +
    response +
    ninput
)
response = openai_api.make_openai_api_call(
    uinput, mrole, mcontent, user_role
)
print("New response:", "\n\n", uinput, "\n", response)

The output provides a nice, short summary of the dialogue:

New response: Hawaii's islands form from volcanic activity over a stationary hotspot beneath the moving Pacific Plate.

This functionality wasn’t strictly necessary here, but it sets us up for the longer dialogues we’ll encounter later in the book. Next, let’s build a long-term simulation of multiple sessions.

Keep in mind: Since the session is still in-memory only, the conversation would be lost if the notebook disconnects. Nothing is stored on disk or in a database yet.

3. Long-term memory of multiple sessions

In this section, we’re simulating long-term memory by continuing a conversation from an earlier session. The difference here is that we’re not just remembering a dialogue from a single session—we’re reusing content from a past session to extend the conversation. At this point, the term “session” takes on a broader meaning. In a traditional copilot scenario, one user interacts with one model in one self-contained session. Here, we’re blending sessions and supporting multiple sub-sessions. Multiple users can interact with the model in a shared environment, effectively creating a single global session with branching memory threads. Think of the model as a guest in an ongoing Zoom or Teams meeting. You can ask the AI guest to participate or stay quiet—and when it joins, it may need a recap.

To avoid repeating the first steps of the past conversation, we’re reusing the content from the short-term memory session we just ran. Let’s assume the previous session is over, but we still want to continue from where we left off:

session01=response
print(session01)

The output contains the final response from our short-term memory session:

Hawaii's islands form from volcanic activity over a stationary hotspot beneath the moving Pacific Plate.

The process in this section will build on the previous session, similar to how you’d revisit a conversation with an online copilot after some time away:

Save previous session => Load previous session => Add it to the new session’s scenario

Let’s first test whether the API remembers anything on its own:

uinput="Is it safe to go there on vacation"
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role
)
print(response)

The output shows that it forgot the conversation we were in:

I'm sorry, but I need more information to provide a helpful response. Could you specify the location you're considering for your vacation? …

The API forgot the previous call because stateless APIs don’t retain past dialogue. It’s up to us to decide what to include in the prompt. We have a few choices:

Do we want to remember everything with a large consumption of tokens?
Do we want to summarize parts or all of the previous conversations?

In a real GenAISys, when an input triggers a request, the AI controller decides which is the best strategy to apply to a task. The code now associates the previous session’s context and memory with a new request:

ninput = "Let's continue our dialog."
uinput=ninput + session01 + "Would it be safe to go there on vacation?"
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role
)
print("Dialog:", uinput,"\n")
print("Response:", response)

The response shows that the system now remembers the past session and has enough information to provide an acceptable output:

Response: Hawaii is generally considered a safe destination for vacation, despite its volcanic activity. The Hawaiian Islands are formed by a hotspot beneath the Pacific Plate, which creates volcanoes as the plate moves over it. While volcanic activity is a natural and ongoing process in Hawaii, it is closely monitored by the United States Geological Survey (USGS) and other agencies…

Let’s now build a long-term simulation of multiple sessions across different topics.

4. Long-term memory of multiple cross-topic sessions

This section illustrates how to merge two separate sessions into one. This isn’t something standard ChatGPT-like platforms offer. Typically, when we start a new topic, the copilot only remembers what’s happened in the current session. But in a corporate environment, we may need more flexibility—especially when multiple users are collaborating. In such cases, the AI controller can be configured to allow groups of users to view and merge sessions generated by others in the same group.

Let’s say we want to sum up two separate conversations—one about Hawaii’s volcanic systems, and another about organizing a geological field trip to Arizona. We begin by saving the previous long-term memory session:

session02=uinput + response
print(session02)

Then we can start a separate multi-user sub-session from another location, Arizona:

ninput ="I would like to organize a geological visit in Arizona."
uinput=ninput+"Where should I start?"
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role
)
#print("Dialog:", uinput,"\n")

We now expect a response on Arizona, leaving Hawaii out:

Response: Organizing a geological visit in Arizona is a fantastic idea, as the state is rich in diverse geological features. Here's a step-by-step guide to help you plan your trip:…

The response is acceptable. Now, let’s simulate long-term memory across multiple topics by combining both sessions and prompting the system to summarize them:

session02=response
ninput="Sum up this dialog in a short paragraph:"
uinput=ninput+ session01 + session02
response = openai_api.make_openai_api_call(
    uinput,mrole,mcontent,user_role
)
#print("Dialog:", uinput,"\n")#optional
print("Response:", response)

The system’s output shows that the long-term memory of the system is effective. We see that the first part is about Hawaii:

Response: The dialog begins by explaining the formation of Hawaii's volcanic islands as the Pacific Plate moves over a stationary hotspot, leading to active volcanoes like Kilauea….

Then the response continues to the part about Arizona:

It then transitions to planning a geological visit to Arizona, emphasizing the state's diverse geological features. The guide recommends researching key sites such as the Grand Canyon…

We’ve now covered the core memory modes of GenAISys—from stateless and short-term memory to multi-user, multi-topic long-term memory. Let’s now summarize the chapter’s journey and move to the next level!

Summary

A business-ready GenAISys offers functionality on par with ChatGPT-like platforms. It brings together generative AI models, agentic features, RAG, memory retention, and a range of ML and non-AI functions—all coordinated by an AI controller. Unlike traditional pipelines, the controller doesn’t follow a fixed sequence of steps. Instead, it orchestrates tasks dynamically, adapting to the context.

A GenAISys typically runs on a model such as GPT-4o—or whichever model best fits your use case. But as we’ve seen, just having access to an API isn’t enough. Contextual awareness and memory retention are essential. While ChatGPT-like tools offer these features by default, we have to build them ourselves when creating custom systems.

We explored four types of memory: memoryless, short-term, long-term, and cross-topic. We also distinguished semantic memory (facts) from episodic memory (personal, time-stamped information). Context awareness depends heavily on memory—but context windows have limits. Even if we increase the window size, models can still miss the nuance in complex tasks. That’s where advanced RAG comes in—breaking down content into smaller chunks, embedding them, and storing them in vector stores such as Pinecone. This expands what the system can “remember” and use for reasoning.

We also saw that no matter how advanced GenAISys becomes, it can’t function without human expertise. From design to deployment, maintenance, and iteration, people remain critical throughout the system’s life cycle. We then outlined three real-world implementation models based on available resources and goals: hybrid systems that leverage existing AI platforms, small-scale systems for targeted business needs, and full-scale systems built for ChatGPT-grade performance.

Finally, we got hands-on—building a series of memory simulation modules in Python using GPT-4o. These examples laid the groundwork for what comes next: the AI controller that will manage memory, context, and orchestration across your GenAISys. We are now ready to build a GenAISys AI controller!