How to Build a Dual-LLM Chat Application with Next.js, Python, and WebSocket Streaming

How to Build a Dual-LLM Chat Application with Next.js, Python, and WebSocket Streaming

posted Originally published at johnowolabiidogun.dev 11 min read

Introduction

In the age of rapid AI advancement, software engineers are frequently tasked with building sophisticated chatbots. This article provides a practical guide to developing a real-time, response-streaming chatbot from scratch using transformer models, autoencoding (BERT family), and autoregressive (GPT family) techniques. We'll learn how to build a chatbot capable of contextual conversations, processing user text and image/PDF uploads for enhanced understanding. We'll utilize Python (aiohttp) and React (Next.js) to deliver a seamless, interactive experience. Let's dive in!

Prerequisite

No prior expertise is strictly necessary, but some foundation in AI, software engineering, Python coding, and frontend development with a framework like React, Svelte, Vue, or Angular will be beneficial. If you're new to aiohttp or Svelte, we highly suggest exploring this previous article series as a helpful primer. Don't worry if you're not an expert (I ain't).

To set up your project, first create a root folder genai-chatbot and then navigate into the backend subdirectory:

```sh
mkdir genai-chatbot && cd genai-chatbot && mkdir backend && cd backend
```

In the backend directory, create a Python virtual environment and install dependencies from requirements.txt:

```sh
python3 -m venv virtualenv && source virtualenv/bin/activate && pip install -r requirements.txt
```

Ensure you create a requirements.txt file in the backend directory with the following content:

```txt
aiohttp
aiodns 
transformers 
torch 
torch-vision
pymupdf
pytesseract
```

For the frontend, navigate to the root directory (genai-chatbot) and use create-next-app to generate a React frontend with TypeScript:

```sh
cd ..
npx create-next-app@latest react-frontend --typescript --eslint --app
```

Follow the prompts from create-next-app.

Source code

Due to restrictions here, refer to the code repository for each file discussed.

github/Sirneij/genai-chatbot

Implementation

Step 1: Theory - Autoregressive vs Autoencoding models

Two primary approaches emerge within the transformer model architecture: autoregressive and autoencoding. Autoregressive models, like those using the decoder component of a transformer, predict the next word (token) in a sequence based solely on the preceding words (a strictly left-to-right approach). This sequential generation method is ideal for tasks like text generation and response streaming. In contrast, autoencoding models (also known as masked language models) leverage the encoder part of the transformer. They operate bidirectionally, considering both preceding and subsequent words in a sentence to understand the context and produce representations. Refer to Hugging Face's model summary for a detailed comparison. Pretraining methodology is a key differentiator: autoregressive models are pretrained to predict the next token, making them naturally suited for text generation, while autoencoding models excel at contextual understanding, making them strong for tasks like question answering based on documents.

In this project, we will practically explore the distinct text generation capabilities of both model types in real-time. We will use microsoft/Phi-3-mini-4k-instruct (autoregressive) and deepset/roberta-base-squad2 (autoencoding) for demonstration, acknowledging resource limitations and focusing on showcasing the core differences.

Step 2: Python asynchronous backend with aiohttp

We'll start with the project structure from Building and Rigorously Testing a WebSocket and HTTP Server. Populate src/app/__init__.py with the following code, noting the key adaptations for our chatbot:

```python
import json
from weakref import WeakSet

from aiohttp import WSCloseCode, web
from aiohttp.web import Request, Response, WebSocketResponse

from src.utils.auto_chat_engine import (
    cleanup_auto_model,
    gpt_question_and_answer,
    prepare_auto_tokenizer_and_model,
)
from src.utils.chat_engine import (
    prepare_qa_tokenizer_and_model,
    squad_question_answering,
)
from src.utils.extract import extract_text_from_file
from src.utils.settings import base_settings

WEBSOCKETS = web.AppKey('websockets', WeakSet[WebSocketResponse])
...

async def extract_text(request: Request) -> Response:
    """Extract text from PDF and image files."""
    data = await request.post()
    files = data.getall('file')
    if not files:
        return web.json_response({'error': 'No files provided'}, status=400)

    extracted_text = []
    for file in files:
        if file.content_type not in ['application/pdf', 'image/jpeg', 'image/png']:
            return web.json_response({'error': 'Invalid file type'}, status=400)

        text = await extract_text_from_file(file.file, file.content_type)
        extracted_text.append(text)

    base_settings.context = '\n'.join(extracted_text)

    return web.json_response({'success': 'Text extracted successfully'})


async def chat_handler(request: Request) -> Response:
    """Handle WebSocket connections."""
    ws = WebSocketResponse()
    await ws.prepare(request)

    request.app[WEBSOCKETS].add(ws)

    async for msg in ws:
        if msg.type == web.WSMsgType.TEXT:
            try:
                data = json.loads(msg.data)
                question_type = data.get('type')
                question = data.get('question', '').strip()
                if not question:
                    await ws.send_str('Error: No question provided.')
                    continue

                if question_type == 'auto':
                    # Stream response token by token.
                    async for token in gpt_question_and_answer(question):
                        await ws.send_json({'answer': token})
                elif question_type == 'masked':
                    # Use squad question answering (non-streamed).
                    answer = await squad_question_answering(question)
                    await ws.send_json({'answer': answer})
                else:
                    await ws.send_str('Error: Unknown question type.')
            except Exception as e:
                await ws.send_str(f'Error processing message: {str(e)}')
        elif msg.type == web.WSMsgType.ERROR:
            request.app[WEBSOCKETS].remove(ws)
            break

    request.app[WEBSOCKETS].remove(ws)

    return ws


def init_app() -> web.Application:
    """Initialize the application."""
    app = web.Application()

    # Add routes
    app.router.add_post('/api/extract', extract_text)
    app.router.add_get('/chat', chat_handler)

    # Add startup/cleanup handlers
    app.on_startup.append(start_background_tasks)
    app.on_shutdown.append(cleanup_app)

    return app
```

This __init__.py file builds upon the structure from the recommended series. The key differences for our chatbot are:

  1. start_background_tasks (on startup) loads AI models, and cleanup_app (on shutdown) unloads them, improving resource handling.
  2. The extract_text handler accepts and processes multiple uploaded files to create a combined context.
  3. We use direct aiohttp WebSocket APIs in chat_handler. We support auto and masked types representing the use of autogenerative and masked language models respectively.

Next, let's delve into the files within the src/utils directory, starting with settings.py.

The settings.py file primarily houses our logger configuration and defines several important constants used in the application, including the system prompt, the default language model (MODEL_NAME), and the question answering model (QA_MODEL_NAME).

Let's examine base.py next. get_device intelligently detects the available hardware and returns the appropriate PyTorch device (CUDA GPU, Apple Metal (MPS), or CPU) along with a descriptive string. get_stopping_strings plays a crucial role in prompt construction for the autoregressive model. It dynamically generates a prompt and a list of stopping strings based on the question type and optional context. The prompt instructs the model to provide a concise answer using Markdown formatting and KaTeX for mathematical expressions. A key challenge with autoregressive models is their tendency to generate repetitive text. To mitigate this, the prompt explicitly instructs the model to conclude its response with the "§" symbol, an uncommon character chosen to signal the end of the answer. The application then programmatically halts text generation upon encountering this symbol. While using the model's native end-of-sequence token would be ideal, I had no luck with it.

The auto_chat_engine.py file is the central component responsible for generating text using the configured language model. It orchestrates the model's loading, manages the text generation process, and handles device-specific optimizations.

prepare_auto_tokenizer_and_model initializes the tokenizer and model. To avoid redundant or repetitive loading, we used global variables and adapted the loading process based on the available device (CPU, MPS, CUDA).

torch.quantization.quantize_dynamic is a nifty way to quantize the model so that it will run faster and require lower memory. If your machine supports cuda, you will need to add your block in the code.

cleanup_auto_model primarily releases the model and tokenizer from memory. This is particularly important on memory-constrained devices like those with Apple Silicon (MPS), where memory leaks can quickly degrade performance.

top_k_top_p_filtering implements two common techniques for controlling the diversity and quality of generated text. Top-k filtering limits the model's choices to the k most likely tokens, while top-p (nucleus) sampling considers the smallest set of tokens whose cumulative probability exceeds p. In simple terms, they help to strike a balance between generating creative and coherent text by preventing the model from generating nonsensical or irrelevant tokens.

The stream_chat_response function is the heart of our real-time chatbot's text generation. It takes a prompt, the model, and configurations to generate streaming text output. For performance, within the generation loop, we:

  • Optimize Device Usage: Move model parameters to the active device for faster access.
  • Disable Gradients: Use torch.no_grad() to skip unnecessary gradient calculations during inference.
  • Enable Auto-Precision: Utilize torch.autocast for faster, mixed-precision operations.

The generation process automatically stops when it encounters our stopping character (code lines 151-152). To ensure smooth streaming, the function also briefly releases memory between tokens. Finally, after generating the complete response, stream_chat_response yields [END] to signal to the frontend that the stream is complete.

gpt_question_and_answer serves as a high-level interface for generating answers to questions. It retrieves the appropriate prompt and stopping strings using get_stopping_strings and then calls stream_chat_response to generate the text. You can modify the parameters here to your taste.

Finally, let's examine the chat_engine.py file. This module leverages pre-trained BERT-based models for extracting answers from a given context. prepare_qa_tokenizer_and_model initializes the question-answering pipeline. The primary motivation behind using a global QA_PIPELINE, as previously stated, is to avoid repeatedly loading the model, which is a resource-intensive operation. In squad_question_answering, a key design decision here is the enforcement of a context. BERT-based models, while powerful, generally perform best when given a specific context to ground their answers. To prevent blocking the main thread, the actual question-answering process is offloaded to a separate asynchronous thread using asyncio.to_thread.

Step 3: Building a React Frontend with Next.js and Tailwind CSS

Home page of the app

While our server backend is functional and can be interacted with via command-line tools like wscat (using a command such as wscat -c ws://localhost:PORT/chat followed by a prompt like {"type": "auto", "question": "Write Python code for Fibonacci sequence"}), this approach is not user-friendly. To make our AI chatbot accessible to a wider audience, including those unfamiliar with the terminal, we need a graphical user interface. We'll leverage Next.js, a React framework known for its performance and developer experience, along with Tailwind CSS for styling.

Brief conversation with the app

Our frontend application consists of a single page with a few key components. Let's start by examining react-frontend/src/app/layout.tsx

This layout.tsx file defines the root layout of our Next.js application. It's largely based on the default code generated by create-next-app and enhanced with Tailwind CSS integration. We've also integrated KaTeX for rendering mathematical expressions (remember to install the katex package and its TypeScript definitions, as well as marked.js for Markdown processing). The ThemeSwitcher component, responsible for toggling between light and dark themes, is also included.

For Svelte developers, React's useEffect hook serves a similar purpose to Svelte 5's $effect block. Both allow you to run code in response to changes (in useEffect you need to list the dependencies, you don't need that in Svelte), and critically, both provide a safe context for interacting with HTML elements.

Now is the turn is react-frontend/src/app/page.tsx.

This file represents the main page component of our chat application, housing the core logic for message handling and UI rendering. The Home component manages the chat interface, including displaying messages, handling user input, and communicating with the backend. The handleSend function adds a user message and a loading message to the state, while handleBotMessage updates the UI with the bot's responses as they stream in.

The custom useWebSocket hook encapsulates all the WebSocket logic, including connection management, message handling, and error handling.

handleSend function adds a temporary "loading" message to the chat interface. This is a simple but effective UI trick to provide immediate feedback to the user while waiting for the AI bot's response. Without this, the user might perceive a delay or lack of responsiveness which is unwanted. handleBotMessage function is designed to handle the streaming responses from the AI bot. It efficiently updates the chat interface by appending new content to the last bot message. The isComplete flag, triggered by the "[END]" signal from the backend, ensures that the loading indicator is removed when the bot finishes generating its response. Why useCallback in handleBotMessage? Its primary reason is to prevent unnecessary re-renders of components that depend on these functions and it does that by memoizing them so that they only change when their dependencies change.

To wrap up, let's look at the react-frontend/src/app/ui/chat/ChatContainer.tsx and react-frontend/src/app/ui/chat/ChatInput.tsx.

The ChatContainer component renders the list of ChatMessage components. It's a straightforward component that iterates over the messages and displays each one.

The ChatMessage component is responsible for rendering individual chat messages. The key feature of this component is its use of marked.js to render Markdown content, with extensions for syntax highlighting (using highlight.js) and math rendering (using KaTeX). The component also includes logic to display a "thinking" animation while the bot is generating a response.

Lastly, let's look at react-frontend/src/app/ui/chat/ChatInput.tsx.

It provides the user interface for entering and sending messages. It includes a text input, controls for toggling "Masked" and "Auto" modes, and a file upload button (you must select the masked mode to upload). The component was inspired by X's Grok 3, .

An important to note is that in handleFileUpload, I used a NextJS action (uploadFiles) to communicate with the server for files upload.

The key reason for using a server action is to avoid CORS errors such as Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at .... By executing the upload logic on the server, we bypass the browser's same-origin policy and can communicate with the backend API without requiring CORS configuration.

I am sorry for the long article. I didn't want to split so I can be forced to just finish it at a go. There are still a couple of cleanups to do but I will leave those to you. Bye for now.

Outro

Enjoyed this article? I'm a Software Engineer, Technical Writer, and Technical Support Engineer actively seeking new opportunities, particularly in areas related to web security, finance, healthcare, and education. If you think my expertise aligns with your team's needs, let's chat! You can find me on LinkedIn and X. I am also an [email](mailto:Emails are not allowed) away.

If you found this article valuable, consider sharing it with your network to help spread the knowledge!

If you read this far, tweet to the author to show them you care. Tweet a Thanks
Damn, this is an insane deep dive into building a chatbot! Mad respect for the effort and detail put into this.  The breakdown of autoregressive vs. autoencoding models was super helpful—never realized how much of a difference it makes for context understanding vs. text generation. Quick question though John—how does the memory usage compare when running this on a local machine vs. a cloud GPU? Do you recommend any optimizations for devs with limited resources?
I used a 2024 MacBook Pro M3 Pro which made me utilize MPS. Because it wasn't a GPU, there were noticeable lags occasionally making me use SLMs (Small Language Models) instead of LLMs with billions of parameters. A cloud GPU will definitely be better and more performant in terms of memory and speed. When used, quantization will help a lot. The code already does some casting and optimization to ensure that minimal resources are used.
Thanks for your response
Hello John!
You really write the great article. As a Community Leader, I welcome you to the CoderLegion

More Posts

Build a Telegram bot with Phi-3 and Qdrant and chat with your PDFs!

Astra Bertelli - May 9, 2024

Introduction to Chatty - A Real-Time MERN WebSocket Chat App

Mohammad Yasir - Feb 20

Learn how to build a user-friendly, conversational Telegram bot with python

Astra Bertelli - Apr 30, 2024

Setting Up Next.js Authentication with Supabase: A Complete Guide

Anuj Kumar Sharma - Jan 22

Connect a Solana wallet in Next.js to start building dApps effortlessly.

adewumi israel - Jan 24
chevron_left