Build a Telegram bot with Phi-3 and Qdrant and chat with your PDFs!

Question

Build a Telegram bot with Phi-3 and Qdrant and chat with your PDFs!

Astra Bertelli posted May 9, 2024 10 min read

1. Introduction

In my previous articles, we saw how to create a simple Telegram bot and how to build a responsive assistant: in both of these cases, we built something that relied on pre-defined responses, no matter how natural language-like they might have seemed.

In this tutorial we will build an AI-powered and context-aware Telegram bot that reads information from PDF files and responds in multiple languages. Before we start, nevertheless, we need to define some important terms that we will be using in this tutorial, and that are not-so-common in everyday programming.

2. Definitions

2a. LLM

A Large Language Model (LLM) is an Artificial Intelligence-based model able to understand, process and produce natural language, performing also complex tasks. For this tutorial, we will be using Phi-3-128K instruct, which is one of the most recent and powerful models, released by Microsoft.

I wrote about LLMs architecture in a post on my personal blog

2b. Vector Databases

A vector database is a non-traditional data storage facility and can be used to represent complex data (with lots of features) based on a set of multi-dimensional numerical objects (vectors). For this example, we will be using Qdrant as vector database provider.

I wrote an educational article on vector databases on my personal blog

2c. Docker

You have to imagine Docker as a big registry where lots of applications (images) are stored: you can download (pull) them and make them run in a virtual environment (the container). A Docker application contains everything that is needed for it to run in the container. A Docker container may also come with some data storage space (a volume), which should be mounted on your local file system.

3. Setup

3a. Folder Structure

As usual, we start by setting up our local folder for the tutorial: you can refer to the structure found on my GitHub repo, which is here represented:

.
|___ bot.py
|___ qdrant_storage
|___ requirements.txt
|___ utils.py

Let's break down the files:

bot.py will be our main script, containing the Telegram bot
requirements.txt will be the file where we write all the needed dependencies, in order to install them
utils.py will be the script where we define useful functions and classes for our bot
qdrant_storage/ will be the local folder where the databases will be stored.

3b. Install Dependencies

Open requirements.txt and paste the following text:

gradio_client == 0.15.0
pypdf == 3.17.4
sentence_transformers == 2.2.2
transformers == 4.39.3
langdetect == 1.0.9
deep-translator == 1.11.4
qdrant_client == 1.9.0
langchain-community == 0.0.13
langchain == 0.1.1

These are all the python packages that we need to build the bot and its back-end architecture:

gradio_client allows to interact with Gradio API
pypdf manages PDF files
sentence_transformers helps with transforming data into vectors
transformers is an HuggingFace library commonly used for LLMs
langdetect and deep-translator are two packages that manage language detection and translation, to add multilingual support to our bot
qdrant_client allows to interact with Qdrant running in the background
langchain and langchain-community are useful to preprocess PDFs and turn them into plain-text data.

Now we save requirements.txt, head over to our terminal and run:

python3 -m pip install -r requirements.txt

3c. Install and Run Qdrant with Docker

We will need Qdrant running in a Docker container on our machine in order to build the vector database.
To do so, first of all we need to download it from Docker Hub:

docker pull qdrant/qdrant

And then we can make it run, mounting as a volume (-v option) our local qdrant_storage folder:

 docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant_storage:z qdrant/qdrant

Now, if you want to see Qdrant WebUI, just type localhost:6333 on your browser and press enter.

4. Define Useful Functions and Classes

In this section, we will be editing utils.py

4a. Import Necessary Dependencies

To make our script work, we need to import all the following packages, classes and/or functions:

from langdetect import detect
from deep_translator import GoogleTranslator
from pypdf import PdfMerger
from qdrant_client import models
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
import os

4b. Handle PDFs

We first define a function to merge multiple PDFs (the Telegram bot does not take multiple documents as input, but this can be always useful):

def merge_pdfs(pdfs: list):
    merger = PdfMerger()
    for pdf in pdfs:
        merger.append(pdf)
    merger.write(f"{pdfs[-1].split('.')[0]}_results.pdf")
    merger.close()
    return f"{pdfs[-1].split('.')[0]}_results.pdf"

And another function that removes blank lines from a list (we will use it when subdividing our PDFs into smaller chuncks of plain text):

def remove_items(test_list, item): 
    res = [i for i in test_list if i != item] 
    return res

Now we can initialize a class that is able to turn PDFs into Qdrant collections (i.e. vector databases):

class PDFdatabase:
    def __init__(self, pdfs, encoder, client):
        self.finalpdf = merge_pdfs(pdfs)
        self.collection_name = os.path.basename(self.finalpdf).split(".")[0].lower()
        self.encoder = encoder
        self.client = client

The client is the interface between our script and Qdrant, while the encoder is the sentence-transformer model that turns our text data into vectors.
Now we define, inside the class, three functions to preprocess (turn into text and subdivide in batches), organize and vectorize our PDF files:

   def preprocess(self):
        loader = PyPDFLoader(self.finalpdf)
        documents = loader.load()
        ### Split the documents into smaller chunks for processing
        text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
        self.pages = text_splitter.split_documents(documents)
    def collect_data(self):
        self.documents = []
        for text in self.pages:
            contents = text.page_content.split("\n")
            contents = remove_items(contents, "")
            for content in contents:
                self.documents.append({"text": content, "source": text.metadata["source"], "page": str(text.metadata["page"])})
        return self.collection_name
    def qdrant_collection_and_upload(self):
        self.client.recreate_collection(
            collection_name=self.collection_name,
            vectors_config=models.VectorParams(
                size=self.encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
                distance=models.Distance.COSINE,
            ),
        )
        self.client.upload_points(
            collection_name=self.collection_name,
            points=[
                models.PointStruct(
                    id=idx, vector=self.encoder.encode(doc["text"]).tolist(), payload=doc
                )
                for idx, doc in enumerate(self.documents)
            ],
        )

4c. Search the Vector Database

We now have to build a class that searches the vector database using Qdrant client and the name of the collection to search as initial inputs, and the query as search term:

class NeuralSearcher:
    def __init__(self, collection_name, client, model):
        self.collection_name = collection_name
        self.model = model
        self.qdrant_client = client
    def search(self, text: str):
        vector = self.model.encode(text).tolist()
        # Use `vector` for search for closest vectors in the collection
        search_result = self.qdrant_client.search(
            collection_name=self.collection_name,
            query_vector=vector,
            query_filter=None, 
            limit=1, 
        )
        payloads = [hit.payload for hit in search_result]
        return payloads

4d. Translation

We now build a class that is able to detect 55 languages (the ones supported by Google Translator) and to translate the provided text from the original (automatically-detected) language to the target one:

class Translation:
    def __init__(self, text, destination):
        self.text = text
        self.destination = destination
        try:
            self.original = detect(self.text)
        except Exception as e:
            self.original = "auto"
    def translatef(self):
        translator = GoogleTranslator(source=self.original, target=self.destination)
        translation = translator.translate(self.text)
        return translation

5. The Bot

We already know how to get our Telegram API Token: after having obtained it, we open bot.py and paste it as follows:

TOKEN = ""

We can then import everything we need:

from gradio_client import Client
from utils import *
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from telegram.ext import *

And set some pre-defined variables:

client = QdrantClient("localhost:6333")
encoder = SentenceTransformer("all-MiniLM-L6-v2")
collection_name = ""
api_client = Client("eswardivi/Phi-3-mini-128k-instruct")
lan = "en"

5a. Start the Conversation

First of all, we need to start the conversation with the user and understand in what language will they provide their PDFs:

LAN = 1 # conversation handler index
async def start_command(update, context):
    user = update.message.from_user
    await update.message.reply_text(f"Hi {user.first_name} {user.last_name}, and thank you so much for having chosen RAGBOT as your assistant today!\nI'm here to help you chatting with your pdfs, so let's start with an ice-breaking question: what language is your pdf written in? Reply with the command '/lan' followed by the ISO code of the language, that you can find here: https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes. For example, if your pdfs are in Italian, reply: '/lan it' (without quotation marks)")
    return LAN
async def handle_lan(update, context):
    global lan
    message = update.message.text
    message = message.replace("/lan ", "")
    lan = message
    txt = Translation(f"Now the language has been changed to {lan}", lan)
    await update.message.reply_text(txt.translatef())
    return ConversationHandler.END

NOTE: we need to provide the language with the ISO code (English becomes "en", Italian becomes "it" and so on...)

5b. Upload the PDF Document

Now that we have started the conversation and set the language, we can pass our first PDF document, that will be handled by our handle_pdfs function:

async def handle_pdfs(update, context):
    global collection_name
    global lan
    if update.message.document:
        doc = update.message.document
        inf = "downloaded_from_user.pdf"
        fid = doc.file_id
        print(fid)
        # Download the file from Telegram to the local directory
        await (await context.bot.get_file(fid)).download_to_drive(custom_path=inf)
        pdfdb = PDFdatabase([inf], encoder, client)
        pdfdb.preprocess()
        collection_name = pdfdb.collect_data()
        pdfdb.qdrant_collection_and_upload()
        txt = Translation("Your document has been succesfully uploaded to a Qdrant collection!", lan)
        await update.message.reply_text(txt.translatef())

This function is able to download the PDF document from Telegram to our local folder preprocess it and turn it into a Qdrant collection in a short time: when it is done, we receive a message (in our favorite language) that tells us that the operation has been completed successfully.

5c. Talk to the LLM about Your Document

Now we have our last function before actually building the application: reply will take the user's query, adapt it to the language of the PDF, search in the PDF for context information, build a prompt (in English) that contains both the context and the user's question. This last prompt is then fed to Phi-3, that responds:

async def reply(update, context):
    global collection_name
    global client
    global encoder
    global api_client
    global lan
    message = update.message.text
    txt = Translation(message, "en")
    print(txt.original, lan)
    if txt.original == "en" and lan == "en":
        txt2txt = NeuralSearcher(collection_name, client, encoder)
        results = txt2txt.search(message)
        response = api_client.predict(
            f"Context: {results[0]['text']}; Prompt: {message}",# str  in 'Message' Textbox component
            0.4,# float (numeric value between 0 and 1) in 'Temperature' Slider component
            True,# bool  in 'Sampling' Checkbox component
            512,# float (numeric value between 128 and 4096) in 'Max new tokens' Slider component
            api_name="/chat"
        )
        await update.message.reply_text(response)
    elif txt.original == "en" and lan != "en":
        txt2txt = NeuralSearcher(collection_name, client, encoder)
        transl = Translation(message, lan)
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        response = api_client.predict(
            f"Context: {res}; Prompt: {message}",
            0.4,
            True,
            512,
            api_name="/chat"
        )
        response = Translation(response, txt.original)
        await update.message.reply_text(response.translatef())
    elif txt.original != "en" and lan == "en":
        txt2txt = NeuralSearcher(collection_name, client, encoder)
        results = txt2txt.search(message)
        transl = Translation(results[0]["text"], "en")
        translation = transl.translatef()
        response = api_client.predict(
            f"Context: {translation}; Prompt: {message}",
            0.4,
            True,
            512,
            api_name="/chat"
        )
        t = Translation(response, txt.original)
        res = t.translatef()
        await update.message.reply_text(res)
    else:
        txt2txt = NeuralSearcher(collection_name, client, encoder)
        transl = Translation(message, lan)
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        response = api_client.predict(
            f"Context: {res}; Prompt: {message}",
            0.4,
            True,
            512,
            api_name="/chat"
        )
        tr = Translation(response, txt.original)
        ress = tr.translatef()
        await update.message.reply_text(ress)

5d. Build the Application and Make It Run

We have all the utilities and functions to build our Telegram bot, now.
To do so, we first create the application:

if __name__ == "__main__":
    print("Bot is up and running")
    application = Application.builder().token(TOKEN).build()

And the conversation handler:

conv_handler = ConversationHandler(
        entry_points=[CommandHandler('start', start_command)],
        states={
            LAN: [CommandHandler('lan', handle_lan)]
        },
        fallbacks=[]
    )

Then we add all the handlers to our application and set it up to run:

application.add_handler(conv_handler)
application.add_handler(MessageHandler(filters.Document.PDF, handle_pdfs))
application.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, reply))
application.run_polling(1.0)

NOTE: we use a particular filter to recognize messages with uploaded pdfs: can you spot it?

Now we can finally give our bot a try!

python3 bot.py

And here's an example chat:
chat_example
The uploaded PDF file contains information about penguins (generated with Llama-3 70B).

ATTENTION: Please note that, for the sake of simplicity, we did not include any error handling technique in this article, but it is good practice to implement it

6. Conclusion

We built our first AI-powered bot, which is inclusive on the side of the language (English is not necessarily needed) and can expand its knowledge based on our PDF files, fully implementing the concept of a context-aware assistant.

Our bot is useful to anyone that wants to learn, query their documents rapidly and access large amount of information: the most interesting thing is that what we built in this tutorial is accessible to everyone, completely open-source and it can be generated without a big amount of prior knowledge. In the end, it doesn't take that much to become part of the AI revolution!

If you read this far, tweet to the author to show them you care. Tweet a Thanks

chevron_left

	Learn how to build a user-friendly, conversational Telegram bot with python Astra Bertelli - Apr 30, 2024
	Create a python Telegram bot, plain, simple and production-ready Astra Bertelli - Apr 24, 2024
	How to Build a Dual-LLM Chat Application with Next.js, Python, and WebSocket Streaming John - Feb 27
	Build A Real-Time Voice Assistant with Mistral AI and FastRTC Ifeanyi - Mar 17
	Deploy an AI journalist chatbot with Gradio and Supabase Astra Bertelli - May 18, 2024

Build a Telegram bot with Phi-3 and Qdrant and chat with your PDFs!

1. Introduction

2. Definitions

2a. LLM

2b. Vector Databases

2c. Docker

3. Setup

3a. Folder Structure

3b. Install Dependencies

3c. Install and Run Qdrant with Docker

4. Define Useful Functions and Classes

4a. Import Necessary Dependencies

4b. Handle PDFs

4c. Search the Vector Database

4d. Translation

5. The Bot

5a. Start the Conversation

5b. Upload the PDF Document

5c. Talk to the LLM about Your Document

5d. Build the Application and Make It Run

6. Conclusion

0 Comments

Please log in to comment on this post.

More Posts

Learn how to build a user-friendly, conversational Telegram bot with python

Create a python Telegram bot, plain, simple and production-ready

How to Build a Dual-LLM Chat Application with Next.js, Python, and WebSocket Streaming

Build A Real-Time Voice Assistant with Mistral AI and FastRTC

Deploy an AI journalist chatbot with Gradio and Supabase

More From Astra Bertelli

Build a Discord python assistant with plain Langchain

Deploy an AI journalist chatbot with Gradio and Supabase

Learn how to build a user-friendly, conversational Telegram bot with python

Welcome to Coder Legion Community

with 2,570 amazing developers

Connect with

Already have an account? Log in

Build a Telegram bot with Phi-3 and Qdrant and chat with your PDFs!

1. Introduction

2. Definitions

2a. LLM

2b. Vector Databases

2c. Docker

3. Setup

3a. Folder Structure

3b. Install Dependencies

3c. Install and Run Qdrant with Docker

4. Define Useful Functions and Classes

4a. Import Necessary Dependencies

4b. Handle PDFs

4c. Search the Vector Database

4d. Translation

5. The Bot

5a. Start the Conversation

5b. Upload the PDF Document

5c. Talk to the LLM about Your Document

5d. Build the Application and Make It Run

6. Conclusion

0 Comments

Please log in to comment on this post.

More Posts

More From Astra Bertelli