What is ReadmeReady?
Automated documentation of programming source code is a challenging task with significant practical and scientific implications for the developer community. ReadmeReady is a large language model (LLM)-based application that developers can use as a support tool to generate basic documentation for any publicly available or custom repository. Over the last decade, several researches have been done on generating documentation for source code using neural network architectures. With the recent advancements in LLM technology, some open-source applications have been developed to address this problem. However, these applications typically rely on the OpenAI APIs, which incur substantial financial costs, particularly for large repositories. Moreover, none of these open-source applications offer a fine-tuned model or features to enable users to fine-tune custom LLMs. Additionally, finding suitable data for fine-tuning is often challenging. Our application addresses these issues.
Why do we need it?
Here, we introduce an LLM-based application that developers can use as a support tool to generate basic documentation for any code repository. Some open-source applications have been developed to address this issue, to name a few:
- AutoDoc-ChatGPT
- AutoDoc
- Auto-GitHub-Docs-Generator
However, these applications suffer from two major issues. Firstly, all of them are built on top of the OpenAI APIs, requiring users to have an OpenAI API key and incurring a cost with each API request. Generating documentation for a large repository could result in costs reaching hundreds of dollars. Our application allows users to choose among Meta’s Llama2, and Google’s Gemma models. Notably, these models are open-source and incur no charges, allowing documentation to be generated for free. Secondly, none of the existing open-source applications provide an option for users to fine-tune. Our application offers a fine-tuning option using QLoRA, which can be trained on the user’s own dataset. Some existing applications provide a command-line tool for interacting with the entire repository, allowing users to ask specific questions about the repository but not generate a documentation file.
How does ReadmeReady work?
The application prompts the user to enter the project’s name, and GitHub URL, and select the desired model from the following options:
- TheBloke/Llama-2–7B-Chat-GPTQ (quantized)
- TheBloke/CodeLlama-7B-Instruct-GPTQ (quantized)
- meta-llama/Llama-2–7b-chat-hf
- meta-llama/CodeLlama-7b-Instruct-hf
- google/gemma-2b-it
- google/codegemma-2b-it
Document Retrieval: Our application indexes the codebase through a depth-first traversal of all repository contents and utilizes an LLM to generate documentation. All files are converted into text, tokenized, and then chunked, with each chunk containing 1000 tokens. The application employs the sentence-transformers/all-mpnet-base-v2
sentence encoder to convert each chunk into a 768-dimensional embedding vector, which is stored in an in-memory vector store. When a query is provided, it is converted into a similar vector using the same sentence encoder. The neighbor nearest to the query embedding vector is searched using KNN (k=4) from the vector store, utilizing cosine similarity as the distance metric. For the KNN search, we use the HNSWLib library, which implements an approximate nearest-neighbor search based on hierarchical navigable small-world graphs. This methodology provides the relevant sections of the source code, aiding in answering the prompted question. The entire methodology for Retrieval Augmented Generation (RAG) and fine-tuning is illustrated below.

Prompt Configuration: Prompt engineering is accomplished using the Langchain API. For our purpose, a prompt template has been used as provided below. This template includes placeholders for questions, which users can edit and modify as needed. This flexibility allows the README to be generated according to the user’s specific requirements. Our default README structure includes sections on description, requirements, installation, usage, contributing methods, and licensing, which align with standard documentation practices. The temperature for text generation is kept at the default value of 0.2. The current prompts are developer-focused and assume that the repository is code-centric.
Prompt Template:
Instruction:
You are an AI assistant for a software project called {project_name}. You are trained on all the {content_type} that make up this project. The {content_type} for the project is located at {repository_url}. You are given a repository which might contain several modules and each module will contain a set of files. Look at the source code in the repository and you have to generate content for the section of a README.md file following the heading given below. If you use any hyperlinks, they should link back to the GitHub repository shared with you. You should only use hyperlinks that are explicitly listed in the context. Do NOT make up a hyperlink that is not listed. Assume the reader is a {target_audience} but is not deeply familiar with {project_name}. Assume the reader does not know anything about how the project is structured or which folders/files do what and what functions are written in which files and what these functions do. If you don’t know how to fill up the README.md file in one of its sections, leave that part blank. Don’t try to make up any content. Do not include information that is not directly relevant to the repository, even though the names of the functions might be common or is frequently used in several other places. Provide the answer in the correct markdown format.
{additional_instructions}
Question:
Provide the README content for the section with the heading “{{input}}” starting with ##{{input}}.
Context:
{{context}}
Answer in Markdown:
{{answer}}
Example Usage: Using the tool is as simple as shown below.
from readme_ready.query import query
from readme_ready.index import index
from readme_ready.types import (
AutodocReadmeConfig,
AutodocRepoConfig,
AutodocUserConfig,
LLMModels,
)
model = LLMModels.LLAMA2_7B_CHAT_GPTQ # Choose model from supported models
repo_config = AutodocRepoConfig (
name = "<REPOSITORY_NAME>", # Replace <REPOSITORY_NAME>
root = "<REPOSITORY_ROOT_DIR_PATH>", # Replace <REPOSITORY_ROOT_DIR_PATH>
repository_url = "<REPOSITORY_URL>", # Replace <REPOSITORY_URL>
output = "<OUTPUT_DIR_PATH>", # Replace <OUTPUT_DIR_PATH>
llms = [model],
peft_model_path = "<PEFT_MODEL_NAME_OR_PATH>", # Replace <PEFT_MODEL_NAME_OR_PATH>
ignore = [
".*",
"*package-lock.json",
"*package.json",
"node_modules",
"*dist*",
"*build*",
"*test*",
"*.svg",
"*.md",
"*.mdx",
"*.toml"
],
file_prompt = "",
folder_prompt = "",
chat_prompt = "",
content_type = "docs",
target_audience = "smart developer",
link_hosted = True,
priority = None,
max_concurrent_calls = 50,
add_questions = False,
device = "auto", # Select device "cpu" or "auto"
)
user_config = AutodocUserConfig(
llms = [model]
)
readme_config = AutodocReadmeConfig(
# Set comma separated list of README headings
headings = "Description,Requirements,Installation,Usage,Contributing,License"
)
index.index(repo_config)
query.generate_readme(repo_config, user_config, readme_config)
Fine-Tuning on your dataset
Parameter-efficient fine-tuning (PEFT) is a technique in natural language processing that enhances pre-trained language models for specific tasks by fine-tuning only a subset of their parameters. This method involves freezing most of the model’s layers and adjusting only the last few, thus conserving computational resources and time. Several parameter-efficient fine-tuning (PEFT) methods exist, such as Adapters, LoRA, etc. You can choose to fine-tune with QLoRA due to its significant reduction in the number of trainable parameters while maintaining performance. Given limited resources, QLoRA is highly efficient as it adapts models for specific tasks with minimal computational overhead. The PEFT library from Hugging Face supports several Parameter Efficient Fine-Tuning (PEFT) methods which can be utilized in this case.
Example Fine Tuning Code:
import torch
import transformers
from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
MODEL_NAME = "TheBloke/Llama-2–7b-Chat-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16,
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=2,
lora_alpha=32,
target_modules = ["k_proj","o_proj","q_proj","v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
training_args = transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
num_train_epochs=3,
learning_rate=1e-4,
fp16=True,
output_dir="outputs_llama2–7b-chat-gptq",
optim="paged_adamw_8bit",
lr_scheduler_type="cosine",
warmup_ratio=0.01,
report_to="none"
)
trainer = transformers.Trainer(
model=model,
train_dataset=data,
args=training_args,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()
Build your own documentation dataset
For instance, one way to build a dataset is by scrapping several repositories using the GitHub APIs, based on popularity and star count. You can limit your scope to Python-based repositories; however, this approach is easily adaptable to multiple programming languages. In scenarios involving various programming languages, distinct datasets can be created for fine-tuning purposes. A CSV file should be created with three features: questions, context, and answers. Questions should be derived from README file headings and subheadings, identified by markdown signatures #
or ##
. Answers should correspond to the text under these headings.
Next, the entire source code from the repositories should be concatenated into a single string and separated into document chunks of a certain number of tokens(say 1000) employing LangChain’s text-splitter. Using the sentence-transformers/all-mpnet-base-v2
sentence encoder, these chunks should be converted into 768-dimensional vectors. Each question should be then converted into a 768-dimensional vector and subjected to a KNN (k=4) search using HNSW to find the closest match from the entire set of document embeddings, stored as the context.
Feel free to use well-known pre-processing techniques to filter out hashtags, email addresses, usernames, image URLs, and other personally identifiable information before using it for fine-tuning. Relevant prompt templates can also ensure that the prompts are designed to avoid generating any personally identifiable data or offensive content.
Conclusion
This application addresses the critical need for generating documentation for code repositories by utilizing multiple LLM models and allowing users to fine-tune these models using LoRA on their own GPUs. While our approach is not designed to surpass state-of-the-art benchmarks, its significance lies in the application of NLP techniques to solve a pressing issue faced by the developer community. The tool provides initial documentation suggestions based on the source code, assisting developers in initiating the documentation process and enabling them to modify the generated README files to meet their specific requirements, thereby reducing manual effort. Additionally, the generated README files can be seamlessly converted into PyPI-compliant standard documentation websites using tools such as MkDocs or Sphinx.
Resources
Curious to see how this works in practice? Check out our linked software and the full paper below. Join us as we explore the next frontier of AI-driven code documentation!