Academy
How to Optimize Files for RAG
Structuring Data for RAG
In this lesson

When preparing data for RAG, every detail in document formatting and structure matters. Let’s start with the basics: the file types you’re using.

First, make sure your files are in supported formats. This includes commonly used types like PDFs, Word documents, HTML files, Markdown, and plain text. The Botpress Studio supports all of these file formats. In general, avoid using file types that can't be easily parsed, such as image-based documents with complex formatting. Without proper extraction, these files can't be read by an LLM, which limits your agent’s ability to understand or respond accurately.

When you upload a file to be used as an agent’s knowledge base in Botpress, we automatically convert the file to markdown. If you want to ensure that your agent provides consistently reliable answers, you can upload a raw markdown file yourself, or use the Rich Text knowledge base type, which is also just markdown.

Now, beyond file type, the way you organize your document’s content is just as important. Breaking your files into a clear and logical structure—with distinct sections, titles, headings, and subheadings—can greatly enhance your agent’s ability to understand and retrieve information. Pay particular attention to your document’s headings: with a clear information hierarchy designated through headings, an LLM can better categorize information, improving its ability to retrieve relevant knowledge based on user queries.

The overarching theory here is to make your document easily parsable. In other words, if you were to hand this document to someone with no context whatsoever about your industry or service, they should still be able to understand the information it contains.

Botpress uses a semantic approach to headings and subheadings, which means that during the vectorizing step we pay attention to logical segments of your files that should be grouped together for retrieval. But we rely on your document’s structure to do this accurately: if your title is being parsed as part of the main body of your text, that’ll cause problem’s in your agent’s ability to consistently retrieve information from this section.

In short, a little time spent organizing and standardizing your files goes a long way toward improving your agent’s ability to process and retrieve accurate information.

Summary
In short, a little time spent organizing and standardizing your files goes a long way toward improving your agent’s ability to process and retrieve accurate information.
all lessons in this course