LLMs are transforming how we build AI solutions. Newer and better off-the-shelf models are being released all the time.
A question I’m often asked is why should someone opt for a custom LLM instead of a ready-made solution?
If you’re working on an AI project, like building an AI agent or AI chatbot, you might opt to use a customized large language model (LLM).
There are plenty of reasons to use a custom LLM, and plenty of options at your disposal. In this article, I’ll walk you through the different ways to customize an LLM for AI projects.
Why use a custom LLM?
There are several reasons to use a custom LLM:
- You want to reduce costs by focusing on a particular task that is important for your business use case, or minimize latency.
- You might want to keep all the data private, or use your company's in-house LLM.
- You might want to improve the quality of answers for a particular task.
Whatever the reason, customizing your LLM allows you to optimize performance, balancing accuracy, speed, and cost to fit your business needs.
Picking an LLM
LLMs have two qualities that impact AI projects: their size (measured by number of parameters), and quality of responses.
You can think of parameters like neurons in a brain. A bigger brain is often correlated with being smart, but that's not always true. And parts of the brain can be highly optimized for certain tasks like vision.
For AI projects, the size usually affects speed of response, and it greatly affects the cost responses. Projects that require low latency often use smaller models, but at the expense of quality of responses.
What to ask when picking a model
Here's a good list of questions to be able to answer when picking a model:
- Can I use a cloud-based LLM or do I need to host one myself?
- How fast do I need the responses to be?
- How accurate do I need the responses to be?
- How much $$ will my project save and/or generate? Then, what price should it fall below?
- How long do I need my responses to be?
Generally speaking, it's difficult to speed up a powerful model or reduce its costs, and it’s easier to improve a less accurate model.
However, it's much faster to get started with a powerful model, and if it fulfills your project's needs, you may not need as much engineering effort (plus, it’s easier to maintain).
Choosing Between RAG, Fine-Tuning, N-Shot Learning, and Prompt Engineering
There are five general concepts that improve the quality of LLM responses:
- Starting from a pre-trained model
- RAG
- Fine tuning
- N-shot prompting
- Prompt engineering
These aren’t specific to using custom models, but you should consider them regardless, as they work hand-in-hand with each other.
Starting from a model
The first thing you should do is pick a starting model. There are plenty of leaderboards online that compare the different models.
For example:
- Hugging Face maintains a leaderboard for open source models.
- Vellum has an excellent one for the more popular models.
If your company has an in-house model, consider using it to work with your budget and keep data private. If you need to host the model yourself, consider an open-source model.
Fine-tuning
Fine-tuning involves providing examples to your model so that it learns how to do a certain task well. If you want it to excel in speaking about your product, you might provide a swath of examples of your company’s best sales calls.
If the model is open source, ask yourself if your team has enough engineering capacity to fine-tune a model.
If the model is closed source and provided as a service – GPT-4 or Claude – then you can usually have your engineers fine-tune custom models using APIs. The price usually increases substantially through this method, but there is little to no maintenance.
But for many use cases, fine-tuning is not the first step towards optimizing your model.
A great case for fine-tuning is building a knowledge bot for static knowledge. By giving examples of questions and answers, it should be able to answer them in the future without looking up the answer. But it’s not a practical solution for real-time information.
Retrieval-augmented generation
RAG is a fancy name for a simple thing that we've all done in ChatGPT: pasting some text into ChatGPT and asking a question about it.
A typical example is asking if a certain product is in stock on an e-commerce site, and a chatbot looking up the information in a product catalog (instead of the wider internet).
In terms of speed of development, and getting real-time information, RAG is a must-have.
It doesn't usually affect which model you will pick, however nothing stops you from creating an LLM API endpoint that queries information and answers and using this endpoint as though it were its own LLM.
Using RAG for a knowledge-based chatbot is often easier to maintain, as you don't need to fine-tune a model and keep it up to date – which can also reduce costs.
N-shot learning
The fastest way to get started in improving the quality of responses is to provide examples in a single LLM API call.
Zero-shot – giving zero examples of what you're looking for in an answer – is how most of us use ChatGPT. Adding one example (or one-shot) is usually enough to see a substantial improvement in the response quality.
More than one example is considered n-shot. N-shot does not change the model, unlike fine-tuning. You're simply giving examples just before asking for a response, every time you ask a question.
But this strategy can’t be overused: LLM models have a maximum context size, and are priced according to the size of the message. Fine-tuning can remove the need for n-shot examples, but takes more time to get right.
Other prompt engineering techniques
There are other prompt engineering techniques, like chain-of-thought, which force models to think out loud before coming up with an answer.
This increases the quality of response, but at the cost of response length, cost and speed.
My recommendation
While every project will have its own unique needs, I’ll give my two cents on a strong approach.
A good place to start is using an off-the-shelf model that balances speed and quality, like GPT-4o Mini. Start by looking at the quality of the responses, response speed, cost, context window needs, and decide what needs to be improved from there.
Then, with a narrow use case, you can try some simple prompt engineering, followed by RAG, and finally fine-tuning. Every model that goes through these will have performance gains, so it can be tricky to figure out what to use.
Privacy Considerations
In an ideal world, every LLM would be 100% under your own control, and nothing would be exposed anywhere.
Unfortunately, this isn't what we observe in practice – and for very good reasons.
The first is simple: it requires engineering to host and maintain a custom model, which is very costly. When the hosted model experiences down-time, business metrics are affected, so the deployment should be very sturdy.
Another reason is that the industry leaders – like OpenAI, Google and Anthropic – are constantly releasing newer, more capable and cheaper models that render any work on fine-tuning redundant. This has been the case since the release of ChatGPT 3.5 and shows no sign of changing.
If your use case has extremely sensitive data, it makes sense to use a model and optimize it for your use case. If GDPR is top-of-mind, there are plenty of off-the-shelf models that are GDPR compliant.
Building after selecting your LLM
Once you’ve selected an LLM, you can start figuring out how you’ll build and maintain your AI project. As an example, I’ll take the type of project I’m most familiar with: an AI agent or AI chatbot.
You can answer the following questions to scope your project:
- Where would I like my AI agent to live? (Slack, WhatsApp, a website widget, etc.)
- What knowledge should it have, where is that knowledge?
- What capabilities should it have other than knowledge answering, if any?
- Should it activate when something happens somewhere in the business?
Offload engineering to save $
Keeping a lean budget is critical in making your project a reality. One of the ways you can do that is reducing engineering time by decoupling requirements.
Nowadays we have access to low-code solutions like Flutterflow, Shopify, which can be used by traditionally non-technical roles like Product Managers. Chatbots are no exception, and some AI automation platforms even allow you to use your own LLM.
You can instruct engineers to focus on hosting the LLM and setting up with the automation platform. That frees up the business analysts, product managers, and other related roles to build AI agents that satisfy business requirements.
When something additional is required, these platforms generally have a way for engineers to add some code. This way, you keep the advantages of a custom model, and gain flexibility, speed and affordability.
Provide engineering freedom to solve business problems
On the other hand, sometimes business problems are just very hard to solve.
We're talking about fully network-gapped LLM applications, on-device apps, or projects requiring giving chatbots extremely advanced capabilities that are more than syncing data between two platforms.
In those cases, allowing engineers the freedom to use whatever tools they are most comfortable makes sense. This is usually just writing code, and stakeholders simply act as project managers.
Strategic considerations for customizing an LLM
Choosing a custom LLM for your AI project isn't just about picking the best model – it's about making strategic decisions that align with your goals.
Custom models offer flexibility, control, and the potential to optimize for specific tasks, but they also come with added complexity. Begin with an off-the-shelf model, experiment with prompt engineering, and gradually refine from there.
Remember, the right model should fit your business needs, not just your tech stack.
Customizing with powerful platforms
Ready to take your AI project up a notch?
Botpress is a fully extensible and flexible AI agent platform. Our stack allows developers to build chatbots and AI agents for any possible use case.
We feature a robust education platform, Botpress Academy, as well as a detailed YouTube channel. Our Discord hosts over 20,000+ bot builders, so you can always get the support you need.
Start building today. It’s free.
Table of Contents
Stay up to date with the latest on AI agents
Share this on: