Diving deeper into the RAG stack

March 17, 2024

In this article I will share with you some of my learnings from diving deeper into the AI stack of RAG architecture. At the end of the article I’ll also do my best to compare the storage size pricing of Pinecone vs. Azure AI Search in terms of 100 page PDFs of a given size. I chose PDFs because they are a common data format that is used as the source of information in RAG solutions. We’ll derive some very interesting conclusions from the comparison. You can also skip right there if you are not interested in the RAG stack on a deeper technical level.

If you don’t know why RAG could be of value for your business, see how Klarna got it’s solution to do 700 peoples work by leveraging it.

I was recently in a situation where we proposed a solution for an organization that wanted to reduce the amount of manual work done by customer servants answering customers’ questions via email. We proposed to build an AI based chatbot on the company’s website. The chatbot would retrieve information from the company’s own knowledge base and answer customers’ questions with the retrieved information. Leveraging AI and generative AI in this case was quite self-evident.

The proposed solution is a very common GenAI use case. It is based on the RAG architecture. Basically a RAG based solution combines a large-language-model (LLM), a text embedding model and a vector database in order to answer users’ questions with information retrieved from the vector database. The beauty of using vectors and an LLM is that it does not matter if the users use in their questions for example the same wording or order of words about a topic as compared to how it is written in the knowledge base. With the help of the text embedding model and the LLM, the solution is able to retrieve an answer with the relevant information. The information is retrieved based on semantic similarities in the contexts of the user prompt and the knowledge base contents. I had already done a few small demos with RAGs but now I dove deeper into the world of retrieval-augmented-generation. Here are my key learnings from it.

The text embedding model and the LLM - Consider the language!

As this solution’s text embedding model we would use Microsoft’s multilingual E5 base. The embedding model converts text into floating point vectors. The multidimensional vectors are capable of capturing the context and the semantics of the text input in an efficient manner. One of the reasons for choosing Multilingual-E5-base as the embedding model was because it is multilingual and open-source. Multilingual so it supports Finnish and Swedish languages (which were requirements for the solution). Open-source so that it wouldn’t lock the solution into any specific vendor’s cloud platform. In the world of natural language processing it is not self-evident at all that the AI models support smaller languages like Finnish and Swedish. Important point in choosing the Multilingual-E5-base was of course also that it performs well. We determined that from HugginFace’s embedding models leaderboard.

Because of the language requirements we basically also had to choose the LLM from OpenAI’s GPT models or Google’s Gemini (Claude wasn’t yet available in the EU but is now). Unfortunately the other closed or open-source LLMs aren’t capable enough in Finnish language to be used in customer facing solutions. In our case we chose GPT-3.5-turbo mainly because we were more familiar with GPT models than Gemini.

Pinecone vs. Azure AI Search

When planning the solution architecture we needed to decide which vector database it would use. The vector database is a very essential part of a RAG based solution so we didn’t want to pick just any. There are plenty of vector databases in the market. In this article we will just concentrate on two of them.

Pinecone is the most popular vector database in the market. There is a large community that shares info about using it as well as plenty of tutorials and reference architectures that use it. The older pod based Pinecone is available on all three major cloud platform providers and the new (cheaper) serverless solution should be soon generally available as well. In our case we had to nonetheless rely on the older pod based version in order to comply with the GDPR legislation. When the Serverless version will be generally available the migration process should be relatively simple.

The other vector database that we considered was Microsoft Azure’s AI Search (yes, the name is a bit confusing, but it is a vector database nonetheless). Microsoft has invested heavily into GenAI and as said the RAG architecture is a very common use case for leveraging an LLM. Thus Microsoft has also created plenty of reference architectures, code repositories and learning material for Azure AI Search. The example code repositories are extremely easy to kickstart on Azure. You can have a functional RAG solution with a chatbot interface up and running in your own Azure cloud within an hour’s work. The solution will then use Azure AI Search and your own private Azure OpenAI LLM instance. Though you should consider before using a ready made code base from a repository as the starting point for your own production solution. It might be better to understand thoroughly what you are building on. In any case using Azure AI Search will mean that you are very tied to Microsoft. In case you would like to switch the cloud service provider some day in the future it will be very difficult. On the other hand Azure AI Search is also very easy to use in Azure because it’s an Azure native solution. That might save you some headaches.

How to compare?

So here we have two vector databases. The other one is the market leader. It won’t create a vendor lock to any specific cloud platform. The other one is from probably the biggest player in the GenAI market currently. It should be easy to use in Azure cloud but it will also tie us very strongly into using Azure as the cloud platform. So how do we choose the right one for us from these two vector databases? Since they are both licensed solutions, one thing to consider is of course pricing. Then you might ask: Price in relation to what? How many queries per second the DB can handle is important. At least if you are expecting high traffic. In our case we weren’t expecting that much traffic. There are also a lot of additional features related to vectors and vector databases, but since I am fairly new to the concept of vector databases I wanted to compare them on a more conventional level: Price in relation to the storage size.

I tried to find an easy way both on Pinecone and Azure websites as well other sources to compare and understand the price in terms of offered storage size. Also I wanted to understand on a more concrete level for example how many PDF files of a given size each vector database would fit. One problem was that Azure AI Search and Pinecone present on their main pricing pages the offered storage size in a very different manner from each other. Thus making the comparison is not straightforward at all. I decided to give it a try. (I did find of course after doing the calculations that Pinecone does actually have in its more in-depth documentation the storage size of a pod in gigabytes. But at least now I have a more in-depth understanding of the PDF count that could fit in both of the databases as well as how the consumed disk space would actually be formed. I chose PDF files because they are very often one of the source data formats.)

Let’s first have a look on the technical features of text embeddings.

Dimensions in vector embeddings - More is more

Vector embeddings are arrays of floating point numbers. With the help of vector embeddings the AI model is capable of capturing the semantic meaning and larger context of the words. In simple terms, the more dimensions the vectors have, the larger the dataset can be from which they can be used to identify semantically similar or otherwise relevant information, while maintaining high-quality in relevance in the results. On the other hand, the more dimensions the vectors have, the more space they take in the vector database and the slower the queries to the vector database become. Text embedding quality is affected also by other qualities of the embedding model. So the dimension count can not be used as the main meter for the quality of the model. However it can be used at least in comparing the suitability of different dimension size versions of the same model for a particular use case. With it you can also roughly estimate conversion efficiency. In our case we use it when estimating the storage size. At any given time one vector store uses only one model and has the same dimension count for all the vectors.

Metadata

Metadata can contain for example the text representation of the vector embedding.

Token Input limit

The input limit of the multilingual-e5-base text embedding model is 512 tokens. In our case the input limit size had to be large enough to capture the context when working with larger text documents. If the input limit size is very small and you are dealing with longer texts it is probable that your retrieval results will be missing important information.

Chunking the text input - Strategies

Because the embedding model has a token input limit you need to chunk the text dataset into smaller pieces. Choosing and implementing a text chunking strategy is very important to the quality of your RAG solutions answers. Often a simple and working chunking strategy is to have a fixed chunk size with some fixed part of the text chunks overlapping. On the other hand it might make sense for example to chunk the PDFs based on a more semantically defined criteria. To make for example a Q&A solution work more accurately you could generate questions from each chunk with an LLM and include those questions in the vector embeddings.

Another reason for considering the chunk size is the LLM’s context window size. If your solution will feed the LLM for example the top 5 of the retrieved results from the vector database the LLM’s context window needs to be large enough to digest them and the rest of the prompt that you might feed into it. Choosing the right chunking strategy can depend on many things, like for example the contents of your data or how long prompts do you expect your users to write.

Price vs. Storage Size: A Comparison between Pinecone and Azure AI Search

On the main pricing page Pinecone presents pricing related to offered storage size in a different manner from Azure. The smallest Storage Optimized Pinecone Pod has storage for 5 million vectors if no metadata is included.

Azure on the other hand presents its price in relation to the storage size in the more traditional way in gigabytes. In the case of vector databases the price in relation to a vector count can make it easier for the developer to understand the pricing. This is though assuming that the developer is already more familiar with the RAG stack. In the case of Azure AI Search the presentation in terms of vectors has extra abstraction layers that first need to be calculated and converted. Thus comparing the storage sizes of the two vector databases is difficult.

Below we aim to compare the pricing of the two database plans in terms of how many 100 page PDF files each would fit. The estimated vector count is of course the same for all the theoretical PDFs. Because of that the accuracy of the estimated number of vectors per PDF is not that important.

OpenAI says that there are approximately ~800 tokens on a page.
The token input limit for the multilingual-e5-base model is 512 tokens.
To keep things simple let’s say there are exactly 512 tokens per PDF page.
Using a very naive chunking strategy the PDF files could then be chunked into non-overlapping chunks of the maximum input of multilingual-e5-base which is 512 tokens.
Each 100 page PDF would convert to 100 pieces of 512 token chunks.
So using the multilingual-e5-base model a 100 page PDF would convert to 100 vector embeddings.
The calculation in your case is likely not this straightforward. For example we are not including metadata and the chunking strategy is oversimplified. But in the example we will keep it so to. We aim rather to get some concrete understanding of just how much space PDF files would actually consume and how the embeddings are formed.

Azure AI Search 100 page PDF count:

A dimension is a float value.
A float is 32 bits long.
32 bits take 4 bytes of memory.
The multilingual-e5-base model creates vectors of 768 dimensions (happens to be the same amount as in Pinecone Pods example pricing).
So each vector embedding would consume 768 * 4B = 3KB of disk space.
If one 768 dim vector is 3KB, then the disk space a 100 page PDF would consume can be calculated with it’s vector embedding count from above: 100 * 3KB = 300KB
Azure Search basic storage size is 2GB.
2GB / 300KB = ~6990

Without metadata and with a naive chunking strategy Azure AI Search basic could store ~6990 pieces of 100 page PDFs of the given size.

Pinecone Storage Optimized Pod 100 page PDF count:

Pinecone says it has space for 5 million 768 dimension vectors.
5 000 000 / 100 = 50 000

Without metadata and with a naive chunking strategy the smallest Pinecone Storage Optimized pod could store ~50 000 pieces of 100 page PDFs of the given size.

This is a huge, over seven fold difference in the cheapest tier storage size between the two compared vector database plans. The monthly cost is almost the same. Now of course in this light if you will choose your vector database according to the offered storage size then Pinecone pods will certainly be the right option.

It is surprising that Pinecone doesn’t mention the offered storage size in GB as well in their main pricing page in order to make it easier to understand the size of the storage. Later on I did find it deeper in their documentation: The offered storage should be 15GB. That is quite in line with my calculations which would convert to 768 * 4B * 5M = ~14.3GB.

Things will however get more complicated again if you need more storage than what the smallest instances examined in this article provide. Storage size should also not be at least the only criteria for choosing some vector database over the other. For an additional advantage for Pinecone you might be interested to know that the Pinecone Serverless version should soonish be generally available (it is currently available for some AWS US regions). It should be even much cheaper than the pods.

Besides the ones mentioned in this article there are many additional ways to improve the quality of your RAG solution. For example fine-tuning the text embedding model can improve the quality. Same applies to the LLM. After fine-tuning a smaller LLM you might even reach better quality for your context than with a larger non-fine-tuned LLM.

Thanks for reading! I hope you learned something with me today. Please be in contact if you need help in building a RAG solution.