▮ Using Vector Stores
The combination of vector databases and LLMs, such as retrieval-augmented generation, has created a massive impact on the AI industry.
When adopting these technologies, how you develop and maintain your personal vector databases becomes significantly important.
So for this post, I’d like to share 3 best practices tips when developing your own vector databases.
▮ Select Appropriate Embedding Models
Whether you are vectorizing your private data or vectorizing the user’s natural language queries, you’ll need an embedding model. The following image is a workflow to decide which embedding model is appropriate for your case.
▮ Ensure Embedding Space Are The Same
You should, in most cases, use the same embedding model for embedding both private data and the user’s natural language queries. If you use a different model, the same words may be embedded into different locations, making the semantic search nearly impossible.
▮ Determine Fitting Chunking Strategy
When you embed your private data into a vector, you’ll often need to “chunk” the data into smaller pieces and embed each chunk individually. The size of each chunk depends on the specific task you’re trying to achieve.
The chunk size can be a single sentence, a single paragraph, or even multiple paragraphs. The smaller the chunk gets, the more the embeddings focus on specific meanings. The larger it gets, the embeddings can capture broader themes.
How you chunk and vectorize your data will have a significant effect on the retrieval process, so it is important to experiment with different sizes of chunks to find a fit with your case.