For those of you who have studied journalism or expository writing, there’s a familiar expression; “Tell them what you’re going to tell them, tell it to them, and tell them what you’ve told them.” The reason this axiom has stood the test of time is because it prepares the reader, or more specifically your audience, to best understand what you’re saying. Well written documents follow this model.

Oddly enough, a new technique called Retrieval Augmented Generation (RAG) does nearly the same thing when conveying information to Chat GPT so that it understands and can best respond to the questions you are asking it.
Today, most applications use content and context prompting to elicit a response from Chat GPT. Depending upon which variant of Chat GPT is used, there’s a limit of 1,000 to 16,000 tokens of content and context which can be ingested for Chat GPT prompting. For those unfamiliar, 1,000 tokens is equal to approximately 750 words or three pages of double spaced text. However, sometimes on rare occasions, Chat GPT can hallucinate, make up information, or just be incorrect.
RAG offers a promising solution to the complexity of fine tuning and textual pre-training while addressing issues such as hallucinations and inaccuracies by allowing relevant information to be automatically included in the conversation with ChatGPT.
How Does RAG Work?
RAG combines retrieval and generation of content to make Chat GPT more accurate. It works by encoding information into embeddings (numeric representations of data that language models understand) that are stored in a vector database (as opposed to a SQL database). Then, when a question is asked, the question is also encoded into embeddings and used to search the vector database. This is called a semantic, or similarity, search. The database uses a similarity search to find the most semantically relevant content in the database that matches the question. The resulting data (relevant portions of documents) are returned by the query and included into conversation with ChatGPT (or other language models).
Understanding Vector Databases
Let’s start with an explanation of vector databases and search. Most people understand a traditional SQL database: data is stored in tables in two dimensions (rows and columns) like the rows and columns in a spreadsheet.
Vector databases however, are completely different. Unlike the static two-dimensional nature of traditional SQL databases, vector databases have the ability to move information through many more dimensions. Hence, they’re quite dynamic. Think for a moment of how Chat GPT works. For the user, information is provided for the Large Language Model in the form of text. It responds by providing a text answer that is knowledgeable and empathetic, but the Large Language Model is not thinking nor is it sentient. It is predicting the letters, words, phrases, sentences, and paragraphs based on the context it was provided.
Each word and letter contained within Chat GPT’s Large Language Model is associated with a particular number. In fact, it is a floating point number that is dozens or thousands of digits long. When a request for information occurs, the text is searched not using the syntax of language, but through the association of all the numbers that are relevant to the context of the prompt. Floating point numbers from the prompt are associated with the floating point numbers contained within the Large Language Model. These floating point numbers are referred to as embeddings.
Although this abstract idea may be cumbersome to conceptualize, the numbers or embeddings that are associated with the content of a Large Language Model are compared and contrasted with the floating point numbers or embeddings that are generated from the prompt. Importantly, this technique does not occur in a linear fashion like searching through thousands of records from A to Z in a SQL database. Instead, the associations occur multidimensionally, and are more akin to three dimensional chess.
Why Vector Databases?
Vector databases are very good at doing similarity searches across data that is encoded as embeddings (vectors). These types of searches are very fast and allow for the retrieval of the encoded information as text to be included in the conversation with the bot.
Furthermore, since the vector database does not have the token limits of Chat GPT, the amount of project specific reference data held in the vector database is nearly unlimited allowing for powerful applications to be built based on extensive corpuses of custom data sources.
In many ways, the vector database tells the audience what it is about to tell them. It sits in front of the content that is delivered into Chat GPT’s Large Language Model and further defines and accurately narrows the scope of the response so that Chat GPT can best answer questions to an increasingly larger degree. It knows in advance what the subject matter pertains to (i.e. the associated floating point numbers or embeddings) and can respond with increasingly more accurate answers. Done iteratively, it is a powerful response mechanism.
How Does RAG Help?
So now, let’s get back to our original axiom for expository writing. If we were to tell an audience a story without telling them what we’re going to tell them, telling it to them, and telling them what we told them, many people would not be able to tell you what the story was about. It’s fair to say that the degree of accuracy for any respondent would be less than the accuracy when using the proper model. Similarly, the proper use of RAG, is a novel technique used by RowBotAI to dramatically reduce hallucinations. RAG is utilized iteratively as a fine tuning mechanism at low cost to increase the accuracy of responses or the rate of which Chat GPT predicts the next word or sentence with a higher degree of accuracy.
Comments