Building blocks of an LLM Application

Retrieval augmented generation (RAG)

Retrieval augmented generation or RAG allows large language models to handle a broader range of queries without the need of exponentially large training data sets.

RAG can be compared to a collaborative effort between an architect and an interior designer, while building a house. The architect's role is to research and select appropriate materials, understand the landscape, and study various architectural designs and building regulations. They lay the foundations and create a structural blueprint.
Once the architect has established this framework, the interior designer steps in to add colors, textures, and decorations, arranging them in a way that is both beautiful and functional. This ensures that the house reflects the unique style and preferences of the homeowner.

In the context of AI, RAG operates in a similar manner with the retrieval component gathering accurate and relevant information, like the architect.

And the generation component creatively weaving this information into coherent and engaging responses similar to that of interior designer's role in home design/interior designer.

Why we need RAG?

So one of the limitations of standard LLMs is the reliance on the knowledge they were trained on.That knowledge might be outdated. Also, it falls short in tasks requiring a specific information.

RAG address this by retrieving up-to-date information from external sources, thus enhancing the accuracy and relevance of the information it provides.

RAG can significantly improve performance by retrieving documents or data that contain the exact information needed, something that pure generative models might struggle with. RAG models retrieves documents and pass them to a sequence to sequence model, such as encoder decoder architecture.

So let's understand the RAG components in more detail. The first one is retriever. This component is responsible for sourcing relevant information from a large corpus or database. It acts like a search engine, scanning through vast amounts of data to find content that is pertinent to the query at hand.

It uses retrieval techniques, such as vector similarity search, keyword-based search, document retrieval or structured database queries to fetch data. We will learn about few of them in our upcoming lessons. The goal is to provide the generation system with contextually relevant, accurate, and up-to-date information that might not be present in the model's pre-trained knowledge.

The second is ranker. The ranker's primary role is to evaluate and prioritize the information retrieved by the retrieval system. It sifts through the various pieces of data or documents that the retrieval system has gathered and ranks them based on their relevance and usefulness in answering the given query.

By effectively ranking the retrieved information, the ranker ensures that the generation system receives the most pertinent and high quality input. This step is crucial for maintaining the accuracy and relevance of the responses generated by the model. The third component is generator. This is a language model, whose job is to generate human-like text based on the input it receives.

It employs generative models to craft human-like text that is contextually relevant, coherent, and informative. It ensures that the final response is not only factually accurate and relevant but also coherent, fluent, and styled in a way that is typical of human language.

Moving on, there are two ways to implement RAG. The first one is RAG sequence model.

So think of RAG sequence model like a novelist who writes a book chapter by chapter.Similarly, for each input query, like a chapter topic, the model retrieves a set of relevant documents or information. It then considers all these documents together to generate a single cohesive response, that is the entire chapter that reflects the combined information.

Imagine you are writing a chapter about the French Revolution. You gather several books and articles on the topic, you read them all, and then write the entire chapter. This is how RAG sequence model works.

The next one is RAG token model. RAG token model is like a journalist who writes an article, considering each piece of information or quote individually as he constructs his story. For each part of the response, like each sentence or even each word, the model retrieves relevant documents.

The response is constructed incrementally with each part reflecting the information from the documents retrieved for that specific part. Imagine you are writing an article about a complex topic, such as climate change. For each point you make or each sentence you write, you look up specific articles or data to ensure that part of your story is accurate and relevant.

Your final article is a combination of these individual pieces of researched information.

What is the scope of retrieval in those two cases?

So the RAG sequence considers the entire input query at once for retrieval, while RAG token does this at more granular level, potentially leading to more varied and specific information integration.

On the topic of response generation, RAG sequence is more about synthesizing a holistic response from a batch of information, whereas RAG token constructs the response in a more piecemeal fashion, considering different sources for different parts of response.

When to use RAG token and RAG Sequence?

Again, use RAG token when the task requires integrating highly specific and varied information from multiple sources into different parts of response, but opt for RAG sequence when the task demands a more holistic and unified approach with responses that need to maintain thematic or contextual consistency across the entire text.

RAG Pipeline

We will now see a basic retrieval augmented generation pipeline, which is used in natural language processing.

The RAG architecture combines a retrieval-based component with a generative model to enhance the generation of text.

Let us see a step by step explanation of this process. So the first phase is ingestion, where documents are ingested into the system. Documents are nothing but the original data corpus, which can consist of a vast collection of text.

The documents are then broken down into a smaller, more manageable piece, often referred to as chunks. This is typically done to improve the efficiency of processing and to focus on relevant sections of the text. Next, each chunk is then transformed into a mathematical representation called an embedding.

These embeddings capture the semantic information of the text and allow for comparisons to be made in a numerical space. These embeddings are then indexed in a database that facilitates quick retrieval. The index is a data structure that allows the system to find and retrieve embeddings sufficiently when a query is made.

In the next phase, the system uses the indexed data to find relevant information. Query is a user's input or question that need to be answered. The system uses this query to search through this indexed embeddings from the ingestion phase to find the most relevant chunks.

From the retrieval process, the system selects the top k most relevant results, which are the chunks that are most likely to contain information relevant to the query.

The third and final phase is generation. This is where the system generates a response based on the information retrieved.

The selected chunks from the retrieval phase are fed into the generative model. The generative model, often a neural network like a transformer, uses the context provided by this top k chunks to generate a coherent and contextually relevant response to the query. So the RAG architecture effectively leverages both the retrieval of relevant information from a large corpus and the generative capabilities of modern neural networks to provide answers that are informed by a wide array of information.

This approach is particularly useful in scenarios where generative models need to be supplemented with specific information that may not be present in their training data.

Real life example of RAG applications

So now, we are going to explore a practical application of the retrieval augmented generation model or RAG. In this real world scenario, we will reflect a question answer based chatbot session.

The RAG model enhances the capabilities of language models by integrating external knowledge. So let us now delve into how this process unfolds.

So imagine an employee asking a simple question " What is the policy?"

The question is clear.

But for a language model to provide a relevant answer, it needs context. And this is where the RAG model starts to shine.

So in addition to this initial prompt, the model considers the entire chat history.

This allows the model to understand the conversation's context, tailoring its response to be more accurate and specific. The prompt and chat history are then combined to form what we call an enhanced prompt. Think of it as a more informed question that provides a clearer picture for the language model to understand.

The enhanced prompt is then passed through a embedding model, which transforms the text into a mathematical vector. These vectors are representations that capture the semantic meaning of the prompt in a high-dimensional space.

Next, we perform an embedding similarity search.Here, the vector of our enhanced prompt is compared against a database of other vectors. And this is how the model finds the most relevant information to this query.

The search results in vector ID matches, which are essentially references to documents that have the closest semantic similarity to our query.

In many cases, these documents contain private content, such as corporate policies or specific employee benefits information stored securely in a relational database. The system then retrieves the document associated with the matching vector ID. This process ensures that the response is both accurate and customized to the user's need.

So once we have the relevant documents, they are used to augment the initial prompt. This step ensures that the model's response is not only relevant but also contains validated information. This augmented prompt is then fed into a large language model, which generates the response.

Even though RAG have the access to the most updated database and gives accurate answers, it does not fully eliminate the risk of hallucinations for various reasons. First, the retrieval could simply fail to retrieve sufficient context or get the relevant one. Second, the response generated by a RAG application was not supported by the retrieved context, but would have been mostly influenced by the LLM and its training data.

And finally, a RAG application may retrieve relevant pieces of context, then leverage them to produce a grounded response and yet still fail to address a user query. For these, RAG triad comes handy. So the first one in RAG triad is context relevance.

So context relevance refers to how well the RAG responses are aligned with the context of the conversation. This includes understanding the ongoing dialogue, the user's intent, and any background information or conversational history. A RAG application with high context relevance can maintain a coherent and meaningful conversation over multiple turns.

Accurately interpreting the user's queries is often assessed through user feedback or by analyzing conversation logs. The second one is groundedness. It indicates the chatbot's ability to provide responses that are not only plausible but also accurate and based on reliable information. It can be evaluated by comparing the RAG responses to trusted sources of the information or through expert assessment, especially for domain specific applications.

The third part of the RAG triad is answer relevance. So the answer relevance refers to the degree to which the RAG responses directly address the user's queries. It's about providing responses that are not only contextually appropriate, but also specifically answer or relate to the user's questions or statement. Overall, leveraging these three parameters over the query, context, and response, one can minimize hallucinations and make RAG-based applications more reliable.

LLMs vs LLMS with RAG

LLMs without RAG typically do not use vector database in their standard operation. Instead, they rely on internal knowledge learned during the pre-training on a large corpus of text. Once these models are trained, they don't automatically know about the latest happenings or the information locked away in private documents that weren't part of their training materials.

The model's parameters encode information and patterns from the data it was trained on, which it then uses to generate responses to queries based on its understanding acquired during this pre-training. This could be with or without even fine-tuning.

LLMs with RAG augment this process by using an external database, which is a vector database. We have seen the RAG in our earlier lesson. Vector databases are a core component of RAG-based LLMs. They are a type of database optimized for storing and querying vectors instead of a traditional rule-based.

What is vector database?

Vector database and the concept of indexing and team vectors, which are high-dimensional data have indeed been used before LLMs.

Search engines have used vector space models for decades to represent text documents for similarity searches and ranking. E-commerce and content platforms have used vector representations of items and user preferences to make recommendations by searching for nearest neighbors in the vector space. They have also been used in computer vision, bioinformatics, and anomaly detection.

Let's start with understanding vectors. So data is represented as vectors in a multi-dimensional space. Each vector can be a point in this space, representing a complex data item like a video, image, text, or a set of features extracted by your machine learning model. In the context of machine learning and particularly with large ML models, vectors are often used to represent embeddings, which are essentially high dimensional vectors.

Vector embeddings are generated by deep learning models to capture semantic meaning of words, sentences, documents, images, or other data types. And they can be used to compute similarity between items efficiently.

How were these vectors represented?

So the structure of vector database is fundamentally different from traditional relational databases. They are optimized for multi-dimensional spaces, where the relationship between data points is not linear or tabular, but is instead based on distances and similarities in a high-dimensional vector space.

Vector databases are particularly adept at handling operations that involve searching the meaning and nearest neighbor queries in high-dimensional spaces. Many vector databases uses a distributed architecture to handle the storage and computational demands of large scale high-dimensional data.

Keyword search evaluate documents on the presence and frequency of the query term. BM25 is such commonly used keyword search algorithm. So let's look at an example here.
Given the query, what is the fastest animal? This is how a keyword search would count the presence and frequency and give the results. So a simple keyword search would show the Response1 as the most relevant to the user's query because of the two common word, the 'the' and 'fastest.'

Semantic search is search by meaning. It is a name given to search algorithms in which the retrieval is done by understanding the semantics of the text, rather than matching keywords.

Semantic search is the ability to search by meaning, not just keyword matching. The two ways you can leverage semantic search are dense retrieval and reranking. Dense retrieval uses a text embedding in order to search for documents that are similar to a query.

The two ways you can leverage semantic search are dense retrieval and reranking.

Dense retrieval uses a text embedding in order to search for documents that are similar to a query. It uses a text embedding to turn words into vectors, which is a list of numbers. We have seen embeddings in our earlier lessons.

The rerank takes a set of item, like search results, documents, images, et cetera and reorders them to improve the relevance or the quality of the results.

Let's revisit embedding quickly. So in this image, here, you can see different words represented in two dimensions. You can locate each words by their coordinates.

For example, car here is 1 and 3 on the x and y-axis respectively. Embeddings are a fundamental concept in machine learning and natural language processing that involves mapping words, phrases, sentences, or even entire documents, as well as the other types of data like users, products, and graphs to vector of real numbers. Embeddings encompass thousands of numbers.

Each number represent a piece of information about the meaning contained in the piece of the text. Text embeddings give you the ability to turn unstructured text data into a structured form. With embeddings, you can compare two or more pieces of text, be it single word, sentences, paragraphs, or even longer documents.

And since these are set of numbers, the ways you can process and extract insights from them are limited only by your imagination. The main purpose of embeddings is to capture the essence of the data in a lower dimensional space that a computer can process, while maintaining the semantic relationships and meaning. So how does dense retrieval use these embeddings?

So let's say you have a user query and a database of Wikipedia articles, dense retrieval consists of the following. It will find the embedding vector corresponding to the query. Then it will find the embedding vectors corresponding to each of the responses.

After that, it retrieves the response vectors that are closest to the query vector in the embedding. Here, document one, three, and five are closest to the query vector. So dense retrieval relies on embeddings of both the queries and documents.

This enables the retrieval system to understand and match based on the contextual similarities between queries and documents.

Now, we will see reranking. Reranking in context of information retrieval. And machine learning is a process that takes an initially retrieved set of items, like search results, documents, images, and reorders them to improve the relevance or quality of the results.

This is often done after an initial ranking has been performed by a more basic or faster retrieval method. For example, you can also use dense retrieval. So how it's done. Reranking assigns a relevance score to query response pairs.

The relevance score is high when the response is likely to be correct response to the query and low otherwise. The way to train this model is giving it positive pairs, query and response and several other negative pairs of query and wrong response. And training the model to score positive pairs highly and negative pairs low.