Skip to main content

Command Palette

Search for a command to run...

Summarize private documents using RAG, LangChain, and LLMs

Published
12 min read

Imagine it's your first day at an exciting new job at a fast-growing tech company, Innovatech. You're filled with a mix of anticipation and nerves, eager to make a great first impression and contribute to your team. As you find your way to your desk, decorated with a welcoming note and some company swag, you can't help but feel a surge of pride. This is the moment you've been working towards, and it's finally here.

Your manager, Alex, greets you with a warm smile. "Welcome aboard! We're thrilled to have you with us. I have sent you a folder. Inside this folder, you'll find everything you need to get up to speed on our company policies, culture, and the projects your team is working on. Please keep them private."

You thank Alex and open the folder, only to be greeted by a mountain of documents. Manuals, guidelines, technical documents, project summaries, and more await you. It's overwhelming. You think to yourself, "How am I supposed to absorb all of this information in a short time? And they are private and I cannot just upload it to GPT to summarize them." "Why not create an agent to read and summarize them for you, and then you can just ask it?" your colleague, Jordan, suggests with an encouraging grin. You're intrigued, but uncertain; the world of large language models (LLMs) is one that you've only scratched the surface of. Sensing your hesitation, Jordan elaborates, "Imagine having a personal assistant who's not only exceptionally fast at reading, but can also understand and condense the information into easy-to-digest summaries. That's what an LLM can do for you, especially when enhanced with LangChain and Retrieval-Augmented Generation (RAG) technology."

"But, how do I get started? And how long will it take to set up something like that?" you ask. Jordan says, "Let's dive into a project that will not only help you tackle this immediate challenge but also equip you with a skillset that's becoming indispensable in this field."

indexing


So, this project steps you through the fascinating world of LLMs and RAG, starting from the basics of what these technologies are, to building a practical application that can read and summarize documents for you. By the end of this tutorial, you have a working tool capable of processing the pile of documents on your desk, allowing you to focus on making meaningful contributions to your projects sooner.

What is RAG

One of the most powerful applications enabled by LLMs is sophisticated question-answering (Q&A) chatbots. These are applications that can answer questions about specific source information. These applications use a technique known as Retrieval-Augmented Generation (RAG). RAG is a technique for augmenting LLM knowledge with additional data, which can be your own data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to public data up to the specific point in time that they were trained. If you want to build AI applications that can reason about private data or data introduced after a model’s cut-off date, you must augment the knowledge of the model with the specific information that it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as RAG.

LangChain has a number of components that are designed to help build Q&A applications, and RAG applications, more generally.

RAG Architecture

A typical RAG application has two main components:

  • Indexing: A pipeline for ingesting data from a source and indexing it. This usually happens offline.

  • Retrieval and generation: The actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like the following examples.

  • Indexing

  1. Load: First, you must load your data. This is done with DocumentLoaders.

  2. Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, because large chunks are harder to search and won’t fit in a model’s finite context window.

  3. Store: You need somewhere to store and index your splits so that they can later be searched. This is often done using a VectorStore and Embeddings model.

    indexing

source

  • Retrieval and generation

  1. Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.

  2. Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

    retrieval

source

Setup

For this lab, you use the following libraries:

\==> Check out the code

Installing required libraries

The following required libraries are not preinstalled in the Labs environment. You must run the following cell to install them:

Note: We are pinning the version here to specify the version. It's recommended that you do this as well. Even though the library will be updated in the future, the library could still support this lab work.

This might take apprioximately 3-5 minutes.

As we use %%capture to capture the installation, you won't see the output process. But once the installation is done, you will see a number beside the cell.

%%capture !pip install "ibm-watsonx-ai==0.2.6" !pip install "langchain==0.1.16" !pip install "langchain-ibm==0.1.4" !pip install "huggingface == 0.0.1" !pip install "huggingface-hub == 0.21.4" !pip install "sentence-transformers == 2.5.1" !pip install "chromadb == 0.4.24" !pip install "wget == 3.2"

After the installation of libraries is completed, restart your kernel. You can do that by clicking the Restart the kernel icon.

Import libraries

You can use this section to suppress warnings generated by your code:

def warn(*args, **kwargs):

pass

import warnings

warnings.warn = warn

warnings.filterwarnings('ignore')

from langchain.document_loaders import TextLoader

from langchain.text_splitter import CharacterTextSplitter

from langchain.vectorstores import Chroma

from langchain.embeddings import HuggingFaceEmbeddings

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

from langchain.chains import ConversationalRetrievalChain

from langchain.memory import ConversationBufferMemory

from ibm_watsonx_ai.foundation_models import Model

from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams from ibm_watsonx_ai.foundation_models.utils.enums import ModelTypes, DecodingMethods

from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM

import wget

Preprocessing

Load the document

The document, which is provided in a TXT format, outlines some company policies and serves as an example data set for the project.

This is the load step in Indexing.

split

Here we will use a CompanyPolicies file to extract and load it

filename = 'companyPolicies.txt'

url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/6JDbUb_L3egv_eOkouY71A.txt'

#Use wget to download the file

wget.download(url, out=filename)

print('file downloaded')

After the file is downloaded and imported into this lab environment, you can use the following code to look at the document.

with open(filename, 'r') as file:

# Read the contents of the file

contents = file.read()

print(contents)

Splitting the document into chunks

In this step, you are splitting the document into chunks, which is basically the split process in Indexing.

split

LangChain is used to split the document and create chunks. It helps you divide a long story (document) into smaller parts, which are called chunks, so that it's easier to handle.

For the splitting process, the goal is to ensure that each segment is as extensive as if you were to count to a certain number of characters and meet the split separator. This certain number is called chunk size. We set 1000 as chunk size in this project. Though the chunk size is 1000, the splitting is happening randomly. This is an issue with LangChain. CharacterTextSplitter uses \n\n as the default split separator. You can change it by adding the separator parameter in the CharacterTextSplitter function; for example, separator="\n".

loader = TextLoader(filename)

documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

texts = text_splitter.split_documents(documents)

print(len(texts))

From the ouput of print, you see that the document has been split into 16 chunks

Embedding and storing

This step is the embed and store processes in Indexing.

split

In this step, you're taking the pieces of the story, your "chunks," converting the text into numbers, and making them easier for your computer to understand and remember by using a process called "embedding."

Think of embedding like giving each chunk its own special code. This code helps the computer quickly find and recognize each chunk later on.

You do this embedding process during a phase called "Indexing." The reason why is to make sure that when you need to find specific information or details within your larger document, the computer can do so swiftly and accurately.

The following code creates a default embedding model from Hugging Face and ingests them to Chromadb.

When it's completed, print "document ingested".

embeddings = HuggingFaceEmbeddings()

docsearch = Chroma.from_documents(texts, embeddings) # store the embedding in docsearch using Chromadb

print('document ingested')

LLM model construction

In this section, you build an LLM model from IBM watsonx.ai.

First, define a model ID and choose which model that you want to use. There are many other model options. Refer to Foundation Models for other model options. This tutorial uses the FLAN_UL2 model as an example.

model_id = 'google/flan-ul2'

Define parameters for the model.

The decoding method is set to greedy to get a deterministic output.

For other commonly used parameters, you can refer to Foundation model parameters: decoding and stopping criteria.

parameters = {

GenParams.DECODING_METHOD: DecodingMethods.GREEDY, GenParams.MIN_NEW_TOKENS: 130, # this controls the minimum number of tokens in the generated output

GenParams.MAX_NEW_TOKENS: 256, # this controls the maximum number of tokens in the generated output

GenParams.TEMPERATURE: 0.5 # this randomness or creativity of the model's responses

}

Define credentials and project_id, which are necessary parameters to successfully run LLMs from watsonx.ai.

(Keep credentials and project_id as they are now, so that you do not need to create your own keys to run models. This supports you running the model inside of this lab environment. However, if you want to run the model locally, refer to this tutorial for creating your own keys.

credentials = { "url": "https://us-south.ml.cloud.ibm.com" }

project_id = "skills-network"

Wrap the parameters to the model.

model = Model(

model_id=model_id,

params=parameters,

credentials=credentials,

project_id=project_id

)

Build a model called flan_ul2_llm from watsonx.ai.

flan_ul2_llm = WatsonxLLM(model=model)

This completes the LLM part of the Retrieval task.

split

Integrating LangChain

LangChain has a number of components that designed to help retrieve information from the document and build question-answering applications, which helps you complete the retrieve part of the Retrieval task.

split

In the following steps, you create a simple Q&A application over the document source using LangChain's RetrievalQA.

Then, you ask the query "what is mobile policy?"

qa = RetrievalQA.from_chain_type(llm=flan_ul2_llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=False)

query = "what is mobile policy?" qa.invoke(query)

Now, try to ask a more high-level question.

qa = RetrievalQA.from_chain_type(llm=flan_ul2_llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=False)

query = "Can you summarize the document for me?"

qa.invoke(query)

So, try to use another model, LLAMA_3_70B_INSTRUCT. You should do the model construction again.

model_id = 'meta-llama/llama-3-70b-instruct'

parameters = { GenParams.DECODING_METHOD: DecodingMethods.GREEDY, GenParams.MAX_NEW_TOKENS: 256, # this controls the maximum number of tokens in the generated output GenParams.TEMPERATURE: 0.5 # this randomness or creativity of the model's responses }

credentials = { "url": "https://us-south.ml.cloud.ibm.com" }

project_id = "skills-network"

model = Model( model_id=model_id, params=parameters, credentials=credentials, project_id=project_id )

llama_3_llm = WatsonxLLM(model=model)

Try the same query again on this model.

qa = RetrievalQA.from_chain_type(llm=llama_3_llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=False)

query = "Can you summarize the document for me?"

qa.invoke(query)

Dive deeper

This section dives deeper into how you can improve this application. You might want to ask "How to add the prompt in retrieval using LangChain?"

split

You use prompts to guide the responses from an LLM the way that you want. For instance, if the LLM is uncertain about an answer, you instruct it to simply state "I do not know" instead of attempting to generate a speculative response.

Let's see an example.

qa = RetrievalQA.from_chain_type(llm=flan_ul2_llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=False)

query = "Can I eat in company vehicles?"

qa.invoke(query)

Using prompt template

In the following code, you create a prompt template using PromptTemplate.

context and question are keywords in the RetrievalQA, so LangChain can automatically recognize them as document content and query.

prompt_template = """Use the information from the document to answer the question at the end. If you don't know the answer, just say that you don't know, definately do not try to make up an answer.

{context}

Question: {question} """

PROMPT = PromptTemplate( template=prompt_template, input_variables=["context", "question"] )

chain_type_kwargs = {"prompt": PROMPT}

You can ask the same question that does not have answer in the document again.

qa = RetrievalQA.from_chain_type(llm=llama_3_llm, chain_type="stuff", retriever=docsearch.as_retriever(), chain_type_kwargs=chain_type_kwargs, return_source_documents=False)

query = "Can I eat in company vehicles?"

qa.invoke(query)

Make the conversation have memory

Do you want your conversations with an LLM to be more like a dialogue with a friend who remembers what you talked about last time? An LLM that retains the memory of your previous exchanges builds a more coherent and contextually rich conversation.

Take a look at a situation in which an LLM does not have memory.

You start a new query, "What I cannot do in it?". You do not specify what "it" is. In this case, "it" means "company vehicles" if you refer to the last query.

query = "What I cannot do in it?"

qa.invoke(query)

To make the LLM have memory, you introduce the ConversationBufferMemory function from LangChain.

memory = ConversationBufferMemory(memory_key = "chat_history", return_message = True)

Create a ConversationalRetrievalChain to retrieve information and talk with the LLM.

qa = ConversationalRetrievalChain.from_llm(llm=llama_3_llm, chain_type="stuff", retriever=docsearch.as_retriever(), memory = memory, get_chat_history=lambda h : h, return_source_documents=False)

Create a history list to store the chat history.

history = []

query = "What is mobile policy?"

result = qa.invoke({"question":query}, {"chat_history": history})

print(result["answer"])

Append the previous query and answer to the history.

history.append((query, result["answer"]))

query = "List points in it?"

result = qa({"question": query}, {"chat_history": history})

print(result["answer"])

Append the previous query and answer to the chat history again.

history.append((query, result["answer"]))

query = "What is the aim of it?"

result = qa({"question": query}, {"chat_history": history})

print(result["answer"])

Wrap up and make it an agent

The following code defines a function to make an agent, which can retrieve information from the document and has the conversation memory.

def qa():

memory = ConversationBufferMemory(memory_key = "chat_history", return_message = True)

qa = ConversationalRetrievalChain.from_llm(llm=llama_3_llm, chain_type="stuff", retriever=docsearch.as_retriever(), memory = memory, get_chat_history=lambda h : h, return_source_documents=False)

history = []

while True:

query = input("Question: ")

if query.lower() in ["quit","exit","bye"]:

print("Answer: Goodbye!")

break

result = qa({"question": query}, {"chat_history": history})

history.append((query, result["answer"]))

print("Answer: ", result["answer"])

More from this blog

Untitled Publication

109 posts

I am Shahriyar, currently a student and working on Quantum Computing, AI, ML , Cloud and DevOps.