Optimizing Information Retrieval in Technical Documentation

Written by Josh Nielsen | Oct 28, 2024 4:41:19 PM

Note: This post was co-authored by Bertrand Thomas

Key Takeaways

Learn more about embedding-based context retrieval, which seeks to leverage various data sources to improve results.
Find out about an investigation into the efficacy of different approaches to document segmentation in relation to query length.
Gain insights for improving information retrieval processes in large text databases.

To facilitate fast, efficient access to relevant technical documentation, our teams have been leveraging a range of technologies and approaches. One of these approaches is known as embedding-based context retrieval, which seeks to leverage various data sources to facilitate more accurate, relevant results. In an effort to optimize this implementation, we investigated the relationship between the length of user queries and how documents are segmented, or “chunked.” Our findings offer insights for anyone looking to improve information retrieval processes in large text databases. In this post, we provide a brief introduction to the approaches we’ve employed and we detail the key takeaways from the experiment we’ve conducted.

Introduction

When it comes to software documentation, it can be difficult to ensure users can readily find the answers they need. These challenges get especially difficult as software documentation repositories grow increasingly expansive. To address these challenges, many organizations are leveraging AI and retrieval-augmented generation (RAG) systems. RAG systems can enhance the accuracy and reliability of generative AI models by retrieving information from different data sources.
Within these environments, approaches like embedding-based context retrieval have become increasingly crucial. In these systems, document content is parsed into smaller chunks. Then those chunks are transformed into vectors, enabling comparisons that are used for retrieving the document chunks that are most relevant to a user's query. In our experience, query length and the chunking strategy employed can have a significant impact on the retrieval process. In general, short chunking strategies tend to be sub-optimal with larger queries, while long chunking strategies are likely to be sub-optimal for short, keyword-based queries.

We wanted to look at this relationship a bit more quantitatively, so we set up a study. We used Broadcom's comprehensive knowledge base, which is available at techdocs.broadcom.com, to explore the effectiveness of embedding-based context retrieval for answering user queries. A key aspect of this study involved determining the optimal chunking strategies for different types of documents, which is a key hyperparameter in RAG systems.

Below is a histogram illustrating the word count of articles in the Rally-specific section of our technical documentation. This graphic provides an illustration of the challenges at hand. The range and frequency of document lengths underscore the variability of the documents we are working with. This uneven distribution of document lengths informed our decision-making process, particularly in designing our document chunking strategies for this experiment. These strategies are intended to handle the spectrum of document sizes, from concise articles to expansive manuals, in order to enhance the accuracy and relevance of retrieved information.

To empirically test these chunking strategies, we leveraged our RAG system, Rally software's helpbot. To answer a user’s query, this helpbot uses a predefined chunking strategy to parse the Rally technical documentation, retrieve relevant context using Google’s Vertex AI vector search service, and then augment the user’s original prompt with (hopefully) relevant content.

The helpbot served as a practical platform for simulating real-world user queries, allowing us to evaluate the performance of each chunking strategy in a dynamic, interactive environment.

By analyzing the performance of these chunking strategies in relation to query length, we hoped to gain insights into the optimal structuring of document data for embedding-based retrieval systems. Plus, we wanted to help guide the practical application of these systems in order to respond effectively to user queries.

The experiment

Utilizing Broadcom's technical documentation for the Rally product, we assessed the efficacy of four chunking strategies:

Chunk-small. This strategy divides the text into small chunks, each comprising up to 100 words, with an overlap of 50 words between consecutive chunks. This method aims to capture detailed segments of the text, potentially improving the retrieval of specific information.
Chunk-medium. In contrast to chunk-small, chunk-medium segments the text into medium-sized chunks, each containing up to 300 words, with a 150-word overlap. This approach balances detail and context, providing a more comprehensive view of each text segment.
No-chunk. This strategy treats each document as a whole, without any segmentation. This preserves the entire context of the document, which could be beneficial for queries requiring a broad understanding of the text.
Chunk-medium-none. This hybrid approach combines the prior chunk-medium and no-chunk strategies. This provides a range of chunk sizes, as well as unchunked documents. This comprehensive method aims to cover a wide spectrum of query types, from specific to broad.

To evaluate strategy performance, we used a set of “golden questions” and corresponding “golden answers.” These golden questions range from concise, single-word queries to more complex ones comprising up to 30 words. Each question is paired with a golden answer, which we, as the human experts, consider the ideal or correct response to the query.

To compare the effectiveness of different chunking strategies, we focus on the alignment between the answers generated by each strategy and these golden answers. Specifically, we employ a metric known as the F1 overlap score, which measures the accuracy and relevance of the answers provided by our system in response to the golden questions.

The F1 score is a widely recognized metric in information retrieval and natural language processing. This score balances precision and recall. Precision assesses the proportion of relevant answers among all answers retrieved, while recall evaluates the proportion of relevant answers that were successfully retrieved.

Our approach to calculating the F1 overlap score involves tokenizing both the golden answers and the answers generated by our system, which are sourced from different chunking strategies. To do this tokenization, we used the “bert-base-uncased” tokenizer from the Hugging Face's Transformers library. This process converts the text into a set of tokens, representing the essential elements of each answer. Here’s a snippet of this code:

# install transformers library
!pip install transformers

def calculate_f1_overlap(df, ground_truth_col, llm_columns):
"""
Calculate the F1 score for overlap between ground truth answers and answers from language model(s).

This function computes the F1 scores to assess the overlap between answers provided by a ground truth column
and one or more language model columns in a DataFrame. It uses the BERT tokenizer to tokenize the answers
before calculating precision, recall, and F1 scores.

Parameters:
df (pd.DataFrame): The DataFrame containing the columns with answers.
ground_truth_col (str): The name of the column in the DataFrame that contains the ground truth answers.
llm_columns (list of str): A list of column names in the DataFrame representing answers generated by language models.

Returns:
dict: A dictionary where keys are language model column names, and values are the corresponding F1 scores.

The F1 score is calculated as follows:
- Tokenize the answers in the ground truth and language model columns using BERT tokenizer.
- Calculate precision as the ratio of correct tokens (intersection) to the total tokens in the language model answer.
- Calculate recall as the ratio of correct tokens to the total tokens in the ground truth answer.
- Calculate F1 score as the harmonic mean of precision and recall.
- Average the F1 scores across all rows for each language model column.

Note:
- The function assumes that the DataFrame and columns exist and are correctly formatted.
- BERT tokenizer is used for tokenization, which requires 'bert-base-uncased' to be available.
- Zero division is handled by assigning zero to precision, recall, or F1 score as appropriate.
"""

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
correct_answers = df[ground_truth_col].apply(lambda x: tokenizer.tokenize(x))

f1_scores = {}

for col in llm_columns:
llm_answers = df[col].apply(lambda x: tokenizer.tokenize(x))
precision_scores = []
recall_scores = []
f1_scores_list = []
for correct, llm in zip(correct_answers, llm_answers):
correct_set = set(correct)
llm_set = set(llm)
precision = recall = 0
if len(llm_set) > 0:
precision = len(correct_set & llm_set) / len(llm_set)
precision_scores.append(precision)
if len(correct_set) > 0:
recall = len(correct_set & llm_set) / len(correct_set)
recall_scores.append(recall)
if precision + recall > 0:
f1 = 2 * precision * recall / (precision + recall)
f1_scores_list.append(f1)
f1_scores[col] = sum(f1_scores_list) / len(f1_scores_list)
return f1_scores

# example usage
good = {
'actual': ['The quick brown fox'],
'ideal': ['The quick brown fox jumps'],
}

bad = {
'actual': ['I love programming in Python'],
'ideal': ['I love skiing in winter'],
}

df_good = pd.DataFrame(good)
df_bad = pd.DataFrame(bad)

f1_results_good = calculate_f1_overlap(df_good, 'actual', ['ideal'])
print(f1_results_good) # {'ideal': 0.888888888888889}

f1_results_bad = calculate_f1_overlap(df_bad, 'actual', ['ideal'])
print(f1_results_bad) # {'ideal': 0.6}

For each pair of golden answers and system-generated answers, we compute the following measurements:

Precision: This is a ratio calculated by tallying the number of tokens that appear in both the system-generated and golden answers, and comparing to the total number of tokens in the system-generated answer.
Recall: The ratio of overlapping tokens to the total number of tokens in the golden answer.
F1 score: The harmonic mean of precision and recall, providing a single measure to gauge the effectiveness of the retrieval method.

The final F1 score for each chunking strategy is calculated as the average of these individual F1 scores across all the golden questions. This method offers a comprehensive and quantitative way to assess the performance of each chunking strategy in providing accurate and relevant answers to a diverse range of queries.

Results

The chart below depicts the results of our experiment.

Assessing the results

Here’s our summary of the takeaways from this experiment:

Chunk-small: Generally effective for all query sizes.
Chunk-medium: Effective for small, small-medium, and medium-long queries. Poor for long queries.
Chunk-none: Most effective for small queries and long queries.
Chunk-medium-none: Effective for small-medium and medium-long queries.

The effectiveness of chunk-small across all query sizes may be attributed to its ability to provide granular and precise matching, suggesting this approach offers a high degree of adaptability and robustness in diverse query contexts. By segmenting documents into smaller, fact-oriented units, perhaps this strategy enables the system to effectively match queries with very specific parts of our technical documentation.

In contrast, chunk-medium’s optimal performance for small, small-medium, and medium-long queries indicates a potential “sweet spot” in chunk size, helping strike a balance between capturing enough context and maintaining focus.

Interestingly, chunk-none appears to be effective for both small and long queries, possibly due to its ability to preserve context in shorter queries and to offer a broader overview for longer ones. However, its poor performance on the medium size queries is concerning.

Finally, the general effectiveness of chunk-medium-none across all query sizes points to its flexibility and adaptability. These findings suggest that this strategy might be capable of dynamically adjusting to the varying demands of different query lengths.

Application to other enterprises

The findings from this study underscore the complex interplay between chunking strategies and query lengths, illuminating key considerations for the design and optimization of information retrieval systems. These insights are relevant to the team at Broadcom as well any other groups trying to address the challenge of helping users effectively and efficiently search through large-scale technical documentation.

The demonstrated relationship between chunking strategies and query lengths can provide a framework that guides groups in optimizing their information retrieval systems. These insights can help teams ensure systems enable users to efficiently access accurate, relevant information. As a result, users can have better experiences and be more productive.

If you are a Rally customer, the Rally helpbot is now available as part of your subscription. To learn more about the solution, be sure to visit the Rally Software page.

View full post