Note: This post was co-authored by Bertrand Thomas |
Key Takeaways
|
|
To facilitate fast, efficient access to relevant technical documentation, our teams have been leveraging a range of technologies and approaches. One of these approaches is known as embedding-based context retrieval, which seeks to leverage various data sources to facilitate more accurate, relevant results. In an effort to optimize this implementation, we investigated the relationship between the length of user queries and how documents are segmented, or “chunked.” Our findings offer insights for anyone looking to improve information retrieval processes in large text databases. In this post, we provide a brief introduction to the approaches we’ve employed and we detail the key takeaways from the experiment we’ve conducted.
When it comes to software documentation, it can be difficult to ensure users can readily find the answers they need. These challenges get especially difficult as software documentation repositories grow increasingly expansive. To address these challenges, many organizations are leveraging AI and retrieval-augmented generation (RAG) systems. RAG systems can enhance the accuracy and reliability of generative AI models by retrieving information from different data sources.
Within these environments, approaches like embedding-based context retrieval have become increasingly crucial. In these systems, document content is parsed into smaller chunks. Then those chunks are transformed into vectors, enabling comparisons that are used for retrieving the document chunks that are most relevant to a user's query. In our experience, query length and the chunking strategy employed can have a significant impact on the retrieval process. In general, short chunking strategies tend to be sub-optimal with larger queries, while long chunking strategies are likely to be sub-optimal for short, keyword-based queries.
We wanted to look at this relationship a bit more quantitatively, so we set up a study. We used Broadcom's comprehensive knowledge base, which is available at techdocs.broadcom.com, to explore the effectiveness of embedding-based context retrieval for answering user queries. A key aspect of this study involved determining the optimal chunking strategies for different types of documents, which is a key hyperparameter in RAG systems.
Below is a histogram illustrating the word count of articles in the Rally-specific section of our technical documentation. This graphic provides an illustration of the challenges at hand. The range and frequency of document lengths underscore the variability of the documents we are working with. This uneven distribution of document lengths informed our decision-making process, particularly in designing our document chunking strategies for this experiment. These strategies are intended to handle the spectrum of document sizes, from concise articles to expansive manuals, in order to enhance the accuracy and relevance of retrieved information.
To empirically test these chunking strategies, we leveraged our RAG system, Rally software's helpbot. To answer a user’s query, this helpbot uses a predefined chunking strategy to parse the Rally technical documentation, retrieve relevant context using Google’s Vertex AI vector search service, and then augment the user’s original prompt with (hopefully) relevant content.
The helpbot served as a practical platform for simulating real-world user queries, allowing us to evaluate the performance of each chunking strategy in a dynamic, interactive environment.
By analyzing the performance of these chunking strategies in relation to query length, we hoped to gain insights into the optimal structuring of document data for embedding-based retrieval systems. Plus, we wanted to help guide the practical application of these systems in order to respond effectively to user queries.
Utilizing Broadcom's technical documentation for the Rally product, we assessed the efficacy of four chunking strategies:
To evaluate strategy performance, we used a set of “golden questions” and corresponding “golden answers.” These golden questions range from concise, single-word queries to more complex ones comprising up to 30 words. Each question is paired with a golden answer, which we, as the human experts, consider the ideal or correct response to the query.
To compare the effectiveness of different chunking strategies, we focus on the alignment between the answers generated by each strategy and these golden answers. Specifically, we employ a metric known as the F1 overlap score, which measures the accuracy and relevance of the answers provided by our system in response to the golden questions.
The F1 score is a widely recognized metric in information retrieval and natural language processing. This score balances precision and recall. Precision assesses the proportion of relevant answers among all answers retrieved, while recall evaluates the proportion of relevant answers that were successfully retrieved.
Our approach to calculating the F1 overlap score involves tokenizing both the golden answers and the answers generated by our system, which are sourced from different chunking strategies. To do this tokenization, we used the “bert-base-uncased” tokenizer from the Hugging Face's Transformers library. This process converts the text into a set of tokens, representing the essential elements of each answer. Here’s a snippet of this code:
# install transformers library
!pip install transformers
def calculate_f1_overlap(df, ground_truth_col, llm_columns):
"""
Calculate the F1 score for overlap between ground truth answers and answers from language model(s).
This function computes the F1 scores to assess the overlap between answers provided by a ground truth column
and one or more language model columns in a DataFrame. It uses the BERT tokenizer to tokenize the answers
before calculating precision, recall, and F1 scores.
Parameters:
df (pd.DataFrame): The DataFrame containing the columns with answers.
ground_truth_col (str): The name of the column in the DataFrame that contains the ground truth answers.
llm_columns (list of str): A list of column names in the DataFrame representing answers generated by language models.
Returns:
dict: A dictionary where keys are language model column names, and values are the corresponding F1 scores.
The F1 score is calculated as follows:
- Tokenize the answers in the ground truth and language model columns using BERT tokenizer.
- Calculate precision as the ratio of correct tokens (intersection) to the total tokens in the language model answer.
- Calculate recall as the ratio of correct tokens to the total tokens in the ground truth answer.
- Calculate F1 score as the harmonic mean of precision and recall.
- Average the F1 scores across all rows for each language model column.
Note:
- The function assumes that the DataFrame and columns exist and are correctly formatted.
- BERT tokenizer is used for tokenization, which requires 'bert-base-uncased' to be available.
- Zero division is handled by assigning zero to precision, recall, or F1 score as appropriate.
"""
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
correct_answers = df[ground_truth_col].apply(lambda x: tokenizer.tokenize(x))
f1_scores = {}
for col in llm_columns:
llm_answers = df[col].apply(lambda x: tokenizer.tokenize(x))
precision_scores = []
recall_scores = []
f1_scores_list = []
for correct, llm in zip(correct_answers, llm_answers):
correct_set = set(correct)
llm_set = set(llm)
precision = recall = 0
if len(llm_set) > 0:
precision = len(correct_set & llm_set) / len(llm_set)
precision_scores.append(precision)
if len(correct_set) > 0:
recall = len(correct_set & llm_set) / len(correct_set)
recall_scores.append(recall)
if precision + recall > 0:
f1 = 2 * precision * recall / (precision + recall)
f1_scores_list.append(f1)
f1_scores[col] = sum(f1_scores_list) / len(f1_scores_list)
return f1_scores
# example usage
good = {
'actual': ['The quick brown fox'],
'ideal': ['The quick brown fox jumps'],
}
bad = {
'actual': ['I love programming in Python'],
'ideal': ['I love skiing in winter'],
}
df_good = pd.DataFrame(good)
df_bad = pd.DataFrame(bad)
f1_results_good = calculate_f1_overlap(df_good, 'actual', ['ideal'])
print(f1_results_good) # {'ideal': 0.888888888888889}
f1_results_bad = calculate_f1_overlap(df_bad, 'actual', ['ideal'])
print(f1_results_bad) # {'ideal': 0.6}
For each pair of golden answers and system-generated answers, we compute the following measurements:
The final F1 score for each chunking strategy is calculated as the average of these individual F1 scores across all the golden questions. This method offers a comprehensive and quantitative way to assess the performance of each chunking strategy in providing accurate and relevant answers to a diverse range of queries.
The chart below depicts the results of our experiment.
Here’s our summary of the takeaways from this experiment:
The effectiveness of chunk-small across all query sizes may be attributed to its ability to provide granular and precise matching, suggesting this approach offers a high degree of adaptability and robustness in diverse query contexts. By segmenting documents into smaller, fact-oriented units, perhaps this strategy enables the system to effectively match queries with very specific parts of our technical documentation.
In contrast, chunk-medium’s optimal performance for small, small-medium, and medium-long queries indicates a potential “sweet spot” in chunk size, helping strike a balance between capturing enough context and maintaining focus.
Interestingly, chunk-none appears to be effective for both small and long queries, possibly due to its ability to preserve context in shorter queries and to offer a broader overview for longer ones. However, its poor performance on the medium size queries is concerning.
Finally, the general effectiveness of chunk-medium-none across all query sizes points to its flexibility and adaptability. These findings suggest that this strategy might be capable of dynamically adjusting to the varying demands of different query lengths.
The findings from this study underscore the complex interplay between chunking strategies and query lengths, illuminating key considerations for the design and optimization of information retrieval systems. These insights are relevant to the team at Broadcom as well any other groups trying to address the challenge of helping users effectively and efficiently search through large-scale technical documentation.
The demonstrated relationship between chunking strategies and query lengths can provide a framework that guides groups in optimizing their information retrieval systems. These insights can help teams ensure systems enable users to efficiently access accurate, relevant information. As a result, users can have better experiences and be more productive.
If you are a Rally customer, the Rally helpbot is now available as part of your subscription. To learn more about the solution, be sure to visit the Rally Software page.