Enhancing Code Q&A: Evaluating Retrievers, Tuning LLMs

ELSP

Researchers have focussed on building a QA system which can answer query of user from building code and reduces the laborious traditional way of manual querying. One possible solution to build a robust QA system is to utilise the power of Retrieval Augmented Generation (RAG). Researchers have explored the potential of several retrieval methods and efficiency of Low Rank Adaptation (LoRA) to fine-tune Large Language Models (LLMs). Retrievers and LLMs are core component of a RAG system, and their performance affects overall performance of QA system.

Manual querying of building codes is often tedious, error-prone, and time-consuming. To address these challenges, researchers have turned to Retrieval-Augmented Generation (RAG), a framework that integrates two key components: a retriever, which identifies and extracts relevant information from documents, and a language model, which generates precise answers by combining the retrieved content with the query.

While RAG holds strong promise, both components present inherent challenges. Retrievers vary widely in performance, each with its own advantages and limitations. At the same time, language models are susceptible to hallucinations and typically require fine-tuning to adapt effectively to specialized domains. Recognizing this, researchers at the University of Alberta — Mr. Aqib, Dr. Qipei, Mr. Hamza, and Professor Chui — explored the performance of several retrievers and investigated the impact of fine-tuning Large Language Models (LLMs) for building code applications, demonstrating that such adaptation significantly enhances generational accuracy and domain alignment.

Their study systematically evaluated multiple retrievers, with Elasticsearch (ES) emerging as the most effective. Experiments also showed that retrieving the top-3 to top-5 documents was sufficient to capture query-relevant context, achieving consistently high BERT F1 scores. In parallel, the researchers also fine-tuned a range of LLMs spanning 1B to 24B parameters to better capture the nuances of building code language. Among these, Llama-3.1-8B delivered the strongest results, achieving a 6.83% relative improvement in BERT F1-score over its pre-trained state.

Together, these findings underscore the value of combining robust retrieval strategies with fine-tuned language models for building code compliance and query answering. For future work, Aqib mentioned that "there is need to develop a fully integrated end-to-end RAG framework, validated against manually curated datasets. Moreover, continued domain-specific fine-tuning could bring performance closer to that of state-of-the-art commercial models such as GPT-4."

This paper, "Fine-tuning large language models and evaluating retrieval methods for improved question answering on building codes," was published in Smart Construction (ISSN: 2960-2033), a peer-reviewed open access journal dedicated to original research articles, communications, reviews, perspectives, reports, and commentaries across all areas of intelligent construction, operation, and maintenance, covering both fundamental research and engineering applications. The journal is now indexed in Scopus, and article submission is completely free of charge until 2026.

Citation:

Aqib M, Hamza M, Mei Q, Chui Y. Fine-tuning large language models and evaluating retrieval methods for improved question answering on building codes. Smart Constr. 2025(3):0021, https://doi.org/10.55092/sc20250021.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like