ETRI Unveils Automated Benchmark for Task Planners

National Research Council of Science & Technology

ETRI research team has developed a technology that automatically evaluates the performance of task plans generated by Large Language Models (LLMs¹⁾), which paves the way for fast and objective assessment of task planning AIs.

^{1) Language models are constructed from artificial neural networks that contain a vast number of parameters.}

Electronics and Telecommunications Research Institute (ETRI) has announced the development of LoTa-Benchmark (LoTa-Bench²⁾), which enables the automatic evaluation of language-based task planners. A language-based task planner understands the verbal instruction from a human user, plans a sequence of operations, and autonomously executes the designated operations to fulfill the goal of the instruction.

^{2) LoTa-Bench: A procedural generation artificial intelligence benchmark technology developed by ETRI, abbreviated from Language-oriented Task Planning.}

The research team published a paper at one of the leading international AI conferences, the International Conference on Learning Representations (ICLR)³⁾, and shared the evaluation results for a total of 33 large language models through GitHub.

^{3) https://openreview.net/forum?id=ADSxCpCu9s ICLR (International Conference on Learning Representations)}

Recently, large language models have demonstrated remarkable performance not only in language processing, conversation, solving mathematical problems, and logic proof but also in understanding human commands, autonomously selecting sub-tasks, and sequentially executing them to achieve goals. Consequently, there has been a widespread effort to apply large language models in robotics applications and service implementation.

Previously, the absence of benchmark⁴⁾ technology capable of automatically evaluating task planning performance necessitated manual assessments, which were labor-intensive. For instance, in existing research, including Google's SayCan⁵⁾, the method adopted involved multiple individuals directly observing the results of tasks being executed and then voting on their success or failure. This approach not only required a significant amount of time and effort for performance evaluation, making it cumbersome but also introduced the problem of subjective judgment influencing the results.

^{4) Benchmark: A system that uses programs to compare and evaluate the performance of computer components, among other things, assigning a score based on their efficiency.}

5) https://say-can.github.io/

The LoTa-Bench technology developed by ETRI automates the evaluation process by actually executing task plans generated by large language models based on user commands and automatically compares the outcomes to the intended results of the commands to determine whether the plans were successful or not. This approach significantly reduces evaluation time and costs as well as ensures that the evaluation results are objective.

ETRI revealed benchmark results for different large language models, indicating that OpenAI's GPT-3 achieved a success rate of 21.36%, GPT-4 exhibited 40.38%, Meta's LLaMA 2-70B model showed 18.27%, and MosaicML's MPT-30B model recorded 18.75%. It was noted that larger models tend to have superior task planning capabilities. A success rate of 20% implies that out of 100 instructions, 20 plans were successful in fulfilling the goal of the instructions.

In LoTa-Bench, performance evaluation is conducted in virtual simulation environments developed by the Allen Institute for AI(AI2-THOR⁶⁾) and the Massachusetts Institute of Technology(MIT's VirtualHome⁷⁾) aimed at research and development of robotics and embodied agent intelligence. The evaluation utilized the ALFRED dataset⁸⁾ that included everyday household task instructions such as "Place a cooled apple in the microwave" etc.

^{6) AI2-THOR: A robotic home service simulator.}

7) VirtualHome: A simulation of household activities through programming.

8) ALFRED: A benchmark for testing and evaluating the performance of everyday household task execution / Watch-and-Help: A benchmark for testing and evaluating the performance of artificial intelligence in recognizing human task intentions and collaborating accordingly.

Leveraging the benefits of the LoTa-Bench technology for easy and rapid verification of new task planning methods, the research team discovered two strategies for improving task planning performance through data-driven training: In-Context Example Selection and Feedback-Based Replanning. They also confirmed that fine-tuning effectively enhances the performance of language-based task planning.

Minsu Jang, a principal researcher at ETRI's Social Robotics Lab, stated, "LoTa-Bench marks the first step in the development of task planning AI. We plan to research and develop technologies that can predict task failures in uncertain situations or improve task generation intelligence by asking for and receiving help from humans. This technology is essential for realizing the era of one robot per household."

Jaehong Kim, the director of ETRI's Social Robotics Research Section, announced, "ETRI is dedicated to advancing robotic intelligence using foundation models to realize robots capable of generating and executing various mission plans in the real world."

By releasing the software⁹⁾ as open source, the ETRI researchers anticipate that companies and educational institutions will be able to freely utilize this technology, thereby accelerating the advancement of related technologies.

^{9) https: //github.com/lbaa2022/LLMTaskPlanning}

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

You might also like