New Technique Guides AI Model Output Monitoring

American Association for the Advancement of Science (AAAS)

AI models have their own internal representations of knowledge or concepts that are often difficult to discern, even as they are critical to the models' output. For instance, knowing more about a model's representation of a concept would help explain why an AI model might "hallucinate" information, or why certain prompts can trick it into responses that dodge its built-in safeguards. Daniel Beaglehole and colleagues now introduce a robust method to extract these representations of concepts, which works across several large-scale language, reasoning, and vision AI models. Their technique uses a feature extraction algorithm called the Recursive Feature Machine. By extracting concept representations with this technique, Beaglehole et al. were able to monitor these models in ways that exposed some of their vulnerabilities to behaviors like hallucinations and to steer them toward improved output responses. Surprisingly, the concept representations were transferable between different languages and could be combined with other concept representations for multi-concept steering, the researchers noted. "Together, these results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements," the authors write.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.