Penn Engineers have developed SmartDJ , an AI-powered editor that lets users modify immersive audio environments with simple instructions in everyday language, with potential applications in virtual reality, augmented reality, gaming and sound design. Instead of requiring users to specify individual edits, SmartDJ can respond to high-level requests like "make this sound like a busy office," then plan and carry out the steps needed to achieve that result.
The system addresses two major limitations of earlier AI audio-editing tools. First, most prior systems worked best with rigid, template-like commands, requiring users to identify sounds to add or remove. Second, those tools generally operated on single-channel or "mono" audio, losing the spatial cues that are necessary for an immersive audio experience.
SmartDJ, by contrast, can interpret high-level instructions and is designed for stereo audio, allowing it to make edits that better preserve or reshape the spatial structure of a scene.
What's more, the system is interpretable: users can see each step SmartDJ takes. For example, a prompt like "make this sound like a busy office" might lead SmartDJ to generate an instruction like "Add the sound of phone ringing at right by 3dB." Users can then revise, remove or add individual steps, providing more control over the final result.
"With SmartDJ, users can describe the outcome they want in natural language, and the system figures out how to make it happen," says Mingmin Zhao , Assistant Professor in Computer and Information Science (CIS) and senior author of a study presented at the 2026 International Conference on Learning Representations (ICLR). "We show that AI can help people edit audio in intuitive ways using simple language."
Combining Language and Diffusion Models
One of the central challenges of AI audio editing is that understanding a user's request and generating sounds are usually handled by different kinds of AI systems. "We use language models to deal with text," says Zitong Lan , a doctoral student in Electrical and Systems Engineering (ESE) and the study's first author. "We further use diffusion models to edit sounds."
The difference comes down to what each system has been trained to do. Language models — the same technology that powers chatbots — learn patterns in words, helping them to interpret what users mean and to generate text in response. Diffusion models, by contrast, are designed to create media by gradually shaping noise into a coherent signal.
To bridge the gap, the team introduced an audio language model, or ALM, into the editing loop. Trained on both sound and text, the ALM analyzes the original audio together with the user's prompt, then breaks that prompt into a sequence of smaller editing actions, such as adding, removing or repositioning a sound. A diffusion model then carries out those actions step by step, allowing SmartDJ to both interpret language and edit audio.
In essence, the language model acts as a producer, deciding how the soundscape should change, while the diffusion model acts like a studio musician, carrying out those directions in audio. "The language model gives the system direction," says Yiduo Hao , a doctoral student in CIS and the study's other co-author. "The diffusion model performs those directions."
Training SmartDJ
To learn how to turn broad user requests into step-by-step audio edits, SmartDJ needed examples that brought together three things at once: a high-level instruction, the sequence of editing actions needed to carry it out, and the audio before and after each change.
Unfortunately, that kind of training data did not exist. "This problem needed a very unusual kind of data set," says Lan. "It had to capture the goal, the steps and the result all at once."
So the team built it themselves. Relying on publicly available sound libraries, the researchers created a pipeline that used a large language model to generate high-level editing prompts and the intermediate steps needed to carry them out, while audio signal processing produced the corresponding edited outputs. "For this to work, we couldn't just show the model an input and output," says Hao. "We had to show it the chain of reasoning in between."
Toward More Accessible Audio Editing
To test SmartDJ, the researchers compared it with earlier audio-editing systems and found that it produced more realistic, better-aligned results. In both quantitative evaluations and human studies, SmartDJ outperformed prior methods on measures including audio quality, how well the results matched the user's instructions and how realistically it placed sounds in space.
The researchers see potential applications in virtual reality, augmented reality, gaming, sound design, virtual conferencing and other forms of interactive media, where users may want to reshape an audio environment without manually specifying every individual change.
Ultimately, the researchers' goal is to make audio editing more accessible, allowing anyone with a creative vision to edit soundscapes. "For other media, like text and images, users can already use AI to make high-level editing requests," says Zhao. "SmartDJ unlocks similar possibilities for audio, making it easier for more people to bring their ideas to life."
This study was conducted at the University of Pennsylvania School of Engineering and Applied Science.