Yilun Du, a grad student in the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, has developed a machine learning technique that accurately captures and models the underlying acoustics of a scene from only a limited number of sound recordings. This machine-learning system can simulate how a listener would hear sound from any point in a room.
Joining Du on the work are lead author Andrew Luo, a graduate student at Carnegie Mellon University (CMU), senior author Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in MIT’s Department of Brain and Cognitive Sciences and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and Antonio Torralba, the Delt Professor of Cognitive and Brain Science at CMU, and Michael J. Tarr, the Kavi-Moura Professor of Cognitive and Brain Science At the Conference on Neural Information Processing Systems, the research will be presented.
The use of spatial acoustic information to aid robots in better understanding their environs was also investigated by the researchers. They created a machine-learning model that can mimic what a listener would hear at various positions by capturing how any sound in a room will travel across space.
An implicit neural representation model, a sort of machine-learning model, has been utilized in computer vision research to produce continuous, smooth reconstructions of 3D scenes from photographs. These models make use of neural networks, which are composed of layers of connected nodes, or neurons, that analyze data to act.
The same kind of model was used by MIT researchers to depict how sound permeates a scene continuously.
However, they discovered that sound models do not share a trait known as photometric consistency that makes vision models more advantageous. The identical thing appears to be about the same when viewed from two different angles. However, when it comes to sound, different locations could result in completely different sounds due to obstructions, distance, etc. As a result, audio prediction is quite challenging.
The reciprocal nature of sound and the influence of regional geometric elements were two acoustic properties that the researchers incorporated into their model to solve this issue.
In addition, they discovered that incorporating the acoustic data their model picks up into a computer vision model can improve the visual reconstruction of the scene.
“When you only have a sparse set of views, using these acoustic features enables you to capture boundaries more sharply, for instance. And maybe this is because to accurately render the acoustics of a scene, you have to capture the underlying 3D geometry of that scene,” Du says.
The model will be improved further by the researchers so that it may be applied to fresh scenes. Additionally, they aim to use this method for more involved impulsive reactions and bigger scenarios, such as entire buildings or even a whole town or metropolis.
“This new technique might open up new opportunities to create a multimodal immersive experience in the metaverse application,” adds Gan.