Researchers have proposed a unifying mathematical framework that helps explain why many successful multimodal AI systems work.
Artificial intelligence is increasingly relied on to combine and interpret different kinds of data, including text, images, audio, and video. One obstacle that continues to slow progress in multimodal AI is deciding which algorithmic approach best fits the specific task an AI system is meant to solve.
Researchers have now introduced a unified way to organize and guide that decision process. Physicists at Emory University developed a new framework that brings structure to how algorithms for multimodal AI are derived, and their work was published in The Journal of Machine Learning Research.
“We found that many of today’s most successful AI methods boil down to a single, simple idea — compress multiple kinds of data just enough to keep the pieces that truly predict what you need,” says Ilya Nemenman, Emory professor of physics and senior author of the paper. “This gives us a kind of ‘periodic table’ of AI methods. Different methods fall into different cells, based on which information a method’s loss function retains or discards.”
A loss function is the mathematical rule an AI system uses to evaluate how wrong its predictions are. During training, the model continually adjusts its internal parameters in order to reduce this error, using the loss function as a guide.
“People have devised hundreds of different loss functions for multimodal AI systems and some may be better than others, depending on context,” Nemenman says. “We wondered if there was a simpler way than starting from scratch each time you confront a problem in multimodal AI.”