Submitted by Anonymous (not verified) on Wed, 03/13/2024 - 18:45
Abstract
Our cognitive abilities enable us not only to perceive and identify speech and non-speech sounds but also to comprehend their meaning as a whole. While significant advancements have been achieved in audio recognition in recent years, models trained with merely sound labels possess limited reasoning and understanding capabilities, e.g., the model may recognize the clock chime 6 times, but not know that it indicates a time of 6 o’clock. Can we build an AI model that has both audio perception and reasoning ability?