Abstract: The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the ...