YouTube’s automatic track captioning just got a whole lot more colourful with the introduction of sound effect captions.
Google posted in a blog last week that it has been providing captions for videos since 2009, but it focused on speech transcription for better accessibility.
Now the company has turned its attention to sound effects, with the aim of providing better access to ‘the richness of all the audio content’.
Three teams from accessibility, sounder understanding and YouTube came together to collaborate on the project, which used machine learning as its way of turning sound effects into words.
But the process came with challenges, which included everything from getting enough labeled data for its neural network. As you can imagine, labeled ambient sound information is hard to come by, but they were still able to get a big enough dataset, the company says.
After focusing on ‘applause’, ‘music’, and ‘laughter’ as its initial testing, they were able to built a framework around some of the more common sounds.
They then started to focus on a wider range of sounds such as ‘ring’, ‘knock’, and ‘bark’, before sounds such as ‘piano music’ and ‘raucous applause’.
It wasn’t all smooth sailing, as sometimes when a segment labeled ‘laugh’ contained speech and laughter. It was difficult to tell them apart in test data. After a bit of experimenting with sound effect caption localisation, it was a question of how to put them together.
Google collaborated with user experience research teams, design options and eventually a pilot study.
They even tested how users react to captions when watching with the sound off, which was overly positive even when the captions got it wrong. According to YouTube, the errors 'did not detract from the participant’s experience in roughly 50% of the cases'.
This was because those who heard the audio could ignore the mistakes, while those who couldn’t hear the audio thought the error was a sound event that didn’t impact critical speech information.
The study’s conclusion? Users were fine with the occasional mistake as long as it provided useful information most of the time.
Google says the new sound effect captions will build on content richness for viewers who use captions, although it is still a work in progress. Google hopes it will spur further discussion and research into automatic captioning, and also to how to use creator-generated and community-generated open tracks richer.