My good deed of this week was to download a video from Internet Archive and upload it to YouTube with some corrected close captioning.
The video I chose is of an interview with a man from Hazard, Kentucky about what life was like in the early 1900’s and what it’s like for him today. The man has a pronounced regional accent and I was curious about how well YouTube would be able to automatically generate captions for it.
I appreciate and use closed-captioning, especially as subtitles when watching foreign language movies or when my very noisy dishwasher is running. However, I was surprised when trying to actually create them how little I could remember about their common conventions.
I had a lot of questions
I wasn’t sure about punctuation, whether I could leave it off when transcribing more conversational speech.
I wasn’t sure if I needed to “set the scene”, accounting for any visible or audible change, with something like stage directions.
I didn’t know if speakers’ names should be used when multiple people are speaking.
Recommendations to affix the speaker’s name to transcribed text felt awkward when I tried them out.
I didn’t know where to draw the line when making corrections. While I did end up making some, I was concerned about removing mannerisms of speech which typify regional Appalachian dialects.
While researching these questions, I couldn’t find easy answers
The captioning process
The automatic captions provided by YouTube are better than nothing and it took less than an hour for the closed-captioning file to be created. The accent of the primary speaker in the video I uploaded was fairly heavy but the speech-to-text generation required less editing than I’d anticipated.
The closed-captioning editor embedded into YouTube was intuitive to me as someone who had never used it while still feeling like it probably rewards more advanced users. Despite this, I found the editing process pretty tedious and it was difficult for me to stay consistent. Getting all super focused so you can listen to the same clip over and over and agonize over having a caption appear at the right millisecond can be fun but there’s not getting around that creating captioning is labor-intensive, which of course translates into it being expensive.