jesse_the_k | A Better Way to Measure Automatic Captioning

Many academic libraries/databases have been made world-readable in the past few months while students lack campus library access. Curiosity led me to the Association for Computing Machinery’s digital library (open access until 30 June 2020), where I was delighted to learn of the journal called ACM Transactions on Accessible Computing

Use the advanced search interface if you’re ready to go diving.

I found research explaining why automatic captioning is so unsatisfactory. "Word Error Rate" is the metric YouTube and other automatic speech recognition systems use as they trumpet their production of "automatic captions." Deaf & HoH users often call them "craptions." Total number of incorrect words divided by total number of words displayed doesn't map on to the info we need to understand spoken language visually. Some words we can easily infer; when names, locations, and crucial verbs go missing, comprehension plummets. This article explains in great detail, and proposes alternative metrics which could measure whether automatic speech recognition is good enough.

Predicting the Understandability of Imperfect English Captions for People Who Are Deaf or Hard of Hearing

SUSHANT KAFLE and MATT HUENERFAUTH, Rochester Institute of Technology

ACM Trans. Access. Comput., Vol. 12, No. 2, Article 7, Publication date: June 2019. https://dl.acm.org/doi/10.1145/3325862

abstract: Automatic Speech Recognition (ASR) technology has seen major advancements in its accuracy and speed in recent years, making it a possible mechanism for supporting communication between people who are Deaf or Hard-of-Hearing (DHH) and their hearing peers. However, state-of-the-art ASR technology is still imperfect in many realistic settings. Researchers who evaluate ASR performance often focus on improving the Word Error Rate (WER) metric, but it has been found to have little correlation with human-subject performance for many applications. This article describes and evaluates several new captioning-focused evaluation metrics for predicting the impact of ASR errors on the understandability of automatically generated captions for people who are DHH. Through experimental studies with DHH users, we have found that our new metric (based on word-importance and semantic-difference scoring) is more closely correlated with DHH user's judgements of caption quality—as compared to pre-existing metrics for ASR evaluation.

And isn’t it weird that academia is still using obscure abbreviations like ‌ACM Trans. Access. Comput. when nothing’s printed so there’s no space to save?

Flat | Top-Level Comments Only

From:

muccamukk

Is there any discussion of adding tags to identify speakers? All speech run together is the biggest comprehension gap I've run into.

jesse_the_k

...as this is a problem I see too often in "published" contexts (streamed/network media).

The researchers were optimizing to test their improved metric, so mostly the source was a single person speaking. They did research a pair of co-workers talking, but body language sufficed in that case.

My go to example right now is the Good Omens: Lockdown thing, which no one bothered to sub: https://www.youtube.com/watch?v=quSXoj8Kob0

But also movie trailers almost always have voices without context, and are rarely subbed.

For me, that's the biggest gap between "I can pretty well tell what's happening" and "I'm not even going to bother trying to show Nenya this."

And movie trailers are all about encouraging people to attend!

Some positives -- I'm seeing people posting shorter, silly things (to Twitter/Insta/blah) with on-screen text -- since they recognize their audience may need to mute the phone.

yourlibrarian

Given how rarely my voice to text mode gets two whole sentences correct, I can't imagine that the true ratio is very good (especially since I often have to turn on captions on TV to understand what should, by definition, be highly clear speech).

My attention was also drawn by your sidenote about abbreviations and my longstanding loathing for APA style, which not only abbreviates journal titles (as you note, quite pointless in the digital age) but also initializes the first names of authors making searches a hundred times more difficult since how many people have the same last name!

I wish I had an icon of a character shaking a fist.

Too damn true re: dictation errors. And those theoretically learn your speech patterns.

My head-canon is initials for authors was an attempt at hiding gender markers, but I'm surely wrong.

I searched for that icon and didn't find one to hand ... do you care about the identity of the fist-shaker?

In this case I imagine it was just space/cost saving as with the journals. (How often is money not the reason for something?)

Maybe not? If I'm not mistaken at one time the Daily Show did an "Old Man Stewart Shakes His Fist At" segment (which did not recur as often as it should have). Icons have so little space though that it's always a toss up if they can do a scene clearly enough to get a message across.

Feel free to take this one :,)

I have an .... icon problem. Well, no I have 310 slots so I have no problem whatsoever.

Oh I love that icon! Thanks very much, I'll upload that soon.

sanguinity

I only skimmed the article, because I was curious to see if information theory came into the proposed new measures. I was right, it does!

In information theory, "information" is a quantity that measures how "surprising" an occurreence is: one bit of information is the amount of information contained in the flip of a fair coin. That is, one bit is the amount in formation contained in being told "heads" vs. "tails" or "left" vs "right", when both options are equally probable. Obviously, there's not much information contained in the work "the" -- "the" is pretty predictable, and often a word we all saw coming. But there's a lot of information contained in "Monday", especially when it's equally probable that they might have said some other day of the week.

Anyway, information theory is a whole method of quantifying that 'surprise' factor, which would make it very handy for exactly this sort of problem. :-)

That's so cool! Thanks for the explanation.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Inside the Little Plastic Castle

adventures with Jesse the K

A Better Way to Measure Automatic Captioning

A Better Way to Measure Automatic Captioning

Predicting the Understandability of Imperfect English Captions for People Who Are Deaf or Hard of Hearing

(no subject)

Good point

Re: Good point

Re: Good point

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

Follow blog by email

Adaptive Tech Resources

Disability Culture

Dreamwidth Tools

Popular Tags

March 2026

Style Credit