As the live streaming and video sharing of events such as lectures and meetings becomes commonplace, the demand for captioning is ever increasing. In this context, our group in the Kyoto University School of Informatics has investigated automatic speech recognition (ASR) technology for captioning lectures. While re-speaking to commercial dictation software is often adopted, it still requires much skill and training. Therefore, we focus on the systems that automatically transcribe the speech of lecturers. There are two scenarios: video captioning and live captioning. Live captioning requires low latency or real-time processing, while video captioning deals with already recorded material without time constraint to achieve high quality. Note that there are still differences between video captioning and verbatim records: video captioning places more weight on faithfulness as it is presented in sync with video, while verbatim records are distributed by themselves and thus prioritize readability. These two factors, faithfulness and readability, are both important but there is a trade-off between the two because post-editing to improve readability often results in many changes to the original utterance.
First, we have looked at the captioning of video lectures by the Open University of Japan, which provides 300 courses via TV and radio programs, the majority also broadcast via the internet. However, only 30% to 50% of the lectures were captioned due to high costs. ASR can be a solution to this. The lectures are recorded in a studio, which provides good acoustic conditions, but topics and vocabulary are technical and not covered by most commercial ASR engines. Thus, we adapt the ASR system to each course with related textual material. Typically, when a textbook on the course is used, the accuracy is 90%. Furthermore, when there is a prepared script of the lecture, which is often not the case, accuracy reaches 95%.
We have conducted evaluations of some video courses in terms of accuracy and the time that has to be spent on post-editing. The result shows a clear correlation (R=0.61) between the ASR accuracy and post-editing time: more editing time is needed when there are more ASR errors. For reference, the time taken to type from scratch is about four hours for a 45-minute lecture by ordinary people who are not professionals but are skilled in captioning. This is about 5.3 times real-time. When we plot it in the correlation graph, it corresponds to ASR accuracy of 87%. If the accuracy is below this threshold, typing from scratch becomes faster. We can therefore conclude that ASR accuracy of 87% is the threshold for a usable level. Moreover, with 93% accuracy, the post-editing time is reduced by more than an hour, a reduction of one-third.
Next, we have developed live captioning software. While professional reporters in parliaments, courts and broadcasts often use stenographic keyboards, layman volunteers at school and town events use normal PCs and keyboards with software that allows for collaborative PC captioning. It typically requires three or four people for one lecture. Similarly, dedicated ASR systems have been developed for parliaments, courts and broadcasts, while layman volunteers use commercial or free software. Since ASR output inevitably includes errors, we need a human editor to make corrections. In this scenario, the ASR system and a human work jointly to generate a caption, and the same collaborative captioning software can be used.
In Japan, free software named IPtalk is most widely used for live captioning by volunteers. We have developed an ASR plug-in for this software, so ASR results can be post-edited easily. There are two ASR options: one is a Google cloud-based server, and the other is our own software Julius, which can be run on a PC without an internet connection and allows for adaptation of the lexicon and language model for technical terms in lectures.
We have conducted many trials for live captioning in academic meetings. When one person, who is not a professional but skilled in captioning, used this software, he could produce a caption of 270 characters per minute on average. Within this, the number of edits needed is 42 characters per minute, and the majority involve the deletion of redundant and erroneous words. In other words, 92% of the caption is the direct output of ASR. This is an encouraging result. Note that Japanese characters are two bytes, roughly corresponding to two characters in western languages.
Success scenario during the pandemic
ASR-based systems are not yet dominant in captioning. For a usable level, accuracy of 85% to 90% is desired, and if the accuracy is below 80% the system is not usable because in this case correction is not possible in real-time. To achieve usable accuracy, it is necessary to ensure that following conditions are met:
(1) fluent speaking;
(2) clean recording by tuning a microphone and an amplifier;
(3) coverage of technical terms by customising the lexicon.
These conditions can be easily met in live online teaching, which has recently become the norm during the Covid-19 pandemic. Lecturers usually prepare well and speak slowly into a microphone in this setting, and technical terms can be extracted from teaching material which is usually posted online beforehand. We are currently investigating the feasibility of using the ASR system for the captioning and translation of lectures in our university. We found that 90% accuracy can be achieved for most classes by fine-tuning the system.
Quality of captioning
Almost once a year, we organise a symposium on captioning technologies at Kyoto University. In this event, steno-typing, collaborative PC typing and the ASR-based system are demonstrated for captioning real lectures, sometimes in parallel. ASR generally returns the output faster, but in a large text block at once. It is more verbatim and does not have punctuation. At the end of the symposium, we ask the audience for feedback and comments. There are several issues with captions. First, in terms of the amount of text, too many captions are hard to read. Second, in terms of faithfulness, text that is too verbatim is not easy to read because of many redundant words. Third, with regard to timing, perfect real-time captioning is not user-friendly because it is tiring to the eyes. Moreover, all of these depend on user preference.
The most controversial topic in the quality of captioning is the question of verbatim captioning vs summarised captioning. There are many hearing-impaired people in the audience of the symposium. The verbatim output keeps speakers’ speaking style and personal character intact, which some hard-of-hearing people like. ASR is suitable for this purpose. On the other hand, summarised captioning, like movie subtitles, preserves only the content of speech but makes the text shorter. This is liked by many deaf people. So there is no clear consensus about which method of captioning produces the best quality for all viewers. It is also important to note that summarised captioning is possible only when produced by human editors who have adequate summarising skills.
With the improvement of speech recognition technology, automatic captioning has become a realistic option for live captioning. However, ASR output is not yet free from errors, which means that human involvement is needed to prevent the output of embarrassing text that includes mistakes. Meanwhile, it might be necessary for the society to allow many options for captions, such as summarised, verbatim, and automatically generated, because there is a variety of needs.
Dr. Tatsuya Kawahara is a professor and the dean of School of Informatics, Kyoto University.