Transcription System using Automatic Speech Recognition in the Japanese Parliament

Posted May 13, 2020

In Issue 1/2020

Introduction

The Japanese Parliament (Diet) was founded in 1890. Since the very first session, verbatim records were made by manual shorthand over 100 years. However, early in this century, the Diet ceased recruiting stenographers and investigated alternative methods. The House of Representatives adopted Automatic Speech Recognition (ASR) technology, which automatically converts the speech of MPs and ministers into text. This system has been used since April 2011 for all plenary sessions and committee meetings. Speech is captured by the stand microphones in the meeting rooms. Separate channels are used for interpellators and ministers. The speaker-independent ASR system generates an initial draft, which is corrected and edited by official parliamentary reporters. This is the first and still the only running ASR-based system in a national-level Parliament.

How does it work?

The biggest challenge in building the system was the accuracy of ASR. Over 90% accuracy is preferred for efficient transcript generation. This can be easily achieved in plenary sessions, but it was difficult in committee meetings, which are interactive, spontaneous, and often excited. A specialised system was developed by using a large amount of meeting speech and text data to match the ASR with the target material. The major technical problem was the difference between official meeting records and actual utterances due to the editing process by the reporters. There are several reasons for this difference: the distance between spoken and written language, disfluency-related phenomena such as fillers (e.g. “well”) and repairs, expressions that are redundant from

the point of view of written text (e.g. discourse markers and repetitions without rhetorical weight), and grammatical corrections. We conducted linguistic and statistical analyses and found that there are differences in 13% of the words, but 93% of them are simple edits such as deletion of fillers and correction of a word. The transformation can be computationally modeled in a statistical framework.

This leads to an innovative approach for semi-automated corpus generation and ASR model training. With the statistical model of the difference, we can infer possible actual utterances from the official records. By computing probabilities for all possible word sequences, including filler insertions, we can make a language model. Moreover, by referring to the audio data of each utterance, we can reconstruct what was actually uttered. By aggregating the sound pattern for each phoneme, we can train an acoustic model. As a result, we can automatically build precise models of spontaneous speech in the parliament in a large scale. Specifically, the acoustic model is trained with approximately 2,000 hours of meeting speech, and the language model is trained with texts of the meeting records of the past twenty years. Moreover, these models evolve in time, reflecting the change of MPs and topics discussed. The language model and the lexicon are updated to incorporate new topics and vocabulary every year, and the acoustic model is updated using meeting speech data after general elections. Additionally, new words can be added temporarily in the system at any time.

ASR performance in terms of Japanese character accuracy is monitored for most of the meetings. Initially in 2011, the average accuracy was 89.7%. When limited to plenary sessions, it was over 95%. Only a few meetings got an accuracy of less than 85%. The processing speed is 0.5 in real-time factor, which means that it takes about 2.5 minutes to transcribe a 5-minute segment. ASR accuracy has been improved after the deployment and saturated around 91% since 2012. Most recently, introducing deep learning technology improves the accuracy by 3-4% absolute. The system can also automatically annotate and remove fillers.

The post-editing software used by reporters is vital for efficient correction of ASR errors and cleaning transcripts. We adopted a screen editor, which is similar to the word-processor interface, so that reporters can concentrate on making correct sentences. The software provides easy reference to the original speech and video, by time, by utterance, and by character. The alignment of speech and text is a byproduct of ASR.

What do the reporters say?

After five years of the new system deployment, we conducted a survey among reporters in the House of Representatives to find out how they feel about working with the new system. The reporter’s job is to edit 5-minute-long texts produced by the ASR system. A team of reporters now consists of stenographers and non-stenographers. In this survey, we found that a majority of the reporters felt that it took less time and labor to complete a transcript with the ASR system, and more than 80% said that they are satisfied with the performance of the ASR system. Some also expressed a positive opinion that the system would make it possible for those who have not been trained in stenography to produce an edited transcript with proper training.

Currently, the training of reporters is conducted as follows. Those who have joined the Record Department without stenography training go through a basic training of six months, and then 1.5-year practical training under the supervision and guidance of an experienced stenographer. To produce high-quality transcripts, reporters must acquire knowledge and skills to listen to and understand the speech correctly. Transcripts submitted by non-stenographer reporters are getting as good as those by trained stenographers. This suggests that the training system is functioning well.

Recently, an emphasis has been made more on fidelity to actual speech than on readability of the text. In the House of Representatives, the meeting records have been produced in a way that is more faithful to actual speech. We have observed a significant decrease of edits from actual utterances by 40% over the ten years from 2007 to 2017. This probably has to do with the increase of real-time streaming and video archiving of Parliamentary meetings. The discrepancy between speech and text may be further reduced in the future.

As the ASR system tries to faithfully transcribe speech, it will be fitted to this trend. However, ASR is not almighty, and intervention by skilled human reporters is still essential for making high-quality transcripts.

Tatsuya Kawahara is a professor and the dean of the School of Informatics in Kyoto University, Japan. Shoko Ueno and Masaya Morikawa are stenographers in the Records Department of the House of Representatives, Japan.

Introduction

How does it work?

What do the reporters say?

Further reading

Leave a Reply to Tatsuya Kawahara Cancel reply