In Issue 1/2025

Introduction

Thanks to the technological advancement of broadband communications and computing devices, including smartphones, videos have become popular in public and social media.  While official records of parliamentary meetings in national and local assemblies are transcripts edited by professional reporters, many meetings are now widely broadcast online and video-archived for later access. However, captioning the videos is still challenging and thus not provided in many cases. Automatic speech recognition (ASR) technology is emerging to a practical level but is still error prone. This article addresses questions about whether and how high-quality official meeting transcripts can be used for captioning videos of meetings.

Current status of video captioning of parliamentary meetings

According to a survey conducted in 21 EU countries by Voutilainen and Kuronen (2025, in this issue), videos of plenary meetings are made public in all the responding countries, and videos of committee meetings are also provided in a majority of the countries. While official meeting reports or transcripts are provided in all cases, captions or subtitles are provided in approximately half of the countries. Here, captions and subtitles are time-synchronised with speech at the phrase or utterance level for improved accessibility. Several methods, including ASR technology, are adopted, and they are sometimes managed by sections other than the reporting unit.

In the Japanese Parliament (Diet), all meetings, including committee meetings, are broadcast online and video-archived in both the House of Representatives and the House of Councillors. Accompanying text was not provided before the House of Councillors began live automatic captioning in August 2024. The texts are now generated by ASR, and some errors are included. In Japan, there are around 1,800 local autonomous bodies, such as prefectures, cities, towns, and villages. Of those 1,800 entities, around 1,000 provide video streaming or archiving of their assembly meetings. Approximately one third of them use their own YouTube channels to reduce costs. Only a few of them have captions generated by ASR.

Manual captioning is very costly, but ASR captioning is not reliable. Therefore, automatic captioning may be acceptable for live captioning but is not regarded as appropriate for permanent video archives. On the other hand, official meeting transcripts are provided in many instances but not time-aligned with speech.

Can we use official meeting transcripts for video captions?

It has long been regarded that official meeting transcripts are not precisely the same as what is spoken in the meetings and are thus inappropriate for captions. Parliamentary reporters have refined the transcripts to improve their readability without changing their content. This process is referred to as diamesic editing, which transfers verbal speech into written text. The edits include the removal of fillers and redundant words, correction of grammatical errors and colloquial expressions, and re-ordering of some phrases. Kawahara (2024) has provided a detailed analysis of the editing processes in the Japanese and European Parliaments.

Kawahara’s study (ibid.) reveals that the ratio of edits has significantly decreased (from 16.3% to 9.9% on average) over past decades. This demonstrates that the transcripts have become more verbatim and colloquial expressions are becoming more acceptable (see also Korhonen, Koze & Tyrkkö 2023). This trend may be affected by the prevalence of video archives and their reference in social media. The data also suggests that using official meeting transcripts for video captions is plausible. In fact, captions on TV programmes involve minimal edits, such as the removal of fillers and correction of grammatical errors, though that ratio is not investigated.

How can we use official meeting transcripts for video captions?

It is reasonable to use high-quality official transcripts for video captions, though it takes some time to make them available and thus they remain an unusable alternative for live captioning. However, it is still troublesome to time-align very long text with long audio/video of meetings at the phrase or utterance level for accessible captions. In our study, it took almost one day (approximately eight hours) to do so manually for a one-hour video.

ASR technology provides a simple solution that consists of the following steps:

  1. Conduct ASR on the meeting audio and get a transcript with time stamps for each phrase or utterance. Note that access to time stamps depends on the ASR system, and it can be fine-tuned by the official meeting transcript for enhanced performance.
  2. Conduct alignment between the ASR-generated transcript and the official meeting transcript.
  3. Copy the time stamps of the ASR-generated transcript to the corresponding part of the official meeting transcript.
  4. Segment the official meeting transcript into proper units for captions in terms of length and linguistic boundaries.

It is experimentally confirmed that automatic alignment between the two texts in step (2) is possible even with ASR errors of 20%, and the time gap between captions and audio is acceptable to some extent. However, it is not possible if some considerable segments of the meeting are not transcribed due to irregularities.

Multi-media portal of meetings of Japanese national and local assemblies

The authors have worked on a project for a multi-media portal of the Japanese Diet (https://gclip1.grips.ac.jp/video/; see Masuyama, Kawahara & Matsuda, 2024). The system includes the ASR system developed for the transcription system of the House of Representatives (Kawahara, Ueno & Morikawa, 2020). It now achieves correctness of over 95% after introducing deep learning models. In this portal, the ASR system is used to generate automatic captions for every meeting on that day before they are replaced by the official meeting transcript (with the procedure mentioned in the last section) when it is available. The names of speakers are also annotated based on the meeting record. The system allows for quick searches based on a keyword or a speaker for the relevant video segments.

We are now building a similar portal for meetings of local assemblies in Japan (https://localassembly-video.jp/; its outlook shown in Figure 1).

Figure 1. Outlook of meeting video time-aligned with official transcript
Figure 1. Outlook of meeting video time-aligned with official transcript

By feeding URLs of YouTube channels of the local assemblies, the system prepares a collection of searchable videos with captions generated by the ASR system. ASR is more challenging because the recording conditions in local assemblies are not as good as those of the national Diet, and there are many proper nouns and dialects unique to the local regions. Thus, ASR correctness is degraded to 70-90%. However, we can still conduct time alignment of video and caption text to allow for keyword searching. This enhances the accessibility of parliamentary meetings for everyone.

Conclusion

Video streaming and archiving of parliamentary meetings have become common even in local assemblies, but captioning is not yet provided in many cases. For recorded video archives, captioning using official meeting transcripts is a straightforward solution, and ASR technology can be used to align the video and the text of the transcript.

Tatsuya Kawahara is a Professor at the School of Informatics, Kyoto University, Japan.

Yuya Akita is a Professor at the School of Economics, Kyoto University, Japan.

Mikitaka Masuyama is a Professor at the National Graduate Institute for Policy Studies, Japan.

References

Kawahara, T., S. Ueno & M. Morikawa (2020). Transcription system using automatic speech recognition in the Japanese Parliament. – Tiro 1/2020. URL: https://tiro.intersteno.org/2020/05/transcription-system-using-automatic-speech-recognition-in-the-japanese-parliament/

Kawahara, T. (2024). Quantitative analysis of editing in transcription process in Japanese and European Parliaments and its diachronic changes. – ParlaCLARIN IV Workshop, pp.66–69. URL: https://aclanthology.org/2024.parlaclarin-1.10/

Korhonen, T., H. Kotze & J. Tyrkkö eds. (2023). Exploring language and society with big data: Parliamentary discourse across time and space. John Benjamins. URL: https://doi.org/10.1075/scl.111

Masuyama, M., T. Kawahara, & K. Matsuda (2024). Video retrieval system using automatic speech recognition for the Japanese Diet. – ParlaCLARIN IV Workshop, pp.145–148. URL: https://aclanthology.org/2024.parlaclarin-1.21/

Voutilainen, E. and R. Kuronen (2025). Text Alternatives for Video Recordings in the Parliaments of Europe. Tiro 1/2025: https://tiro.intersteno.org/2025/06/the-production-of-text-alternatives-for-video-recordings-in-european-parliaments/

Comments
pingbacks / trackbacks
  • […] Tatsuya Kawahara, Yuya Akita, and Mikitaka Masuyama:Captioning Parliamentary Meeting Videos Using Official Meeting Transcripts […]

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.