Hansard Services in British Columbia is marking 50 years of reporting the debates of the Legislative Assembly. The anniversary has been a time for us to reflect on the changes the organisation, our predecessors and the current team have undertaken. While changes usually seem dizzying and frustrating at the time, they are part of a history of change: typewriters and shorthand to tape players, to computers, to digital audio, to many versions of Microsoft Word. We have benefited from the willingness of staff to embrace these changes and find the ways that the new tools could support their work, rather than be swamped by change imposed from the outside.
Why Implement Automated Speech Recognition?
Long hours of transcription work can be physically and mentally taxing. In years past, we had periods of intense activity followed by months of less work when staff could get a break. Now, the work is nearly constant.
For many years we offered the closed captioning from our broadcast to transcription staff as a supplement to ease typing, thanks to a partnership with the broadcasting team. Although the captioning text was provided by a dedicated contractor, we found that the quality had been quite variable recently and, due to cost, did not include parliamentary committee proceedings.
So, continuing our history of embracing new technologies, at the beginning of this year we began a pilot project to implement automated speech recognition (ASR). We approached the project anticipating both professional and technical hurdles but confident that it would be worth reviewing the technology. Perhaps the results would help us to meet some of the challenges we faced.
We anticipated that staff responsible for transcription might be somewhat apprehensive about the introduction of an ASR system so we tailored our communication about the project to emphasise that it was a supplement to their work, and that we were aiming to reduce the mechanical task of typing to allow more time for discernment of meaning, applying their knowledge of topics and procedure, and the work of editing extemporaneous speech.
When a previously manual process is partially automated, it is often the most routine and repetitive portions which are taken over by the computer. But working on the repetitive task can represent a bit of a mental break, a variation in tasks for the staff member. Unfortunately, what is left after automation is the most mentally taxing aspect.
An ASR system also presents a set of novel editorial challenges. We were guided in our thinking, in part, by the experience of our colleagues at the Oireachtas, the Irish national parliament, as described in Tiro earlier this year in the excellent article by Carl Lombard. When all the spoken words are recognised, the transcription process changes. Rather than being additive, it becomes a subtractive process where much that is recognised but not reported is removed. This is particularly true for our “substantially verbatim” reporting style, where false starts, self-corrections, routine acknowledgements and other artifacts of speech are eliminated.
The system that we implemented essentially presents transcription staff with a wall of text, offering no paragraphing or speaker diarisation and little punctuation. Some transcription staff found that to be somewhat intimidating. In response, we were able to work with our vendor to include annotating information about the current speaker and item of parliamentary business to break up the text a little more.
We found that staff had to guard against complacency; the recognised text is very good but not perfect. While listening and following the recognised text, it is easy to hear what is written and so miss an ASR error. One of our editors described approaching the task “ears first”—that is, closing her eyes or looking away from the screen while listening to a segment in order to avoid this confirmation bias.
A survey of the technology showed us that the standard measures of word error rates were becoming less important as deep learning-based models helped many offerings reach similarly high levels of recognition. Instead, the main differentiating factor we found was integration: the amount of friction in the process from recording, through recognition, to the words in the preferred transcription and editing program. Some tools required the complete audio from the meeting to begin processing, others required signing into a web application and selecting and copying recognised content into a separate editor. Still others required most of the transcription to happen within their own browser-based editor. Every click or separate action required to retrieve the recognised text represents a barrier to use of the tool.
The implementation we selected was provided by our existing transcription system vendor, Sliq Media Technologies. Its existing understanding of our systems and programmatic access to our audio archive meant that it could train and integrate the new ASR features directly into our existing tools. That was ably supported by our internal technical team, who were familiar with the transcription system and could support the integration and testing. The result is that each five-minute audio segment is recorded and sent to a remote server for recognition in about one minute, after which the segment is available to the transcriber. A single click in Microsoft Word inserts the recognised text and annotation information into the document for editing.
After a few months of use, we have found the ASR system to be an accurate and reliable improvement over the previous closed captioning supplement. Even long-serving staff who are very capable typists have welcomed it. Staff have reported that the system handles even difficult speakers well, which reduces their stress and strain.
We can say that we have embraced and incorporated this new technology into our processes, rather than been overtaken by it. As the field of ASR continues rapidly to evolve, we will be evaluating the technology frequently to ensure we are staying on top of these changes, including a combination of experimentation with new tools and measuring the performance of our system.
Dan Kerr is the Manager of Publishing Systems at the Legislative Assembly of British Columbia in Canada.
Lombard, Carl 2022: Experimenting with Automatic Speech Recognition in the Houses of the Oireacthas (Iris Parliament). – Tiro 1/2021. URL: https://tiro.intersteno.org/2022/07/experimenting-with-automatic-speech-recognition-in-the-houses-of-the-oireachtas-irish-parliament/