In Issue 1/2025

From Amazon to Parrot

In 2022, I wrote in Tiro about our vendor-supplied ASR system at the Legislative Assembly of British Columbia, built on Amazon’s recognition technology (Kerr, 2022). At the time, our approach was to evaluate and adopt emerging tools, using ASR as a supplement to support transcription staff. But technology moves quickly. By late 2022, OpenAI released Whisper as an open-source model, demonstrating a marked improvement over existing solutions, including our own. Recognizing its potential, we began investigating how Whisper could offer greater accuracy, flexibility, and control. Trained on 680,000 hours of multilingual recordings — including, we suspect, many parliamentary broadcasts and transcripts — Whisper was particularly well-suited to our use case. OpenAI’s benchmarking placed it close to professional transcription accuracy, with a word error rate (WER) of 8.81%. However, the best results came from a “human-in-the-loop” approach, where ASR served as an assistive tool rather than a replacement for professional editors. That approach garnered a WER of just 7.61%, according to OpenAI’s testing. As Whisper is an open-source system, we were able to access a growing community of developers, tools, and improvements on the initial system, smoothing the development of our implementation, which we dubbed Parrot.

Human editors

Once Parrot was introduced, our focus shifted to ensuring that it integrated smoothly into our editorial workflow. One difference between ASR-based transcription and previous workflows was the absence of speaker diarisation. Previously, when editors worked with broadcast closed captioning, speaker names were included, making attribution straightforward. However, ASR systems, including our previous vendor-supplied model, do not easily identify individual speakers. To help close that gap, we integrate microphone data from our sound system based on synchronised time, ensuring that editors can accurately attribute each contribution. While Parrot improved accuracy of the transcription supplement, its implementation reinforced a critical point: ASR is a tool to support editorial judgment, not replace it. Our professional editorial staff quickly recognised its strengths and limitations. In a post-implementation survey, one editor put it succinctly: “Sometimes it won’t be able to pick up words that are spoken unclearly. That’s why we (human editors) are here.”

Shifting approach in editing

By handling routine transcription work, Parrot freed up time for editors to focus on clarity, formatting, local and Indigenous terms and phrases, verification of terms and other tasks. However, this also required a shift in how they interacted with the text. As another editor noted:

“The only thing to watch out for is not letting the ASR dull your sharp editorial eye. It’s easy to be swayed by the words it chooses to render to text, so it’s extra important to remember to still use your ears to carefully listen to the audio and make your own judgments about the words spoken by the members.”

This approach has proved important to maintaining the accuracy and reliability of our products. Parrot provides an efficient initial pass, but editorial oversight remains essential. Ensuring that staff remain in control of the process was a guiding principle in how we implemented the system. We anticipated that integrating ASR might raise concerns among staff, so we took a proactive approach. Senior editors were involved early in the development, and their insights guided the design of the system. This also shaped how we communicated about Parrot, both internally and with leadership: we consistently positioned it as a supplement to editorial work, reinforcing that final control remained with staff. As a result, the system was well received by the editorial team, with feedback focused not on resistance, but on practical guidance for fellow editors, ensuring best practices for its use and highlighting potential pitfalls to avoid.

With a focus on a seamless editorial workflow, we integrated Parrot’s functionality directly into Microsoft Word. By enabling editors to insert the recognised text at their discretion with the push of a button, we reinforced their authority in controlling and validating the output. We worked closely with staff to develop and implement a search-and-replace function, allowing supervisors to apply live refinements, improving the text while locating control of the refinement with the editorial team. The intent was to make using Parrot as frictionless as possible. We designed the system to be automated, requiring minimal maintenance while remaining adaptable to future improvements.

Local implementation and fine-tuning accuracy

An additional technical consideration in implementing Parrot was deciding whether to deploy it as a cloud-based or local system. A cloud implementation offered powerful computing resources for quick recognition but required uploading sensitive recordings to external servers beyond our direct control. A local implementation, though requiring greater involvement from our IT team, ensured full control over data security and cost management. In the end, these considerations guided our decision, and with the hardware resources provided by our IT team (several NVIDIA A2 GPUs), Parrot requires only 20 seconds to process a five-minute segment. By balancing technical, operational, and editorial considerations, we built a system that has required little ongoing maintenance or additional costs during the last two years of operation.

Looking ahead, we are exploring ways to further improve Parrot’s accuracy. Whisper is a robust model, but it does not always recognise regional or domain-specific vocabulary. Drawing inspiration from Professor Tatsuya Kawahara’s work with the Japanese House of Representatives (2012), where a custom ASR model achieved 95% accuracy, we plan to fine-tune the Whisper medium English model we are using with selected segments of our own recordings and transcription. This process will adapt the model’s training data to better recognise specialised terminology, speaker patterns — including non-native speakers — and other parliamentary nuances. Importantly, we will not train the model on edited transcripts, as this could encourage it to apply editorial interventions that we believe should remain the responsibility of our professional editors. We plan to run the fine-tuning process periodically, with the frequency depending on the complexity of preparing training data and evaluating results.

Conclusion

Implementing ASR in a parliamentary setting is not just a technical challenge — it requires deep expertise in parliamentary reporting. While we are not AI researchers, we are experts in our domain, and that expertise has guided every decision in how we implemented and use ASR. We designed Parrot to reinforce editorial and technical control, in a user-driven approach. Our role was not to develop ASR technology from the ground up but to apply our expertise in how it should be used in our context, upholding editorial standards and data security. This is a specialisation in its own right — one that does not require advanced technical expertise but demands a clear understanding of reporting standards, editorial style, and the realities of transcribing extemporaneous speech. As ASR continues to evolve, professionals in our field should feel empowered to lead its application in their areas, shaping these tools to meet the unique needs of their institutions rather than feeling dictated to by technological developments.

Dan Kerr is the Manager of Publishing Systems at the Legislative Assembly of British Columbia in Canada.

References

Radford, A, J. W. Kim, T. Xu, G. Brockman, C. McLeavey & I. Sutskever (2022). Robust Speech Recognition via Large-Scale Weak Supervision. – OpenAI. URL: https://cdn.openai.com/papers/whisper.pdf

Kawahara, T. (2012). Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet). – Proceedings of the AAAI Conference on Artificial Intelligence, 26(2), 2224–2228. URL: https://doi.org/10.1609/aaai.v26i2.18962

Kerr, D. (2022). Automated Speech Recognition: Ears First! Embracing Technological Change at the Legislative Assembly of British Columbia. – Tiro 2/2024. URL: https://tiro.intersteno.org/2022/12/automated-speech-recognition-ears-first-embracing-technological-change-at-the-legislative-assembly-of-british-columbia/

Showing 2 comments
pingbacks / trackbacks

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.