In Issue 1/2024

In this article, we discuss the Official Report of what is said in plenary and committee meetings of the Scottish Parliament. We have been piloting the use of OpenAI’s Whisper automatic speech recognition (ASR) model, which we have found to be remarkably attuned to our styles, spellings and even editing conventions.

Introduction and Early Trials of ASR

The Official Report is a substantially verbatim report on the Hansard model. It is lightly edited to remove redundancies and obvious errors that would make the report less easy to read. Acronyms are expanded the first time they appear; the names of people and organisations are verified; the titles of bills are put out in full; and a house style is applied to capitalisation and spelling for consistency.

In 2023, the office of the Official Report ran a proof of concept exercise using two different ASR models. Although a majority of reporters said that they either would, or possibly would, use ASR in the future, a substantial number said that they would not, typically because the amount of typing involved to edit an ASR transcript to our substantially verbatim standard was too great, even when a low word error rate was achieved.

Whisper ASR Model

Before taking the next steps, we found out about OpenAI’s Whisper model and learned that British Columbia Hansard was trying it out. We set up an account and wrote a simple script to send sound files to OpenAI to be transcribed at negligible cost.

The results were impressive. Whisper was trained using an enormous amount of audio matched with text, which helps to make it very accurate, in that it can predict what word should come next based on the context and regardless of the quality of the audio recording or the accent of the speaker. It is particularly good at recognising where sentences start and finish, which makes it a lot less frustrating for reporters to read initially.

What was most intriguing for us was that Whisper often lightly edited transcripts in the way that we would expect to do ourselves. Rather magically, it seemed to recognise that the sound files were from the Scottish Parliament and applied our style and editorial approach. Sometimes it would add the correct surname when only a first name was spoken—less helpfully, it would also frequently add the wrong surname.

To give just a few examples:

What was said: “there has been improvements in terms of the whistleblowing processes”

Whisper: “there have been improvements in the whistleblowing processes”

What was said: “There was a lot of young people who were for for them online learning was not an option.”

Whisper: “For a lot of young people, online learning was not an option.”

What was said: “So I wonder if the cabinet secretary’s had an opportunity to reflect on the evidence we’ve heard?”

Whisper: “Has the cabinet secretary had an opportunity to reflect on the evidence that we have heard?”

What was said: “motion 11654”

Whisper: “motion S6M-11654”

We don’t know for sure what is happening, but our hypothesis is that among the 680,000 hours of training material used by Whisper were several hundred hours from the Scottish Parliament—and most languages other than English provided less than 1,000 hours. Videos of Scottish Parliament meetings are available on YouTube and a Scottish Parliament microsite. Since 2016, our Broadcasting department has subtitled some of those videos with the text of the Official Report, rather than automatically generated subtitles.

On the OpenAI website, Whisper developers have described how

“In order to avoid learning ‘transcript-ese’, we developed many heuristics to detect and remove machine-generated transcripts from the training dataset.”

The Official Report subtitles, complete with punctuation, will have been detectable as human generated and suitable for inclusion in the Whisper training dataset. The recurring elements specific to the Official Report that we see in transcripts lead us to believe that Whisper was trained on our reports. We also note that our colleagues in legislatures that do not have videos subtitled using Hansard do not see the same effect in their ASR transcripts.

Prompting to Improve Transcripts

We found that, with committee meetings, which often involve external witnesses, Whisper would often take a similar approach but would sometimes revert to a more fully verbatim style. For some reason, it no longer recognised the audio as being from a Scottish Parliament meeting.

After a lot of trial and error, we discovered a prompt that reliably resulted in Whisper recognising the audio as coming from the Scottish Parliament. The prompt is this short piece of text, which we include with the request that we send to Whisper with the audio:

“The minister will now take questions on the issues that were raised in her statement. I intend to allow about 20 minutes for questions, after which we will move on to the next item of business. I would be grateful if members who wish to ask a question pressed their request-to-speak buttons.”

Although it looks unremarkable, that is something that is only said by our Presiding Officer in the chamber, so it identifies the committee audio, which may not otherwise have any identifying features, as belonging to the Scottish Parliament. Whisper then treats it in the same way as plenary.

Effects on Staff

Of course, Whisper is not perfect. Whisper does the easy things well, but it often misinterprets, it frequently edits in the wrong way, it omits words that should be left in and it adds words that are unnecessary.

Furthermore, Whisper can’t make the subject matter any easier to work with. If a speech is about changes to agricultural support schemes, just as much attention is required with or without Whisper. Perhaps even more care is required using Whisper, given its propensity to come up with a plausible misinterpretation. The interesting and challenging part of the job remains, with much of the typing or respeaking removed.

Given the reservations many reporters had about using ASR to report to a substantially verbatim standard, we took a cautious approach to introducing Whisper. We formed a working group of early adopters, who supported their colleagues, tested prompts and other settings, wrote training materials and led training sessions. Positive as that engagement has been, our caution was perhaps unnecessary. Now, even though not required to do so, most reporters use Whisper at least some of the time and Whisper is used to generate more than 80 per cent of our copy.

We haven’t yet achieved an increase in productivity, although that hasn’t been our focus during the pilot. We hope to see benefits in improved staff wellbeing and reduced fatigue; we will explore that through a questionnaire at the end of the pilot. Further work is required specifically on whether productivity can be increased using Whisper; for example, the reduction in typing may allow reporters to work on longer segments of audio.

The Future

What does the future hold? We have several unanswered questions.

Whereas using ASR is often linked to taking a more verbatim approach, Whisper raises the possibility of getting the benefits of ASR while retaining a substantially verbatim approach. Is that a realistic expectation?

The model is trained on historical data, so we believe that it will become progressively more out of date. How can we update and fine tune it? Would we be better to run our own iteration of Whisper locally?

How will we train new reporters? Should we introduce them to the job by requiring them to type or respeak from scratch, or from now on should we train people only on how to edit ASR transcripts?

Cameron Smith is a sub-editor and Kenneth Reid is an official reporter in the Scottish Parliament Official Report.

Comments
pingbacks / trackbacks
  • […] Cameron Smith and Kenneth Reid:Towards a Substantially Verbatim Official Report Using Automatic Speech Recognition […]

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.