In Issue 2/2021

Innovations most often come in waves, and the third industrial revolution, promising the digitalisation of manufacturing, happens to coincide with the third wave of automatic speech recognition (ASR) technology. ASR is now available for larger audiences through accessible cloud platforms which fully automate the task of transcribing the spoken word into written text. What does this imply for professional reporters, and what potential do these cloud solutions have in a parliamentary context?

Three waves

Initially, ASR was a niche technique and token of obsession for language and computer scientists, but a frustrating tool for the end user because the results were not of high quality: correcting the ASR output could cost more energy than manual transcription. The promises were high for decades, but the first ASR wave was tiny – maybe not more than a ripple on the surface.

The second wave followed in the 2010s as ASR became mainstream. Reporters and other language experts incorporated ASR into their daily workflow with respeaking software like Dragon NaturallySpeaking. Common people were no longer ashamed to talk to their smartphones to ask Siri for a nearby restaurant. The ASR technology behind these innovations was steady enough, and with the help of artificial intelligence and backed by investments from large companies, it was not frustrating to use.

The 2020s seem to mark the true breakthrough of ASR for the masses. In 2021, IBM, Microsoft and Google all offer ASR solutions on user-friendly platforms that do not require any downloads or coding knowledge, accommodating English and up to 100 other languages with the help of  state-of-the-art speech recognition models. Several start-ups have tried to brand their own ASR products, in search of their share of an $8.3 billion market which, according to recent figures, could grow to up to $22 billion in 2026 (Markets and Markets 2021).

Shiny promises

All ASR providers have one thing in common: they do not ask the user for much more than to upload an audio file containing the speech that needs to be transcribed, they take care of this transcription within minutes or hours, and they promise to do so with up to 99% accuracy if the audio quality is decent. Prices are fair – though that is not to say low – and users have the flexibility to pick a package that suits their needs, from a fixed rate of around $10 per hour to a credit system or monthly subscription plan.

Sounds attractive, doesn’t it? Indeed, any professional reporter should feel encouraged to experiment with ASR on these cloud platforms, and luckily most providers offer a free trial. But whether you are a freelance reporter or affiliated with an institution like a parliament, it is important to take into account at least two factors when using these cloud platforms beyond their shiny promises. They should be considered when answering, first, if ASR is a solution for you and, secondly, whether using a cheaper cloud platform or investing in a tailor-made and more expensive ASR tool is the wiser decision.

Data ownership

In The Five Senses: A Philosophy of Mingled Bodies (2008), the French philosopher Michel Serres wrote, “We no longer live addicted to speech; having lost our senses, now we are going to lose language, too. We will be addicted to data … Information is becoming our primary and universal addiction”.  True or not, it is always important to ask who owns the data. Anyone who uploads a file to an ASR platform enters into an agreement with a tech company: you give us data, we give you text. The tech companies need data on a huge scale, not only to tweak the performance of their ASR solutions but for other innovations they are working on. It is the main reason they keep their rates low:  the initial investments they made would justify a higher price, but data sourcing is a form of investment for them.

Target audience

Whom are the tech giants targeting? Basically, anyone who has to process huge amounts of the spoken and written word, such as lawyers, doctors, journalists and call-centre executives. For them, manually transcribing or editing speech into text is painstaking or costly. They do not regard transcribing as a skill that comes with its own quality standards and joys, but as a task to get out of the way so they can do what they are good at. To offer technology to the masses means to compromise. Professional reporters know about the specific contexts in which they operate and the lingo this requires. She or he knows all the tricks, in the same way as a professional translator comes up with a readable text where an automated translation fails.

Experiments with ASR in the Dutch Parliament

In an experiment we set up in the Dutch Parliament, we found the ASR results from several cloud providers nearly convincing. We compared two modes of work. In the first test, we used the ASR output as a base for transcribing and editing a text according to our standards. In the other test, we performed this task manually. We witnessed a (sometimes impressive) leap in quality from the first ASR products we started testing around 2015, but still found them to underperform structurally when handling parliamentary jargon and insider language – of which we frankly deal with a lot. Searching for the correct spelling of a name, institution or Bill is an important part of our reporting job, and ASR-generated text often fails there. Also, the ASR output still requires loads of editing, up to a point that editing it takes more time than manually transcribing and then editing from scratch. Furthermore, we want to keep control over our data. So, overall, we are still searching for a more specialised ASR tool that has the potential to make our job easier.

Deru Schelhaas has been a parliamentary reporter in the Dutch House of Representatives since 2014. He is a member of a working group that investigates and tests relevant technical developments for parliamentary reporting and broadcasting.

References

Markets and Markets 2021: “Speech and Voice Recognition Market Worth $22.0 Billion by 2026. – Markets and Markets. URL: “https://www.marketsandmarkets.com/PressReleases/speech-voice-recognition.asp

Serres, Michel 2008: The Five Senses: A Philosophy of Mingled Bodies. Continuum.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.