Introduction
The field of interpreting has been continuously evolving since its emergence and professionalisation, maintaining its dynamic nature. New tasks and phenomena within the profession continue to arise—particularly in relation to barrier-free communication, which aims to provide accessibility to all individuals, regardless of their physical or sensory abilities. One of the most recent developments in this context is speech-to-text interpreting.
Speech-to-text interpreting refers to the process in which spoken language is converted into written text in real time, primarily for audiences who are deaf or hard of hearing, or for anyone who requires textual support during live events. Many tasks need to be co-ordinated simultaneously in speech-to-text interpreting, similar to simultaneous spoken language interpreting. However, a key distinction is the additional responsibility of monitoring the textual output. This is essential because, while speech recognition technology has advanced significantly, it still lacks complete accuracy. Therefore, interpreters must monitor the output produced by the use of speech recognition software, reading and potentially correcting the text, all while continuing to respeak the content. This combination of tasks adds a significant cognitive load, making the process of speech-to-text interpreting a topic of scientific interest. I explored this aspect of speech-to-text interpreting in depth in my research (see Matzenberger, 2023). In this article, I present my key findings from that study.
Explanation and Experiment
When people first observe the creation of a speech-to-text interpretation, they are often struck by the speed and complexity of the task. Many wonder how interpreters manage to co-ordinate respeaking, monitoring and correcting text output simultaneously. This intricate process demands high levels of focus, attention and cognitive flexibility, as interpreters must keep pace with spoken communication while ensuring that the output is accurate and readable. In my research, I sought to investigate the cognitive demands of speech-to-text interpreting, particularly the role of visual attention in managing these tasks. One of the most common methods for measuring visual attention is eye-tracking technology (see Szarkowska et al., 2018, 187), which provides precise data on where a person’s gaze is directed and how long they focus on specific areas. This tool enabled me to examine the gaze behaviour of interpreters during speech-to-text interpreting sessions.
At the Centre for Translation Studies at the University of Vienna, eye-tracking technology was employed to measure both the number and average duration of fixations during the speech-to-text interpreting process. Fixations refer to phases of eye movement in which the gaze remains steadily focused on a single point. These moments are essential for processing visual information, especially in tasks such as reading or monitoring text on a screen. During reading, the typical duration of a fixation is about 200-300 milliseconds (see Holmqvist et al., 2011; Rayner, 2009). By analysing fixations, I could gain insights into how interpreters allocate their visual attention across different elements of the task—such as monitoring the textual output, watching the speaker and following any accompanying visuals like slides or presentations.
To replicate a realistic, multimodal interpreting environment in the eye-tracking lab, the interpreted lecture was designed to simulate an on-site setting. This setup mirrors real-world interpreting situations where interpreters must not only listen to a speaker but engage with additional media, such as PowerPoint presentations or visual cues. To align with the pioneering nature of the experiment, only one speech-to-text interpreter was involved. For the experiment, a video was selected where the speaker was visible alongside the corresponding PowerPoint presentation. A separate view of both the speaker and the PowerPoint slides was displayed in separate frames on the screen (see figure 1).

To analyse gaze behaviour during speech-to-text interpreting, specific areas of interest (AOIs) were predefined on the screen. Due to the simulated on-site environment, three AOIs were established: the speaker on the left side of the screen, the PowerPoint presentation in the centre and the speech-to-text interpretation on the right. These areas were chosen to reflect the key elements that interpreters typically focus on during a multimodal interpreting session. By tracking the gaze across these AOIs, it was possible to measure how much time the interpreter spent focusing on each aspect of the task.
Findings
The eye-tracking technology allowed for precise measurement of which areas of the screen were viewed during the 20-minute experiment. The results (see figure 2) revealed that throughout the entire experiment, the speech-to-text interpretation was the primary AOI, receiving 47.2% of the total fixations (476.75 seconds). This finding suggests that the written output required the most attention from the interpreter, likely because it needed constant monitoring and correction for potential errors. The second-largest AOI was the speaker, with 30.6% of the total gaze distribution (309.56 seconds), indicating that the interpreter also needed to focus on the speaker’s verbal and non-verbal cues to ensure accurate delivery of the message. In third place was the PowerPoint presentation, which received 17.7% of the total fixations (178.75 seconds). Lastly, the smallest portion of the pie chart, labelled ‘Other’, represented areas outside the predefined AOIs, such as glances at the keyboard or parts of the screen that were not of primary focus. These accounted for 4.5% of the total visual attention during the interpretation (45.09 seconds). In total, fixation time amounted to 16 minutes and 50 seconds (1,010.15 seconds), while the remaining 3 minutes and 22 seconds consisted of saccades—rapid eye movements that occur between fixations.

The analysis of fixations demonstrates a clear prioritisation of the speech-to-text interpretation, with nearly 50% of the fixation duration focused on the text output. This emphasises the critical role of monitoring the interpretation for accuracy. The data also indicate that the interpreter’s gaze on the presentation slides is flexible, depending on the content and function of what is being displayed. For example, if a slide is purely decorative, the interpreter may devote less attention to it. Further analysis in the study also focused on gaze direction during the monitoring phase. In this context, Moser’s (2002) hypothesis was confirmed: during interpreting, the gaze tends to shift toward the location where relevant information is expected to appear.
The experiment also considered the potential effects of the unusually long 20-minute duration on the interpreter’s gaze behaviour. Typically, speech-to-text interpreting sessions are shorter, as the cognitive demands of the task can lead to fatigue over time. However, the extended session in this study provided valuable insights into how visual attention is managed over longer periods, offering a basis for further research on the sustainability of speech-to-text interpreting in real-world scenarios.
Conclusion
In conclusion, my experiment provided initial insights into the visual attention patterns of interpreters during speech-to-text interpreting. The findings contribute to a growing body of research aimed at understanding and optimising this process, particularly with respect to managing the cognitive demands on interpreters. If these results have piqued your interest, I warmly encourage you to explore the full research. By expanding the sample size and conducting further investigations, we can gain deeper insights into the complex interrelations involved in speech-to-text interpreting. It remains exciting to follow the developments and new discoveries in this ever-evolving field.
Julia Matzenberger is a certified speech-to-text interpreter, live subtitler and a freelance interpreter for German, English and French.
References
Holmqvist, Kenneth; Nyström, Marcus; Andersson, Richard; Dewhurst, Richard; Jarodzka, Halszka & Van De Weijer, Joost (2011). Eye Tracking: A Comprehensive Guide to Methods and Measures. Oxford, United Kingdom: Oxford University Press.
Matzenberger, Julia (2023). Visuelle Aufmerksamkeit beim Schriftdolmetschen: Eine Eyetracking-Studie. Universität Wien: Masterarbeit. Available at https://utheses.univie.ac.at/detail/69464/
Moser, Barbara (2002). Situation Models: The Cognitive Relation Between Interpreter, Speaker and Audience. In: Israël, Fortunato (Hrsg.) Identité, altérité, équivalance, la traduction comme relation. Paris : Lettres Modernes Minard, 163-187.
Szarkowska, Agnieszka; Dutka, Łukasz; Szychowska, Anna & Pilipczuk, Olga (2018). Visual Attention Distribution in Intralingual Respeaking: An Eye-Tracking Study. In: Walker, Callum & Federici, Federico M. (Hrsg.) Eye Tracking and Multidisciplinary Studies on Translation. Amsterdam: John Benjamins, 192-201.
[…] Julia Matzenberger:Visual Attention During Speech-To-Text Interpreting: Results from an Eye-Tracking Study […]