Many professional and scientific fields that involve representing talk as written text use more or less edited versions of the original, also known as verbatim transcriptions (Poland 1995). Verbatim transcriptions vary in accuracy, but several features of speech, such as “little words” (e.g. discourse particles such as “well”, “like”, “you know”), variations in volume, pace and pitch of talk as well as simultaneous talk, or embodied conduct, can be left unmarked. In this paper, we introduce conversation analytic (CA) transcription (Jefferson 1985, 2004; Hepburn & Bolden 2017), which rejects this type of editing and invests in capturing faithfully the details of how talk unfolds moment to moment. Here we demonstrate why researchers investigating talk-in-interaction find this way of transcribing invaluable for analysing audiovisual research data and for examining the resources that participants use to produce and interpret social action.

Conversation analysis is a qualitative research method and theoretical framework used in sociology, linguistics, educational sciences, logopedics and related fields interested in social interaction (see Sidnell & Stivers 2013). From the CA perspective, typical versions of verbatim transcription are not sufficient to represent a conversation. The CA transcription system (see Annexe I) covers not only the smallest verbal elements (e.g. “mm-m”, “uh”, cut-off words) – which can also be included in the verbatim style – but a range of further properties of talk such as intonation, voice quality, overlapping talk, laughter and pauses. These notations used in CA and the structuring of the transcripts show that turns-at-talk consist of smaller building blocks, turn-constructional units(e.g.Selting 1996), delineated by grammatical, prosodic and pragmatic features. Embodied interactional conduct can also be transcribed (Mondada 2018; see Annexe II), as shown in our example case.

Example case

We illustrate CA transcription with a data excerpt from a video-recorded interaction, in Finnish, between a doctor and patient. The doctor (DOC) is investigating the patient’s (PAT) fatigue. He asks her to evaluate the reduction in her physical capacity in percentages.

The lines in bold contain transcription of talk; embodied conduct is in grey. Each numbered line beginning with a speaker label is a new turn-at-talk. Pauses between stretches of talk are in parentheses, their length indicated in tenths of seconds. The transcription conventions are available in full in Annexes I and II.

Physical capacity (2017.08.17_p1_00:42:44)

The excerpt shows that transcribed features as minimal as pauses and audible breath are significant for understanding how the sequence of action unfolds between the doctor and patient. In particular, the doctor orients to those features as indicating the patient’s difficulty in providing a response.

In lines 1–3, the doctor comes to the end of an extended turn in which he requests the patient to estimate in percentages how much her physical capacity has decreased. Instead of answering immediately, the patient stays silent for 0.5 seconds (line 4). Then she takes a long inbreath (.mhhhhhhh), opens her mouth, making a click sound and continuing the inbreath (.mtHHhhh), followed by another pause and a heavy outbreath, hearable as a sigh (HHHhhh) (line 5).

Such gaps and hesitation markers before talking, which delay the production of the next relevant action, are typical indications that the upcoming response will not be the one expected (“dispreferred action”, Sacks 1987), and possibly show some trouble in the ongoing action. A sigh, specifically, can foreshadow negative valence of the upcoming talk (Hoey 2014). Here, instead of giving the percentage, the patient first assesses in a quiet voice (indicated by degree marks in line 6) “it is a lot” – the decrease is serious and lamentable.

The patient’s quiet talk in line 6 and the doctor’s talk in line 7 are marked with aligned square brackets, which indicate the beginning of overlapping talk. Since the speakers start talking at the same time, we see that the doctor orients to the patient’s problem by already responding before she has produced any words. Moreover, the doctor’s turn at lines 7–9 displays an interpretation of what the problem is. He acknowledges it is “difficult” to make the estimate, and mitigates the need for accuracy by saying that the patient’s personal view is enough. Without going into detail, we note that the later discussion shows the patient’s more precise problem to be claiming such a major decrease of physical capacity (80%), which might not sound credible.

In addition to the features already discussed, the patient’s bodily conduct implies a non-immediate response. During the doctor’s request, in line 3, the patient has gazed upward with a “thinking face” (Goodwin & Goodwin 1986), thus withdrawing from mutual gaze contact with the doctor. The doctor reciprocates this by turning his gaze away (lines 6–7), thus also bodily reducing the pressure for an immediate answer.

By capturing details of vocal and embodied conduct beyond actual talk, the transcript helps to identify resources that the doctor demonstrably uses to interpret the patient’s difficulty in responding. If this conduct was not transcribed between the verbal turns, the doctor’s talk at lines 7–9 could become erroneously represented as part of the original request, instead of being responsive to what the patient does. This would misrepresent the institutional practice of making such requests (see Simonen 2012).

The features shown in the transcripts matter for the local organisation of action in a multitude of ways. Audible breath is a common resource for participants to co-ordinate interaction. Inhaling is necessary for producing talk, and can therefore index that the participant is about to take a turn. Furthermore, participants make use of changes in loudness (e.g. line 6), tempo (lines 7–8 >[fast talk]<), pitch (lines 8, 9 ↑[step-up in pitch]), overall intonation and voice quality as resources to co-ordinate turn-taking and mutual action (Couper-Kuhlen 2001; Ogden 2004; Selting 1996). Several studies also show how speakers use “little words” to indicate how their upcoming talk relates to the ongoing conversation, such as the turn-initial discourse particle “siis”(line 7) prefacing the doctor’s specification and explanation of his action (Laakso & Sorjonen 2010), and “no” (line 12) as the patient’s resource in engaging with the requested action (Sorjonen & Vepsäläinen 2016).


The CA transcription system renders visible the conversational organisation of participants’ turns and actions. Talk does not simply represent the speaker’s individual ideas and behaviour; talk is shaped by, and shapes further, the unfolding of the interaction, as a result of how the participants orient to each other and respond to each other’s conduct, making use of shared interactional practices (Heritage 1984). These practices deploy a range of verbal as well as bodily interactional resources, which can be made visible – and analysable – in transcription. A transcription system that accounts for such features is invaluable for showing how participants arrive at saying the particular things they say in settings such as medical interactions, police interrogations, court sessions, journalistic interviews or ordinary conversations. It also shows that even small and often ignored features of talk-in-interaction such as hesitation sounds, changes in tempo, audible aspiration or gaze direction are genuinely meaningful for interaction.

Katariina Harjunpää is a postdoctoral researcher in Finnish linguistics at the University of Helsinki. Suvi Kaikkonen is a PhD candidate in Finnish linguistics at the University of Helsinki.

Annexe I

CA transcription conventions (cf. Jefferson 1985; Hepburn & Bolden 2017)

Temporal and sequential unfolding of talk:

[                  Beginning of the simultaneous, overlapping talk

]                  End of the simultaneous, overlapping talk

=                 Latching, immediate continuation with a new turn or segment

(.)                Micropause, up to 0.2 sec

(0.5)            Longer pause and its duration in tenths of a second


.                  Unit-final falling intonation

;                  Unit-final slightly falling intonation

,                  Unit-final level intonation

?                 Unit-final raising intonation

↑                 Pitch upstep in next syllable or word

↓                 Pitch downstep in next syllable or word

I did           Accent

Speech delivery:

°  °              Quiet voice

YEAH        Loud voice

£ £              Segment, produced with a smile

@  @          Change in voice quality

>  <             Fast tempo

<  >             Slow tempo

:                  Lengthening of the sound

wen-           Cut-off word

we<            Glottal closure

Features accompanying talk:

.hh              Inbreath (each h corresponds to one tenth of a second)

hh               Outbreath (each h corresponds to one tenth of a second)

w(h)ent       Laughter accompanying speech

he heh         Laughter particles (transcribed as heard)

Transcriber’s metacommentary:

(went)         Assumed wording

(–)              Unintelligible passage

((cries))      Transcriber’s comments

Annexe II

Conventions for multimodal transcription(Mondada 2018; for a commentary on the list of conventions see

*  *             Descriptions of embodied actions are delimited between

+   +            two identical symbols (one symbol per participant and per type of action)

∆∆              that are synchronized with corresponding stretches of talk or time indications

*—>           The action described continues across subsequent lines

—->*          until the same symbol is reached

>>               The action described begins before the excerpt’s beginning

—>>           The action described continues after the excerpt’s end

……             Action’s preparation

—-              Action’s apex is reached and maintained

,,,,,              Action’s retraction

ric               Participant doing the embodied action is identified in small caps in the margin

fig               The exact moment at which a screenshot has been taken

#                 is indicated with a sign (#) showing its position within the turn/a time measure

Explanation of the transcript

On the translation line, we gloss the particle “siis” as PTCL (particle) due to the lack of a suitable translation in English. The particle is used in self-repair for projecting specification or explanation, and in other contexts for marking inference (“so”, “then”, “therefore”, “since”) (Laakso & Sorjonen 2010).


