With the increasing use of video recording in social research methodological questions about multimodal transcription are more timely than ever before. How do researchers transcribe gesture, for instance, or gaze, and how can they show to readers of their transcripts how such modes operate in social interaction alongside speech? Should researchers bother transcribing these modes of communication at all? How do they define a ‘good’ transcript? In this paper we begin to develop a social semiotic framework to account for transcripts as artefacts, treating them as empirical material through which transcription as a social, meaning making practice can be reconstructed. We look at some multimodal transcripts produced in conversation analysis, discourse analysis, social semiotics and micro-ethnography, drawing attention to the meaning-making principles applied by the transcribers. We argue that there are significant representational differences between multimodal transcripts, reflecting differences in the professional practices and the rhetorical and analytical purposes of their makers.
Read the full article: Bezemer, J. & D. Mavers (2011). Multimodal Transcription as Academic Practice: A Social Semiotic Perspective. International Journal of Social Research Methodology 14, 3, 191-207. DOI
Transcription is a common academic practice. In social research investigating language and communication, a ‘transcript’ usually refers to a distinctive genre associated with turning a strip of ‘naturally’ occurring talk – e.g. a job interview, a conversation at the dinner table – into writing. This genre has analytical as well as rhetorical purposes: to develop insights into the moment-by-moment and in situ construction of social reality and to provide evidence in developing an argument for an academic audience. With the increasing use of video recording in social research, methodological questions about multimodal transcription are more timely than ever before. How do researchers transcribe gesture, for instance, or gaze, and how can they show to readers of their transcripts how such modes operate alongside speech? Should researchers bother transcribing these modes of communication at all? What are the epistemological implications of choices of inclusion and exclusion? What does one gain from inclusion of modes other than speech if the aim of transcription is to focus on a selection of the vast amount of data collected? In this paper we investigate how some researchers, including ourselves, have dealt with these issues. We discuss the emergence of multimodal transcripts in social research, review the theoretical perspectives on transcription adopted in conversation analysis and linguistics and develop our own, social semiotic take. We then use this framework to analyse and compare a selection of multimodal transcripts which have appeared in recent academic publications.
In this paper we have analysed and discussed a small number of ‘multimodal’ transcripts from a social semiotic perspective with the aim of gaining a better understanding of the methodological implications of this changing academic genre. We have suggested directions along which to investigate transcripts: how common principles (framing, selecting and highlighting) and modes of transcription (writing, image and layout) are differently used in video-based social research. Whilst these principles operate in all modes, each mode provides distinctive potential for re-constructing video data, and these choices shape the account of social interaction in significant ways. Such reconstructions are inevitable and essential outcomes of any video analysis, and it is through reconfiguring video data that researchers and their audiences can see the observed interaction in the categories appropriate to their discipline(s) and position themselves in relation to that discipline(s).
We have identified some significant methodological differences between the various transcripts. We can see, for instance, that their makers have chosen to represent strips of interaction ranging from a few seconds (Heath et al.) to a couple of minutes (Erickson). Their transcripts also point to different units of analysis, such as: ‘turns’ (Erickson), ‘actions’ (Heath et al.), ‘higher level actions’ (Norris), and ‘modes’ (Mavers). These differences reflect the professional interests of the makers, and their analytical and rhetorical concerns. As conversation analysts Heath et al. are particularly concerned with the temporal unfolding of action, and they tend to look for more detail in shorter clips than for less detail in longer clips. They want to show that interaction is sequentially organized, and that this interaction unfolds in different forms of action. As social semiotician, Mavers calls these forms of action ‘modes’, and highlights the modal organization of interaction. Norris wants to discern ‘simultaneously performed higher level actions’ (2004: 101) in a 30-second clip, and so frames the moments where new ‘higher level actions’ are added to what is going on. Bezemer identifies ‘types’ of bodily configurations in a 5-minute clip, thus like Norris he moves beyond the ‘micro’ actions which are in focus in Heath et al.’s transcript. Another difference between the transcripts is their treatment as a ‘source of evidence’. In conversation analysis evidence for an argument ought to reside in the transcript, and other, ethnographic sources of evidence may be considered as ‘supplementary’. In the ‘micro-ethnographic’ approach adopted by Erickson such sources of evidence are given more equal weight. Thus the transcripts have different positions in the originating analysis and rhetoric.
In this paper we have looked at transcripts as finished products, as professional artefacts. We treated these artefacts as mediating social interaction between the ‘makers’, the represented materials, and the (imagined) ‘readers’. The framework set out here is not only useful from the perspective of the sociology of science, but can also be applied by individual researchers undertaking video-based social research to reflect on the methodological and theoretical implications of choices around transcription. Such reflection, we believe, should not focus on representational ‘accuracy’. Rather, transcripts should be judged in terms of the ‘gains and losses’ involved in re-making video data. It is crucial to make those gains and losses transparent, for example, which modes of communication used in the observed activity have been excluded from the transcript and why, and what the effect is of that exclusion on the analysis and subsequent reader interpretation? It also promotes reflection on the effects of transduction: how use of the mode of transcription shapes what is re-presented. Transcription conventions accommodate such transparency and consistency, but are currently utilized only in transcribing speech to writing. Contemporary practices in multimodal transcription may require information, for example, on how images were constructed. Such conventions cannot and need not be standardized beyond the study/project/publication for which they are used, but they need to be made transparent to readers.
We have discussed only a small number of transcripts in this paper, and so our conclusions are provisional. We are in the process of expanding our data set, so that we will be able to systematically analyse a broader range of multimodal transcripts. This corpus will increase the variety of transcripts we have worked with so far, to include, for instance, musical notation as a mode of transcription showing the relative temporal locations of vocally stressed syllables in talk (e.g. Erickson, 2004), ‘laban’ script as a mode of transcription for movement (Duranti, 1997, p.149) and geographical maps detailing the direction of gaze (e.g. Haviland, 2003). Further research will also need to involve interviewing the makers of transcripts, as well as observations of transcription activities (c.f. Vigouroux, 2007) to gain more insight into transcription as a situated practices (Mondada, 2007) and ‘effects’ on readers. Transcripts, like any form of representation, are not only socially and culturally shaped, they are also situated (Mondada, 2007), that is, they are produced in a local, social, physical context in which certain representational resources are available and others not. Increasingly, computer software technologies are part of those resources, and they shape ‘transcription’ (Plowman & Stephen, 2008). Transcription may also be a collaborative activity, involving a number of participants who are engaged in a ‘data session’. It would be particularly interesting to compare different versions of a transcript, made in different situations, involving different participants, and the processes of review and reformatting. The interactions between transcribers, and between transcriber and computer, and the local ecologies within which these interactions unfold need to be researched ethnographically. So does the ‘reading’ of transcripts: to our knowledge, no one has observed how people engage with transcripts. We do not know what they attend to, in what order, or indeed if they read the transcript at all. In short, our paper is only a very modest theoretical and empirical contribution to important methodological questions.