XTranscript and the CASE XML Conversion Tool

The Research Development Unit for English Studies have developed a number of tools to assist quantitative language research.

Researchers

CASE XML Conversion Tool: Convert CASE transcripts into XML for quantitative study

We worked on an exciting project to record English spoken by students in academic institutions around the world. The Corpus of Academic Spoken English (CASE) has been compiled by a team of researchers at Trier University of Applied Sciences. Birmingham City University is one partner providing students for the project, and we also developed software to support the analysis of the transcribed spoken data.

Our CASE XML Conversion tool converts the project's default mark-up (based on discourse analysis notation) to a bespoke XML schema, encapsulating all of the original information in a machine-readable form. The XML versions of the transcripts enable additional levels of computational analysis. For example, XPath searches enable features of the texts to be found with relative ease, and frequency information can be extracted about these features. The machine-readable XML therefore assists with the analysis of the transcripts.

Find out more about the CASE Project on the CASE Project website.

The CASE XML Conversion Tool has been developed into the XTranscript system.

XTranscript: Convert Conversation Analysis style transcripts into XML for quantitative study

XTranscript is an online tool for converting transcripts saved in mainstream document formats (such as Microsoft Word, Open Document, PDF or TXT) into a lightweight XML format.

It has been developed in the field of linguistics to enable combined qualitative and quantitative studies of spoken language. Converting transcripts into XML allows for powerful and mature XML processing tools, such as XPath and XQuery, to be used to search or summarise features of the transcripts.

XTranscript currently offers two configurations for the conversion process:

Basic: Utterances will be detected and the text will be tokenised (split into words).
Conversation Analysis notation: In addition to the utterances, Jefferson notation (and a few known extensions) will be identified and recorded in XML elements.

Part-of-speech (grammatical) tagging can also be performed for English texts using the Stanford CoreNLP library. The Stanford tagger uses the Penn TreeBank tag-set.

Find out more about XTranscript and convert your transcripts into XML.

If you are interested in converting your own notation then please get in touch for a consultation.