
112: When language become-s(3SG) linguistic example-s(PL)
January 15, 202649 min · 8,737 words
Show notes
Language is all around us. This sentence right here, is language! But between the raw experience of someone saying something and a linguistic analysis of what they've said, there are certain steps that make it easier for that analysis to happen, or to be understood or reproduced by others later. In this episode, your hosts Lauren Gawne and Gretchen McCulloch get enthusiastic about how language becomes linguistic data. We talk about making recordings of language, transcribing real-life or recorded language, annotating recordings or transcriptions, archiving all those materials for future generations, restoring archival materials from decaying formats, and presenting this information in useful ways when writing up an analysis. Along the way, we touch on playing 100+ year old songs from cracked wax cylinders, the multi-line glossing format used so readers can understand examples in a language they're not already fluent in, analyzing spontaneous conversation using tapes from the Watergate Scandal, recognizing everyone who's contributed (including your own intuitions!), and Lauren's role on a big committee of linguists and archivists formalizing principles for data citation in linguistics. Click here for a link to this episode in your podcast player of choice: https://pod.link/1186056137/episode/dGFnOnNvdW5kY2xvdWQsMjAxMDp0cmFja3MvMjI0ODMzMjkyMA Read the transcript here: Announcements: If you wish there were more Lingthusiasm episodes to listen to or you just want to help us keep making this show, we have over a hundred bonus episodes available for you to listen to on Patreon. Not sure about committing to a monthly subscription? You can now sign up for a free trial and start listening to bonus episodes for free right away: https://www.patreon.com/lingthusiasm In this month’s bonus episode we get enthusiastic about about some of our favourite deleted bits from recent interviews that we didn't quite have space to share with you! First, an excerpt from our interview with Adam Aleksic about tiktok and how different online platforms give rise to different kinds of communication styles. Second, a return to our interview with Miguel Sánchez Ibáñez for a bit about Spanish internet slang, -och, and why "McCulloch" looks like a perfect name for an author of a book about internet linguistics. Finally, deleted scenes from our advice episode, in which we reveal some Lingthusiasm lore about pronouncing "Melbourne" and imitating each other's accents and answer questions about linguistics degrees and switching languages with people.. Join us on Patreon now to get access to this and 100+ other bonus episodes. You’ll also get access to the Lingthusiasm Discord server where you can chat with other language nerds: https://www.patreon.com/posts/147181832 For links to things mentioned in this episode: https://lingthusiasm.com/post/805852742418661376/lingthusiasm-episode-112-when-language
Highlighted moments
“We don't speak in punctuation. Putting some level of punctuation in, even though it makes it more readable, is still making a level of decision.”
“there isn't a clear boundary between data and theory. Writing data down involves making some theoretical decisions, some that the researchers may be aware of and some that they may not even be aware of.”
“They'll sometimes only break down the one word that's of interest and not any of the other words, which means if you want to reuse that data to do a different analysis about a different part of the sentence that the author didn't care that much about, you don't have enough information to do that.”
Transcript
Introduction
0:00Welcome to Lingthusiasm, a podcast that's enthusiastic about linguistics. I'm Gretchen McCulloch. And I'm Lauren Gawne. Today, we're getting enthusiastic about the data people use to do linguistics. But first, if you wish there were more Lingthusiasm episodes to
0:33listen to, or you just want to help us keep making the show, we have over 100 bonus episodes available for you to listen to on Patreon. If you're not sure about committing to a monthly subscription, you can now sign up for a free trial and start listening to bonus episodes for free right away. Our most recent bonus episode was a whole collection of extra great material from interviews we've done over the past year that was too good not to share. You can hear more from Adam Aleksic about how the differences between platforms shape how slang evolves on them, and from Miguel Sanchez-Abañez about Spanish internet memes.
1:03We have some bonus linguistics advice questions that we answer in this episode as well. For this and over 100 other bonus episodes, go to patreon.com slash lingthusiasm.
Linguistic Data
1:26Lauren, what is linguistic data? I'm speaking a language right now, so does that mean I am linguistic data right now? Absolutely. In fact, we have used recordings of this show with Bethany Gardner to make vowel plots of the two of us, so extremely yes. That is true. Maybe this episode someday will be part of another analysis. This is one of the things that I find so exciting about linguistics. There's always language to analyse, and there's language going on right inside my head that I could analyse at any time.
1:59Indeed. Even with a recording of a conversation, there are so many different things that you could do with the same single recording. You could look at, as we've done, the way both of us pronounce different words, but you could also look at the choices of words that we make, or the way our sentences are structured, or the way we do back and forth. Language is so many different things, and linguistic data can be so many different things as well. One of the reasons I love linguistics is because of this
2:31wide-ranging approach to data. Linguistics really is a science. You can do linguistic experiments and get that kind of experimental scientific data. Linguistics is also a humanity in that you can do this detailed textual analysis or very detailed analysis on one particular piece of a story or a conversation and analyse that one thing in its own terms. All of these fall within linguistics. They're all different ways of relating to language and to linguistic data.
Data Replicability
3:03Yeah. It could be signed language or spoken language, or you could look at written language. You could look at those things across time for a single person or a single group, or you could look across different people right now. You can do experiments, or you can observe naturalistic data. One of the things that we want out of linguistics as an academic discipline, as a scientific discipline, is the idea that its data is replicable. Sometimes that can be replicable in the scientific sense. If you've got 100 Australian English speakers and you have them read a list of words,
3:37and then you extract their vowels and you analyse the vowels, the idea is that you could get a different group of another 100 Australian English speakers to read the same word list, and you should get the same results. If you get a different set of results, there should be some reason why this group is different from that group. Maybe 50 years later, the vowels have shifted because you're doing these at different times. Maybe you're looking at Melbourne and Sydney English speakers. Nothing like a bit of intercity variation to get people excited about comparing data.
Data Analysis
4:07But sometimes you can learn a lot about language by just studying a story or a conversation in a lot of detail. The real challenge with this data is that even if you ask the same person to tell the same story again, or even if you have those two people have another conversation on the same topic, it's always going to be different because you're really trying to capture something about that particular moment. In some ways, it makes them feel weirder if you say that. Can you just have
4:38the same conversation that you were having before I turned the tape recorder on? Make sure you laugh in all of the same places that you were laughing before because you're going to find it just as funny the second time around, right? This is maybe a good point to confess that once or twice we have lost a recording of this show. Doing it again, I fully sympathise why you can't just replicate that exact moment. Yeah. Even us who are relatively practiced at this point, 112 episodes later, but it's not to
5:08say that I want to do every single recording twice. In a situation like this where it's all about analysing that particular non-replicatable moment, what you want to do is share the recording or share the transcript or share the data in such a way that people can follow your analysis and decide if they can reproduce it and whether they agree with you or disagree with you about it. That's a very different but still very important way of doing linguistics.
Data Conventions
5:38Right. Because there are so many areas of linguistics, there are a lot of different conventions that various areas of linguistics have come up with as far as sharing that data, what format it's in, how other people can experience it, and just what other bits of information get added on to, you might think, the core audio files or video files or text files that we then subsequently do stuff with to make it more data-ish. Yeah. Every decision we make about how we present data is making a decision that is influenced by the traditions of the types of linguistics you work
6:14in. Something like a transcript sounds straightforward, but how are you presenting it? Are you presenting it in the international phonetic alphabet? Exactly what symbols you're using is a conventional choice that is standardised. Even adding punctuation. We don't speak in punctuation. Putting some level of punctuation in, even though it makes it more readable, is still making a level of decision. Yeah. Once you get to know the conventions of a particular part of linguistics, you can look at
6:44something and go, oh, that's a phonetic analysis, or the diagrams of the structures of sentences that people draw. Without even looking in detail at that particular diagram, you know a lot about someone's theory of syntax and their theory of sentence structure from the way they've diagrammed and
Interlinear Glossing
7:03presented that sentence. And this begins to touch on something that we're going to come back to a lot, which is that there isn't a clear boundary between data and theory. Writing data down involves making some theoretical decisions, some that the researchers may be aware of and some that they may not even be aware of. So there's always this interface between theory and data. One of my favourite linguistic data formats – because everyone gets to have a favourite linguistic data format – is the way that if you have a sentence in a language that isn't the language that the paper's being written in, and you want to make sure
7:40that people who are reading the paper, who don't necessarily speak that other language, can still understand exactly what's happening or what you're proposing is happening in that sentence. You can write that sentence on several different lines with several different bits of information per line. And we have cited examples that are in this format on the podcast. Because we're an audio podcast, you can't necessarily tell that we're going line by line. – Yeah, that's a good point. We use and we read these interlinear glossing conventions for the structure of data all the time. We've just never
8:14broken it down that way before. – Right. We're gonna give an example in English because you can do this in English as well. It's just that if you're reading a paper that's in English, it's more likely that someone will do this for another language because the assumption is that the people who are reading an English paper already know what's going on in English. So if we have a sentence like the English sentence, I am feeding the guinea pig. Perfectly normal English sentence. – Yeah. – We can gloss this sentence by taking each of the important words or parts of words and splitting them up or grouping
8:47them together so that we know what our important units are. So in this case, we have I, we have am, we have feed and ing, which we need to split apart. – Yep. – We have the, and we have guinea pig, which we need to group together because a guinea pig is not really a kind of pig. – No, it really is – – It is a different animal. – Just one word. We're gonna treat it as one word. – Yeah. So we need to split some things that in the language are written together like feed plus ing, and we need to join some things that in this language are written apart like guinea and pig. And then we also need to gloss this
9:22in a way that is comparable with other languages. So we can gloss I as this is the first-person singular subject, and then this lets us compare it with other languages that also have a form – maybe it's the ending on a verb, maybe it's a separate word – that also corresponds to first-person singular. – And because languages have different word orders, that first-person singular subject, I in English is at the start of a sentence, but in other English sentences might not be, or in other languages might be in a completely different place all the time.
9:55– Exactly. We do this sort of the breaking things down level in the language in question, and then we do breaking down what its meaning is using particular meaning symbols, and then we have an additional line often that's a sort of idiomatic translation, which in this case doesn't make a lot of sense because our matrix language is still English. But if we were translating it, let's say, into French because we're writing a paper into French, you could hit something like je donne un manger au – I don't know how to say guinea pig in French – oh well, au chat – to the cat,
10:26or whatever animal we're talking about here. We wanted to pick guinea pig because it's two words in English and we need to illustrate that. – Only boring pets for French-cretion.
Transcription Systems
10:36– Sorry. – What I love about the particular form of this interlinear glossing – that cody bit between the sentence in the language you're working with and the sentence as it's translated for the reader – is that there are a very specific set of conventions for how you mark that something is first-person and singular. This set of conventions is known as the Leipzig-glossing rules. – I learned about the Leipzig-glossing conventions in grad school. I encountered them
11:08in papers. They'd have a little footnote that said, you know, this paper is using the Leipzig-glossing conventions. I'd be like, yeah, I'm using them too. This is clearly what I'm doing as well. But it wasn't until a while later that I paused and thought, sort of, why Leipzig? – Why a city in Germany? – Yeah. Why the eighth largest city in Germany and the most populous city in the German state of Saxony – neither of which I think is relevant to why they're used for linguistics? – Good to rule that out. – But, you know, this is sort of a medium-large city in Germany. Why did they decide how we
11:43abbreviate things to gloss them in linguistics? – It's because it's where they decided. The group of people who published the Leipzig-glossing rules did so while working in Leipzig linguistics program there. – So, this is sort of like the Kyoto Protocol type of naming things after cities because you had a big convention or a small convention. You came up with a set of rules for how we're going to do something and then you say, well, let's name this treaty after the city that it was signed in. – Yeah. I'm so grateful for the
12:15Leipzig-glossing rules because when I started presenting data in this way, it was so great to just open up this three- or four-page document that very clearly lays out how things should be structured and being like, I don't have to figure out how I'm going to describe things like first person singular subject. It's all here for me. – Sometimes I read older papers that were written before the Leipzig-glossing rules and their glosses are a lot more inconsistent and they don't do as
12:46much about consistently breaking down, okay, we're going to break down feed plus ing into the root, which is feed, and the ing part, which is the progressive. They don't do that sort of breaking down. They'll sometimes only break down the one word that's of interest and not any of the other words, which means if you want to reuse that data to do a different analysis about a different part of the sentence that the author didn't care that much about, you don't have enough information to do that. Or sometimes – this one really gets me – they'll just assume for a certain subset of languages that of course the reader speaks French. – Of course the reader in this paper will be
13:20totally fine with me throwing French, Latin, Greek at them. – German. Yeah. This assumption that of course the reader speaks some language that the paper isn't written in can be unnecessarily hostile to new readers when we could just provide a translation. – And one of the great things about not only having the translation but this breakdown segment by segment, word and morpheme level interlinear gloss, the line between the language and the translated language, is that it means you can come back and revisit someone's analysis if things have moved on. In fact, once you start
13:54getting used to reading the glossing, it's pretty quick to pick up and pay attention. Also, shout out to people who bold the bit that you're meant to be focusing on. – Oh, my god. – Love that. – I love that. Yeah. Please put in bold the bit that we're trying to focus on. – This is what happens when you spend so much time with a particular data convention, you get really used to it. In fact, when I teach my third-year linguistics class on syntax and sentence structure, we usually get a couple of students that take it as an outside elective. My linguistics majors
14:25are always shocked by how used to Leipzig glossing conventions they've become when you get these students that are just like, what is happening here? Then we go back and reread the three to four pages of the Leipzig glossing rules and help everyone get up to speed. – We will put a link to some examples of data that's annotated in the Leipzig glossing rules in the show notes if you want to see what that actually looks like, particularly from a Facebook page called Kittens and Linguistic Diversity, which finds a thematically appropriate cat photo for every sentence that they put up. They had one
15:03sentence translated as, and then the wind blew, and you have this picture of this very wind-blown cat to match up with this sentence. You can see a few examples of Leipzig glossing rules and what the spacing looks like because there's a bit of conventions around tab spacing to align the words with their glosses to make it easier to read that are very hard to display on an audio-only podcast but are pretty easy to see if you could look at them. – Yeah. Leipzig glossing rules, I think also definitely in my top five data presentation formats. Not that anyone has to pick a favourite,
15:36but up there for me. – It's like trying to pick a favourite animal or something. Like, oh, but this one's so cool. – One of my other favourite data presentation structures is a transcription system used – I've seen it a lot in a database of child language data and children and parents interacting. It's a bit retro. I get a bit nostalgic because it's designed in an era where you had a very limited number of keyboard characters. They've thought very carefully about how to structure the data for their presentation. – This particular format, which I also have first encountered for
16:12child language data, it cares a lot about the back-and-forth aspects of conversation, whether someone is interrupting someone, whether there's a pause between turns or things like that. Because it's often used for child data, it's not necessarily reflecting real adult-shaped words that the child is saying. It might be representing the children's words with as close as possible to the sequence of sounds as we can get. It's really fun, and it's got this vintage retro text-based format.
16:44I also find the name of this format very charming. It's CHAT, which is not a chat platform. It is an acronym for Codes for the Human Analysis of Transcripts. – Impeccable acronym game here. – Nicely pronounceable and also very pleasing because it works with the acronym CHILDIS, which is the name of this corpus. Researching child language data is so much effort. If you get a child and you record them for two hours every week and then you transcribe all of those recordings,
17:17there's going to be so much data there. You're following them for a year, two years, three years, watching how their language develops. You can spend your entire career analyzing this one set of recordings that you did over three years, and there's still more stuff that other researchers can find in those recordings. The child language researchers have very collaboratively put together this whole corpus where they can share all of these transcripts that they've done in a standardized format so that other researchers can just draw on those transcripts and come up with their own
17:47theories around the same sets of data. – And again, impeccable acronym game. This is the Child Language Data Exchange System or CHILDIS. – Oh, it's recursive. It's got CHILD in the original. It's got CHILD in the later one. Great acronym. – And just to really drive home that you can have multiple ways of approaching the same type of data, another form of transcribing conversation is known as conversation analysis, which shares a lot with chat but is its own style
18:20and its own convention as well. – I believe a popular type of conversation analysis system is known
Conversation Analysis
18:26as the Jefferson transcription system, which is named after a person named Gail Jefferson who's notable for being the person who transcribed the Watergate tapes. – Ah, yeah. What a throwback. – Yeah. This was a 1973 American political scandal with Richard Nixon and some people saying scandalous things in the White House. – But also saying it on tape, which is just a great indicator of how much linguists really will treat any language as a potential data source. – Yeah. They didn't know that they were being taped and so they were having these naturalistic
19:00conversations. – Perfect. – How often do you get to analyse naturalistic conversations, people who didn't know they were being taped, but because it's in the public interest, now this data is public. Gail Jefferson did these very detailed transcripts in this particular format, which are all available online if you would like to read them as transcripts. Linguists have analysed how these people were communicating with each other about this political scandal that is perhaps most famous now for being the source of the gate suffix for scandals and controversies. Watergate itself has
19:33nothing to do with the water. It was named after a complex known as the Watergate Complex, a building where this thing happened. Since then, gate has become a suffix that can get added to a relevant word related to a scandal. There was bin gate, which happened in the Great British Bake Off when one of the contestants threw his baked Alaska in the bin and so the judges couldn't judge it. – No, outrage. – Yeah, outrage. Scandal. Terrible. – Kategate, which is also known as Photogate, but who could pass up Kategate when there was all these
20:07accusations that the Princess of Wales had been photoshopped? Kategate. – Right, exactly. Kategate, how can you resist the wordplay? – The Watergate tapes are also a good reminder that linguistic data is collected in many different
Data Preservation
20:22types of media formats as well. It's hard to overstate just how important the era of recording has been. We wouldn't have modern gesture studies without affordable video recording. – It would be so hard to do the kind of detailed phonetic work that people are doing without recording as well. – Yeah. Some of these types of recording media are more stable and reliable than others. Tapes will, especially if they're left somewhere humid or not kept in a stable environment,
20:55tapes – the magnetic tape will actually crumble off them. CDs can get scratched and the actual data bit can come off as well. – I went to a talk at a conference once where they were talking about these really old wax cylinder recordings. Before, we had vinyl records that had flat grooves and you put the needle in the groove and it goes around and it produces the sound. – Yeah. – They had those grooves but in a tall cylinder instead and the needle could keep going down. The problem with this, of course, like the problem with the record, is if you get a crack
21:31in this physical medium, then the needle goes around and then whoops, it jumps over the crack and then it digs the crack in deeper so it becomes unplayable. You can have these wax cylinders sitting in an archives that are some cases over 100 years old. Some of our earliest recordings of languages, including some languages where their speakers aren't alive today. It's like, we have this data, sort of, but how do we play them? This conference talk was telling about how they figured out a way to play these grooves with lasers so that they wouldn't damage the cylinder like you would if you
22:04were running a physical record needle around it. You can play it all with beams of light and then recover most of what was there. Obviously, not the cracked bit, but you can get all of the stuff in between the cracks. They have some recordings of really, really old songs and people talking in some indigenous languages of the US. You can now hear what these people were saying. – Amazing and so important because we have these really stable but still cracking records. We're hitting this data apocalypse point where people who've done fieldwork, cassette tapes last about 50 years,
22:39CDs last around 20 to 30 years. We're hitting this point where all these physical media just through coincidence are all coming to the end of their physical lives at the same time. We potentially have this really important data that's being lost. There's this amazing blog post by the Paradisic Archive about how they've built this tape restorer that's trying to solve the same problem as the cylinder records but for tape where it simultaneously restores moulding and damaging tape as it's coming
23:15out of the reel and records it. Often, they only get one shot at this. As they are recording, the tape is physically falling away. It really is this attempt to collect and digitise all of this. Of course, digital data also has its weaknesses if it's not stored somewhere safe. It's always worth thinking about the physical media of data as well. Right. Just because something has gotten digitised, that hard drive can get corrupted. It can stop working. Websites certainly go down. There are so many aspects of this that can happen. I actually ran into this question when I was writing
23:51Because Internet, which is a book about the internet that I wrote. Some of the links that I had collected to cite vanished in between the time that I had collected them and then when I went back to write about them and make sure that they were in the citations, in the footnotes. What I decided to do was copy-paste every single link that I cited in Because Internet into archive.org, which archives websites, and make sure that either I had saved a copy or that a saved copy existed so that even if that site went down,
24:22estimates are that 10% to 20% of sites from even 5-10 years ago, you can't get them anymore. Oof. If I was going to tell people to look at them in a book, that at the very least people could pop that archive.org URL in front of any link from Because Internet and find some archived version where they could see it. In some ways, a book is a very stable format. You can have a book that's 100 years old, even 1,000 years old, and they're still pretty readable. They're still in pretty good condition.
24:52Even if the binding cracks, you can repair them. We know how to repair books. With audio media, with digital media, we don't have as much archivability yet.
Language Documentation
25:02Yeah. There are lots of amazing books. The tradition of descriptive grammars takes everything that someone has encountered in their experience doing fieldwork on a language and attempts to distill it into a description for the audience to read. Sometimes there feels like this real disconnect between the data that was collected and the people involved in that and what you read in the final document. I remember reading one descriptive grammar when I was in grad school. I don't remember the language anymore, but it was clearly written during some particular phase of
25:36grammar writing where the trend was to not write any full sentences in the language, even though I'm sure that this linguist had witnessed people uttering sentences in the language. But to write it all in terms of mathematical codes of like, well, you could put this thing with that thing and you could end up with a word or you'd end up with a sentence, but none of it was actually just like, can you just put down some sentences in this language? If someone wanted to try to say something, they'd have to solve your equations in order to do that. I'm sure at the time it was really cutting edge. Then that theory
26:09was not the one that caught on. You're left with this grammar that's written in an outdated paradigm, thinking if you'd had more complete sentences, someone else could have reanalyzed this in a more current theory, but as it is, it's a lot harder to do that. I think even when you have complete sentences – and a really good grammar will even include some entire stories or conversations or songs as appendixed information. Even then, with a lot of grammars, you can read the whole grammar and you
26:40don't know who said any of this, what their names are, how much they contributed their time and their knowledge and their words to this work. In fact, I did this big survey with colleagues looking at a hundred different grammars. For some of them, in fact, for a lot of them, the only way we knew who had been involved in sharing that knowledge was because they were thanked in the acknowledgements. Which is something, but – Yeah. I think it's a real indicator of this disconnect between
27:13the data and the underlying knowledge and then how it gets represented in this particular written genre. The data is ultimately people and stuff that people have said, right? Especially when we're talking about a language that may not have very many speakers at the moment or doesn't have this kind of – people who use this language aren't necessarily represented in academia as academic practitioners. There's this real responsibility of like, what are you going to do with this once you've collected it? There's also a sense where if you don't know where the sentences and the words
27:48come from, you don't know – there might be a whole story in this data about these two villages actually have slightly different varieties or actually every single sentence in this grammar is elicited and someone asked for it and not actually how people speak in spontaneous conversation, which is not a problem in itself. It's just a problem that we're left really uncertain about these things. Sometimes you get speakers who are very helpful and agreeable and when you say, you know, could I say this thing? They're like, yeah, you could say that because I recognise that
28:21you're not a very fluent speaker and I want to be really encouraging. Like, yeah, you can say that, that's okay, I understand you. Then if you say, would you say that? It's like, well, I would probably say it a bit differently. I would say this thing, but it's okay if you say it this way, I understand you. That's really nice but not necessarily the thing you want to be basing a whole linguistic theory on is this version of their language that they're tolerating from an outsider trying their best to speak. Yeah. I mean, there are times where people don't wish to be identified. Maybe if you're working with a particular minority within a language, they might not feel comfortable
28:56being identified and transparently represented in the grammar. You get into weird legal minefields around – especially in terms of if you work at a university while doing this work or in particular countries, the person who shares their story, if you're the person who's hit the record button or you're the person getting them to sign an ethics form, it completely changes who owns in a legal sense those recordings and that data. I think there's still a lot to be reckoned in
29:30linguistics around that. Right. One of the weird things about how copyright works in Western systems is the copyright of, say, a photograph belongs to the person who clicked the shutter button rather than the person who's the subject of the photograph, who the photo is being taken of. Which, of course, if you take a selfie, those are the same person. But when you're doing this sort of language work, you can end up with this very weird situation where the linguist has the copyright of the language that should really, by any logic, belong to the community. But the linguist is the one who came in and wrote a grammar or made a dictionary, did these sort of
30:02things? When I write a description of how people are using language on the internet, I have copyright over my description. That doesn't mean that other people who are using language on the internet think that they have to do it my way just because I wrote a book about it. In that sense, that's fair. I'm very grateful that I was able to sell copies of my book about language on the internet because that enables me to make a living. Also, there's a big difference between – when you're working on a major language like English, it would actually be weird for me to be like, well, Gretchen says
30:33– This is the thing. – this particular construction. I have sometimes found – I speak French, but I'm not a native speaker of French. Sometimes I'll check my intuitions with someone who's a more fluent speaker than me and I'll say, I could cite you about this and be like, oh, no, that would be weird. I can't represent all of French. I mean, I speak it, obviously, but just don't cite me. That's just what French does. – right? If someone came to me and was like, okay, I want to check that when English speakers greet someone, they might say, how are you? Can I cite you, Gretchen McCulloch, as saying that
31:05English speakers say, how are you? I'd be like, no, that's just what people say. That's not me. What would you cite me for that? – Yeah. Part of citing, though, is not because I necessarily need to acknowledge that this person says the thing that everyone says, but because I want to be able to point back to the original recordings that are archived that other people could look at, or even myself. There are times where, especially over the course of writing something really big like a PhD thesis, I actually went back and went, oh, I misunderstood what was happening in that
31:38recording. I wouldn't have been able to do that if I didn't have those citations back to the original data in the analysis that I was doing. I think it's really important that people be able to also access that underlying data while they're accessing my analysis, which is just one potential analysis of those recordings. – There has been an increasing trend, I would say, in recent years or recent decades that oftentimes these days when you're reading a paper that relies on a
32:08particular corpus of data, they'll have a little code by each example that indicates like, okay, this recording is from this file at the following minute or with a code about who the speaker is so that if somebody is like, actually, these speakers are from two different places, maybe we can find out if there are any dialect differences between them. They can go back and reverse engineer that from the information that you've put in the paper. I also sometimes find this a bit funny because sometimes when I've been trying to analyse a particular sentence or a particular construction,
32:38I'll go back and check that with the speakers several times. It's like, which recording am I indexing if I've gone back and checked that sentence with speakers three or four times just to make sure that they still like it the next day? – Yeah. – But I think overall this is a positive trend.
Data Citation
32:52– I think it's a real positive trend. I think it's also great when people do share those archives. I know there's a lot of anxiety people have around wanting to get all their analysis done and not be scooped, or a lot of anxiety around people feeling like their data is too messy sometimes. I think we have a lot of work to do as a field reckoning with normalising that it's okay to share your data even if it's messy. People generally have a respect for checking in with people before they go and analyse
33:24data they haven't collected themselves. – I also think that just having a plan and realising that your language data is part of something bigger than you and outlives you and has a bigger life beyond your career or beyond your academic life or even your physical life. I sort of think about it, maybe that's a weird analogy, but you know when people will get a turtle as a pet and baby turtles? – Sure, okay. – They're really small, they're really cute. You're like,
33:57how much of a problem could this be? Then it turns out a turtle can grow quite large. – Not only that, but don't they live for over 100 years? – Right. You have to have a succession plan for this turtle. You've taken on responsibility for this animal and it's gonna outlive you. Who is gonna inherit your turtle and do they have enough space for a massive terrarium? Really, what's gonna happen? – Yeah, I guess the tiny turtle is the first time you do a bit of language data collection and you're
34:29like, look at my cute little collection. I'll definitely be totally fine with looking after this and keeping it organised and then that collection grows. – Exactly. Then you can put that collection on a little tiny skateboard. It can go zoom, zoom, zoom all around. Have you seen these videos of the turtles on little skateboards? – The ones where all four of their legs are hanging over the edges so they can scoot around. – And suddenly they can go really fast. – I guess that's like keeping your data well structured and organised and shareable. I'm maybe stretching this turtle.
35:02– I'm not sure if the analogy needs to go that far. I just want to make sure everyone has seen the tiniest turtles who go zoom. – Right. Yes. – But also, languages are organic and spontaneous and they're also part of something bigger than just an individual speaker. They're also part of the linguistic community that speaker belongs to. And sometimes people will just say things that are in the moment. – Absolutely. – And the recorder's not on. – I think we're definitely in a bit of an overcorrection era where it's a bit like if
35:34someone didn't say something into a recorder or into a video camera, then it didn't happen. But that's not true and it's just as important to share as long as you make clear the difference between when you're sharing something that is an elicitation or something that is naturally recorded. But also, sharing something that you heard or experienced or saw is still a valid type of data as long as you make clear the difference between them. I have observational data in my work. In fact, a lot of the observational data for Yolmo that I have is people roasting me.
36:10– Okay. You didn't have the recorder on when they were making fun of you? – No. When we were at public events, my friends liked to make fun of me for not eating meat and not drinking and would often put words in my mouth. They'd be like, yeah, yeah, yeah, she'll have a drink, she said. Yeah, definitely. I'm like, what a great example that people can use a reported evidential when that definitely wasn't an original speech act because they were making fun of me. – And everyone knew it was a joke and they weren't confused about
36:41like, oh, she said she was gonna have some meat. – Yeah. And I got a really great example for my work. It was a win-win. – I also think that everybody who works on a language or who does linguistic research on a language is also someone who knows at least one language. By virtue of being able to write this paper or to do this analysis, we are all users, speakers, or signers of at least one language and potentially several. Traditionally, in the field, doing research on your own
37:16linguistic intuitions – which is possible to do. I can say, yeah, this is a sentence of English. This one is not a sentence, at least of my English. Traditionally, this has been known as armchair linguistics because you don't have to venture out anywhere. You don't have to go to a lab. You don't have to go down to the street. You don't have to go out to another country or another place and interview people. You can just stay in your own comfy armchair and say, ah, I know that English speakers sometimes say, how are you, when they're greeting people. I know this based on my own experience as an English speaker. I don't have to go out and ask 100 people. I'll get the same answer.
37:47– Yeah. – That's been known as armchair linguistics. One of the things that happens when people do this armchair linguistic intuitions is they'll report that data in the paper, but they won't cite myself and my own brain as the source of that data. They'll just say, this is a sentence of English. I'm writing paper, but English, I can just tell you this. – I think it's important to be transparent about that for a couple of reasons. The first is there can be a big difference between whether it's armchair linguistics where it's just you and your thoughts and the article. Here's my intuitions, but I also checked with people in my corridor or I took
38:23this with me on a conference presentation to a couple of conferences because that can change things. It's also worth thinking about who gets access to the armchair. – Exactly. Sometimes people will say, oh, the following construction is not grammatical or is not attested. What they mean is it's not grammatical or attested in their English or in their version of whatever language. There might actually be some people for whom it is grammatical who aren't as well represented in academia. – I think it's totally fine to work with intuition data. I just
38:57like it when people are really transparent about that. I think that's part of this larger approach that I have of needing to show more respect for the data that we work with. – Even if that data is just in your own head, you can have respect for, okay, I'm citing myself as a speaker of this language. – Yeah, because academia in particular is very obsessed with counting the number of books you write and the number of research articles you write. Even though the work of doing recordings, doing transcripts – Gail Jefferson took so long
39:31transcribing those Watergate tapes because it's such a laborious process but creates such rich data. – One estimate for how long it takes to transcribe, let's say, a minute of recorded data, which is not that long, is about one hour of transcription time per one minute of recorded data. This can vary depending on how detailed transcription you're trying to do, but the time investment is significant. – It's been really good to see some linguistics organizations, some universities are beginning to
40:06respect that work as good and important work in its own right. – You've worked on some of the efforts to expand how that work is recognized, Lauren. – Indeed. It's not a thing that happens by accident. It's a thing that happens because a lot of people over a long time have built this consensus to respect data as a thing in its own right and to make it more standardized how we cite and refer to the underlying data in linguistic research. – What have some of your goals or successes
40:39been when it comes to encouraging people to cite data? – It's been about normalizing this kind of practice, that big survey of grammars that we did. We found there were people who were doing this work already. You can improve the norms in the field by giving people standards to follow in the same way that I got to download the Leipzig glossing rules and be like, great, someone has done the thinking for me, I'll just copy this template. A very large group of people over a long time have come together
41:11to work on two documents. The first is the Austin Principles of Data Citation in Linguistics. We have another location-based document. – Ah. This is Austin, Texas, where a meeting happened about this? – At this meeting, they wanted to find a way to articulate that linguistic data is important even before you get to the analysis. But then, it's how do you give people a format for that? These recommendations for citation of research data in linguistics were formalized. It looks a little
41:42bit like how you cite other research. If you're used to using referencing or bibliography or citation for other research publications, you can actually do the same for your data and other people's data. This is known as the Tromsø Recommendations because this was initially crystallized in Tromsø in Norway. – But with the very charming abbreviation of the Tromsø Recommendations or the T-Rex… – Yes. – It's a fun little code there. What was it like to be in the room where some of this was
42:19being discussed? How many people were in a room like this? – For these two relatively short and to the point documents? I wasn't even at these two meetings because there were at least half a dozen to a dozen other times where people met and moved forward on these documents. Overall, there were over 40 people involved, I would say. If you think about – as a very lower-end estimate, assuming around a decade of expertise on average in this area, you are looking at so many people at the top of their game who
42:52have already been thinking about this so deeply for so long. – That's like 400 person years, probably much more than that because some of these people would have 20, 30, 40 years of expertise of being the editor of a journal or of being a person who runs an archive. They've encountered a whole bunch of different ways that people have tried to cite data or not cite data or archive data. They're saying, okay, what can we put together that would be a standard way of saying, here's what you should do when you're trying to acknowledge where data comes from. Whether that's data that's coming from
43:25inside your own head, which you can still acknowledge, or whether that's data that's coming from archive work that someone else has done or work that you yourself have archived or work that you should have archived and haven't necessarily gotten around to yet. – Alongside the Austin Principles and the Tromso recommendations, all of these people have also been doing a lot of work in their own areas. So, the Australian Linguistic Society now has a statement arguing for the importance of data as a research output. An international body that's a network of linguistic archives called DELAMAN has
43:57their award for the best archive so people can receive formal recognition. The Association for Linguistic Typology's Panini Award now includes the quality of data citation and archiving as part of the assessment for their award. – And Panini is famously the first known grammarian. The first grammarian who we know by name? – Yeah, he had terrible data citation, but we forgive him for writing 2000 years ago. – He came up with the concept of doing grammars in Sanskrit like 2000 years ago. So, really cool to have an award named after the first
44:30named grammarian that we have a historical record of. – And all of these things are helping to change norms in the field, hopefully. There's a really great handbook of linguistic data management that brings a lot of this together as well. I have a chapter in that, thinking about where linguistics sits in social sciences when it comes to thinking about data and how we can move forward. – That's one of the interesting things about this movement to cite linguistic data more clearly, and to have principles around that, is it brings linguistics into conversation with other social
45:04sciences and with other academic disciplines who are also trying to figure this out, especially in the digital age, like how to do data citation? – Yeah, a lot of the time it wasn't necessarily important for us to invent something completely new, but to see what were the best practices that we could take from psychology and anthropology and other social sciences and other humanities. I think that when it comes to linguistic data, it's really important to be constantly reminded that
45:34it's not just about, okay, you see some language on the page or you see some language in a video or in an audio recording, and that's just what it is. So many people had to be doing so many things in order for that language to exist in that particular moment, whether it's the community that supported that person learning that language in the first place, the person themselves who is saying whatever they're saying and being there and being willing to participate in linguistics, and the technology that
46:07exists to record them, the linguist that is there trying to figure out what's going on, the future generation of linguists, and potentially also community members who want to be involved in that language and have access to that information – all of these come together when we're talking about linguistic data, and all of these are important reasons why making those contributions visible is so important to keep as part of the record for how humans do language.
46:37For more Lingthusiasm and links to all the things mentioned in this episode, go to lingthusiasm.com. You can listen to us on all the podcast platforms or at lingthusiasm.com. You can get transcripts of every episode on lingthusiasm.com slash transcripts, and you can follow at Lingthusiasm on all the social media sites. You can get Scarve with lots of linguistics patterns on them, including the International Phonetic Alphabet, Branching Tree Diagrams, Booba and Kiki, and our favourite esoteric Unicode symbols,
47:07plus other Lingthusiasm merch at lingthusiasm.com slash merch. Links to my social media can be found at gretchamcculloch.com. My blog is allthingslinguistic.com, and my book about internet language is called Because Internet. My social media and blog is Superlinguo. Lingthusiasm is able to keep existing thanks to the support of our patrons. If you want to get an extra Lingthusiasm episode to listen to every month, our entire archive of bonus episodes to listen to right now, or if you just want to help keep the show running ad-free, go to patreon.com slash Lingthusiasm or follow the links from our website. Patrons can also get access to our Discord chat room to
47:40talk with other linguistics fans and be the first to find out about new merch and other announcements. Recent bonus topics include World Linguistics Day back in November, a discussion about the Voynich Manuscript with Claire Bowen, and a whole episode of bonus content from recent interviews. Can't afford to pledge? That's okay too. We also really appreciate if you can recommend Lingthusiasm to anyone in your life who's curious about language, or leave a nice review, like this one from Avian Soap who said, Lingthusiasm is a gem. It's an educational podcast hosted by two linguists who are very thoughtful about the pedagogical design of the podcast to make it both entertaining and educational listening experience, and they're so good at that. I always come away
48:14from an episode going, wow, that's so cool about the things I learned from them. This is also a podcast with really well-done transcripts. They pay someone with a linguistics background to write careful transcripts that read clearly and have all the specialized vocabulary accurate and laid out in a way that's coherent in text form, even if the original was intended to be understood through listening to sounds in your ears. I followed this show solely via transcripts for years before I finally transitioned into being able to listen to it regularly. Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne. Our editorial producer is Sarah Doppiarella. Our production assistant is Martha
48:46Tsui-Billens. Our editorial assistant is John Crook, and our technical editor is Leah Vellman. Our music is Ancient City by The Triangle. Stay Lingthusiastic! The Triangle! Ofidisburgродing Hatties!
49:17The Triangle.
More from Lingthusiasm

116: Cross-cultural communication (in space!)
May 22, 202631 min

115: The long shadow of Daisy Bates with This Guy Sucked
Apr 17, 20261h

114: Begonia, average coral, and sea pink - Defining colour terms with Kory Stamper
Mar 20, 202654 min

113: Why "it's a diglossia!" explains so many social dynamics
Feb 20, 202648 min

111: Whoa!! A surprise episode??? For me??!!
Dec 19, 202550 min