Sections
Uni-Logo
Document Actions

Corpus linguistics - an introduction

 by Friederike Müller and Birgit Waibel
updated by Julia Müller
 

 

1. Basics

 

What is a corpus?

A corpus (plural corpora, German “das Korpus”, not “der”) is a collection of texts used for linguistic analyses, usually stored in an electronic database so that the data can be accessed easily by means of a computer. Such corpora generally comprise hundreds of thousands to billions of words and are not made up of the linguist’s or a native speaker’s invented examples, but based on authentic naturally occurring spoken or written usage.

The majority of present-day corpora are “balanced” or “systematic”. This means that the texts are collected (“compiled”) according to specific principles, such as different genres, registers, or styles of English (e.g. written or spoken English, newspaper editorials or technical writing); these sampling principles do not follow language-internal but language-external criteria. For example, the texts for a corpus are not selected because of their high number of relative clauses, but because they are instances of a predefined text type, say broadcast English, magazine or newspaper texts. Examples of balanced corpora are the International Corpus of English (ICE), the British National Corpus (BNC), or the Brown and Lancaster-Oslo/Bergen (LOB) corpora and their Freiburg updates (Frown and F-LOB).

 

A corpus is a systematic, computerised collection of authentic language used for linguistic analysis.

 

 

What is corpus linguistics and why is it useful?

Based on the above definition of a corpus, corpus linguistics is the study of language by means of naturally occurring language samples; analyses are usually carried out with specialised software programmes on a computer. Corpus linguistics is thus a method to obtain and analyse data quantitatively and qualitatively rather than a theory of language or even a separate branch of linguistics on a par with e.g. sociolinguistics or applied linguistics. The corpus-linguistic approach can be used to describe language features and to test hypotheses formulated in various linguistic frameworks. To name but a few examples, corpora recording different stages of learner language (beginners, intermediate, and advanced learners) can provide information for foreign language acquisition research; by means of historical corpora it is possible to track the development of specific features in the history of English, such as changes in the use of the modal verb must and the emergence of alternatives such as have to or have got to; the emergence of the modal verbs gonna and wanna; or sociolinguistic markers of specific age groups, such as the use of like as a discourse marker, can be investigated for purposes of sociolinguistic or discourse-analytical research.

The great advantage of the corpus-linguistic method is that language researchers do not have to rely on their own or other native speakers’ intuition or even on made-up examples. Rather, they can draw on large amounts of authentic, naturally occurring language data produced by a variety of speakers or writers in order to confirm or refute their own hypotheses about specific language features on the basis of a robust and solid empirical foundation.

 
 

What types of corpora are there?

In the following, a list of some of the most common types of corpora is provided. 

  • General corpora, such as the British National Corpus, contain a large variety of both written and spoken language, as well as different text types, by speakers of different ages, from different regions and from different social classes.
  • Synchronic corpora, such as F-LOB and Frown, record language data collected for one specific point in time, e.g. written British and American English of the early 1990s.
  • Historical (or diachronic) corpora, such as ARCHER and the Helsinki corpus, consist of corpus texts from earlier periods of time. They usually span several decades or centuries, thus providing diachronic coverage of earlier stages of language.
  • Learner corpora, such as the International Corpus of Learner English and the Cambridge Learner Corpus, are collections of data produced by foreign language learners, such as essays or written exams.
  • Corpora for the study of varieties, such as the International Corpus of English and the Freiburg English Dialect Corpus, represent different regional varieties of a language
  • Specialized corpora, e.g. the Michigan Corpus of Academic Spoken English (MICASE), are useful for various types of research (cf. e.g. http://www.helsinki.fi/varieng/CoRD/corpora/index.html).

 

It should be pointed out that the above listed types of corpora are not necessarily mutually exclusive – F-LOB and Frown, for example, are both synchronic and regional corpora, and even “become” historical when paired with their 1960s counterparts LOB and Brown.

 

There are fixed corpora which are generally not expanded after their release (e.g. the Brown Corpus) and monitor corpora which are updated and expanded regularly, such as the Bank of English (BoE) or the News on the Web (NoW) corpora. Frequencies obtained from the Brown Corpus are stable regardless of the time of the search; frequencies obtained from monitor corpora, by contrast, depend on the time the search was carried out.


back to top



2. List of corpora available in Freiburg


You can download a list of corpora here. Please note that with new corpora constantly being compiled this list is not exhaustive but constitutes a selection of well known and widely used corpora of the English language.
 
back to top

 

2. Using corpora

 

 

In order to analyse a corpus and search for certain words or phrases (strings), you can either access the data via an online user interface or, if none is provided, need to use special software – so-called concordancers like AntConc. The corpora available at the Brigham Young University (BYU) are a good place to start due to a user-friendly interface and the number of very large corpora this interface allows you to access. As a student at the University of Freiburg, you can get premium access to them for free (see here for details).

The Baden-Württemberg Digital English Studies Community website (BW-DESC) has been designed to guide new users of corpora. Here, you can: 

  • get information on the size, focus and specification of each corpus, as well as on how to cite them correctly
  • learn about the search strings you need to enter to get the results you want
  • get to know the BYU display modes (or ways in which these results can be presented)
  • test your knowledge with exercises (solutions are provided at the bottom of the page)
  • look up corpus jargon expressions in the wiki
  • discuss more complex search queries or problems in the forum (or, if you’re an experienced user, share your best practice examples with the community)
  • learn how to make your own corpus using concordancers (programs such as AntConc that help you to carry out the standard types of corpus searches in data which you have collected yourself or obtained from the web) and find help for resources on text annotation (adding a part of speech tag to each word, for instance)
  • read advice on how to statistically analyse your data for your next term paper or thesis
  • browse additional resources on (corpus) linguistics

 

back to top

Personal tools