Search results
Feb 12, 2020 · In linguistics, a corpus is a collection of linguistic data (usually contained in a computer database) used for research, scholarship, and teaching. Also called a text corpus.
- Dialects
"The classic example of a dialect is the regional dialect:...
- Definition and Examples of Paralinguistics
Loudness in Different Cultures "A simple example of the...
- Registers
Frozen: This form is sometimes called the static register...
- Linguistic Variation
For example, if one person utters the sentence 'John is a...
- Interlanguage Definition and Examples
Universal Grammar "A number of researchers pointed out quite...
- Metalanguage in Linguistics
We can hardly avoid using these metaphors in talking about...
- Dialects
Sep 19, 2022 · How can BAVL help you build the perfect corpus for your NLP project? In short, a corpus is a large set of language training data for statistical NLP applications.
Mar 22, 2024 · Introduction to Corpora. A corpus is a collection of texts or text extracts that have been put together to be used as a sample of a language or language variety. It consists of texts that have been produced in 'natural contexts' (published books, ordinary conversation, letters, newspapers, lectures etc), which means it mirrors natural language ...
- English Faculty Library Efl
- 2011
- Who Builds A Corpus?
- What Is A Corpus for?
- How Do We Sample A Language For A Corpus?
- Representativeness
- Balance
- Topic
- Size
- Specialised Corpora
- Homogeneity
- Character of Corpus Research
Experts in corpus analysis are not necessarily good at building the corpora they analyse — in fact there is a danger of a vicious circle arising if they construct a corpus to reflect what they already know or can guess about its linguistic detail. Ideally a corpus should be designed and built by an expert in the communicative patterns of the commun...
A corpus is made for the study of language; other collections of language are made for other purposes. So a well-designed corpus will reflect this purpose. The contents of the corpus should be chosen to support the purpose, and therefore in some sense represent the language from which they are chosen. Since electronic corpora became possible, lingu...
There are three considerations that we must attend to in deciding a sampling policy: 1. The orientation to the language or variety to be sampled. 2. The criteria on which we will choose samples. 3. The nature and dimensions of the samples.
It is now possible to approach the notion of representativeness, and to discuss this concept we return to the first principle, and consider the users of the language we wish to represent. What sort of documents do they write and read, and what sort of spoken encounters do they have? How can we allow for the relative popularity of some publications ...
The notion of balance is even more vague than representativeness, but the word is frequently used, and clearly for many people it is meaningful and useful. Roughly, for a corpus to be pronounced balanced, the proportions of different kinds of text it contains should correspond with informed and intuitive judgements. Most general corpora of today ar...
The point above concerning a text type where most of the exemplars are highly specialised, raises the matter of topic, which most corpus builders have a strong urge to control. Many corpus projects are so determined about this that they conduct a semantic analysis of the language on abstract principles like those of Dewey or Roget, and then search ...
The minimum size of a corpus depends on two main factors: 1. the kind of query that is anticipated from users, 2. the methodology they use to study the data. There is no maximum size. We will begin with the kind of figures found in general reference corpora, but the principles are the same, no matter how large or small the corpus happens to be. To ...
The proportions suggested above relate to the characteristics of general reference corpora, and they do not necessarily hold good for other kinds of corpus. For example, it is reasonable to suppose that a corpus that is specialised within a certain subject area will have a greater concentration of vocabulary than a broad-ranging corpus, and that is...
The underlying factor is homogeneity. Two general corpora may differ in their frequency profile if one is more homogenous than the other, while specialised corpora, by reducing the variables, offer a substantial gain in homogeneity. Homogeneity is a useful practical notion in corpus building, but since it is superficially like a bundle of internal ...
It is necessary to say something here about the "typical studies" mentioned above, because at many points in this chapter there are assumptions made about the nature of the research enquiries that engage a corpus. This section is not intended in any way to limit or circumscribe any use of corpora in research, and we must expect fast development of ...
One example is dysfluency annotation: those working on spoken data may wish to annotate a corpus of spontaneous speech for dysfluencies such as false starts, repeats, hesitations, etc. — see Lickley, no date).
A corpus is a collection of texts. More specifically, in the words of Sinclair, it is "a collection of naturally-occurring language text, chosen to characterize a state or variety of a language" (1991, p. 171).
People also ask
What is a corpus?
What is a text corpus called?
Why is a corpus a remarkable thing?
What is a corpus in linguistics?
What is a corpus in NLP?
Should we assemble a corpus of text types?
Intro to Corpus Analysis. Robert Poole. Situating questions for discussion and reflection. What things would you describe with one word as opposed to its synonym? What differences are there in meaning? For example, ‘little’ as opposed to ‘small’ when describing ‘boy’ or ‘man’? ‘weird’ as opposed to ‘odd’ when describing ‘trick’ or ‘person’?