Search results
People also ask
What is a text corpus?
What is a corpus in linguistics?
What is a corpus?
What is a text corpora?
Why is a corpus a remarkable thing?
What is a corpus in NLP?
In linguistics and natural language processing, a corpus (pl.: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
A corpus is a collection of texts. More specifically, in the words of Sinclair, it is "a collection of naturally-occurring language text, chosen to characterize a state or variety of a language" (1991, p. 171).
Sep 19, 2022 · In the field of linguistics, a corpus is a large and structured set of texts (nowadays, usually electronically stored and processed). The texts in a corpus have been selected to represent a...
Feb 12, 2020 · In linguistics, a corpus is a collection of linguistic data (usually contained in a computer database) used for research, scholarship, and teaching. Also called a text corpus.
- Richard Nordquist
- Who Builds A Corpus?
- What Is A Corpus for?
- How Do We Sample A Language For A Corpus?
- Representativeness
- Balance
- Topic
- Size
- Specialised Corpora
- Homogeneity
- Character of Corpus Research
Experts in corpus analysis are not necessarily good at building the corpora they analyse — in fact there is a danger of a vicious circle arising if they construct a corpus to reflect what they already know or can guess about its linguistic detail. Ideally a corpus should be designed and built by an expert in the communicative patterns of the commun...
A corpus is made for the study of language; other collections of language are made for other purposes. So a well-designed corpus will reflect this purpose. The contents of the corpus should be chosen to support the purpose, and therefore in some sense represent the language from which they are chosen. Since electronic corpora became possible, lingu...
There are three considerations that we must attend to in deciding a sampling policy: 1. The orientation to the language or variety to be sampled. 2. The criteria on which we will choose samples. 3. The nature and dimensions of the samples.
It is now possible to approach the notion of representativeness, and to discuss this concept we return to the first principle, and consider the users of the language we wish to represent. What sort of documents do they write and read, and what sort of spoken encounters do they have? How can we allow for the relative popularity of some publications ...
The notion of balance is even more vague than representativeness, but the word is frequently used, and clearly for many people it is meaningful and useful. Roughly, for a corpus to be pronounced balanced, the proportions of different kinds of text it contains should correspond with informed and intuitive judgements. Most general corpora of today ar...
The point above concerning a text type where most of the exemplars are highly specialised, raises the matter of topic, which most corpus builders have a strong urge to control. Many corpus projects are so determined about this that they conduct a semantic analysis of the language on abstract principles like those of Dewey or Roget, and then search ...
The minimum size of a corpus depends on two main factors: 1. the kind of query that is anticipated from users, 2. the methodology they use to study the data. There is no maximum size. We will begin with the kind of figures found in general reference corpora, but the principles are the same, no matter how large or small the corpus happens to be. To ...
The proportions suggested above relate to the characteristics of general reference corpora, and they do not necessarily hold good for other kinds of corpus. For example, it is reasonable to suppose that a corpus that is specialised within a certain subject area will have a greater concentration of vocabulary than a broad-ranging corpus, and that is...
The underlying factor is homogeneity. Two general corpora may differ in their frequency profile if one is more homogenous than the other, while specialised corpora, by reducing the variables, offer a substantial gain in homogeneity. Homogeneity is a useful practical notion in corpus building, but since it is superficially like a bundle of internal ...
It is necessary to say something here about the "typical studies" mentioned above, because at many points in this chapter there are assumptions made about the nature of the research enquiries that engage a corpus. This section is not intended in any way to limit or circumscribe any use of corpora in research, and we must expect fast development of ...
Oct 28, 2019 · What are the different types of text corpora for NLP? A plain text corpus is suitable for unsupervised training. Machine learning models learn from the data in an unsupervised manner. However, a corpus that has the raw text plus annotations can be used for supervised training.
Oct 16, 2024 · What is a Corpus? A corpus is, simply put, a text under study or a set of texts to study (the plural is corpora). For linguists, a corpus is specifically a collection of written or spoken material upon which a linguistic analysis is based. You may source your corpora from many different sources.