2/17/2024 0 Comments Ran deep synonymOne such cleaning is to lowercase the corpus types and limit them to sequences involving the letters a–z only ( Gerlach and Altmann, 2013). In order to confirm that the growth in types is not the result of these arbitrary combinations of characters, some cleaning is required. Brants and Franz (2006) counted 13.6 million word types in their corpus, with no indication of a stop to the growth.Ī look at the types in Brants and Franz’s (2006) corpus reveals that a great deal of them consist of alphabetical characters combined with non-letter signs, most of which no native speaker would accept as constituting a word in English (similar to the word “mee-ee-ooow” in the example above). ![]() Since Kornai’s (2002) analysis, corpora of vastly greater size have been released and when Brants and Franz (2006) made the first 1.025 trillion word corpus available based on the English internet webpages at that time, they verified that Herdan’s law still applied for a corpus of this size. Kornai (2002) verified Herdan’s law for corpora up to 50 million word tokens and failed to find any flattening of the predicted linear curve, indicating that the pool of possible word types was still far from exhausted. This will be visible in the curve becoming flat from a certain corpus size on. Mathematicians prefer to present power functions in coordinates with logarithmically transformed axes, because this changes the concave function into a linear function, which is easier to work with, as shown in the bottom part of Figure Figure1 1 Kornai’s (2002) insight was that if the number of word types is limited, then at a certain point Herdan’s law will break down, because the pool of possible word types has been exhausted. Then the concave function becomes a linear function, which is easier to work with. The (Bottom) figure shows the same information when the axes are log 10 transformed. The (Top) figure shows how the number of word types would increase if the law were a power law with exponent 0.5 (i.e., square root). We will call the function Herdan’s law in the remainder of the text.įigures illustrating Herdan’s or Heap’s law. This function is shown in the upper part of Figure Figure1 1 It is known as Herdan’s law or Heap’s law (Herdan described the function first, but Heap’s book had more impact). Herdan (1964) and Heaps (1978) argued that the function linking the number of word types to the corpus size has the shape of a power function with an exponent less than 1 (i.e., it will be a concave function). The more words processed already (i.e., the larger the corpus size), the less likely the next word will be a new type, because most word types have already been encountered. ![]() ![]() Very rapidly, however, word types start to repeat (e.g., the word “ the” occurs in nearly every sentence) and the increase in word types slows down. When the very first words of a corpus are processed, each word is a new type. All else equal, the number of word types will be smaller in a small corpus than in a large corpus, as new types add up the more words a person (or machine) processes. 1 This is linked to the observation that the number of word types increases as a function of the corpus size. Kornai (2002) argued that the number of word types in a language is boundless because language users constantly coin new words. In Theory, the Number of Word Types in a Language is Infinite
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |