Olesya A. Valger
Applying the Internet Search Engines to Neological Research
In the framework of this article the author gives the highlights ofthe principles developed by her for high-quality selection, analysis and classification of the material employed in English neological research by means of the Internet search engines and platforms. The suggested algorithm can be applied to lexicographic research.
During the latest decade we have come across a new methodological problem which is connected with the use of the Internet technologies in individual and group research. The international network has existed for two decades only, but the amount of information stored in it has overcome all the former database resources and is increasing every day. Its peculiarity lies in the fact that it is multifunctional, and not only allows the users to keep in touch with the ready up-to-date information, but also provides them with the opportunity to acquire the necessary material for creating absolutely new pieces of information. For a linguist the Internet represents a constant stream of speech in all possible kinds of discourse; if five years ago the speech was mostly written, nowadays it is also oral speech transmitted by means of video. The existence of convenient search machines provides scholars with the easily obtained material for investigation.
Still, the methods of work with the Internet are not sufficiently investigated in applied linguistics. It may lead to obtaining low-quality or misinterpreted material and undermine the authority of the Internet as a source of information. This is especially evident in the sphere of neology as one of the common mistakes of students in neology is misinterpreting the nature of a new neological formation. In this case occasionalisms are often referred to as neologisms, and neologisms of different degree of stability are taken as a whole.
That is why I suggest it necessary to develop an algorithm of online work applicable for neological research. The algorithm includes several levels of investigating a word. At each level, the researcher has to analyze the material taking into account several criteria. An attempt is also taken to establish several parameters for each level.
Suppose in our research we have come across a word which is likely to be a neologism. The pre-analysis includes consulting the printed dictionaries, both of standard profile and specializing in neology, such as Barnhart New Words Concordance so that we make sure that the word is not included into the word-stock of the language and does not appear to be a neologism of some earlier age.
The analysis proper includes three levels.
At the first level we perform general analysis of frequency and usage based on a standard search engine (google.com, yandex.com, yahoo.com, mail.com). Filename extension “.com” is very important here; there is also a common mistake to search for the word with the setting “Russian sites preferable”, which prevents one from getting objective search results. Our primary goal at this level is to differentiate between a neologism and a nonce-word and for eliminating misinterpretation here we apply several criteria.
a) Quantitative criterion. Any search engine gives the number of entries found which contain the word entered. As the experimental results show, there are certain parameters to distinguish a neologism from an occasionalism. If we take February 2011 as the time point, all the neutral words included in the central word stock give no less than approximately 100,000,000 results; colloquialisms give no less than 100,000; bookish modern words – no less than 5,000,000. Special scientific terms give more than 10,000. Standard neologisms give from 10,000 up to 1,000,000. That is why we suppose, that the number of entries overcoming 10,000 entries is sufficient to continue the study. In cases with fewer entries we can state that the word is not used extensively enough by a certain group of speakers.
b) Qualitative criterion. As we have found the range of texts in which the word is used we are to evaluate the quality of the material. Care should be taken with the so-called repeated entries, when the same text marked by higher interest of the community is copied into a lot of webpages with small or no difference, thus giving a high result of entries which are in fact one and the same entry in its numerous variants. Taking into account the physical abilities of a person, we would set the sufficient level of different texts at twenty entries with varied content. The statistics investigation shows that if there are twenty, there are more; if there are no twenty entries, and the word is mostly cited from the original context, it can be referred to the class of occasionalisms. There is an argument about the status of “internet mems”, witty non-standard nomination of existing phenomena that were posted in a popular website and then enjoyed wide distribution in social networks and blogs. On the quantitative and qualitative levels of analysis the search for them can give multiple entries with varied context, and at the same time very few of them penetrated printed media for the last ten years. The analysis according to temporal and discourse criteria eliminates them from the neological research.
c) Temporal criterion. It implies the idea of the time borders imposed on the word. The high speed with which new words are transmitted in the Internet sometimes gives high results of usage to the word which cannot be considered a neologism proper, as it has already went out of use. We may even speak about “words in fashion”, that is, words which are said occasionally on TV or in the Internet, then broadcast internationally, and spread all over the world in two or three days. Then they die out completely. This was the story with the word “undollarization” first used by Tony Blair to describe the economic situation when international banks started active equity transfer to the safer euro stock. The word travelled from news to news, and went out of use as soon as the situation became not so topical. Most of the entries including the word refer to 2006, and only three separate citation entries are found during the latest year.
d) Discourse criterion. At this level we need to obtain the information for the next one, and conduct primary discourse analysis. We get back to the 20 entries with varied context we have selected and study their discourse characteristics to define the sphere of the following level. If all the entries are advertisements, then we should refuse the status of neologism to the word. Words occurring in a scientific context are more likely to become generally accepted, whereas those referring to politics, economy, mass media need more careful observation. If a word is found in a number of different context it is probable to enter the colloquial word stock. There are interesting cases when a word has been created deliberately by an author of a book or a film and then its success depends on the success of the author. For instance, the word “hobbit” created fifty years ago by J. R.R. Tolkien is now widely used in the colloquial speech like a metaphor (a man of no great height), and now the word “muggle” from J. Rowling series repeats its nature, starting to denote a common person of no elevated thoughts. In this case we are to observe the author’s occasionalism for a long time before giving it a neological status.
Summing it up, at the first level of analysis we bring the borderline between a neologism and an occasionalism and get the basic information for the further study.
At the second level we perform detailed analysis of the contexts in which the word is used and experiment with different particular search localities, such as search instruments in different specialized websites. The more resources you take the more profound results you get. However, a minimum circle of spheres can be restricted:
a) online newspaper publications (such as timesonline.co.uk, etc.);
b) scientific resources. Here you may choose general (such as scienceonline.org, sciencedaily.com) or more specific resources in medicine, linguistics, chemistry, etc;
c) chats, forums and bolgs. In this case you may define the audience which uses the word, systematically checking the word with the search in forums of certain communities – religious, subcultural, age-range, hobby-oriented, etc;
d) amateur fiction resources. These provide an opportunity to observe the word in up-to-date narrative and descriptive texts.
There can be no fixed parameters established in this case, as it greatly depends on the word, but it should be widely represented in at least one type of context and occur in all the contexts at least sometimes. The practice shows that in the Internet even the rarest technical terms penetrate all the other spheres.
At the third level of analysis we study the lexical meaning of the word and give its definition. The optimal number of contextual examples (sentences or texts) is ten, as it allows one not only to follow the primary lexical meaning, but also to take into account the combinability of the word. At this stage we usually can find figurative meanings of a word if there are any. In the case with abbreviations it is necessary to make sure that all the sentences contain the word under investigation and not its homonyms which often occurs when the initial letters of two or more words are taken; it will require searching for sentences in which the abbreviation is explained in brackets or in the footnote.
One more opportunity provided by the web is the active presence of a great many of cultural informants, so it is optimal to ask some native speaker (e.g. by means of answers.com) about the meaning of the word and then interpret the variant.
This is the basic algorithm of neologic analysis with the use of the Internet technologies which allows us to minimize the possible misinterpretation of a word status. It should be kept in mind that the parameters data introduced at the first level of the algorithm are not fixed and remain to be optimal only at the current date. Every year they will be increased and increased, as the international network is actively developing, the storage of information is far from being limited, and more and more texts are created in the process of communication. If in the past days all these texts had no chance to remain in the written form, the Internet provides us with the opportunity to preserve everything and lose no information at all, which greatly emphasizes the problem of material search and selection.
1. Dejan Delić. A Finitely Axiomatizable Undecidable Equational Theory with Recursively Solvable Word Problems. Transactions of the American Mathematical Society, July, 2000, N 7. - pp. 3065-3101.
2. Einar Rødland. Exact Distribution of Word Counts in Shuffled Sequences. Advances in Applied Probability, March, 2006, N 1. - pp. 116-133.
3. Krister Linden. Evaluation of Linguistic Features for Word Sense Disambiguation with Self-Organized Document Maps. Computers and the Humanities, November, 2004, N 4. - pp. 417-435.
4. R. Harald Baayen and Rochelle Lieber. Word Frequency Distributions and Lexical Semantics. Computers and the Humanities, 1997, N 4. - pp. 281-291.