Some information on how to search in the database StimulStat

  1. For lemma search our main source is the Frequency dictionary of modern Russian language (Lyashevskaya, Sharov, 2009) and for wordform search the main source is the list of word forms generated for these lemmas using the OpenCorpora dictionary (version 05.10.2015) and pymorphy2 analyzer (see About the project). Therefore, if these sources do not contain the lemma/form of interest and you try to get word info, nf (not found) will be indicated for all the parameters.

    At the moment we are synchronizing all sources used in the database and we hope that in the nearest time it will be possible to search for any word that is included at least in one of the sources.

    In the mode Find words it is possible to search separately in Zalizniak’s dictionary (the words with a particular index), Efremova’s dictionary and the list of word forms with accents. If you search using these sources you can get words, for which on the Get word info tab you would get nf (not found).

  2. On frequency

    • Working with frequency information you should take homonymy into account

      • In the Frequency dictionary of modern Russian language (Lyashevskaya, Sharov, 2009) frequency (ipm) and part of speech is specified. For part of speech homonyms (for example, o - preposition 'about' and interjection 'oh') separate frequency values are specified for each lemma (preposition - 3407,1 ipm, interjection - 71,2 ipm). If homonyms belong to one and the same part of speech (for example, bor 'forest', '(dental) drill', 'boracium'), even if they differ in some grammatical features (for example, operator as an animate and an inanimate noun), their frequencies are summed. Word form frequencies are taken from a different source, the "Frequency grammar of Russian" (Lyashevskaya 2013), where frequencies of homonymous forms of one lemma are not summed.

        Output in the Get word info mode.

        If you do not check "Full grammatical analysis", frequencies of homonyms are summed up even if they belong to different parts of speech.

        Lemma o. Parameters "Frequency" and "Full grammatical analysis" are checked

        freq_with_gramms

        Lemma o. Parameter "Frequency" is checked, parameter "Full grammatical analysis" is unchecked

        freq_no_gramms

        The same applies to word forms:

        Wordform koške ('cat', Dative/Locative Singular). Parameters "Frequency" and "Full grammatical analysis" are checked

        form_freq_with_gramms

        Wordform koške ('cat', Dative/Locative Singular). Parameter "Frequency" checked, parameter "Full grammatical analysis" is unchecked

        form_freq_no_gramms

        Output in the Find words mode.

        As we explained in detail for the "Get word info" mode, frequencies of homonyms that belong to the same part of speech are summed in the database. Frequences of homonyms that belong to different parts of speech are also summed if you do not check "Full grammatical analysis" or "Part of speech" parameters.

    • Some word forms included in the databse are not represented in the source "Russian Frequency Grammar" (which was used as a source of word form frequencies). So when you search for various form neighborhood parameters (e.g. the summed frequency of all neighbors), the frequencies of such forms will not be taken into account.

  3. Inflectional characteristics in the lemma search

    • For every part of speech, some form represents the lemma, for example, for nouns it is Nominative Singular form. Number and case are inflectional categories that are not related to the grammatical characteristics of lemmas. However, in the full grammatical analysis in the Get word info mode we allow searching for them. Inflectional features are separated from true grammatical characteristics of lemmas by space.

      special_gram_for_lem
    • Two systems of notation are used in the database at the moment: one from the Russian: one from Russian National Corpus (also adopted in the Frequency dictionary of modern Russian language (Lyashevskaya, Sharov, 2009) and the "Frequency Grammar of Russian") and another one from the OpenCorpora project. OpenCorpora notation always consists of four letters (this is why Nominative is abbreviated as NOMN, not NOM).

  4. About neighbors. One and the same letter string can be both a lemma and a form (for example, the nominative singular form for nouns). Consequently, neighbors for such strings can be identified both among lemmas and among word forms. The same is true for the word uniqueness point parameter. Therefore, be careful: depending on what type of lexical unit you are looking for (lemma or word form), you will get different results.