StimulStatProject

StimulStat is a lexical database for Russian language which allows selecting words and word forms based on different parameters and finding values of various parameters for a list of words or word forms. The resource was created primarily for experimental studies of Russian. In such studies, it is often necessary to select stimuli for which some target property varies (e.g. accent), while many other properties (e.g. length, frequency, part of speech) either coincide or are carefully balanced. However, the database can also be used for many other purposes: for example, for selecting words with certain characteristics in teaching Russian, in all sorts of tests and assignments, and also for scientific papers.

The database includes parameters associated with lemma and form frequency, orthographic representation, ideal and real phonemic representation, prosodic features, grammatical features of word forms and lemmas, polysemy, homonymy, homography, orthographic and phonological neighbourhoods (groups of words with similar spelling and pronunciation), as well as subjective age of acquisition and imageability. All parameters can be found on search pages, and a complete list of them with necessary comments is included in the instruction. Values of some parameters were taken from various sources (see the list below), in this case, the advantage of this database is the possibility to take them into account simultaneously. Some parameters were calculated specifically for the database.

On the page "Additional Resources" we present a separate project in which the frequencies of various grammatical characteristics and inflectional affixes were obtained for Russian nouns on the basis of the Russian National Corpus.

At this stage the main part of the project is completed, but we continue correcting errors and optimizing the interface. We will be grateful for comments or suggestions. In addition, we plan to add a few more parameters to the database.

The project is being developed at the Laboratory for Cognitive Studies, St. Petersburg State University.

Authors

Svetlana Alexeeva mail@s-alexeeva.ru
Natalia Slioussar
Daria Chernova

Database sources:

Lyashevskaya, O., Sharov, S. (2009). Chastotnyj slovar' sovremennogo russkogo jazyka (na materialakh Nacional'nogo korpusa russkogo jazyka) [The frequency dictionary of modern Russian language based on Russian National Corpus]. Moscow: Azbukovnik. [In Russian]. (electronic version is available here).
The project Frequency grammar of Russian.
Lyashevskaya, O. (2013). Chastotnyj leksiko-grammaticheskij slovar’: prospect proekta [Lexico-grammatical frequency dictionary: A preliminary design]. In V. P. Selegey (Ed.), Computational Linguistics and Intellectual Technologies. Vol. 12. 2013. Pp. 478-489.
Zaliznjak, A. A. (1987). Grammaticheskij slovar' russkogo jazyka [Grammatical dictionary of Russian Language]. 3rd ed. Moscow: Russkij jazyk. [In Russian].
Efremova, T. (2000). Novyj slovar' russkogo jazyka. Tolkovo-slovoobrazovatel'nyj [The new explanatory dictionary of Russian language]. Moscow: Russkij jazyk. [In Russian].
OpenCorpora (version 05.10.2015) via the morphological parser pymorphy2.
Bocharov V. V., Alexeeva S. V., Granovsky D. V., Protopopova E. V., Stepanova M. E., Surikov A. V. Crowdsourcing morphological annotation // Computational Linguistics and Intellectual Technologies. Vol. 12. 2013. Pp. 109–114.

Korobov M. Morphological analyzer and generator for Russian and Ukrainian languages // Analysis of Images, Social Networks and Texts. Berlin: Springer, 2015. Pp. 320-332.
The list of word forms annotated with the stress, created by A. Usachev based on the Grammatical Dictionary of the Russian Language.
Database "Verb and action".
Akinina, Y., Malyutina, S., Ivanova, M., Iskra, E., Mannova, E., Dragoy, O. Russian normative data for 375 action pictures and verbs // Behavior Research Methods. Vol. 47. 2015. Pp. 691-707.
the CORPRES dictionary of phonological variants created at the Laboratory for Experimental Phonetics of St. Petersburg State University.
Skrelin, Pavel A., Nina B. Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, and Vera Evdokimova. "A Fully Annotated Corpus of Russian Speech." In LREC. 2010.

Research report "Variability of Phonetic Units in view of Interaction between the Levels of the Language System in Standard Russian", P.A.Skrelin, Saint Petersburg State University, 2010-2014.

Frequency values are drawn from Lyashevskaya and Sharov' dictionary (for lemmas), the project "Frequency grammar of Russian" (for word forms), the CORPRES database (for phonemic representations of forms)

The following parameters were calculated by the authors based on the sources mentioned above:

Ln-transformed and lg-transformed frequency;
various parameters related to the orthographic representation: word length in letters, the first and last letter, reversed spelling (e.g., okolom <- moloko 'milk'), sorted representation with and without letter repetition (e.g., klmooo and klmo <- moloko), uniqueness point;
various parameters associated with syllable boundaries and stress: the number of syllables (based on the number of vowels), syllable boundaries according to the model of L.V. Bondarko, CV notation, the main and additional stress position (in letters and in syllables), the presence of stress shift in the inflectional paradigm;
homonyms and homographs of different types;
orthographic and phonological neighbors of different types.

How to refer to the project:

Only the initial stage of the project is described in publications so far. If you use the database, please refer to this website and to the following article:

Alexeeva, S., Slioussar, N. & Chernova, D. (2017). StimulStat: a lexical database for Russian. Behavior Research Methods, doi: 10.3758/s13428-017-0994-3, URL: https://link.springer.com/article/10.3758/s13428-017-0994-3

download Alexeeva S.V., Slioussar N.A., Chernova D.A. (2016). Stimulstat: a database for linguistic and psychological studies on russian language. Abstracts of The Seventh International Conference on Cognitive Science (June 20–24, 2016, Svetlogorsk, Russia), pp. 23-24. [In English]

download Alexeeva S.V., Slioussar N.A., Chernova D.A. (2015). StimulStat: baza dannykh, okhvatyvajushchaja razlichnye kharakteristiki slov russkogo jazyka, vazhnye dlja lingvisticheskikh i psikhologicheskikh issledovanij [StimulStat: A lexical database for linguistic and psychological research on Russian language]. Materialy 21-oj Mezhdunarodnoj konferencii po komp'juternoj lingvistike «Dialog» [Proceeding of the 21st International conference on computer science “Dialogue”]. 2015, URL: http://www.dialog-21.ru/digests/dialog2015/materials/pdf/AlexeevaSVSlioussarNAChernovaDA.pdf [In Russian]

If you use the information on frequency from the page "Additional Resources", please refer to the site and to this article:

download Slioussar N. A., Samoilova M. V. Chastotnosti razlichnykh grammaticheskikh kharakteristik i okonchanij u sushchestvitel'nykh russkogo jazyka [Frequencies of different grammatical features and inflectional affixes in Russian nouns [Materialy 21-oj Mezhdunarodnoj konferencii po komp'juternoj lingvistike «Dialog» [Proceeding of the 21st International conference on computer science “Dialogue”]. 2015, URL: http://www.dialog-21.ru/digests/dialog2015/materials/pdf/SlioussarNASamoilovaMV.pdf. [In Russian]

We would be grateful if you write us about the studies, where these resources are used, this is important and interesting for us.

Please note that in certain cases it is necessary to refer to the source of information that was found with the help of our database as well (e.g. to the "Frequency dictionary of modern Russian language" when it comes to frequencies of lemmas etc).

Students who helped us

Pavel Prokopiev (State University of Aerospace Instrumentation)
Vladislav Meletyagin (State University of Aerospace Instrumentation)
Nikita Narchuk (State University of Aerospace Instrumentation)