Khalkha Mongolian Corpus

One of the reasons why no psycholinguistic studies have been conducted with Khalkha speakers prior my dissertation’s work is the lack of a digitized, searchable corpus. Therefore, in order to conduct the psycholinguistic studies presented in my dissertation, I created a Khalkha Mongolian corpus. This corpus is currently about a half-million tokens in size. It was gathered from several online Khalkha language sources. These sources are, in order of the size of their contribution to the resulting corpus: Onoodor Sonin (Өнөөдөр Сонин: http://www.mongolnews.mn), Khalkha Wkipedia (http://mn.wikipedia.org), and Tsahii Murtuu (http://www.tsahimurtuu.mn). Onoodor Sonin, the newspaper from which the majority of the corpus is derived, is one of the nation’s large daily papers. Khalkha Wikipedia was taken in its entirety. In addition, Tsahii Murtuu, another news site, was also added in its entirety to the corpus. The existence of this corpus now makes it possible to derive lexical statistics about the language such as word frequency, phonological neighborhood density, vowel pattern frequency, and harmonic class size, among others.

I’ve made this corpus publicly available on the PsyCol Lab‘s website at the University of Arizona. It is currently housed on this webpage because it can be accessed easily via a user-interface. If you are interested in further information or would like to access the corpus without the user interface, please don’t hesitate to contact me.