TAJIK-RUSSIAN PARALLEL CORPUS: DEVELOPMENT AND DESCRIPTION

Authors

            Khudoyberdiev H.A. Candidate of Physical and Mathematical Sciences, Associate professor, Department of programming and information technologies, Polytechnic Institute of Tajik Technical University.

              Nazarov A.A. – Senior teacher, Chair of Programming and Information Technologies Department, Polytechnic Institute of Tajik Technical University.

Annotation

             The first stage of development of the Tajik-Russian parallel building for machine translation of text from the Tajik language into Russian is presented. The general structure of the corpus, the structure of text data, algorithms, as well as the automatic control of the corpus using the author’s program Taj-Rus-Corp are considered. The analysis of the tasks of the development of a parallel corpus: the selection of the correct texts; preprocessing; text source analysis; text comparison; creation of data processing algorithms; creating a Taj-Rus-Corp program with text search capabilities; input of ready texts in the parallel case; statistical data analysis; creation of experimental modules of machine translation are completed. In the end, the author concludes that the development of a parallel corpus in the future will facilitate machine translation of text from Tajik into Russian languages.

Key words

        Tajik language, Russian language, parallel building, text analysis, software, database, machine translation.

 

Language 

english

Type

technical

Year

2019

Page

12

References

  1. Rastorgueva V.S. Essays on the Tajik dialectology. – Stalin-bad: Publishing house Acad. Sciences of the Tajik SSR, 1956. – 80 p.
  2. Zakharov V.P. Corpus linguistics. – SPb: SPbU. – 2005.
  3. Usmanov Z.D. On the ordered alphabetical coding of words of natural languages, Reports of the Academy of Sciences of the Republic of Tajikistan, 2012. v. 55, № 7, P. 545 – 548.
  4. Khudoyberdiev Kh.A. On automatic conversion of Tajik text to standard graphics. Reports of the Academy of Sciences of the Republic of Tajikistan, 2014. v. 57, № 3. P. 210 – 214.
  5. Usmanov Z.D., Dovudov G.M. 2015. Morphological analysis of word forms of the Tajik language (monograph). Dushanbe, “Donish”, 2015. – 130 р.
  6. Khudoyberdiev Kh.A., Soliev O.M. Linguistic Thesaurus of TaJik Language. New information technologies in automated systems. MIEM HSE. Moscow, 2017. – Р. 103 – 106.
  7. Khudoyberdiev Kh.A., Rakhmonov Z.A. Logical structure and analysis of machine translation artifacts. Herald KPITTU M.S. Osimi, № 2 (7), Khujand, 2018. – P. 7 – 11.

Date publication

09/21/2023