DEVELOPMENT OF THE TAJIK SPEECH CORPORA FOR SOLVING SOME PROBLEMS OF COMPUTER LINGUISTICS

Authors

          Khudoiberdiev H.A. – Candidate of Physics and Mathematics, Associate Professor, Department of programming and information systems, Polytechnic Institute of Tajik Technical University, Khujand, Republic of Tajikistan, tajlingvo@gmail.com

        Muzafarov D.Z. – Candidate of Physics and Mathematics, Associate Professor, Department of programming, Khujand State University rova, Khujand, Republic of Tajikistan, muzafarov.dilshod@gmail.com

      Ashurova Sh.N. – Senior Lecturer, Department of Programming and information systems, Polytechnic Institute of Tajik Technical University, Khujand, Republic of Tajikistan, shnurulloevna@gmail.com

Annotation

        The article proposes a scientific concept and stages of planning the devel-opment of the Tajik speech corpus. The purpose of creating such a corpus is to solve important problems of computational linguistics related to voice control, synthesis and speech recognition. The authors note the insufficient elaboration of these issues for the Tajik language in contrast to English and Russian. The main proposed methods include automatic processing of text elements, preliminary analysis of audio data, formation of a corpus database. It is planned to create a corpus with a volume of 1000 hours of speech recordings obtained from different speakers, tak-ing into account age and gender. Further, based on the corpus, software modules will be devel-oped for its processing, including modules for voice control of computer tools and automatic synthesis and speech recognition. The proposed approaches are based on modern methods of mathematical modeling, data analysis and artificial intelligence technologies. The research re-sults can find wide application in scientific research, education and industry of the Republic of Tajikistan. It is noted that the implementation of the proposed approach will allow solving im-portant problems of processing Tajik speech, such as voice control, automatic synthesis and recognition. The developed corpus can serve as a fundamental basis for research and develop-ment in the field of computational linguistics in relation to the Tajik language.

Key words

Tajik language, text corpus, speech corpus, speech data analysis, speech technologies, speech recognition.

Language

english

Type

technical

Year

2023

Page

14

References

  1. Tajik language pack for spell checking in Microsoft Office. Usmanov Z.D., Soliev O.M., Khudoyberdiev Kh.A., Dovudov G.M. // Patent registered 4201200235 dated 04.10.2012. Research Center of the Ministry of Economic Development and Trade of the Republic of Ta-tarstan.
  1. Usmanov Z.D., Dovudov G.M. Formation of the base of morphs of the Tajik language. Monograph. – Dushanbe: “Donish”, 2014. -110 p.
  1. Usmanov Z.D., Khudoiberdiev Kh.A., Experience of computer synthesis of Tajik speech according to the text. Monograph. Technological University of Tajikistan Khujand branch. Monograph. -Dushanbe. “Irfon”, 2010 -145 р.
  2. Usmanov Z.D., Soliev O.M. Keyboard layout problem. Monograph. Technological University of Tajikistan. – Dushanbe: “Irfon”, 2010. -104 p12
  3. Khudoiberdiev H.A., Muzafarov D.Z., Ashurova Sh.N. Development of the tajik speech corpora for solving some problems of computer linguistics
  4. Usmanov Z.D., Soliev O.M., Khudoyberdiev Kh.A., Dovudov G.M. Automatic system TajSpell-2.0. to check the spelling of the Tajik language in the MS Office 2010-2019 office suite.
  5. – Certificate of state registration of information resource, Republic of Tajikistan. No. 4202000456 dated 07/30/2020
  6. Usmonov Z.J., Khudoyberdiev Kh.A. Nizomkhoi hudkori korcardi ma’lumot bo zaboni tojiki. Monograph. – Khujand. “Irfon”, 2022. -186 р.
  7. Khudoiberdiev Kh.A. Web-application “Automatic information processing systems in the Tajik language” www.tajlingvo.tj. – Certificate of state registration of information resource, Republic of Tajikistan. No. 4202200496 dated 04/28/2022.
  8. Khudoiberdiev Kh.A., Soliev O.M., Soliev P.A., Dovudov G.M., Nazarov A.A. Web ap-plication Tajik translator www.tarjumon.tj. – Certificate of state registration of information re-source, Republic of Tajikistan. No. 4202100482 dated 12/03/2021.

Publication date

2023-10-11