Mongolian, Tibetan, and Uyghur Speech Data from Chinese Minority Region in 2015

No comments yet

Submit questions or advice:



You are not logged in, please[Login]or[ Register]!

Mongolian, Tibetan, and Uyghur Speech Data from Chinese Minority Region in 2015

Total number of views and downloads

View in HTML Paper download
1504 8

Mongolian, Tibetan, and Uyghur Speech Data from Chinese Minority Region in 2015

The author's papers

Sorry, failed to retrieve the author's related papers.

            Data source: Chinese Science Citation Database(CSCD)

Mongolian, Tibetan, and Uyghur speech data from Chinese minority regions in 2015

Wei Xiangfeng1*, Yuan Yi1, Zhang Quan1, Chi Zhejie1,2 

1.Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, P. R. China

2.University of Chinese Academy of Sciences, Beijing 100049, P. R. China

* Email: wxf@mail.ioa.ac.cn

Abstract: This paper introduces a Mongolian, Tibetan and Uyghur speech data set in 2015, which was collected using a remote speech acquisition software system based on Client/Server architecture. The system reduced the cost and improved the efficiency of collecting Mongolian, Tibetan and Uyghur speech data. The data set contains nearly 800 sentences, with a total size of 136 MB. The speech data is of great theoretical and practical value for speech analysis and teaching, speech recognition and synthesis concerning the minority languages in China. The system can be applied into acquiring other language/dialect speeches with slight modification, and it is easy to operate and economic to install.

Keywords: speech data; Chinese minorities; Mongolian Tibetan and Uyghur; recording; remote collection

Dataset Profile

Chinese title

2015年中国少数民族地区蒙藏维言语录音数据集

English title

Mongolian, Tibetan, and Uyghur speech data from Chinese minority regions in 2015

Corresponding author

Wei Xiangfeng (wxf@mail.ioa.ac.cn)

Data author(s)

Yuan Yi, Zhang Quan, Chi Zhejie, Wei Xiangfeng

Time range

2015

Geographical scope

Inner Mongolia, Qinghai, Tibet and Xinjiang in China

Data format

MP3

Data service system

 

Source(s) of funding

The “Integration and Application of Basic Science Data in Minority Information Processing” Project – a Key Database Project of the S&T Data and Resource Integration and Sharing Program under the Informatization Special Program of the Chinese Academy of Sciences (CAS)

Dataset composition

The data set consists of three parts, namely, Mongolian speech data, Tibetan speech data and Uyghur speech data.

1. Introduction

China is a multi-ethnic country. Chinese minority speeches have their own unique characteristics. To collect and preserve the minority speech data has great importance for studying the speech characteristics of Chinese minority languages, propagating Chinese minority culture, and promoting the informatization construction in the minority regions. So far there have been several common methods for collecting minority speech data. The first method is field-survey linguistics, which is to conduct on-site speech collection by means of audio or video recording of, for instance, oral stories and poetries. The second is to record speeches in a professional studio, where a professional recording device is used to record speeches of selected texts read by selected people. The third is to record speeches in a natural environment or daily life, such as conference and phone channel. In the fields of speech recognition, synthesis, and detection, the second method is more popular than the other two.

To meet the needs of the speech recognition system, Fei Long recorded speech data from 200 persons who spoke with a standard Mongolian accent, and built a 69,136-sentence speech corpus for the speech recognition system after 16 KHz sampling and 16 bit PCM quantization1. Shan Dan recorded broadcasters’ high-quality and standard pronunciation of Mongolian isolated phonetic symbols and words for machine evaluation under the condition of 44.1 KHz sampling rate, 16 bit sampling precision and single channel2. In response to the needs of the speech synthesis corpus in the Lhasa Tibetan speech synthesis system, Chen Xiaoying recorded 3,000 sentences of Tibetan speech data in a professional recording studio with external sound card, microphone, mixer and Audition software, during which Chen ensured a consistent recording quality, speed, and vocal style of the speakers3. Reyiman Tursun et al. recorded some representative Uyghur sentences (each sentence contains 10 to 20 syllables, and as many Uyghur triphone units as possible) from radio, television, literature, art works and dictionaries, using IBM notebook computer, full duplex card, high sensitivity built-in microphone, GaoBao vertical-type microphone (impedance 250 W, sensitivity -56 ± 3 dB, frequency responding range 100 – 16 KHz), and WavRecode recording software in a natural recording environment4. Yang Yating et al. collected Uyghur speech data with one microphone and two telephone channels for establishing an oral Uyghur speech corpus, with a sampling rate of 8 KHz. The two telephone channels were sampled with sound collecting software and hardware. The microphone channel was recorded using CoolEdit software5.

To reduce the cost of a professional studio, we focused on remote collection of, specifically, the speech data of Mongolian, Tibetan, and Uygur languages, which was supported by the “Integration and Application of Basic Science Data in Minority Information Processing” Project, a Key Database Project of the S&T Data and Resource Integration and Sharing Program under the Informatization Special Program of the Chinese Academy of Sciences (CAS). The project was led by Hefei Institutes of Physical Science, CAS, in collaboration with Institute of Software, CAS, and Institute of Acoustics, CAS, both of which have rich resources and experience in minority language information processing and natural language processing. The project assumes two aims: one is to integrate the language data resources (Chinese/Mongolian/Tibetan/Uyghur) in the field of ethnic information processing within CAS institutes, and the other is to support CAS’s work on the informatization and popularization of science and provide public data and statistical services for the informatization construction of minority regions in China. Hefei Institutes of Physical Science has rich resources in Chinese-Mongolian dictionary, Chinese-Mongolian sentence-aligned corpus, Chinese-Uyghur dictionary, and Chinese-Uyghur sentence-aligned corpus6. Institute of Software has accumulated abundant resources in Chinese-Tibetan dictionary and Chinese-Tibetan sentence-aligned corpus7.

Based on texts from the Chinese-Mongolian, Chinese-Tibetan, and Chinese-Uyghur bilingual parallel corpus provided by the two institutes, this paper illustrates how to record speech data from remote minority language speakers in Inner Mongolia, Qinghai, Tibet and Xinjiang. We implemented a multi-user and multi-client software system for collecting remote speech data by using the Client/Server architecture. On the Client side, a user (who speaks Mongolian, Tibetan or Uyghur) records his or her speech with a standard professional microphone and a laptop computer. After running our special Client software, users read aloud sentences displayed on the screen of the Client. Once recording is completed, users can upload their speech data directly to the server through the Client software and the Internet. The Server side receives the data (including speech data) from the Client software and stores information (including the user ID, task ID, and corresponding text) into a MySQL database.

2. Data collection and processing

Our remote collection system consists of two parts: a Client (or Clients) and a Server. The Server uses a MySQL database for data management, while the Client uses Microsoft SQL Server Compact to manage its local data, which is much smaller than the data at the Server side. Data transmission and information exchange are required between a Client and the Server, in order to achieve the Client’s functions, including “log in / log out”, “download text corpus”, “upload recorded speech data”, and so on (Figure 1).

Figure 1  Architecture of the remote collection system for speech data

One of the main purposes of our project is to acquire Mongolian, Tibetan, and Uyghur language speech data. Reading material was provided by Hefei Institutes of Physical Science, CAS, and Institute of Software, CAS, which was selected from the Chinese-Mongolian, Chinese-Tibetan, Chinese-Uyghur bilingual parallel corpus 6 – 7. A successful remote speech data collecting task includes the following steps: designing, assigning, recording, uploading, and auditing. Once passed, the speech data will be stored into the database. Each task is designed to contain 100 – 1,000 sentences, randomly selected from the bilingual parallel corpus. One user can perform multiple recording tasks, and one task can be assigned to multiple users at a time.

The following steps describe a successful remote collection task: (i) Staff design a task in the Server software (to determine the language, number of sentences or range of the task); (ii) Staff assign the task to a user/speaker in a remote minority region, and notify the user/speaker to download the text corpus of the task; (iii) The user/speaker downloads the text corpus using the Client software; (iv) After successful download, the speaker reads aloud the text sentence by sentence and records his or her speech with the Client software; (v) After reading all the sentences in the task, the speaker uploads his or her speech data to the Server through the Client software; (vi) Staff or language experts audit the recorded speech data through a Web auditing system to judge whether the uploaded data are qualified; (vii) Staff store all qualified speech data into the minority language (Mongolian, Tibetan, Uyghur) speech database.

After the the text corpus of a new task is downloaded from the Client, the user can read the text aloud for speech recording. The Client side has a relatively independent recording module, which provides: (i) text displayed in minority languages; (ii) functions of starting, stopping, and canceling recording, of playing recordings and of re-recording; (iii) easy browse of previous or next sentence and its recording; (iv) ways to increase or decrease the font size, or to change the font, and so on. Figure 2 shows the user interface of the recording module at the Client when a Tibetan sentence is recorded.

Figure 2  User interface of the recording module at the Client

3. Sample description

A data sample typically consists of three files: the first is a TXT file which contains a sentence written in Chinese; the second is a TXT file which contains the sentence written in a minority language (Mongolian, Tibetan or Uyghur); and the third is an MP3 file which records the speech of the sentence (read aloud in a minority language).

Figure 3 shows a Chinese sentence, Figure 4 shows the sentence in Mongolian, and Figure 5 shows the MP3 file that records the sentence read aloud by a Mongolian speaker.

Figure 3  A sample sentence written in Chinese

Figure 4  The sentence in Mongolian

Figure 5  Speech record of the sentence in MP3 format

Figure 6 shows a Chinese sentence, Figure 7 shows the sentence in Tibetan, and Figure 8 shows the MP3 file that records the sentence read aloud by a Tibetan speaker.

Figure 6  A sample sentence written in Chinese

Figure 7  The sentence in Tibetan

Figure 8  Speech record of the sentence in MP3 format

Figure 9 shows a Chinese sentence, Figure 10 shows the sentence in Uyghur, and Figure 11 shows the MP3 file that records the sentence read aloud by a Uyghur speaker.

Figure 9  A sample sentence written in Chinese

Figure 10  The sentence in Uyghur

Figure 11  Speech record of the sentence in MP3 format

4. Quality control and assessment

In order to ensure the quality of the speech data in the Server, all the speech data uploaded by users from the Client software need to be audited. We developed a Web auditing system for auditing the speech data of minority languages, where an auditor can review the speech data remotely through the Internet. Results are marked by pass or fail. Those that fail the audit will not be stored in the minority language speech database.

Auditors are experts at one kind of minority language or experts at speech data analysis. The speech data are audited for multiple times by more than two experts to ensure the quality of the data. According to the attributes of speech data, records exhibiting any of the following features will be marked as unqualified: (i) obvious background noise; (ii) unclear, small-volume or silent voice; (iii) excessive noise from microphone vibration; (iv) long-time pause or silence; (v) many words erroneously spoken or inconsistent with the original text; (vi) unnatural speaking; (vii) other features that may disqualify the data as deemed by the auditor. In case parts of the sentences in one task are qualified, unqualified parts will be rejected and only the qualified (speech data) will be stored into the database. Collected speech data will then be audited for at least one time by no less than two experts respectively. Because the speakers had been selected, tested and trained before formal recording started, the pass rate of the collected speech data is greater than 97.1%, as calculated according to the number of sentences.

5. Usage notes

This data set shares nearly 800 records of sentence-level speech data of Mongolian, Tibetan and Uygur languages. To the best of our knowledge, there are no other similar speech data in China. It fills the gap by providing an open Mongolian, Tibetan, and Uyghur speech corpus in China. The speech data set can be used for studies on speech parameters and minority language teaching. It can also be used for development and application of speech recognition and synthesis systems. Therefore, it has broad academic and great social values.

The Client/Server architecture in the acquisition system we use enables simultaneous multi-person performance from a remote distance. It greatly reduces the costs and difficulties of data collection, and improves the efficiency. The speech acquisition system is suitable to collect Mongolian, Tibetan, and Uyghur speech data, as well as data of other minority languages or dialects after minimal modification of certain character encoding. With great significance for speech data collection and analysis, it can not only promote the collection and preservation of Chinese minority languages, but help develop minority speech applications.

Acknowledgments

Thanks go to Chen Lei from Hefei Institutes of Physical Science, CAS, Ma Longlong and Liu Huidan from Institute of Software, CAS, who provided some constructive advice.

References

1. Fei L. Study and improve[ment] on the Mongolian speech recognition system. Master’s Thesis, University of Inner Mongolia, 2009.

2. Shan D. Designing and building speech database for machine testing of Mongolian standard pronunciation. Journal of the Western Mongolian Studies, (2010): 58 – 62.

3. Chen X. Studying and building the speech synthesis corpus of Tibetan Lhasa dialect. Science & Technology Information, (2013): 13 – 14.

4. Tursun R & Muhammat I. Research[ing] and implementation of the Uyghur speech corpus MIS. Journal of Xinjiang University (Natural Science Edition), 28 (2011): 242 – 247.

5. Ya Y, Ma B, Wang L et al. Research on the Uyghar spoken language speech corpus. The Fifth Youth Workshop of Computational Linguistics (YWCL2010), (2010): 208 – 214.

6. Zhu Z, Li M, Chen L et al. Building comparable corpus based on bilingual LDA model. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics(ACL), (2013): 278 – 282.

7. Liu H, Nuo M, Ma L et al. Mining Tibetan Web text resources and its application. Journal of Chinese Information Processing, 29 (2015): 170 – 177.

Data citation

1. Wei X, Yuan Y, Zhang Q et al. Mongolian, Tibetan, and Uyghur speech data from Chinese minority regions in 2015. Science Data Bank. DOI: 10.11922/sciencedb.120.30

Authors and contributions

Wei Xiangfeng, PhD, Associate Professor; research field: natural language processing, speech recognition and speech synthesis. Contribution: overall technical design, project organizing and implementation.

Yuan Yi, BS, Senior Engineer; research field: natural language processing, speech recognition and speech synthesis. Contribution: implementation of the speech data processing system in Server.

Zhang Quan, PhD, Professor; research field: natural language processing, speech recognition and speech synthesis. Contribution: implementation of the auditing speech data system.

Chi Zhejie, PhD Candidate; research field: natural language processing, speech recognition and speech synthesis. Contribution: implementation of the speech data collecting system in Client.

 

 

How to cite this article: Wei X, Yuan Y, Zhang Q et al. Mongolian, Tibetan, and Uyghur speech data from Chinese minority regions in 2015. China Scientific Data 2 (2016), DOI: 10.11922/csdata.120.2015.0024

Download