Arabic Diacritic-Aware Text-Audio Segmentation and Alignment Model (DASAM)

Adel Sabour, Abdeltawab Hendawi, Mohamed Ali

Abstract


Abstract: This paper introduces the Diacritic-Aware Segmentation and Alignment Model for Arabic (DASAM). Diacritics are vital for pronunciation and meaning in the Arabic language but are often ignored by current speech recognition systems. DASAM is designed for word-level segmentation and alignment in unseen audio and associating them with diacritic-marked Arabic text. The DASAM approach uses linguistic analysis based on intonation rules. DASAM then applies Dynamic Time Warping (DTW) to match the reference audio word with its position in the unseen sentence audio. The model outputs a list of words with their start and end times in the recording. Tested on the Qur’an dataset, DASAM outperforms Google Speech-to-Text (STT) in predicting word timings. It achieves higher accuracy in text-audio alignment, with values of 0.959 and 0.957 for word start and end times, respectively (compared to Google STT’s 0.870 and 0.849). Additionally, DASAM employs advanced signal processing techniques and demonstrates robustness across various audio variations. These results establish that DASAM constitutes a fundamental building block for speech-to-text conversion and linguistic research in Arabic, particularly for applications involving diacritics.


Keywords


Quran; Speech Recognition; Signal Processing; Phoneme Detection; NLP Techniques; Position Detection

Full Text:

PDF

References


Abdo, M. S., & Kandil, A. H. (2016). Semi-automatic segmentation system for syllables extraction from continuous Arabic audio signal. International Journal of Advanced Computer Science and Applications, 7(1).

Abdo, M. S., Kandil, A. H., & Fawzy, S. A. (2014, February). MFC peak based segmentation for continuous Arabic audio signal. In 2nd Middle East Conference on Biomedical Engineering (pp. 224-227). IEEE.

Aboalnaser, S. A. (2019, October). Machine learning algorithms in arabic text classification: A review. In 2019 12th international conference on developments in esystems engineering (dese) (pp. 290-295). IEEE.

Absa, A. H. A., Deriche, M., Elshafei-Ahmed, M., Elhadj, Y. M., & Juang, B. H. (2018). A hybrid unsupervised segmentation algorithm for arabic speech using feature fusion and a genetic algorithm (July 2018). Ieee Access, 6, 43157-43169.

Al-Fadhli, S., Al-Harbi, H., & Cherif, A. (2023). Speech Recognition Models for Holy Quran Recitation Based on Modern Approaches and Tajweed Rules: A Comprehensive Overview. International Journal of Advanced Computer Science & Applications, 14(12).

Aldarmaki, H., & Ghannam, A. (2023). Diacritic recognition performance in arabic asr. arXiv preprint arXiv:2302.14022.

Alkayyali, Z. K., Idris, S. A. B., & Abu-Naser, S. S. (2023). A New Algorithm for Audio Files Augmentation. Journal of Theoretical and Applied Information Technology, 101(12).

Amazon Web Services. (2024). Supported languages and language-specific features - Amazon Transcribe. Retrieved from https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html.

Bakır, H., Çayır, A. N., & Navruz, T. S. (2024). A comprehensive experimental study for analyzing the effects of data augmentation techniques on voice classification. Multimedia Tools and Applications, 83(6), 17601-17628.

Balula, N. O. M., Rashwan, M., & Abdou, S. (2021). Automatic speech recognition (ASR) systems for learning Arabic language and Al-quran recitation: a Review. International Journal of Computer Science and Mobile Computing, 10(7), 91-100.

Dean, D., Sridharan, S., Vogt, R., & Mason, M. (2010). The QUT-NOISE-TIMIT corpus for evaluation of voice activity detection algorithms. In Proceedings of the 11th annual conference of the international speech communication association (pp. 3110-3113). International Speech Communication Association.

Deriso, D., & Boyd, S. (2023). A general optimization framework for dynamic time warping. Optimization and Engineering, 24(2), 1411-1432.

El-Imam, Y. A. (2004). Phonetization of Arabic: rules and algorithms. Computer Speech & Language, 18(4), 339-373.

Google Cloud. (2024). Cloud Speech-to-Text API - Language support. Retrieved from https://cloud.google.com/speech-to-text/docs/speech-to-text-supported-languages.

Javed, M., Baig, M. M. A., & Qazi, S. A. (2020). Unsupervised phonetic segmentation of classical Arabic speech using forward and inverse characteristics of the vocal tract. Arabian Journal for Science and Engineering, 45, 1581-1597.

Lokhande, N. N., Nehe, N. S., & Vikhe, P. S. (2012, March). Voice activity detection algorithm for speech recognition applications. In Ijca proceedings on international conference in computational intelligence (iccia2012), vol. iccia (Vol. 6, pp. 1-4).

Microsoft Azure. (2024). Language support for Computer Vision-Azure Cognitive Services [Note by Microsoft]. Retrieved from https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/language-support#optical-character-recognition-ocr.

Permanasari, Y., Harahap, E. H., & Ali, E. P. (2019, November). Speech recognition using dynamic time warping (DTW). In Journal of physics: Conference series (Vol. 1366, No. 1, p. 012091). IOP Publishing.

Qasim, H., & Abdulbaqi, H. A. (2022, October). Arabic speech recognition using deep learning methods: Literature review. In Aip conference proceedings (Vol. 2398, No. 1). AIP Publishing.

Rahman, A., Kabir, M. M., Mridha, M. F., Alatiyyah, M., Alhasson, H. F., & Alharbi, S. S. (2024). Arabic Speech Recognition: Advancement and Challenges. IEEE Access.

Sabour, A., & Ali, M. (2024). Quran Research. Retrieved from https://quranresearch.org/ (Accessed: March 1, 2024).

Sabour, A., Hendawi, A., & Ali, M. (2023). Diacritic-Aware Alignment and Classification in Arabic Speech: A Fusion of Fuztpi and ML Models. JISTech (Journal of Islamic Science and Technology), 8(2), 169-191.

Stan, A., Mamiya, Y., Yamagishi, J., Bell, P., Watts, O., Clark, R. A., & King, S. (2016). ALISA: An automatic lightly supervised speech segmentation and alignment tool. Computer Speech & Language, 35, 116-133.

Sundus, K., Al-Haj, F., & Hammo, B. (2019, October). A deep learning approach for arabic text classification. In 2019 2nd international conference on new trends in computing sciences (ictcs) (pp. 1-7). IEEE.

Wei, G., Duan, Z., Li, S., Yu, X., & Yang, G. (2023). LFEformer: Local Feature Enhancement Using Sliding Window with Deformability for Automatic Speech Recognition. IEEE Signal Processing Letters, 30, 180-184.

Yalova, K., Babenko, M., & Yashyna, K. (2023). Automatic Speech Recognition System with Dynamic Time Warping and Mel-Frequency Cepstral Coefficients. In COLINS (2) (pp. 141-151).

Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of Computer science and Technology, 16, 582-589.

Zheng, J., Franco, H., & Stolcke, A. (2003). Modeling word-level rate-of-speech variation in large vocabulary conversational speech recognition. Speech Communication, 41(2-3), 273-285




DOI: http://dx.doi.org/10.22373/ekw.v10i1.23637

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Adel Sabour, Abdeltawab Hendawi, Mohamed Ali

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

P-ISSN : 2460-8912
E-ISSN : 2460-8920

ELKAWNIE

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Elkawnie: Journal of Islamic Science and Technology in 2022. Published by Faculty of Science and Technology in cooperation with Center for Research and Community Service (LP2M), UIN Ar-Raniry Banda Aceh, Aceh, Indonesia.

View full page view stats report click here

Flag Counter