Analyzing the Impact of Synthetic Speech on Spoken Language Identification
Main Article Content
Abstract
This research explores how synthetic speech influences the performance of spoken language identification systems by examining various feature types (acoustic, temporal, and rhythmic) across multiple machine learning and deep learning architectures. The study utilized Multi-Layer Perceptron (MLP), Support Vector Machines (SVM), and Long Short-Term Memory networks (LSTM) to evaluate three distinct scenarios: identifying languages from natural speech, synthetic speech, and a mix of both. Furthermore, the study investigated whether integrating all feature types could enhance system performance. The results revealed that the Mel spectrogram consistently emerged as the most effective feature across all tested models, with MLP and LSTM achieving the best overall results. In fact, the Mel spectrogram attained a remarkable accuracy rate of 100%, establishing itself as the top-performing feature. Similarly, MFCC also reached 100% accuracy in the synthetic speech scenario, highlighting its strength as the second most effective feature. Notably, combining all features did not always lead to performance improvements, underscoring the importance of strategic feature selection. The study also tackled challenges like variability in natural speech recordings and imbalances in dataset distribution, emphasizing the necessity of robust data augmentation methods. By shedding light on the interactions between feature types, model architectures, and speech data sources, this research advances the development of more accurate and resilient spoken language identification systems.
Article Details
References
Alashban, A.A. et al. (2022). Spoken language identification system using convolutional recurrent neural network. Applied Sciences, 12(18), 9181.
Alshutayri, A. & Albarhamtoshy, H. (2011). Arabic spoken language identification system (ASLIS): A proposed system to identifying modern standard Arabic (MSA) and Egyptian dialect. IInformatics Engineering and Information Science: International Conference, ICIEIS 2011, Kuala Lumpur, Malaysia, November 14-16, 2011. Proceedings, Part II, pp.375-385. Springer Berlin Heidelberg.
Biswas, M., et al. (2023). Automatic spoken language identification using MFCC based time series features. Multimedia Tools and Applications, 82(7), 9565-9595.
Duffy, S.A. & Pisoni, D.B. (1992). Comprehension of synthetic speech produced by rule: review and theoretical interpretation. Language and Speech, 35(4), 351-389.
Ganapathy, S. et al. (2014). Robust language identification using convolutional neural network features, Interspeech, pp. 1846-1850.
Gelly, G. & Gauvain, J.L. (2017). Spoken Language Identification Using LSTM-Based Angular Proximity. Interspeech, pp. 2566-2570.
Jothilakshmi, S., Ramalingam, V. & Palanivel, S. (2012). A hierarchical language identification system for Indian languages. Digital Signal Processing, 22(3), 544-553.
Kumar, P. et al. (2010). Spoken language identification using hybrid feature extraction methods. arXiv preprint arXiv:1003.5623.
Kumar, S. S. & Ramasubramanian, V. (2005). Automatic language identification using ergodic-HMM. In Proceedings. (ICASSP’05), IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. (Vol. 1, pp. I-609).
Maguolo, G. et al. (2021). Audiogmenter: a MATLAB toolbox for audio data augmentation. Applied Computing and Informatics.
Manchala, S. et al. (2014). GMM based language identification system using robust features. International Journal of Speech Technology, 17, 99-105.
Mc Fee, B. et al. (2015). librosa: Audio and music signal analysis in python. In SciPy, pp. 18-24.
Sarmah, K. & Bhattacharjee, U. (2014). GMM based Language Identification using MFCC and SDC Features. International Journal of Computer Applications, 85(5).
Sefara, T.J. et al. (2019). HMM-based speech synthesis system incorporated with language identification for low-resourced languages. In 2019 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD), pp. 1-6.
Singh, G. et al. (2021). Spoken language identification using deep learning. Computational Intelligence and Neuroscience, 2021(1), 5123671.
Wazir, A.S.B., et al. (2020). Spectrogram based classification of spoken foul language using deep CNN. In 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pp. 1-6.
Wicaksana, V.S. & Kom, A.Z.S. (2021). Spoken language identification on local language using MFCC, random forest, KNN, and GMM. International Journal of Advanced Computer Science and Applications, 12(5).
Zazo, R. et al. (2016). Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PloS one, 11(1), e0146917.