Combined CNN-LSTM for Enhancing Clean and Noisy Speech Recognition
Main Article Content
Abstract
This paper presents a hybrid Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) approach for Automatic Speech Recognition (ASR) using deep learning techniques on the Aurora-2 dataset. The dataset includes both clean and multi-condition modes, encompassing four noise scenarios : subway, babble, car, and exhibition hall, each evaluated at different signal-to-noise ratios (SNRs), and clean condition, and the results are compared with those from the ASC-10 dataset and the ESC-10 dataset. The problem addressed is the need for robust ASR models that perform well in both clean and noisy environments. The aim of utilizing the CNN-LSTM architecture is to enhance the recognition performance by combining the strengths of CNNs and LSTMs, rather than relying on either CNNs or LSTMs alone. Experimental results demonstrate that the combined CNN-LSTM model achieves superior classification performance, in clean environments on the Aurora2 dataset, attaining an accuracy of 97.96%, surpassing the individual CNN and LSTM models, which achieved 97.21% and 96.06%, respectively. In noisy conditions, the hybrid model also outperforms the standalone models, with an accuracy of 90.72%, compared to 90.12% for CNN and 86.12% for LSTM. These findings indicate that the CNN-LSTM model is more effective in handling various noise conditions and improving overall ASR accuracy.
Article Details
References
Daouad, M., Allah, F. A. & Dadi, E. W. (2023). An automatic speech recognition system for isolated Amazigh word using 1D & 2D CNN-LSTM architecture. International Journal of Speech Technology, 26(3), 775-787.
Dar M.A. & J. Pushparaj, J. (2024). Hybrid architecture cnn-blstm for automatic speech recognition. 3rd International Conference on Artificial Intelligence for Internet of Things (AIIoT). pp. 1–4.
Dat, T.T. et al. (2021). Convolutional recurrent neural network with attention for Vietnamese speech to text problem in the operating room. International Journal of Intelligent Information and Database Systems, 14(3), 294-314.
Demir, F., Turkoglu, M., Aslan, M. & Sengur, A. (2020). A new pyramidal concatenated CNN approach for environmental sound classification. Applied Acoustics, 170, 107520.
Djeffal, N. et al. (2023). Automatic speech recognition with BERT and CTC transformers: A review. In 2nd International Conference on Electronics, Energy and Measurement (IC2EM) (Vol. 1, pp. 1-8).
Djeffal, N. et al. (2023). Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches. In 2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM) (Vol. 1, pp. 1-6). IEEE.
Greg V.H., Carlos M. & Gonzalo, N. (2020). A review on the long short- term memory model. Artificial Intelligence Review. Springer, 2020, pp. 5929–5955.
Gueriani, A., Kheddar, H. & Mazari, A. C. (2024). Enhancing IoT Security with CNN and LSTM-Based Intrusion Detection Systems. In 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS) (pp. 1-7). IEEE.
Essaid, B. et al. (2024). Artificial Intelligence for Cochlear Implants: Review of Strategies, Challenges, and Perspectives. IEEE Access.
Exter, M. & Meyer, B. T. (2016). DNN-Based Automatic Speech Recognition as a Model for Human Phoneme Perception. In INTERSPEECH (pp. 615-619).
Ghandoura, A., Hjabo, F. & Al Dakkak, O. (2021). Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting. Engineering Applications of Artificial Intelligence, 102, 104267.
Habchi, Y. et al. (2023). Ai in thyroid cancer diagnosis: Techniques, trends, and future directions. Systems, 11(10), 519.
Hirsch, H. G. & Pearce, D. (2000). The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRWH.
Karam, S. et al. (2023). Episodic memory based continual learning without catastrophic forgetting for environmental sound classification. Journal of Ambient Intelligence and Humanized Computing, 14(4), 4439-4449.
Kheddar, H. et al. (2023). Deep transfer learning for automatic speech recognition: Towards better generalization. Knowledge-Based Systems, 277, 110851.
Kheddar, H. et al. (2024). Deep learning for steganalysis of diverse data types: A review of methods, taxonomy, challenges and future directions. Neurocomputing, 127528.
Kheddar, H., Hemis, M. & Himeur, Y. (2024). Automatic speech recognition using advanced deep learning approaches: A survey. Information Fusion, 102422.
Li, J., Mohamed, A., Zweig, G. & Gong, Y. (2015). LSTM time and frequency recurrence for automatic speech recognition. In 2015 IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 187-191). IEEE.
Lichouri, M., Lounnas, K. & Bakri, A. (2023). Toward building another arabic voice command dataset for multiple speech processing tasks. In 2023 International Conference on Advances in Electronics, Control and Communication Systems (ICAECCS) (pp. 1-5). IEEE.
Mazari, A. C. & Kheddar, H. (2023). Deep learning-based analysis of Algerian dialect dataset targeted hate speech, offensive language and cyberbullying. International Journal of Computing and Digital Systems..
Naing, H.M.S., Hidayat, R., Hartanto, R. & Miyanaga, Y. (2020, November). A front-end technique for automatic noisy speech recognition. 23rd conference of the oriental COCOSDA international committee for the co-ordination and standardisation of speech databases and assessment techniques (O-COCOSDA) (pp. 49-54).
Passricha, V. & Aggarwal, R. K. (2019). A hybrid of deep CNN and bidirectional LSTM for automatic speechrecognition. Journal of Intelligent Systems, 29(1), 1261-1274.
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018).
Selouani, S. A. & Yacoub, M. S. (2018). Long short-term memory neural networks for artificial dialogue generation. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 761-768). IEEE.
Soe Naing, H.M. at al. (2020). Discrete Wavelet Denoising into MFCC for Noise Suppressive in Automatic Speech Recognition System. International Journal of Intelligent Engineering & Systems, 13(2).
Takazawa, S. K. et al. (2024). Explosion Detection Using Smartphones: Ensemble Learning with the Smartphone High-Explosive Audio Recordings Dataset and the ESC-50 Dataset. Sensors, 24(20), 6688.
Wu, Y., Zheng, B. & Zhao, Y. (2018). Dynamic gesture recognition based on LSTM-CNN. In 2018 Chinese Automation Congress (CAC) (pp. 2446-2450). IEEE.
Wang, W., Yang, X. & Yang, H. (2020). End-to-End low-resource speech recognition with a deep CNN-LSTM encoder. In 2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP) (pp. 158-162). IEEE
Xie, J., Fang, J., Liu, C. & Li, X. (2020). Deep learning-based spectrum sensing in cognitive radio: A CNN-LSTM approach. IEEE Communications Letters, 24(10), 2196-2200.