Combined CNN-LSTM for Enhancing Clean  and  Noisy Speech Recognition

Noussaiba  Djeffal; Djamel Addou; Hamza Kheddar; Sid Ahmed Selouani

doi:10.61850/allj.v30i2.732

pdf (English)

Publié-e : déc 30, 2024

DOI : https://doi.org/10.61850/allj.v30i2.732

Mots-clés :

ASR - CNN - LSTM - Parole propre - Parole bruitée - CNN-LSTM - DNN - SNR

Noussaiba Djeffal

Laboratoire de Traitement de la Parole et du Signal Université des Sciences et Techniques, USTHB Alger,

Djamel Addou

Laboratoire de Traitement de la Parole et du Signal Université des Sciences et Techniques, USTHB Alger,

Hamza Kheddar

Laboratoire LSEA, département Génie électrique Université de MEDEA Medea

Sid Ahmed Selouani

Laboratoire de recherche en interaction humain-système Université de Moncton, Campus de Shippagan Shippagan

Résumé

Cet article présente une approche hybride de réseau neuronal convolutionnel et de mémoire à long terme (CNN-LSTM) pour la reconnaissance automatique de la parole (ASR) utilisant des techniques d'apprentissage profond sur la base de données Aurora-2. Cette base de données comprend des modes propres et multi-conditions, englobant quatre scénarios de bruit : métro, babillage, voiture et hall d'exposition, chacun évalué à différents rapports signal/bruit (SNR) et condition propre, et les résultats sont comparés à ceux de l'ensemble de données ASC-10 et de la base de données ESC-10. Le problème abordé est le besoin de modèles ASR robustes qui fonctionnent bien dans les environnements bruités et non bruités (propres). L'objectif de l'utilisation de l'architecture CNN-LSTM est d'améliorer les performances de reconnaissance en combinant les points forts des CNN et des LSTM, plutôt que de s'appuyer uniquement sur les CNN ou les LSTM pris en isolés. Les résultats expérimentaux démontrent que le modèle combiné CNN-LSTM atteint de hautes performances de classification, dans des environnements non bruités sur l'ensemble de données Aurora2, atteignant une précision de 97,96 %, surpassant les modèles CNN et LSTM pris individuellement, qui ont atteint respectivement 97,21 % et 96,06 %. Dans des conditions bruitées, le modèle hybride surpasse également les deux modèles cités, avec une précision de 90,72 %, contre 90,12 % pour CNN et 86,12 % pour LSTM. Ces résultats indiquent que le modèle hybride CNN-LSTM est plus efficace pour gérer diverses conditions de bruit et améliorer la précision globale du taux de reconnaissance de la parole.

Comment citer

Djeffal, N., Addou, D., Kheddar, H., & Selouani, S. A. (2024). Combinaison CNN-LSTM combiné pour améliorer la reconnaissance vocale propre et bruyante. AL-Lisaniyyat, 30(2), 5-26. https://doi.org/10.61850/allj.v30i2.732

Numéro

Vol. 30 No 2 (2024): v30i22024

Rubrique

Articles

Références

Alsayadi, H.A. et al. (2021). Non-diacritized Arabic speech recognition based on CNN-LSTM and attention-based models. Journal of Intelligent & Fuzzy Systems, 41(6), 6207-6219.‏
Daouad, M., Allah, F. A. & Dadi, E. W. (2023). An automatic speech recognition system for isolated Amazigh word using 1D & 2D CNN-LSTM architecture. International Journal of Speech Technology, 26(3), 775-787.‏
Dar M.A. & J. Pushparaj, J. (2024). Hybrid architecture cnn-blstm for automatic speech recognition. 3rd International Conference on Artificial Intelligence for Internet of Things (AIIoT). pp. 1–4.
Dat, T.T. et al. (2021). Convolutional recurrent neural network with attention for Vietnamese speech to text problem in the operating room. International Journal of Intelligent Information and Database Systems, 14(3), 294-314.‏
Demir, F., Turkoglu, M., Aslan, M. & Sengur, A. (2020). A new pyramidal concatenated CNN approach for environmental sound classification. Applied Acoustics, 170, 107520.‏
Djeffal, N. et al. (2023). Automatic speech recognition with BERT and CTC transformers: A review. In 2nd International Conference on Electronics, Energy and Measurement (IC2EM) (Vol. 1, pp. 1-8).
Djeffal, N. et al. (2023). Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches. In 2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM) (Vol. 1, pp. 1-6). IEEE.‏
Greg V.H., Carlos M. & Gonzalo, N. (2020). A review on the long short- term memory model. Artificial Intelligence Review. Springer, 2020, pp. 5929–5955.
Gueriani, A., Kheddar, H. & Mazari, A. C. (2024). Enhancing IoT Security with CNN and LSTM-Based Intrusion Detection Systems. In 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS) (pp. 1-7). IEEE.‏
Essaid, B. et al. (2024). Artificial Intelligence for Cochlear Implants: Review of Strategies, Challenges, and Perspectives. IEEE Access.‏
Exter, M. & Meyer, B. T. (2016). DNN-Based Automatic Speech Recognition as a Model for Human Phoneme Perception. In INTERSPEECH (pp. 615-619).‏
Ghandoura, A., Hjabo, F. & Al Dakkak, O. (2021). Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting. Engineering Applications of Artificial Intelligence, 102, 104267.‏
Habchi, Y. et al. (2023). Ai in thyroid cancer diagnosis: Techniques, trends, and future directions. Systems, 11(10), 519.‏
Hirsch, H. G. & Pearce, D. (2000). The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRWH.
Karam, S. et al. (2023). Episodic memory based continual learning without catastrophic forgetting for environmental sound classification. Journal of Ambient Intelligence and Humanized Computing, 14(4), 4439-4449.‏
Kheddar, H. et al. (2023). Deep transfer learning for automatic speech recognition: Towards better generalization. Knowledge-Based Systems, 277, 110851.‏
Kheddar, H. et al. (2024). Deep learning for steganalysis of diverse data types: A review of methods, taxonomy, challenges and future directions. Neurocomputing, 127528.‏
Kheddar, H., Hemis, M. & Himeur, Y. (2024). Automatic speech recognition using advanced deep learning approaches: A survey. Information Fusion, 102422.
Li, J., Mohamed, A., Zweig, G. & Gong, Y. (2015). LSTM time and frequency recurrence for automatic speech recognition. In 2015 IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 187-191). IEEE.‏
Lichouri, M., Lounnas, K. & Bakri, A. (2023). Toward building another arabic voice command dataset for multiple speech processing tasks. In 2023 International Conference on Advances in Electronics, Control and Communication Systems (ICAECCS) (pp. 1-5). IEEE.‏
Mazari, A. C. & Kheddar, H. (2023). Deep learning-based analysis of Algerian dialect dataset targeted hate speech, offensive language and cyberbullying. International Journal of Computing and Digital Systems..‏
Naing, H.M.S., Hidayat, R., Hartanto, R. & Miyanaga, Y. (2020, November). A front-end technique for automatic noisy speech recognition. 23rd conference of the oriental COCOSDA international committee for the co-ordination and standardisation of speech databases and assessment techniques (O-COCOSDA) (pp. 49-54).
Passricha, V. & Aggarwal, R. K. (2019). A hybrid of deep CNN and bidirectional LSTM for automatic speechrecognition. Journal of Intelligent Systems, 29(1), 1261-1274.‏
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018).‏
Selouani, S. A. & Yacoub, M. S. (2018). Long short-term memory neural networks for artificial dialogue generation. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 761-768). IEEE.
Soe Naing, H.M. at al. (2020). Discrete Wavelet Denoising into MFCC for Noise Suppressive in Automatic Speech Recognition System. International Journal of Intelligent Engineering & Systems, 13(2).‏
Takazawa, S. K. et al. (2024). Explosion Detection Using Smartphones: Ensemble Learning with the Smartphone High-Explosive Audio Recordings Dataset and the ESC-50 Dataset. Sensors, 24(20), 6688.‏
Wu, Y., Zheng, B. & Zhao, Y. (2018). Dynamic gesture recognition based on LSTM-CNN. In 2018 Chinese Automation Congress (CAC) (pp. 2446-2450). IEEE.‏
Wang, W., Yang, X. & Yang, H. (2020). End-to-End low-resource speech recognition with a deep CNN-LSTM encoder. In 2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP) (pp. 158-162). IEEE
Xie, J., Fang, J., Liu, C. & Li, X. (2020). Deep learning-based spectrum sensing in cognitive radio: A CNN-LSTM approach. IEEE Communications Letters, 24(10), 2196-2200.‏

##plugins.themes.bootstrap3.article.sidebar##

##plugins.themes.bootstrap3.article.main##

Résumé

##plugins.themes.bootstrap3.article.details##

Références