Session 3-Neural Speech Assessment and Its Application

Neural Speech Assessment and Its Application

Moderator:

Speakers:

Dr. Erica Cooper, National Institute of Information and Communications Technology, Japan

Title: Progress and Challenges in DNN-based Objective Quality Assessment of Synthesized Speech

Abstract: The field of speech synthesis has advanced rapidly in recent years, and evaluation methodologies for synthesized speech have evolved as well. While listening tests are the gold standard for evaluating synthesized speech, they are costly and time-consuming, leading researchers to consider more automatic and objective metrics for evaluation. In this paper, we give an overview of machine learning based approaches to the prediction of quality of synthesized speech, with a focus on modern deep neural network (DNN) based approaches for MOS prediction, including supervised task-specific training, approaches making use of pretrained self-supervised speech models, unsupervised approaches, and more recent approaches making use of large language models. We will also discuss the current state of objective evaluation of synthesized speech including open research challenges and future directions.

Bio: Erica Cooper completed the Ph.D. degree at Columbia University in the City of New York in 2019 with a research focus on text-to-speech synthesis for low-resource languages. She worked as a postdoctoral researcher at the National Institute of Informatics from February 2019 to March 2024 contributing to the JST-ANR CREST VoicePersonae project. She joined NICT as a senior researcher in April 2024. She is one of the founding organizers of the VoiceMOS Challenge series which began in 2022.

Prof. Wen-Chin Huang, Graduate School of Informatics, Nagoya University, Japan

Title: Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities

Abstract: Speech quality assessment (SQA) refers to the evaluation of speech quality, and developing an accurate automatic SQA method that reflects human perception has become increasingly important, in order to keep up with the generative AI boom. In recent years, SQA has progressed to a point that researchers started to faithfully use automatic SQA in research papers as a rigorous measurement of goodness for speech generation systems. We believe that the scientific challenges and open-source activities of late have stimulated the growth in this field. In this paper, we review recent challenges as well as open-source implementations and toolkits for SQA, and highlight the importance of maintaining such activities to facilitate the development of not only SQA itself but also generative AI for speech.

Bio: Wen-Chin Huang is currently an assistant professor at the Graduate School of Informatics, Nagoya University, Japan. He received the B.S. degree from National Taiwan University, Taiwan in 2018 and the M.S. and Ph.D. degree from Nagoya University, Japan in 2021 and 2024, respectively. He was a co-organizer of the Voice Conversion Challenge 2020, Singing Voice Conversion 2023, 2025, VoiceMOS Challenge 2022, 2023, 2024, and AudioMOS Challenge 2025. His main research interest is speech processing, with a main focus on speech generation-related fields, including voice conversion and speech quality assessment. He was the recipient of the Best Student Paper Award in ISCSLP2018, the Best Paper Award in APSIPA ASC 2021, and the 16th IEEE Signal Processing Society Japan Best Student Journal Paper Award.

Dr. Ryandhimas E. Zezario, Research Center for Information Technology Innovation, Academia Sinica

Title: Non-Intrusive Intelligibility Prediction for Hearing Aids: Recent Advances, Trends, and Challenges

Abstract: Improving speech understanding in noisy environments is an important objective in the development of hearing aid (HA) devices. To support this objective, it is essential to have a reliable metric that can accurately predict speech intelligibility for HA users. While subjective listening tests remain the gold standard for intelligibility evaluation, they are costly and time-consuming. As a result, a series of deep learning–based approaches have been proposed to perform automatic evaluation. With the growing interest in deploying reliable neural speech assessment models, this talk aims to highlight recent advances in non-intrusive intelligibility prediction for HAs, where the goal is to estimate speech intelligibility without requiring clean reference signals. We discuss emerging trends, including the use of acoustic representations, the design of suitable loss functions, and integration with hearing aid signal processing pipelines. In addition, we examine challenges such as generalization and robustness across conditions, as well as the gap between predicted and ground-truth intelligibility. The talk concludes with perspectives on future directions for non-intrusive intelligibility prediction in hearing aid applications.

Bio: Ryandhimas E. Zezario received the Ph.D. degree in computer science and information engineering from the National Taiwan University, Taipei, Taiwan, in 2023. He is currently a Postdoctoral Researcher with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. His research interests include speech enhancement, non-intrusive quality assessment, speech processing, speech and speaker recognition, and deep learning. He was the recipient of the Gold Prize for the Best Non-Intrusive Systems, 1st place in the Hearing Industry Research Consortium student prizes at the Clarity Prediction Challenge 2022, and Best Reviewer Award at IEEE ASRU 2023.

Dr. Yu Tsao, Research Center for Information Technology Innovation, Academia Sinica

Title: Learning to Evaluate: Neural Speech Assessment for Downstream Speech Applications

Abstract: Neural speech assessment employs deep learning models to predict key speech properties, including intelligibility, perceptual quality, background noise level, and distortion. Unlike traditional metrics, these models are trained on large datasets to closely align with human perceptual judgments, making them effective across various acoustic conditions. They serve as objective tools for evaluating speech systems and datasets and can operate in real time to support online quality monitoring and adaptive processing. Additionally, neural speech assessment guides training for generative tasks such as text-to-speech, voice conversion, and speech enhancement by providing perceptually meaningful objectives. It also contributes to spatial audio tasks like beamforming by offering real-time direction and quality cues, making it essential for modern speech technology development and deployment.

Bio: Yu Tsao received the B.S. and M.S. degrees in Electrical Engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in Electrical and Computer Engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher at the National Institute of Information and Communications Technology (NICT), Tokyo, Japan, where he conducted research and product development in multilingual speech-to-speech translation systems, focusing on automatic speech recognition. He is currently a Research Fellow (Professor) and the Deputy Director at the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. He also holds a joint appointment as a Professor in the Department of Electrical Engineering at Chung Yuan Christian University, Taoyuan, Taiwan. His research interests include assistive oral communication technologies, audio coding, and bio-signal processing. He serves as an Associate Editor for IEEE Transactions on Consumer Electronics and IEEE Signal Processing Letters. He received the Outstanding Research Award from Taiwan’s National Science and Technology Council (NSTC), the 2025 IEEE Chester W. Sall Memorial Award, and served as the corresponding author of a paper that won the 2021 IEEE Signal Processing Society Young Author Best Paper Award.