Speaker/Style Adaptation for Digital Voice Assistants Based on Image Processing Methods

The project S-ADAPT will be carried out from 01.09.2020 to 31.08.2022 by the Faculty of Technical Sciences (FTS), University of Novi Sad, together with the Mathematical Institute (MI) of the Serbian Academy of Sciences and Arts, as a project with the Science Fund of the Republic of Serbia, accepted at the first public Call for submission of scientific and research project proposals aimed at the realization of the Programme for the development of projects in the domain of Artificial Intelligence (AI). The project S-ADAPT has obtained the highest grades among the 6 projects accepted within the Subprogramme oriented on the application of AI in different spheres of life and work, and aimed at the acceleration of social, technological, cultural and economical development of the Republic of Serbia – PRVI_P, whose total budget is 200,000 EUR.

About

The project S-ADAPT will investigate methods used in deep-learning based image processing and apply them to speech in order to increase the functionality of digital voice assistants, which rely on the technologies of automatic speech recognition (ASR) and text-to-speech synthesis (TTS). The Project will specifically aim at achieving full flexibility of the only existing digital voice assistant application in Serbian, which, in terms of ASR, refers to the ability of the application to adapt to voices of different speakers, speaking styles and recording conditions (microphone, ambience noise), while in terms of TTS it refers to the possibility of producing synthetic speech in an arbitrary voice and in an arbitrary speaking style. To this aim, the Project will exploit cutting-edge image style transfer methods based on domain adaptation, which are known for their modest requirements for style-specific data, and thus are efficient in real-life scenarios.

Objectives

  • Collection and processing of a multi-speaker/multi-style speech database in Serbian, needed to establish a densely populated speaker/style space for acoustic modelling.
  • Implementation of language independent machine learning algorithms for speech/style adaptation based on a cycle-consistent generative adversarial network (CycleGAN) and a fixed, pre-trained autoencoder neural network, used in order to train the system with only a limited amount of style- or speaker-specific speech data.
  • Extension and enhancement of existing style transfer methods in image processing and their application to speech-to-speech style transfer and speech enhancement, keeping in mind the differences between image and speech as two principal channels of human-to-human communication.
  • Implementation of style neutralization and speech denoising (ASR), as well as style adaptation (TTS) modules based on the abovementioned technologies.
  • Integration of the abovementioned improvements into a digital voice assistant application for the Serbian language in order to increase its flexibility and robustness, and evaluate the achieved improvement.