AI Speech Generator Reaches Human Parity but Deemed Too Dangerous by Scientists
Microsoft has recently developed an artificial intelligence (AI) speech generator called VALL-E 2, which has the ability to reproduce the voice of a human speaker with just a few seconds of audio. According to Microsoft researchers, VALL-E 2 is capable of generating speech that is so accurate and natural that it is comparable to human performance. In fact, the AI voice generator is convincing enough to be mistaken for a real person, marking a milestone in zero-shot text-to-speech synthesis by achieving human parity for the first time.
The key features that enable VALL-E 2 to achieve this level of performance are “Repetition Aware Sampling” and “Grouped Code Modeling.” Repetition Aware Sampling prevents infinite loops of sounds or phrases during the decoding process, making the speech sound more fluid and natural. On the other hand, Grouped Code Modeling reduces the sequence length, speeding up the speech generation process and improving efficiency.
In their experiments, the researchers used audio samples from speech libraries to assess how well VALL-E 2 matched recordings of human speakers. They found that VALL-E 2 surpassed previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity, making it the first of its kind to reach human parity on these benchmarks.
Despite its capabilities, Microsoft has decided not to release VALL-E 2 to the public due to potential misuse risks associated with voice cloning and deepfake technology. While the researchers acknowledge the practical applications of AI speech technology in various fields such as education, entertainment, journalism, and accessibility features, they emphasize the need for protocols to ensure that the speaker approves the use of their voice and to detect synthesized speech.
In conclusion, VALL-E 2 represents a significant advancement in AI speech generation technology, but its potential risks have led Microsoft to treat it as a purely research project for now. As the debate around the ethical use of AI continues, it is essential to consider the implications of such powerful technology on privacy, security, and authenticity in speech synthesis.