/Amazon’s AI creates synthesized singers

Amazon’s AI creates synthesized singers

Image for illustration purposes only. Source: Pablo Buffer

Summary: Researchers from Amazon and Cambridge put their collective minds to the challenge in a recent paper in which they propose an AI system that requires “considerably” less modeling than previous work of features like vibratos and note durations

Original author and publication date: Kyle Wiggers – December 19, 2019

Futurizonte Editor’s Note: Is music created by AI still music? In other words, can AI feel the music it creates? Is that music inspired by the Muses?

From the article:

AI and machine learning algorithms are quite skilled at generating works of art — and highly realistic images of apartments, people, and pets to boot. But relatively few have been tuned to singing synthesis, or the task of cloning musicians’ voices.

Researchers from Amazon and Cambridge put their collective minds to the challenge in a recent paper in which they propose an AI system that requires “considerably” less modeling than previous work of features like vibratos and note durations. It taps a Google-designed algorithm — WaveNet — to synthesize the mel-spectrograms, or representations of the power spectrum of sounds, which another model produces using a combination of speech and signing data

The system comprises three parts, the first of which is a frontend that takes a musical score as input and produces note embeddings (i.e., numerical representations of notes) to be sent to an encoder. The second is a model that is modified to accept the aforementioned embeddings, whose decoder produces mel-specrograms. As for the third and final component — the WaveNet vocoder, which mimics things like stress and intonation in speech — it synthesizes the spectrograms into song.

The frontend performs linguistic analysis on the score lyrics, allowing for three possible vowel levels of stress and ignoring punctuation. In time, it discovers which phonemes (perceptually distinct units of sound) correspond to each note of the score using syllabification information specified in the score itself. It also computes the expected duration in seconds of each note, as well as the tempo and time signature of the score, which it combines into embeddings.

The researchers compiled a data set of 96 songs in English, sung a capella by a single female voice for a total of two hours and 15 seconds of music. (An additional 40 hours of recordings was used to train the WaveNet model and the baseline systems.) It covered several genres, including pop, blues, rock, and some children’s songs, and the songs were split into segments 20-30 seconds in length, corresponding to about 200 phonemes each. This splitting reduced the amount of computation required to train the system, the researchers say, and made it easier to transform the samples (by shifting the pitch and changing the tempo) to augment the corpus.

The research team recruited around 22 human listeners to evaluate the quality of synthesized songs, principally by listening to segments three to five seconds in length and rating their naturalness on a scale of 0 to 100. The results show that the proposed model achieved an average ranking of 58.9%, with most segments in the lower quartile containing either a vocoder glitch or mumbled words.

READ the complete article here.