Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Voice AI that actually converts: New TTS model boosts sales 15% for major brands


Join the event that trusts business leaders for almost two decades. VB Transform brings together people who build a real business AI strategy. Learn more


Generate voices that are not only human and nuanced but miscellaneous continues to be a struggle in AI conversational.

In the end, people want to hear voices that look like them or that are at least natural, not just the nineteenth -century American dissemination standard.

To start up Rhyme notes this challenge with Arcana Text-TO-Speech (TTS), a new model of spoken language which can quickly generate new “endless” voices of variables of sexes, ages, demographic data and languages ​​only on the basis of a simple description of texts of the planned characteristics.

The model has helped stimulate customer sales – for Domino’s and Wingstop – 15%.

“It is one thing to have a model of your real high quality person, achievable and devoted to the person,” said Lily Clifford, CEO and co -founder of Rime, in Venturebeat. “It is another to have a model that can not only create a single voice, but an infinite variability of voices according to demographic lines.”

A vocal model that “acts human”

Multimodal and self -regressive rhyme TTS model was trained on natural conversations with real people (as opposed to the actors in the voice). Users simply type a text prompt description in a voice with desired demographic characteristics and language.

For example: “I want a 30-year-old woman who lives in California and who is in software” or “give me an Australian voice”.

“Whenever you do this, you are going to get a different voice,” said Clifford.

The TTS Mist V2 model of rhyme has been designed for high -volume critical applications, allowing companies to develop unique votes for their business needs. “The customer hears a voice that allows a natural and dynamic conversation without the need for a human agent,” said Clifford.

For those looking for ready-to-use options, Rime offers eight flagship speakers with unique characteristics:

  • Luna (woman, chill but excitable, ge-z optimistic)
  • Celeste (woman, warm, relaxed, in love)
  • Orion (male, older, African-American, happy)
  • Ursa (man, 20 years old, encyclopedic knowledge of emo music from the 2000s)
  • Astra (female, young, with a wide eyes)
  • Esther (woman, older, Chinese-American, loving)
  • Estelle (female, average age, African-American, sounds so sweet)
  • Andromeda (female, young, breathless, yoga vibrations)

The model has the ability to switch between languages ​​and can whisper, be sarcastic and even mocking. Arcana can also insert laughter in the Word when we give him the token . This can return various and realistic outings, from a little laugh to a big laugh, “says rhyme. The model can also interpret ,, And even Correctly, even if he has not been explicitly trained to do so.

“He draws the emotion from the context,” writes rhymes in a technical article. “He laughs, signs, buzzes, breathes auditable and made subtle noise. He says “UM” and other breaks naturally. He has emerging behaviors that we still discover. In short, it acts human. ”

Capture natural conversations

The rhyme model generates audio tokens which are decode in words using an approach based on the codec, which, according to Rime, provides for a “faster than real synthesis”. At the launch, the time of the first audio was 250 milliseconds and the public cloud latency was around 400 milliseconds.

Arcana was formed in three stages:

  • Pre-training: rhyme used large open source models (LLMS) as skeleton and pre-formulated on a large group of audio text pairs to help arcana to learn general linguistic and acoustic models.
  • Adjusted finished with a “massive” owner data set.
  • Specific administration function: Rime identified the speakers he found “the most exemplary” among his data set, his conversations and his reliability.

Rime data incorporate sociolinguistic conversation techniques (taking into account in the social context such as class, gender, location), idiolet (individual speech habits) and paralinguistic nuances (non -verbal communication aspects that accompany speech).

The model was also formed on accent subtleties, the filling words (these “UHS” and “UMS” subconscious) as well as breaks, prosodic constraint models (intonation, calendar, stress of certain syllables) and the change of multilingual code (when multilingual speakers are rotating between languages).

The company has adopted a unique approach to Collection of all this data. Clifford explained that, generally, the manufacturers of models will bring together extracts from the actors of the voice, then create a model to reproduce the characteristics of the voice of this person according to the text entry. Or, they will scratch audio data.

“Our approach was very different,” she said. “It was:” How do we create the largest set of exclusive data in the world of conversational discourse? ” »»

To do this, Rime built her own recording studio in a basement in San Francisco and spent several months recruiting people off Craigslist, by word of mouth, or simply gathered in a causal way and his friends and his family. Rather than scripted conversations, they recorded natural conversations and chichat.

They then annotated voices with detailed metadata, coding for sex, age, dialect, affect of speech and language. This allowed Rime to reach a precision of 98 to 100%.

Clifford noted that they constantly increased this set of data.

“How can we make it look personal? You will never get there if you just use voice actors,” she said. “We did the incredibly difficult thing to collect really naturalistic data. The huge secret sauce of Rime is that they are not actors. They are real people. “

A “personalization harness” that creates tailor -made voices

Rhyme intends to give customers the possibility of finding voices that will work best for their application. They built a “personalization harness” tool to allow users to do A / B tests with various voices. After a given interaction, the API rests in Rime, which provides an analytical dashboard identifying the most efficient voices based on success measurements.

Of course, customers have different definitions of what constitutes a successful call. In food services, this could be looking for an order of additional fries or wings.

“The objective for us is how to create an application that allows our customers to easily carry out these experiences themselves?” Said Clifford. “Because our customers are not vocal casting directors, we are not either. The challenge becomes how to make this layer of personalization analysis really intuitive.”

Another KPI Maximization Customer is the appellant’s desire to speak to AI. They discovered that, during the rhyme, the appellants are 4x more likely to speak to the bot.

“For the first time, people say to themselves:” No, you don’t need to transfer me. I’m perfectly ready to talk to you, “said Clifford.” Or when they are transferred, they say “thank you”. ” (20%, in fact, is cordial when the conversations with a bot) end).

Feed 100 million calls per month

Rime is one of its Domino’s, Wingstop, Converse Now and Ylopo customers. They do a lot of work with major contact centers, business developers creating interactive vocal response systems (IVR) and telecommunications, noted Clifford.

“When we went to Rime, we saw an immediate two -digit improvement from the probability that our calls will make,” said Akshay Kayastha, Director of Engineering at Conversenow. “Working with Rime means that we solve a ton of the problems of the last mile which arise in the shipment of a high impact application.”

Ylopo CPO Ge Juefeng noted that, for the outgoing demand for his business, they must establish immediate confidence with the consumer. “We have tested all the models on the market and found that rhyme votes convert customers to the highest rate,” he reported.

Rime is already helping energy at nearly 100 million phone calls per month, said Clifford. “If you call Domino’s or Wingstop, there are 80 to 90% chance that you hear a rhyme voice,” she said.

For the future, Rime will grow more in on -site offers to support a low latency. In fact, they provide that, at the end of 2025, 90% of their volume will be on site. “The reason is that you will never be so fast if you run these models in the cloud,” said Clifford.

In addition, Rime continues to set up her models to meet other linguistic challenges. For example, sentences that the model has never encountered, such as the “Vianza Extravaganzza” by Domino which accumulates Domino “. As Clifford noted, even if a voice is personalized, natural and responds in real time, it will fail if it cannot manage the unique needs of a business.

“There are still a lot of problems that our competitors consider the last mile problems, but that our customers consider problems in the first mile,” said Clifford.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *