Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

ABSTRACT

Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed I³-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and Objective experiments on the DailyTalk dataset show that I³-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/Coder-jzq/ICASSP2025-IIICSS.

MODEL ARCHITECTURE

Figure: The overview of I³-CSS consists of Intra-modal Interaction Modules, Inter-modal Interaction Modules, Text Encoder, and Speech Synthesizer.

EXPERIMENTS

Comparative Experiment:
1) DailyTalk models the independent dialogue history through a coarse-grained context encoder to enhance speech expressiveness. [PAPER]
2) M²-CTTS designs a multimodal, multiscale context encoder to model independent MDH to generate speech with appropriate prosody. [PAPER]
3) CONCSS increases the discriminability of independent MDH context through contrastive learning. [PAPER]
4) Homogeneous Graph-based CSS uses a homogeneous graph to model independent MDH, inferring speaking styles of the target utterance. [PAPER]
5) ECSS constructs a heterogeneous graph of the target utterance and MDH's multi-source knowledge to predict emotions and synthesize emotionally expressive speech. [PAPER]

Ablation Experiment:
Abl.Exp.1 represents the removal of all intra-modal and inter-modal interaction modules.
(Abl.Exp.2 to Abl.Exp.9 represents different combinations of HT-NT, HS-NS, HT-NS, and HS-NT)
Abl.Exp.2 represents I³-CSS with only HT-NT.
Abl.Exp.3 represents I³-CSS with only HS-NS.
Abl.Exp.4 represents I³-CSS with only HT-NS.
Abl.Exp.5 represents I³-CSS with only HS-NT.
Abl.Exp.6 represents I³-CSS with both HT-NT and HS-NS.
Abl.Exp.7 represents I³-CSS with both HT-NS and HS-NT.
Abl.Exp.8 represents I³-CSS with both HT-NT and HT-NS.
Abl.Exp.9 represents I³-CSS with both HS-NS and HS-NT.
Abl.Exp.10 represents I³-CSS without the interaction enhancement (RE) mechanism.

In my book, all a good movie needs is a chase scene and lots of things that blow up.

Conversation history	text	speech
1^th	I suppose you like cinematography and costumes and that sort of stuff?
2^th	Yes, I do. The look of a picture is very important.
3^th	Umm I think sound is even more important! Guns, bombs, sirens--that's what makes a movie exciting!
4^th	You wouldn't know a good movie even if it bit you on the nose.

Comparative

current	text	DailyTalk	M²-CTTS	CONCSS	Homogeneous Graph-based CSS	ECSS	I³-CSS
5^th	In my book, all a good movie needs is a chase scene and lots of things that blow up.

Ablation

Abl.Exp.1	Abl.Exp.2	Abl.Exp.3	Abl.Exp.4	Abl.Exp.5	Abl.Exp.6	Abl.Exp.7	Abl.Exp.8	Abl.Exp.9	Abl.Exp.10

No, I don't often dance. Isn't this a wonderful party?

Conversation history	text	speech
1^th	Umm excuse me, miss. I'm Bob.
2^th	I'm Amy. How do you do?
3^th	I'm very glad to meet you. May I have this dance with you?
4^th	Certainly! I suppose you dance often.

Comparative

current	text	DailyTalk	M²-CTTS	CONCSS	Homogeneous Graph-based CSS	ECSS	I³-CSS
3^th	No, I don't often dance. Isn't this a wonderful party?

Ablation

Abl.Exp.1	Abl.Exp.2	Abl.Exp.3	Abl.Exp.4	Abl.Exp.5	Abl.Exp.6	Abl.Exp.7	Abl.Exp.8	Abl.Exp.9	Abl.Exp.10

I'm going to take more pictures today.

Conversation history	text	speech
1^th	Lucy, come here! I can see the lake which is in the center of park.
2^th	It is beautiful! Look, there are so many birds around it.
3^th	It is a great place for a relaxing vacation.
4^th	Listen to the sound of nature! It's like music.
5^th	Yeah, I agree. It makes you feel really good.
6^th	What are those?
7^th	Do you mean the red things? They are roses.

Comparative

current	text	DailyTalk	M²-CTTS	CONCSS	Homogeneous Graph-based CSS	ECSS	I³-CSS
8^th	I'm going to take more pictures today.

Ablation

Abl.Exp.1	Abl.Exp.2	Abl.Exp.3	Abl.Exp.4	Abl.Exp.5	Abl.Exp.6	Abl.Exp.7	Abl.Exp.8	Abl.Exp.9	Abl.Exp.10