Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis
ABSTRACT
Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed I³-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and Objective experiments on the DailyTalk dataset show that I³-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/Coder-jzq/ICASSP2025-IIICSS.
MODEL ARCHITECTURE

Figure: The overview of I³-CSS consists of Intra-modal Interaction Modules, Inter-modal Interaction Modules, Text Encoder, and Speech Synthesizer.

EXPERIMENTS
Comparative Experiment:
1) DailyTalk models the independent dialogue history through a coarse-grained context encoder to enhance speech expressiveness. [PAPER]
2) M²-CTTS designs a multimodal, multiscale context encoder to model independent MDH to generate speech with appropriate prosody. [PAPER]
3) CONCSS increases the discriminability of independent MDH context through contrastive learning. [PAPER]
4) Homogeneous Graph-based CSS uses a homogeneous graph to model independent MDH, inferring speaking styles of the target utterance. [PAPER]
5) ECSS constructs a heterogeneous graph of the target utterance and MDH's multi-source knowledge to predict emotions and synthesize emotionally expressive speech. [PAPER]

Ablation Experiment:
Abl.Exp.1 represents the removal of all intra-modal and inter-modal interaction modules.
(Abl.Exp.2 to Abl.Exp.9 represents different combinations of HT-NT, HS-NS, HT-NS, and HS-NT)
Abl.Exp.2 represents I³-CSS with only HT-NT.
Abl.Exp.3 represents I³-CSS with only HS-NS.
Abl.Exp.4 represents I³-CSS with only HT-NS.
Abl.Exp.5 represents I³-CSS with only HS-NT.
Abl.Exp.6 represents I³-CSS with both HT-NT and HS-NS.
Abl.Exp.7 represents I³-CSS with both HT-NS and HS-NT.
Abl.Exp.8 represents I³-CSS with both HT-NT and HT-NS.
Abl.Exp.9 represents I³-CSS with both HS-NS and HS-NT.
Abl.Exp.10 represents I³-CSS without the interaction enhancement (RE) mechanism.

In my book, all a good movie needs is a chase scene and lots of things that blow up.
Conversation history text speech
1th I suppose you like cinematography and costumes and that sort of stuff?
2th Yes, I do. The look of a picture is very important.
3th Umm I think sound is even more important! Guns, bombs, sirens--that's what makes a movie exciting!
4th You wouldn't know a good movie even if it bit you on the nose.
Comparative
current text DailyTalk M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS
5th In my book, all a good movie needs is a chase scene and lots of things that blow up.
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3 Abl.Exp.4 Abl.Exp.5 Abl.Exp.6 Abl.Exp.7 Abl.Exp.8 Abl.Exp.9 Abl.Exp.10
No, I don't often dance. Isn't this a wonderful party?
Conversation history text speech
1th Umm excuse me, miss. I'm Bob.
2th I'm Amy. How do you do?
3th I'm very glad to meet you. May I have this dance with you?
4th Certainly! I suppose you dance often.
Comparative
current text DailyTalk M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS
3th No, I don't often dance. Isn't this a wonderful party?
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3 Abl.Exp.4 Abl.Exp.5 Abl.Exp.6 Abl.Exp.7 Abl.Exp.8 Abl.Exp.9 Abl.Exp.10
I'm going to take more pictures today.
Conversation history text speech
1th Lucy, come here! I can see the lake which is in the center of park.
2th It is beautiful! Look, there are so many birds around it.
3th It is a great place for a relaxing vacation.
4th Listen to the sound of nature! It's like music.
5th Yeah, I agree. It makes you feel really good.
6th What are those?
7th Do you mean the red things? They are roses.
Comparative
current text DailyTalk M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS
8th I'm going to take more pictures today.
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3 Abl.Exp.4 Abl.Exp.5 Abl.Exp.6 Abl.Exp.7 Abl.Exp.8 Abl.Exp.9 Abl.Exp.10