Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
ABSTRACT
Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosody interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosody expressiveness. Code and speech samples are available at https://github.com/coder-speech/MFCIG-CSS.
MODEL ARCHITECTURE

Figure: The overview of MFCIG-CSS consists of Multimodal Fine-grained Dialogue Semantic Interaction Graph, Multimodal Fine-grained Dialogue Prosody Interaction Graph, and Speech Synthesizer.

EXPERIMENTS
Comparative Experiment:
1) BaseCTTS incorporates a coarse-grained context encoder to model sentence-level textual features of dialogue history, aiming to enhance the quality of the synthesized speech. [PAPER]
2) FCTalker designs a coarse-grained and fine-grained text context encoder to enhance the prosodic expressiveness of synthesized speech. [PAPER]
3) M²-CTTS incorporates a multi-scale, multi-modal context encoder that simultaneously models the textual and speech features of dialogue history to enhance the prosody of the synthesized speech.. [PAPER]
4) CONCSS incorporates a negative-sample-enhanced sampling strategy when modeling multi-modal dialogue history to improve the discriminability of context vectors, aiming to enhance the prosodic sensitivity of the synthesized speech. [PAPER]
5) MSRGCN-CSS incorporates a context modeling scheme based on multi-scale relational graph convolution networks to enhance the speaking style of the synthesized speech. [PAPER]
6) ECSS incorporates a context modeling scheme based on multi-source knowledge heterogeneous graphs to enhance the emotional expressiveness of the synthesized speech. [PAPER]
7) I³-CSS incorporates a context interaction modeling scheme that handles both inter-modal and intra-modal interactions, aiming to improve the prosodic performance of the synthesized speech. [PAPER]

Ablation Experiment:
Abl.Exp.1: w/o SIG removes SIG to validate the impact of the multimodal fine-grained dialogue semantic interaction graph on model performance.
Abl.Exp.2: w/o PIG removes PIG to validate the impact of the multimodal fine-grained dialogue prosody interaction graph on model performance.
Abl.Exp.3: w/o SIG and PIG removes both SIG and PIG to assess their joint impact on model performance.

Sample 1: That's a good reason to like something.
Conversation history text speech
1th What type of music do you like to listen to?
2th I like listening to different kinds of music.
3th Like what, for instance?
4th I enjoy Rock and R&B.
5th Why is that?
6th I like the different instruments that they use.
Comparative
current text BaseCTTS FCTalker M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS Ours
7th That's a good reason to like something.
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3
Sample 2: All right. this suits my taste best. I'II take It.
Conversation history text speech
1th Good morning, Madam! Can I help you?
2th Well, I'd like to buy a watch.
3th Oh, look at these two watches, aren't they lovely?
4th Yeah. But I think I'd prefer...
5th Umm how about this one? It's graceful in style.
6th Mm, yes, but I think I like that one better. It's made of gold, isn't it?
7th Sure.
8th How much is it?
9th Five hundred dollars, Madam.
10th I wonder if it keeps good time.
11th Surely. As this is the latest model, and you can also set the alarm.
12th How do I set it?
13th Just do like this. Very simple.
Comparative
current text BaseCTTS FCTalker M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS Ours
14th All right. this suits my taste best. I'II take It.
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3
Sample 3: OK. I close my mouth.
Conversation history text speech
1th Hi, Mrs. Henderson.
2th Hi, Steven. Do you have time and chat with me?
3th Of course I have plenty of time. What's new?
4th The new couple next door divorced. Have you heard about it?
5th Umm no. The Hills? Who filed for divorce first?
6th I guess it Is Mrs. Hill. She sued for divorce on the grounds of her husband's misconduct with his secretary.
7th Oh, maybe not. It's just your guess. Do not give currency to idle gossip.
Comparative
current text BaseCTTS FCTalker M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS Ours
8th OK. I close my mouth.
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3
Sample 4: I am sorry. There is a speed limit.
Conversation history text speech
1th Driver, bring me to the station.
2th OK.
3th Uh, can you please speed up ? I am catching the train.
Comparative
current text BaseCTTS FCTalker M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS Ours
4th I am sorry. There is a speed limit.
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3
Sample 5: It's there by the window.
Conversation history text speech
1th It's very dark in here. Umm, will you turn on the light?
2th Okay. But our baby has fallen sleep.
3th Then, turn on the lamp, please.
4th But where's the switch?
Comparative
current text BaseCTTS FCTalker M²-CTTS CONCSS Homogeneous Graph-based CSS ECSS I³-CSS Ours
5th It's there by the window.
Ablation
Abl.Exp.1 Abl.Exp.2 Abl.Exp.3