ABSTRACT
Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosody interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosody expressiveness. Code and speech samples are available at
https://github.com/coder-speech/MFCIG-CSS.
MODEL ARCHITECTURE

Figure: The overview of MFCIG-CSS consists of Multimodal Fine-grained Dialogue Semantic Interaction Graph, Multimodal Fine-grained Dialogue Prosody Interaction Graph, and Speech Synthesizer.
EXPERIMENTS
Comparative Experiment:
1) BaseCTTS incorporates a coarse-grained context encoder to model sentence-level textual features of dialogue history, aiming to enhance the quality of the synthesized speech. [PAPER]
2) FCTalker designs a coarse-grained and fine-grained text context encoder to enhance the prosodic expressiveness of synthesized speech. [PAPER]
3) M²-CTTS incorporates a multi-scale, multi-modal context encoder that simultaneously models the textual and speech features of dialogue history to enhance the prosody of the synthesized speech.. [PAPER]
4) CONCSS incorporates a negative-sample-enhanced sampling strategy when modeling multi-modal dialogue history to improve the discriminability of context vectors, aiming to enhance the prosodic sensitivity of the synthesized speech. [PAPER]
5) MSRGCN-CSS incorporates a context modeling scheme based on multi-scale relational graph convolution networks to enhance the speaking style of the synthesized speech. [PAPER]
6) ECSS incorporates a context modeling scheme based on multi-source knowledge heterogeneous graphs to enhance the emotional expressiveness of the synthesized speech. [PAPER]
7) I³-CSS incorporates a context interaction modeling scheme that handles both inter-modal and intra-modal interactions, aiming to improve the prosodic performance of the synthesized speech. [PAPER]
Ablation Experiment:
Abl.Exp.1: w/o SIG removes SIG to validate the impact of the multimodal fine-grained dialogue semantic interaction graph on model performance.
Abl.Exp.2: w/o PIG removes PIG to validate the impact of the multimodal fine-grained dialogue prosody interaction graph on model performance.
Abl.Exp.3: w/o SIG and PIG removes both SIG and PIG to assess their joint impact on model performance.
Sample 1: That's a good reason to like something.
Conversation history | text | speech |
---|---|---|
1th | What type of music do you like to listen to? | |
2th | I like listening to different kinds of music. | |
3th | Like what, for instance? | |
4th | I enjoy Rock and R&B. | |
5th | Why is that? | |
6th | I like the different instruments that they use. |
Comparative |
---|
current | text | BaseCTTS | FCTalker | M²-CTTS | CONCSS | Homogeneous Graph-based CSS | ECSS | I³-CSS | Ours |
---|---|---|---|---|---|---|---|---|---|
7th | That's a good reason to like something. |
Ablation |
---|
Abl.Exp.1 | Abl.Exp.2 | Abl.Exp.3 |
---|---|---|
Sample 2: All right. this suits my taste best. I'II take It.
Conversation history | text | speech |
---|---|---|
1th | Good morning, Madam! Can I help you? | |
2th | Well, I'd like to buy a watch. | |
3th | Oh, look at these two watches, aren't they lovely? | |
4th | Yeah. But I think I'd prefer... | |
5th | Umm how about this one? It's graceful in style. | |
6th | Mm, yes, but I think I like that one better. It's made of gold, isn't it? | |
7th | Sure. | |
8th | How much is it? | |
9th | Five hundred dollars, Madam. | |
10th | I wonder if it keeps good time. | |
11th | Surely. As this is the latest model, and you can also set the alarm. | |
12th | How do I set it? | |
13th | Just do like this. Very simple. |
Comparative |
---|
current | text | BaseCTTS | FCTalker | M²-CTTS | CONCSS | Homogeneous Graph-based CSS | ECSS | I³-CSS | Ours |
---|---|---|---|---|---|---|---|---|---|
14th | All right. this suits my taste best. I'II take It. |
Ablation |
---|
Abl.Exp.1 | Abl.Exp.2 | Abl.Exp.3 |
---|---|---|
Sample 3: OK. I close my mouth.
Conversation history | text | speech |
---|---|---|
1th | Hi, Mrs. Henderson. | |
2th | Hi, Steven. Do you have time and chat with me? | |
3th | Of course I have plenty of time. What's new? | |
4th | The new couple next door divorced. Have you heard about it? | |
5th | Umm no. The Hills? Who filed for divorce first? | |
6th | I guess it Is Mrs. Hill. She sued for divorce on the grounds of her husband's misconduct with his secretary. | |
7th | Oh, maybe not. It's just your guess. Do not give currency to idle gossip. |
Comparative |
---|
current | text | BaseCTTS | FCTalker | M²-CTTS | CONCSS | Homogeneous Graph-based CSS | ECSS | I³-CSS | Ours |
---|---|---|---|---|---|---|---|---|---|
8th | OK. I close my mouth. |
Ablation |
---|
Abl.Exp.1 | Abl.Exp.2 | Abl.Exp.3 |
---|---|---|
Sample 4: I am sorry. There is a speed limit.
Conversation history | text | speech |
---|---|---|
1th | Driver, bring me to the station. | |
2th | OK. | |
3th | Uh, can you please speed up ? I am catching the train. |
Comparative |
---|
current | text | BaseCTTS | FCTalker | M²-CTTS | CONCSS | Homogeneous Graph-based CSS | ECSS | I³-CSS | Ours |
---|---|---|---|---|---|---|---|---|---|
4th | I am sorry. There is a speed limit. |
Ablation |
---|
Abl.Exp.1 | Abl.Exp.2 | Abl.Exp.3 |
---|---|---|
Sample 5: It's there by the window.
Conversation history | text | speech |
---|---|---|
1th | It's very dark in here. Umm, will you turn on the light? | |
2th | Okay. But our baby has fallen sleep. | |
3th | Then, turn on the lamp, please. | |
4th | But where's the switch? |
Comparative |
---|
current | text | BaseCTTS | FCTalker | M²-CTTS | CONCSS | Homogeneous Graph-based CSS | ECSS | I³-CSS | Ours |
---|---|---|---|---|---|---|---|---|---|
5th | It's there by the window. |
Ablation |
---|
Abl.Exp.1 | Abl.Exp.2 | Abl.Exp.3 |
---|---|---|