Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis
Abstract
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: https://github.com/Coder-jzq/RADKA-CSS.
MODEL ARCHITECTURE

Figure: Overview of our proposed RADKA-CSS model.

Experiments
Comparative Experiment:
1) DailyTalk incorporates a dialogue context encoder into FastSpeech2 to model sentence-level text dialogue history.
2) M²-CTTS designs a text context module and an acoustic context module, using both coarse-grained and fine-grained modeling. This approach aims to make full use of multimodal history to enhance prosodic expression in synthesized speech.
3) Homogeneous Graph-based CSS proposes a context modeling method based on a Multi-Scale Relational Graph Convolutional Network (MSRGCN), which models dependencies among multimodal information in the context. This allows the model to learn dependencies in both global and local scales within dialogues, enhancing its ability to synthesize speaking style. The learned multi-scale, multimodal contextual information is then used to infer the global and local speaking style of the current utterance for speech synthesis.
4) CONCSS introduces a CSS framework based on contrastive learning, which incorporates a negative-sample-enhanced sampling strategy to improve the discriminability of context vectors. This enables the model to perform self-supervised learning on unlabeled dialogue datasets, enhancing the model' s understanding of context.
5) ECSS presents a new emotion CSS model based on heterogeneous graph-based emotion context modeling and an emotion rendering mechanism, ensuring accurate generation of emotional conversational speech in terms of both emotion understanding and expression.
Ablation Experiment:
1) Ablation 1: w/o Style Knowledge (SD) indicates the removal of similar dialogue knowledge retrieved from SD, aiming to verify whether referencing the style knowledge of similar dialogues in SD enhances the agent's ability to understand the style of the CD.
2) Ablation 2: w/o Style Knowledge (CD: Text) indicates that encoded textual knowledge from the CD is not aggregated, aiming to validate the extent to which the semantic information contained in the scenarios of CD contributes to the agent's ability to generate speech that aligns with the conversational style.
3) Ablation 3: w/o Style Knowledge (CD: Audio) indicates that encoded audio knowledge from the CD is not aggregated, which assesses the effectiveness of the conversational style knowledge in CD and its role in enhancing the agent's understanding of the CD style.
4) Ablation 4: w/o Style Knowledge (a_N Style Vector) indicates that the a_N style knowledge predicted by the a_N vector predictor is not aggregated, aiming to evaluate whether the predicted a_N style vector contains style information.
5) Ablation 5: w/o Heterogeneous Graph replaces the heterogeneous graph in RADKA-CSS with a homogeneous graph to validate whether the proposed heterogeneous structure can better capture and model dialogue style and semantic representations.
6) Ablation 6: w/o Multi-granularity removes the dialogue-level and word-level nodes, using only sentence-level nodes for training, to evaluate whether the multi-granularity approach can more comprehensively represent style and semantic features.
7) Ablation 7: w/o Knowledge Aggregation Method replaces the knowledge aggregation method with simple direct addition, which helps validate the effectiveness of our proposed style knowledge aggregation method and its impact on performance.
8) Ablation 8: w/o Contrastive Learning removes the retrieval-based dialogue contrastive learning to validate its effectiveness and its impact on RADKA-CSS's performance.
9) Ablation 9: w/ GT (Retrieved) uses the ground truth Top-K dialogue set instead of the Top-K dialogue set retrieved during inference.

Sample 1: Umm, maybe you are right, so I will try my best to find a suitable job.
Conversation History 1 Text Speech
1th I plan to immigrate to Canada.
2th Have you found a suitable job?
3th No. But it is said the welfare in Canada is very good.
4th But as to me finding a good job is the most important thing.
Comparative
Current Text Ground Truth DailyTalk M²-CTTS Homogeneous Graph-based CSS CONCSS ECSS RADKA-CSS
5th Umm, maybe you are right, so I will try my best to find a suitable job.
Ablation
Ablation 1 Ablation 2 Ablation 3 Ablation 4 Ablation 5 Ablation 6 Ablation 7 Ablation 8 Ablation 9
Sample 2: Umm…. You're making me a believer.
Conversation History 2 Text Speech
1th Have you ever done your shopping at Whole Foods market?
2th I haven't shopped there. How is the food?
3th The food there is wonderful.
4th I go to Sons for my groceries.
5th I prefer the food at Whole Foods.
6th Is there something wrong with Sons?
7th Sons doesn't offer a lot of organic foods.
8th Do they offer organic foods at Whole Foods?
9th Yes, that's the place to go to get healthier food.
10th Maybe I'll try that store out..
11th If you like Sons, then I'm sure you'll love Whole Foods.
Comparative
Current Text Ground Truth DailyTalk M²-CTTS Homogeneous Graph-based CSS CONCSS ECSS RADKA-CSS
12th Umm…. You're making me a believer.
Ablation
Ablation 1 Ablation 2 Ablation 3 Ablation 4 Ablation 5 Ablation 6 Ablation 7 Ablation 8 Ablation 9
Sample 3: I'm sure you'll enjoy your workout, sir. Everyone seems to like the swim stations.
Conversation History 3 Text Speech
1th Do you have a swimming pool in this hotel?
2th We don't have a swimming pool, sir, umm but we do have swim stations in the gym.
3th I never heard of a swim station. Is that like a train or bus station?
4th It's just a deep bathtub with a current of water that you swim against.
5th Holy cow! I never heard of such a thing. How much do they cost?
6th As a guest, sir, you pay nothing.
7th This sounds better every second. Now, when can I use the stations?
8th If you want to swim, you can visit the gym any day between seven a.m. and ten p.m.
9th Oh, boy! This is going to be great. I'm going to the gym right now!
Comparative
Current Text Ground Truth DailyTalk M²-CTTS Homogeneous Graph-based CSS CONCSS ECSS RADKA-CSS
10th I'm sure you'll enjoy your workout, sir. Everyone seems to like the swim stations.
Ablation
Ablation 1 Ablation 2 Ablation 3 Ablation 4 Ablation 5 Ablation 6 Ablation 7 Ablation 8 Ablation 9
Sample 4: It must be interesting.
Conversation History 4 Text Speech
1th It's so boring.
2th Don't you like it?
3th I don't. Is there anything worth watching on the other channel?
4th Umm I think it's a basketball match on channel five.
5th Do you mind if we switch over?
6th Well, I'd rather see a movie.
7th What's the movie?
8th Star war.
Comparative
Current Text Ground Truth DailyTalk M²-CTTS Homogeneous Graph-based CSS CONCSS ECSS RADKA-CSS
9th It must be interesting.
Ablation
Ablation 1 Ablation 2 Ablation 3 Ablation 4 Ablation 5 Ablation 6 Ablation 7 Ablation 8 Ablation 9
Sample 5: In my book, all a good movie needs is a chase scene and lots of things that blow up.
Conversation History 5 Text Speech
1th I suppose you like cinematography and costumes and that sort of stuff?
2th Yes, I do. The look of a picture is very important.
3th Umm I think sound is even more important! Guns, bombs, sirens--that's what makes a movie exciting!
4th You wouldn't know a good movie even if it bit you on the nose.
Comparative
Current Text Ground Truth DailyTalk M²-CTTS Homogeneous Graph-based CSS CONCSS ECSS RADKA-CSS
5th In my book, all a good movie needs is a chase scene and lots of things that blow up.
Ablation
Ablation 1 Ablation 2 Ablation 3 Ablation 4 Ablation 5 Ablation 6 Ablation 7 Ablation 8 Ablation 9
Sample 6: No, I don't often dance. Isn't this a wonderful party?
Conversation History 6 Text Speech
1th Umm excuse me, miss. I'm Bob.
2th I'm Amy. How do you do?
3th I'm very glad to meet you. May I have this dance with you?
4th Certainly! I suppose you dance often.
Comparative
Current Text Ground Truth DailyTalk M²-CTTS Homogeneous Graph-based CSS CONCSS ECSS RADKA-CSS
5th No, I don't often dance. Isn't this a wonderful party?
Ablation
Ablation 1 Ablation 2 Ablation 3 Ablation 4 Ablation 5 Ablation 6 Ablation 7 Ablation 8 Ablation 9

Case Study

We present the current dialogue along with the Top 1, Top 2, Top 3, and Top 10 dialogues retrieved by RADKA-CSS that are most similar in conversational style to the current dialogue, as well as the Bottom 1, Bottom 2, Bottom 3, and Bottom 10 dialogues that are the least similar in conversational style. The similarity is calculated using cosine similarity.

CD belongs to a dialogue scenario about hotel room reservations. The dialogues ranked in the Top 1 and Top 10 are all related to hotel room reservations, while those in the Bottom 1 and Bottom 10 do not belong to the hotel reservation scenarios. Based on the audio analysis of these dialogues, the dialogue styles in Top 1 and Top 10 are very similar to CD, whereas the styles in Bottom 1 and Bottom 10 show significant differences.

Current Dialogue

t1: Good afternoon, San Felice Hotel. May I help you?
t2: Yes. I'd like to book a room, please.
t3: Certainly. When for, madam?
t4: March the twenty third.
t5: Umm how long will you be staying?
t6: Three nights.
t7: What kind of room would you like, madam?
t8: Er... double with bath. I'd appreciate it if you could give me a room with a view over the lake.
t9: Certainly, madam. I'll just check what we have available... Yes, we have a room on the fourth floor with a really splendid view.
t10: Fine. How much is the charge per night?
t11: Would you like breakfast?
t12: No, thanks.
t13: It's eighty four euro per night excluding VAT.
t14: That's fine.
t15: Who's the booking for, please, madam?
t16: Mr. and Mrs. Ryefield, that's R-Y-E-F-I-E-L-D.
t17: OK, let me make sure I got that.
t18: Yes it is. Thank you.
t19: Let me give you your confirmation number. It's seven five seven six three eight five. Thank you for choosing San Felice Hotel and have a nice day. Goodbye.
t20: Goodbye.

Top 1 in Conversational style similarity ranking (Similarity: 0.794)

t1: Royal Hotel, can I help you?
t2: Yes. I urgently need a room for tomorrow night, and do you have any vacancies?
t3: Yes, we have. What kind of room would you like?
t4: I'd like a suite with an ocean view, please.
t5: No problem, sir.
t6: Umm what is the price of the suite?
t7: It is two hundred dollars per night.
t8: It is a little high. I'm told that your hotel is offering discount now.
t9: Yes, but the offer ended yesterday. I'm sorry.
t10: Oh, I see. Then do you have anything less expensive?
t11: No, sir. So far it is the least expensive suite for tomorrow night.
t12: OK, I will take it. By the way, does the price include breakfast?
t13: Yes, it does. Now could I have your name, please?
t14: My name is David White.
t15: Would you kindly spell it for me?
t16: That's D-A-V-I-D, W-H-I-T-E.
t17: Thank you, I got it. And how long do you expect to stay?
t18: About three days.
t19: OK. Our check-in time is after twelve. And see you tomorrow.
t20: Thank you. See you.

Top2 in Conversational style similarity ranking (Similarity: 0.783)

t1: Good morning, sir. How may I help you?
t2: Good morning! Do you have any rooms available at the moment?
t3: Yes, we do. What kind of room would you like?
t4: I'd like a suite for four nights.
t5: Please wait a moment while I check availability. Ah, I'm sorry, sir. We only have a double room available now.
t6: That's all right. Umm how much do they cost?
t7: Each night costs three twenty RIB, but for a four night stay, we can offer a discount of fifteen percent.
t8: How much in total?
t9: Ten thousand eighty eight RIB.
t10: Is breakfast included?
t11: Yes, it is. You also have free use of the leisure facilities here.
t12: That's fine. I'll get it.
t13: OK. Please fill out this form with your details.
t14: I would like to pay by cash. Do I need to pay a deposit?
t15: Yes, you do. There is a three hundred RIB deposit, which we will refund when you check out. So, in total, you need to pay thirteen eighty eight RIB.
t16: Fine. Here you are.
t17: Here's your key and receipt. Your room number is four o eight. A porter will take your luggage to your room. The elevator is just around the corner.
t18: Thank you very much.
t19: It is my pleasure, sir. I wish you a pleasant stay here. Goodbye!
t20: Bye-bye!

Top 3 in Conversational style similarity ranking (Similarity: 0.770)

t1: Hello, who is speaking, please?
t2: Hello, Mr. Stern. This is Hao Bo from the International Travel Agency. I have made the plane reservations for you.
t3: Oh, good. Let me get a pencil and take down the information. Well, go ahead, please.
t4: OK. You'll be travelling on Northwest Airlines, flight number two two two.
t5: Umm what time does it leave?
t6: It departs Guangzhou at ten:thirty on the morning of July tenth.
t7: That is good.
t8: You want to fly first class. Is that correct, Mr. Stern?
t9: That's right.
t10: Well, I have got you three first class tickets and I have reserved your seats. Your seat numbers are eight A, eight B and eight C.
t11: Those are in the non-smoking section, aren't they?
t12: Yes, they are. I have charged the tickets to your credit card. They are six thirty dollars each, so It is eighteen ninety dollars for all three.
t13: Fine, thank you very much.
t14: One more thing. Could you give me the names of the people you'll be travelling with?
t15: Sure. They are my kids, Alex and Kathy Stern.
t16: All right. You're all set. Have a nice flight.
t17: Thanks.

Top 10 in Conversational style similarity ranking (Similarity: 0.743)

t1: Good morning, Madam. This is room service, may I help you?
t2: Good morning. I'd like to reserve some rooms for a tourist party.
t3: All right. Umm what kind of room would you like?
t4: You see, we are tourists whose requests are different, so please tell me more about it, will you?
t5: It's my pleasure. We have single rooms, double rooms, suites and luxury suites, et cetera. Well, here is an introduction to our hotel.
t6: That's great. I'd like to book four single rooms, five double rooms and three suites.
t7: All right, madam. For which dates do you want to book the rooms?
t8: From tomorrow till January eighth. That's five days in all.
t9: I see. Now please fill out the form.
t10: Here you are. Is everything OK?
t11: Just a minute, madam. You should pay a deposit of five hundred yuan beforehand.
t12: OK. Here you are.
t13: Thank you. Please keep this receipt.
t14: Thank you. By the way, is there any preferential rate for the party?
t15: Yes, there is a fifteen percent discount.
t16: That's wonderful. Thank you.
t17: You're welcome. I hope all of you will have a good time here.

Bottom 1 in Conversational style similarity ranking (Similarity: -0.380)

t1: Hey, How's it going?
t2: Not good. I lost my wallet.
t3: Oh, that's too bad. Was it stolen?
t4: No, I think it came out of my pocket when I was in the taxi.
t5: Is there anything I can do?
t6: Can I borrow some money?
t7: Sure, how much do you need?
t8: About fifty dollars.
t9: That's no problem.
t10: Thanks. I'll pay you back on Friday.
t11: That'll be fine. Here you are.
t12: What are you going to do now?
t13: I'm going to buy some books and then I'm going to the gas station.
t14: If you wait a minute I can go with you.
t15: OK. I'll wait for you.

Bottom 2 in Conversational style similarity ranking (Similarity: -0.343)

t1: Where's Mrs. Johnson?
t2: Just call her Lisa, Mary. She's cooking dinner.
t3: I see. Can I sit down?
t4: Of course! Make yourself at home.
t5: Thank you, Mr. Johnson.
t6: Please, just call me Tom.
t7: OK, Tom.
t8: Where's Cindy?
t9: She's upstairs in my room.
t10: Umm.. Can you tell her to come downstairs? We're about to have dinner.

Bottom 3 in Conversational style similarity ranking (Similarity: -0.331)

t1: Every year, the South has the floods. It is an act of God.
t2: Do you really think so?conditioners. Hum, this one looks pretty good.
t3: Yeah, umm, you have some other ideas?
t4: Think, in some way it is an act of God, but in another way, it is just caused by us.
t5: For example?
t6: We didn't pay attention to the environment, cut down trees and polluted the air.
t7: Oh, I see. Fortunately government has taken some action to prevent such things.

Bottom 10 in Conversational style similarity ranking (Similarity: -0.303)

t1: Umm I am not certain, but I think I might ask to be considered for the new job.
t2: Why are you considering trying for it?
t3: I think that I might like it, but I am still thinking about it.
t4: What is it about this job that appeals to you?
t5: I think that I would enjoy the position but there isn't a lot of creativity involved.
t6: Yes, you could be right. There is a lot to consider.
t7: I am also wondering about the pay.
t8: Would a slight decrease in pay be worth it for a new opportunity for growth?
t9: I am thinking that might be the case.
t10: I think you should give it a shot. What do you have to lose? You can always change your mind.