DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Jia Li Wei Wang Min Chen
Institute of Artificial Intelligence, Global Tech University

Abstract

This paper introduces DialectGen, a novel benchmark dataset and evaluation suite designed to assess and improve the robustness of multimodal generation models across diverse linguistic dialects. We propose an innovative fine-tuning strategy that integrates dialect-aware features, significantly enhancing model performance in generating dialectally accurate and natural outputs. Our comprehensive experiments demonstrate that models trained with DialectGen achieve substantial improvements in dialectal coherence and overall quality, highlighting critical areas for future research in robust multimodal AI.

Keywords

Dialect robustness, Multimodal generation, Benchmarking, Large language models, Speech synthesis


1. Introduction

Multimodal generation models face significant challenges in maintaining performance consistency across diverse linguistic dialects, often exhibiting biases towards standard variations. This work addresses the critical problem of dialectal robustness, which is underexplored despite the global application of such AI systems. The primary objective is to establish a standardized benchmark for evaluating dialectal proficiency and to develop methods for improving model generalization. Models used in this article include various transformer-based architectures for text-to-speech (e.g., VITS, Bark-like models), image-to-text, and general large language models (e.g., LLaMA, GPT variants).

2. Related Work

Existing literature on multimodal generation primarily focuses on standard language varieties, with limited attention paid to the nuances of regional and social dialects. Previous benchmarks for speech synthesis or natural language generation often lack comprehensive dialectal coverage, leading to performance degradation in real-world applications. Efforts in cross-lingual transfer learning offer some insights but do not fully address the internal variations within a single language. This work builds upon foundational research in multimodal learning while explicitly tackling the unique challenges posed by linguistic diversity.

3. Methodology

Our methodology involves the creation of the DialectGen benchmark, comprising a large-scale, diverse dataset of audio, text, and image pairs annotated with specific dialectal features. We detail the data collection, cleaning, and annotation process, ensuring representation across several major dialects. Furthermore, we introduce a novel dialect-aware fine-tuning pipeline that incorporates explicit dialect embeddings and adversarial training techniques. This pipeline aims to regularize model behavior and enhance its ability to generate content that accurately reflects target dialectal characteristics without sacrificing general quality.

4. Experimental Results

Experiments conducted on the DialectGen benchmark demonstrate that baseline multimodal generation models exhibit significant performance drops when evaluated on dialectal inputs, particularly in terms of linguistic accuracy and naturalness. Our proposed dialect-aware fine-tuning strategy consistently outperforms these baselines, showcasing improved robustness across all evaluated dialects. Quantitative metrics like Word Error Rate (WER) and Mean Opinion Score (MOS) reveal substantial enhancements, confirming the effectiveness of our approach in mitigating dialectal bias. The table below summarizes key performance improvements across various dialects, highlighting the efficacy of DialectGen's training methodology in reducing errors and enhancing perceived quality for dialectal content generation.

5. Discussion

The experimental results confirm that current multimodal generation models struggle with dialectal variations, underscoring the urgent need for dialect-aware training paradigms. Our DialectGen benchmark and proposed methodology effectively address these shortcomings, offering a pathway towards more inclusive and robust AI systems. The significant improvements in both objective and subjective metrics highlight the importance of dedicated dialectal data and specialized training techniques. Future work will explore expanding DialectGen to more dialects and integrating our methods into larger, pre-trained models to achieve broader applicability and impact.