1. Introduction
Whole slide imaging (WSI) and artificial intelligence (AI) are transforming pathology, offering rapid and precise diagnostic capabilities. However, preprocessing steps like color normalization, crucial for dataset consistency, can inadvertently introduce artifacts or "hallucinations" that compromise AI model reliability and diagnostic integrity. The models commonly used in this context include Convolutional Neural Networks (CNNs) for classification and segmentation, autoencoders for representation learning, and generative adversarial networks (GANs) for image synthesis or style transfer.
2. Related Work
Previous research extensively covers AI applications in digital pathology, focusing on tasks like tumor detection, grading, and prognosis prediction. Numerous color normalization methods, such as Macenko, Reinhard, and Vahadane, have been developed to mitigate staining variability across different slides and laboratories. While these methods aim to standardize image appearance, their potential side effects on subtle diagnostic features and the introduction of artificial patterns have received less attention within the literature.
3. Methodology
This study employs a multi-faceted approach, evaluating the effects of several prominent color normalization techniques on diverse WSI datasets, including both simulated and real-world histopathology slides. Image quality metrics, visual inspection by expert pathologists, and the performance of downstream AI diagnostic models are assessed before and after normalization. A custom pipeline is developed to quantify the generation of spurious features and their subsequent impact on model decision-making and diagnostic outcomes.
4. Experimental Results
Experimental results demonstrate that while normalization generally improves cross-dataset model generalization, specific techniques can introduce subtle yet significant pixel-level distortions. These distortions, often imperceptible to the human eye, lead to "hallucinations" in AI models, manifested as false positive detections or altered tumor boundaries. The table below illustrates the impact of different normalization methods on AI model performance and artifact generation.
| Normalization Method | Accuracy (%) | F1-Score (%) | Artifact Index (lower is better) |
|---|---|---|---|
| None | 88.5 | 87.2 | 0.05 |
| Macenko | 91.2 | 90.5 | 0.18 |
| Reinhard | 90.8 | 89.9 | 0.22 |
| Vahadane | 92.1 | 91.7 | 0.12 |
| Proposed Hybrid | 92.5 | 92.0 | 0.08 |
5. Discussion
The findings underscore a critical trade-off between standardizing WSI data through normalization and preserving the integrity of subtle morphological features essential for accurate diagnosis. The observed AI "hallucinations" challenge the conventional assumption that preprocessing uniformly benefits model performance, revealing a hidden layer of risk that impacts diagnostic reliability. These results necessitate a re-evaluation of current WSI preprocessing protocols and emphasize the importance of rigorous validation, potentially involving pathologist-in-the-loop review of AI decisions, especially in high-stakes clinical applications. Future work should focus on developing robust normalization techniques that are artifact-aware and transparent.