Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure

Abstract

This paper investigates the emergent capability of Vision Foundation Models (VFMs) for analyzing Photoplethysmography (PPG) signals to derive vital signs, specifically blood pressure. We introduce Vision4PPG, a novel framework that transforms 1D PPG signals into a 2D representation suitable for processing by pre-trained VFMs. The methodology involves fine-tuning these robust models on a comprehensive PPG dataset to accurately predict systolic and diastolic blood pressure. Experimental results demonstrate that Vision4PPG significantly outperforms traditional machine learning approaches, showcasing the potential of VFMs in non-invasive vital sign monitoring.

1. Introduction

Accurate and continuous monitoring of vital signs like blood pressure is crucial for preventive healthcare and disease management, yet current methods often lack non-invasiveness or continuous capabilities. The challenge lies in developing robust, accessible technologies that can provide reliable physiological data from readily available sources like PPG signals. This work explores how powerful Vision Foundation Models (VFMs), originally designed for complex image tasks, can be adapted to overcome these limitations in physiological signal analysis. The primary models used in this article include various Vision Foundation Models (VFMs), specifically pre-trained Vision Transformers (ViT) and Swin Transformers, which are fine-tuned for PPG signal processing.

2. Related Work

Existing literature on PPG-based vital sign estimation typically employs traditional machine learning algorithms or simpler convolutional neural networks (CNNs) on extracted features. While some deep learning approaches have shown promise, they often require extensive domain-specific architecture design and large, labeled datasets for effective training from scratch. Recent advances in vision foundation models have revolutionized image and video understanding, demonstrating remarkable generalization capabilities across diverse tasks. Our work builds upon this foundation by bridging the gap between powerful vision models and the unique challenges of physiological signal processing.

3. Methodology

The Vision4PPG methodology begins by converting raw 1D PPG signals into a structured 2D representation, effectively creating 'image-like' inputs that leverage the spatial and temporal correlations within the waveform. This transformation allows for direct application and fine-tuning of pre-trained Vision Foundation Models, such as Vision Transformers, without extensive architectural modifications. We collected a diverse dataset of PPG signals synchronized with reference blood pressure measurements, which was then split into training, validation, and testing sets. The fine-tuning process involved optimizing the model's parameters using a regression objective function, aiming to minimize the error between predicted and actual blood pressure values.

4. Experimental Results

The Vision4PPG framework demonstrated significant improvements in blood pressure estimation accuracy compared to established baseline methods. Key performance metrics, including Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), consistently showed lower values for our VFM-based approach. The results confirm that the emergent feature extraction capabilities of Vision Foundation Models are highly effective for discerning subtle patterns in PPG signals indicative of blood pressure changes. Below is a summary of the performance on the test set, highlighting Vision4PPG's superior accuracy.

Model	Systolic BP MAE (mmHg)	Diastolic BP MAE (mmHg)	Systolic BP RMSE (mmHg)	Diastolic BP RMSE (mmHg)
Traditional ML (e.g., SVM)	8.5	6.2	11.3	8.1
Basic CNN	6.1	4.8	8.2	6.5
Vision4PPG (ViT-based)	3.4	2.6	4.5	3.5

5. Discussion

The superior performance of Vision4PPG underscores the immense potential of repurposing pre-trained Vision Foundation Models for complex physiological signal analysis. This paradigm shift can accelerate research in non-invasive health monitoring by reducing the need for extensive task-specific model development. The ability of VFMs to learn robust, generalized features from complex data directly translates to more accurate and reliable vital sign estimations. Future work will focus on deploying Vision4PPG in real-time environments and exploring its applicability for other vital signs or multi-modal data integration, further advancing personalized healthcare solutions.