More from our lab
ACL 2026

RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification

1The Ohio State University  ·  2University of Southern California  ·  3University of Chicago

0.00%
AUROC gain
supervised tasks
0.00%
AUROC gain
zero-shot tasks
0
respiratory
diseases covered
0
real-world
datasets
RespiraMFM Overview

RespiraMFM is a two-stage multimodal framework that first contrastively aligns respiratory audio embeddings with clinical text, then instruction-tunes a large language model for accurate respiratory disease identification.

Abstract

Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.

Key Findings

State-of-the-art performance. RespiraMFM achieves a 9.15% improvement in AUROC on supervised tasks and 20.98% on zero-shot tasks over the strongest multimodal baseline.

Superior zero-shot generalization. RespiraMFM effectively detects unseen respiratory diseases — including asthma and pneumonia — with no disease-specific training samples.

Data-efficient training. RespiraMFM achieves comparable performance with an order of magnitude less training data than baselines, making it ideal for data-scarce clinical settings.

Effective contrastive alignment. The audio-text alignment module consistently improves AUROC across all tasks, confirmed by t-SNE visualization showing cleaner embedding clusters.

Method

1
Stage 1

Modality Alignment via Contrastive Learning

A lightweight MLP projection head is contrastively trained to map 768-dimensional OPERA-CT audio embeddings into the semantic space of the LLM's text encoder. This anchors non-linguistic acoustic biomarkers (coughs, wheezes, crackles) to the correct clinical symptom concepts — providing a semantically aligned initialization before fine-tuning.

2
Stage 2

Instruction Tuning with Frozen Aligner

The frozen, aligned projector is incorporated into a full multimodal pipeline. Audio embeddings, patient symptom context, and task-specific prompts are concatenated and fed to a 2.7B-parameter Phi-2 LLM fine-tuned with LoRA (rank 16, α = 32) for disease classification via a linear classification head.

Contrastive Alignment

Contrastive learning-based audio-text alignment maps audio embeddings into the LLM's semantic space before instruction tuning.

Dataset Overview

Class Distribution Across Datasets

UK COVID-19
Coughvid
Coswara
TBscreen
CodaTB
ICBHI
KAUH

Results

Supervised In-domain performance — Tasks T1–T4 (AUROC, mean ± std over 3 runs)
Task Dataset Disease Qwen-2 Audio BTS RespLLM RespiraMFM (Ours)
T1UK COVID-19COVID-19 0.855 ± 0.0180.898 ± 0.0100.881 ± 0.005 0.910 ± 0.002  (↑ 1.41%)
T2CoughvidCOVID-19 0.561 ± 0.0090.595 ± 0.0140.613 ± 0.011 0.673 ± 0.011  (↑ 9.79%)
T3TBscreenTB 0.334 ± 0.0430.568 ± 0.0190.687 ± 0.016 0.709 ± 0.014  (↑ 3.20%)
T4ICBHICOPD 0.614 ± 0.0050.880 ± 0.0040.833 ± 0.007 0.999 ± 0.000  (↑ 13.64%)
Zero-Shot Out-of-distribution generalization — Tasks T5–T9 (AUROC, mean ± std over 3 runs)
Task Dataset Disease Qwen-2 Audio BTS RespLLM RespiraMFM (Ours)
T5CoswaraCOVID-19 0.813 ± 0.0350.901 ± 0.0080.900 ± 0.006 0.908 ± 0.005  (↑ 0.77%)
T6CodaTBTB 0.527 ± 0.0120.645 ± 0.0160.669 ± 0.019 0.689 ± 0.012  (↑ 2.99%)
T7KAUHCOPD 0.581 ± 0.0130.491 ± 0.0140.425 ± 0.011 0.829 ± 0.005  (↑ 42.74%)
T8KAUHAsthma ★ 0.458 ± 0.0100.418 ± 0.0160.399 ± 0.010 0.552 ± 0.014  (↑ 20.55%)
T9KAUHPneumonia ★ 0.301 ± 0.0410.595 ± 0.0200.400 ± 0.021 0.709 ± 0.013  (↑ 19.29%)
★ Diseases not seen during training — fully zero-shot evaluation.

Ablation Studies

Uni-Modal vs. Multi-Modal

On the Coswara dataset (T5), combining audio + text consistently outperforms either modality alone across all patient severity groups.

InputMild/NoneModerateHealthyTotal
Audio only0.35760.35710.72660.6102
Text only0.32940.61900.97660.7934
Audio + Text 0.4047 0.6587 0.9849 0.8203
Effect of Alignment Module

The contrastive alignment module consistently improves AUC across all zero-shot tasks (T5–T9). Largest gain on unseen COPD dataset: +42.7%.

Alignment ablation

AUC of zero-shot detection with vs. without contrastive alignment.

BibTeX

@inproceedings{siam2026respiramfm,
  title     = {RespiraMFM: A Multimodal Foundation Model with Contrastive
               Audio-Language Alignment for Respiratory Disease Identification},
  author    = {Siam, Shakhrul Iman and Feng, Tiantian and Zhang, Jiankun
               and Narayanan, Shrikanth and Zhang, Mi},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association
               for Computational Linguistics (ACL)},
  year      = {2026}
}