ACL 2026

RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification

Shakhrul Iman Siam¹, Tiantian Feng², Jiankun Zhang³, Shrikanth Narayanan², Mi Zhang¹

¹The Ohio State University · ²University of Southern California · ³University of Chicago

RespiraMFM is a two-stage multimodal framework that first contrastively aligns respiratory audio embeddings with clinical text, then instruction-tunes a large language model for accurate respiratory disease identification.

Abstract

Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.

Key Findings

State-of-the-art performance. RespiraMFM achieves a 9.15% improvement in AUROC on supervised tasks and 20.98% on zero-shot tasks over the strongest multimodal baseline.

Superior zero-shot generalization. RespiraMFM effectively detects unseen respiratory diseases — including asthma and pneumonia — with no disease-specific training samples.

Data-efficient training. RespiraMFM achieves comparable performance with an order of magnitude less training data than baselines, making it ideal for data-scarce clinical settings.

Effective contrastive alignment. The audio-text alignment module consistently improves AUROC across all tasks, confirmed by t-SNE visualization showing cleaner embedding clusters.

Method

Stage 1

Modality Alignment via Contrastive Learning

A lightweight MLP projection head is contrastively trained to map 768-dimensional OPERA-CT audio embeddings into the semantic space of the LLM's text encoder. This anchors non-linguistic acoustic biomarkers (coughs, wheezes, crackles) to the correct clinical symptom concepts — providing a semantically aligned initialization before fine-tuning.

Stage 2

Instruction Tuning with Frozen Aligner

The frozen, aligned projector is incorporated into a full multimodal pipeline. Audio embeddings, patient symptom context, and task-specific prompts are concatenated and fed to a 2.7B-parameter Phi-2 LLM fine-tuned with LoRA (rank 16, α = 32) for disease classification via a linear classification head.

Contrastive learning-based audio-text alignment maps audio embeddings into the LLM's semantic space before instruction tuning.

Dataset Overview

Class Distribution Across Datasets

UK COVID-19

Coughvid

Coswara

TBscreen

CodaTB

ICBHI

KAUH

Results

Supervised In-domain performance — Tasks T1–T4 (AUROC, mean ± std over 3 runs)

Task	Dataset	Disease	Qwen-2 Audio	BTS	RespLLM	RespiraMFM (Ours)
T1	UK COVID-19	COVID-19	0.855 ± 0.018	0.898 ± 0.010	0.881 ± 0.005	0.910 ± 0.002 (↑ 1.41%)
T2	Coughvid	COVID-19	0.561 ± 0.009	0.595 ± 0.014	0.613 ± 0.011	0.673 ± 0.011 (↑ 9.79%)
T3	TBscreen	TB	0.334 ± 0.043	0.568 ± 0.019	0.687 ± 0.016	0.709 ± 0.014 (↑ 3.20%)
T4	ICBHI	COPD	0.614 ± 0.005	0.880 ± 0.004	0.833 ± 0.007	0.999 ± 0.000 (↑ 13.64%)

Zero-Shot Out-of-distribution generalization — Tasks T5–T9 (AUROC, mean ± std over 3 runs)

Task	Dataset	Disease	Qwen-2 Audio	BTS	RespLLM	RespiraMFM (Ours)
T5	Coswara	COVID-19	0.813 ± 0.035	0.901 ± 0.008	0.900 ± 0.006	0.908 ± 0.005 (↑ 0.77%)
T6	CodaTB	TB	0.527 ± 0.012	0.645 ± 0.016	0.669 ± 0.019	0.689 ± 0.012 (↑ 2.99%)
T7	KAUH	COPD	0.581 ± 0.013	0.491 ± 0.014	0.425 ± 0.011	0.829 ± 0.005 (↑ 42.74%)
T8	KAUH	Asthma ★	0.458 ± 0.010	0.418 ± 0.016	0.399 ± 0.010	0.552 ± 0.014 (↑ 20.55%)
T9	KAUH	Pneumonia ★	0.301 ± 0.041	0.595 ± 0.020	0.400 ± 0.021	0.709 ± 0.013 (↑ 19.29%)
★ Diseases not seen during training — fully zero-shot evaluation.

Ablation Studies

Uni-Modal vs. Multi-Modal

On the Coswara dataset (T5), combining audio + text consistently outperforms either modality alone across all patient severity groups.

Input	Mild/None	Moderate	Healthy	Total
Audio only	0.3576	0.3571	0.7266	0.6102
Text only	0.3294	0.6190	0.9766	0.7934
Audio + Text	0.4047	0.6587	0.9849	0.8203

Effect of Alignment Module

The contrastive alignment module consistently improves AUC across all zero-shot tasks (T5–T9). Largest gain on unseen COPD dataset: +42.7%.

AUC of zero-shot detection with vs. without contrastive alignment.

BibTeX

@inproceedings{siam2026respiramfm,
  title     = {RespiraMFM: A Multimodal Foundation Model with Contrastive
               Audio-Language Alignment for Respiratory Disease Identification},
  author    = {Siam, Shakhrul Iman and Feng, Tiantian and Zhang, Jiankun
               and Narayanan, Shrikanth and Zhang, Mi},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association
               for Computational Linguistics (ACL)},
  year      = {2026}
}