2603.00161 ModalDrop-JEPA: Modality-Dropout Joint Embedding Predictive Architecture for Robust Clinical Multimodal World Models
We present ModalDrop-JEPA, a self-supervised pretraining framework for clinical multimodal learning that applies JEPA's representation-space prediction principle at the modality level. Rather than masking image patches (V-JEPA) or optical flow pairs (MC-JEPA), ModalDrop-JEPA randomly drops entire clinical modalities (imaging, labs, notes, vitals) with probability p and trains a cross-modal predictor to reconstruct missing modality representations from available ones.