MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

Chaoyi Wu^1,2

Xiaoman Zhang^1,2

Ya Zhang^1,2

Yanfeng Wang^1,2,

Weidi Xie^1,2,

¹CMIC, Shanghai Jiao Tong University

²Shanghai AI Laboratory

Accepted by ICCV2023

Code [GitHub]

Paper [arXiv]

Cite [BibTeX]

Abstract

In this paper, we consider the problem of enhancing self-supervised visual-language pre-training (VLP) with medical-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel report pre-processing mechanism by simply extracting the useful medical entities, avoiding unnecessary complexity from understanding the language grammar; Second, we propose a novel entity embedding module by querying an external knowledge description base, to exploit the rich context of additional information that the medical domain affords, and implicitly build relationships between entities in the language embedding space; Third, we propose a novel Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level only with self-supervised learning, thus enabling the ability for spatial grounding; Fourth, we conduct thorough experiments to validate the effectiveness of our proposed architecture, and benchmark on numerous public benchmarks, e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding.

Architecture

The whole framework of our method. The figure mainly contains fourth module: Visual Encoder, Knowledge-enhanced Language Encoding, Fusion Module. Knowledge-enhanced Language Encoding contains Text Encoder and Report Filter. Report Filter extracts entities from the raw reports and Text Encoder further embeds them. Visual Encoder is used to encoder the input of visual modalities and Fusion Module are used for cross-modality interaction. The details of Report Filter can be found in the right sub-figure. A report is first filtered by a pre-trained filter and viewed as a set of triplets. The ``Position'' part mixed with some negative positions for contrastive loss and the ``Exist'' part is used for CE loss.

Quantitative Results

R1: Zero-shot Classification

Comparison with other state-of-the-art methods on zero-shot classification task. AUC, F1 and ACC scores are reported. All diseases mentioned in table are seen during pre-training.

Comparison with other state-of-the-art methods on zero-shot unseen disease ``Covid-19'' classification task. AUC, F1 and ACC scores are reported. ``Direct covid-19'' refers to directly use ``Covid-19'' to construct the prompt sentence while ``Covid-19 Description'' refers to replace the name ``Covid-19'' with its medical description.

The radar figure of our method and other methods of 14 diseases on Chestx-ray14 datasets AUC scores are reported and, as shown, our method exceeds the previous state-of-the-art on most diseases.

R2: Zero-shot Grounding

Comparison with other state-of-the-art methods on zero-shot region grounding tasks. (a) shows the results on RSNA Pneumonia dataset. (b) shows the results on SIIM-ACR Pneumothorax dataset. The pneumothorax region tends to be thin and narrow and much more challenging for grounding, we thus only consider pointing game, recall and precision. Our method can achieve better performance on different metrics, especially on Pointing game score. ``ConVIRT'' as the basic method proposed earliest can not realize this function.

Comparison with other state-of-the-art methods on zero-shot unseen disease Covid-19 grounding task. ``Direct covid-19'' refers to directly use ``Covid-19'' to construct the prompt sentence while ``Covid-19 Description'' refers to replace the name ``Covid-19`` with its medical description. Our method can achieve better performance on different metrics.

R3: Fine-tuning Classification

Comparison of AUC scores with other state-of-the-art methods on fine-tuning classification task. The macro average of AUC scores on 14 diseases are reported for ChestX-ray14 dataset.

Comparison of Dice scores with other state-of-the-art methods on fine-tuning segmentation tasks. Three diseases are reported, for each disease, three data portions, 1%, 10%, 100% are adopted to show the performance change.

R3: Fine-tuning Grading (Fine-grained Classification)

Comparison with other state-of-the-art methods on fine-tuning edema severity grading muti-class classification task. AUC score is reported in the Table. ``0,1,2,3'' in table represents the severity level and final macro average scores are reported.

R4: Ablation Study

Ablation study on zero-shot classification task. ``PosCL'' refers to the position contrastive loss and ``DE'' refers to the description encoder.AUC, F1 and ACC scores are reported. For ChestX-ray 14, the metrics all refer to the macro average on the 14 diseases.

Ablation study on zero-shot grounding tasks. (a) shows the results on RSNA Pneumonia dataset. (b) shows the results on SIIM-ACR Pneumothorax dataset.

Visualizations of Zero-shot Grounding

The visualization of zero-shot grounding results of our method. Each column represents the results on one disease and the left in it is the ground-truth and right is the heatmap predication of our model. The brighter the red on the figure, the more likely the model considering this region to be associated with abnormalities.

Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.