Deep Neural Networks to Detect and Quantify Lymphoma Lesions: Related Work

12 Jun 2024


(1) Shadab Ahamed, University of British Columbia, Vancouver, BC, Canada, BC Cancer Research Institute, Vancouver, BC, Canada. He was also a Mitacs Accelerate Fellow (May 2022 - April 2023) with Microsoft AI for Good Lab, Redmond, WA, USA (e-mail:;

(2) Yixi Xu, Microsoft AI for Good Lab, Redmond, WA, USA;

(3) Claire Gowdy, BC Children’s Hospital, Vancouver, BC, Canada;

(4) Joo H. O, St. Mary’s Hospital, Seoul, Republic of Korea;

(5) Ingrid Bloise, BC Cancer, Vancouver, BC, Canada;

(6) Don Wilson, BC Cancer, Vancouver, BC, Canada;

(7) Patrick Martineau, BC Cancer, Vancouver, BC, Canada;

(8) Franc¸ois Benard, BC Cancer, Vancouver, BC, Canada;

(9) Fereshteh Yousefirizi, BC Cancer Research Institute, Vancouver, BC, Canada;

(10) Rahul Dodhia, Microsoft AI for Good Lab, Redmond, WA, USA;

(11) Juan M. Lavista, Microsoft AI for Good Lab, Redmond, WA, USA;

(12) William B. Weeks, Microsoft AI for Good Lab, Redmond, WA, USA;

(13) Carlos F. Uribe, BC Cancer Research Institute, Vancouver, BC, Canada, and University of British Columbia, Vancouver, BC, Canada;

(14) Arman Rahmim, BC Cancer Research Institute, Vancouver, BC, Canada, and University of British Columbia, Vancouver, BC, Canada.

Numerous works have explored the application of deep learning methods for segmenting lymphoma in PET/CT images. Yuan et al. [4] developed a feature fusion technique to utilize the complementary information from multi-modality data. Hu et al. [5] proposed fusing a combination of 3D ResUNet trained on volumetric data and three 2D ResUNet trained on 2D slices from three orthogonal directions for enhancing segmentation performance. Li et al. [6] proposed DenseX-Net trained in an end-to-end fashion integrating supervised and unsupervised methods for lymphoma detection and segmentation. Liu et al. [7] introduced techniques such as patch-based negative sample augmentation and label guidance for training a 3D Residual-UNet for lymphoma segmentation. A major limitation of all these works was that they were developed on relatively smaller-sized datasets (less than 100 images). Moreover, most of these methods did not compare the performance of their proposed methods with other baselines or with the performance of physicians.

Constantino et al. [8] compared the performances of 7 semi-automated and 2 deep learning segmentation methods, while Weisman et al. [9] compared 11 automated segmentation techniques, although both these studies were performed on smaller datasets of sizes 65 and 90 respectively. Weisman et al. [10] compared the segmentation performances of automated 3D Deep Medic method with that of physician although even this study included just 90 lymphoma cases. Except for [10], none of these studies reported model generalization on out-ofdistribution dataset (such as on data collected from different centers), limiting their robustness quantification and external validity. Jiang et al. [11] used a relatively larger dataset as compared to the above studies with 297 images to train a 3D UNet. They even performed out-of-distribution testing on 117 images collected from a different center. To the best of our knowledge, the largest lymphoma PET/CT dataset for deep learning-based lesion segmentation ever reported is the work by Blanc-Durand et al. [12] who used 639 images for model development and 94 for external testing; however, this study only used standard segmentation evaluation metrics and assessed their model’s ability for predicting accurate TMTV. Both the studies [11] and [12] are limited by the fact that their datasets exclusively consisted of patients diagnosed with diffuse large B-cell lymphoma (DLBCL), representing only a single subtype of lymphoma.

Most of the existing studies on deep learning-based lymphoma segmentation report their performances on generic segmentation metrics such as Dice similarity coefficient (DSC), intersection-over-union (IoU), sensitivity, etc. In the presence of large segmented lesions, very small missed lesions or small false positives do not contribute much to the DSC value. Hence, there is a need to report the volumes of false positives and false negatives. It will also be beneficial to evaluate the detection performances on a per-lesion basis (number of connected components detected vs missed), since automated detection of even a few voxels of all lesions can help physicians quickly locate the regions of interest, even if the DSC is low. Moreover, the difficulty of the segmentation/detection task is often not assessed via inter- or intra-observer agreement analysis.

Our study aims to address these limitations. We trained and validated four deep neural networks on lymphoma PET/CT datasets from three cohorts, encompassing two distinct subtypes of lymphoma: DLBCL and primary mediastinal large B-cell lymphoma (PMBCL). (i) We performed both in (images coming from same cohorts as the training/validation set) and out-of-distribution or external (images from a fourth cohort not used for training/validation) testing to evaluate the robustness of our models. (ii) We reported the performance using DSC, volumes of false positives and negatives, and evaluated the performance dependence on six different types of lesion measures. (iii) We also evaluated the ability of our networks to reproduce these ground truth lesion measures and computed networks’ error in predicting them. (iv) We proposed three types of detection criteria for our use-case and evaluate model’s performance on these metrics. (v) Finally, we evaluated the intra- and inter-observer agreement to give a measure of the difficulty of lesion segmentation task on our datasets.

This paper is available on arxiv under CC 4.0 license.