Take these results with a grain of salt. There's a large class imbalance in this dataset and ROC curves can be misleading in this case. The test set contains 269 positive examples and 8482 negative examples.<p>From [1]:<p>> Class imbalance can cause ROC curves to be poor visualiza-
tions of classifier performance. For instance, if only 5 out of 100
individuals have the disease, then we would expect the five posi-
tive cases to have scores close to the top of our list. If our classifier
generates scores that rank these 5 cases as uniformly distributed
in the top 15, the ROC graph will look good (Fig. 4a). However,
if we had used a threshold such that the top 15 were predicted to
be true, 10 of them would be FPs, which is not reflected in the
ROC curve. This poor performance is reflected in the PR curve,
however.<p>The authors seem to be aware of this in the supplement and also evaluate performance by a hazard ratio they define:<p>> We calculated the ratio of the observed cancer incidence in the top 10% of patients over the incidence in the middle 80% and referred to this metric as the top decile hazard ratio. We calculated the ratio of the observed cancer incidence in the bottom 10% of patients over the incidence in the middle 80% and referred to this metric as the bottom decile hazard ratio.<p>However, binning is a form of p-hacking [2]. And I'm still wondering why they don't just post the Precision-Recall curves.<p>[1] <a href="https://doi.org/10.1038/nmeth.3945" rel="nofollow">https://doi.org/10.1038/nmeth.3945</a><p>[2] <a href="https://doi.org/10.1080/09332480.2006.10722771" rel="nofollow">https://doi.org/10.1080/09332480.2006.10722771</a><p>[Edit] to add link to [2]