It seems intuitive since ReLU is just a type of implicit regularization. Why would subsequent gradient descent help once you've achieved the benefit of throwing away the "outliers" or data beyond the threshold you want?
don't confuse this with universal approximation - yes shallow ReLU networks are dense in functional space, so at the limit you should be able to get any function you want - but they are talking about exact representation with finitely many neurons here.