AI interpretability tools fail to predict inner misalignment

1 pointsby philbert101over 3 years ago

1 comment

Links to articles <a href="https://distill.pub/2020/understanding-rl-vision/" rel="nofollow">https://distill.pub/2020/understanding-rl-vision/</a> <a href="https://arxiv.org/pdf/2105.14111.pdf" rel="nofollow">https://arxiv.org/pdf/2105.14111.pdf</a>