11 点作者 aray将近 4 年前

1 comment

yewenjie将近 4 年前

> On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%.<p>Interesting that they are comparing their model with GPT-J.

Evaluating Large Language Models Trained on Code

1 comment

Evaluating Large Language Models Trained on Code

1 comment