科技回声

As always you kind of need to play with the model to see how well it actually works as benchmarks can be misleading (e.g. phi-2)But at face value, a new architectural approach with the same capacity (8b) trained on a dataset 1/6th the tokens, being competitive with llama3-8b is exciting

Not sure why they included a hallucination as one of their first examples:"Please recommend me three famous movies""The Empire Strikes Back (1980) - Directed by George Lucas"

it doesn't seem to support variable length for input and output, does it?The paper seems to use EOS padding to create fixed length input/output.so is there a maximum output length?

Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?

Not sure why they included a hallucination as one of their first examples:"Please recommend me three famous movies""The Empire Strikes Back (1980) - Directed by George Lucas"

it doesn't seem to support variable length for input and output, does it?The paper seems to use EOS padding to create fixed length input/output.so is there a maximum output length?

Large Language Diffusion Models

4 条评论

Large Language Diffusion Models

4 条评论