I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.<p>The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.<p>(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)<p>To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.<p>---<p>For folks in MLops, deploying models with streaming APIs:<p>1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?<p>2. How are you currently serving these models as an API and what upcoming tools are you exploring?<p>For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?
Two important takeaways on the base model:<p>* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2<p>* was trained from the beginning with a 16k context using a modified RoPe where many models are simply fine-tuned using RoPe to gain longer context windows after the base model has been trained at 4k.<p>Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?
Awesome! I applaud everyone training new models and attempting different techniques!<p>I'm concerned about the current download's availability - its two URLs to some object storage. I find that these go dark rather quickly for many different reasons (accidentally moving it, bandwidth limits, deleting it later, etc).<p>I'm curious if there's a reason it's not also hosted on huggingface? I'm not saying they're the best place, but redundancy is good, most models have entries there, they have a very good cdn, and isn't as likely to go dark accidentally.
I applaud you guys for not including any nauseating gibberish in this press release or seemingly anywhere else on your website. It's like a breath of fresh air comparing to every other AI-related resource I saw recently. Please, keep it up.
Congrats on the release! Two questions.<p>1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?<p>2) The Llama-2 numbers for MMLU in that table seem different from those in the HF leaderboard and the Llama-2 webpage presentation. Is it the 1-shot variant that is different or are these measurements not 100% standard and reproducible?
The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output.
> The standard practice for achieving fast inference is to rewrite the entire model inference loop in C++, as in FasterTransformer, and call out to special fused kernels in CUDA. But this means that any changes to the model require painfully reimplementing every feature twice: once in Python / PyTorch in the training code and again in C++ in the inference codebase. We found this process too cumbersome and error prone to iterate quickly on the model.<p>I am an AI novice but why can't they automated this with AI? I thought the whole point of these tools was to automated tasks that are error prone and require lots of attention to details. Computers are great at that kind of stuff so it's surprising they haven't applied AI techniques to automate parts of the AI pipeline like converting code from Python to C++.
>The model has 70k unused embeddings for multimodal extensions,<p>Could someone briefly explain what this means? multimodal as in picture, but if unused then presumably that part is somehow untrained...so it wouldn't know what to do with the picture?
Appreciate the release! Since you're hosting the downloads directly, I'd recommend throwing an integrity hash for each of the files alongside the download links so users can verify there wasn't any corruption in transfer.
Good. Keep it going. Let’s have more $0 for free AI models getting released since we all know it is the future and you can’t compete with free.<p>The AI race to zero must be accelerated with $0 free models and less control from gatekeepers such as ClosedAI
Since it is coming from Adept, maybe they are building 8B models for UI automation, the inputs are usually large and latency required is low. It's basically a task of information extraction and UI action generation.