Persimmon-8B

175 点作者 jgershen超过 1 年前

14 条评论

AaronFriel超过 1 年前

I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.---For folks in MLops, deploying models with streaming APIs:1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?2. How are you currently serving these models as an API and what upcoming tools are you exploring?For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?

评论 #37432159 未加载

gardnr超过 1 年前

Two important takeaways on the base model:* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2* was trained from the beginning with a 16k context using a modified RoPe where many models are simply fine-tuned using RoPe to gain longer context windows after the base model has been trained at 4k.Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?

评论 #37424968 未加载

评论 #37427189 未加载

评论 #37427131 未加载

thewataccount超过 1 年前

Awesome! I applaud everyone training new models and attempting different techniques!I'm concerned about the current download's availability - its two URLs to some object storage. I find that these go dark rather quickly for many different reasons (accidentally moving it, bandwidth limits, deleting it later, etc).I'm curious if there's a reason it's not also hosted on huggingface? I'm not saying they're the best place, but redundancy is good, most models have entries there, they have a very good cdn, and isn't as likely to go dark accidentally.

评论 #37430283 未加载

评论 #37424881 未加载

123yawaworht456超过 1 年前

I applaud you guys for not including any nauseating gibberish in this press release or seemingly anywhere else on your website. It's like a breath of fresh air comparing to every other AI-related resource I saw recently. Please, keep it up.

评论 #37424102 未加载

imjonse超过 1 年前

Congrats on the release! Two questions.1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?2) The Llama-2 numbers for MMLU in that table seem different from those in the HF leaderboard and the Llama-2 webpage presentation. Is it the 1-shot variant that is different or are these measurements not 100% standard and reproducible?

评论 #37425609 未加载

elietoubi超过 1 年前

Really cool! Honestly I wish these releases would come with a demo (like on replicate or hugging face)

deckar01超过 1 年前

The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output.

评论 #37426146 未加载

automatistist超过 1 年前

> The standard practice for achieving fast inference is to rewrite the entire model inference loop in C++, as in FasterTransformer, and call out to special fused kernels in CUDA. But this means that any changes to the model require painfully reimplementing every feature twice: once in Python / PyTorch in the training code and again in C++ in the inference codebase. We found this process too cumbersome and error prone to iterate quickly on the model.I am an AI novice but why can't they automated this with AI? I thought the whole point of these tools was to automated tasks that are error prone and require lots of attention to details. Computers are great at that kind of stuff so it's surprising they haven't applied AI techniques to automate parts of the AI pipeline like converting code from Python to C++.

评论 #37424280 未加载

评论 #37423870 未加载

Havoc超过 1 年前

>The model has 70k unused embeddings for multimodal extensions,Could someone briefly explain what this means? multimodal as in picture, but if unused then presumably that part is somehow untrained...so it wouldn't know what to do with the picture?

评论 #37426655 未加载

TrueDuality超过 1 年前

Appreciate the release! Since you're hosting the downloads directly, I'd recommend throwing an integrity hash for each of the files alongside the download links so users can verify there wasn't any corruption in transfer.

评论 #37424888 未加载

评论 #37424848 未加载

rvz超过 1 年前

Good. Keep it going. Let’s have more $0 for free AI models getting released since we all know it is the future and you can’t compete with free.The AI race to zero must be accelerated with $0 free models and less control from gatekeepers such as ClosedAI

sunshadow超过 1 年前

Do you have any explanations on why this performed better than Llama 2?

评论 #37435506 未加载

visarga超过 1 年前

Since it is coming from Adept, maybe they are building 8B models for UI automation, the inputs are usually large and latency required is low. It's basically a task of information extraction and UI action generation.

theLiminator超过 1 年前

What kind of use cases do these sub 10B param models serve? Are they mostly useful for code completion?

评论 #37425724 未加载

评论 #37427329 未加载