Devstral

692 포인트작성자: mfiguiere4일 전

23 comments

The first number I look at these days is the file size via Ollama, which for this model is 14GB <a href="https://ollama.com/library/devstral/tags">https://ollama.com/library/devstral/tags</a>I find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters because I want to know how much RAM I will have left for running other applications.Anything below 20GB tends not to interfere with the other stuff I'm running too much. This model looks promising!

评论 #44056502 未加载

评论 #44059888 未加载

评论 #44054806 未加载

评论 #44059216 未加载

oofbaroomf4일 전

The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run this for almost free, this is a very extraordinary model.

评论 #44058287 未加载

评论 #44056570 未加载

评论 #44056216 未加载

dismalaf4일 전

It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mistral integration.

评论 #44062682 未加载

solomatov4일 전

It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions.

评论 #44054272 未加载

johnQdeveloper3일 전

*For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:*total duration: 35.016288581s load duration: 21.790458ms prompt eval count: 1244 token(s) prompt eval duration: 1.042544115s prompt eval rate: 1193.23 tokens/s eval count: 213 token(s) eval duration: 33.94778571s eval rate: 6.27 tokens/stotal duration: 4m44.951335984s load duration: 20.528603ms prompt eval count: 1502 token(s) prompt eval duration: 773.712908ms prompt eval rate: 1941.29 tokens/s eval count: 1644 token(s) eval duration: 4m44.137923862s eval rate: 5.79 tokens/sCompared to an API call that finishes in about 20% of the time it feels a bit slow without the recommended graphics card and what not is all I'm saying.In terms of benchmarks, it seems unusually well tuned for the model size but I suspect its just a case of gaming the measurement by testing against it as part of the development of the model which is not bad in and of itself since I suspect every LLM who is in this space marketed to IT folks does the same thing tbh so its objective enough given that as a rough gauge of "Is this usable?" without heavy time expense testing it.

评论 #44058748 未加载

CSMastermind4일 전

I don't believe the benchmarks they're presenting.I haven't tried it out yet but every model I've tested from Mistral has been towards the bottom of my benchmarks in a similar place to Llama.Would be very surprised if the real life performance is anything like they're claiming.

评论 #44056495 未加载

评论 #44057452 未加载

christophilus3일 전

What hardware are y'all using when you run these things locally? I was thinking of pre ordering the Framework desktop[0] for this purpose, but I wouldn't mind having a decent laptop that could run it (ideally Linux).[0] <a href="https://frame.work/desktop" rel="nofollow">https://frame.work/desktop</a>

评论 #44058281 未加载

评论 #44058499 未加载

评论 #44058363 未加载

评论 #44058269 未加载

ddtaylor4일 전

Wow. I was just grabbing some models and I happened to see this one while I was messing with tool support in LLamaIndex. I have an agentic coding thing I threw together and I have been trying different models on it and was looking to throw ReAct at it to bring in some models that don't have tool support and this just pops into existence!I'm not able to get my agentic system to use this model though as it just says "I don't have the tools to do this". I tried modifying various agent prompts to explicitly say "Use foo tool to do bar" without any luck yet. All of the ToolSpec that I use are annotated etc. Pydantic objects and every other model has figured out how to use these tools.

评论 #44055047 未加载

qwertox4일 전

Maybe the EU should cover the cost of creating this agent/model, assuming it really delivers what it promises. It would allow Mistral to keep focusing on what they do and for us it would mean that the EU spent money wisely.

评论 #44055600 未加载

评论 #44056791 未加载

jwr3일 전

My experience with LLMs seems to indicate that the benchmark numbers are more and more detached from reality, at least my reality.I tested this model with several of my Clojure problems and it is significantly worse than qwen3:30b-a3b-q4_K_M.I don't know what to make of this. I don't trust benchmarks much anymore.

评论 #44059264 未加载

ics4일 전

Maybe someone here can suggest tools or at least where to look; what are the state-of-the-art models to run locally on relatively low power machines like a MacBook Air? Is there anyone tracking what is feasible given a machine spec?"Apple Intelligence" isn't it but it would be nice to know without churning through tests whether I should bother keeping around 2-3 models for specific tasks in ollama or if their performance is marginal there's a more stable all-rounder model.

评论 #44056458 未加载

评论 #44054653 未加载

评论 #44058187 未加载

twotwotwo3일 전

Any company in this space outside of the top few should be contributing to the open-source tools (Aider, OpenHands, etc.); that is a better bet than making your own tools from scratch to compete with ones from much bigger teams. A couple folks making harnesses work better with your model might yield improvements faster than a lot of model-tuning work, and might also come out of the process with practical observations about what to work on in the next spin of the model.Separately, deploying more autonomous agents that just look at an issue or such just seems premature now. We've only just gotten assisted flows kind-of working, and they still get lost--get stuck on not-hard-for-a-human debugging tasks, implement Potemkin 'fixes', forget their tools, make unrelated changes that sometimes break stuff, etc.--in ways that imply that flow isn't fully-baked yet.Maybe the main appeal is asynchrony/potential parallelism? You could tackle that different ways, though. And SWEBench might be a good benchmark still (focus on where you want to be, even if you aren't there yet), but that doesn't mean it represents the most practical way to use these tools day-to-day currently.

screye3일 전

What's the play for smaller base model training companies like Mistral ?Mistral's positioning as the European alternative doesn't seem to be sticking. Acquisition seems tricky given how inflection, character.ai and stability have got carved out. The big acquisition bucks are going to product companies (windsurf)They could pivot up the stack, but then they'd be starting from scratch with a team that's ill-suited for product development.The base model offerings from pretraining companies have been surprisingly myopic. Deepmind seems to be the only one going past the obvious "content gen/coding automation" verticals. There's a whole world out there. LLM product companies are fast acquiring pieces of the real money pie and smaller pretraining companies are getting left out.______edit: my comment rose to the top. It's early in the morning. Injecting a splash of optimism.LLMs are hard, and giants like Meta are struggling to make steady progress. Mistrals models are cheap, competent, open-source-ish and don't come with AGI-is-imminent baggage. Good enough for me.To my own question: They have a list of target industries at the top. <a href="https://mistral.ai/solutions#industry" rel="nofollow">https://mistral.ai/solutions#industry</a>Good luck to them.

评论 #44062614 未加载

bravura4일 전

And how do the results compare to hosted LLMs like Claude 3.7?

评论 #44054278 未加载

thih93일 전

> Devstral excels at using tools to explore codebasesAs an AI and vibe coding newbie, how does that work? E.g. how would I use devstral and ollama and instruct it to use tools? Or would I need some other program as well?

评论 #44061016 未加载

评论 #44062244 未加载

YetAnotherNick4일 전

The SWE bench is super impressive of model of any size. However just providing one benchmark results and having to do partnership with OpenHands seems like they focused too much on optimizing the number.

gyudin4일 전

Super weird benchmarks

评论 #44053952 未加载

abrowne24일 전

Curious to check this out, since they say it can run on a 4090 / Mac with >32 GB of RAM.

评论 #44053250 未加载

评论 #44056730 未加载

评论 #44059683 未加载

jadbox4일 전

But how does it compare to deepcoder?

AnhTho_FR4일 전

Impressive performance!

anonym293일 전

I know it's not the recommended runner (OpenHands), but running this on Cline (ollama back-end), it seemed absolutely atrocious at file reading and tool calling.

ManlyBread4일 전

>Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an ideal choice for local deployment and on-device useThis is still too much, a single 4090 costs $3k

评论 #44054416 未加载

评论 #44054518 未加载

评论 #44054504 未加载

评论 #44062005 未加载

评论 #44054679 未加载

TZubiri4일 전

I feel this is part of a larger and very old business trend.But do we need 20 companies copying each other and doing the same thing?Like, is that really competition? I'd say competition is when you do something slightly different, but I guess it's subjective based on your interpretation of what is a commodity and what is proprietary.To my view, everyone is outright copying and creating commodity markets:OpenAI: The OG, the Coke of Modern AIClaude: The first copycat, The Pepsi of Modern AIMistral: Euro OpenAIDeepSeek: Chinese OpenAIGrok/xAI: Republican OpenAIGoogle/MSFT: OpenAI clone as a SaaS or Office package.Meta's Llama: Open Source OpenAIetc...

评论 #44057876 未加载

评论 #44057200 未加载

评论 #44059035 未加载

评论 #44057097 未加载