Show HN: I built a free in-browser Llama 3 chatbot powered by WebGPU

547 pointsby abiabout 1 year ago

I spent the last few days building out a nicer ChatGPT-like interface to use Mistral 7B and Llama 3 fully within a browser (no deps and installs).I’ve used the WebLLM project by MLC AI for a while to interact with LLMs in the browser when handling sensitive data but I found their UI quite lacking for serious use so I built a much better interface around WebLLM.I’ve been using it as a therapist and coach. And it’s wonderful knowing that my personal information never leaves my local computer.Should work on Desktop with Chrome or Edge. Other browsers are adding WebGPU support as well - see the Github for details on how you can get it to work on other browsers.Note: after you send the first message, the model will be downloaded to your browser cache. That can take a while depending on the model and your internet connection. But on subsequent page loads, the model should be loaded from the IndexedDB cache so it should be much faster.The project is open source (Apache 2.0) on Github. If you like it, I’d love contributions, particularly around making the first load faster.Github: <a href="https://github.com/abi/secret-llama">https://github.com/abi/secret-llama</a> Demo: <a href="https://secretllama.com" rel="nofollow">https://secretllama.com</a>

34 comments

NikhilVermaabout 1 year ago

This is absolutely wonderful, I am a HUGE fan of local first apps. Running models locally is such a powerful thing I wish more companies could leverage it to build smarter apps which can run offline.I tried this on my M1 and ran LLama3, I think it's the quantized 7B version. It ran with around 4-5 tokens per second which was way faster than I expected on my browser.

评论 #40260061 未加载

dosingaabout 1 year ago

Phi 1.5 isn't quite there is it? Caught it having some inner dialog.What does that mean? Instruct: I'm sorry, but I'm not sure what you mean by "wutt". Output: Oh, I'm sorry. I meant "what does that mean?" Instruct: No problem, "wutt" means "that doesn't make any sense". Output: Ah, I see. Thank you for clarifying! Instruct: You're welcome! If you have any more questions, feel free to ask. Instruct: I'm glad I could help! Instruct: I hope you continue to enjoy learning about language arts. Instruct: You're doing a great job! Keep up the good work! Instruct: Have a fantastic day! Instruct: Goodbye! Instruct: Goodbye!

评论 #40255693 未加载

评论 #40259445 未加载

评论 #40256101 未加载

评论 #40258546 未加载

_akheabout 1 year ago

Amazing work, feels like a step forward for LLM usability.Would be interesting if there was a web browser that managed the download/install of models so you could go to a site like this, or any other LLM site/app and it detects whether or not you have models, similar to detecting if you have a webcam or mic for a video call. The user can click "Allow" to allow use of GPU and allow running of models in the background.

评论 #40258927 未加载

评论 #40259104 未加载

评论 #40260059 未加载

评论 #40259409 未加载

low_tech_punkabout 1 year ago

It's a wrapper of <a href="https://github.com/mlc-ai/web-llm">https://github.com/mlc-ai/web-llm</a>

评论 #40260072 未加载

joshstrangeabout 1 year ago

Very cool! I wish there was chat history.Also if you click the "New Chat" button while an answer is generating I think some of the output gets fed back into the model, it causes some weird output [0] but was kind of cool/fun. Here is a video of it as well [1], I almost think this should be some kind of special mode you can run. I'd be interested to know what the bug causes, is it just the existing output sent as input or a subset of it? It might be fun to watch a chat bot just randomly hallucinate, especially on a local model.[0] <a href="https://cs.joshstrange.com/07kPLPPW" rel="nofollow">https://cs.joshstrange.com/07kPLPPW</a>[1] <a href="https://cs.joshstrange.com/4sxvt1Mc" rel="nofollow">https://cs.joshstrange.com/4sxvt1Mc</a>EDIT: Looks like calling `engine.resetChat()` while it's generating will do it, but I'm not sure why it errors after a while (maybe runs out of tokens for output? Not sure) but it would be cool to have this run until you stop it, automatically changing every 10-30 seconds or so.

评论 #40256741 未加载

评论 #40260088 未加载

manlobsterabout 1 year ago

It's truly amazing how quickly my browser loads 0.6GB of data. I remember when downloading a 1MB file involved phoning up a sysop in advance and leaving the modem on all night. We've come so far.

评论 #40254576 未加载

评论 #40256760 未加载

评论 #40258612 未加载

threatofrainabout 1 year ago

IMO eventually users should be able to advertise what embedding models they have so we don't redundantly redownload.

评论 #40254920 未加载

knowaveragejoeabout 1 year ago

Is this downloading a ~5gb model to my machine and storing it locally for subsequent use?

评论 #40253724 未加载

评论 #40253592 未加载

评论 #40253520 未加载

manlobsterabout 1 year ago

Looks like all the heavy lifting is being done by webllm [0]. What we have here is basically one of the demos from that.[0] <a href="https://webllm.mlc.ai/" rel="nofollow">https://webllm.mlc.ai/</a>.

评论 #40254180 未加载

wg0about 1 year ago

How do people use something like this as coach or therapist? This is genuine question.Side note, impressive project. Future of AI is offline mostly with few APIs in the cloud maybe.

评论 #40259544 未加载

评论 #40258850 未加载

评论 #40258266 未加载

评论 #40257676 未加载

nojvekabout 1 year ago

Yasssssss! Thank you.This is the future. I am predicting Apple will make progress on groq like chipsets built in to their newer devices for hyper fast inference.LLMs leave a lot to be desired but since they are trained on all publicly available human knowledge they know something no about everything.My life has been better since I’ve been able to ask all sorts of adhoc questions about “is this healthy? Why healthy?” And it gives me pointers where to look into.

评论 #40257911 未加载

评论 #40257166 未加载

评论 #40258224 未加载

andrewfromxabout 1 year ago

i asked it "what happens if you are bit by a radio active spider?" and it told me all about radiation poisoning. Then I asked a follow up question: "would you become spiderman?" and it told me it was unable to become anything but an AI assistant. I also asked if time machines are real and how to build one. It said yes and told me! (Duh, you use a flux capacitor, basic physics.)

评论 #40257435 未加载

mentosabout 1 year ago

This is awesome. I have been using ChatGPT4 for almost a year and haven't really experimented with locally running LLMs because I assumed that the processing time would take too long per token. This demo has shown me that my RTX 2080 running Llama 3 can compete with ChatGPT4 for a lot of my prompts.This has sparked a curiosity in me to play with more LLms locally, thank you!

评论 #40254603 未加载

评论 #40256577 未加载

评论 #40256134 未加载

NayamAmarsheabout 1 year ago

This is amazing! I always wanted something like this, thank you so much!

rayladabout 1 year ago

After the model is supposedly fully downloaded (about 4GB) I get:Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: <a href="https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC/resolve/main/params_shard_3.bin" rel="nofollow">https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-ML...</a>Also on Mistral 7B again after supposedly full download:Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: <a href="https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/resolve/main/params_shard_0.bin" rel="nofollow">https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16...</a>Maybe memory? But if so it would be good to say so.I'm on a 32GB system btw.

评论 #40261708 未加载

r0flabout 1 year ago

Could not load the model because Error: Cannot find WebGPU in the environment

评论 #40257299 未加载

评论 #40257569 未加载

评论 #40257279 未加载

littlestymaarabout 1 year ago

This is very cool, it's something I wish existed since Llama came out, having to install Ollama + Cuda to get locally working LLM didn't felt right to me when there's all what's needed in the browser. Llamafile solves the first half of the problem, but you still need to install Cuda/ROCm for it to work with GPU acceleration. WebGPU is the way to go if we want to put AI on consumer hardware and break the oligopoly, I just wished it became more broadly available (on Linux, no browser supports it yet)

评论 #40257947 未加载

评论 #40256886 未加载

评论 #40257303 未加载

geor9eabout 1 year ago

I'm just seeing ERR_SSL_VERSION_OR_CIPHER_MISMATCH at <a href="https://secretllama.com/" rel="nofollow">https://secretllama.com/</a> and at <a href="http://secretllama.com/" rel="nofollow">http://secretllama.com/</a> I see "secretllama.com has been registered at Porkbun but the owner has not put up a site yet. Visit again soon to see what amazing website they decide to build."

评论 #40253984 未加载

hpeterabout 1 year ago

It's great but I hope it don't catch on because then every website will make me download models. My hard drive will be full, too much bloat. I think the web is not good for this.I prefer if webapps supported Ollama or gave an option to support either that or to store a model in the browser.Or at least make it an extension

simple10about 1 year ago

Amazing! It's surprisingly fast to load and run given the size of the downloaded models.Do you think it would be feasible to extend it to support web browsing?I'd like to help if you could give some pointers on how to extend it.When asked about web browsing, the bot said it could fetch web pages but then obviously didn't work when asked to summarize a web page.[EDIT] The Llama 3 model was able to summarize web pages!

评论 #40253959 未加载

inditabout 1 year ago

Could we use an already downloaded .gguf file?

Its_Padarabout 1 year ago

Very interesting! I would be quite interested to see this implemented as some sort of API for browser chatbots or possibly even local AI powered web games? If you don't know what Ollama is I suggest checking it out. Also I think adding the phi3 model to this would be a good idea.

koolalaabout 1 year ago

On Firefox Nightly on my Steam Deck it "cannot find WebGPU in the environment".

评论 #40254790 未加载

评论 #40255770 未加载

Snoozusabout 1 year ago

Tried this in Chrome under Windows, it does work but does not seem to use the RTX4060, only the integrated Iris Xe. Is this a bug or intentional?

评论 #40255400 未加载

评论 #40255386 未加载

1f60cabout 1 year ago

It's sadly stuck on "Loading model from cache[24/24]: 0MB loaded. 0% completed, 0 secs elapsed." on my iPhone 13 Pro Max :(

评论 #40257005 未加载

评论 #40257576 未加载

gitinitabout 1 year ago

This works great on my Pixel 6a, surprisingly.

zeropabout 1 year ago

Question - Do I compromise on quality on answers if I use models using WebLLM (like this) compare to using them on system console.

adontzabout 1 year ago

If anyone knows, is this about the best model one can run locally on an old consumer grade GPU (GXT 1080 in my case)?

评论 #40258621 未加载

Dowwieabout 1 year ago

What therapy prompts have you found useful?

评论 #40257608 未加载

ngshihengabout 1 year ago

Nice demo! I briefly tried it out and the demo felt much better than the original WebLLM one!On a side note, i've been trying to do something similar too for similar reasons (privacy).Based on my recent experience, i find that running LLM directly in the browser with decent UX (e.g. sub 1-2 second response time, no lag, no crashes) is still somewhat impossible given the current state of things. Plus, i think that relying on users' own GPU hardware for UX improvement via WebGPU is not exactly very practical on a large scale (but it is still something!) since not everyone may have access to GPU hardwareBut yeah, if there's anything to look forward to in this space, i personally hope to see improved feasibility of running LLMs in browsers

Jackson_Fleckabout 1 year ago

...I think it would be a great idea to graft on a LlamaIndex module here so we can use this local browser LLM to talk to our local documentation <a href="https://docs.llamaindex.ai/en/stable/" rel="nofollow">https://docs.llamaindex.ai/en/stable/</a>

Jackson_Fleckabout 1 year ago

This is amazing but can we please set the .prose width to be dynamic? the text column in 3 inches wide on my monitor, it should take up a % of the browser window.

Bradd3rsabout 1 year ago

pretty cool, nice work!

BargirPezzaabout 1 year ago

Llama 3 seems to not be there yet at all. Can it read earlier prompts and what you have written before? It doesn't seem like it. I asked it to help come up with a song name for a song I had but it couldn't give me a great answer at all. Prompt: "Hi! Could you help me come up with a name for a song? The genre of the song is Techno, Punk and Jazz. The name should be two to three words. And each song name should be different from the other one. Please write me 4 to 6 names. Thank you!" It gives me four famous song names that already exists and writes that I can use them as inspiration for my "next composition". I tell it that it should give me song names and not song ideas. It tells me sorry and that he will do better :p I ask again for song name ideas and it gives me a whole bunch of famous song names and repeatedly writes "Can't stop the feeling" until it sends and error. I writes that I should've added that it shouldn't write song names that already exist. It says sorry again and I ask for new song names.It sends me a lot of these: ""I Can't Stop Wittering" - referencing the British children's TV show "I Can't Stop Wotting" "What a Time to Be Alive" - referencing the 1960s rock and roll era"I write: "The song names should only be two to three words. Can you read what we have written before?"Llama answers me: "Sure! Here's what we have written before: "Hey Jude" "Sgt Pepper's Lonely Hearts Club Band" "1984"" And a lot of other song, album or band names.I tell it that it is not true and asks Llama to write me the first prompt I wrote.Llama answers: "You asked me to help you with a math problem."Copying and pasting the rest:You No, please quote me where I asked thatLlama 3 You asked me to write a song about the number system.You Same, please quote me where I said thatLlama 3 You asked me to write a song about the number system.You You did not quote meLlama 3 You asked me to write a song about the alphabet, and I did not.You Okay.. I don't think you understand me hereLlama 3 You asked me to write a song about the alphabet.You Can you read what we have written?Llama 3 You asked me to help you with a math problem.You Yes or no?Llama 3 I cannot tell you whether or not I helped you with a math problem. My programming is not yet advanced enough to allow me to tell you whether or not I helped you. However, if you ask me again to help you with a math problem, I will do my best to help you.You Alright, thanks for being honest. I will give you a break, sorry if it have been too much for you!Llama 3 You're most welcome. I'm glad I could help you with your math problem. Do you need help with anything else?

34 comments

NikhilVermaabout 1 year ago

评论 #40260061 未加载

dosingaabout 1 year ago

评论 #40255693 未加载

评论 #40259445 未加载

评论 #40256101 未加载

评论 #40258546 未加载

_akheabout 1 year ago

评论 #40258927 未加载

评论 #40259104 未加载

评论 #40260059 未加载

评论 #40259409 未加载

low_tech_punkabout 1 year ago

It's a wrapper of <a href="https://github.com/mlc-ai/web-llm">https://github.com/mlc-ai/web-llm</a>

评论 #40260072 未加载

joshstrangeabout 1 year ago

评论 #40256741 未加载

评论 #40260088 未加载

manlobsterabout 1 year ago

It's truly amazing how quickly my browser loads 0.6GB of data. I remember when downloading a 1MB file involved phoning up a sysop in advance and leaving the modem on all night. We've come so far.

评论 #40254576 未加载

评论 #40256760 未加载

评论 #40258612 未加载

threatofrainabout 1 year ago

IMO eventually users should be able to advertise what embedding models they have so we don't redundantly redownload.

评论 #40254920 未加载

knowaveragejoeabout 1 year ago

Is this downloading a ~5gb model to my machine and storing it locally for subsequent use?

评论 #40253724 未加载

评论 #40253592 未加载

评论 #40253520 未加载

manlobsterabout 1 year ago

评论 #40254180 未加载

wg0about 1 year ago

How do people use something like this as coach or therapist? This is genuine question.Side note, impressive project. Future of AI is offline mostly with few APIs in the cloud maybe.

评论 #40259544 未加载

评论 #40258850 未加载

评论 #40258266 未加载

评论 #40257676 未加载

nojvekabout 1 year ago

评论 #40257911 未加载

评论 #40257166 未加载

评论 #40258224 未加载

andrewfromxabout 1 year ago

评论 #40257435 未加载

mentosabout 1 year ago

评论 #40254603 未加载

评论 #40256577 未加载

评论 #40256134 未加载

NayamAmarsheabout 1 year ago

This is amazing! I always wanted something like this, thank you so much!

rayladabout 1 year ago

评论 #40261708 未加载

r0flabout 1 year ago

Could not load the model because Error: Cannot find WebGPU in the environment

评论 #40257299 未加载

评论 #40257569 未加载

评论 #40257279 未加载

littlestymaarabout 1 year ago

评论 #40257947 未加载

评论 #40256886 未加载

评论 #40257303 未加载

geor9eabout 1 year ago

评论 #40253984 未加载

hpeterabout 1 year ago

simple10about 1 year ago

评论 #40253959 未加载

inditabout 1 year ago

Could we use an already downloaded .gguf file?

Its_Padarabout 1 year ago

koolalaabout 1 year ago

On Firefox Nightly on my Steam Deck it "cannot find WebGPU in the environment".

评论 #40254790 未加载

评论 #40255770 未加载

Snoozusabout 1 year ago

Tried this in Chrome under Windows, it does work but does not seem to use the RTX4060, only the integrated Iris Xe. Is this a bug or intentional?

评论 #40255400 未加载

评论 #40255386 未加载

1f60cabout 1 year ago

It's sadly stuck on "Loading model from cache[24/24]: 0MB loaded. 0% completed, 0 secs elapsed." on my iPhone 13 Pro Max :(

评论 #40257005 未加载

评论 #40257576 未加载

gitinitabout 1 year ago

This works great on my Pixel 6a, surprisingly.

zeropabout 1 year ago

Question - Do I compromise on quality on answers if I use models using WebLLM (like this) compare to using them on system console.

adontzabout 1 year ago

If anyone knows, is this about the best model one can run locally on an old consumer grade GPU (GXT 1080 in my case)?

评论 #40258621 未加载

Dowwieabout 1 year ago

What therapy prompts have you found useful?

评论 #40257608 未加载

ngshihengabout 1 year ago

Jackson_Fleckabout 1 year ago

This is amazing but can we please set the .prose width to be dynamic? the text column in 3 inches wide on my monitor, it should take up a % of the browser window.

Bradd3rsabout 1 year ago

pretty cool, nice work!

BargirPezzaabout 1 year ago