How I run LLMs locally

400 pointsby Abishek_Muthian5 months ago

26 comments

upghost5 months ago

> Before I begin I would like to credit the thousands or millions of unknown artists, coders and writers upon whose work the Large Language Models(LLMs) are trained, often without due credit or compensationI like this. If we insist on pushing forward with GenAI we should probably at least make some digital or physical monument like "The Tomb of the Unknown Creator".Cause they sure as sh*t ain't gettin paid. RIP.

评论 #42544458 未加载

评论 #42545447 未加载

评论 #42546031 未加载

评论 #42544983 未加载

评论 #42544347 未加载

评论 #42544455 未加载

chb5 months ago

I’m surprised to see no mention of AnythingLLM (<a href="https://github.com/Mintplex-Labs/anything-llm">https://github.com/Mintplex-Labs/anything-llm</a>). I use it with an Anthropic API key, but am giving thought to extending it with local LLM. It’s a great app: good file management for RAG, agents with web search, cross platform desktop client, but can also easily be run as a server using docker compose.Nb: if you’re still paying $20/mo for a feature-poor chat experience that’s locked to a single provider, you should consider using any of the many wonderful chat clients that take a variety of API keys instead. You might find that your LLM utilization doesn’t quite fit a flat rate model, and that the feature set of the third-party client is comparable (or surpasses) that of the LLM provider’s.edit: included repo link; note on API keys as alternative to subscription

评论 #42546128 未加载

评论 #42550229 未加载

评论 #42546420 未加载

chown5 months ago

If anyone is looking for a one click solution without having to have a Docker running, try Msty - something that I have been working on for almost a year. Has RAG and Web Search built in among others and can connect to your Obsidian vaults as well.<a href="https://msty.app" rel="nofollow">https://msty.app</a>

评论 #42544345 未加载

评论 #42544668 未加载

评论 #42546453 未加载

评论 #42544143 未加载

评论 #42546800 未加载

rspoerri5 months ago

I run a pretty similar setup on an m2-max - 96gb.Just for AI image generation i would rather recommend krita with the <a href="https://github.com/Acly/krita-ai-diffusion">https://github.com/Acly/krita-ai-diffusion</a> plugin.

评论 #42542621 未加载

throwaway3141555 months ago

Open WebUI sure does pull in a lot of dependencies... Do I really need all of langchain, pytorch, and plenty others for what is advertised as a _frontend_?Does anyone know of a lighter/minimalist version?

评论 #42543836 未加载

评论 #42543036 未加载

评论 #42544638 未加载

评论 #42545478 未加载

评论 #42554904 未加载

评论 #42544682 未加载

评论 #42543635 未加载

评论 #42544569 未加载

halyconWays5 months ago

Super basic intro but perhaps useful. Doesn't mention quant sizes, which is important when you're GPU poor. Lots of other client-side things you can do too, like KoboldAI, TavernAI, Jan, LangFuse for observability, CogVLM2 for a vision model.One of the best places to get the latest info on what people are doing with local models is /lmg/ on 4chan's /g/

ashleyn5 months ago

anyone got a guide on setting up and running the business-class stuff (70B models over multiple A100, etc)? i'd be willing to spend the money but only if i could get a good guide on how to set everything up, what hardware goes with what motherboard/ram/cpu, etc.

评论 #42544505 未加载

评论 #42544229 未加载

评论 #42546526 未加载

评论 #42543890 未加载

评论 #42543964 未加载

Salgat5 months ago

There is a lot I want to do with LLMs locally, but it seems like we're still not quite there hardware-wise (well, within reasonable cost). For example, Llama's smaller models take upwards of 20 seconds to generate a brief response on a 4090; at that point I'd rather just use an API to a service that can generate it in a couple seconds.

评论 #42542402 未加载

评论 #42541997 未加载

评论 #42544163 未加载

评论 #42542152 未加载

评论 #42542643 未加载

评论 #42542775 未加载

评论 #42542136 未加载

评论 #42541874 未加载

评论 #42543407 未加载

评论 #42542584 未加载

评论 #42542723 未加载

pvo505555 months ago

There was a post a few weeks back (or a reply to a post) showing an app entirely made using an LLM. It was like a 3D globe made with 3js, and I believe the poster had created it locally on his M4 MacBook with 96 GB RAM? I can't recall which model it was or what else the app did, but maybe someone knows what I'm talking about?

评论 #42544794 未加载

dividefuel5 months ago

What GPU offers a good balance between cost and performance for running LLMs locally? I'd like to do more experimenting, and am due for a GPU upgrade from my 1080 anyway, but would like to spend less than $1600...

评论 #42543101 未加载

评论 #42543502 未加载

评论 #42542964 未加载

评论 #42551324 未加载

评论 #42542817 未加载

评论 #42543044 未加载

评论 #42542879 未加载

评论 #42542921 未加载

评论 #42543131 未加载

Der_Einzige5 months ago

Still nothing better than oobabooga (<a href="https://github.com/oobabooga/text-generation-webui">https://github.com/oobabooga/text-generation-webui</a>) in terms of maximalism/"Pro"/"Prosumer" LLM UI/UX ALA Blender, Photoshop, Final Cut Pro, etc.Embarrassing and any VCs reading this can contact me to talk about how to fix that. lm-studio is today the closest competition (but not close enough) and Adobe or Microsoft could do it if they fired their current folks which prevent this from happening.If you're not using Oobabooga, you're likely not playing with the settings on models, and if you're not playing with your models settings, you're hardly even scratching the surface on its total capabilities.

评论 #42547428 未加载

评论 #42542947 未加载

gulan285 months ago

You can try out <a href="https://wiz.chat" rel="nofollow">https://wiz.chat</a> (my project) if you want to Run llama on your web browser. Still needs a GPU and the latest version of chrome but it's fast enough for my usage.

coding1235 months ago

We will at some point have a JS API to run preliminary LLM to make local decisions, then the server will be final arbiter. So for example a comment rage moderator can help an end user change their proposed post while they write it, to help them not turn the comment into rage bate. This will be done best locally on the users browser. Then when they are ready to post, one final check by the server would be done. This would be like today's React front ends doing all the state and UI computation, reducing servers from having to render HTML, for example.

jokethrowaway5 months ago

I have a similar pc and I use text-generation-webui and mostly exllama quantized models.I also deploy text-generation-webui for clients on k8s with gpu for similar reasons.Last I checked, llamafile / ollama are not as optimised for gpu use.For image generation I moved from automatic webui to comfyui a few months ago - they're different beasts, for some workflow automatic is easier to use but for most tasks you can create a better workflow with enough comfy extensions.Facefusion warrants a mention for faceswapping

novok5 months ago

As a piece of writing feedback, I would convert your citation links into normal links. Clicking on the citation doesn't jump to the link or the citation entry, and you are basically using hyperlinks anyway.

mikestaub5 months ago

I just use MLC with WebGPU: <a href="https://codepen.io/mikestaub/pen/WNqpNGg" rel="nofollow">https://codepen.io/mikestaub/pen/WNqpNGg</a>

prettyblocks5 months ago

> I have a laptop running Linux with core i9 (32threads) CPU, 4090 GPU (16GB VRAM) and 96 GB of RAM.Is there somewhere I can find a computer like this pre-built?

评论 #42545881 未加载

erickguan5 months ago

How much memory can models take? I would assume the dGPU set up won't perform better at a certain point.

koinedad5 months ago

Helpful summary, short but useful

评论 #42541967 未加载

masteruvpuppetz5 months ago

David Bombal interviews a mysterious man where he shows how he uses AI/LLMs for his automated LinkedIn posts and other tasks. <a href="https://www.youtube.com/watch?v=vF-MQmVxnCs" rel="nofollow">https://www.youtube.com/watch?v=vF-MQmVxnCs</a>

sturza5 months ago

4090 has 24gb vram, not 16.

评论 #42541873 未加载

dumbfounder5 months ago

Updation. That’s a new word for me. I like it.

评论 #42542346 未加载

thangalin5 months ago

run.sh:<pre><code> #!/usr/bin/env bash set -eu set -o errexit set -o nounset set -o pipefail readonly SCRIPT_SRC="$(dirname "${BASH_SOURCE[${#BASH_SOURCE[@]} - 1]}")" readonly SCRIPT_DIR="$(cd "${SCRIPT_SRC}" >/dev/null 2>&1 && pwd)" readonly SCRIPT_NAME=$(basename "$0") # Avoid issues when wine is installed. sudo su -c 'echo 0 > /proc/sys/fs/binfmt_misc/status' # Graceful exit to perform any clean up, if needed. trap terminate INT # Exits the script with a given error level. function terminate() { level=10 if [ $# -ge 1 ] && [ -n "$1" ]; then level="$1"; fi exit $level } # Concatenates multiple files. join() { local -r prefix="$1" local -r content="$2" local -r suffix="$3" printf "%s%s%s" "$(cat ${prefix})" "$(cat ${content})" "$(cat ${suffix})" } # Swapping this symbolic link allows swapping the LLM without script changes. readonly LINK_MODEL="${SCRIPT_DIR}/llm.gguf" # Dereference the model's symbolic link to its path relative to the script. readonly PATH_MODEL="$(realpath --relative-to="${SCRIPT_DIR}" "${LINK_MODEL}")" # Extract the file name for the model. readonly FILE_MODEL=$(basename "${PATH_MODEL}") # Look up the prompt format based on the model being used. readonly PROMPT_FORMAT=$(grep -m1 ${FILE_MODEL} map.txt | sed 's/.*: //') # Guard against missing prompt templates. if [ -z "${PROMPT_FORMAT}" ]; then echo "Add prompt template for '${FILE_MODEL}'." terminate 11 fi readonly FILE_MODEL_NAME=$(basename $FILE_MODEL) if [ -z "${1:-}" ]; then # Write the output to a name corresponding to the model being used. PATH_OUTPUT="output/${FILE_MODEL_NAME%.*}.txt" else PATH_OUTPUT="$1" fi # The system file defines the parameters of the interaction. readonly PATH_PROMPT_SYSTEM="system.txt" # The user file prompts the model as to what we want to generate. readonly PATH_PROMPT_USER="user.txt" readonly PATH_PREFIX_SYSTEM="templates/${PROMPT_FORMAT}/prefix-system.txt" readonly PATH_PREFIX_USER="templates/${PROMPT_FORMAT}/prefix-user.txt" readonly PATH_PREFIX_ASSIST="templates/${PROMPT_FORMAT}/prefix-assistant.txt" readonly PATH_SUFFIX_SYSTEM="templates/${PROMPT_FORMAT}/suffix-system.txt" readonly PATH_SUFFIX_USER="templates/${PROMPT_FORMAT}/suffix-user.txt" readonly PATH_SUFFIX_ASSIST="templates/${PROMPT_FORMAT}/suffix-assistant.txt" echo "Running: ${PATH_MODEL}" echo "Reading: ${PATH_PREFIX_SYSTEM}" echo "Reading: ${PATH_PREFIX_USER}" echo "Reading: ${PATH_PREFIX_ASSIST}" echo "Writing: ${PATH_OUTPUT}" # Capture the entirety of the instructions to obtain the input length. readonly INSTRUCT=$( join ${PATH_PREFIX_SYSTEM} ${PATH_PROMPT_SYSTEM} ${PATH_PREFIX_SYSTEM} join ${PATH_SUFFIX_USER} ${PATH_PROMPT_USER} ${PATH_SUFFIX_USER} join ${PATH_SUFFIX_ASSIST} "/dev/null" ${PATH_SUFFIX_ASSIST} ) ( echo ${INSTRUCT} ) | ./llamafile \ -m "${LINK_MODEL}" \ -e \ -f /dev/stdin \ -n 1000 \ -c ${#INSTRUCT} \ --repeat-penalty 1.0 \ --temp 0.3 \ --silent-prompt > ${PATH_OUTPUT} #--log-disable \ echo "Outputs: ${PATH_OUTPUT}" terminate 0 </code></pre> map.txt:<pre><code> c4ai-command-r-plus-q4.gguf: cmdr dare-34b-200k-q6.gguf: orca-vicuna gemma-2-27b-q4.gguf: gemma gemma-2-7b-q5.gguf: gemma gemma-2-Ifable-9B.Q5_K_M.gguf: gemma llama-3-64k-q4.gguf: llama3 llama-3-64k-q4.gguf: llama3 llama-3-1048k-q4.gguf: llama3 llama-3-1048k-q8.gguf: llama3 llama-3-8b-q4.gguf: llama3 llama-3-8b-q8.gguf: llama3 llama-3-8b-1048k-q6.gguf: llama3 llama-3-70b-q4.gguf: llama3 llama-3-70b-64k-q4.gguf: llama3 llama-3-smaug-70b-q4.gguf: llama3 llama-3-giraffe-128k-q4.gguf: llama3 lzlv-q4.gguf: alpaca mistral-nemo-12b-q4.gguf: mistral openorca-q4.gguf: chatml openorca-q8.gguf: chatml quill-72b-q4.gguf: none qwen2-72b-q4.gguf: none tess-yi-q4.gguf: vicuna tess-yi-q8.gguf: vicuna tess-yarn-q4.gguf: vicuna tess-yarn-q8.gguf: vicuna wizard-q4.gguf: vicuna-short wizard-q8.gguf: vicuna-short </code></pre> Templates (all the template directories contain the same set of file names, but differ in content):<pre><code> templates/ ├── alpaca ├── chatml ├── cmdr ├── gemma ├── llama3 ├── mistral ├── none ├── orca-vicuna ├── vicuna └── vicuna-short ├── prefix-assistant.txt ├── prefix-system.txt ├── prefix-user.txt ├── suffix-assistant.txt ├── suffix-system.txt └── suffix-user.txt </code></pre> If there's interest, I'll make a repo.

amazingamazing5 months ago

I never have seen the point of running locally. Not cost effective, worse model, etc.

评论 #42543268 未加载

评论 #42543648 未加载

评论 #42543249 未加载

评论 #42543285 未加载

评论 #42543843 未加载

farceSpherule5 months ago

I pay for ChatGPT Teams. Much easier and better than this.

deadbabe5 months ago

My understanding is that local LLMs are mostly just toys that output basic responses, and simply can’t compete with full LLMs trained with $60 million+ worth of compute time, and that no matter how good hardware gets, larger companies will always have even better hardware and resources to output even better results, so basically this is pointless for anything competitive or serious. Is this accurate?

评论 #42542328 未加载

评论 #42543608 未加载

评论 #42542171 未加载

评论 #42541921 未加载

评论 #42542497 未加载

评论 #42543075 未加载

评论 #42542098 未加载

评论 #42546537 未加载

评论 #42541879 未加载