TechEcho

$ pledge -p 'stdio rpath' -- /home/shared/src/llama.cpp/main -m /home/shared/models/ggml-model-65b- q4_0.bin -t 16 -p "A proper AI assistant deployment should be very strict in the kind of resources it allows" main: seed = 1681759209 llama.cpp: loading model from /home/shared/models/ggml-model-65b-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 8192 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 64 llama_model_load_internal: n_layer = 80 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 22016 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 65B llama_model_load_internal: ggml ctx size = 146.86 KB llama_model_load_internal: mem required = 41477.67 MB (+ 5120.00 MB per state) llama_init_from_file: kv self size = 1280.00 MB<p>system_info: n_threads = 16 / 20 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0<p>A proper AI assistant deployment should be very strict in the kind of resources it allows its agents to access. It can't just send them everywhere willy-nilly, and in fact, the more selective it is, the better for both itself and the user. So what are some good rules for allowing or blocking agent access? One rule is that an agent should only be given access to resources necessary for its job description. For example, a chatbot should only have access to the data required to answer customer questions (assuming it's not an ELIZA-type AI which just repeats what you say back at you). Another rule is that agents shouldn't be llama_print_timings: load time = 2205.70 ms llama_print_timings: sample time = 61.93 ms / 128 runs ( 0.48 ms per run) llama_print_timings: prompt eval time = 4190.82 ms / 18 tokens ( 232.82 ms per token) llama_print_timings: eval time = 39469.33 ms / 127 runs ( 310.78 ms per run) llama_print_timings: total time = 44072.21 ms pledge -p 'stdio rpath' -- /home/shared/src/llama.cpp/main -m -t 16 -p 11:27.47 user 0.658 system 1556% cpu (44.219 wasted time).

Llama.cpp runs 65B on Asahi Linux at 4tk/s

no comments

Llama.cpp runs 65B on Asahi Linux at 4tk/s

no comments