Neat to see more folks writing blogs on their experiences. This however does seem like it's an over-complicated method of building llama.cpp.
Assuming you want to do this iteratively (at least for the first time) should only need to run:
ccmake .
And toggle the parameters your hardware supports or that you want (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple etc..), and press 'c' (configure) then 'g' (generate), then:
cmake --build . -j $(expr $(nproc) / 2)
Done.
If you want to move the binaries into your PATH, you could then optionally run cmake install.
First time I heard about Llama.cpp I got it to run on my computer. Now, my computer: a Dell laptop from 2013 with 8Gb RAM and an i5 processor, no dedicated graphic card. Since I wasn't using a MGLRU enabled kernel, It took a looong time to start but wasn't OOM-killed. Considering my amount of RAM was just the minimum required, I tried one of the smallest available models.
Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.
Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.
Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?
I'd say avoid pulling in all the python and containers required and just download the gguf from huggingface website directly in a browser rather than doing is programmatically. That sidesteps a lot of this project's complexity since nothing about llama.cpp requires those heavy deps or abstractions.
I tried building and using llama.cpp multiple times, and after a while, I got so frustrated with the frequently broken build process that I switched to ollama with the following script:
#!/bin/sh
export OLLAMA_MODELS="/mnt/ai-models/ollama/"
printf 'Starting the server now.\n'
ollama serve >/dev/null 2>&1 &
serverPid="$!"
printf 'Starting the client (might take a moment (~3min) after a fresh boot).\n'
ollama run llama3.2 2>/dev/null
printf 'Stopping the server now.\n'
kill "$serverPid"
smcleod ·7 days ago
Assuming you want to do this iteratively (at least for the first time) should only need to run:
And toggle the parameters your hardware supports or that you want (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple etc..), and press 'c' (configure) then 'g' (generate), then: Done.If you want to move the binaries into your PATH, you could then optionally run cmake install.
Show replies
marcodiego ·7 days ago
Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.
Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.
Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?
Show replies
wing-_-nuts ·7 days ago
Show replies
superkuh ·7 days ago
Show replies
arendtio ·7 days ago
Show replies