Which Ryzen 7? How much RAM? Which Deepseek R1? How many tokens per second? I might be becoming curious enough to assemble an AMD system with RTX 3060 12GB.
This is the machine I bought. Rysen 7, 5825U, 32 Gbyts DDR4 Ram, 1Tbyt Hdd. The Nucbox was about £350 when I bought it, but the price on Amazon has risen significantly. They do go up and down, maybe related to your cookies. Dynamic pricing, it is called. This is what I bought at Amazon:
There are options on that page. I don't recommend less than 32Gbyts of Ram. You might not require a terabyte of memory. I have used nowhere near that, but have it anyway. The models are pretty massive. If you wanted a lot of them, you might use all of that.
I have a deepseek model on it, but I don't use that one, much preferring the Gemma3 Google models. I don't want to wait while the machine produces reams of reasoning which deepseek seems to do. Maybe that can be turned off. The Gemma 3 model seems more capable and accurate for the queries I have asked of it. It does creative writing well for a start. Other people have different requirements and prefer other models.
The mini PC will handle the mid sized Gemma model quite well, I think around 12 Bn parameters. The 27 Bn one is too big to run at acceptable speed. I've tried a few models and they have different specialities in their training so what I like, may not suit you.
Ollama has loads of models of various sizes that you can download - all free. The free stuff is a genuine Pro bono gift from the big tech companies. These are mot monetised at all.
There are models from META, MIcroSoft, Deepseek, and Google and maybe more. Huggingface is another massive receptacle for free models of all kinds.
I DO NOT CLAIM that this size of PC is the best way to run AI models. It certainly isn't. One of my sons runs 27Bn parameter models on a games machine with a reasonable GPU. The price of the GPU alone would surpass the price of this mini PC.
That setup he has, is fast and furious. It gives answers quicker than ChatGPT even when accessed remotely. The reason is of course that ChatGPT is dealing with thousands of queries a second and his isn't. Also, the ChatGPT models run in the unpaid versions with 175 billion parameters, not 27Bn and if you pay they are MUCH bigger and more capable.
With my set up, you need to wait a few seconds after sending a query while the machine works out what you want. The initial query in a session comes back quickly - probably around four or five seconds, depending how you access it. Once it starts answering your query, the data comes at faster than my comfortable reading speed. It sort of streams onto your screen like an old telytype machine.
If you run Ollama in a terminal window, everything is faster. I tend to use it from my chromebook, via a front end programme which gives an interface like the Chat GPT one. This adds overhead.
For networking the AI, as front ends go, OpenWeb UI is faster than the Page Assist UI, because it seems to send your query character by character as you are typing it (that's what ChatGPT does) and this gives the AI model a head start in working out what you want it to do, as opposed to a UI which waits until you have typed the whole query and then sends the whole thing.
I found that Windows Defender fought against Openwebui and the Page assist UI was easier to get going on the network at home. A linux environment is probably better in that respect.
In a particular session of question and answer, the response time gradually increases because what is known as the 'context' builds up. This is all part of the new questions that are submitted to the AI model in a long session. It is how it keeps hold of the whole discussion you are having with it. So it needs each time you send a question, to look back over the whole conversation and start again trying to contextualise and understand what has been said. The latest question or response you send is only a part of the whole discussion, so it needs to re-process that too, and that will soon mean you are waiting maybe a minute from adding some question to the session for an answer to start coming back. You likely know all of this, but is might be an issue with a smaller machine like this, if you need a long dialogue with the AI rather than a quick single question or instruction to do something. This is much less apparent if I access my son's machine remotely. I don't notice the impact of the context building up.
I found an easy way around the buildup of context and slowing down. Just ask a single question in a particular instance of the model, unless you need to build up context over a longer session of dialogue.
I got into this because I wanted to get a grip on how these models do what they do and this is why I bought the windows Mini computer. I wasn't prepared to spend £1500, so this was my compromise. It was a learning opportunity for me. This technology will revolutionise work and much of our employment economy in the next decade so I thought I should understand what it is, how it works, and what it can do.
Windows caused me no end of trouble in networking access. Windows Defender obstructs attempts to come into the machine from outside. I think you are an IT professional Guerney and if so, you won't be phased, but I had to put some time into sorting this stuff out. My son runs his Ollama set up in a Linux environment. If you have background in that, I think it would be easier to get going than mine was, but I have a lot of spare time which you don't. It was only networking the AI around my home network that was a problem in Windows and it wasn't insurmountable. I easily loaded Ollama, got models downloaded and could run them in a terminal window on the PC. It was only getting it to work with my chromebook that was a bit of a trouble.
The number of tokens per second it handles varies according to the size of the models. Big models run slower than small ones. Ollama has models of all kinds of sizes. I had to find a compromise between quality with speed. 4Bn parameter Gemma 3 runs about 15 tokens per second and 12Bn drops to about 7 tk per second. The full games machine with dedicated GPU my son uses is very quick. The bigger models are less likely to give hallucinating answers or factual errors. 8bn parameters is a good compromise.
The attraction of running models on your own machine is that once you download the model, you have it forever and can run entirely away from the internet. Some commercial users have concern about the privacy of their data and information in queries and reports. My son sent me redacted samples of reports he is running on his system, with attached private client database and library of commercially sensitive documents, and client system data which relating to his role as a tech consultant. For work like that, an offline system is essential because of client sensitivity. He reckons the system he has constructed using qwen3:14b and qdrant vector store, can save him days of work per week in responding to client queries in his job. This makes him much more productive. He works for a big tech consultancy firm and deals with very large companies. He gets paid a lot. They expect a lot in return.
The box I bought is quite small.
