I ran Gemma 26B on 4GB VRAM + 16RAM. 15 t/s on avarage

NAwT@lemmy.world · edit-2 2 months ago

I ran Gemma 26B on 4GB VRAM + 16RAM. 15 t/s on avarage

hendrik@palaver.p3x.de · 2 months ago

These MoE models are great regarding speed. Half your 15T/s and you can run it entirely without a graphics card on an old computer. At least mine, which is several generations older manages to do 6-7 tokens a second, entirely on CPU. I guess that’s a bit slow for some agent to waste 1M tokens on some very basic programming project… But it’s enough to chat and ask questions, I guess?

NAwT@lemmy.world · edit-2 2 months ago

yeah, 6-7 is slow (for me personally even for chat), but 15 feels great. Strange, but It can run even faster in generating progress. KV cache is hittin i guess.
I tried to create my own optimised version of coding agent and it even performes relatively good, but for programming it is surely slow. It would be ok, if it done all the code right from the first try, but it’s not. It is not the model problem - even cloud agents do mistakes, but due to high speed they can fix it fast.

but for chat its great

hendrik@palaver.p3x.de · 2 months ago

It took me until now to finally dabble in these coding agents. And I didn’t realize at all how many tokens they burn through. I let it write some basic HTML & JavaScript browser game with some free OpenRouter model. I’ve done this before, just told a model to one-shot it in a single file. And now I tried OpenCode, let it ask me a few questions, come up with a plan and do an entire project structure… And it’s at one million tokens way faster than I thought. If my math is correct, that’d take my computer 2 days and nights straight at 6T/s 👀

Guess it’s really a bit (too) slow.

NAwT@lemmy.world · 2 months ago

the problem with coding agents is simple - THERE A LOT of System promts. Promts that correct the behavior of the model in process of creating project. That is needed becase even largest models are a dumb to some degree. They forget what tools they need to use and how to use them properly. So there hidden from you system promt (i tried Cline for example - it is 11k tokens only on system prompt!) that eats context like crazy. I tried to create similar agent with tools and system promts, that save on context (my custom tool “get_overview”, instead of read_file; in mix with “search_content” tool that returns lines on search query, it can save a lot - model don’t need to read full file) and mix just a tiny beat cheetsheet to every user msg, so model don’t forget. Results were very good. Don’t know why they need spam sysprmt like that.

So i think this problem is kinda solvable on local machine