I ran Gemma 26B on 4GB VRAM + 16RAM. 15 t/s on avarage

NAwT@lemmy.world · edit-2 2 months ago

I ran Gemma 26B on 4GB VRAM + 16RAM. 15 t/s on avarage

hendrik@palaver.p3x.de · 2 months ago

It took me until now to finally dabble in these coding agents. And I didn’t realize at all how many tokens they burn through. I let it write some basic HTML & JavaScript browser game with some free OpenRouter model. I’ve done this before, just told a model to one-shot it in a single file. And now I tried OpenCode, let it ask me a few questions, come up with a plan and do an entire project structure… And it’s at one million tokens way faster than I thought. If my math is correct, that’d take my computer 2 days and nights straight at 6T/s 👀

Guess it’s really a bit (too) slow.

NAwT@lemmy.world · 2 months ago

the problem with coding agents is simple - THERE A LOT of System promts. Promts that correct the behavior of the model in process of creating project. That is needed becase even largest models are a dumb to some degree. They forget what tools they need to use and how to use them properly. So there hidden from you system promt (i tried Cline for example - it is 11k tokens only on system prompt!) that eats context like crazy. I tried to create similar agent with tools and system promts, that save on context (my custom tool “get_overview”, instead of read_file; in mix with “search_content” tool that returns lines on search query, it can save a lot - model don’t need to read full file) and mix just a tiny beat cheetsheet to every user msg, so model don’t forget. Results were very good. Don’t know why they need spam sysprmt like that.

So i think this problem is kinda solvable on local machine