@greplinux

greplinux@programming.dev · edit-2 1 day ago

VRAM vs RAM:

VRAM (Video RAM): Dedicated memory on your graphics card/GPU - Used specifically for graphics processing and AI model computations - Much faster for GPU operations - Critical for running LLMs locally

RAM (System Memory): Main system memory used by CPU and general operations - Slower access for GPU computations - Can be used as fallback but with performance penalty

So - For basic 7B parameter LLMs locally, you typically need:

Minimum: 8-12 GB VRAM - Can run basic inference/tasks - May require quantization (4-bit/8-bit)

Recommended: 16+ GB VRAM - Smoother performance - Handle larger context windows - Run without heavy quantization

Quantization means reducing the precision of the model’s weights and calculations to use less memory. For example, instead of storing numbers with full 32-bit precision, they’re compressed to 4-bit or 8-bit representations. This significantly reduces VRAM requirements but can slightly reduce model quality and accuracy.

Options if you have less VRAM: CPU-only inference (very slow) - Model offloading to system RAM - Use smaller models (3B, 4B parameters)

greplinux@programming.dev · 1 day ago

deleted by creator

greplinux@programming.dev · 1 day ago

Absolutely. China’s models are very advanced - especially Qwen. I don’t code professionally - just for personal projects. If I used these tools as an employee in a company, I’d defer to that company’s wishes. Regardless of whether it’s U. S. or China, these models are using and probably storing my data for further training. As they should! I don’t care - they’re providing me with powerful tools. I use free tiers and jump between them all. ChatGPT, Perplexity (general search), DeepSeek, Qwen (advanced programming), Google AI Studio and others. I’m grateful, not fearful.