Key architectural details

Mixture of Experts (MoE): 128 experts, with 4 active per token, enabling efficient scaling and specialization.

119B total parameters, with 6B active parameters per token (8B including embedding and output layers).

256k context window, supporting long-form interactions and document analysis.

Configurable reasoning effort: Toggle between fast, low-latency responses and deep, reasoning-intensive outputs.

Native multimodality: Accepts both text and image inputs, unlocking use cases from document parsing to visual analysis.

  • keepthepace@tarte.nuage-libre.fr
    cake
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 days ago

    Does this work though, to keep non active experts into RAM and only load the active ones in VRAM? My understanding is that experts can be changed every token, 6B parameters is still a lot of data to load from RAM at every token!