If generation temperature is non-zero (which it often is), there is inherent randomness to the output. So even if the first number in a statistic should be 1, sometimese it will just randomly pick any other plausible number. Even if the network always picks the correct token as the highest probability, it’s basically doing a coin toss for every token to make answers more creative.
That’s on top of hoping the LLM has even seen that data during training AND managed to memorize it during training AND that the networks just happens to be able to reproduce the correct data given your prompt (it might not be able to for a different prompt).
If you want any reliability at all, you need to use RAG AND also you yourself have to double check all the references it quotes (if it even has that capability).
Even if it has all the necessary information to answer correctly in it’s context window, it can still answer incorrectly.
None of the current models are anywhere close to producing trustworthy output 100% of the time.
Vanilla Llama3 can not generate images. It’s probably just used to write the prompt for a text to image model.
But there’s also some Llama3 based Image/Text to Image/Text models out there, I think.