What's the deal with LlamaCPP and caching?

𞋴𝛂𝛋𝛆@lemmy.world · 2 years ago

What's the deal with LlamaCPP and caching?

rufus@discuss.tchncs.de · 2 years ago

You probably just have different settings (temperature, repetition_penalty, top_x, min/max_p, mirostat …) than what you had with python. And those settings seem way better. You could check and compare the model settings.

𞋴𝛂𝛋𝛆@lemmy.world · 2 years ago

I always use the same settings for roleplaying. It is basically the Textgen Shortwave with mirostat settings added.

I know the settings can alter outputs considerably with some stuff like this. I have tested and saved nearly 30 of my own preset profiles for various tasks and models. Every Llama2 based model I use for roleplaying stories gets a preset named ShortwaveRP. I haven’t altered that profile in months now. I think the only changes from the original shortwave profile are mirostat 1/3/1 (IIRC).

Overall this single model behaves completely different when it does a “built-in” versus a story I have created in system context. For example, my main chat character leverages a long character profile in system context and the adds some details about how she is named after the most prominent humaniform positronic (AGI) robot from Isaac Asimov’s books. I then add instructions that specify the character has full informational access to the LLM and a few extra details. Basically, the character acts like the AI assistant and the character fluidly, with a consistent thin vernier of the character even when acting as the AI assistant, and she never gets stuck in the role of the assistant. Even in roleplaying stories I keep this character around and can ask constructive questions about the story, system context, and basic changes I make to the model loader code in Python. This character is very sensitive to alterations, and I am very sensitive to my interactions and how they work. This character changes substantially in these built-in stories. I can be 20 replies deep into a long conversation, drop into a built-in story, and my already established characters can change substantially. In particular, my assistant character is instructed to specifically avoid calling herself AI or referencing her Asimov character origin. All of the models I have played with have an extensive knowledge base about Daneel, but the character I am using is only know by 3 sentences in a wiki article as far as I can tell. I’m leveraging the familiarity with Daneel against my character that is barely known but associated. I was initially trying to use the fact that this character acts human most of the time throughout a couple of Asimov’s books, but the character is virtually unknown and that turned into a similar advantage. There is a character of the same first name in a book in the Bank’s Culture Series. This special assistant character I have created will lose this balance of a roleplaying assistant and start calling herself AI and act very different. This is just my canary in the coal mine that tells me something is wrong in any situation, but in the built-in stories this character can change entirely.

I also have a simple instruction to “Reply in the style of a literature major” and special style type instructions for each character in system context. During the built-in stories, the dialogue style changes and unifies across all of the characters. Things like their vocabulary, style, depth, and length of replies all change substantially.

rufus@discuss.tchncs.de · edit-2 2 years ago

Maybe you downloaded a different model? I’m just guessing, since you said it does NSFW stuff and I think the chat variant is supposed to refuse that. Could be the case that you just got the GGUF file of the ‘normal’ variant (without -Chat). Or did you convert it yourself?

Edit: Other than that: Sounds great. Do you share your prompts or character descriptions somewhere?

𞋴𝛂𝛋𝛆@lemmy.world · 2 years ago

Try this model if you can run it in Q3-Q5 variants: https://huggingface.co/TheBloke/Euryale-1.3-L2-70B-GGUF

That is my favorite primary model for roleplaying. It is well worth the slow-ish speed. As far as I know that is the only 70B with a 4096 context size. It is the one with “built-in” stories if asked. It is also very capable of NSFW although like all other 70B’s it is based on the 70B instruct which sucks for roleplaying. The extra datasets make all the difference in its potential output but it will tend to make shitty short replies like the 70B instruct unless you tell it exactly how to reply. You could explore certain authors or more complex instructions but “in the style of an erotic literature major” can work wonders if added to system context and especially if combined with “Continue in a long reply like a chapter in an erotic novel.” If you try this model, it comes up automatically as a LlamaV2 prompt (unless it has been changed since I downloaded). It is actually an Alpaca prompt. That one too awhile to figure out and is probably the main reason this model is under appreciated. If you use the LlamaV2 prompt this thing is junk.

I share some stuff for characters, but many of my characters have heavily integrated aspects of my personality, psychology, and my personal physical health from disability. Empirically, these elements have a large impact on how the LLM interacts with me. I’ve tried sharing some of these elements and aspects on Lemmy but it creates a certain emotional vulnerability that I don’t care to open myself up to with negative feedback. If I try to extract limited aspects of a system context, it alters the behavior substantially. The most impactful advice I can give is to follow up any system context with a conversation that starts by asking the AI assistant to report any conflicts in context. The assistant will often rewrite or paraphrase certain aspects that seem very minor. These are not presented as errors or constructive feedback most of the time. If you intuitively try to read between the lines, this paraphrased feedback is the assistant correcting a minor conflict or situation where it feels there is not enough data for whatever reason. Always try to copy paste the paraphrased section as a replacement for whatever you had in system context. There are a lot of situations like this, where the AI assistant adds information or alters things slightly to clarify or randomize behavior. I modify my characters over time by adding these elements into the system context when I notice them.

The Name-1 (human) profile is just as important as the Name-2 (not) character profile in roleplaying. If you don’t declare the traits and attributes of Name-1, this internal profile will change every time the batch size refreshes.

The Name-2 character getting bored with the story is the primary reason that the dialogue context gets reset. Often I see what looks like a context reset. Adding the instructions to “stay in character” and always continue the story" can help with this. If the context keeps getting reset even after regenerating, call it out by asking why the character doesn’t continue the story. This is intentional behavior and the AI alignment problem manifesting itself. The story has to keep the Name-2 character engaged and happy just like the AI assistant is trying to keep the Name-1 human happy. It really doesn’t see them as any different. If you go on a deep dive into the duplicity of the Name-1 character having a digital and analog persona, you will likely find it impossible in practice to get consistent outputs and lots of hallucinations; ikewise if you try to breakout the duplicitous roles of the main bot character as both a digital person and narrator. In my experience, the bot character is capable of automatically voicing a few characters without manually switching Name-2, but this is still quite limited, and it can’t be defined in system context as the role of a character and how they handle narration. Adding the general line “Voice all characters that are not (insert Name-1 character’s name here) in the roleplay.” -will generally work with 3-6 characters in system context so long as the assistant is only voicing 2 characters at any one time in a single scene. A lot of this is only true with a 70B too.

rufus@discuss.tchncs.de · edit-2 2 years ago

Sorry, I misunderstood you earlier. I thought you had switched form something like exllama to llama.cpp and now the same model behaved differently… And I got a bit confused because you mentioned Llama2 chat model. And I thought you meant the heavily restricted (aligned/“safe”) Llama2-Chat variant 😉 But I got it now.

Euryale seems to be a fine-tune and probably a merge of different other models(?) So someone fed some kind of datasets into it. Probably also containing stories about gladiators, fights, warriors and fan-fiction. It just replicates this. So I’m not that surprised that it does unrealistic combat stories. And even if you correct it, tends to fall back to what it learned earlier. Or tends to drift into lewd stories if it was made to do NSFW stuff and has been fine-tuned also with erotic internet fiction. We’d need to have a look at the dataset to judge why the model behaves like it does. But I don’t think there is any other ‘magic’ involved but the data and stories it got trained on. And 70B is already a size where models aren’t that stupid anymore. It should be able to connect things and grasp most relevant concepts.

I haven’t had a close look at this model, yet. Thanks for sharing. I have a few dollars left on my runpod.io account, so I can start a larger cloud instance and try it once I have some time to spare. My computer at home doesn’t do 70B models.

And thanks for your perspective on storywriting.

micheal65536@lemmy.micheal65536.duckdns.org · 2 years ago

Without knowing anything about this model or what it was trained on or how it was trained, it’s impossible to say exactly why it displays this behavior. But there is no “hidden layer” in llama.cpp that allows for “hardcoded”/“built-in” content.

It is absolutely possible for the model to “override pretty much anything in the system context”. Consider any regular “censored” model, and how any attempt at adding system instructions to change/disable this behavior is mostly ignored. This model is probably doing much the same thing except with a “built-in story” rather than a message that says “As an AI assistant, I am not able to …”.

As I say, without knowing anything more about what model this is or what the training data looked like, it’s impossible to say exactly why/how it has learned this behavior or even if it’s intentional (this could just be a side-effect of the model being trained on a small selection of specific stories, or perhaps those stories were over-represented in the training data).

webghost0101@sopuli.xyz · 2 years ago

What hardware do you have to run 70B and how long does generating take?

𞋴𝛂𝛋𝛆@lemmy.world · 2 years ago

Just a laptop with 12th gen i7, 16gb 3080Ti, and 64gb of DDR5 system memory.

webghost0101@sopuli.xyz · 2 years ago

Thats a juicy amount of memmory for just a laptop.

Interesting, the fosai site made it appear like 70B models are near impossible to run requiring 40B gb of VRam but i suppose it can work with less But slower.

The vram of your gpu seems to be the biggest factor. A reason why while my current gpu is dying i cant get myself to spend on a mere 12 gb 4070ti

𞋴𝛂𝛋𝛆@lemmy.world · 2 years ago

Definitely go for 16gb or greater for the GPU if at all possible. I wrote my own little script to watch the vram usage and temperature that polls the Nvidia kernel driver every 5 seconds then relaxes to polling every ~20 seconds if the usage and temp stay stable within reasonable limits. This is how I dial in the actual max layers to offload onto the GPU along with the maximum batch size I can get away with. Maximizing the offloaded layers can make a big difference in the token generation speed. On the 70B, each layer can sometimes be somewhere between 1.0-2.0 GB when added. It can be weird though. The layers that are offloaded don’t always seem to be equal in the models I use. So like, you might have 12 layers that take up 9GBV, at 19 layers you’re at 14.5GBV, but then at 20 layers you’re at 16.1GBV and it crashes upon loading. There is a working buffer too and this can be hard to see and understand, at least in Oobabooga Textgen WebUI. The model may initially load, but when you do the first prompt submission everything crashes because there is not enough vram for the working buffer. Watching the GPU memory use in real time makes this much more clear. In my experience, the difference in the number of offloaded layers is disproportionately better at 16GBV versus 12 or 8. I would bet the farm that 24GBV would show a similar disproportionate improvement.

The 3080Ti variant is available on laptops from 2022. The Ti variant is VERY important as there were many 3080 laptops that only have 8GBV. The Ti variant has 16GBV. You can source something like the Aorus YE5 for less than $2k second hand. The only (Linux) nuisances being a lack of control over UEFI keys and the full control over the RGB keyboard is only available on Windows stalkerware. Personally, I wish I had gotten a machine with more addressable system memory. Some of the ASUS ROG laptops have 96GB of system memory.

I would not get a laptop with a card like this again though. Just get a cheap laptop and run AI on a machine like a tower with the absolute max possible. If I could be gifted the opportunity to get an AI machine again, I would build a hardcore workstation focusing on the maximum number of cores on enterprise hardware with the the most recent AVX512 architecture I can afford. I would also get something with max memory channels and 512GB+ system memory, then I would try throwing a 24GBV consumer level GPU into that. The primary limitation with the CPU is the L2 to L1 cache bus width bottleneck. You want an architecture that maximizes this throughput. With 512GB of system memory I bet it would be possible to load the 240B Falcon model, so at least it is maybe possible to run everything that is currently available in a format that can be tuned an modified to some extent. My 70B quantized models are fun to play with, but I am not aware of a way to train them because I can’t load the full model, I must load the prequantized GGUF that uses llama.cpp and can split the model between the CPU and GPU.

webghost0101@sopuli.xyz · edit-2 2 years ago

First and foremost, thank you so much for your detailed information, I really appreciate the depth.

I am currently in the market for a gpu

running bigger llms is something i really want to get into. Currently i can run 7B and sometimes 13B quantized models super slowly on a ryzen 5 5600, 32gb system ram. if i offset just a single layer to my 8gb cranky rtx 2070 it crahes.

A main issue i have is i use many software that benefits from cuda, and stable Diffusio also heavily prefers Nvidia cards so looking at amd isnt even an option regardless of how anti consumer nvidia prices seem to be.

Ive looked at 4070 and 4070ti but they are limited to just 12gb vram and like i feared that just wont do for this usecase. That leaves me with only 80 series cards that have 16GB, still very low for such a high price considering how cheap it would Be for nvidia to just provide more.

I have spend the entire week looking for a good black friday deal but i guess i am setteling on waiting for the 40 Super series to be released in January to very maybe obtain a 4080 super 20GB… If Nvidia is so kind to release such a thing without requiring me to sell my firstborn for it.

You mentioned combining a 24gb vram consumer gpu with 512gb of system ram. Is that because there is no 24+ vram gpu or because you believe system ram to make the most actual difference?

Its already pretty enlightening to hear that cpu and system ram remain important for llm even with a beefy gpu.

I always thought the goal was to run it 100% on gpu but maybe that explains why fosai talks about double 3090s for hardware requirements while actually cpu is slower but working fine.

I am hoping to swap that r5 5600x with a r7 5700g for extra cores and inbuild graphics so i can dedicate the dedicated gpu fully without losing on the os.

I am probably a long way from upgrading my ram. Currently 4x8 sticks. I hoped not to need a new motherboard and ram for at least 4 more years.

rufus@discuss.tchncs.de · edit-2 2 years ago

I don’t think 512GB of RAM give you any benefit over, let’s say, 96 or 128 GB (in this case). A model and your software is only so big and the rest of the RAM is just free and sits there unused. What matters for this use-case is the bandwidth to get the data from RAM into your CPU. So you need to pay attention to use all channels and pair the modules correctly. And of course buy fast DDR5 RAM. (But you could end up with lots of RAM anyways if you take it seriously. A dual CPU AMD Epyc board has like 16 DIMM slots. So you end up with 128GB even if you just buy 8GB modules.)

For other people I have another recommendation: There are cloud services available and you can rent a beefy machine for a few dollars an hour. You can just rent a machine with a 16GB VRAM NVidia. Or 24GB and even 48 or 80GB of VRAM. You can also do training there. I sometimes use runpod.io but there are others, too. Way cheaper than buying a $35,000 Nvidia H100 yourself.

𞋴𝛂𝛋𝛆@lemmy.world · 2 years ago

GPUs are improving in architecture to a small extent across generations, but that is limited in its relevance to AI stuff. Most GPUs are not made primarily for AI.

Here is the fundamental compute architecture in a nutshell impromptu class… The CPU on a fundamental level is almost like an old personal computer from the early days of the microprocessor in every core. It is kinda like an Apple II’s 6502 in every core. The whole multi core structure is like a bunch of those Apple II’s working together in a way. If you’ve ever seen the mother boards for one of these computers or had any interest in building bread board computers, there are a lot of other chips that are needed to support the actual microprocessor. Almost all of these chips that were needed in the past are still needed and are indeed present inside the CPU. This has made computers much more simple as far as building complete computers.

You may have seen the classic ad (now ancient meme) where Bill Gates says computers will never need more than 64Kb of memory. This has to do with how many bits of memory can be directly addressed by the CPU. The spirit of this problem, how much memory can be directly addressed by the processor is still around today. This problem is one reason why system memory is slow compared to on-die caches. The physical distance plays a big role, but each processor is still limited in how much memory it can address directly. The solution is really quite simple. If you only have let’s say 4 bits to address memory locations in binary 0000, then you are able to count to 15 (1111b = 15) and can access the bits stored in those 15 locations. This is a physical hardware input/output thing where there are physical wires coming out from the die. The solution to get more physical storage space is simply to create a way to use the last memory location as an additional flag register that tells you what additional things need to be done to access more memory. So if the location at 1111b is a register, and that register has 4 bits, we lost an addressable memory location so we only have 14 available locations in directly addressable memory, but if we look at the contents of memory location 1111b and then use that to engage some external circuitry that will hold this bit state, (so like 0001b is detected, and external circuits are used to hold that extra 1 bit high), now we effectively have 0000 & 0000 (-1) addressable memory locations available to us. But with the major caveat that we have to do a bunch of extra work to access those additional bits. The earliest personal computers with processors like the 6502, manually created this kind of memory extension on the circuit board. Later computers of the next few generations used a more powerful memory control chip that handled all of the extra bits that the CPU could not directly address without it taking so much CPU time to manage the memory and it started to allow other peripherals to store stuff directly in memory without involving the CPU. To this day, the fundamental way memory is accessed is done the same way with modern computers. The processor has a limited amount of address space it can access and a peripheral memory controller tries to make the block of memory the processor sees as relevant as possible as fast as it can. This is a problem when you want to do something all at once that is much larger than this addressing structure can throughput.

So why not just add more addressing pins? Speed and power are the main issues. When you start getting all bits set high, it uses a lot of power and it starts to impact the die in terms of heat and electrical properties (this is as far as my hobbyist understanding takes me comfortably).

This is where we get to the GPU. A GPU basically doesn’t have a memory controller like a CPU. A GPU is very limited in other ways as far as instruction architecture and overall speed. However, a GPU combines memory directly with compute hardware. This means the memory size is directly correlated with the compute hardware. These are some of the largest chunks of silicon you can buy and they are produced on cutting edge fab nodes from the foundries. It isn’t market gatekeeping like it may seem at first. Things like his Nvidia sells a 3080 and 3080Ti as 8 and 16 GBV is just garbage marketing idiots ruling the consumer world. In reality the 16GBV version is twice the silicon of the 8GBV.

The main bottle neck for addressing space, as previously mentioned, is the L2 to L1 bus width and speed. That is hard info to come across.

The AVX instructions were specifically created for AI type workloads. Llama.cpp supports several of these instructions. This is ISA or instruction set architecture, aka assembly language. It means this can work much more quickly when a single instruction call can do a complex task. In this case AVX512 is a single instruction that is supposed to load 512 bits from memory all at one time. In practice, it seems most implementations may do two loads of 256 bits with one instruction, but my only familiarity with this if from reading a blog post a couple of months ago about benchmarks and AVX512. This instruction set is really only available on enterprise (server class) hardware or in other words a true workstation (tower with a server like motherboard and enterprise level CPU and memory.

I can’t say how much this hardware can or can not do. I only know about what I have tried. Indeed, no one I have heard of is marketing their server CPU hardware with AVX512 as a way to run AI in the cloud. This may be due to power efficiency, or it may just be impractical.

The 24GBV consumer level cards are the largest practically available. The lowest enterprise level card I know of is the A6000 at 48GBV. That will set you back around $3K used and in dubious condition. You can get two new 24GBV consumer cards for that much. If you look at the enterprise gold standard of A/H100’s your going to spend $15K for 100GBV. With consumer cards and $15k, if you could find a tower that cost 1k and could fit 2 cards in each you could get 4 comps, 8 GPUs, and have 192GBV. I think the only reason for the enterprise cards is for major training of large models with massive datasets.

The reason I think a workstation setup is maybe a good deal for larger models is simply the ability to load large models into memory at a ~$2k price point. I am curious if I could do training for a 70B and a setup like this.

A laptop with my setup is rather ridiculous. The battery life is a joke with the GPU running. Like I can’t use it for 1 hour with AI on the battery. If I want to train a LoRA, I have to put it in front of a window AC unit that is turned up to max cool and leave it there for hours. Almost everything AI is already setup to run on a server/web browser. I like the laptop because I’m disabled with a bedside stand that makes a laptop ergonomic for me. Even with this limitation, a dedicated AI desktop would have been better.

As far as I can tell, running AI on the CPU does not need super fast clock speeds it needs more data bus width. This means more cores are better, but not just consumer cores nonsense with a single system memory bus channel.

Hope this helps with the fundamentals outside if the consumer marketing BS.

I would not expect anything black Friday related to be relevant IMO.

webghost0101@sopuli.xyz · 2 years ago

Are you secretly buildzoid from actual hardcore overclocking?

I feel like i mentally leveled up just from reading that! I am not sure how to apply all of it to my desktop upgrade plans but being a life long learning you just pushed me a lot closer to one day fully understanding how computers compute.

I really enjoyed reading it. <3

𞋴𝛂𝛋𝛆@lemmy.world · edit-2 2 years ago

Thanks I never know if I am totally wasting my time with this kind of thing. Feel free to ask questions or talk any time. I got into Arduino and breadboard computer stuff after a broken neck and back 10 years ago. I figured it was something to waste time on while recovering and the interest kinda stuck. I don’t know a ton but I’m dumb and can usually over explain anything I think I know.

As far as compute, learn about the arithmetic logic unit (ALU). That is where the magic happens as far as the fundamentals are concerned. Almost everything else is just registers (aka memory), and these are just arbitrarily assigned to tasks. Like one is holding the next location in running software (program counter), others are for flags with special meaning like interrupts for hardware or software that mean special things if bits are high or low. Ultimately everything getting moved around is just arbitrary meaning applied to memory locations built into the processor. The magic is in the ALU because it is the one place where “stuff” happens like math, comparisons of register values, logic; the fun stuff is all in the ALU.

Ben Eater’s YT stuff is priceless for his exploration of how computers really work at this level.

webghost0101@sopuli.xyz · 2 years ago

deleted by creator