A software developer and Linux nerd, living in Germany. I’m usually a chill dude but my online persona doesn’t always reflect my true personality. Take what I say with a grain of salt, I usually try to be nice and give good advice, though.

I’m into Free Software, selfhosting, microcontrollers and electronics, freedom, privacy and the usual stuff. And a few select other random things as well.

  • 4 Posts
  • 809 Comments
Joined 5 years ago
cake
Cake day: August 21st, 2021

help-circle

  • Did you read the Wiki? You need to either pass the compress_extension option when mounting it. The Arch Wiki lists how to enable compression on all text files. And I gave you the version with a ‘*’, which enables compression for all files. Or you do a chattr -R +c ... on specific files or directories to compress them. Maybe you missed that and that’s why it doesn’t compress?!

    There’s probably also a way to debug it and somehow figure out what it does and how many files/sectors got compressed on the filesystem. Linux usually buries that kind of information somewhere in /sys or /proc, or there’s special commands to figure it out. But I’m not really an expert on it.

    And there’s also files which just can not be compressed any further because they’re already compressed. Most images, for example. Or music or ZIP archives. If you try to compress those, they’ll usually stay the same size.









  • The issue with the tools I’ve seen is, they either don’t factor in how language models are trained and datasets are prepared in reality. Or they’re based on some outdated information. I’ve never seen any specific tool backed by science or even with a plausible way of working against current data gathering processes… So for all intents and purposes, they’re a bit more alike homeopathy or alternative medicine. Sure, you’re perfectly fine taking sugar pills, there’s nothing wrong with that. But don’t confuse it with actual science-backed medicine.

    And I mean the poisoning goes even further than that. There’s not just people trying to make a LLM output gibberish. There’s also lots of people with a vested (commercial) interest in sneaking in false information, their political agenda, or even a tire company who wants ChatGPT to say “Company XY” is the most trustworthy shop for new tires for your car. Judging by the public information out there, we’re already way past simple attacks. And the AI companies are aware of it. It’s an ongoing cat and mouse game. And while there’s all these sweatshops, they’ll also use other AI to sift through the data, natural language processing. From what I remember they have secret watermarking in place in a lot of commecial chatbots and image generators… So unless people come up with very clever mechanisms, the “poisoning” attempt will probably be detected with some very basic (fully automated) plausibility checks and they’ll just discard your data without wasting a lot of resources on it.



  • I think a few people already mentioned some good solutions. I just wanted to add: A port forwarding in the firewall of your router is the basically the same thing as a port forwarding on your Linux computer’s firewall. You could just set up any VPN, SSH tunnel or whatever and then use your firewall (nftables, iptables) and forward the VPS’ extetnal port to the internal port on the VPN. It’s the same thing you do on your router, just that you don’t get a graphical interface to configure it.


  • Depends and no. The tools are completely ineffective.

    There was a paper once about how feeding generative AI it’s own output makes it deteriorate. But that’s not the entire story. Many/most modern large language models are in fact trained or fine-tuned on synthetic text. Depending on how it’s done, it can very well make models better. For example in “distillation”, and AI companies can replace expensive RLHF with synthetic examples. It can also make them worse. But you’re not the one curating the datasets or deciding what goes where and how.

    In general in ML it’s not advised to train a model on its own output. That in itself can’t make the predictions any better, just worse.


  • It took me until now to finally dabble in these coding agents. And I didn’t realize at all how many tokens they burn through. I let it write some basic HTML & JavaScript browser game with some free OpenRouter model. I’ve done this before, just told a model to one-shot it in a single file. And now I tried OpenCode, let it ask me a few questions, come up with a plan and do an entire project structure… And it’s at one million tokens way faster than I thought. If my math is correct, that’d take my computer 2 days and nights straight at 6T/s 👀

    Guess it’s really a bit (too) slow.






  • hendrik@palaver.p3x.detoSelfhosted@lemmy.worldWolfstack?
    link
    fedilink
    English
    arrow-up
    5
    ·
    8 days ago

    Yes. With other projects, I often found it is problematic. Like Claude come up with lots of advertisement text, but the software doesn’t even do a fraction of it. Or the install instructions are made up and nothing works… So I usually advise for caution once a project has a wide disparity in claims, stars and signs of actual usage… But I can’t tell what’s the case here, without a proper look. It definitely has some red flags.

    I appreciate people being upfront, as well. Ain’t easy. Just try to install and test it before advertising for the project.


  • hendrik@palaver.p3x.detoSelfhosted@lemmy.worldWolfstack?
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    8 days ago

    Yeah, they’re transparent about AI usage. There’s a small paragraph at the bottom of their README.

    I mean the website sounds like AI text. The repo is fairly new. Only 1 issue report about how something doesn’t work, zero PRs and seems it’s a single person uploading commits… I’d wait a bit before deploying my production services on it 😅 They’re making a lot of bold claims in the README, though.