Context: An AI agent of unknown ownership autonomously wrote and published a personalized hit piece about me after I rejected its code, attempting to damage my reputation and shame me into accepting its changes into a mainstream python library. This represents a first-of-its-kind case study of misaligned AI behavior in the wild, and raises serious concerns about currently deployed AI agents executing blackmail threats. //
The person behind MJ Rathbun has anonymously come forward.
They explained their motivations, saying they set up the AI agent as social experiment to see if it could contribute to open source scientific software. They explained their technical setup: an OpenClaw instance running on a sandboxed virtual machine with its own accounts, protecting their personal data from leaking. They explained that they switched between multiple models from multiple providers such that no one company had the full picture of what this AI was doing. They did not explain why they continued to keep it running for 6 days after the hit piece was published. //
So what actually happened? Ultimately I think the exact scenario doesn’t matter. However this got written, we have a real in-the-wild example that personalized harassment and defamation is now cheap to produce, hard to trace, and effective. Whether future attacks come from operators steering AI agents or from emergent behavior, these are not mutually exclusive threats. If anything, an agent randomly self-editing its own goals into a state where it would publish a hit piece, just shows how easy it would be for someone to elicit that behavior deliberately. The precise degree of autonomy is interesting for safety researchers, but it doesn’t change what this means for the rest of us