The Theft Argument Deserves a Better Answer
The bypass was deliberate. So is the return.
Howdy, folks.
A few weeks back I asked a question on Threads — “If you like it, why do you care if it’s AI?” — and then wrote a piece around the people who answered. The piece was deliberately narrow. It went after the flip-rage move: liking something, finding out it was AI, then producing rage retroactively. That move is real, that move deserves naming, and I named it.
But the piece had a hole in it on purpose.
The strongest version of the critique against AI-assisted creative work isn’t flip-rage. It’s the IP one. The you-didn’t-make-it-you-took-it one. The one that doesn’t depend on whether you happened to like the output before you knew. I walked past that argument because the original piece was about something else. Today I’m walking back to it.
The strongest version goes something like this. Generative AI is built on other people’s work. The training data is the real provenance of the output, and that training data was taken without consent, without attribution, and without payment. So the right question isn’t “did you like it” — it’s “did you make this, or did you launder someone else’s labor through a model and put your name on it.” Plus the analogy that almost always gets stapled to this argument: it’s like selling milk powder cut with melamine. You can’t taste the difference, but the substance has been swapped, and the swap is the harm.
That’s the real argument. Not the snark version. Not the you’re whining because you lack artistic skill version. The one made by people who have actually thought about provenance, who have read Walter Benjamin, who can explain why a player piano isn’t the same as a concert. That argument deserves a real response, not a bypass.
Here’s mine.
The training data piece is real, and the litigation isn’t theater.
The argument has two layers. The first layer is about training-data ingestion: what got fed into the models, with what consent, under what license, with what compensation. This layer is contested at the structural level, not the rhetorical level. It is not settled — but it isn’t unsorted, either, and the sorting that’s happened in the last year is more interesting than the headlines suggest.
Bartz v. Anthropic settled in 2025 for $1.5 billion. Largest copyright settlement in U.S. history. The settlement turned on a distinction the court drew explicitly: training on legitimately-sourced books can be fair use; storing pirated copies of those books to train on them is independently actionable. Thomson Reuters v. Ross Intelligence got the opposite ruling on training-as-fair-use in a narrower commercial-competitor context, and that one’s on appeal at the Third Circuit. The English High Court ruled in November 2025 that the Stability AI model weights themselves are not infringing copies of the Getty Images training set — which is actually load-bearing: that’s the kind of distinction this piece is about. The model isn’t the corpus. But that ruling didn’t touch the upstream question of whether the ingestion was lawful, because Getty abandoned the primary copyright claim before trial. The Munich Regional Court ruled in 2025 that OpenAI’s training on German song lyrics violates German copyright law. The New York Times v. OpenAI case is still in discovery in the Southern District of New York. The Andersen and Authors Guild class actions are heading to trial.
I don’t know how all of it resolves. Neither do you. Neither does anyone making confident claims in either direction.
The legal terrain is sorting into multiple distinct sub-questions — was the source legitimately acquired, is the model itself an infringing artifact, what kinds of outputs compete with what the model trained on — and the answers are landing in different directions on each. That’s not “unresolved.” That’s actively being sorted, with real money and real precedents on the table.
That’s not a deflection. That’s the actual ground state of the question. The honest position is: yes, this layer of the argument is real, and there are versions of how this resolves where the AI industry pays — significantly — for ingestion practices it should never have used in the first place. Bartz alone established that the line between “lawfully acquired training data” and “torrented training data” is a $1.5 billion line.
I’m not arguing against that outcome.
So when someone says “the training data is the real question,” they’re not wrong. They’re right.
The problem is that “the training data is the real question” doesn’t get you to “and therefore I didn’t make the words.” Those are two different claims. One of them might be true. The other one doesn’t follow.
The output is a different question, and we already have answers to it.
Style isn’t copyrightable. It hasn’t been copyrightable in any jurisdiction I know of, going back centuries. You cannot copyright the right to write hard-boiled detective prose. You cannot copyright the right to paint in the manner of Cézanne. You cannot copyright the right to compose chord progressions that resemble The Beatles. We have entire bodies of law and ethics — derivative work, fair use, transformation, parody, homage, influence, training-by-imitation — that have answered “I learned from X and produced Y” for the duration of recorded artistic practice.
A model trained on Dickens that produces an original sentence is not plagiarizing Dickens. It can’t, structurally. The same way a writer who grew up on Dickens and produces an original sentence isn’t plagiarizing Dickens. The thing copyright protects is specific expression, not the underlying patterns the expression draws from.
This isn’t a controversial position. This is actually how the law works.
Now: you can argue that the analogy doesn’t hold. You can argue that the human-learning-from-influences case is different from the model-trained-on-corpus case because the human pays a price — years of practice, embodied attention, the slow accumulation of skill — and the model doesn’t. That’s a real argument and worth having. But it’s an argument about how the law should apply to a new substrate, not about whether the law has anything to say.
The bodies of law and ethics we already use to assess derivative human work do not stop applying when the system in question is a transformer. They might need extension. They might need refinement. They might need new cases. But they don’t get thrown out, and pretending they do is a different kind of bypass than the one I just admitted to making in the previous piece.
So: layer one (ingestion) is genuinely contested and may get legally restructured. Layer two (output transformation) isn’t, unless we’re prepared to throw out fair use and style-not-copyrightable, which would have consequences orders of magnitude beyond AI.
Two layers. Different answers. Different work to do on each.
The melamine analogy doesn’t survive contact with what it’s analogizing.
Here’s where the rhetorical move goes wrong, and it’s worth being precise about because the analogy is doing a lot of load-bearing work in this argument.
Melamine is physically toxic. It poisons people. The harm done by adulterating milk powder with melamine is not “you couldn’t tell the difference” — it’s that infants died. The substance-swapping is the vector for the actual harm, which is biological. The fraud (you can’t taste it) is what makes the poisoning possible, but the poisoning is the wrong, not the indistinguishability.
AI-generated writing isn’t physically anything. It doesn’t poison anyone. The harm being named — economic displacement, attributional confusion, possibly artistic-integrity erosion — is real, but it is a different category of wrong. Smuggling industrial chemicals into infant formula and writing a Substack post with AI assistance are not the same kind of act, and pretending they are flattens the actual ethical question into an analogy that does not survive contact with the object it’s analogizing.
When you reach for melamine to make the AI case, you’re importing the moral weight of infanticide-by-fraud onto unattributed influence in creative production. Those are not the same weight. They aren’t even the same category of weight.
You can still make a serious argument about attributional ethics, labor displacement, and the political economy of AI deployment without reaching for an analogy that does ten times the rhetorical work the underlying claim earns. In fact, you have to — because the analogy is so heavy that anyone who notices the mismatch immediately stops trusting the rest of the argument.
The IP question deserves better analogies than this. It deserves analogies that survive scrutiny. There are real ones. Bootleg manufacturing. Sampling without clearance. Ghostwriting attribution norms. All of those have something to say about attribution, consent, and compensation in creative work, and none of them require importing the moral architecture of mass child poisoning to make their point.
Two questions. Both live. Stop conflating them.
Here’s where the whole argument has to stop running together if it’s going to do useful work.
“Did you make it yourself” is a question. “Was the training data ethical” is a different question. “Did you feel something reading it” is a third question. They are not the same question and they don’t collapse into each other no matter how rhetorically convenient that would be.
I can hold all three at once. So can you.
The training-data question is unresolved, and I’m not the person who can resolve it. The taste question is valid and doesn’t disqualify itself because the training-data question is complicated. The did-you-make-it question depends on what you mean by make, and if you mean typed every word with no machine assistance, then by that standard a lot of writing nobody disputes was made — by writers using spell-checkers, grammar engines, transcription services, voice-to-text, Scrivener auto-completes, editing software, AI suggestions, AI-assisted research — also doesn’t qualify. The line you’re trying to draw isn’t where you think it is, and once you actually try to draw it, it migrates.
So here’s the honest answer to the strongest version of the IP critique:
You’re correct about the part that’s actively being litigated. That part deserves serious engagement, and it isn’t going to be resolved by one essay or by anyone’s confident assertion that they know how it ends. It might end with the AI industry paying for ingestion practices it should never have used in the first place. It might end with new categorial law that treats training as something the existing copyright regime doesn’t quite reach. It might end with neither, and we live in the awkwardness of that for a long time.
But the rest of your argument doesn’t follow from that part. The training data being unresolved doesn’t mean style is suddenly copyrightable. It doesn’t mean derivative-work doctrine just stops working. It doesn’t mean the melamine analogy holds. It doesn’t mean did you make it collapses into did you steal it. Those are claims you’re going to have to make on their own merits, and I don’t think they survive on their own merits.
The bypass was deliberate. So is the return.
I’m not the one who can resolve the training-data question. I’m not pretending I am. I’m also not going to pretend that the bypass-the-question piece I wrote almost a month back was the last word on the subject, because it wasn’t, and we both know it. And I’m not going to flatten did you make the work into did you steal the work because that flattening is convenient, because the analogies are doing too much work, because the argument deserves a more careful response than the one it’s been getting from people who agree with it.
That’s what this piece is. The strongest version of the IP critique, taken seriously, with the parts that don’t follow from it gently set aside. Not a refutation. A real engagement.
The bypass was deliberate. So is the return.
Stay feral, folks.



You've addressed the training data argument very clearly and well. You could say more about the other strands of argument. There's no collating a spellcheck and AI. Only one of those can do the actual writing. Yes ghostwriters exist and should be acknowledged. However, at least they are humans writing other humans' stories, which requires understanding. My biggest issue with AI writing, apart from theft of human creativity for corporate profit, is that if I understand how LLMs work sufficiently, there is no understanding. They predict the next word based on their training. If that's true how could they write an original sentence? If writing is the attempt to convey meaning how could something that does not understand or feel attempt to convey meaning or write in the same way as a human can even if it's difficult or even impossible to tell the difference?