Skip to main content

Against Copyright

Copyright considered paradoxical, incoherent, and harmful from an information theory and compression perspective as there is no natural kind corresponding to ‘works’, merely longer or shorter strings for stupider or smarter algorithms.

One of the most troublesome aspects of copyright law as applied to technology is how the latter makes it possible - and even encourages - doing things that expose the intellectual incoherence of the former; copyright is merely an ad hoc set of rules and custom evolved for bygone economic conditions to accomplish certain socially-desirable ends (and can be criticized or abolished for its failures). If we cannot get the correct ontology of copyright, then the discussion is foredoomed1. Many people suffer from the delusion that it is something more than that, that copyright is somehow objective, or even some sort of actual moral human right (consider the French “droit d’auteur”, one of the moral rights”)2 with the same properties as other rights such as being perpetual3. This is quite wrong. Information has a history, but it carries with it no intrinsic copyright.

This has been articulated in some ways serious and humorous, but we can approach it in an interesting way from the direction of theory.

Lossless Compression

One of the more elegant ideas in computer science is the proof that lossless compression does not compress all files. That is, while an algorithm like ZIP will compress a great many files - perhaps to tiny fractions of the original file size - it will necessarily fail to compress many other files, and indeed for every file it shrinks, it will expand some other file. The general principle here is TANSTAAFL:

“There ain’t no such thing as a free lunch.”

There is no free lunch in compression. The normal proof of this invokes the Pigeonhole Principle; the proof goes that each file must map onto a single unique shorter file, and that shorter file must uniquely map back to the longer file (if the shorter did not, you would have devised a singularly useless compression algorithm - one that did not admit of decompression).

But the problem is, a long string simply has more ‘room’ (possibilities) than a shorter string. Consider a simple case: we have a number between 0 and 1000, and we wish to compress it. Our compressed output is between 0 and 10 - shorter, yes? But suppose we compress 1000 into 10. Which numbers do 900-999 get compressed to? Do they all go to 9? But then given a 9, we have absolutely no idea what it is supposed to expand into. Perhaps 999 goes to 9, 998 to 8, 9997 to 7 and so on - but just a few numbers later we run out of single-digit numbers, and we face the problem again.

Fundamentally, you cannot pack 10kg of stuff into a 5kg bag. TANSTAAFL.

You Keep Using That Word…

The foregoing may seem to have proven lossless compression impossible, but we know that we do it routinely; so how does that work? Well, we have proven that losslessly mapping a set of long strings onto a set of shorter substrings is impossible. The answer is to relax the shorter requirement: we can have our algorithm ‘compress’ a string into a longer one. Now the Pigeonhole Principle works for us - there is plenty of space in the longer strings for all our to-be-compressed strings. And as it happens, one can devise ‘pathological’ input to some compression algorithms in which a short input decompresses4 into a much larger output - there are such files available which empirically demonstrate the possibility.

What makes lossless compression any more than a mathematical curiosity is that we can choose which sets of strings will wind up usefully smaller, and what sets will be relegated to the outer darkness of obesity.

Compress Them All—The User Will Know His Own

TANSTAAFL is king, but the universe will often accept payment in trash. We humans do not actually want to compress all possible strings but only ones we actually make. This is analogous to static typing in programming languages; type checking may result in rejecting many correct programs, but we do not really care as those are not programs we actually want to run. Or, high level programming languages insulate us from the machine and make it impossible to do various tricks one could do if one were programming in assembler; but most of us do not actually want to do those tricks, and are happy to sell that ability for conveniences like portability. Or we happily barter away manual memory management (with all the power indued) to gain convenience and correctness.

This is a powerful concept which is applicable to many tradeoffs in computer science and engineering, but we can view the matter in a different way - one which casts doubt on simplistic views of knowledge and creativity such as we see in copyright law. For starters, we can look at the space-time tradeoff: a fast algorithm simply treats the input data (eg. WAV) as essentially the output, while a smaller input with basic redundancy eliminated will take additional processing and require more time to run. A common phenomenon, with some extreme examples5, but we can look at a more meaningful tradeoff.

Cast your mind back to a lossless algorithm. It is somehow choosing to compress ‘interesting’ strings and letting uninteresting strings blow up. How is it doing so? It can’t be doing it at random, and doing easy things like cutting down on repetition certainly won’t let you write a FLAC algorithm that can cut 30 megabyte WAV files down to 5 megabytes.

Work Smarter, Not Harder

An efficient program is an exercise in logical brinkmanship.

Butler W. Lampson6

The answer is that the algorithms are smart about a kind of file. Someone put a lot of effort thinking about the kinds of regularity that one might find in only a WAV file and how one could predict & exploit them. As we programmers like to say, the algorithms embody domain-specific knowledge. A GZIP algorithm, say, operates while implicitly making an entire constellation of assumptions about its input - that repetition in it comes in chunks, that it is low-entropy, that it is globally similar, that it is probably text (which entails another batch of regularities gzip can exploit) in an English-like language or it’s binary, and so on. These assumptions baked into the algorithm collectively constitute a sort of rudimentary intelligence.

The algorithm compresses smartly because it operates over a small domain of all possible strings. If the domain shrunk even more, even more assumptions could be made; consider the compression ratios attained by top-scorers in the Marcus Hutter compression challenge. The subject area is very narrow: highly stylized and regularly formatted, human-generated English text in MediaWiki markup on encyclopedic subjects. Here the algorithms quite literally exhibit intelligence in order to eek out more space savings, drawing on AI techniques such as neural nets.

If the foregoing is unconvincing, then go and consider a wider range of lossless algorithms. Video codecs think in terms of wavelets and frames; they will not even deign to look at arbitrary streams. Audio codecs strive to emulate the human ear. Image compression schemes work on uniquely-human assumptions like trichromaticity.

So then, we are agreed that a good compression algorithm embodies knowledge or information about the subject matter. This proposition seems indubitable. The foregoing has all been informal, but I believe there is nothing in it which could not be turned into a rigorous mathematical proof at need.

Compression Requires Interpretation

But this proposition should be subliminally worrying copyright-minded persons. Content depends on interpretation? The latter is inextricably bound up with the former? This feels dangerous. But it gets worse.

Code Is Data, Data Code

Suppose we have a copyrighted movie, Titanic. And we are watching it on our DVD player or computer, and the MPEG4 file is being played, and everything is fine. Now, it seems clear that the player could not play the file if it did not interpret the file through a special MPEG4-only algorithm. After all, any other algorithm would just make a mess of things; so the algorithm contains necessary information. And it seems equally clear that the file itself contains necessary information, for we cannot simply tell the DVD player to play Titanic without having that quite specific multi-gigabyte file - feeding the MPEG4 algorithm a random file or no file at all will likewise just make a mess of things. So therefore both algorithm and file contain information necessary to the final copyrighted experience of actual audio-visuals.

There is nothing stopping us from fusing the algorithm and file. It is a truism of computer science that ‘code is data, and data is code’. Every computer relies on this truth. We could easily come up with a 4 or 5 gigabyte executable file which when run all by its lonesome, yields Titanic. The general approach storing required data within a program is a common programming technique, as it makes installation simpler.

The point of the foregoing is that the information which constitutes Titanic is mobile; it could at any point shift from being in the file to the algorithm - and vice versa.

We could squeeze most of the information from the algorithm to the file, if we wanted to; an example would be a WAV audio file, which is basically the raw input for the speaker - next to no interpretation is required (as opposed to the much smaller FLAC file, which requires much algorithmic work). A similar thing could be done for the Titanic MPEG4 files. A television is n pixels by n, rendering n times a second, so a completely raw and nearly uninterpreted file would be the finite sum of n3 pixels. That demonstrates one end of the spectrum is possible, where the file contains almost all the information and the algorithm very little.

Gedankenexperiment

The other end of the spectrum is where the algorithm possesses most of the information and the file very little. How do we construct such a situation?

Well, consider the most degenerate example: an algorithm which contains in it an array of frames which are Titanic. This executable would require 0 bits of input. A less extreme example: the executable holds each frame of Titanic, numbered from 1 to (say) 1 million. The input would then merely consist of [1,2,3,4..1000000]. And when we ran the algorithm on appropriate input, lo and behold - Titanic!

Absurd, you say. That demonstrates nothing. Well, would you be satisfied if instead it had 4 million quarter-frames? No? Perhaps 16 million sixteenth-frames will be sufficiently un-Titanic-y. If that does not satisfy you, I am willing to go down to the pixel level. It is worth noting that even a pixel level version of the algorithm can still encode all other movies! It may be awkward to have to express the other movies pixel by pixel (a frame would be a nearly arbitrary sequence like 456788,67,89,189999,1001..20000), but it can be done.

The attentive reader will have already noticed this, but this hypothetical Titanic algorithm precisely demonstrates what I meant about TANSTAAFL and algorithms knowing things; this algorithm knows an incredible amount about Titanic, and so Titanic is favored with an extremely short compressed output; but it nevertheless lets us encode every other movie - just with longer output.

In Which All Is Made Clear

There is no meaningful & non-arbitrary line to draw. Copyright is an unanalyzable ghost in the machine, which has held up thus far based on fiat and limited human (machine) capabilities. This rarely came up before as humans are not free to vary how much interpretation our eyes or ears do, and we lack the mental ability to freely choose between extremely spelled out and detailed text, and extremely crabbed elliptical allusive language; we may try, but the math outdoes us and can produce program/data pairs which differ by orders of magnitudes, and our machines can handle the math.

But now that we have a better understanding of concepts such as information, it is apparent that copyright is no longer sustainable on its logical or moral merits. There are only practical economic reasons for maintaining copyright, and the current copyright regime clearly fails to achieve such aims.

Similar Links

[Similar links by topic]