Copyright considered paradoxical, incoherent, and harmful from an information theory and compression perspective as there is no natural kind corresponding to "works", merely longer or shorter strings for stupider or smarter algorithms.
created: 26 Sep 2008; modified: 4 May 2014; status: finished; confidence: possible; importance: 2
One of the most troublesome aspects of copyright law as applied to technology is how the latter makes it possible - and even encourages - doing things that expose the intellectual incoherence of the former; copyright is merely an ad hoc set of rules and custom evolved for bygone economic conditions to accomplish certain socially-desirable ends (and can be criticized or abolished for its failures). If we cannot get the correct ontology of copyright, then the discussion is foredoomed1. Many people suffer from the delusion that it is something more than that, that copyright is somehow objective, or even some sort of actual moral human right (consider the French “droit d’auteur”, one of the “moral rights”)2 with the same properties as other rights such as being perpetual3. This is quite wrong. Information has a history, but it carries with it no intrinsic copyright.
One of the more elegant ideas in computer science is the proof that lossless compression does not compress all files. That is, while a algorithm like ZIP will compress a great many files - perhaps to tiny fractions of the original file size - it will necessarily fail to compress many other files, and indeed for every file it shrinks, it will expand some other file. The general principle here is TANSTAAFL:
“There ain’t no such thing as a free lunch.”
There is no free lunch in compression. The normal proof of this invokes the Pigeonhole Principle; the proof goes that each file must map onto a single unique shorter file, and that shorter file must uniquely map back to the longer file (if the shorter did not, you would have devised a singularly useless compression algorithm - one that did not admit of decompression).
But the problem is, a long string simply has more ‘room’ (possibilities) than a shorter string. Consider a simple case: we have a number between 0 and 1000, and we wish to compress it. Our compressed output is between 0 and 10 - shorter, yes? But suppose we compress 1000 into 10. Which numbers do 900-999 get compressed to? Do they all go to 9? But then given a 9, we have absolutely no idea what it is supposed to expand into. Perhaps 999 goes to 9, 998 to 8, 9997 to 7 and so on - but just a few numbers later we run out of single-digit numbers, and we face the problem again.
Fundamentally, you cannot pack 10kg of stuff into a 5kg bag. TANSTAAFL.
The foregoing may seem to have proven lossless compression impossible, but we know that we do it routinely; so how does that work? Well, we have proven that losslessly mapping a set of long strings onto a set of shorter substrings is impossible. The answer is to relax the shorter requirement: we can have our algorithm ‘compress’ a string into a longer one. Now the Pigeonhole Principle works for us - there is plenty of space in the longer strings for all our to-be-compressed strings. And as it happens, one can devise ‘pathological’ input to some compression algorithms in which a short input decompresses4 into a much larger output - there are such files available which empirically demonstrate the possibility.
What makes lossless compression any more than a mathematical curiosity is that we can choose which sets of strings will wind up usefully smaller, and what sets will be relegated to the outer darkness of obesity.
TANSTAAFL is king, but the universe will often accept payment in trash. We humans do not actually want to compress all possible strings but only ones we actually make. This is analogous to static typing in programming languages; type checking may result in rejecting many correct programs, but we do not really care as those are not programs we actually want to run. Or, high level programming languages insulate us from the machine and make it impossible to do various tricks one could do if one were programming in assembler; but most of us do not actually want to do those tricks, and are happy to sell that ability for conveniences like portability. Or we happily barter away manual memory management (with all the power indued) to gain convenience and correctness.
This is a powerful concept which is applicable to many tradeoffs in computer science and engineering, but we can view the matter in a different way - one which casts doubt on simplistic views of knowledge and creativity such as we see in copyright law. For starters, we can look at the space-time tradeoff: a fast algorithm simply treats the input data (eg. WAV) as essentially the output, while a smaller input with basic redundancy eliminated will take additional processing and require more time to run. A common phenomenon, with some extreme examples5, but we can look at a more meaningful tradeoff.
Cast your mind back to a lossless algorithm. It is somehow choosing to compress ‘interesting’ strings and letting uninteresting strings blow up. How is it doing so? It can’t be doing it at random, and doing easy things like cutting down on repetition certainly won’t let you write a FLAC algorithm that can cut 30 megabyte WAV files down to 5 megabytes.
The answer is that the algorithms are smart about a kind of file. Someone put a lot of effort thinking about the kinds of regularity that one might find in only a WAV file and how one could predict & exploit them. As we programmers like to say, the algorithms embody domain-specific knowledge. A GZIP algorithm, say, operates while implicitly making an entire constellation of assumptions about its input - that repetition in it comes in chunks, that it is low-entropy, that it is globally similar, that it is probably text (which entails another batch of regularities gzip can exploit) in an English-like language or it’s binary, and so on. These assumptions baked into the algorithm collectively constitute a sort of rudimentary intelligence.
The algorithm compresses smartly because it operates over a small domain of all possible strings. If the domain shrunk even more, even more assumptions could be made; consider the compression ratios attained by top-scorers in the Marcus Hutter compression challenge. The subject area is very narrow: highly stylized and regularly formatted, human-generated English text in MediaWiki markup on encyclopedic subjects. Here the algorithms quite literally exhibit intelligence in order to eek out more space savings, drawing on AI techniques such as neural nets.
If the foregoing is unconvincing, then go and consider a wider range of lossless algorithms. Video codecs think in terms of wavelets and frames; they will not even deign to look at arbitrary streams. Audio codecs strive to emulate the human ear. Image compression schemes work on uniquely-human assumptions like trichromaticity.
So then, we are agreed that a good compression algorithm embodies knowledge or information about the subject matter. This proposition seems indubitable. The foregoing has all been informal, but I believe there is nothing in it which could not be turned into a rigorous mathematical proof at need.
But this proposition should be subliminally worrying copyright-minded persons. Content depends on interpretation? The latter is inextricably bound up with the former? This feels dangerous. But it gets worse.
Suppose we have a copyrighted movie, Titanic. And we are watching it on our DVD player or computer, and the MPEG4 file is being played, and everything is fine. Now, it seems clear that the player could not play the file if it did not interpret the file through a special MPEG4-only algorithm. After all, any other algorithm would just make a mess of things; so the algorithm contains necessary information. And it seems equally clear that the file itself contains necessary information, for we cannot simply tell the DVD player to play Titanic without having that quite specific multi-gigabyte file - feeding the MPEG4 algorithm a random file or no file at all will likewise just make a mess of things. So therefore both algorithm and file contain information necessary to the final copyrighted experience of actual audio-visuals.
There is nothing stopping us from fusing the algorithm and file. It is a truism of computer science that ‘code is data, and data is code’. Every computer relies on this truth. We could easily come up with a 4 or 5 gigabyte executable file which when run all by its lonesome, yields Titanic. The general approach storing required data within a program is a common programming technique, as it makes installation simpler.
The point of the foregoing is that the information which constitutes Titanic is mobile; it could at any point shift from being in the file to the algorithm - and vice versa.
We could squeeze most of the information from the algorithm to the file, if we wanted to; an example would be a WAV audio file, which is basically the raw input for the speaker - next to no interpretation is required (as opposed to the much smaller FLAC file, which requires much algorithmic work). A similar thing could be done for the Titanic MPEG4 files. A television is n pixels by n, rendering n times a second, so a completely raw and nearly uninterpreted file would be the finite sum of n3 pixels. That demonstrates one end of the spectrum is possible, where the file contains almost all the information and the algorithm very little.
The other end of the spectrum is where the algorithm possesses most of the information and the file very little. How do we construct such a situation?
Well, consider the most degenerate example: an algorithm which contains in it an array of frames which are Titanic. This executable would require 0 bits of input. A less extreme example: the executable holds each frame of Titanic, numbered from 1 to (say) 1 million. The input would then merely consist of
1,2,3,4..1000000. And when we ran the algorithm on appropriate input, lo and behold - Titanic!
Absurd, you say. That demonstrates nothing. Well, would you be satisfied if instead it had 4 million quarter-frames? No? Perhaps 16 million sixteenth-frames will be sufficiently un-Titanic-y. If that does not satisfy you, I am willing to go down to the pixel level. It is worth noting that even a pixel level version of the algorithm can still encode all other movies! It may be awkward to have to express the other movies pixel by pixel (a frame would be a nearly arbitrary sequence like 456788,67,89,189999,1001..20000), but it can be done.
The attentive reader will have already noticed this, but this hypothetical Titanic algorithm precisely demonstrates what I meant about TANSTAAFL and algorithms knowing things; this algorithm knows an incredible amount about Titanic, and so Titanic is favored with an extremely short compressed output; but it nevertheless lets us encode every other movie - just with longer output.
Now consider this spectrum of algorithms from a copyright perspective. It is clear to me that in the case of a standard MPEG4 algorithm and a movie file, copyright inhere solely in the movie file. No one would suggest otherwise (software patents are an orthogonal issue). That the algorithm knows a little about the movie, about human-made movies as opposed the set of all possible movies - is but a dubious intellectual curiosity. It also seems clear that in the case of our Titanic algorithm and the file which is merely a list of integers from 1 to 1 million, all the copyright applies to the algorithm - one cannot copyright natural facts like 1 or 1 million, after all, and I am hard-pressed to see how trivial sequences like 1 to 1 million could be copyrighted either.
But here we run into a slippery slope! At each of the many possible steps between our two cases, a little bit more information slips out of the file and into the algorithm (or vice versa). It is untenable to maintain that the file always is the copyrighted thing, or that the algorithm is always the copyrighted things - for in either case we can produce an algorithm or file which common sense tells us must be the copyrighted object. But we cannot put our finger on where they flip roles. Is it at the ¾s mark? Or half-way?
But at the half-way mark, neither item is Titanic, and both could still be used for something else. The half-way algorithm could plausibly serve to compress episodes of another nautical production like Horatio Hornblower - if we’re down to operating on the level of small pixel blocks, there may not be a single recognizable Titanic feature in the entire algorithm but plenty of usefully ‘blue’ or ‘black’ blocks. This slippery slope isn’t as bad as some, because we can sum up the number of bits of the movie (32 billion bytes for the file, perhaps, and a few hundred million for the executable) and know that the number of possible efficient intermediates must be equal or less to that; but 1 billion or 100 billion intermediate steps still presents us the slippery slope problem of how we can sensibly specify that copyright changes only at the 667,503,001th bit, say, but not any previous bit.
There is no meaningful & non-arbitrary line to draw. Copyright is an unanalyzable ghost in the machine, which has held up thus far based on fiat and limited human (machine) capabilities. This rarely came up before as humans are not free to vary how much interpretation our eyes or ears do, and we lack the mental ability to freely choose between extremely spelled out and detailed text, and extremely crabbed elliptical allusive language; we may try, but the math outdoes us and can produce program/data pairs which differ by orders of magnitudes, and our machines can handle the math.
But now that we have a better understanding of concepts such as information, it is apparent that copyright is no longer sustainable on its logical or moral merits. There are only practical economic reasons for maintaining copyright, and the current copyright regime clearly fails to achieve such aims.
For example, if you grant that copyright exists as a moral right, then you have immediately accepted the abrogation of another moral right, to free speech; this is intrinsically built into the concept of copyright, which is nothing but imposing restrictions on other people’s speech.↩︎
Jefferson eventually came to agree with Madison, supporting a limited conferral of monopoly rights but only “as an encouragement to men to pursue ideas which may produce utility.” Letter from Thomas Jefferson to Isaac McPherson (Aug. 13, 1813), in 6 Papers of Thomas Jefferson, at 379, 383 (J. Looney ed. 2009) (emphasis added).
This utilitarian view of copyrights and patents, embraced by Jefferson and Madison, stands in contrast to the “natural rights” view underlying much of continental European copyright law-a view that the English booksellers promoted in an effort to limit their losses following the enactment of the Statute of Anne and that in part motivated the enactment of some of the colonial statutes. Patterson 158-179, 183-192. Premised on the idea that an author or inventor has an inherent right to the fruits of his labor, it mythically stems from a legendary 6th-century statement of King Diarmed “‘to every cow her calf, and accordingly to every book its copy.’” A. Birrell, Seven Lectures on the Law and History of Copy right in Books 42 (1899). That view, though perhaps reflected in the Court’s opinion, ante, at 30, runs contrary to the more utilitarian views that influenced the writing of our own Constitution’s Copyright Clause. See S. Ricketson, The Berne Convention for the Protection of Literary and Artistic Works: 1886-1986, pp. 5-6 (1987) (The first French copyright laws “placed authors’ rights on a more elevated basis than the Act of Anne had done,” on the understanding that they were “simply according formal recognition to what was already inherent in the ‘very nature of things’”); S. Stewart, International Copyright and Neighbouring Rights 6-7 (2d ed. 1989) (describing the European system of droit d’auteur).
Mark Helprin’s editorial “A Great Idea Lives Forever. Shouldn’t Its Copyright?” (Lessig’s review) is one of the standard examples of this sort of view. He begins by claiming that intellectual property is morally equivalent to other forms of property, and hence is as morally protected as the others:
What if, after you had paid the taxes on earnings with which you built a house, sales taxes on the materials, real estate taxes during your life, and inheritance taxes at your death, the government would eventually commandeer it entirely? This does not happen in our society … to houses. Or to businesses. Were you to have ushered through the many gates of taxation a flour mill, travel agency or newspaper, they would not suffer total confiscation.
Of course, an author can possess his work in perpetuity - the physical artifact. The government does not inflict ‘total confiscation’ on the author’s manuscript or computer. Rather, what is being confiscated is the ability to call upon the government to enforce, with physical coercion and violence, anywhere within its borders, some practices regarding information. This is a little different from property. Helprin then addresses the economic rationale of copyright, only to immediately appeal to ethical concerns transcending any mere laws or economic gains:
It is, then, for the public good. But it might also be for the public good were Congress to allow the enslavement of foreign captives and their descendants (this was tried); the seizure of Bill Gates’s bankbook; or the ruthless suppression of Alec Baldwin. You can always make a case for the public interest if you are willing to exclude from common equity those whose rights you seek to abridge. But we don’t operate that way, mostly…Congress is free to extend at will the term of copyright. It last did so in 1998, and should do so again, as far as it can throw. Would it not be just and fair for those who try to extract a living from the uncertain arts of writing and composing to be freed from a form of confiscation not visited upon anyone else? The answer is obvious, and transcends even justice. No good case exists for the inequality of real and intellectual property, because no good case can exist for treating with special disfavor the work of the spirit and the mind.
Copyright was the sometimes sparkling, controversial fuse I used as an armature for a much expanded argument in regard to the relationship of man and machine, in which I attacked directly into the assault of modernism, collectivism, militant atheism, utilitarianism, mass conformity, and like things that are poison to the natural pace and requirements of the soul, that reify what those who say there is no soul believe is left, and that, in a headlong rush to fashion man solely after his own conceptions, are succeeding. The greater the success of this tendency, however, the unhappier are its adherents and the more they seek after their unavailing addictions, which, like the explanation for nymphomania, is not surprising. It is especially true in regard to the belief that happiness and salvation can be found in gadgets: i.e., toy worship. I addressed this. I defended property as a moral necessity of liberty. I attempted to reclaim Jefferson from the presumptuous embrace of the copyleft. (Even the docents at Monticello misinterpret Jefferson, failing to recognize his deep and abiding love of the divine order.) And I advanced a proposition to which the critics of copyright have been made allergic by their collectivist education - that the great achievement of Western civilization is the evolution from corporate to individual right, from man defined, privileged, or oppressed by cast, clan, guild, ethnicity, race, sex, and creed, to the individual’s rights and privileges that once were the province only of kings
But this is its concluding passage:
’The new, digital barbarism is, in its language, comportment, thoughtlessness, and obeisance to force and power, very much like the old. And like the old, and every form of tyranny, hard or soft, it is most vulnerable to a bright light shone upon it. To call it for what it is, to examine it while paying no heed to its rich bribes and powerful coercions, to contrast it to what it presumes to replace, is to begin the long fight against it.
Very clearly, the choice is between the preeminence of the individual or of the collective, of improvisation or of routine, of the soul or of the machine. It is a choice that perhaps you have already made, without knowing it. Or perhaps it has been made for you. But it is always possible to opt in or out, because your affirmations are your own, the court of judgement your mind and heart. These are free, and you are the sovereign, always. Choose.’
This should be true only of decompression - only with an incompetent compression algorithm will one ever be able to compress a small input to a large output, because the algorithm’s format can start with a single bit indicating whether the following is compressed or not compressed, and if the output is larger than input, not bother compressing at all. That additional bit makes the file larger and so doesn’t raise any issues with the Pigeonhole Principle.↩︎
The tradeoff between conciseness of representation and ease of decoding is illustrated in an extreme form by the information required to solve the halting problem. One standard representation of this information is as an infinite binary sequence K0 (the characteristic sequence of the halting set) whose i’th bit is 0 or 1 according to whether the i’th program halts. This sequence is clearly redundant, because many instances of the halting problem are easily solvable or reducible to other instances. Indeed, K0 is far more redundant than this superficial evidence might indicate. Barzdin  showed that this information can be compressed to the logarithm of its original bulk, but no concisely encoded representation of it can be decoded in recursively bounded time.
Complexity classes like NP-hard, which are fairly high up the complexity hierarchy, are bywords for intractability & futility among programmers. But this decoding of an extremely space-efficient representation of K0 is a process so slow that it doesn’t even fit in any normal complexity class!↩︎
The hope is that transformations from a modest library will provide a path from a naïve, inefficient but obviously correct program to a sophisticated efficient solution. I have seen how via program transformations striking gains in efficiency have been obtained by avoiding recomputations of the same intermediate results, even in situations in which this possibility —note that the intermediate results are never part of the original problem statement!— was, at first sight, surprising…I am afraid that great hopes of program transformations can only be based on what seems to me an underestimation of the logical brinkmanship that is required for the justification of really efficient algorithms. It is certainly true, that each program transformation embodies a theorem, but are these the theorems that could contribute significantly to the body of knowledge and understanding that would give us maturity? I doubt, for many of them are too trivial and too much tied to program notation.