Skip to main content

Subscripts For Citations

A typographic proposal: replace cumbersome inline citation formats like ‘Foo et al. (2010)’ with subscripted dates/​sources like ‘Foo⁠…⁠2020’. Intuitive, easily implemented, consistent, compact, and can be used for evidentials in general.

2020-01-082022-10-14 finished
certainty: certain importance: 2
backlinks similar bibliography

I propose reviving an old General Semantics notation: borrow from scientific writing and use subscripts like ‘Gwern2020’ for denoting sources (like citation, timing, or medium).

Using subscript indices is flexible, compact, universally technically supported, and intuitive. This convention can go beyond formal academic citation and be extended further to ‘evidentials’ in general, indicating the source & date of statements.

While (currently) unusual, subscripting might be a useful trick for clearer writing, compared to omitting such information or using standard cumbersome circumlocutions.

I don’t believe the Sapir-Whorf hypothesis so beloved of 20th century thinkers & SF, or that we can make ourselves much more rational by One Weird Linguistic Trick. There is no far transfer, and the benefits of improved vocabulary/​notation are inherently domain-specific. You think the same thoughts in English as you do in Chinese. But, like good typography⁠, good linguistic conventions may be worth all told, say, even as much as 5% of whatever one values—and that’s not nothing. In ‘rectifying names’, be realistic: aim low. (It’s definitely worthwhile to do things like spellcheck your writings, after all, even though no amount of spellcheck can rescue a bad idea.)

Good Writing Conventions

Checklist approach. I already use a few unusual conventions, like attempting to use the Kesselman Estimative words to be more systematic about the strength of my claims or always linking fulltext in citations (and improving using link annotations which do not just link fulltext but present the abstract/​excerpts/​summary as well), and I employ a few more domain-specific tricks like avoiding use of the word ‘significance’ in statistics contexts, automatically inflation-adjusting currencies (to avoid the trivial inconvenience of doing it by hand & so not doing it at all), or using research-specific checklists⁠. Without straying into conlang territory or attempting to do everything in formal logic or serious eccentricity, what else could be done?

Subscripts For Citations/Dates/Sources/Evidentials

One idea for more precise English writing which I think could be usefully revived is broader use of subscripts.

Distinguishing things named the same. The subscripting idea is derived from General Semantics (GS)1⁠, which itself borrows it from standard science or math typography, like physics/​statistics/​mathematics/​chemistry/​programming: a superscript / subscript is an index distinguishing multiple versions of something, such as quantity, location, or time, eg. xt vs xt+1. They’re typically not seen outside STEM contexts, aside from a few obscure uses like ruby⁠/​furigana glosses⁠.

Citations

However, there are many places we could use subscripting to be clearer & more compact about which version we are referring to, using them as evidentials⁠. And because it’s clearer & more compact, we can afford to use it more places without it wasting space/​effort/​patience. Citations are a good use case. Why write “Friedenbach (2012)” if we can write “Friedenbach2012”? The latter is shorter, easier to read, less ambiguous (especially if we use it in parentheticals, see Friedenbach (2012)), and doesn’t come in a dozen different slightly-varying house styles. Why not use it for other things, like software versions too?

Evidentials

But why restrict subscripting to formal publications or written documents? Apply it to any quote, statement, or opinion where indexing variables like time might be relevant. Refusing to allow easy references to anything not a book is but codex chauvinism.

One convention, arbitrary metadata. It is a unified notation: regardless of whether something was thought, spoken, or written by me in 2020, it gets the same notation—“Gwern2020”. The evidential can be expanded as necessary: if it’s a paper or essay, the ‘2020’ can be a hyperlink, or if it’s a ‘personal communication’, then there can be a bibliography entry stating as much, or if it’s the author about their own beliefs/​actions/​statements in 2020, further information neither necessary nor usually possible (and it avoids awkward custom phraseology like “As I thought back in 2020 or so…”). In contrast, normal citation style cumbersomely uses a different format for each, or provides no guidance: how do you gracefully cite a paper written one year but whose author changed their mind 5 years later based on new results and who told you so 10 years after that? (“Dr. Bach originally maintained A (Bach et al. (2000)) but gradually modified his position until 2005 when he recalls writing in his diary he had lost all confidence in A (personal communication, according to Frieden2015)…”)

Multiple Authors

Classicists & normal people’s reactions to pseudo-Latin (by Rachel Kowert).

‘et al’ = ‘…’ Single or double authorship is straightforward, just ‘Friedenbach2012’ or ‘Frieden & Bach2012’ But how should multi-author citations, currently denoted by ‘et al’ (or ‘et al.’ or even ‘et al.’), be handled? This is important because the frequency of multi-author papers has risen dramatically, and they are now the norm in many fields; notations should be optimized for the most common case.

But the existing ‘et al’ notation is ridiculous: not only does it take up 6 letters and is natural language which should be a symbol, it’s ambiguous & hard to machine-parse, and it’s not even English.2 Writing ‘Foo et al2010’ or ‘Fooet al 2010’ doesn’t look nice, and it makes the subscripting far less compact. My current suggestion is to do the expected thing: when you elide something, how do you write that? Why, with an ellipsis ‘…’, of course! So one would write ‘Foo⁠…⁠2010’ or possibly ‘Foo⁠…⁠2010’. (I think the former is probably better, since there is less risk of confusion over what is being elided.)

Unicode Ellipsis

Funky alternatives. Horizontal ellipsis aren’t the only kind: there are several others in Unicode⁠, including midline ‘⋯’ and vertical ‘⋮’ and even the “down right diagonal ellipsis” ‘⋱’, so one could do ‘Foo⋯2010’ or ‘Foo⋮2010’ or ‘Foo⁠⋱⁠2010’. (I’m not sure about support for these particular Unicode entities, but they show up without issue in my Firefox, Emacs⁠, and urxvt, so they shouldn’t be too rare.) The vertical ellipsis is nice but unfortunately it’s hard to see the first/​top dot because it almost overlaps with the final letter, making it look like a weird colon. The midline ellipsis is middling, and doesn’t really have any virtue.3 But I particularly like the last one, down-right-diagonal ellipsis, because it works visually so well—it leads the eye down and to the right and is clear about the omission being an entire phrase, so to speak.

Generalized Evidentials

Evidentials using authors or years are short enough that they can be laid out as simple subscripts. There is no new typographic issue with that. But as discussed above⁠, there is no need to limit it to formal publications; knowledge can be derived from many sources, and even in the most formal academic writing, there are the occasional pseudo-citations like “Foo2010 (personal communication)”. A complete evidential—like “Foo told me so on the second day of our Black Forest camping trip in 2010”—would be awkward to read if naively subscripted.

In a noninteractive format, such evidentials probably must be relegated to footnotes/​endnotes/​sidenotes⁠; in an interactive format like HTML, we can do better.

For HTML, CSS supports setting maximum widths & truncating with ellipsis overflows, while expanding width on hover, so one can do something roughly like this:

.evidential { display: inline-block; white-space: nowrap; max-width: 10ch; overflow: hidden; text-overflow: ellipsis; }
.evidential:hover, .evidential:focus { max-width: min-content; position: relative; bottom: -0.5em; font-size: 80%; }
CSS prototype of expanding subscripts for long evidentials.

Then the first 10 characters will be displayed, truncated by ‘…’, and if the reader hovers over it with their mouse, it expands to reveal the arbitrarily-long evidential. The CSS seems tricky to get right, so it might be easier to resort to Javascript-based popups like my existing link annotations/​definitions using popups.js⁠.

Technical Support

Subscripts: already in a theater near you. Because it’s already used so much in technical writing, subscripting is reasonably familiar to anyone who took highschool chemistry & can be quickly figured out from context for those who’ve forgotten, and it’s well-supported by fonts and markup languages and word processors: it’s written x~t~ in Pandoc Markdown & some Markdown extensions like markdown-it (but not Reddit), x<sub>t</sub> in HTML, x<subscript>t</subscript> in DocBook⁠, x_t in TeX⁠/​LaTeX⁠, x\ :sub: \t in reStructuredText⁠; and it has keybindings C-= in Microsoft Office, C-B in LibreOffice⁠, C-, in Google Docs etc. So subscripting can be used almost everywhere immediately.

Example Use

Example: here are 3 versions of a text; one stripped of citations and evidentials, one with them written out in long form, and one with subscripts:

  1. I went to Istanbul for a trip, and saw all the friendly street cats there, just as I’d read about in Abdul Bey; he quotes the local Hakim Abdul saying that the cats even look different from cats elsewhere (but after further thought, I’m not sure I agree with that there). I and my wife had a wonderful trip, although while she clearly enjoyed the trip to the city, she claimed the traffic was terribly oppressive and ruined the trip. (Oh really?)
  2. In 2010, I went to Istanbul for a trip, and saw all the friendly street cats there, just as I’d read about in Abdul Bey’s 2000 Street Cats of Istanbul; he quotes the local Hakim Abdul in 1970 saying that the cats even look different from cats elsewhere (but after further thought as I write this now in 2020, I’m not sure I agree with Bey (2000)). I and my wife had a wonderful trip, although while she clearly enjoyed the trip to the city, on Facebook she claimed the traffic was terribly oppressive and ruined the trip. (Oh really?)
  3. I2010 went to Istanbul for a trip, and saw all the friendly street cats there, just as I’d read about in Abdul Bey2000 (Street Cats of Istanbul); he quotes the local Hakim Abdul1970 saying that the cats even look different from cats elsewhere (but after further thought, I’m not sure I2020 agree with Bey2000). I and my wife had a wonderful trip, although while she clearly enjoyed the trip to the city, she claimedFB the traffic was terribly oppressive and ruined the trip. (Oh really?)

In the first version, suppressing the metadata leads to a confusing passage. What did Bey write? We don’t learn when Abdul expressed his opinion—which is important because Istanbul, as a large fast-growing metropolis, may have changed greatly over the 40 years from quote to visit. When did the speaker become skeptical of the claim Istanbul cats both act & look different? What might explain the wife’s inconsistency, and which version should we put more weight on?

The second version answers all these questions, but at the cost of prolixity, jamming in comma phrases to specify date or source. Few people would want to either write or read such a passage, and the fussiness has a distinctly fussy pseudo-academic air. Unsurprisingly, few people will bother with this—any more than they will bother providing inflation-adjusted dollar amounts of something from a decade ago (even though that’s misleading by a good 15% or so, and compounding), or they’d want to check a paywalled paper, or redo calculations in Roman numerals.

The third version may look a little alien because of the subscripts, but it provides all the information of the second version plus a little more (by making explicit the implicit ‘2020’), in considerably less space (as we can delete the circumlocutions in favor of a single consistent subscript), and reads more pleasantly (the metadata is literally out of the way until we decide we need it).

Possible Alternative Notation

I considered 3 alternatives:

  1. Superscripts: already overloaded as footnotes & powers

  2. Bang notation: another possible notation for disambiguating, is the “X!Y” notation (apparently derived from UUCP bang notation), which is associated with online fandoms & fanfiction, and gives notation like “2020!gwern”.

    This notation puts the metadata first, which is confusing yodaspeak (what does the ‘2020’ refer to? It dangles until you read on); it makes it inline & full-sized, and then tacks on an additional character just to take up even more space; it’s confusing and unusual to anyone who isn’t familiar with it from online fanfiction already, and to those who are familiar, it is low-status and has bad connotations.

  3. Ruby annotations: as mentioned above, there is standardized HTML support (but with spotty browser support & no support at all in most other formats) for ‘ruby’ annotations which are similar to superscripts and intended for interlinear glosses.

    Unfortunately, in a horizontal language like English (as opposed to Chinese/​Japanese), they require extremely high line-heights to be at all legible. Example:

    Example of HTML ‘ruby’ interlinear gloss (eg. <ruby>$200<rt>$375 in 2020</ruby>).
  4. New symbols: no font, editor, or word processor support kills any new symbol proposal, and can be rejected out of hand.

Disadvantages

Deal-breaker: low status? The major downside, of course, is that subscripting is novel and weird. It at least is not associated with anything bad (such as fanfics), and is associated with science & technology, but I’m sure it will deter readers anyway. Does it do enough good to be worth using despite the considerable hit to weirdness points? That I don’t know.

Date Ranges

Date slippage. With inflation, what was once an absolute unit slowly becomes a relative unit, with a hidden multiplier, leading to ever greater mental slippage: it is misleading by default, and one must effortfully correct, which one will often not do. A similar issue can happens with dates and durations, in both directions: a year like “1972” is absolute but it may be relevant for its distance from the present moment, which keeps changing.

Trapped in an eternal present. People often note that they feel like they are simply in a “Now”, and don’t feel like time or history is actually passing, that it is the present year; their mental default is somehow still frozen in 2000, and they have to remind themselves that it is in fact the (much later) present year. Somehow, to many, the year 2001 is still a sci-fi future from a Kubrick movie, and not an increasingly-distant past a generation or more ago. One might write in a 2022 blog post “what happened 50 years ago [in 1972]?”, which is entirely comprehensible… as long as one didn’t write it in early January 2022 when people are still adjusting to the new calendar year and wonder what happened in 1971, and someone reading it a decade later might be severely confused about the concern over the year 1982. This goes in the other direction: ‘1972’ is 50 years ago (as I write this), but does it feel an entire half-century in the past? In 2022, when the movie sequel Top Gun Maverick aired starring, of course, Tom Cruise⁠, more than one person was startled to actually do the arithmetic for the 1986 release date of Top Gun and realize it was ~36 years ago. Or when a live-action Scooby-Doo reboot sans Scooby-Doo was announced, I was bemused to note that the franchise was now older than almost everyone debating the omission, as it started in 1969 or >53 years ago—which means that Top Gun & Scooby-Doo are closer in time than you to Top Gun. (Does it feel that way?) Such “time ghost” comparisons can doubtless be generated indefinitely as our intuitive grasp of timelines.

Context collapse. Why are past dates getting squeezed like this? Many eras seem quite distinct: the 1920s are different from the 1930s, which are totally separate from the 1940s in my mind, and likewise the 1950s and 1960s slot into place, more or less; but come the 1970s, and everything gradually blends together. I have a good feel for technology & science over that time, and can with trouble reconstruct geopolitics, but in other areas like general culture, I feel you could fool me with sequences decades out of order. Did Paul McCartney have his greatest album hits in the 1960s, or the 1990s? I’d believe you if you told me either. It’s hard to maintain a chronology of movies or music albums when one watches them indiscriminately, pulling from a century ago as easily as tracks released today, and how can we take the passage of time too seriously if Paul McCartney is still, in 2022, performing in concerts with guests such as Bruce Springsteen⁠, while elsewhere Bob Dylan continues releasing albums while touring & writing books?

Recycling without updating. Aside from that sort of “context collapse” in time, fashion cycles recycle the same things in minor variations, like environmentalism or political-correctness, which produces awkward repetitions: a feminist comparison to the misogyny of college education “50 years ago” is rhetorically effective in the 1990s, where it harks back to the 1940s (pre-second-wave feminism) and an America with few female graduate students or tenured professors, but it is drastically less so when uttered in 2022 and referring to 1972, well into the tidal wave of female entry into higher ed which saw them reach a majority of undergrads just a few years later (a milestone they would never relinquish, and has since escalated to >60% of undergrads).

To err is human. So what can be done about this? If someone is seriously committed to making a case for 1972 being tantamount to the medieval dark ages, then fine, but perhaps they just hadn’t quite thought about the implications of quoting a spicy passage from something written in the 1990s—after all, the 1990s, that was not that long ago, was it? One can try to be mindful and calculate deltas as a routine matter of reading, similar to checking whether a number of meaningful by shifting it an order of magnitude up & down, and simply remember the decade-level differences: 2020s to 1990s is 3 decades, to 1980s is 4 decades etc. But this is something so mechanical & straightforward, and dates so pervasive, that this is a poor solution: we will still subtly slip into a vague timeless “Now”.

Denote years-since. It seems like something that a better notation could help with. Scientists, for example, do not trust to absolute dates and expect readers to casually add & subtract dates millennia apart, but explicitly use durations like “100kya” or “1mya”. This is not too exotic, and I think most readers would understand what “10ya” means, particularly in a context. So, this makes a natural fit with subscript citations: as we are already putting years in subscripts, why not put in date durations like “10ya”?

Tom Cruise has returned for Top Gun Maverick (2022), the sequel to Top Gun (198636ya).

Regexp-able. This is compact, intuitive, and can be automated—particularly if one uses decimal separators for numbers but not years, so most dates can be matched as 4-digit whole numbers starting 1–2 (something like [^0-9][12][0-9][0-9][0-9][^0-9]).

Hide duplicates & in cites. For particularly date-heavy writings where the same dates come up repeatedly, this might result in too much subscripting: unlike citations, which must be displayed at each claim to serve their role, date-range subscripting is a convenience, a reminder of date distances, so after the first instance of a date, the rest can be suppressed. (This is also true of inflation-adjusting currency, but amounts tend to be fairly unique so it is less of a problem.) It would not apply to subscripted citations themselves, as combinations like ‘Foo20202ya’ or ‘Foo2020 (2ya)’ are probably unworkable, and in any case, with citations, the year itself tends not to be too important given the realities of academic publishing, where papers can take decades to be formally published, or the work can take more decades to be applied or built on: if the year is important, it can be discussed normally. (“And then in 20202ya, Foo2020 revolutionized the field by reticulating the splines with unprecedented efficiency…”)


  1. I am considerably less impressed by other GS linguistic suggestions like E-Prime⁠, but subscripting seems like it may be worth rescuing.↩︎

  2. Actually, it’s not even Latin because it’s an abbreviation for the actual Latin phrase, et alii (to save you one character and also avoid the need to correctly conjugate the Latin—the absurdity is fractal, is what I’m saying), but as pseudo-Latin, that means that many will italicize it, as foreign words/​phrases usually are—but now that is even more work, even more visual clutter, and introduces ambiguity with other uses of italics like titles. A terrible notation, and what could be more pretentious?↩︎

  3. At least for subscripts. If we were only replacing the ‘et al’ notation, MIDLINE HORIZONTAL ELLIPSIS (⋯) is great: it’s intuitive, looks nice without any CSS styling, and takes a third the pixels while also looking visually much simpler:

    A screenshot demo of the standard “et al” vs “⋯” in denoting citations by >2 authors.
    ↩︎

Further Reading

backlinks