A Prediction Concerning Statistical Findings in
Volume III of the Corpus of Indus Seals and Inscriptions
"Once a singleton, always a singleton?" — Richard Sproat
One test of whether Indus inscriptions are linguistic has developed out of a collaboration with the computational linguist Richard Sproat, currently at the University of Illinois and the Beckman Institute [update 2009 on this 2003 Webpage: Now at Oregon Health and Science University]. The test revolves around the huge number of Indus 'singletons' and other very rare signs, which are not compatible with a true 'writing system.'
If the signs were linguistic, as the number of known inscriptions grew you would expect the corpus to ‘saturate’ as apparent singletons began showing up second and third and fourth times, etc. On the other hand, if some Indus symbols were created ‘on the fly’ and never used again, the ratio of singletons over the total number of known signs (n1/N) (or the number of very rare signs compared to N, if these signs were placed on a few objects before being dropped) would increase with each new wave of discoveries. As we look at the last 130 years of Indus research, we find the ratio n1/N has in fact grown steadily larger as new inscriptions have turned up — exactly the reverse of what we’d expect from any genuine writing system.
An ideal opportunity to test these ideas further will come soon, when the long-delayed third volume of the Corpus of Indus Seals and Inscriptions is finally released. The volume reproduces early photos of many inscriptions that have been lost or stolen and were not shown in the first two volumes. (A shocking percentage of the most interesting pieces has disappeared.) Some of these are cataloged in the concordances, but not all have been published before in pictorial form. The volume will also contain high-resolution images of many newly found inscriptions, including 500 or so from the last two decades of excavations at Harappa. Others come from Mohenjo-daro and other sites.
One easily testable prediction one can propose is that the anomalously high ratio n1/N, and the even higher ratio of all low-frequency Indus signs taken together, will not drop with this new crop of inscriptions, so long as current definitions of symbols are held constant. That is, we will find few apparent 'singletons' or other rare signs reappearing in the new body of inscriptions, and we can expect more new singletons or very low frequency signs to show up.  Assuming this prediction holds, one might try to 'save' old linguistic models of the inscriptions by arguing that the Indus 'script' was a Chinese-type system that required a huge number of signs, some of which we still haven't seen. But this claim would clash with one critical piece of evidence discussed elsewhere in these notes: the fact that the vast majority of Indus inscriptions are made up of a very small number of high-frequency signs. Moreover, as noted earlier, it is difficult to imagine how any ancient language thought to have been spoken in S. Asia could have possibly been encoded in a Chinese-style system.
A defender of traditional claims might also argue that 'singletons' were personal symbols — perhaps a little like Chinese taboo names minus their phonetic elements. But this solution could not account for inscriptions containing more than one 'singleton,' whose meaning could not possibly have been understood over a wide geographical area. The pictographic sense of most singletons is very obscure, moreover, making it unlikely that any human reader, at least, could guess their (assumed) sound values through visual-auditory punning.
I propose that we label the upcoming test 'Sproat's smoking gun.'
(Note added in November 2009: data from new archaeological digs (e.g., Farmana) fully confirms this 2003 prediction. Further confirmation of the prediction has come from study of 500-odd new unpublished inscriptions I was given access to in the Harappa Project data base at Harvard after this prediction was first made in 2003.)
 What is critical to this prediction is that no drop occurs in the ratio n1/N or in the number of very rare signs. How many new singletons we can expect depends on the typology of the new inscriptions and on how many are duplicates. There are 112 singletons among the 2905 inscriptions in Mahadevan's 1977 concordance, which omits many short duplicates Vats left out of his 1940 Harappan excavation report. Extrapolating from these numbers, you might expect an average of one new singleton for every 25 or so new inscriptions. But the numbers can be predicted to be lower if, as expected, many of the new inscriptions are duplicates of the large numbers of 'tiny steatite tablets' unique to Harappa or are so-called graffiti, few of which contain novel signs. On the other hand, I predict that a large percentage of 'singletons' will show up on newly found amulet-seals of traditional design. Again, the critical point in the prediction is that the percentage of 'singletons' and very rare signs will not diminish, even as the number of known inscriptions continues to increase. [Note added November 2009: once again this prediction so far has been amply confirmed.]