It seems that most people's familiarity with statistics is limited to quoting an opinion that there is something inherently suspicious about the whole business. Statistical factoids lend an air of authority, plausibility, respectability to just about any marketing pitch. And we're all justifiably suspicious of pitches, which we figure are trying to make us believe something that just ain't so.

So the idea that the census takers would be using statistics... well, it just doesn't seem right. What could be simpler than counting, for heaven's sake? The least mathematical among us can handle counting; it's the first bit of number handling we do, after learning what numbers are. Surely, they must be trying to pull a fast one on us.

But consider that counting a couple hundred million things is difficult enough to do accurately. When you're trying to count people who have a tendency to not stand still, or even to avoid official looking callers at the front door for various reasons, the task is even harder. Last go around, the Bureau of the Census estimates they missed about 4.7 million people, which is better than all but one of the 5 previous counts.

This time, they came up with a plan to use statistical sampling to use on after the overall enumeration, hoping to come up with a more accurate result. That's the goal, isn't it? To get as close as possible to the actual number of people in the country?

Some folks howled, as if the Bureau was cheating or had a hidden agenda of some sort. (Which would be...?) We don't want an estimate, we want a count! The stunningly obvious fact that there is simply no economic way to count hundreds of millions of people with complete accuracy is apparently not so stunningly obvious to everyone.

I can't quote all the news stories, but I got a pretty clear impression that this was a partisan issue: the Republicans thought statistics in the census was anathema, and the Democrats kind of went along with the idea. Well, that's kind of odd, isn't it? What's partisan about counting?

When it comes to the census, everything. How many people are in each precinct, and county, and legislative district, and state determine how representatives get apportioned, how pork gets distributed, and whose ox is gored. So, apparently, Republicans think that Republicans can be counted the old fashioned way well enough, whereas Democrats think Democrats need to have statistics applied to them.

Well maybe I'm wrong about who's for and against. But it was the Supreme Court that got to decide how it's going to be carried out, and they came up with a hare-brained split decision: only the "plain old" count can be used for the House of Representatives' apportionment, but statistically adjusted figures will have to be used for all other purposes, when feasible. Yeah, I'm sure the Founding Fathers would've wanted it that way.

Anyway, this whole issue was fairly low on my radar, but what made me write about it was finding out how simple and elegant the statistical method that's proposed is. It's called capture-recapture, and here's how it works:

1. Take a number of simultaneous samples from a population such that there can't be any overlap, and tag them in some way. (That's the "capture.") Say you have 500 individuals from this.

2. In a second round, sample from the population again, and observe how many of the samples taken are tagged. This is the "recapture."

That's about it! The statistical inference is that if you sampled another 1000 individuals the second time, and find that, say, 25 of them are tagged, then you've recaptured 25/500 or 1/20th of the population. That is, you captured 1/20th of the tagged part of the population, and we infer that the untagged part of the sample is likewise 1/20th of the untagged part of the population.

The population estimate is then:

(number found by the census)

(fraction of people in the sample also found in the census)

The denominator is the number of "tagged" individuals recaptured, divided by the size of the recapture sample. So in my example, 500 divided by (25/1000) = 500 * 1000 / 25 = 20,000. Notice that both counts add up to less than 10% of the population! Neat trick.

The power of statistics is not in prevarication - one hardly needs any mathematics to practice that. The power is in inference.

Of course, the Bureau of the Census isn't looking to go quite that far. They still figure on counting most everybody, but will use the statistics to get a best estimate of the 5% or so that will be missed by the conventional approach. And for some reason, this more accurate result is good for everything but the House of Representatives....

P.B. Stark of UC Berkeley offers up a critique of the proposed method, suggesting that while the concept may be clean and simple, it doesn't necessarily improve accuracy. It gets a little thick, though, I'm warning you... He (?) says "counting is extremely robust," but counting 250 or 300 millions of anything doesn't seem robust to me.

Conditions he notes for the capture-recapture method to yield an accurate estimate are:

Most of which he says aren't going to happen in the census. Probably so. I guess we can count on mistakes being made, eh?

Tom von Alten      tva_∂t_fortboise_⋅_org

Tuesday, 17-Dec-2002 23:22:26 MST