It seems that most people's familiarity with statistics is limited to
quoting an opinion that there is something inherently
suspicious about the whole business. Statistical factoids lend an air of
authority, plausibility, respectability
to just about any marketing pitch. And we're all justifiably suspicious of
*pitches*,
which we figure are trying to make us believe something that just
ain't so.

So the idea that the census takers would be using statistics... well, it
just doesn't seem right. What could be simpler than
*counting*, for heaven's sake? The least mathematical among us can
handle counting; it's the first bit of number handling we do, after learning
what numbers are.
Surely, they must be trying to pull a fast one on us.

But consider that counting a couple hundred million *things*
is difficult enough to do accurately. When you're trying to count
*people*
who have a tendency to not stand still, or even to avoid official looking
callers at the front door for various reasons, the task is even harder.
Last go around, the Bureau of the Census estimates they missed about 4.7
million people, which is better than all but one of the 5 previous counts.

This time, they came up with a plan to use statistical sampling to use on after the overall enumeration, hoping to come up with a more accurate result. That's the goal, isn't it? To get as close as possible to the actual number of people in the country?

Some folks howled, as if the Bureau was cheating or had a hidden agenda of
some sort. (Which would be...?) We don't want an
*estimate*, we want a *count*! The stunningly obvious fact that
there is simply no economic way to count hundreds of millions of people with
complete accuracy is apparently not so stunningly obvious to everyone.

I can't quote all the news stories, but I got a pretty clear impression that this was a partisan issue: the Republicans thought statistics in the census was anathema, and the Democrats kind of went along with the idea. Well, that's kind of odd, isn't it? What's partisan about counting?

When it comes to the census, *everything.* How many people are in each
precinct, and county, and legislative district, and state determine how
representatives get apportioned, how pork gets distributed, and whose ox is
gored. So, apparently, Republicans think that Republicans can be counted the
old fashioned way well enough, whereas Democrats think Democrats need to
have statistics applied to them.

Well maybe I'm wrong about who's for and against. But it was the Supreme
Court that got to decide how it's going to be carried out, and they came up
with a hare-brained split decision: only the "plain old" count can be used
for the House of Representatives' apportionment, but statistically
adjusted figures will have to be used for all other purposes, when feasible.
Yeah, I'm sure the Founding Fathers would've wanted it
*that* way.

Anyway, this whole issue was fairly low on my radar, but what
made me write about it was finding out how simple and elegant the
statistical method that's proposed is. It's called
**capture-recapture**, and here's how it works:

1. Take a number of simultaneous samples from a population such that there can't be any overlap, and tag them in some way. (That's the "capture.") Say you have 500 individuals from this.

2. In a second round, sample from the population again, and observe how many of the samples taken are tagged. This is the "recapture."

That's about it! The statistical inference is that if you sampled another
1000 individuals the second time, and find that, say, 25 of them are tagged,
then you've recaptured 25/500 or 1/20th of the population. That is, you
captured 1/20th of the *tagged*
part of the population, and we infer that the untagged part of the sample is
likewise 1/20th of the *untagged* part of the population.

The population estimate is then:

(number found by the census) |

(fraction of people in the sample also found in the census) |

The denominator is the number of "tagged" individuals recaptured, divided by the size of the recapture sample. So in my example, 500 divided by (25/1000) = 500 * 1000 / 25 = 20,000. Notice that both counts add up to less than 10% of the population! Neat trick.

The power of statistics is not in prevarication - one hardly needs any
mathematics to practice that. The power is in *inference.*

Of course, the Bureau of the Census isn't looking to go quite that far. They still figure on counting most everybody, but will use the statistics to get a best estimate of the 5% or so that will be missed by the conventional approach. And for some reason, this more accurate result is good for everything but the House of Representatives....

P.B. Stark of UC Berkeley offers up a critique of the proposed method, suggesting that while the concept may be clean and simple, it doesn't necessarily improve accuracy. It gets a little thick, though, I'm warning you... He (?) says "counting is extremely robust," but counting 250 or 300 millions of anything doesn't seem robust to me.

Conditions he notes for the capture-recapture method to yield an accurate estimate are:

- one must be able to count the first capture perfectly;
- the population needs to be constant between catches;
- one must be able to determine with certainty whether an individual was captured or not;
- the population has to mix randomly between captures (the second capture needs to be a random sample of the population); and
- all individuals need to have the same propensity to be captured.

Tom von Alten tva_∂t_fortboise_⋅_org