Thursday, October 05, 2006

Pulling the nth Degree

Is nth-selection random?

Recently asked the above, and like all questions these days it seems the answer is 'it depends'.

Picking every 4th, 10th, 100th etc. name as a way of sampling is often described as 'random'. Two practical scenarios make this unlikely.

Say I need 10,000 names for a test from a house-file of 10,000,000. If I pick every 10th or even 100th name I will only go through a small portion of the whole list. So, if that list is sequenced by anything meaningful like age of account, last name, value, etc. I end up with a biased list. And most database administers will order a file by some attribute to speed load, linking, and indexing.

When doing repeated pulls over time nth selections may result in the same list being pulled. Changing the starting point or randomly sorting before the pull minimizes this risk.

Random assumes that every record has an equal chance of being selected -- while nth is a means of sampling, it is by no means random. Once the first record from a list is flagged the chance of pulling any other record is defined.

Whether this matters a hill of beans or not depends a lot on why one samples in the first place. (More on that soon.)

No comments: