How Accurate Is The Biometric?
Determining The Size Of The Test Set
At some point in the selection of the correct biometric method to use or to choose between several different vendors of a particular type of biometric system, the question arises "How accurate is this biometric?" It is not really worthwhile to get accuracy figures from the vendors themselves for several reasons; 1) the vendor may not actually know (i.e., very little or no testing has been done); 2) the vendor does know but wishes suppress the data because the figure is below market expectations; or, 3) the vendor has an inflated opinion of the system's accuracy because the system has never been subjected to a rigorous test and whatever basic testing that has been performed has revealed few or no flaws.
If you want to make an intelligent choice of a biometric system, you must manage and design the testing of the biometric system yourself.
When constructing an accuracy test, one of the first questions to consider is "How many test sets (samples) must be used in order to be sure that the final, overall test result represents the 'True' accuracy of the system? (Also referred to as the “true mean accuracy” of the system." On one hand, testing is expensive in terms of money, time, and resources. On the other hand, the test must be rigorous enough to yield a very close approximation of the inherent matching capabilities of the biometric system in question.
Take, for example, two extremes:
1.) A single test sample is used to determine accuracy. If the test is successful, you will judge the accuracy to be 100%; if the test fails, the accuracy is determined to be 0%;
2.) Several million test samples are used, the number of "Hits" are divided by the total number of searches made and the result is multiplied by 100 to yield the accuracy level of the system expressed as a percentage.
Is the first system fair to you or to the vendor? Probably not, but it is very cheap and quick. What about the second test? The accuracy figure that results from the second test will be virtually identical to the actual true mean accuracy of the system; additional testing will have virtually no effect on the measured accuracy figure obtained in this very large test. The second test, however, is extremely expensive and it would be unreasonable to accomplish the testing within acceptable time limits. Somewhere in between these two testing extremes is the correct tradeoff between the desire for an absolute answer and the practicality of performing and funding the search for a reasonable (and defensible) answer.
So, what is the response to the previous question, "How many samples must I use for the test?" The answer lies, not surprisingly, with you, the buyer of the biometric system. How much of an error in testing are you willing to accept? In this context, "error" is the (possible) difference between the actual, TRUE MEAN ACCURACY of the system and the accuracy measured by your test. An acceptable error limit of the test is measured in terms of the "Level Of Confidence" (LOC) you are willing to accept for the test and how precise you determine the accuracy estimate must be. The precision of the testing is determined by the “Confidence Interval” of the test (discussed below).
A single test will result in a measured accuracy score for that test; the TRUE MEAN ACCURACY of the system (i.e., the one that is obtained by performing millions of tests) lies within a band of values on either side of the single test’s measured accuracy score. The range (or size) of this band of values is termed the "Confidence Interval" of the test. For a given LOC, the Confidence Interval becomes narrower as the number of samples used in each individual test increases. In other words, as the number of samples (or test sets) is increased, the Confidence Interval narrows thus improving the precision of the test. To see how an increase in the number of samples used decreases the width of the Confidence Interval, click on the following java applet: Confidence Interval as a function of LOC and sample size.
Usually, testing is done on systems without any a priori knowledge of the system's accuracy. In this case, for an LOC of 95%, millions of samples will produce a Confidence Interval so small that the TRUE accuracy value virtually equals the measured accuracy value. Using just 100 samples, and expecting the same LOC from the test, the TRUE accuracy value will lie (in 95 out of 100 tests) within a band ±10% on either side of the measured accuracy value. [See Calculate the Confidence Interval.]
If a 100-sample test set is run against two different competitor's systems and one system has a measured accuracy of 55% and the second has a measured accuracy of 60%, you cannot claim that the second system's accuracy is absolutely higher than that of the first. Why? Because the Confidence Interval bands for the two systems overlap; the accuracy values in the range 50% to 65% (60% - 10%, and 55% + 10%) are common to both system's Confidence Interval bands. In this example, the TRUE accuracy is just as likely to be 51% as it is to be 64%; the precision (or Confidence Interval band) of the test equals the range defined by the highest measured score -10% to the lowest measured accuracy score +10%.
Is knowing that the actual accuracy of the system lies within a band of ±10% centered on the measured accuracy acceptable to you? If not -- if the band is too wide -- you must do more testing, that is, use more samples. More samples means that the Confidence Interval decreases so that there is an even narrower band centered on the measured accuracy score; within this band lies the true accuracy of the system.
For the following examples, the system accuracies for Vendors A, B, and C are 75%, 80% and 85%, respectively. The examples illustrate how increasing the number of test sets or “samples” in the test increases the “test resolution” or ability to differentiate one vendor’s accuracy capability against others. All examples have a 99.9% Level Of Confidence.
100 samples -- No differentiation of accuracy scores.
500 samples – Vendor A may possibly be eliminated.
1,000 samples – Vendor A is eliminated.
Illustrates the trend of increasing test resolution by increasing the number of test samples.