How Accurate Is The Biometric?
Determining The Size Of The Test Set
At some point in the selection of the correct biometric method
to use or to choose between several different vendors of a particular type of
biometric system, the question arises "How accurate is this biometric?"
It is not really worthwhile to get accuracy figures from the vendors
themselves for several reasons; 1) the vendor may not actually know
(i.e., very little or no testing has been done); 2) the vendor does
know but wishes suppress the data because the figure is below market
expectations; or, 3) the vendor has an inflated opinion of the system's
accuracy because the system has never been subjected to a rigorous test and
whatever basic testing that has been performed has revealed few or no flaws.
If you want to
make an intelligent choice of a biometric system, you must manage and design
the testing of the biometric system yourself.
When constructing an accuracy test, one of the first questions
to consider is "How many test sets (samples) must be used in order to be
sure that the final, overall test result represents the 'True' accuracy of the
system? (Also referred to as the “true mean accuracy” of the system."
On one hand, testing is expensive in terms of money, time, and resources.
On the other hand, the test must be rigorous enough to yield a very close
approximation of the inherent matching capabilities of the biometric system in
question.
Take, for example, two extremes:
1.)
A single test sample is used to determine accuracy. If the test
is successful, you will judge the accuracy to be 100%; if the test fails, the
accuracy is determined to be 0%;
2.)
Several million test samples are used, the number of "Hits" are divided
by the total number of searches made and the result is multiplied by 100 to
yield the accuracy level of the system expressed as a percentage.
Is the first system fair to you or to the vendor?
Probably not, but it is very cheap and quick. What about the second
test? The accuracy figure that results from the second test will be
virtually identical to the actual true mean accuracy of the system; additional
testing will have virtually no effect on the measured accuracy figure obtained
in this very large test. The
second test, however, is extremely expensive and it would be unreasonable to
accomplish the testing within acceptable time limits. Somewhere in
between these two testing extremes is the correct tradeoff between the desire
for an absolute answer and the practicality of performing and funding the
search for a reasonable (and defensible) answer.
So, what is the response to the previous question, "How
many samples must I use for the test?" The answer lies,
not surprisingly, with you, the buyer of the biometric system. How much
of an error in testing are you willing to accept? In this context,
"error" is the (possible) difference between the actual, TRUE MEAN ACCURACY of
the system and the accuracy measured by your test. An acceptable error
limit of the test is measured in terms of the "Level
Of Confidence" (LOC) you are willing to accept for the test and how
precise you determine the accuracy estimate must be. The precision of
the testing is determined by the “Confidence Interval” of the test (discussed
below).
A single test will result in a measured accuracy score
for that test; the TRUE MEAN ACCURACY of the system (i.e., the one that is
obtained by performing millions of tests) lies within a band of values on
either side of the single test’s measured accuracy score. The
range (or size) of this band of values is termed the "Confidence
Interval" of the test. For a
given LOC, the Confidence Interval becomes narrower as the number of samples
used in each individual test increases. In other words, as the number of
samples (or test sets) is increased, the Confidence Interval narrows
thus improving the precision of the test.
To see how an increase in the number of samples used decreases the
width of the Confidence Interval, click on the following java applet:
Confidence Interval as a
function of LOC and sample size.
Usually, testing is done on systems without any a priori
knowledge of the system's accuracy. In this case, for an LOC of 95%,
millions of samples will produce a Confidence Interval so small that the TRUE
accuracy value virtually equals the measured accuracy value.
Using just 100 samples, and expecting the same LOC from the test, the TRUE
accuracy value will lie (in 95 out of 100 tests) within a band ±10% on either
side of the measured accuracy value.
[See Calculate the
Confidence Interval.]
If a 100sample test set is run against two different
competitor's systems and one system has a measured accuracy of 55% and the
second has a measured accuracy of 60%, you cannot claim that the second
system's accuracy is absolutely higher than that of the first. Why?
Because the Confidence Interval bands for the two systems overlap; the
accuracy values in the range 50% to 65% (60%  10%, and 55% + 10%) are common
to both system's Confidence Interval bands. In this example, the TRUE
accuracy is just as likely to be 51% as it is to be 64%; the precision (or
Confidence Interval band) of the test equals the range defined by the highest
measured score 10% to the lowest measured accuracy score +10%.
Is knowing that the actual accuracy of the system lies within a
band of ±10% centered on the measured accuracy acceptable to you? If not
 if the band is too wide  you must do more testing, that is, use more
samples. More samples means that
the Confidence Interval decreases so that there is an even narrower band
centered on the measured accuracy score; within this band lies the true
accuracy of the system.
For the following examples, the system accuracies for Vendors
A, B, and C are 75%, 80% and 85%, respectively.
The examples illustrate how increasing the number of test sets or
“samples” in the test increases the “test resolution” or ability to
differentiate one vendor’s accuracy capability against others.
All examples have a 99.9% Level Of Confidence.
100 samples 
No differentiation of accuracy scores. 

500 samples – Vendor A may
possibly be eliminated. 

1,000 samples – Vendor A
is eliminated. 

Illustrates the trend of
increasing test resolution by increasing the number of test samples. 