Level Of Confidence

 

The term "Level Of Confidence" (LOC) is used to describe the percentage of instances that a set of similarly constructed tests will capture the true mean (accuracy) of the system being tested within a specified range of values[1]  around the measured accuracy value of each test.

Put in another way, it makes common sense that as you perform more and more tests on a system (in this case, for accuracy) you become increasingly confident in predicting the result of the next test.[2] 

If a biometric-based matching system has reasonable levels of consistency and repeatability, successive accuracy test result scores will tend to cluster within a progressively narrower range of values[3] as the number of tests increases.

There are (at least) two ways of establishing a reasonably accurate estimate of a system’s accuracy; 1.) Conduct one very large test – in terms of the number of “test sets” or “samples” used -- and declare that the system’s accuracy is the measured accuracy[4] of the test; or, 2.) Conduct many smaller tests and declare that the system’s accuracy will lie somewhere within a range defined by the highest and lowest measured accuracy values obtained in these small tests.

There are some problems associated with the two methods described above:

·      What is a “very large test” in terms of the number of test sets used?

·      How many “smaller tests” should be performed and how many test sets should be in each small test?

Problems associated with testing, like the ones above, have always existed; however, the advent of industrialization made the problems associated with this sort of testing extremely important.  Should every ball bearing be tested after it is machined?  Since it is obviously impractical to test every drug capsule or every ball bearing coming off the assembly line, how many of each batch of product should be tested in order to verify that a certain level of production quality has been met?  This question has a direct impact on the cost of manufacture and profit to the producer.  Obviously, the profit requirement has resulted in considerable capital expenditure on the investigation of efficient yet cheaper testing techniques.

Testing techniques have been investigated by math scholars and resulting concepts have been experimented with by manufacturers for several years.  The result has been a concept that roughly defines the sample size of a test depending on the desired Level of Confidence (LOC) in the test and a test Confidence Interval (CI) that is acceptable to the testing agent.  By using these concepts and assuming, once again, that the system being tested has reasonable levels of consistency and repeatability, the range of accuracy values resulting from successive accuracy tests of the system can be predicted by performing a single test; this prediction is based on several factors:

·       The desired LOC of the test;

·       The magnitude of the CI that is acceptable to the testing agent;

·       The number of “test sets” or “samples” used in the test; and,

·       The estimated accuracy of the system to be tested.[5], [6]

So, what does this mean to biometric testing?  The following scenarios describe two approaches to biometric testing; I will call the “The Hard Way” and “The Easy Way” for reasons that I hope will be obvious.

The Hard Way

Suppose we wish to establish a reliable estimate of a system’s accuracy by performing many tests.

·         1,000 accuracy tests are performed on a system that does biometric-based one-to-many searches against a large database of biometric “templates.” 

·          The same number of test sets[7] is used in each of the 1,000 tests.  In this example, each test is conducted with 100 test sets.

·            Each test set is constructed in the same manner but each test has a different set of test data.

·            When all of the 1,000 test results (i.e., measured accuracy scores) are compiled, it is observed that 950 of the 1,000 tests produced measured accuracy scores with values in the range 76% to 84%.

 

Given the conditions and results listed above, the test administrator can make this statement:

“If I perform yet another accuracy test[8] of the system (the 1001st), there is a 95% chance that the measured accuracy of the system for this test will be within the range of 76% to 84% [80% ± 4%].”

 

The Easy Way

Suppose we wish to establish a reliable estimate of a system’s accuracy by performing only one test.  The system being tested is the same system that was tested using “The Hard Way,” above.

·      Set the LOC to 95%.

·      Determine how many “test sets” or “samples” (N) are need to produce a test with a Confidence Interval of ±4% given a LOC of 95%.  [Refer to Calculation of Confidence Interval; N=600.]

·       The administrator of the test has no prior knowledge of the approximate accuracy of the system being tested.

·       Perform the test with 600 samples and calculate the measured accuracy4 of the test.

·       The measured accuracy of the test is 82%.

Given the conditions and results listed above, the test administrator can make this statement:

“If I perform yet another accuracy test[9] of the system (the 2nd), there is a 95% chance that the measured accuracy of the system for the 2nd and any subsequent test will likely be within the range of 78% to 86% [82% ± 4%].”

 

 

So, what is the significance of “The Easy Way?”  Primarily, we only had to run one test with 600 samples instead of 1,000 tests with 100 test sets each; this means that, to the matcher subsystem, there are only 600 matcher operations vs. 100,000 matcher operations.  Additionally, we only had to construct one 600-pair test set (600 “target” templates to be inserted into the background database plus 600 corresponding “search” templates that will be “launched” against the matcher subsystem).  Using “The Hard Way” we would have to construct one thousand 100-pair test sets.

 

What are the “holes” in the scenarios described above?  Conducting 1,000 tests each having a different set of 100 test samples will probably yield an observed accuracy range (for 950 of the tests that form the tightest “cluster” of scores) than the ± 4% given in the example.  This means that the average score for these 950 tests represents a more precise accuracy estimate than that that could be obtained by “The Easy Way” described above.  Another possible discrepancy is in the construction of the test sets themselves (“search” template vs. “target” template); there are so many aspects to this topic that I cannot address them adequately in this section.  I hope to add another web page in the near future that discusses this topic thoroughly.

 

One last comment on “The Easy Way” of testing that is important to consider.  For example, we take the result of the two test methods described above, except that, in this instance, the test scenarios are applied to two different systems (systems made by different vendors that perform the same sort of matching function).  As shown in the diagram below; the Confidence Intervals for both vendors overlap (in the range 78% to 84%); also, the measured accuracy of “The Easy Way” test (82%) and the average score of “The Hard Way” test (described above) both lie within the overlap range.  As a consequence, we would have to say that the two systems are (statistically) equal in accuracy performance.

 

If we were in a situation where we were testing two different biometric systems (using “The Easy Way” method) and the test results of the two systems exhibited the same conditions, we could be confident (in this case 95% confident) that the two systems have the same accuracy capabilities.

 

[Note:  Use of the terms “The Hard Way” and “The Easy Way” are clumsy, are my fault, and were only meant to help describe a concept that is not easy to explain – for me anyway.  I apologize to any math-oriented individual that I have offended in any way.  I would welcome any criticism of the above and/or suggestions of how I can make this material more understandable – or accurate.]

 

Other one-line descriptions:

An LOC of 95% means that, if 100 tests are conducted, the true mean accuracy of the system will be located within the Confidence Interval band in at least 95 out of the 100 tests.   Nothing can be said with any certainty of the remaining 5 tests; they could be just as meaningful as the other 95 or they could be completely misleading.  An LOC of 99% means that 99 of 100 tests are "good" and that only one may be meaningless.

An LOC of 99.9% means that, if 1,000 tests are conducted, the true mean accuracy of the system will be located within the Confidence Interval band in at least 999 out of the 1,000 tests.  

The Confidence Interval band for every (individual) test is centered on the measured accuracy of the test.  “Measured accuracy” is the computed result of the accuracy test that the customer performs on the system being evaluated.

The diagram below illustrates a series of tests – each unique in terms of test data – that are run on a system.  The LOC that the tester requires is 95%; that is; 95 of 100 tests will produce a measured accuracy result and that the range of values defined by the CI (centered on the measured accuracy for each test) will include the true mean accuracy of the system (dark, vertical line in the center of the diagram).  The value of the true mean accuracy of the system is not defined – all we are interested in here is whether or not the CI for each test “captures” the true mean accuracy value in at least 95 of 100 tests. 

Instructions For Using This Demo

Check The Demo Calculations

 

 

 

 



[1]  This “specified range of values” is also known as the Confidence Interval of the test(s).

[2]  For this assumption to be true, every test must be constructed in a similar manner yet have different test data used in each test.

[3]  In this sense, the “range of values” is the Confidence Interval of the test.

[4]  “measured accuracy” is evaluated as the number of successes divided by the total number of trials multiplied by 100; this formula equals the accuracy score expressed as a percentage.

[5]  For instance, if you are testing a “n+1” generation of the system and you know, from past operational use of the system, that the nth generation system had an accuracy level that was close to 97%, you would use this information to develop the size of the accuracy test for the nth+1 generation system (that is expected to have higher, or at least similar accuracy, to the nth generation system). 

[6]  As long as the estimated accuracy of the system is NOT close to 50%, the effect of using an estimated accuracy has the effect of lowering the number of “test sets” or “samples” used in the test. [See the description of the Confidence Interval.]

[7]  One “test set” consists of a “search” biometric template and a “target” template.  All target templates are placed in a single database (sometimes called a “background database”).  Search templates are matched, one by one, against the background database.  This type of search is called a one-to-many or 1:N search.

[8]  Once again, the 1001st test set is constructed in the same manner as the preceding 1000 tests but with a different set of test data.

[9]  Once again, the 1001st test set is constructed in the same manner as the preceding 1000 tests but with a different set of test data.