Most people in the metrology community will agree that a calibration laboratory's ability to reproduce measurement results belongs in an uncertainty budget. Several Accreditation Bodies require reproducibility to be at least considered as part of a calibration laboratory's Calibration and Measurement Capability (CMC). The question on Reproducibility is, does it only apply to my equipment, or should it be required for the calibration process as well? If the answer is both and it should be with force-measuring devices, we must have a debate on why it is acceptable for labs to have items calibrated where the calibration method does not test for reproducibility. Reproducibility of equipment is part of two very well recognized force standards ISO 376 Metallic materials — Calibration of force proving instruments used for the verification of uniaxial testing machines and ASTM E74-18 Standard Practices for Calibration and Verification for Force-Measuring Instruments. The ASTM E74 standard applies a term llf (lower limit factor), which is the really a Type A uncertainty calculation that quantifies the reproducibility of the equipment from calculating a pooled standard deviation from a range of 10-11 force points. These deviations are found by applying a series of forces and rotating the instrument by varying degrees such as 0,120, 240 or 0,60,300 in the deadweight machine or calibration frame. If the force-measuring device is susceptible to or the force machine has bending, torsion, unparallel surfaces, large deviations may occur when the device is rotated.

ASTM E74 and ISO 376 have rotational tests with a goal to capture the reproducibility of the device in when calibrated. This is an excellent first step, but a second step to obtain repeatability and reproducibility of the process with different operators, different machines, and different locations should be needed for calculation of CMC. This blog will attempt to cite sources from various publications that may help anyone wanting to know to answer the question of what reproducibility is. We will then provide an example of how we feel short-term repeatability and reproducibility can be calculated.

**VIM**: **International vocabulary of metrology**

*2.24 (3.7, Note 2) reproducibility condition of measurement reproducibility condition *

*condition
of measurement, out of a set of conditions that includes different
locations, operators, measuring systems, and replicate measurements on
the same or similar objects *

*NOTE 1 The different measuring systems may use different measurement procedures. *

*NOTE 2 A specification should give the conditions changed and unchanged, to the extent practical. *

**ASTM E691 **

*3.1.10 reproducibility, n—precision under reproducibility conditions. E177 *

*3.1.11
reproducibility conditions, n—conditions where test results are obtained with the same method on identical test items in different laboratories with different operators using different equipment. *

*E177
3.1.12 reproducibility limit (R), n—the value below which the absolute
difference between two test results obtained under reproducibility
conditions may be expected to occur with a probability of approximately
0.95 (95 %). *

*E177 3.1.13 reproducibility standard deviation (sR), n—the standard deviation of test results obtained under reproducibility conditions. *

**NASA-HDBK-8739.19-4 **

*Reproducibility
The closeness of the agreement between the results of measurements of the value of an attribute carried out under different measurement conditions. The differences may include the principle of measurement, method of measurement, observer, measuring instrument(s), reference standard,
location, conditions of use, time. *

*Then under error sources lists *

*• Operator Bias (Reproducibility) - Error due to quasi-persistent bias in operator perception and/or technique. *

**MSA 4th Edition **

*Reproducibility This is traditionally referred to as the "between appraisers" variability.
Reproducibility is typically defined as the variation in the average of the measurements made by different appraisers using the same measuring instrument when measuring the identical characteristic on the same
part. This is often true for manual instruments influenced by the skill of the operator. It is not true, however, for measurement processes
(i.e., automated systems) where the operator is not a major source of variation. For this reason, reproducibility is referred to as the average variation between systems or between-conditions of
measurement. *

* The ASTM definition goes beyond this to
potentially include not only different appraisers but also different:
gages, labs and environment (temperature, humidity) as well as including
repeatability in the calculation of reproducibility. *

* In
order to better understand the effect of measurement system error on
product decisions, consider the case where all of the variability in
multiple readings of a single part is due to the gage repeatability and
reproducibility. That is, the measurement process is in statistical
control and has zero bias. *

* Between-appraisers
(operators): average difference between appraisers A, B, C, etc., caused
by training, technique, skill and experience. This is the recommended
study for product and process qualification and a manual measuring
instrument*

* Gage R&R is an estimate of the combined
variation of repeatability and reproducibility. Stated another way, GRR
is the variance equal to the sum of within-system and between-system
variances. *

* Guidelines for Determining Repeatability and Reproducibility Page 41 *

*The
Variable Gage Study can be performed using a number of differing
techniques. Three acceptable methods will be discussed in detail in this
section. *

*These are: *

*· Range method *

*· Average and Range method (including the Control Chart method) *

*· ANOVA method Except for the Range method, the study data design is very similar for each of these methods. *

* The
ANOVA method is preferred because it measures the operator to part interaction gauge error, whereas the Range and the Average and Range methods does not include this variation. *

Many of the above definitions and text use different operators, different laboratories, and various equipment. If the lab only has one location, then we can remove different laboratories. Some parameters such as force measurement where one lab rarely has two of the same size machines rely on capturing the reproducibility of the measurement process by comparing operators. The ideal solution is to set up SPC procedures which can obtain long-term reproducibility (Morehouse offers a training course on SPC several times a year). However, using ANOVA and other methods can capture the reproducibility of a process in the short-term, which is generally accepted.

ANOVA or Analysis of Variance will test for repeatability as well as reproducibility between operators. Repeatability and Reproducibility between technicians should be performed whenever there is a change in personnel, the first time a budget is established, new equipment is purchased, or whenever there is a change that may alter the measurement process. For example, upgrading a force-measuring system or load cells to ones provided by Morehouse shown below may drastically improve repeatability and reproducibility between operators

The above example uses two technicians recording readings at the same measurement point on the same equipment. Repeatability between technicians can be found by taking the square root of the averages of the variances of the readings from the technicians. Reproducibility between technicians is found by taking the standard deviation of the averages of readings for each technician. The ANOVA analysis in Microsoft Excel is a useful tool that can do the same calculation with a little manipulation. Below is an example of single-factor ANOVA. This is found in the data analysis section of excel.

The results shown in each of these cases indicate that Reproducibility, in this case, maybe insignificant because F calculated < F critical. The F value is found by dividing two mean squares; it will determine whether the test is statistically significant. A large F value generally means that variation among group means is more than you would expect to see by chance, or there is a significant difference between operators. In the example above the P-value, or probability value is 0.664251, which means there is a 66.4251 % chance that the operators will produce the same results. We can use the above ANOVA analysis to obtain reproducibility and repeatability.

Reproducibility is found by taking the square root of the between-groups mean squared value and dividing that by the square root of the count (number of observed values per Technician 1). Repeatability is found by taking the square root of the mean squared value of the within groups.

**Conclusion**:
This article has presented several definitions and defined a valid method for calculating reproducibility and determining its significance using an F-test. There is a significant issue with the parameter of force and in many cases, torque measurements as the reproducibility of the equipment is often not captured using these methods unless the reference standards are repositioned in machines, often they are not. Therefore,
there may be additional error sources for the reproducibility of the reference standards such as load cells. If the reference load cell is calibrated in accordance with the ASTM E74 or ISO 376 standard, then this issue becomes moot as both standards capture reproducible conditions at the time of calibration. That is unless the end-user
alters the calibration by not using the right equation, uses different adapters then what was used for calibration, or makes physical changes to the load cell. If any of these happen, the system should be calibrated again. Those companies not using these calibration standards will have additional error sources that may be very difficult to quantify. It is the belief of this author that companies should use legal metrological standards for calibration of their equipment and not rely on 5 to 10-point calibrations often called commercial calibration for their force-measuring devices.

It is recommended that the end-user then test their equipment and the additional error from the interactions of bending, torsion, and uneven surfaces by comparing two force-measuring devices against each other. Both of which should have been calibrated by primary standards (deadweights). Comparing one standard calibrated by deadweights with another standard calibrated by deadweights against one another will show any additional measurement errors in the machine from not being truly plumb, level, square, rigid, and free from torsion. This error is called a dissemination error and hardly any labs do this. It is a major problem with calibration laboratories making force measurements as these errors can be very large.

If you have additional questions, please contact us at info@mhforce.com. We are here to help you improve your force and torque measurements.

Everything we do, we believe in changing how people think about force and torque calibration. Morehouse believes in thinking differently about force and torque calibration and equipment. We challenge the "just calibrate it" mentality by educating our customers on what matters, what causes significant errors, and focus on reducing them. Morehouse makes our products simple to use and user-friendly. And we happen to make great force equipment and provide unparalleled calibration services.

Wanna do business with a company that focuses on what matters most? Email us at info@mhforce.com.

Written by Henry Zumbrun