Discussion: “Factors of Safety for Richardson Extrapolation” (Xing, T., and Stern, F., 2010, ASME J. Fluids Eng., 132, p. 061403)

A recent paper by Xing and Stern [1] provides additional performance data of interest on Verification of Calculations, but is misleading in some important respects. Five methods, all starting from Richardson Extrapolation, are compared: three are based on variants of my Grid Convergence Index method [2–6], and two are based on the authors’ work: the earlier “CF method” and the latest “FS method.” (1) The method referred to as GCI2 is successful but its source is cited as a “private communication” from myself (their Ref. [16]). What they refer to as GCI2 is essentially the GCI method, as I have described and discussed and evaluated at length, Refs. [2–6]. This is the method that, as the authors say [1], “is widely used and recommended by ASME and AIAA” citing [6,7]. I shall refer to it herein as the “real GCI method” for reasons soon to be apparent. The equivalence is obscured because the authors [1] have written their Eq. (12) defining the method using their own ratio term CF, which obscures the simplicity of the original and makes it appear that the factor of safety varies continuously with observed order of convergence pRE; see Fig. 3 of Ref. [1]. This is merely an artifact of Eq. (12) in which their factor of safety FS is defined in terms of an error estimate based on observed pRE rather than on the p actually used. As a result, the “factor of safety” FS in Ref. [1] is not always the same concept as the “factor of safety” FS in Refs. [2–6]. This leaves the question of what are the methods called “GCI” and “GCI1” by the authors. The answer is that I consider these to be misapplications of the GCI formulas, and therefore misrepresentations of the GCI method. (2) For specificity, I will change notation and refer to the method of their Eq. (10) as GCI0, not as “GCI” since I claim it is not the real GCI method. This Eq. (10) is written in terms of an error estimate based on the “observed” [2–6] or “estimated” [1] order of convergence pRE. However, the authors correctly indicate in the text following that this is possible only if three grids have been used to calculate a pRE, in which case the recommended Factor of Safety FS1⁄4 1.25. If only two grids have been used, pRE in Eq. (10) is replaced by a theoretical value pth (e.g., pth1⁄4 2 for a nominally second-order method) and a more conservative value FS1⁄4 3 is recommended [2–6]. This would indeed agree with my GCI method [2–6] in good applications. The trouble is that the authors have applied this regardless of the reasonableness of the estimated pRE. This is contrary to the discussions in Ref. [2], which discussions I claim are part of the real “GCI method” as contrasted with simplistic application of the formula. Exactly what constitutes a reasonable value was not defined in my work [2–6] and I would not take issue with (say) a 5% difference between pRE and pth, e.g., using an observed pRE1⁄4 2.1 for a nominally second-order method. (Although it should be obvious that the conservative approach would be to use the minimum of pRE and pth, I would not claim that this was a misapplication of the GCI method.) However, the authors [1] have applied the GCI0 formula using observed pRE theoretical pth, which is imprudent and not recommended [2–6] no matter what the value of Fs. This point was conveyed to the authors of Ref. [1] in their Ref. [16], the private communication from myself. (I am glad to make this communication, cited here as Ref. [8], available to any interested party.) Using the authors’ notation P1⁄4 pRE/pth the authors have claimed rational estimates of uncertainty from studies with P as high as 6.01 for the new ship hydrodynamics problem (see Table 7, second line). Since the numerical method used is nominally second order, this indicates an observed order pRE1⁄4 12.02, the use of which would be ridiculous. Another instance is the third row entry in Table 6 [1] for the grid triplet (4,5,6) showing P1⁄4 0.08. These are pretty strong hints that something is wrong. There is no sense in performing a grid convergence study and then ignoring the results. No excuses of the computing difficulties in industrial applications can justify an attempt to obtain a rational estimate of uncertainty from such meaningless corrupted studies. To so attempt is to give merit to meritless results. I can only describe this as a misapplication of the GCI formula, not as a poor result for the real GCI method. (3) The second method considered in Ref. [1] is GCI1 based on Eq. (11). The appearance of the CF term in the equation serves to exchange the error estimate based on observed pRE to one based on theoretical pth for the condition P >1, i.e., for pRE> pth. This is exactly what is recommended above, and is good conservative practice. However, the Eq. (11) uses Fs1⁄4 1.25 in either case. This does not make sense. The GCI1 method as described uses three grid solutions to evaluate an observed pRE, but if this value is too large, the method reverts to using exactly the error estimator that would have been used for only two-grid studies but with the nonconservative value Fs1⁄4 1.25 instead of the recommended Fs1⁄4 3. Thus, the authors did use the correct factor of safety Fs1⁄4 3 as recommended [2 6] for 2 grid studies but they used only Fs1⁄4 1.25 for 3-grid studies in which the observed order of convergence pRE was not at least approximately consistent with theory. The authors then concluded from their tests that “These facts suggest that the use of the GCI1 method is closer to a 68% than a 95% confidence level.” This is a misleading statement since the conclusion follows from using Fs1⁄4 1.25 and applying it to a dataset from [9] designed with “intentional choice of grid studies with oscillations in both exponent p and output quantity” [9] (Ref. 3). The point of such studies as Ref. [1] should not be to tune large Fs until 95% coverage is obtained no matter how bad the data set is. If such large Fs are required, this should be a flag alerting the investigator that the grids used are inadequate for the problem [3,10,11]. (4) Therefore, the evaluation in Ref. [1] of the real GCI method (GCI2) is corrupted by confusion with two distorted methods. The authors’ evaluation also does not agree with my own evaluation from some of the same data, notably the wide-ranging study of Cadafalch et al. [12] (Ref. [33] of Ref. [1]) which I presented in Contributed by the Fluids Engineering Division of ASME for publication in the JOURNAL OF FLUIDS ENGINEERING. Manuscript received Ocotber 19, 2010; final manuscript received July 28, 2011; published online October 24, 2011. Assoc. Editor: Dimitris Drikakis. In the original GCI method [2], FS does depend on observed pRE implicitly, in the sense that if pRE is unreasonable one at least reverts to using pth and FS1⁄4 3 rather than blindly applying FS1⁄4 1.25. The preferred approach would be to continue the grid convergence study with additional grids until reasonable values of pRE were obtained. These rather obvious points have been made more explicit in Ref. [3] which was not available to the authors of Ref. [1]. There is further unfortunate confusion caused by the authors’ choice of a descriptor for their method and their use of the symbol FS both for their “Factor of Safety” method and for the factors of safety used in all five methods, including the “factor of safety used in the Factor of Safety method.” Likewise for their symbol CF, which denotes both their Correction Factor method and the “correction factor” used in the Correction Factor method as well as in their defining equations for other methods (Eqs. (11) and (12)).


Introduction
A recent paper by Xing and Stern [1] provides additional performance data of interest on Verification of Calculations, but is misleading in some important respects. Five methods, all starting from Richardson Extrapolation, are compared: three are based on variants of my Grid Convergence Index method [2][3][4][5][6], and two are based on the authors' work: the earlier "CF method" and the latest "FS method." (1) The method referred to as GCI 2 is successful but its source is cited as a "private communication" from myself (their Ref. [16]). What they refer to as GCI 2 is essentially the GCI method, as I have described and discussed and evaluated at length, Refs. [2][3][4][5][6]. This is the method that, as the authors say [1], "is widely used and recommended by ASME and AIAA" citing [6,7]. I shall refer to it herein as the "real GCI method" for reasons soon to be apparent. The equivalence is obscured because the authors [1] have written their Eq. (12) defining the method using their own ratio term CF, which obscures the simplicity of the original and makes it appear that the factor of safety varies continuously 1 with observed order of convergence p RE ; see Fig. 3 of Ref. [1]. This is merely an artifact of Eq. (12) in which their factor of safety FS is defined 2 in terms of an error estimate based on observed p RE rather than on the p actually used. As a result, the "factor of safety" FS in Ref. [1] is not always the same concept as the "factor of safety" F S in Refs. [2][3][4][5][6].
This leaves the question of what are the methods called "GCI" and "GCI 1 " by the authors. The answer is that I consider these to be misapplications of the GCI formulas, and therefore misrepresentations of the GCI method.
(2) For specificity, I will change notation and refer to the method of their Eq. (10) as GCI 0 , not as "GCI" since I claim it is not the real GCI method. This Eq. (10) is written in terms of an error estimate based on the "observed" [2][3][4][5][6] or "estimated" [1] order of convergence p RE .
However, the authors correctly indicate in the text following that this is possible only if three grids have been used to calculate a p RE , in which case the recommended Factor of Safety F S ¼ 1.25.
If only two grids have been used, p RE in Eq. (10) is replaced by a theoretical value p th (e.g., p th ¼ 2 for a nominally second-order method) and a more conservative value F S ¼ 3 is recommended [2][3][4][5][6]. This would indeed agree with my GCI method [2][3][4][5][6] in good applications. The trouble is that the authors have applied this regardless of the reasonableness of the estimated p RE . This is contrary to the discussions in Ref. [2], which discussions I claim are part of the real "GCI method" as contrasted with simplistic application of the formula.
Exactly what constitutes a reasonable value was not defined in my work [2][3][4][5][6] and I would not take issue with (say) a 5% difference between p RE and p th , e.g., using an observed p RE ¼ 2.1 for a nominally second-order method. (Although it should be obvious that the conservative approach would be to use the minimum of p RE and p th , I would not claim that this was a misapplication of the GCI method.) However, the authors [1] have applied the GCI 0 formula using observed p RE ) theoretical p th , which is imprudent and not recommended [2][3][4][5][6] no matter what the value of Fs. This point was conveyed to the authors of Ref. [1] in their Ref. [16], the private communication from myself. (I am glad to make this communication, cited here as Ref. [8], available to any interested party.) Using the authors' notation P ¼ p RE /p th the authors have claimed rational estimates of uncertainty from studies with P as high as 6.01 for the new ship hydrodynamics problem (see Table  7, second line). Since the numerical method used is nominally second order, this indicates an observed order p RE ¼ 12.02, the use of which would be ridiculous. Another instance is the third row entry in Table 6 [1] for the grid triplet (4,5,6) showing P ¼ 0.08. These are pretty strong hints that something is wrong. There is no sense in performing a grid convergence study and then ignoring the results. No excuses of the computing difficulties in industrial applications can justify an attempt to obtain a rational estimate of uncertainty from such meaningless corrupted studies. To so attempt is to give merit to meritless results. I can only describe this as a misapplication of the GCI formula, not as a poor result for the real GCI method.
(3) The second method considered in Ref. [1] is GCI 1 based on Eq. (11). The appearance of the CF term in the equation serves to exchange the error estimate based on observed p RE to one based on theoretical p th for the condition P >1, i.e., for p RE > p th . This is exactly what is recommended above, and is good conservative practice. However, the Eq. (11) uses F s ¼ 1.25 in either case. This does not make sense. The GCI 1 method as described uses three grid solutions to evaluate an observed p RE , but if this value is too large, the method reverts to using exactly the error estimator that would have been used for only two-grid studies but with the nonconservative value F s ¼ 1.25 instead of the recommended F s ¼ 3. Thus, the authors did use the correct factor of safety F s ¼ 3 as recommended [2À6] for 2 grid studies but they used only F s ¼ 1.25 for 3-grid studies in which the observed order of convergence p RE was not at least approximately consistent with theory.
The authors then concluded from their tests that "These facts suggest that the use of the GCI 1 method is closer to a 68% than a 95% confidence level." This is a misleading statement since the conclusion follows from using Fs ¼ 1.25 and applying it to a dataset from [9] designed with "intentional choice of grid studies with oscillations in both exponent p and output quantity" [9] The point of such studies as Ref. [1] should not be to tune large Fs until 95% coverage is obtained no matter how bad the data set is. If such large Fs are required, this should be a flag alerting the investigator that the grids used are inadequate for the problem [3,10,11].
(4) Therefore, the evaluation in Ref. [1] of the real GCI method (GCI 2 ) is corrupted by confusion with two distorted methods. The authors' evaluation also does not agree with my own evaluation from some of the same data, notably the wide-ranging study of Cadafalch et al. [12] 1 In the original GCI method [2], F S does depend on observed p RE implicitly, in the sense that if p RE is unreasonable one at least reverts to using p th and F S ¼ 3 rather than blindly applying F S ¼ 1.25. The preferred approach would be to continue the grid convergence study with additional grids until reasonable values of p RE were obtained. These rather obvious points have been made more explicit in Ref. [3] which was not available to the authors of Ref. [1].
Ref. [4] and to which I had alerted the authors by [8]. Cadafalch et al. used an original and reasonable variant of the real GCI in which a mesh-averaged p RE was used for local uncertainty estimates. The literature survey in Ref. [1] omitted my evaluation [4] which itself covered $ 200 cases. As I had pointed out in Ref. [8], my evaluation of these uncertainty calculations using the GCI with F s ¼ 1.25 showed 92% coverage overall, less for upwind differencing, as expected, and 97.7% for adaptive differencing. There was one case of 176 that was found worrisome; it had p RE ¼ 1.2 for an expected first-order method.
(5) When the confusion over the identity of the real GCI is cleared up, the overall evaluation from Ref. [1] of the real GCI method (GCI 2 ) is excellent. The reporting of the various subsets of data is too convoluted to repeat here completely (and the process is suspect in one aspect -see item (9) below). Briefly, for the largest subsets the real GCI method (GCI 2 ) produces reliability of about 93% (in some subsets, 94% and higher) and the FS method produces about 96%.
The authors claim that only their latest "FS method provides a reliability larger than 95%." The FS method certainly works, but I consider it virtually meaningless to distinguish between 93% coverage for the real GCI method (GCI 2 ) versus 96% for the FS method when the target is roughly 95% [2][3][4][5][6]. Note, we are not talking about the solution itself, or even its error, but about uncertainty. It is well recognized that the target of 95%, while traditional and useful, is itself somewhat arbitrary. There is an implicit precision level evidenced by the usual lack of distinction between 95%, or 20:1 odds, or 2-r for a Gaussian distribution [4] 4 , and which are only approximately equivalent; this is true also for experimental uncertainties. Note also that greater coverage (say 99%) would not necessarily be considered more successful since the expanded coverage would be considered unnecessarily conservative, giving an unrealistically pessimistic estimate of uncertainty. So the real GCI method (GCI 2 ) misses the target of roughly 95% by À2%, and the FS method misses it by þ1%, for this dataset. Their large dataset used is a good one 5 and is likely representative of problems of interest, but no one would be surprised if another dataset slightly changed the evaluations. (6) In this regard, the 93% (and higher) coverage by the real GCI method (GCI 2 ) represents a confirmation of the method (with a tolerance of À2%) but the results for the authors' FS method cannot strictly be considered a confirmation because their exercise is actually a calibration. (Here I distinguish confirmation versus calibration of an uncertainty estimator, analogous to the widely accepted distinction between validation and calibration of a model to experimental data [2,3,5].) In this study [1], the three free parameters (see Eq. (15)) of their FS method were tuned with the objective of attaining 95% coverage, so their claim of achieving > 95% coverage is somewhat circular. One would expect the tuned parameters to hold generally for other datasets, but strictly speaking their FS method could not be considered confirmed until it is applied successfully to another dataset not used in the calibration. By contrast, the real GCI method (GCI 2 ), by now applied to probably O(1000) cases by many independent modelers, has been stable for over 12 years. 6 It has one parameter Fs that takes on two possible values, depending on the thoroughness and results of the grid convergence study. Fs ¼ 1.25 works well for thorough, well-behaved studies (meaning a minimum of three grids to allow calculation of observed p RE , and p RE in adequate agreement with theory and/or adequately stable over multiple grid triplets, which evaluation is, of course, a judgment call). Presumably, if one tuned it to Fs ¼ 1.27, or something equally pointless, the real GCI method also could achieve 96% coverage for the subject dataset, accomplished using only one tuned parameter instead of three as in the latest FS method.
(7) The latest FS method is an improvement over the authors' previous CF method, in all its versions. 7 Whether or not the performance is worth the additional tuned empirical parameters (three instead of one) is a matter of opinion. However, it is also true that the authors own results, when examined closely, provide cautionary evidence. In an earlier version of the method [17] (their Ref. [18]) there were problems with performance on the finer grids. (This was communicated to the authors in Ref. [8], their Ref. [16]; see also Ref. [3], p. 229.) The latest version [1] is improved in this regard but is still problematical in the same way, as discussed next.
Less obvious than the previously noted results for values of P ¼ 6.01 and 0.08, but more problematical, is the result for the FS method on the ship hydrodynamics problem. (The finest grid admirably uses 8.1 Â 10 6 points.) Uncertainty estimates for well behaved problems should decrease monotonically as the grid is refined. The uncertainty estimates in Table 6  In Ref. [17], the authors had noted that the (then new version of the) CF method was "unfortunately invalid" for the finest grid triplet. The authors claimed this was "caused by the contamination of the iterative error on the fine grid." This is possible, but (a) we do not know how the iterative error was estimated, (b) Eça and Hoekstra [20] 9 have shown that common methods of estimating iteration error are grossly underpredictive, (c) most importantly, Why is this not a problem for the other methods? They all use the same data (i.e., results from the same computations). At the least, we must conclude that the new FS method is more sensitive to noise than the others. Eça [11] notes that the explanation is simple: the FS method determines the factor of safety from p RE so it is directly affected by the noise in the data. It is intuitively appealing to algorithmically adjust Fs on the basis of "distance from the asymptotic range" as calculated by some measure, but any such uncertainty estimator will be prone to erratic nonmonotone behavior in the same "industrial applications" which motivate its use. As Eça [11] notes, the noisy behavior is not necessarily due to distance from the asymptotic range but can be caused by other factors: lack of iterative convergence, accumulated computer roundoff errors, use of block-generated grids or other grid generation methods that compromise strict geometric similarity in a grid sequence, switches embedded in turbulence models, etc. [2,3]. (8) The statement in Ref. [1] that my previous studies did not use statistical techniques apparently refers to my choice to not use the Students t-test or other small sample correction to estimate statistical confidence. Instead, I had just reported straightforward "coverage" or counting of O(500) cases (incrementally accumulated over years) for which the GCI was conservative or not, 4 Note that no distribution of any kind has been assumed in the development or testing in Ref. [1] nor in any of the other work cited herein. 5 I applaud the authors examination on a large number of cases, but note that some of these are old studies, not dependable, with wildly varying results that are perhaps better discarded. On the other hand, the discard of outliers (4 discarded outliers in Sample 21, Table 5 of Ref. [1]) based on Peirce's criterion may be acceptable but is debatable without a justification based on quality of the studies. See discussion in Ref. [3], Sec. 5.16. 6 The real GCI method (GCI 2 ) was published in mature detail in Ref. [2] in 1998. The earlier publications of 1993-1994 [13,14] already considered a likely range of less conservative Fs but recommended Fs ¼ 3 pending further extensive studies. "But much experimentation would be required over an ensemble of problems to determine a near-optimum value and to establish the correspondence with statistical measures such as the 2-r band." [15] When those initial studies were completed in 1998, the result was Fs ¼ 1.25 for well-behaved problems, as described in Ref. [2]. The method has been stable since, through a wide range of new applications [3]. The important new development has been the extension by Eça and Hoekstra to a Least-Squares GCI (beginning with [16] in 2000 and continuing through the decade) suitable for cases of noisy data; see Ref. [3], Sec. 5.11 for additional references and some details. 7 See Ref. [3], p. 228 for a spotty history of the confusion of variations in the earlier versions and descriptors. 8 It is not indicated why the results for the CF method reported in Ref. [1] are different and improved from the results of the CF method in Ref. [17]. See also the critiques in Refs. [18,19]. 9 See also Ref. [3], Sec. 5.10.10.3 for additional references and some details.

115501-2 / Vol. 133, NOVEMBER 2011
Transactions of the ASME compared to the actual error. The evaluation of statistical confidence level is based on a conceptual model in which the examined data set of an individual study is taken to represent the entire population which is inaccessible. If this data set is small, then small sample correction techniques (like Students t-test) are applicable. The simpler approach just claims (say) "89.2% coverage" for an individual study (perhaps small, or not) with the idea that the results of many studies will eventually be aggregated (probably informally, as I did). If small sample corrections for "confidence interval" are made to each study, later aggregation is confused because the small sample corrections are nonlinear and nondistributive. See Ref. [3], p. 229. The real GCI (GCI 2 ) is indeed widely used and recommended, as noted by Ref. [1]. It is impossible to keep track of its many applications. It is surely well over 500, as we claimed in [5], and more likely of O(1000). In Ref. [3], Sections 6.23.3-4 there is a list of 14 types of problems solved by Dr. C. J. Freitas and his group at the Southwest Research Institute. They have applied the real GCI to many realistic problems with far from ideal convergence behaviors with good results. These include confined detonations, vortex cavitation, blast-structure interactions, fluidized beds, shock propagation, structural failure, and turbulent multiphase flows, which problems were solved using FVM, FDM, FEM, ALE, and Riemann solvers. "In all applications the GCI provided a very good estimate of uncertainty [21]." (9) Although the results in Ref. [1] are consistent with previous studies, their methodology appears to be flawed in one aspect. The authors' define the so-called "actual factor of safety" FS A,i for each case i (each grid triplet) and use average values (averaged over a study) to discriminate performance. The five uncertainty estimators considered are all in the form U est ¼ Fs Â jE est j where E est is some estimate of the discretization error. FS A,i was obtained formally by replacing E est,i by the actual error E i for each i case and inverting the algebra to solve for FS A,i ¼ (Uest, i /|E i |). But there is no "actual factor of safety" other than the value used. The ratio (E est,i /E i ) has been properly interpreted in Ref. [1] and elsewhere as an "effectivity index" and the "actual factor of safety" is similar to an effectivity index but applied to an uncertainty estimate rather than an error estimate. Values (E i /E est,i ) ) 1 or FS A,i ) 1 would indicate excessive conservatism for a single case. But this would say little or nothing about % coverage attained for an ensemble of cases, which is the agreed-upon target or principal performance metric. Of two uncertainty estimators with the same acceptable coverage, the one with the smaller average or maximum effectivity index would be preferred, but it would not make sense to actually use this "actual factor of safety" in any estimator of uncertainty for a new calculation.
Suspicion of the concept arises when we note that if one case happens to produce the correct solution, the result is an "actual factor of safety" FS A,i ¼ 1, which is difficult to interpret and suggests something is wrong. Indeed, Fig. 4 of Ref. [1] shows three methods producing maximum values $ 30 and two (including the latest FS method) going off-scale on log plots with FS A,i $ 60.
A single value of FS A,i < 1 indicates that the actual single error is outside the uncertainty band, which should happen $ 1 in 20 times. As the measure of coverage or reliability (Eq. (19) of Ref. [1]) this test is correct. A single value of FS A,i ) 1 would indicate large conservatism for that single case, which, combined with experience and intuition, may suggest an evaluation of excessive conservatism for the method. 10 However, comparing |E i | to U est,i in whatever formulation cannot be used to faithfully indicate conservatism; it cannot be relied upon as a substitute for examining a large number of cases. Nor can the "average actual FS A,i " of a large study, as used in Ref. [1]. That this can be mis-leading even for benign data is shown by the following synthetic example.
Consider a data set of 20 cases with the fine-grid solution variables of interest normalized to S 1 ¼ 1. For convenience, say all normalized uncertainty estimates are U est ¼ 0.1, i.e., an estimated 10% "error band." (This might have been estimated by the real GCI method for well behaved cases using Fs ¼ 1.25 from an error estimate of 0.08.) Suppose the true values for the first 19 of 20 cases are given by T i ¼ 1 þ i/200 or T i ¼ 1.005, 1.01, 1.015, … 1.095. Suppose T(20) ¼ 1.11 so this case is not likely to be discarded as an outlier. It is seen that the GCI uncertainty estimate performs exactly as intended for this synthetic case, with the interval [S 1 6 U est ] ¼ [0.9, 1.1] capturing the first 19 of 20 cases for 95% coverage. However, the "actual FS A,i " ranges from 0.91 to 20, and the "average actual FS A " ¼ 3.59, which would seem to suggest that the original method using Fs ¼ 1.25 is much more conservative than intended. It is not; it had performed perfectly for this example. 11 The actual errors are not weirdly distributed; they range from 0.005 to 0.11 with an average error of 0.053, not out of line with the 95% uncertainty estimate of 0.1. If one thinks that the ratio of the original Fs to the "average actual FS A " can be used to tune the Fs, one is in for a disappointment. The new Fs would be 1.25/3.59 ¼ 0.348, giving new U est ¼ 0.028, producing coverage for the first 5/20 cases or just 25% coverage.
These benign synthetic examples demonstrate that the concept used in Ref. [1] of relying on the so-called "average actual factor of safety" as an indicator of quality is misguided.
(11) Finally, some of the confusion of the authors [1] in not identifying the GCI 2 method as the real (original) GCI method [2][3][4][5][6] is understandable due to the fact that application of the method involves a judgment call as to what constitutes a reasonable value of observed p RE . This makes it difficult for the authors to compare performance of different methods without being subject to unfair criticism. The judgment calls implicit in the earlier presentation of the GCI method in Ref. [2] have been made more explicit in Ref. [3] (which of course was not available to the authors of Ref. [1]) but the method is still not free of judgment calls. It would be preferable if all methods were defined in a recipelike fashion with no room for judgment calls. However, there are other judgment issues possible in a grid convergence study, e.g., how many grid triplets are adequate to claim that p RE is reasonably constant, whether or not to use a least-squares evaluation, which detailed leastsquares version to use, etc. I do not expect that such judgment calls will be eliminated for poorly behaved convergence studies.
In their new V&V book, Oberkampf and Roy ( [22], p. 326) have proposed a procedure that standardizes the judgment call at least for the question of what constitutes acceptable p RE . Their procedure accepts 10% agreement between p RE and p th , using p ¼ p th and Fs ¼ 1.25 in the GCI formula. For disagreement >10%, the procedure uses F s ¼ 3 and limits the p used in the GCI formula to p ¼ min{max(0.5, p RE ), p th }. I consider this a reasonable recipe, although of course other cut-offs could be justified as well. The limits on 10% agreement and the lower limit on p ¼ 0.5 (provided that p th > 0.5) are obviously judgment calls but are reasonable and unambiguous, and therefore repeatable in a comparative study such as Ref. [1]. A more conservative (perhaps overly conservative) procedure would not set a lower limit on p other than p > 0 (allowing for difficult problems such as singularities as in Ref. [3], Sec. 5.10.4.1) and would use p ¼ min(p RE , p th ) when these agree within 10% instead of p th . 10 See discussion in Ref. [3], Sec. 5.15 on "Evaluation of Uncertainty Estimators from Small Sample Studies." 11 Other synthetic examples can be constructed to demonstrate that the concept of "average actual factor of safety" can mislead in the other direction, suggesting adequate conservatism when in fact the method is behaving terribly. Consider true values T i ¼ 1.1 þ a and a ( 1 for i ¼ 1 to 19, and T 20 ¼ 1.05, which is not an outlier. Then "average actual FS A " % 21/20 ¼ 1.05, falsely indicating mild conservatism when in fact the method has missed 19 of 20 cases, giving 5% coverage rather than the intended 95%.
These two prescriptive procedures would be additional candidates for evaluation by the authors of Ref. [1] using their extensive dataset.