A Stopping Rule for Simulation-Based Estimation of Inclusion Probabilities

Design-based estimation of finite population parameters such as totals usually relies on the knowledge of inclusion probabilities characterising the sampling design. They are directly incorporated into sampling weights and estimators. However, for some useful sampling designs, these probabilities may remain unknown. In such a case, they may often be estimated in a simulation experiment which is carried out by repeatedly generating samples using the same sampling scheme and counting occurrences of individual units. By replacing unknown inclusion probabilities with such estimates, design-based population total estimates may be computed. The calculation of required sample replication numbers remains an important challenge in such an approach. In this paper, a new procedure is proposed that might lead to the reduction in computational complexity of simulations.


Introduction
, we shall represent a finite population as a set of unit indices U = {1, …, N}. Values of a fixed characteristic for corresponding population units are represented by a vector y = [y 1 , …, y N ]'. The parameter under study is the population total (Hedayat, Sinha, 1991: 2): The unordered sample space may be represented by a matrix: represents one possible sample with a ij = 1 when this sample contains the j-th unit and a ij = 0 otherwise. The matrix A has N columns and Z = 2 N rows representing all possible sequences of zeros and ones of the length N, including an empty sample represented by a sequence of N zeros and a census represented by N ones. A vector of corresponding sample sizes may be calculated as: Let an unordered sample: sU ⊆ be drawn from U. The sample composition may be characterised by a vector of sample membership indicators (Tillé, 2006: 8): The sampling is equivalent to choosing a certain (say i-th) row of A so that I(s) = a i . It may be done according to a sampling design: [ ] 1 ,..., ' One may also define a matrix of second-order inclusion probabilities (Tillé, 2006: 17) as: Pr i j s π = ∈ . This lets us express the covariance matrix of the vector I as: The size of the sample s may be expressed as: Denote sampled elements as: (13) For any vector u = [u 1 , …, u N ], let: This lets us define sample vectors: y(s), π(s), d(s) which are obtained by omitting elements corresponding to zeros in I(s) respectively in y, π, d. For known π, the design-unbiased Horvitz-Thompson (HT) estimator of t may be expressed in the form (cf. Narain, 1951;Horvitz, Thompson, 1952): or equivalently:

Simulation-based estimation
To calculate the HT estimator, first order inclusion probabilities are needed. However, many sampling procedures are too complicated to calculate them. In particular, this is true for spatial sampling (Barabesi, Fattorini, Ridolfi, 1997;, order sampling schemes, especially the Pareto scheme (Rosén, 1997), rejective sampling (Wywiał, 2003;Boistard, Lopuhaä, Ruiz-Gazen, 2012;Yu, 2012), and sequential sum-quota sampling schemes (Pathak, 1976;Kremers, 1985). A particular example is the greedy sampling scheme (Gamrot, 2014: 223) where costs of sampling individual units vary but are known in advance, and the survey budget is restricted. Individual units are drawn to the sample sequentially, one-by-one, with equal probabilities, from a gradually shrinking pool of still-affordable units. In the most pessimistic case, the calculation of inclusion probabilities would require analysing all permutations of units, which is unfeasible. If inclusion probabilities do not depend on sample observations, then Fattorini (2006;2009) or equivalently: Pobrany 31-10-2020 W y d a w n i c t w o U Ł

Setting up the stopping rule
In order to establish a sufficient value of replication number R that guarantees a required precision of simulation-based estimates, Fattorini (2006;2009) proposes the accuracy criterion: and finds its upper bound on the basis of Bennet's inequality. On this basis, he proposes a formula for the sufficient value of R. Later, Gamrot (2013) attempted to improve over that using asymptotic approximations based on a normal distribution, Chernoff-Hoeffding inequality, and pre-calculated tables of exact probabilities for the restricted maximum likelihood estimator. However, the relative deviation of the empirical HT estimator ( ) ts  from its "true" value ( ) ts that would be calculated for known inclusion probabilities has a complex distribution. The construction of an upper bound for it requires the pessimistic assumption of possible high correlation among sample membership indicators. This leads to very conservative replication numbers, which results in long calculation time.
In what follows, it is demonstrated that these pessimistic assumptions are often overly conservative, and may be improved upon. The value of the empirical HT estimator depends on the count vector m. Let Ω be a set of such values of this vector for which the condition is satisfied so that: (24) Hence, instead of examining the scalar distribution of ( ) ts  one may investigate a much simpler, multivariate distribution of m. As an introductory example, let us consider a population of size N = 3, with the sample space and sampling design: When a sample s corresponding to the sample indicator vector I(s) = [0, 1, 1] is drawn, all the columns of the matrix A which contain zeros and correspond to non-sampled units may be disregarded in our analysis because the HT estimator does not depend on these units and corresponding inclusion probabilities. Hence, it is sufficient to consider a reduced sample space and sampling design: where: In the following examples, we will now discuss in more detail some interesting special cases of such a bivariate distribution. The dependency of Q(R) on sample observations of the study variable are not the only important effect. The following example illustrates another one: Example 2. Let us consider four sampling designs P 1 , P 2 , P 3 , P 4 given by Table 1, together with corresponding values of the correlation coefficient ρ between the two sample membership indicators corresponding to the first and second unit.  Despite its simplicity, the example shows that the probability Q(R) depends on the correlation ρ between sample membership indicators. It is high for the least favourable case extreme of positive correlation that is tacitly assumed in the derivation of known stopping rules. In reality, however, it is often much lower. Taking this effect into account, one might construct tighter bounds for Q(R) and obtain a stopping rule which gives lower required replication numbers. This effect remains in force when more than two elements are drawn to the sample, as shown in the last example.

The proposed stopping rule
The accuracy criterion depends on correlations among sample membership indicators. However, it is not reasonable to expect these correlations to be known when inclusion probabilities (defined as moments of their distributions) remain unknown.
To account for correlations, the simulation may be divided into two phases. In the first phase, Then the estimates of first and second order inclusion probabilities are obtained as: are generated. Hence, in the whole simulation, a total of R = R 1 + R 2 sample replications s ˘1, …, s Ȓ are generated, which leads to the calculation of the final count vector m and the empirical HT estimator according to expressions (17)-(21). Capabilities of contemporary computers make it possible to set R 1 , R 2 and R quite large (in the order of millions) without much effort so that the distribution of m tends to multivariate normal: ( ) 2 , NR R π C as shown by Krzyśko (2000: 31). After the first phase of the simulation, it may be approximated by ( ) 2 ,NR R π C . Realisations of this distribution may be easily and quickly generated in large quantities, for example, by using algorithms described by Zieliński and Wieczorkowski (1997). This enables the estimation of the probability Q(R) associated with any value of R by counting what percentage of these pseudo-random realisations falls outside the Ω region (with the unknown 'true' statistic ( ) ts approximated by ( ) ts  based on R 1 replications). Such estimation is easily repeated for various candidate values of R because generated replications of multivariate normal distribution may be reused. The re-calculation boils down to a few matrix operations (addition, multiplication, division of corresponding elements) which are easily serialised and optimised. The well-known golden-section or Newton-Raphson algorithms may hence be applied to find the minimum sufficient number R before the second phase of simulation is initiated.
Pobrany 31-10-2020 W y d a w n i c t w o U Ł

Conclusions
According to the approach sketched in the last section, one relatively simple, but very time-consuming simulation, is replaced with a more complex but potentially faster procedure. Instead of calculating a conservative number of replications and then generating them all, a more subtle approach is proposed. After initially generating some R 1 sample replications, the auxiliary nested but fast simulation is executed to establish the required total number R of replications accounting for correlations among sample membership indicators. Then the second, possibly quite a small batch of R -R 1 replications, is generated, and the empirical HT estimator may finally be evaluated. The nested fast simulation step may eventually be repeated more times when more and more replications are available to make the initial assessment of R more reliable. It may also be done after all replications are generated to verify that their number is indeed sufficient. Further studies are needed to confirm whether the proposed procedure indeed produces substantial speeding-up of the whole simulation process.