Accuracy of Calibration Estimators Supported by Auxiliary Variables from Past Periods Based on Simulation Analyses

In sample surveys there is often a need to estimate not only population characteristics, but subpopulation characteristics as well. We consider the problem of estimating the total value in do‐ mains (subpopulations). In this case, the Horvitz‐Thompson estimator could be used. Nevertheless, it does not use any additional information about population units, which are usually known. To in‐ crease estimation accuracy we propose to use calibration estimators with auxiliary variables from the current and past periods. In the simulation studies based on real and generated data, we show the influence of using auxiliary information from past periods on the accuracy, and compare properties of two calibration estimators of domain totals in longitudinal surveys.


Introduction
Longitudinal data for periods t = 1, 2, …, M are considered. In the period t the population of size N t is denoted by Ω t . The population in the period t is divided into D disjoint subpopulations (domains) Ω dt of size N dt , where d = 1, 2, …, D. The domain of interest in the period of interest is denoted by Ω d*t* . Let the set of population elements for which observations are available in the period t be denoted by s t and its size by n t . The set of elements of d-th domain for which observations are available in the period t is denoted by s dt and its size by n dt . In the case of surveys conducted in one period (see section 2) the subscript t will be omitted. First and second order inclusion probabilities in the period t are denoted by π ti and π tij respectively. The value of the variable of interest and the vector of the p auxiliary variables for the i-th population element in the period t are denoted by y ti and x ti respectively.
Nowadays many surveys are conducted in more than one period which gives additional possibilities of increasing estimation accuracy. In the paper, we study properties of calibration estimators in the case of longitudinal surveys, where information on auxiliary variables is available not only for the current period but also for past periods. We consider the research hypothesis that the additional information provided by comparison with surveys conducted in another period will result in an increase in estimation accuracy. This must be verified in simulation studies because values of the same auxiliary variables in different periods are usually strongly correlated and hence similar information is included in the estimation process. The main purpose of this paper is to analyse the properties of certain calibration estimators of the domain total in longitudinal surveys in simulation studies, and to explore the influence of supporting estimations by auxiliary variables from past periods. Any results presented in the paper may be of direct interest to statistical offices, market research and opinion polling companies, and of indirect interest to decision-makers and all data users.

Calibration estimators -single wave
In this section we present calibration estimators used in surveys conducted over a particular period. The calibration estimator of the population total is given by (Deville, Särndal, 1992): where weights w si are solutions of: where f s (w si , d i , q i ) is some distance measure between weights of the calibration estimator w si and the design weights (inverses of the first order inclusion probabilities) , where, for more generality, additional known weights q i can be included. The minimization in (2) leads to the approximate design-unbiasedness of the calibration estimator. The equality in (2) is the condition of the model-unbiasedness of the estimator (1) under the linear model.
If in (2) we assume that: then the resulting calibration estimator called generalized regression estimator (GREG) is given by (Deville, Särndal, 1992;Särndal, Swensson, Wretman, 1992: 232;Rao, Molina, 2015: 13): Estimator (4) can be also written as (1) where What is more, Deville and Särndal (1992) prove under some conditions that the calibration estimator (1) is design-consistent and asymptotically equivalent to the generalized regression estimator (4) in the sense that Although (8) informs that the estimators are asymptotically equivalent, Singh and Mohl (1996) and Stukel, Hidiroglou and Särndal (1996) show that the values of the estimators are very similar even for small sample sizes.
To estimate the design-variance of (4) we can use one of the design-consistent estimators (Rao, Molina, 2003: 15) given by where where g si is given by (7). Estimator (9) may lead to some underestimation of the true variance, but including g si in the equation (11) reduces this underestimation. The formula (1) can be adapted to estimate the subpopulation (domain) total instead of the population total. The generalized regression estimator of the d*-th domain total is given by (Rao, Molina, 2015: 17-18): where This means that the estimator (12) is given by (1) where y i is replaced by a id* y i . Hence, the design-variance of the estimator (12) can be estimated based on (11) where y i is replaced by a id* y i , but it means that the residuals (10) in (11) for i ∉ s d* are given by The domain total can also be estimated using modified generalized regression estimator (MGREG) proposed by Särndal (1981) and given by: where B is given by (5) and e i by (10). Rao and Molina (2015: 23) propose to estimate the design variance of (14) by (9), where e i are replaced by a id* e i assuming that the overall sample size is large (even if the domain sample size is small).
Estimators (12) and (14) have a property called benchmarking -they sum up to the estimator of the population total (in this case GREG): It is also possible to estimate the domain total by generalized regression estimators obtained by conditional minimization (2) (assuming (3)) where the set s is replaced by s d* (see Rao, Molina, 2015: 19), but the resulting estimator is not approximately design-unbiased unless the domain size tends to infinity and it does not have the benchmarking property.

Calibration estimators -multiple waves
In the section we study the problem of estimation of the subpopulation total in longitudinal surveys. We assume, that we use not only information from the period of interest t * but also from some k past periods: t *k, t *k + 1, …, t * -1. Hence, estimators proposed in this chapter are simple generalizations of the estimators proposed in Żądło (2011) and Żądło (2015: 219), where it is assumed that the information from all of the periods of the longitudinal survey is used.
Firstly, we propose the following generalized regression estimator of the total in the domain of interest, in the period of interest Ω d*t* (compare Żądło, 2011): Hence, where i ∈ s t* , and It must be stressed that the estimator (16) cannot be used if s d*t* = ∅. We also assume that x ti are known for i ∈ s t* and t = {t *k, …, t * } (which is true e.g. when all of the values of the auxiliary variables are known or in the case of panel surveys) and that inclusion probabilities for the period of interest are known (in the case of complex sample designs it is possible to compute them based on simulation techniques -see e.g. Fattorini, 2006or Gamrot, 2014. To estimate the design-variance of (16) we propose (similarly to the estimator (12)) to use (11) for the period t * where g si are replaced by (19) and e i by: for i ∈ s t* , and ( ) Similarly to the estimation of the design-variance of the estimator (14), we propose to estimate the design variance of the estimator (22) using (9) for the period t * , where e i are replaced by: For s d*t* = ∅ estimator (22) simplifies to the synthetic estimator but because of (24) it is not possible to estimate its design-variance in this case.

Simulation study -introduction
In sections 5 and 6 we present the results of design-based simulation studies conducted in R (R Development Core Team, 2016). The simulation study presented in section 5 is based on real data, while in section 6, generated data are used.
In both simulation studies we analyze the accuracy of calibration estimators and the biases of estimators of their variances. We compute: 1) the relative bias (in %) as 11 11 00 ( ) 2) the ratio of mean square error of Horvitz-Thompson estimator (HT) given by d ii is dy ∈ ∑ to calibration estimators given by (16)

Simulation study -real data
We use real data about Polish counties (NUTS 4). The database is taken from sources of The Central Statistical Office of Poland (https://bdl.stat.gov.pl/BDL).
The study variable is value of investment outlays in enterprises and the auxiliary variable is the number of entities of the national economy. We consider data from M = 7 periods (years 2007-2013) describing 378 Polish counties. As a result, the data set contains NM = 2646 observations. Owing to missing data in some periods, Wałbrzych was omitted, and so was the capital city -Warsaw -as an outlier. We consider the problem of estimating the total value in D = 6 subpopulations (domains). Domains are regions in Poland under NTS nomenclature (NUTS 2). In the first period, the sample of size n = 38 which represents 10% of the population is drawn using simple random sampling without replacement and then the same sample is studied in other periods (the balanced panel sample).
In Table 1 we present the exact information for which estimators are calculated in the simulation study. We estimate domain totals in four periods from year 2010 up to 2013. For each year we use eight estimators. The purpose of using information on auxiliary variables from different numbers of periods (1 or 2 or 3 or 4 periods) is to check the influence of additional but autocorrelated variables on the accuracy of calibration estimators.
Our study covers properties of the following estimators: 1) GREG1 given by (12), 2) GREG2 given by (16) where k = 1, 3) GREG3 given by (16) where k = 2, 4) GREG4 given by (16) (14), 6) MGREG2 given by (22) where k = 1, 7) MGREG3 given by (22) where k = 2, 8) MGREG4 given by (22) where k = 3.  In Figure 1, the first column displays the relative biases (see (25)) of the generalized regression estimator of the total in each of D = 6 domains, and the second column displays the relative biases of the modified generalized regression estimator in each of D = 6 domains. For the GREG estimator there are substantially bigger biases than for the MGREG estimator. For the GREG estimator, relative biases are from -16% to 23%.
For the MGREG estimator, the relative biases are on average closer to zero than in the case of the GREG estimator, where they span from -10% to 3%. There are some outliers. The number of outliers increases with the number of past periods of the auxiliary variable used in the supporting estimation. In each case, the outlier is the bias for the north region in Poland.
In Figure 2 are presented values of the ratio of the mean square error (MSE) of the Horvitz-Thompson estimator to the MSE of the calibration estimators given by (16) and (22). The aim of this comparison is to show a decrease in the MSE of the calibration estimators in comparison to the HT estimator. In the case of GREG estimators, in a little less than a half of the cases these ratios take values greater Pobrany 10-10-2020 W y d a w n i c t w o U Ł than 1 (grey vertical line). It means that in nearly half of the cases the GREG estimator gets smaller values for its MSE than the Horvitz-Thompson estimator. In the case of the MGREG estimator, better results are obtained. In each period, regardless of the number of periods in which the auxiliary variable was taken into consideration, ratios are larger than 1. This means that in each case the MSE of the MGREG is lower than the MSE of the HT estimator. In best cases, an astounding fifteenfold decrease in MSE was actually obtained. Using information from more periods does not significantly improve the estimation accuracy. Figure 3 presents the relative biases of the estimators of design-variances of GREG and MGREG estimators. With regard to the biases of GREG estimators, it is visible that they are usually above zero. This means that the design-variance of GREG estimators is overestimated on average. When it comes to biases of the design-variance of MREG estimators, it turns out that they are usually below zero. This indicates that design-variance of MGREG estimators is underestimated on average.

Simulation study -generated data
Artificial data are generated for design-based simulation study purposes. Generated data which mimic or modify in some sense real data are widely used for simulation purposes in economic applications (see e.g. Białek, 2014;Krzciuk, 2014). We generate artificial data in order to get data without any outliers and also to obtain a stronger relationship between studied variables than can be observed in the real data. Values of the auxiliary variable for all of the periods are generated once using normal distribution with the mean and the standard deviation of the real auxiliary variable. Values of the variable of interest are generated once as follows: where e ti are generated once using N(0, 0.1S u ), a (r) and b (r) are estimates of regression parameters computed using the least squares methods based on real values of the variable of interest and real values of the auxiliary variable, S u is the residual standard error of the regression based on the real data. in Figure 1. The relative biases for GREG1, GREG2, GREG3 and GREG4 estimators are from -4% to 6%. For MGREG1, MGREG2, MGREG3 and MGREG4 they presented much lower biases. Their absolute values are less than 1%. Figure 4 shows ratios of the MSE of Horvitz-Thompson estimator to the MSE of calibration estimators for the generated data. Comparing results for the real and generated data it can be noticed that ratios for GREG estimators are close to 1, so it can be concluded that in the case of data with smaller diversity, the MSE for both GREG and HT estimators are quite similar. In regard to MGREG estimators, it can be observed that ratios are much higher in comparison to the real data. In this case, the use of MGREG instead of HT estimators is sometimes able to give a more than 600-fold decrease in the MSE. Similarly, as in Figure 2, the use of information from past periods results in is no significant improvement of estimation accuracy, even though in this case there is a stronger relationship between variables than in the real data.  Figure 5 shows the relative biases of the estimators of the design-variance of the considered calibration estimators which were calculated for the generated data. For GREG estimators the design variance was usually overestimated, since the obtained biases were between 3% and 25%. As regards MGREG estimators, the design vari-Pobrany 10-10-2020 W y d a w n i c t w o U Ł ance was underestimated on average, as the relative biases were in the range of -5% and 3%. The properties of design-variance estimators for both kinds of considered calibration estimators are better for the generated data than for the real data. Nevertheless, the true cause of this improvement is not the greater relationship between the variables, but a different distribution of the auxiliary variable. This is concluded from another simulation study for which the results are not discussed in this paper.