8 Pós-estratificação

Author

Felipe Lamarca

8.1 Wolf, C., Joye, D., Smith, T. W., & Fu, Y. (2016). The SAGE handbook of survey methodology. Sage. Chapter 31: Analysis of Data from Stratified and Clustered Surveys.

In the SRS case, we have a list of elements, which allows us to attribute to each element in the list a probability of being selected. In particular, “If the population is size \(N\) and the sample is size \(n\), then the probability of selection of each unit is \(\frac{n}{N}\).” (p. 478)

The mean is just:

\[ \bar{y} = \dfrac{1}{n} \sum^n_{i=1} y_i \]

The sample mean has itself a distribution: “We select just one sample, but there are many other samples that we could have selected, and each one would give an estimate of the mean.” The variance of the sample estimate of the mean is:

\[ \text{Var}(\bar{y}) = \dfrac{ \left( 1 - \frac{n}{N} \right) S^2 }{n} \]

where \(\frac{n}{N}\) is the sampling fraction and \(S^2\) is the variability of the variable \((Y)\) in the population. It gives the true variance of the estimate of the mean, but we usually cannot calculate it, because \(S^2\) is not known. We then estimate \(S^2\) by \(s^2\), in in-sample variability:

\[ s^2 = \dfrac{ \sum^n_1 (y_i - \bar{y})^2 }{n-1} \]

\(\left( 1 - \frac{n}{N} \right)\) is the finite population correction, which is often vary small. We can estimate the variance as:

\[ \text{var}(\bar{y}) = \dfrac{s^2}{n} \]

Now, in the stratified context, we estimate the variance as:

\[ \text{var}(\bar{y}) = \sum_H \dfrac{W_h^2 s^2_h}{n_h} \]

The first step in deriving weights is to take the inverse of the probability of selection for each unit. In an SRS, the probability of selection of each unit is simply \(n/N\) and thus the weight for each case is \(N/n\). In stratified samples using SRS within strata, but disproportionate allocation to strata, the weight is different for each stratum, but constant within strata. In PPS samples, each unit may have a different probability of selection and thus a different weight. Weights that adjust for the probability of selection are called design weights. (p. 482)

Most surveys suffer from nonresponse, and weights are often adjusted to account for it. There are several different methods of adjusting weights for nonresponse using information that is available for respondents and nonrespondents. […]. If we have several variables available for each selected case, we can estimate a response and use the predicted propensities based on the model to adjust the weights. (p. 482-483)

Raking, post-stratification, and calibration are in this class: all involve adjusting the sample so that it matches the population based on known population information. For example, in many contries, women are more likely to respond to surveys than men and thus the set of respondents has more women than men. These adjustment techniques increase the weight of the responding men, and decrease the weight of women, so that the weighted sample reflects the population. (p. 483)

If these adjustment techniques increase the variance of our estimates, why do we use them? Because we believe that the point estimates (that is, the estimates of means, quantiles, correlation coefficients, regression coefficients, and so on) that we get with the final adjusted weights are more accurate (less biased) than those that we get wehn we use only the design weights. The cost of this reduction in bias is an increase in variance. (p. 483)

In complex surveys, weights, stratification, and cluster sampling interact to make the estimation of variances particularly difficult. When a sample caintains these elements, the variance estimation formulas given above do not work, and we need to turn to alternative methods. There are two broad classes of techniques to estimate variances and standard errors from complex samples: Taylor-series linearization and replication, which includes several similar techniques. (p. 484)

Replication techniques work on a different principle. Replication methods estimate variance by choosing subsets from the sample itself, and using the variability across all these subsets to estimate the variance of the sampling distribution. How exactly these subsamples of the sample are selected depends on the exact replication method in use. (p. 484)

8.2 Lumley, T. (2011). Complex surveys: a guide to analysis using R. John Wiley & Sons. Chapter 7: Post-stratification, raking, and calibration.

This chapter deals with techniques for using known population totals for a set of variables (auxiliary variables) to adjust the sampling weights and improve estimation for another set of variables. All of these techniques have the same idea: adjustments are made to the sampling weights so that estimated population totals for the auxiliary variables match the known population totals, making the sample more representative of the population. A second benefit is that the estimates are forced to be consistent with the population data, improving their credibility with people who may not understand the sampling process. (p. 135-136)

Post-stratification adjusts the sampling weights so that the estimated population group sizes are correct, as they would be in stratified sampling. The sampling weights \(\frac{1}{\pi_i}\) are replaced by weights \(\frac{g_i}{\pi_i}\), where \(g_i = \frac{N_i}{\hat{N}_i}\) for the group containing individual \(i\).

Adjusting the weights to incorporate the auxiliary information is straightforward, but estimating the resulting standard errors is more difficult. For replicate-weight designs it is sufficient to post-stratify each set of replicate weights (Valliant [178]). Since the estimated group size will then match the known population group size for every set of replicates, between-group variation will not produce differences between replicates and so the replicate-weight standard errors will correctly omit the between-group differences. (p. 137)

The function postStratify() creates a post-stratified survey design object. In addition to adjusting the sampling weights, it adds information to allow the standard errors to be adjusted. For a replicate-weight design this involves computing a post-stratification on each set of replicate weights; for linearization estimators it involves storing the group identifier and weight information so that between-group contributions to variance can be removed.

The first step is to set up the information about population group sizes. This information can be in a data frame (as here) or in a table object as is produced by the table() function. In the data frame format, one or more columns give the values of the grouping variables and the final column, which must be named Freq, gives the population counts. The call to postStratify() specifies the grouping variables as a model formula in the strata argument, and gives the table of population counts as the population argument. The function returns a new, post-stratified survey design object.

Raking allows multiple grouping variables to be used without constructing a complete cross-classification. The process involves post-stratifying on each set of variables in turn, and repeating this process until the weights stop changing. The name arises from the image of raking a garden bed alternately in each direction to smooth out the soil. Raking can also be applied when partial cross-classifications of the variables are available.

[…]

Standard error computations after raking are based on the iterative post-stratification algorithm. For standard errors based on replicate weights, raking each set of replicate weights ensures that the auxiliary information is correctly incorporated into between-replicate differences. […].

The rake() function performs the computations for raking by repeatedly calling postStratify(). Each of these calls to postStratify() accumulates the necessary information for standard error computations. (p. 139)

As in section 7.2, post-stratification can be seen as making the smallest possible changes to the weights that result in the estimated population totals matching the known totals. This view of post-stratification leads to calibration estimators. An alternative is to note that the estimated population total after post-stratification is a regression estimator, for a working regression model that uses indicatorvariables for each post-stratum as predictors. This view leads to generalized regression or GREG estimators. (p. 141)

“Regression thinking” is useful in understanding why large increases in precision are possible – it is surprising to many biostatisticians that small changes in sampling weights can increase precision dramatically, but it is easier to understand that a good regression model can reduce the unexplained variation. “Calibration thinking” often leads to simpler formulas and simpler software implementation, since an explicit model is not needed. (p. 141)

\[ \hat{T}_\text{reg} = \sum^N_{i = 1} x_i \hat{\beta}_i = \left( \sum^N_{i = 1} X_i \right) \hat{\beta} = T_X \hat{\beta} \]

8.3 Anotações de aula

Em amostragem, cada unidade amostral tem um peso. Lemos isso como “quantas unidades na população cada unidade amostrada representa”. Se o desenho amostral tiver equiprobabilidade, cada unidade amostrada representa a mesma quantidade de unidade da população e pesos não são necessários.

O estimador ponderado é:

\[ \hat{Y} = \sum_{i \in s} w_i y_i \]

onde \(w_i\) é o peso para a unidade \(i\) e \(y_i\) é a medida extraída do survey.

Considere o seguinte exemplo:

Área	Idade	Quantas pessoas ela representa
ZS	25	349
ZS	30	482

Nesse caso, indicamos quantas pessoas cada unidade amostral representa. Esse valor é calculado a partir da probabilidade de inclusão da unidade amostral. O peso é o inverso da probabilidade de inclusão. A ponderação pelo desenho ajusta amostras quando temos probabilidades desiguais de inclusão: \(w_i = \frac{1}{\pi_i}\), onde \(\pi_i\) é a probabilidade de inclusão da unidade amostral \(i\). Supondo que a probabilidade de inclusão é de 0.25, o peso da unidade amostral é 4. Isso significa que ela representa 4 unidades na população.

Esse estimador também é chamado de estimador de expansão – o estimador de Horwitz-Thompson. O peso absoluto me diz quantas pessoas de perfil similar uma única pessoa da amostra representa.

\[ \hat{N} = \sum_{i = 1}^n \frac{y_i}{w_i} \]

Note que esse peso é útil para calcular totais populacionais, mas ele é diferente do peso usado para fazer ponderação. Por que ponderar?

Ponderar pela probabilidade de inclusão quando não temos equiprobabilidade é importante para não incorporar viés na amostra (claro, isso depende de uma amostragem probabilística);
Dar maior peso a subgrupos que responderam menos ao survey (não-resposta);
Ajustar estimativas com totais populacionais conhecidos (aka calibração)

8.3.1 Ponderação de ajuste

Problemas com quadro amostral (cobertura, e.g.) e não-resposta de unidade podem introduzir viés se respondentes diferem sistematicamente de não-respondentes:

Entrevistamos muito menos pessoas mais ricas do que deveríamos;
Usamos estratificação com alocação não-proporcional (i.e., regiões pequenas e grandes recebem o mesmo número de entrevistas);
Bolsonaristas ou lulistas se recusam mais a responder.

A ponderação duplica, triplica, às vezes quadruplica a informação que temos. Na prática, isso lembra a amostragem por conglomerados, na perspectiva de que estamos aumentando a margem de erro ao não incorporar informações efetivamente novas.

Ideia básica: ajustar os pesos dos respondentes para compensar as unidades ausentes ou sub-representadas. Pesquisa domiciliar probabilística com \(n=1000\) entrevistou \(380\%\) homens; na população, homens são \(50\%\):

\[ \frac{0.5}{0.38} \ \text{ou} \ \frac{500}{380} \approx 1.315789474 \]

Cada homem na amostra representa 1.315789474 homens na população. O mesmo vale para mulheres, mas o peso é menor.

8.3.2 Calibração vs. ponderação

Quando já temos pesos (geralmente do desenho), podemos forçá-los a corresponder a totais populacionais conhecidos em variáveis auxiliares.

Ponderar é pegar a amostra que eu tenho e tentar aproximá-la de algum benchmark (por exemplo, a população). Calibrar, por outro lado, é pegar um peso que eu já tenho (no mais das vezes, a probabilidade de inclusão) – supondo que a gente calcule o total populacional e, no entanto, a população brasileira deu \(205\) milhões, mas sabemos que a população brasileira é de \(203\) milhões. A calibragem ajusta o esquema de pesos para fazer com que o total bata com algum total conhecido da população.

Nesse caso, precisamos encontrar novos pesos \(w_i^{\text{Cal}}\) próximos aos pesos iniciais de forma que:

\[ \sum_{i \in s} w_i^{\text{Cal}} x_i = \sum_{i \in U} x_i = X, \]

onde \(X\) é o total populacional (\(U\) é a população) conhecido (e.g., total de eleitores) para a variável auxiliar \(x_i\). Esse ajuste pode ser feito de várias maneiras: algoritmos iterativos, regressão, etc.

8.3.3 Pesos e inflação da variância

Pesos de desenho muito grande ou desiguais inflam estimativas.

Intuição: uma unidade com peso relativo de \(2\) equivale a duplicar a sua entrevista na amostra; para isso, no entanto, precisamos retirar uma unidade de outras unidades da amostra. A solução geralmente é usar tetos ou ajustes para limitar pesos extremos.

8.3.4 Outros tipos de pós-ajustes em R

Rake e iterative proportional fitting, que ajustam pesos iterativamente para que totais populacionais conhecidos sejam respeitados (pesos relativos);
Generalized regression estimation (GREG), que usa modelos de regressão para calibrar pesos;
Multilevel regression and poststratification (MrP), que projeta resposta em subgrupos da população (e.g., regiões, idade, sexo) usando modelos de regressão

8.3.5 Usando o pacote `survey` no R

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(survey)

Warning: package 'survey' was built under R version 4.4.3

Loading required package: grid
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loading required package: survival

Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

set.seed(123)

macae <- readRDS("C:/Users/felip/OneDrive/Área de Trabalho/IESP/Survey-Research/listas/lista1/dados/macae.rds")

verdade <- mean(macae$alfabetizada)
print(verdade)

[1] 0.8720686

De maneira naive, podemos estimar a média apenas tirando a média simples da amostra. Devemos esperar algo razoavelmente distoante do parâmetro populacional.

amostra <- macae %>%
  group_by(subdistrito) %>%
  mutate(pi = 100 / n()) %>%
  slice_sample(n = 100)

mean(amostra$alfabetizada)

[1] 0.8485714

Usando o estimador de Horwitz-Thompson, podemos recuperar os pesos absolutos usando a probabilidade de inclusão (i.e., ponderação pela probabilidade de inclusão):

amostra <- amostra %>%
  mutate(peso = 1 / pi)

Vamos fazer a mesma coisa usando o pacote survey, que nos permite declarar o desenho amostral. Vamos usar as probabilidades de inclusão para “acertar” o cálculo, assim como no exemplo anterior:

# todas as observacoes sao intercambiaveis dentro de um estrato, entao ignoramos id (~1)
metodo1 <- svydesign(~1,
                     strata = ~subdistrito, # estrato
                     probs = ~pi, # probabilidades de inclusao para ponderar
                     data = amostra)

svytotal(~alfabetizada, metodo1)

              total     SE
alfabetizada 176920 3347.8

svymean(~alfabetizada, metodo1)

                mean     SE
alfabetizada 0.85581 0.0162

Agora, vejamos o método 2: não temos a probabilidade de inclusão, mas tenho os totais populacionais extraídos a partir de alguma outra fonte – como, por exemplo, o censo demográfico. Eu sei, por exemplo, quantas pessoas moram em cada distrito.

# MÉTODO 2: pós-estratificação

# vamos ver o caso naive
desenho <- svydesign(~id_pessoa,
                     data = amostra)

Warning in svydesign.default(~id_pessoa, data = amostra): No weights or
probabilities supplied, assuming equal probability

# estimamos a média de maneira naive sem ponderar
svymean(~alfabetizada, desenho)

                mean     SE
alfabetizada 0.84857 0.0136

# agora vamos estratificar
estratos <- macae %>%
  group_by(subdistrito) %>%
  summarise(Freq = n())

# criamos um objeto do tipo postStratify
metodo2 <- postStratify(desenho, ~subdistrito, estratos)

# calculamos a media usando esse meotod, que estratificou os resultados
svymean(~alfabetizada, metodo2)

                mean     SE
alfabetizada 0.85581 0.0161

Agora passemos ao método 3, que usa Rake:

desenho <- svydesign(~id_pessoa,
                     data = amostra)

Warning in svydesign.default(~id_pessoa, data = amostra): No weights or
probabilities supplied, assuming equal probability

# se fossem varias variaveis desbalanceadas, criaria um banco pra cada e 
# passaria cada um deles na lista
metodo3 <- rake(desenho, 
                list(~subdistrito),
                list(estratos))

svymean(~alfabetizada, metodo3)

                mean     SE
alfabetizada 0.85581 0.0161

Podemos extrair os pesos calculados por cada desenho da seguinte maneira:

amostra <- amostra %>%
  ungroup() %>%
  mutate(
    peso_strat = weights(metodo1),
    peso_pos_strat = weights(metodo2),
    peso_rake = weights(metodo3)
  )

head(amostra)

# A tibble: 6 × 11
  codigo_setor  id_pessoa distrito subdistrito bairro alfabetizada      pi  peso
  <chr>         <glue>    <chr>    <chr>       <chr>         <dbl>   <dbl> <dbl>
1 330240305020… 33024030… Macaé    Aeroporto   Parqu…            1 0.00265  378.
2 330240305020… 33024030… Macaé    Aeroporto   Ajuda             1 0.00265  378.
3 330240305020… 33024030… Macaé    Aeroporto   Ajuda             0 0.00265  378.
4 330240305020… 33024030… Macaé    Aeroporto   Ajuda             1 0.00265  378.
5 330240305020… 33024030… Macaé    Aeroporto   Parqu…            1 0.00265  378.
6 330240305020… 33024030… Macaé    Aeroporto   Parqu…            1 0.00265  378.
# ℹ 3 more variables: peso_strat <dbl>, peso_pos_strat <dbl>, peso_rake <dbl>

Observe que chegamos \(\pm\) aos mesmos pesos com cada approach. De fato, são diferentes maneiras de resolver o mesmo problema.