4 Amostragem I

Author

Felipe Lamarca

4.1 Anotações das leituras

4.1.1 Groves, R. M., Fowler Jr, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2011). Survey methodology. John Wiley & Sons. Cap. 3-4.

4.1.1.1 Capítulo 3 - Target Populations, Sampling Frames, and Coverage Error

In short, statistics describing different populations can be collected in a single survey when the populations are linked to units from which measurements are taken.

Among common research tools, surveys are unique in their concern about a well-specified population. For example, when conducting a randomized biomedical experiments, the researcher often pays much more attention to the experimental stimulus and the conditions of the measurement than to the identification of the population under study. The implicit assumption in such research is that the chief purpose is identifying the conditions under which the stimulus produces the hypothesized effect. The demonstration that it does so for a variety of types of subjects is secondary. Because surveys evolved as tools to describe fixed, finite populations, survey researchers are specific and explicit about definitions of populations under study. (p. 69)

Target population: grupo de “elementos” sobre os quais a pesquisa de survey deseja fazer inferência usando estatísticas amostrais (sample statistics). A população-alvo é finita em tamanho (isto é, pode ser contada); existem durante um certo período de tempo; e são observáveis.

For example, the target population of many U.S. household surveys is persons 18 years of age or older, “adults” who reside in housing units within the United States. (p. 70)

There are many populations, though, for which lists of individual elements are not readily available. For example, in the United States lists are seldom avaliable in one place for all students attending school in a province or state, inmates in prisions, or even adults living in a specific county. (p. 71)

When available sampling frames miss the target population partially or entirely, the survey researcher faces two options:

Redefine the target population to fit the frame better

Admit the possibility of coverage error in statistics describing the original target population (p. 71)

A common example of redefining the target population is found in telephone household surveys, where the sample is based on a frame of telephone numbers. Although the desired target population might be all adults living in the U.S. households, the attraction of using the telephone frame may persuade the researcher to alter the target population to adults living in telephone households. Alternatively, the researcher can keep the full household target population and document that approximately 2% of U.S. adults are missed because they have no telephones. (p. 71)

The match of sampling frame to target population created three potential outcomes:

Coverage - when a target population element is included in the sampling frame
Undercoverage - when a target population element is not included in the sampling frame
Ineligible (or foreign) units - when a sampling frame element is not a target population element (e.g. business numbers in a frame of residential numbers)

Outros problemas incluem “Duplication” – isto é, vários elementos da framing são mapeados para um único elemento da população alvo e, nesse caso, há sobrerrepresentação de alguns elementos; além disso, há também o problema de “Clustering”, quando múltiplos elementos da população alvo a um único elemento da frame.

4.1.1.1.1 Undercoverage

Undercoverage is the weakness of sampling frames promptiong the greatest fears of coverage error. It threatens to produce errors of nonobservation in survey statistics from failure to include parts of the target population in any survey using the frame. (p. 72)

Another common concern about undercoverage in household surveys stems from the fact that sampling frames for households generally provide identification of the housing unit (through an address or telephone number) but not identifiers for persons within the household.

4.1.1.1.2 Ineligible Units

Sometimes, sampling frames contain elements that are not part of the target population. For example, in telephone number frames, many of the numbers are nonworking or nonresidential numbers, complicating the use of the frame for the target population of households. In area probability surveys, sometimes the map materials contain units outside the target geographical area. (p. 76)

When interviewers develop frames of household members withing a unit, they often use residence definitions that do not match the meaning of “household” held by the informant. Parents of the students living away from home often think of them as members of the household, yet many survey protocols would place them at college. The informants might tend to exclude persons unrelated to them who rent a room in the housing unit. Studies show that children in shared custody between their father and mother are likely to be duplicated by appearing in each parent’s household list. (p. 76)

Although undercoverage is a difficult problem, “ineligible” or “foreign” units in the frame can be a less difficult problem to deal with if the problem is not extensive. When foreign units are identified on the frame before selection begins, they can be purged at little cost. […]. For example, it is known that approximately 15% of entries in residential portions of national telephone directories are numbers that are no longer in service. To achieve a sample of 100 telephone households, one could select a sample of \(100(1 - 0.15) = 118\) entires from the directory, expecting that 18 are going to be out-of-service numbers. (p. 76)

4.1.1.1.3 Clustering

A telephone directory lists telephone households in order by surname, given name, and address. When sampling adults from this frame, an immediately obvious problem is the clustering of eligible persons that occurs. “Clusterng” means that multiple elements of the target population are represented by the same frame element. A telephone listing in the directory may have a single or two or more adults living there. (p. 77)

One way to react to clustering of target population elements is by simply selecting all eligible units in the selected telephone households (or all eligible units in a cluster). With this design, the probability of selection of the cluster applies to all elements in the cluster. (p. 77)

After sample selection, there is one other issue that needs to be addressed in this form of cluster sampling: unequal probabilities of selection. If all frame elements are given equal chances, but one eligible selection is made from each, then elements in large clusters have lower overall probabilities of selection than elements in small clusters. For example, an eligible person chosen in a telephone household containing two eligibles has a chance of one-half of being selected, given that the household was sampled, whereas those in a household with four eligibles have a one-in-four chance. (p. 78)

4.1.1.1.4 Duplication

“Duplication” means that a single target population element is associated with multiple frame elements. In the telephone survey example, this may arise when a single telephone household has more than one listing in a telephone directory. (p. 79)

The problem that arises with this kind of frame problem is similar to that encountered with clustering. Target population elements with multiple frame units have higher chances of selection and will be overrepresented in the sample, relative to the population. If there is a correlation between duplication and variables of interest, survey estimates will be biased. In survey estimation, the problem is that both the presence of duplication and the correlation between duplication and survey variables are often unknown. (p. 79)

4.1.1.1.5 Clustering and Duplication

It is also possible to have frame units mapped to multiple population elements. For example, in telephone household surveys of adults, one may encounter a household with several adults who have multiple entries in the directory. (p. 80)

4.1.1.1.6 Desenhos de amostragem e seus problemas

Há vários framing problems relacionados a pesquisas de survey. Há uma série de desenhos alternativos, que também possuem problemas atrelados:

Area frames (list of area units like census tracts or counties): primeiro, um subconjunto de áreas é selecionado; depois, listagens de endereços são geradas para a área selecionada.
Telephone number frames for households and persons: ao usar linhas fixas, falhamos em cobrir cerca de 20% das unidades residenciais dos Estados Unidos. Além disso, um percentual menor possui mais de uma linha fixa e acabam sendo duplicados. Além disso, com o uso de celulares, essa abordagem torna-se mais complicada – mesmo porque números de telefone celular são associados a uma pessoa, e não a um endereço.
Frames for Web Surveys of General Populations: “The e-mail frame, however, fails to cover large portions of the household population. It has duplication problems because one person can have many different e-mail addresses, and it has clustering problems because more than one person can share an e-mail address.” (p. 83).

“In short, without a universal frame of e-mail addresses with known links to individual population elements, some survey practices have begun to ignore the frame development step. Without a well-defined sampling frame, the coverage error of resulting estimates is completely unknowable.” (p. 84).

4.1.1.1.7 Coverage error

Undercoverage is a difficult problem, and may be an important source of coverage error in surveys. It is important to note, though, that coverage error is a property of sample statistics and estimates made from surveys.

No caso de uma estimativa de média, o viés de cobertura é dado por:

\[ \bar{Y}_C - \bar{Y} = \dfrac{U}{N} (\bar{Y}_C - \bar{Y}_U), \]

one \(\bar{Y}\) é a média da população total, \(\bar{Y}_C\) é a média da população coberta e \(\bar{Y}_U\) é a média da população não coberta. \(U\) é o número de unidades não cobertas e \(N\) é o número total de unidades. “Thus, the error due to not covering the \(N-C\) units left out of the frame is a function of the proportion ‘not covered’ and the difference between means for the covered and the not covered.” (p. 88).

4.1.1.1.8 Reduzindo o erro de cobertura

The Half-Open Interval: “Consider, for example, address or housing unit lists used in household surveys. These lists may become out of date and miss units quickly. They may also have missed housing units that upon closer inspection could be added to the list. Since address lists are typically in a particular geographic order, it is possible to add units to the frame only for selected frame elements, rather than updating the entire list.” (p. 88).
Multiplicy sampling: “Some frame supplementation methods add elements to a population through network sampling.” (p. 90).
Multiple frame designs: “For example, an out-of-date set of listings of housing units can be supplemented by a frame of newly constructed housing units obtained from planning departments in governmental units responsible for zoning where sample addressses are located” (p. 91).

“At times, the supplemental frame may cover a completely separate portion of the population. In most cases, though, supplemental frames overlap with the principal frame. In such cases, multiple framing sampling and estimation procedures are employed to correct for unequal probabilities of selection and possibly to yield improved precision for survey estimates.” (p. 91).

“There are several solutions to the overlap and overrepresentation problem. One is to screen the area household frame. At the doorstep, before an interview is conducted, the interviewer determines whether the household has a fixed-line telephone that would allow it to be reached by telephone. If so, the unit is not selected and no interview is attempted. With this procedure, the overlap is eliminated, and the dual frame sample design has complete coverage of households.

A second solution is to attempt interviews at all sample households in both frames, but to determine the chance of selection for each household. Households from the nonoverlap portion of the sample, the nontelephone households can be selected from the area frame and, thus, have one chance of selection. Telephone households have two chances, one from the telephone frame and the other from the area household frame. Thus, theis chance of selection is \(p_{RDD} + p_{area} - p_{RDD} \times p_{area}\), where \(p_{RDD}\) and \(p_{area}\) denote the chances of selection for the RDD and area sample households. A compensatory weight can be computed as the inverse of the probability of selection: \(1/p_{area}\) for nontelephone households and \(1/(p_{RDD} + p_{area} - p_{RDD} \times p_{area})\) for telephone households, regardless of which frame was used.

A third solution was proposed by Hartley (1962) and others. They suggested that the overlap of the frames be used to obtain a more efficient estimator. […]. The telephone and nontelephone domains are combined using a weight that is the proportion of the telefone households in the target population, say \(W_{tel}\). The dual frame estimator for this particular example is:

\[ \bar{y} = (1 - W_{tel}) p_{non-tel} + W_{tel} [ \theta p_{RDD-tel} + (1 - \theta) p_{area-tel}], \]

where \(\theta\) is the mixing parameter chosen to maximize precision.” (p. 92).

4.1.1.2 Capítulo 4 - Sample Design and Sampling Error

How the sample of a survey is obtained can make a difference. The self-selected nature of the NGS survey, coupled with its placement on the National Geographic Society’s website, probably yielded respondents more interested and active in cultural events (Couper, 2000). (p. 97)

The random sampling mechanism and geographic controls are designed to avoid the selection of a sample that has higher incomes, or fewer members of ethnic or racial minorities, or more females, or any of a number of other distortions when compared to the entire U.S. population. (p. 98)

Essentially, in its simplest form, the kind of sample selection used in carefully designed surveys has three basic features:

A list, or combinations of lists, of elements in the population (the sampling frame described in Chapter 3)

Chance or random selection of elements from the list(s)

Some mechanism that assures that key subgroups of the population are represented in the sample (p. 98)

When chance methods, such as tables of random numbers, are applied to all elements of the sampling frame, the samples are referred to as “probability samples”. Probability samples assign each element in the frame a known and nonzero chance to be chosen. These probabilities do not need to be equal. For example, the survey designers may need to overrepresent a small group of the population, such as persons age 70 years and older, in order to have enough of them in the sample to prepare separate estimates for the group. (p. 98)

Section 2.3 makes the distinction between fixed errors or biases and variable errors or variance. Both errors can arise through sampling. The basis of both sampling bias and sampling variance is the fact that not all elements of the sampling frame are measured. If there is a systematic failure to observe some elements because of the sample design, this can lead to “sampling bias” in the survey statistics. For example, if some persons in the sampling frame have a zero chance of selection and they have different characteristics on the survey variable, then all samples selected with that design can produce estimates that are too high or too low. Even when all frame elements have the same chance of selection, the same sample design can yield many different samples. They will produce estimates that vary; this variation is the basis of “sampling variance” of the sample statistics. We will use from time to time the word “precision” to denote the levels of variance. (p. 98)

In some sense, you should think of each survey as one realization of a probability sample design among many that could have occurred. It is conventional to describe attributes of the sample realizations using lower case letters. So each sample realization has a mean and a variance of the distribution of \(y\) in the sample realization. The sample mean is called \(\bar{y}\), which summarizes all of the different values of \(y\) on the sample elements, and the variance of the distribution of \(y_i\)’s in the sample realization os labeled as \(s^2\). We will describe sample designs in which values of \(\bar{y}\) from one sample realization will be used to estimate \(\bar{Y}\) in the frame population. We will describe sample designs in which values of \(s^2\) from one sample realization will be used to estimate \(S^2\) in the frame population. (p. 100-101)

Do not confuse the different “variances” discussed. Each frame population has its own distribution of \(Y\) values, \(S^2\), the population element variance, estimated by \(s^2\), the sample element variance from any given sample realization. These are each different from the variance of the sample mean, labeled as \(V(\bar{y})\), estimated by using sample realization data, using \(v(\bar{y})\). Capital letters denote frame population quantities; lower case letters generally denote sample quantities.

[…] the extent of the error due to sampling is a function of four basic principles of the design:

How large a sample of elements is selected

What chances of selection into the sample different frame population elements have

Whether individual elements are drawn directly and independently or in groups (called “element” or “cluster” samples)

Whether the sample is designed to control the representation of key sub-populations in the sample (called “stratification”) (p. 102)

4.1.1.2.1 Simple Random Sampling (SRS)

Simple random samples assign an equal probability of selection to each frame element, equal probability to all pairs of frame elements, equal probability to all triplets of frame elements, and so son. (p. 103)

To select an SRS, random numbers can be applied directly to the elements in the list. SRS uses list frames numbered from 1 to \(N\). Random numbers from 1 to \(N\) are selected from a table of random numbers, and the corresponding population elements are chosen from the list. If by chance the same population element is chosen more than once, we do not select it into the sample more than once. Instead, we keep selecting until we have \(n\) distinct elements from the population. (This is called sampling “without replacement”.)

From and SRS, we compute the mean of the variable \(y\) as

\[ \bar{y} = \dfrac{1}{n} \sum_{i=1}^{n} y_i, \]

[…]. The sampling variance of the mean can be estimated directly from one sample realization as

\[ v(\bar{y}) = \dfrac{(1 - f)}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2 = \dfrac{(1 - f)}{n} s^2, \]

[…]. The term \((1-f)\) is referred to as the finite population correction, or fpc. THe finite population correction equals the proportion of frame elements not sample or \(1 - f\), where \(f\) is the sampling fraction. As a factor in all samples selected without replacement, it acts to reduce the sampling variance of statistics. If we have a situation in which the sample is a large fraction of the population, where \(f\) is closer to 1, then the fpc acts to decrease the sampling variance. Often, the frame population is so large relative to the sample that the fpc is ignored.

When \(f\) is small, the fpc is close to 1, and

\[ v(\bar{y}) \approx \dfrac{1}{n} s^2. \]

4.1.1.2.2 Cluster Sampling

One way to construct frames at reasonable cost is to sample clusters of elements and then collect a list of elements only for the selected clusters. And when a sample cluster is selected and visited, it makes sense to interview or collect data from more than one element in each cluster also to save costs.

How do we compute statistics for clustered samples? Let us use NAEP as an illustration. For example, suppose that for the NAEP, we could obtain a list of all \(A = 40,000\) 4th grade classrooms in the United States, and that each of those classrooms has exactly \(B = 25\) students. We do not have a list of the \(A \times B = N = 40,000 \times 25 = 1,000,000\) students, but we know that when we get to a classroom, we can easily obtain a list of the 25 students there.

The sampling procedure is simple: chose a sample of \(a\) classrooms using SRS, visit each classroom, and collect data from all students in each sample classroom. If we select \(a = 8\) classrooms, out sample size is \(n = 8 \times 25 = 200\) students. It should be clear that this is not the same kind of sampling as an SRS of elements. There are many SRS samples of size 200 students that cannot possibly be chosen in this kind of a design. For instance, some SRS sample realizations consist of exactly one student in each of 200 classrooms. None of these are possible cluster samples of 200 students from 8 classrooms.

Nesse caso, do ponto de vista de notação, a coisa fica um pouco mais complexa porque precisamos levar em conta a natureza agrupada dos dados. Imagine que obtemos para cada um dos 200 estudantes da amostra um score de teste \(y_{\alpha \beta}\) para o estudante \(\beta\) na sala de aula \(\alpha\), e queremos computar a média do score de teste para a população de estudantes:

\[ \bar{y} = \dfrac{1}{N} \sum_{\alpha=1}^{A} \sum_{\beta=1}^{B} y_{\alpha \beta} \]

The sampling variance of this mean is different from the SRS sampling variance. The randomization in the selection is applied only to the classrooms. They are the sampling units, and depending on which classrooms are chosen, the value of \(\bar{y}\) will vary. In one sense, everything remains the same as SRS, but we treat the clusters as elements in the sample. (p. 108)

In this case,

\[ v(\bar{y}) = \dfrac{(1 - f)}{A} s^2_a \]

where \(s^2_a\) is variability of the mean test scores across the \(A\) classrooms. That is,

\[ s^2_a = \dfrac{1}{A - 1} \sum_{\alpha=1}^{A} (\bar{y}_{\alpha} - \bar{y})^2, \]

where \(\bar{y}_{\alpha}\) is the mean test score in classroom \(\alpha\) (p. 108).

The statistic \(d^2\) is referred to as the design effect. It is a widely used tool in survey sampling in the determination of sample size, and in summarizing the effect of having sampled clusters instead of elements. It is defined to be the ration of the sampling variance for a statistic computed under the sample design devided by the sampling variance that would have been obtained from an SRS of exactly the same size. Every statistic in a survey has its own design effect. Different statistics in the same survey will have different magnitudes of the design effect. (p. 109)

4.1.1.2.3 The Design Effect and Within-Cluster Homogeneity

É importante nos atentarmos para a homogeneidade ou heterogeneidade das observações dentro de cada cluster. Note: se todos os alunos em uma turma tiram a mesma nota no teste, bastaria saber a nota de um único aluno para saber a nota de todos os alunos. Nesses casos, pouca informação nova é adquirida ao entrevistar mais alunos. A medida “roh” (rate of homogeneity) mede a homogeneidade dos clusters.

Em particular, vale notar que \(d^2\) cresce à medida que roh cresce. De fato, podemos escrever \(d^2\) como \(1 + (B - 1) \rho\), onde \(B\) é o número de elementos dentro de cada cluster. Assim, se \(\rho = 0\), temos \(d^2 = 1\) e, se \(\rho = 1\), temos \(d^2 = B\).

Podemos estimar roh como \(\dfrac{(d^2 - 1)}{(b - 1)}\).

4.1.1.2.3.1 Stratification and Stratified Sampling

On a frame of population elements, assume that there is information for every element on the list that can be used to divide elements into separate groups, or “strata”. Strata are mutually exclusive groups of elements on a sampling frame. Each frame element can be placed in a singular “stratum”. In stratified sampling, independent selections are made from each stratum, one by one. Separate samples are drawn from each such group, using the same selection procedure (such as SRS in each stratum, when the frame lists elements) or using different selection procedures (such as SRS of elements in some strata, and cluster sampling in others). (p. 113)

To obtain estimates for the whole population, results must be combined across strata. One method weights stratum results by the population proportions \(W_h\). Suppose we are interested in estimating the mean for the population, and we have computed means \(\bar{y}_h\) for each stratum. The stratified estimate for the population mean is called \(\bar{y}_{st}\) with the subscript st denoting “stratified”. It is computed by:

\[ \bar{y}_{st} = \sum_{h=1}^{H} W_h \bar{y}_h = \text{weighted sum of strata means} \]

4.1.1.2.3.2 Summary

Most of the practices of sample design are deduced from theorems in probability sampling theory. Simple random sampling is a simple base design that is used for the comparison of all complex sample designs. There are four features of sample design on which samples vary:

The number of selected units on which sample statistics are computed (other things being equal, the larger the number of units, the lower the sampling variance of statistics)

The use of stratification, which sorts the frame population into different groups that are then sampled separately (other things being equal, the use of stratification decreases the sampling variance of estimates)

The use of clustering, which samples groups of elements into the sample simultaneously (other things being equal clustering increases the sampling variance as a function of homogeneity within the clusters on the survey statistics)

The assignment of variable selection probabilities to different elements on the frame (if the higher selection probabilities are assigned to groups that exhibit larger variation on the survey variables, this reduces sampling variance; if not, sampling variances can be increased with variable selection probabilities)

4.2 Anotações de aula

Em 1936, George Gallup entrevistou 50.000 eleitores de diferentes estados americanos. Mesmo assim, é a pessoa que chega mais perto de acertar os resultados das eleições, contrastando com a revista Literary Digest, que aplicava formulários “opt-in” e entrevistou 2 milhões de eleitores.

Há uma explicação ainda muito repetida de que a amostra da Literary Digest tinha uma amostra focada na classe média e, por outro lado, a amostra de Gallup chegava a pessoas mais pobres. No entanto Lusinchi (2012) mostra que, na verdade, não havia diferenças muito claras entre as amostras quando ajustadas — isto é, pobres e ricos votavam mais ou menos da mesma maneira. Daí, a culpa que era delegada ao erro de cobertura passa a ser delegada ao erro de não-resposta: as pessoas que não respondiam à Literary Digest votavam mais no Roosevelt, o que acaba subestimando o apoio a ele na amostra.

A soma dos erros de survey leva a erros na estimação de uma particular estatística. Por exemplo: se eu quero estimar a renda da população, eu preciso tomar cuidado para evitar uma longa sequência de erros.

Nota: fazer uso de dados de pesquisa de survey comparando o resultado entre institutos não é uma boa ideia. Na prática, a maneira como a pergunta é feita tem um efeito grande. Por exemplo, nas pesquisas da AtlasIntel, as respostas tendem a ser mais extremas do que nas pesquisas da Quaest, por exemplo, onde a taxa de avaliações “Regulares” tende a ser mais alta.

4.2.1 Amostragem aleatória simples (AAS)

Cada elemento \(i\) tem a mesma probabilidade de ser incluído na amostra
A probabilidade de inclusão é \(\pi_i = \dfrac{n}{N}\), onde \(n\) é o número de elementos da amostra e \(N\) é o número de elementos da população
Seleção tipicamente feita sem reposição, i.e., cada \(i\) só pode ser selecionado uma vez
Número total de amostras possíveis é dado por¹ \(\left(\begin{array}{c} N \\ n \end{array}\right) = \dfrac{N!}{n!(N-n)!}\)
Cada amostra específica tem probabilidade \(\mathbb{P}(S) = \dfrac{1}{\left(\begin{array}{c} N \\ n \end{array}\right)}\) de ser selecionada

4.2.1.1 Um tutorial

Para uma amostra aleatória simples, o quadro amostral precisa ser uma lista de elementos (ou uma proxy para uma lista de elementos, como uma lista telefônica ou de endereços).

Primeiro, portanto, definimos a população-alvo e o quadro amostral (e.g., lista de estudantes de ensino médio do RJ).
Depois, definimos o tamanho amostral (e.g., 100 estudantes).
E aí sorteamos sem reposição (de fato, não faz sentido sorteio com reposição, porque não faz sentido perguntar para a mesma pessoa mais de uma vez). No R, por exemplo, usamos sample() para escolher \(n\) pessoas da lista usando IDs para contabilizar a seleção.

Há prós e contras. Em particular, presencialmente ou pesquisa domiciliar é impossível de fazer dessa maneira. Suponha uma população de 3 pessoas: \(\{ Y_1, Y_2, Y_3 \}\). Queremos uma amostra com \(n=2\). Temos as seguintes combinações possíveis: \((Y_1, Y_2)\), \((Y_1, Y_3)\) e \((Y_2, Y_3)\). Se tivéssemos arranjos, teríamos \((Y_1, Y_2)\), \((Y_1, Y_3)\), \((Y_2, Y_1)\), \((Y_2, Y_3)\), \((Y_3, Y_1)\) e \((Y_3, Y_2)\). No caso da amostragem, não importa se a amostra é \((Y_1, Y_2)\) ou \((Y_2, Y_1)\), por exemplo. Assim, o denominador remove as duplicatas. No fundo, o denominador mostra quantas vezes cada arranjo é contado. Assim, se o denominador é \(2\), significa que cada arranjo é contado duas vezes. Também chamamos isso de coeficiente binomial.

Para uma AAS, no mais das vezes temos um quadro amostral como uma listagem de pessoas ou uma listagem de números de telefone. Se tivermos uma amostra com \(n=50\) e registrar a idade \(y_i\) de cada:

Cada amostra obtida é uma realização de uma variável aleatória: o plano amostral.
Distribuições de estatísticas amostrais revelam propriedades de um desenho amostral

Variância de uma estimativa: \(\text{Var}(\bar{y}) = \mathbb{E}[\bar{y}^2] - \mathbb{E}[\bar{y}]^2\). Já o viés de uma estimativa é dado por \(\text{Bias}(\bar{y}) = \mathbb{E}[\bar{y}] - \bar{y}\).

# cria um banco de pessoas com idades entre 18 e 80 anos
pessoas <- data.frame(id = 1:250, idade = sample(18:80, size = 250, replace = TRUE))

# 5 amostras de 50 pessoas
for (i in 1:5){
    x <- sample(pessoas$idade, size = 5, replace = FALSE)
    print(mean(x))
}

[1] 36.4
[1] 62.6
[1] 39.8
[1] 38.8
[1] 35.2

De outra maneira, podemos usar a função slice_sample() do dplyr para fazer amostras aleatórias simples. Por exemplo:

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# 5 amostras de 50 pessoas
for (i in 1:5){
    x <- pessoas %>% slice_sample(n = 5, replace = FALSE)
    print(mean(x$idade))
}

[1] 52
[1] 44.8
[1] 58.6
[1] 46.2
[1] 60.2

4.2.2 Amostragem sistemática

O jeito de selecionar uma amostra não é por sorteio aleatório, mas por “pulos” na lista. A ideia básica é que usamos um quadro amostral ordenado por \(i\) e um intervalo fixo, \(k\). Suponha que \(N = 10\) e queremos uma amostra \(n=3\).

Calculamos \(k = \dfrac{N}{n} = \dfrac{10}{3} \approx 3\). Esse é o tamanho do pulo. Assim, escolhemos um ponto inicial aleatório e vamos definindo os elementos. Uma grande vantagem da amostragem sistemática é que ela evita que recrutemos pessoas próximas da cauda da distribuição. Esse tipo de amostragem é extremamente comum – por exemplo, a cada \(100\) clientes você entrevista uma.

4.2.3 Amostragem estratificada

Uma amostra estratificada é uma amostra aleatória simples de cada estrato estratificado (criado em função de alguma característica da população que seja conhecida — isto é, uma característica que está sendo levada em conta no meu quadro amostral). Não pode haver sobreposição entre os estratos – isto é, uma pessoa deve pertencer a um, e somente a um estrato.

Ideia: dividimos a população em \(h\) estratos (e.g., região, tipo de moradia, etc.) e selecionamos uma amostra aleatória simples de cada estrato
Há vários tipos de alocação possíveis, mas a mais comum é a proporcional: \(n_h = \dfrac{N_h}{N} \times n\), onde \(N_h\) é o número de elementos do estrato \(h\) e \(N\) é o número total de elementos da população. Assim, a amostra total é dada por \(n = \sum_{h=1}^{H} n_h\).

Isso é uma ideia profundamente importante e interfere diretamente na maneira como calculamos as estatísticas amostrais.

Só há vantagens: sabemos como alocar pessoas de antemão, facilita logística, etc. Mas, para além das questões logísticas, diminuímos muito o espaço amostral das respostas possíveis. Dessa maneira, a distribuição marginal das entrevistas espelha a distribuição marginal do quadro amostral. De fato, são amostras aleatórias dentro dos estratos, mas os estratos sempre podem ser comparados entre si. Pesquisas telefônicas, por exemplo, fazem amostras estratificadas por DDD.

Quanto mais homogêneos são os estratos internamente, melhor. Se há pouca variabilidade, não importa quem a gente escolhe. A informação da homogeneidade é fundamental: escolher grupos que sejam semelhantes naquela dimensão que eu quero estudar. Assim, conseguimos diminuir a variância do meu desenho amostral, isto é, obtemos uma distribuição normal com caudas cada vez menos pesadas (isto é, mais centradas em torno da média).

4.2.4 A relação entre amostragem estratificada e amostragem sistemática

Na amostragem sistemática, se a lista for ordenada de acordo com características relevantes da população, ela também pode ser chamada de “estratificação implícita”. No fim das contas, diminuímos o espaço amostral e reduzimos a variância da amostra.

4.2.5 Amostragem por conglomerados

Ideia: a população é dividida em grupos (conglomerados ou clusters) e selecionamos aleatoriamente alguns inteiramente, com todos os seus elementos, para integrar a amostra.
Indexação dupla:
- \(y_{ij}\) é o valor da unidade \(j\) no conglomerado \(i\)
- \(N\) é o total de conglomerados
- \(M_i\) é o tamanho do conglomerado \(i\)
- \(n\) é o número de conglomerados selecionados

O uso de conglomerados é muito comum na prática. Em praticamente todas as entrevistas amostrais do IBGE usa-se amostragem por conglomerados, sorteando setores censitários, assim como o IPEC e a Quaest.

Via de regra, o conglomerado precisa ser viável e tornar factível o desenho amostral. De saída, não deve ser a primeira opção. Mas, se a pesquisa parecer inviável, é uma boa opção.

Do ponto de vista de estatística amostral, a conglomeração é o contrário da estratificação. No conglomerado, fazemos, por exemplo, \(5\) entrevistas em cada quarteirão. Mas fazer \(5\) entrevistas em um determinado prédio, por exemplo, pode dar respostas que, na média, são iguais. Conglomeração nos faz, no fim das contas, perder informação, porque eu acabo fazendo \(5\) entrevistas que equivalem a \(1\), talvez um pouco mais. Nesse caso, a margem de erro é maior, porque eu tenho mais incerteza sobre a minha coleta – fatalmente eu vou terminar com uma amostra mais extrema, porque eu vou capturar muito da mesma coisa. Quanto mais clusters eu visito, mais informação eu consigo obter².

4.2.6 Estratificação vs. conglomeração

Na estratificação, há redução de variância se os estratos forem homogêneos
Na conglomeração, a variância aumenta se os conglomerados forem homogêneos

4.2.7 Comparando desenhos

Uma forma de entender os diferentes impactos na variância de desenhos amostrais alternativos é o design effect, que é uma comparação implícita entre a variância de um desenho amostral e a variância de uma AAS:

\[ \text{Deff} = \dfrac{\text{Var}(\bar{y})}{\text{Var}(\bar{y}_\text{AAS})} \]

4.2.8 AAS e viés

Embora AAS seja geralmente confiável contra viés, há algumas situações em que o viés pode ocorrer. A mais comum delas é o problema no quadro amostral: se ele é bom, não há viés; se é ruim, há viés.

O efeito do denominador, aqui, é remover duplicatas de arranjos de \(n\) elementos. Por exemplo, se \(N = 3\) e \(n = 2\), temos as seguintes combinações possíveis: \((1,2)\), \((1,3)\) e \((2,3)\). Se tivéssemos arranjos, teríamos \((1,2)\), \((1,3)\), \((2,1)\), \((2,3)\), \((3,1)\), \((3,2)\). No caso da amostragem, não importa se a amostra é \((1,2)\) ou \((2,1)\), por exemplo. Assim, o denominador remove as duplicatas. No fundo, o denominador mostra quantas vezes cada arranjo é contado. Assim, se o denominador é \(2\), significa que cada arranjo é contado duas vezes. Também chamamos isso de coeficiente binomial.↩︎
Se eu faço uma única entrevista por conglomerado, há aqui um ponto de contato com a amostragem simples.↩︎