Thanks to Dr. Haojin Zhou for teaching me this course.
Reference Book:
Elementary Survey Sampling, 7th Edition, by Richard L. Scheaffer, William Mendenhall III, and Richard L. Ott
An important example of Survey Sampling
Sample from real population
SRP
# Load packageslibrary(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(sampling)library(readxl)#Chap 5: Sampling from Real Populations#5.1 Data on the population of the United States is given in Appendix C and on the data disk under USPOP. The goal is to estimate the total U.S. population in the 18–24 age group from a sample of states. The states are divided into four geographic regions. Using these regions as strata, select an appropriately sized stratified random sample of states and use their data on population in the 18- to 24-year-old group to estimate the total U.S. population in that age group. Because the total population is available from the data on all the states, check to see if your estimate is within the margin of error you established for your estimate. Compare your result with those of other students in the class.setwd("/Users/luyu/Library/CloudStorage/OneDrive-Xi'anJiaotong-LiverpoolUniversity")USPOP <-read_excel("USPOP(3).XLS")summary(USPOP)
State Total Section 18-24
Length:51 Length:51 Min. :1.00 Length:51
Class :character Class :character 1st Qu.:2.00 Class :character
Mode :character Mode :character Median :3.00 Mode :character
Mean :2.66
3rd Qu.:3.75
Max. :4.00
NA's :1
18+ 15-44 65+ 85+
Length:51 Length:51 Length:51 Length:51
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
PerPov
Min. : 6.10
1st Qu.: 9.05
Median :10.80
Mean :11.47
3rd Qu.:13.40
Max. :18.90
#5.2 The Florida Survey Research Center has completed a telephone survey on opinions about recycling for a group of cities in Florida. The questionnaire reproduced on the following pages shows the information that was coded to track the city, county, and interviewer, as well as the survey questions asked. The data are stored on the data disk in a file called RECYCLE. The survey used a stratified random sample design with three strata defined by the level of recycling education in the cities: stratum 1 (low education), stratum 2 (moderate education), and stratum 3 (high education). Each response in the dataset includes a stratum code, and the sample sizes were equal across all three strata. The population sizes for the strata are assumed to be nearly equal as well. Your task is to analyze the survey data by selecting two questions of your choice and following these steps: (a) estimate the true population proportion for each selected question; (b) determine whether the proportion of men responding in the category of interest differs from the proportion of women in the same category for each question; (c) for one of the selected questions, estimate the true population proportions within each of the three strata; and (d) for the question used in part (c), compare the true proportions across the three strata to assess whether they differ. To facilitate analysis, you may organize the sampled data in two-way tables for clearer results and calculations. The full survey questionnaire is available in a Word file linked from electronic Section 5.0.RECYCLE <-read_excel("RECYCLE.XLS")Analysisdata <- RECYCLE[,c("Q2a","Q3","Q24","Stratum")]table(Analysisdata$Stratum)
# 5.3 The CARS93 data, in Appendix C, has cars classified as to being one of six different types: small, compact, midsize, large, sporty, or van. A numerical type code is given in the data set, in addition to the actual name of the type. The goal of this activity is to see if poststratification on car type pays any dividends when estimating:# Average city gasoline mileage# Proportion of cars with air bags# for the cars in this population.CARS93 <-read_excel("CARS93.XLS")summary(CARS93)
MANUFAC MODEL TYPE MINPRICE
Length:92 Length:92 Length:92 Min. : 6.70
Class :character Class :character Class :character 1st Qu.:10.88
Mode :character Mode :character Mode :character Median :14.70
Mean :17.23
3rd Qu.:20.48
Max. :45.40
MIDPRICE MAXPRICE MPGCITY MPGHIGH
Min. : 7.40 Min. : 7.90 Min. :15.00 Min. :20.00
1st Qu.:12.40 1st Qu.:14.57 1st Qu.:18.00 1st Qu.:26.00
Median :17.95 Median :20.30 Median :21.00 Median :28.00
Mean :19.59 Mean :21.96 Mean :22.29 Mean :29.04
3rd Qu.:23.40 3rd Qu.:25.55 3rd Qu.:24.25 3rd Qu.:31.00
Max. :61.90 Max. :80.00 Max. :46.00 Max. :50.00
AIRBAGS DRIVETR CYLINDR LITERS
Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:4.000 1st Qu.:1.875
Median :1.0000 Median :1.0000 Median :4.000 Median :2.400
Mean :0.8152 Mean :0.9348 Mean :4.978 Mean :2.680
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:6.000 3rd Qu.:3.300
Max. :2.0000 Max. :2.0000 Max. :8.000 Max. :5.700
NA's :1
HPOWER RPMMAX US? TYPECODE
Min. : 55.0 Min. :3800 Min. :0.0000 Min. :1.000
1st Qu.:104.5 1st Qu.:4800 1st Qu.:0.0000 1st Qu.:2.000
Median :140.0 Median :5200 Median :1.0000 Median :3.000
Mean :144.4 Mean :5273 Mean :0.5109 Mean :3.109
3rd Qu.:170.0 3rd Qu.:5712 3rd Qu.:1.0000 3rd Qu.:4.250
Max. :300.0 Max. :6500 Max. :1.0000 Max. :6.000
ROW
Min. : 1.00
1st Qu.:23.75
Median :46.50
Mean :46.50
3rd Qu.:69.25
Max. :92.00
#Simple random samplingset.seed(2025)CARS93Sample1 <- CARS93 %>%slice_sample(prop =0.25)CARS93Sample1
Height PROB-M PROB-F PROB
Min. :56 Min. :0.00000 Min. :0.00000 Min. :0.00025
1st Qu.:61 1st Qu.:0.00160 1st Qu.:0.00050 1st Qu.:0.01115
Median :66 Median :0.02230 Median :0.02000 Median :0.04390
Mean :66 Mean :0.04686 Mean :0.04732 Mean :0.04709
3rd Qu.:71 3rd Qu.:0.08400 3rd Qu.:0.08620 3rd Qu.:0.08070
Max. :76 Max. :0.14770 Max. :0.16670 Max. :0.10530
Data on the population of the United States is given in Appendix C and on the data disk under USPOP. The goal is to estimate the total U.S. population in the 18–24 age group from a sample of states. The states are divided into four geographic regions. Using these regions as strata, select an appropriately sized stratified random sample of states and use their data on population in the 18- to 24-year-old group to estimate the total U.S. population in that age group. Because the total population is available from the data on all the states, check to see if your estimate is within the margin of error you established for your estimate. Compare your result with those of other students in the class.
# Load packageslibrary(dplyr)library(sampling)library(readxl)#Chap 5: Sampling from Real Populationssetwd("/Users/luyu/Library/CloudStorage/OneDrive-Xi'anJiaotong-LiverpoolUniversity")USPOP <-read_excel("USPOP.XLS")summary(USPOP)
State Total Section 18-24
Length:51 Length:51 Min. :1.00 Length:51
Class :character Class :character 1st Qu.:2.00 Class :character
Mode :character Mode :character Median :3.00 Mode :character
Mean :2.66
3rd Qu.:3.75
Max. :4.00
NA's :1
18+ 15-44 65+ 85+
Length:51 Length:51 Length:51 Length:51
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
PerPov
Min. : 6.10
1st Qu.: 9.05
Median :10.80
Mean :11.47
3rd Qu.:13.40
Max. :18.90
The Florida Survey Research Center has completed a telephone survey on opinions about recycling for a group of cities in Florida. The questionnaire reproduced on the following pages shows the information that was coded to track the city, county, and interviewer, as well as the survey questions asked. The data are stored on the data disk in a file called RECYCLE. The survey used a stratified random sample design with three strata defined by the level of recycling education in the cities: stratum 1 (low education), stratum 2 (moderate education), and stratum 3 (high education). Each response in the dataset includes a stratum code, and the sample sizes were equal across all three strata. The population sizes for the strata are assumed to be nearly equal as well. Your task is to analyze the survey data by selecting two questions of your choice and following these steps:
estimate the true population proportion for each selected question; (b) determine whether the proportion of men responding in the category of interest differs from the proportion of women in the same category for each question;
for one of the selected questions, estimate the true population proportions within each of the three strata; and (d) for the question used in part (c), compare the true proportions across the three strata to assess whether they differ.
To facilitate analysis, you may organize the sampled data in two-way tables for clearer results and calculations. The full survey questionnaire is available in a Word file linked from electronic Section 5.0.
The CARS93 data, in Appendix C, has cars classified as to being one of six different types: small, compact, midsize, large, sporty, or van. A numerical type code is given in the data set, in addition to the actual name of the type. The goal of this activity is to see if poststratification on car type pays any dividends when estimating:
MANUFAC MODEL TYPE MINPRICE
Length:92 Length:92 Length:92 Min. : 6.70
Class :character Class :character Class :character 1st Qu.:10.88
Mode :character Mode :character Mode :character Median :14.70
Mean :17.23
3rd Qu.:20.48
Max. :45.40
MIDPRICE MAXPRICE MPGCITY MPGHIGH
Min. : 7.40 Min. : 7.90 Min. :15.00 Min. :20.00
1st Qu.:12.40 1st Qu.:14.57 1st Qu.:18.00 1st Qu.:26.00
Median :17.95 Median :20.30 Median :21.00 Median :28.00
Mean :19.59 Mean :21.96 Mean :22.29 Mean :29.04
3rd Qu.:23.40 3rd Qu.:25.55 3rd Qu.:24.25 3rd Qu.:31.00
Max. :61.90 Max. :80.00 Max. :46.00 Max. :50.00
AIRBAGS DRIVETR CYLINDR LITERS
Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:4.000 1st Qu.:1.875
Median :1.0000 Median :1.0000 Median :4.000 Median :2.400
Mean :0.8152 Mean :0.9348 Mean :4.978 Mean :2.680
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:6.000 3rd Qu.:3.300
Max. :2.0000 Max. :2.0000 Max. :8.000 Max. :5.700
NA's :1
HPOWER RPMMAX US? TYPECODE
Min. : 55.0 Min. :3800 Min. :0.0000 Min. :1.000
1st Qu.:104.5 1st Qu.:4800 1st Qu.:0.0000 1st Qu.:2.000
Median :140.0 Median :5200 Median :1.0000 Median :3.000
Mean :144.4 Mean :5273 Mean :0.5109 Mean :3.109
3rd Qu.:170.0 3rd Qu.:5712 3rd Qu.:1.0000 3rd Qu.:4.250
Max. :300.0 Max. :6500 Max. :1.0000 Max. :6.000
ROW
Min. : 1.00
1st Qu.:23.75
Median :46.50
Mean :46.50
3rd Qu.:69.25
Max. :92.00
#Simple random samplingset.seed(2025)CARS93Sample1 <- CARS93 %>%slice_sample(prop =0.25)CARS93Sample1
We now move from selecting samples from real sets of data to selecting samples from probability distributions. The probability distributions partially given in the following table represent the heights of adults in America. The complete set of data is available via a link from electronic Section 5.0. PROB-M denotes the probabilities of various heights (in inches) for males, PROB-F denotes the probabilities for females, and PROB denotes the combined probabilities for adults. The goal is to:
(p.s. Apart from this, CLT also says that When the sample size is large enough (n \(\geq\) 30), the mean distribution of the sample is normal. But here we just prove the above thing)
Proof: (based on MGF)
\(z_i =(x_i-\mu)/ \sigma\) where \(Y_n=\sqrt n \bar z\) and \(z_i \sim ^{iid} F(0,1)\)
This leads the construction of confidence interval as well as sample size, which is one of keys to SURVEY SAMPLING
Basic knowledge of Statistics
Sampling distribution
What we learned before about sampling distribution are the situation of “Sampling with Replacement”.
We are beginning to embark a journey that allows us to learn about populations by obtaining data from samples since it is rare that we know all values in an entire population.
Sampling distribution of a statistic is the probability distribution of a sample statistics (such as mean/proportion which tend to target the population mean/proportion), with all samples having the same sample size. This concept is important to understand. The behavior of a statistic can be known by understanding its distribution( (The random variable in this case is the value of that sample statistics)). Under certain condition, the distribution of sampling mean/proportion approximates a normal distribution.
Though statistics does not depend on unknown parameters, the distribution of it depend on unknown parameters.(eg. Normal distribution of sample means depends on population mean(an unknow parameter) and standard deviation)
(ps: the advantage of sampling with replacement:
when selecting a relatively small sample from a large population, it makes no significant difference whether we sample with or without replacement.
Sampling with replacement results in independent events that are unaffected by previous outcomes, and independent events are easier to analyze and they result in simpler formulas.)
For a fixed sample size, the mean of all possible sample means is equal to the mean of population though sample means vary(sampling variability)
Example
If the population follows \(N \sim (\mu, \sigma^2)\), the sampling distribution of mean follows \(N \sim (\mu, \sigma^2/\sqrt n)\)
Basic knowledge in Survey sampling
Elementary Unit
In sampling we get information from an individual, which is called an individual unit.
Population
The sum of all individual units in a given investigation at a given time.
—————–
————————————————————————————————
Sample
A subset of the Population
—————–
————————————————————————————————
Subpopulation
A specific part of the Population of the study. Typically, subgroups and study domains are not the same.
—————–
————————————————————————————————
Enumeration Unit and Sampling Unit
Individuals that would be selected under a particular sampling mechanism.
—————–
————————————————————————————————
Chapter 4 Simple Random Sampling (SRS)
Estimator for population mean
Estimator for population mean \(\mu\): \[
\hat{\mu} = \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i
\]
Estimator for the variance of \(\bar{y}\)\[
\hat{V}(\bar{y}) = \left( 1-\frac{n}{N} \right) \frac{s^2}{n}
\]
Estimator for population total
Estimator for population total \(\hat{\tau}\): \[
\hat{\tau} = N\bar{y} = \frac{N}{n}\sum_{i=1}^n y_i
\]
Estimator for the variance of \(\tau\)\[
\hat{V}(\hat{\tau}) = \hat{V}(N\bar{y}) = N^2 \left( 1-\frac{n}{N} \right) \frac{s^2}{n}
\]
A note on Bounds
The estimator for the bound on the error of estimation is \(2\sqrt{V(\hat\lambda)}\) where \(V(\hat\lambda)\) is the variance of the estimator (\(\hat\mu\), \(\hat\tau\), or \(\hat{p}\)). (Notation deviates from the textbook) \(2\) is an approximated version of a t-test of 1.96 (95% CI of two sides) that the text uses to simplify calculation.
Sample size estimates for population mean and for population total
Sample size required to estimate \(\mu\) or \(\tau\) with a bound on the error of estimation \(B\). \[
n = \frac{N\sigma^2}{(N-1)D + \sigma^2} \\
\text{where} \\
D = \frac{B^2}{4} \text{ for } \mu \text{ and} \\
D = \frac{B^2}{4N^2} \text{ for } \tau
\]
Fuction in R for sample size estimate
Estimator for population proportion
Estimator for the population proportion \(p\): \[
\hat{p} = \bar{y} = \frac{\sum_{i=1}^n y_i}{n}
\]
\(\sigma^2\) si replaced with \(pq\) in the sample size formula to estimate \(p\) with a bound on the error of estimation \(B\): \[
n = \frac{Npq}{(N-1)D + pq} \\
\text{where} \\
D = \frac{B^2}{4}
\]
Example Exercise
The Fish and Game Department of a particular state was concerned about the direction of its future hunting programs. To provide for a greater potential for future hunting, the department wanted to determine the proportion of hunters seeking any type of game bird. A simple random sample of \(n=1000\) of the \(N=99,000\) licensed hunters was obtained. Suppose 430 indicated that they hunted game birds. Estimate \(p\), the proportion of licensed hunters seeking game birds. Place a bound on the error of estimation. Using the data, determine the sample size the department must obtain to estimate the proportion of game bird hunters, given a bound on the error of estimation of magnitude \(B=0.02\). Recall the estimator of a simple random sample. The proportion \(\hat{p}=\frac{1}{n} \sum_{i=1}^n y_i=\frac{430}{1000}=\frac{43}{100}\) The bound on the error of \(\hat{p}\) is \[
\begin{aligned}
2 \sqrt{\hat{v}(\hat{p})} & =2 \sqrt{\left(1-\frac{1000}{99000}\right) \times \frac{\frac{43}{100} \times\left(1-\frac{43}{1000}\right)}{1000-1}} \\
& =2 \times \sqrt{\frac{98}{99} \times \frac{43}{100} \times \frac{57}{100} \times \frac{1}{998}} \\
& =2 \times 0.016 \\
& =0.032
\end{aligned}
\]
If \(B=0.02\), then the sample size \(n\) the department must obtain to estimate \(p\) is \(\hat{n}=\frac{N \hat{p} \hat{q}}{(N-1) D+\hat{p} \hat{q}}=\frac{89000 \times \frac{43}{100} \times \frac{57}{100}}{(99000-1) \times\left(\frac{0.02}{1}\right)^2+\frac{43}{100} \times \frac{57}{100}}\)\(\approx 2392\).
Therefore in this question the estimator of \(P\)\[
\hat{p}=\frac{25}{30}=\frac{5}{6}
\] and the bound of error is \[
\begin{aligned}
2 \sqrt{\hat{v}(\hat{p})} & =2 \cdot \sqrt{\left(1-\frac{30}{300}\right) \times \frac{\frac{5}{6} \times \frac{1}{6}}{30-1}} \\
& =2 \times \sqrt{\frac{7}{10} \times \frac{5}{36} \times \frac{1}{28}} \\
& =2 \times 0.0657 \\
& =0.1314
\end{aligned}
\]
Here if \(B=0.05\), then the sample size required to estimated \(p\) is given by \[
\begin{aligned}
\hat{n} & =\frac{N p q}{(N-1) D+p q} \\
& =\frac{300 \times \frac{5}{6} \times \frac{1}{6}}{(300-1) \times \frac{(0.05)^2}{4}+\frac{5}{8} \times \frac{1}{6}} \approx 128
\end{aligned}
\]
Therefore, we need 128 samples at least.
State park officials were interested in the proportion of campers who consider the camp-site spacing adequate in a particular campground. They decided to take a simple random sample of \(n=30\) from the first \(N=300\) camping parties that visited the campground. Let \(y_i=0\) if the head of the \(i t h\) party sampled does not think the campsite spacing is adequate and \(y_i=1\) if he does \((i=1,2, \ldots, 30)\). Use the data in the accompanying table to estimate \(p\), the proportion of campers who consider the campsite spacing adequate. Place a bound on the error of estimation. Use the data to determine the sample size required to estimate \(p\) with a bound on the error of estimation of magnitude \(B=0.05\).
Camper sampled
Response, \(y_i\)
1
1
2
0
3
1
.
.
.
.
29
1
30
1
\(\sum_{i=1}^{30} y_i=25\)
\[
\begin{aligned}
& \text {the estimator of the population } \\
& \text { proportion } p \text { is } \hat{p}=\bar{y}=\frac{\sum_{i=1}^n y_i}{n} \\
& \text { and the variance of } p \text { is. } \\
& \qquad \hat{V}(\hat{p})=\left(1-\frac{n}{N}\right) \frac{\hat{p} \hat{q}}{n-1} \text {, where } \hat{q}=1-\hat{p}
\end{aligned}
\] and the bound on the error of estimation is \[
\begin{gathered}
2 \sqrt{\hat{V}(\hat{p})}=2 \cdot \sqrt{\left(1-\frac{n}{N 1}\right) \frac{\hat{p} q}{n-1}} \\
\text { Let } B=2 \sqrt{\hat{V}(\hat{p})} \text {, then } 2 \sqrt{\left(1-\frac{n}{N}\right) \frac{p q}{n-1}}=B
\end{gathered}
\] we have \(n=\frac{N p q}{(N-1) D+p q}\), where \(q=1-p\) and \(D=\frac{B^2}{4}\)
An investigator is interested in estimating the total number of “count trees” (trees larger than a specified size) on a plantation of \(N=1500\) acres. This information is used to determine the total volume of lumber for trees on the plantation. A simple random sample of \(n=100\) one-acre plots was selected, and each plot was examined for the number of count trees. The sample average for the \(n=100\) one-acre plots was \(\bar{y}=25.2\) with a sample variance of \(s^2=136\). Estimate the total number of count trees on the plantation. Place a bound on the error of estimation. Using the results of the survey, determine the sample size required to estimate \(\tau\), the total number of trees on the plantation, with a bound on the error of estimation of magnitude \(B=1500\).
The estimation of total number of count tree \(i\) is given by \(\hat{\tau}=N \cdot \bar{y}=1500 \times 25.2=37800\) and the bound on the error of \(\hat{\imath}\) is \[
\begin{aligned}
2 \cdot N \cdot \sqrt{\hat{V}(\bar{y})} & =2 \cdot N \cdot \sqrt{\left(1-\frac{n}{N}\right) \cdot \frac{S^2}{n}} \\
& =2 \times 1500 \times \sqrt{\left(1-\frac{100}{1500}\right) \cdot \frac{136}{100}} \\
& =3379.84
\end{aligned}
\]
If \(B=1500\), then the sample size required to estimate \(\tau\) is given by \[
\begin{aligned}
\hat{n} & =\frac{N S^2}{(N-1) D+S^2}, \text { where } D=\frac{B^2}{4 N^2} \\
& =\frac{1500 \times 136}{(1500-1) \times \frac{(1500)^2}{4 \times(1500)^2}+136} \approx 400
\end{aligned}
\]
Chapter 5 Stratfied Sampling
Estimator for population mean
Estimator for population mean \(\mu\): \[
\bar{y}_{st} = \frac{1}{N}\sum_{i=1}^LN_i\bar{y}_i
\]
Estimator of variance of \(\bar{y}_{st}\)\[
\hat{V}(\bar{y}_{st}) = \frac{1}{N^2}\sum_{i=1}^L \left[ N^2_i \left( \frac{N_i-n_i}{N_i} \right) \left( \frac{s^2_i}{n_i} \right) \right]
\]
Estimator for population total
Estimator for population total \(\tau\): \[
N\bar{y}_{st} = \sum_{i=1}^L N_i \bar{y}_i
\]
Estimator for the variance of \(\tau\): \[
N^2 \hat{V}(\bar{y}_{st}) = \sum_{i=1}^L N_i^2 \left ( \frac{N_i-n_i}{N_i} \right ) \left ( \frac{s_i^2}{n_i} \right )
\]
Approximate Sample size with a fixed Bound
Approximate sample size\(n\) required to estimate \(\mu\) or \(\tau\) with a bound \(B\) on the error of estimation:
\[
n = \frac{\sum_{i=1}^L N_i^2 \sigma^2_i/a_i}{N^2 D + \sum_{i=1}^L N_i \sigma^2_i} \\
D = \frac{B^2}{4} \text{ when estimating } \mu \\
D = \frac{B^2}{4N^2} \text{ when estimating } \tau \\
\]
Example Exercise
An advertising firm, interested in determining how much to emphasize television advertising in a certain county, decides to conduct a sample survey to estimate the average number of hours each week that households within the county watch televi- sion. The county contains two towns, A and B, and a rural area. Town A is built around a factory, and most households contain factory workers with school-age chil- dren. Town B is an exclusive suburb of a city in a neighboring county and contains older residents with few children at home. There are 155 households in town A, 62 in town B, and 93 in the rural area. Discuss the merits of using stratified random sam- pling in this situation.The advertising firm in here decides to use telephone interviews rather than personal interviews because all households in the county have telephones, and this method reduces costs. The cost of obtaining an observation is then the same in all three strata. The stratum standard deviations are again approximated by \(\sigma_1 \approx 5, \sigma_2 \approx 15\), and \(\sigma_3 \approx 10\). The firm desires to estimate the population mean \(\mu\) with a bound on the error of estimation equal to 2 hours. Find the appropriate sample
step (1) Given a bound \(B=2, \Rightarrow D=\frac{B^2}{4}=1\)
Optimal Allocation Theorem (Neyman allocation) Here the costs are same for all strata.
Stratum sample sizes \[
n_1=57 \times 0.3=17.1 \rightarrow 18 \text { to control the bound. }
\]\[
\begin{aligned}
& n_2=57 \times 0.35=20 \\
& n_3=57 \times 0.35=20
\end{aligned}
\]
Therefore, the sample size \(n\) is 58 .
A quality control inspector must estimate the proportion of defective microcomputer chips coming from two different assembly operations. She knows that, among the chips in the lot to be inspected, \(60 \%\) are from assembly operation A and \(40 \%\) are from assembly operation B. In a random sample of 100 chips, 38 turn out to be from operation A and 62 from operation B. Among the sampled chips from operation A, six are defective. Among the sampled chips from operation B, ten are defective.
Considering only the simple random sample of 100 chips, estimate the proportion of defectives in the lot, and place a bound on the error of estimation.
Stratifying the sample, after selection, into chips from operation A and B, estimate the proportion of defectives in the population, and place a bound on the error of estimation. Ignore the fpc in both cases. Which answers do you find more acceptable?
Part (a): Estimation of the Proportion of Defective Chips in the Lot (Simple Random Sampling) Calculation of the Proportion of Defective Chips: - Total number of defective chips in the sample: \(6+10=16\) - Total sample size: 100 - The estimated proportion of defectives \(\hat{p}\) is: \(\hat{p}=\frac{\text { Number of defectives }}{\text { Total sample size }}=\frac{16}{100}=0.16\)
Estimation Error: - The standard error (SE) of the proportion is calculated using the formula for the standard error of a proportion: \(S E=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) Plugging in the values we get: \(S E=\)\(\sqrt{\frac{0.16 \times 0.84}{100}}=\sqrt{\frac{0.1344}{100}}=\sqrt{0.001344} \approx 0.0367\) - A 95% confidence interval for the proportion can be estimated as: \(\mathrm{CI}=\hat{p} \pm 1.96 \times\)\(S E=0.16 \pm 1.96 \times 0.0367 \approx 0.16 \pm 0.0719\) Thus, the \(95 \% \mathrm{Cl}\) is approximately: [0.0881, 0.2319]
Chapte 7 Systematic Sampling
Estimator for population mean
Estimator for population mean \(\mu\): \[
\hat{\mu} = \bar{y}_{sy} = \frac{1}{n}\sum_{i=1}^n y_i
\] Estimator of variance of \(\bar{y}_{st}\)\[
\hat{V}(\bar{y}_{st}) = \left( 1-\frac{n}{N} \right) \frac{s^2}{n}
\] assuming a randomly ordered population
Notice that this is the same estimator as used in a Simple Random Sample
The true variance of \(\bar{y}_{st}\) is given by \[
V(\bar{y}_{sy}) = \frac{\sigma^2}{n} [1+(n-1) \rho]
\]
Where \(\rho\) is a measure of the correlation between pairs of observations in the same systematic sample. It consists of the variability within sample over the variability between samples.
- characteristics of a systematic sample compared to that of the population \[
\rho \approx \frac{MSB - MST}{(n-1)MST} \\
\]
\[
MSB = \frac{n}{k-1} \sum_{i=1}^k (\bar{y}_i - \bar{\bar{y}}_i)^2 \\
MSW = \frac{1}{k(n-1)} \sum_{i=1}^k \sum_{j=1}^n (y_{ij} - \bar{y}_i)^2 \\
SST = \sum_{i=1}^k \sum_{j=1}^n (y_{ij} - \bar{\bar{y}})^2
\] where \(\bar{\bar{y}}\) is the overall mean per element. here, for \(\rho\) is \[
\rho = \frac{(k-1)nMSB - SST}{(n-1)SST}
\]
Systematic sampling uses the same estimators as simple random sampling because it is designed to be practically as random as a SRS, and a better estimate is not possible without taking multiple cluster samples. As such, the remaining equations for population total, proportions, sample size, etc. are the same as in SRS and can be found in Chapter 4.
1 in k sampling for mean
Estimator for the population mean \(\mu\) under \(1 \text{ in } k'\) systematic sampling: \[
\hat{\mu} = \sum_{i=1}^{n_s} \frac{\bar{y}_i}{n_s}
\]\(y_i\) is the mean of the \(i^{th}\) systematic sample. \(n_s\) is the number in the sample.
A college is concerned about improving its relations with a neighboring community. A 1-in-150 systematic sample of the \(N=4500\) students listed in the directory is taken to estimate the total amount of money spent on clothing during one quarter of the school year. The results of the sample are listed in the accompanying table. Use these data to estimate \(\tau\) and place a bound on the error of estimation.
\(\bar{m}=\frac{1}{n} \sum_{i=1}^n m_i\) average cluster size for the sample
\(M=\sum_{i=1}^N m_i\) number of elements in the population
\(\bar{M}=\frac{M}{N}\) average cluster size for the population
\(y_i-\) total of all observations in the \(i^{t h}\) cluster
Population mean
Estimator for the population mean \(\mu\): \[
\bar{y} = \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n m_i}
\]
Estimated variance of \(\bar{y}\): \[
\hat{V}(\bar{y}) = \left( \frac{N-n}{N n \bar{M}^2} \right) s_r^2 \\
\text{where} \\
s_r^2 = \frac{\sum_{i=1}^n (y_i - \bar{y} m_i)^2}{n-1}
\]Note: If \(\bar{M}\) is unknown, it can be approximated by \(\bar{m}\).
Population total
Estimator for the population mean \(\tau\): \[
M\bar{y} = M \left( \frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n m_i} \right)
\]
Approximate sample size required to estimate population mean
Approximate sample size required to estimate \(\mu\), with a bound \(B\) on the error of estimation: \[
n = \frac{N\sigma^2_r}{ND + \sigma^2_r} \\
\sigma^2_r \text{ is estimated by } s_r^2 \\
D = \frac{B^2\bar{M}^2}{4}
\]
Approximate sample size required to estimate population total
Approximate sample size required to estimate \(\tau\), using \(M\bar{y}\), with a bound \(B\) on the error of estimation: \[
n = \frac{N\sigma^2_r}{ND + \sigma^2_r} \\
\sigma^2_r \text{ is estimated by } s_r^2 \\
D = \frac{B^2}{4N^2}
\]
(Note that the only difference between (8.12) and (8.13) is D.)
Approximate sample size required to estimate population total (without M)
Approximate sample size required to estimate \(\tau\), using \(N\bar{y_t}\), with a bound \(B\) on the error of estimation: \[
n = \frac{N\sigma^2_t}{ND + \sigma^2_t} \\
\sigma^2_t \text{ is estimated by } s_t^2 \\
D = \frac{B^2}{4N^2}
\]
Population proportion
Estimator for the population proportion \(p\): \[
\hat{p} = \frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n m_i}
\]
Method: Cumulative sum method / Maximum size method / Catalog method
Estimator of the population mean \(\mu\): \[
\hat{u}_{pps} = \bar{\bar{y}} = \frac{1}{n}\sum_{i=1}^n \bar{y_i}
\]
Estimator for the variance of \(\hat{u}_{pps}\): \[
\hat{V}(\hat{u}_{pps}) = \frac{1}{n(n-1)} \sum_{i=1}^n (\bar{y_i} - \hat{u}_{pps})^2
\]
Estimator of the population total \(\tau\): \[
\hat{\tau}_{pps} = \frac{M}{n}\sum_{i=1}^n \bar{y_i}
\]
Estimator for the variance of \(\hat{\tau}_{pps}\): \[
\hat{V}(\hat{\tau}_{pps}) = \frac{M^2}{n(n-1)} \sum_{i=1}^n (\bar{y_i} - \hat{u}_{pps})^2
\]
Example Exercise
Cluster sampling
A sociologist wants to estimate the per-capita income in a certain small city. No list of resident adults is available. Here, cluster sampling seems to be the logical choice for the survey design because no lists of elements are available. Each of the city blocks will be considered one cluster, and the clusters are numbered on a city map, with the numbers from 1 to 415.The experimenter has enough time and money to sample n = 25 clusters and to interview every household within each cluster. Hence, 25 random numbers between 1 and 415 are selected, and the clusters having these numbers are marked on the map. Interviewers are then assigned to each of the sampled clusters.
Because \(M\) is not known, the \(\bar{M}\) appearing in Eq. (8.2) must be estimated by \(\bar{m}\), where \[
\bar{m}=\frac{\sum_{i=1}^n m_i}{n}=\frac{151}{25}=6.04
\]
Thus, the estimate of \(\mu\) with a bound on the error of estimation is given by \[
\bar{y} \pm 2 \sqrt{\hat{V}(\bar{y})}=8801 \pm 2 \sqrt{653,785}=8801 \pm 1617
\]
An investigator wishes to estimate the average number of defects per board on boards of electronic components manufactured for installation in computers. The boards contain varying numbers of components, and the investigator thinks that the number of defects should be positively correlated with the number of components on a board. Thus, pps sampling is used, with the probability of selecting any one board for the sample being proportional to the number of components on that board. A sample of n = 4 boards is to be selected from the N = 10 boards of one day of production. The number of components on each of the ten boards are
10, 12, 22, 8, 16, 24, 9, 10, 8, 31
Show how to select n = 4 boards with probabilities proportional to size.
(the sum of 10, 12, 22, 8, 16, 24, 9, 10, 8, 31 is 150)
Sol:
Method 1: Cumulative sums method – randomly get 4 numbers from 0 to 150
Method 2: Maximum size method – generate the ramdom pair (i,j) \(i \in [1,N], j\in[1,M]\), where N=10, M=31. If the pair does not exist then change another one until 4 boards are selected.
Method 3: Catalog method
Sequence 1,2,…,\(x_1\),\(x_1+1\),…,\(x_1+x_2,x_1+x_2+1,\),..\(x_1+...+x_n\) –> (1,2,…,150),where \(x_i\) is the size of each board.
k = 150/4 = 37
\(R_1 =\) SRS(1,2,…,37)
\(R_2 = R_1 +37\)
\(R_3 = R_2 +37\)
\(R_4 = R_3 +37\)
If \(\sum _{i=1}^{j-1} x_i < R_1 \leq \sum _{i=1}^{j} x_i, y_i=y_j\) select the j-th board
Then we will estimate the average number of defects per board on boards of electronic components manufactured for installation in computers. If the number of defects found boards 2, 3, 5, and 7 was 1, 3, 2, and 1, respectively
Firstly, calculate the cluster mean and probability. \[
\begin{array}{lll}
\bar{y}_1=\frac{1}{12}, & \bar{y}_2=\frac{3}{22}, & \bar{y}_3=\frac{2}{16},\bar y_4 = \frac{1}{9}\\
p_1=\frac{12}{150}, & p_2=\frac{22}{150}, & p_3=\frac{9}{150}, p_4=\frac{9}{150}
\end{array}
\]
Secondly, the estimation of average number is given by \[
\begin{aligned}
\hat{\mu}_{p p s} & =\frac{1}{n} \sum_{i=1}^n \frac{\bar{y}_i}{p_i}=\frac{1}{4}\left(\frac{\bar{y}_1}{p_1}+\frac{\bar{y}_2}{p_2}+\frac{\bar{y}_3}{p_2}+\frac{\bar{y}_4}{p_4}\right) \approx 1.25 \\
\text { And } \hat{v}\left(\hat{\mu}_{p p s}\right) & =\frac{1}{n(n-1)} \sum_{i=1}^n\left(\frac{\bar{y}_i}{p_i}-\hat{\mu}_{p p s}\right)^2 \\
& =\frac{1}{4 \times 3} \times(1.665) \\
& \approx 0.14
\end{aligned}
\]
Therefore, the bound \(B=2 \sqrt{\hat{v}\left(\hat{\mu_{\text {pps }}}\right)}=2 \sqrt{0.14} \approx 0.75\)
Chapter 9 Two-Stage Cluster Sampling
Advantages
The advantages of two-stage cluster sampling over other designs are the same as those listed in Chapter 8 for cluster sampling. First, a frame listing all elements in the population may be impossible or costly to obtain, whereas obtaining a list of all clus- ters may be easy. For example, compiling a list of all university students in the coun- try would be expensive and time-consuming, but a list of universities can be readily acquired. Second, the cost of obtaining data may be inflated by travel costs if the sampled elements are spread over a large geographic area. Thus, sampling clusters of elements that are physically close together is often economical.
How to draw
Randomly select clusters first, then randomly sample elements within chosen clusters.
Population mean
Unbiased estimator of the population mean \(\mu\) : \[
\hat{\mu}=\left(\frac{N}{M}\right) \frac{\sum_{i=1}^n M_i \bar{y}_i}{n}=\frac{1}{\bar{M}} \frac{\sum_{i=1}^n M_i \bar{y}_i}{n}
\] assuming simple random sampling at each stage.
Estimated variance of \(\hat{\boldsymbol{\mu}}\) : \[
\hat{V}(\hat{\mu})=\left(1-\frac{n}{N}\right)\left(\frac{1}{n \bar{M}^2}\right) s_{\mathrm{b}}^2+\frac{1}{n N \bar{M}^2} \sum_{i=1}^n M_i^2\left(1-\frac{m_i}{M_i}\right)\left(\frac{s_i^2}{m_i}\right)
\] where \[
s_{\mathrm{b}}^2=\frac{\sum_{i=1}^n\left(M_i \bar{y}_i-\bar{M} \hat{\mu}\right)^2}{n-1}
\] and \[
s_i^2=\frac{\sum_{j=1}^{m_i}\left(y_{i j}-\bar{y}_i\right)^2}{m_i-1} \quad i=1,2, \ldots, n
\]
Notice that \(s_{\mathrm{b}}^2\) is simply the sample variance among the terms \(M_i \bar{y}_i\).
Population total
Estimation of the population total \(\boldsymbol{\tau}\) : \[
\hat{\tau}=M \hat{\mu}=\frac{N}{n} \sum_{i=1}^n M_i \bar{y}_i
\] assuming simple random sampling at each stage. Estimated variance of \(\hat{\boldsymbol{\tau}}\) : \[
\begin{aligned}
\hat{V}(\hat{\tau}) & =M^2 \hat{V}(\hat{\mu}) \\
& =\left(1-\frac{n}{N}\right)\left(\frac{N^2}{n}\right) s_{\mathrm{b}}^2+\frac{N}{n} \sum_{i=1}^n M_i^2\left(1-\frac{m_i}{M_i}\right)\left(\frac{s_i^2}{m_i}\right)
\end{aligned}
\]
Ratio Estimation of Population Mean
Ratio estimator of the population mean \(\mu\) : \[
\hat{\mu}_{\mathrm{r}}=\frac{\sum_{i=1}^n M_i \bar{y}_i}{\sum_{i=1}^n M_i}
\]
Estimated variance of \(\hat{\boldsymbol{\mu}}_{\mathbf{r}}\) : \[
\hat{V}\left(\hat{\mu}_{\mathrm{r}}\right)=\left(1-\frac{n}{N}\right)\left(\frac{1}{n \bar{M}^2}\right) s_{\mathrm{r}}^2+\frac{1}{n N \bar{M}^2} \sum_{i=1}^n M_i^2\left(1-\frac{m_i}{M_i}\right)\left(\frac{s_i^2}{m_i}\right)
\]
where \[
s_{\mathrm{r}}^2=\frac{\sum_{i=1}^n M_i^2\left(\bar{y}_i-\hat{\mu}_{\mathrm{r}}\right)^2}{n-1}=\frac{\sum_{i=1}^n\left(M_i \bar{y}_i-M_i \hat{\mu}_{\mathrm{r}}\right)^2}{n-1}
\] and \[
s_i^2=\frac{\sum_{i=1}^{m_i}\left(y_{i j}-\bar{y}_i\right)^2}{m_i-1} \quad i=1,2, \ldots, n
\]
Population proportion
Estimator of a population proportion \(\boldsymbol{p}\) : \[
\hat{p}=\frac{\sum_{i=1}^n M_i \hat{p}_i}{\sum_{i=1}^n M_i}
\]
Estimated variance of \(\boldsymbol{p}\) : \[
\hat{V}(\hat{p})=\left(1-\frac{n}{N}\right)\left(\frac{1}{n \bar{M}^2}\right) s_{\mathrm{r}}^2+\frac{1}{n N \bar{M}^2} \sum_{i=1}^n M_i^2\left(1-\frac{m_i}{M_i}\right)\left(\frac{\hat{p}_i \hat{q}_i}{m_i-1}\right)
\]
where \[
s_{\mathrm{r}}^2=\frac{\sum_{i=1}^n M_i^2\left(\hat{p}_i-\hat{p}\right)^2}{n-1}=\frac{\sum_{i=1}^n\left(M_i \hat{p}_i-M_i \hat{p}\right)^2}{n-1}
\] and \(\hat{q}_i=1-\hat{p}_i\).
Probabilities proportional to size (PPS)
Estimator of the population mean \(\mu\) : \[
\hat{\mu}_{\mathrm{pps}}=\frac{1}{n} \sum_{i=1}^n \bar{y}_i
\]
Estimator of the population total \(\boldsymbol{\tau}\) : \[
\hat{\tau}_{\mathrm{pps}}=\frac{M}{n} \sum_{i=1}^n \bar{y}_i
\]
Estimated variance of \(\hat{\boldsymbol{\tau}}_{\mathbf{p p s}}\) : \[
\hat{V}\left(\hat{\tau}_{\mathrm{pps}}\right)=\frac{M^2}{n(n-1)} \sum_{i=1}^n\left(\bar{y}_i-\hat{\mu}_{\mathrm{pps}}\right)^2
\]
Summary
Advantages and Disadvantages of Common Sampling Methods
Simple Random Sampling:
Advantages: It is easy to operate and ensures that each individual in the population has an equal probability of being selected. Theoretically, it can obtain an unbiased sample, and the results have strong statistical inference ability.
Disadvantages: When the population is large, the numbering and sampling process are cumbersome. If there is an obvious hierarchical structure in the population, the sample may not well represent the characteristics of each layer.
Stratified Sampling:
Advantages: First, the population is stratified according to characteristics, and then independent sampling is carried out from each layer, making the sample more representative and improving the accuracy of estimation. It also allows for separate analysis of each layer to understand the differences among different levels.
Disadvantages: It requires a certain understanding of the characteristics of the population to stratify reasonably. The sampling procedure is more complex than simple random sampling, and the calculation amount is relatively large.
Cluster Sampling:
Advantages: The sampling unit is a group, and sampling and investigation are relatively convenient, which can reduce the cost and difficulty of sampling. It is suitable for situations where the population is widely distributed and individual access is difficult.
Disadvantages: Individuals within a cluster often have similarities, which may lead to insufficient sample representativeness. The sampling error is usually larger than that of simple random sampling and stratified sampling.
Systematic Sampling:
Advantages: After arranging the population in a certain order, sampling is carried out at equal intervals. It is simple to operate and easy to implement. In some cases, the sample is evenly distributed and can have good representativeness.
Disadvantages: If there are periodic changes in the population, and the sampling interval is related to the period, it may lead to serious bias.
Prevention and Correction of Sample Selection Bias
Prevention and Correction of Sample Selection Bias
Prevention:
Reasonable Sampling Design: Adopt scientific sampling methods, such as random sampling and stratified sampling, to ensure that each individual has an appropriate probability of being selected. Clearly define the source and scope of the sample and consider the diversity of the population. For example, in consumer surveys, cover people of different ages, regions, and income levels.
Establish Strict Criteria: When selecting samples, establish clear inclusion and exclusion criteria to avoid