An Introduction to Two-Phase Sampling and How it Could be Used to Collect Learning Outcomes Data

Andres Parrado and I recently wrote an article in which we look at the reliability of learning outcomes data in India. The main findings of the paper are a) the government-run survey of learning outcomes (called the NAS) likely contains a lot of noise and b) the main independent survey of learning outcomes (ASER) is a tad bit noisier than the survey’s sample size would lead one to believe.

In the process of working on the paper, I spent a bit of time looking at learning outcomes surveys across the world. Something that somewhat surprised me is that most learning outcomes surveys, in both rich countries (e.g. PISA) and developing countries (e.g. SACMEQ), are school-based. That is, the surveyors randomly select a bunch of schools, then randomly select a bunch of students in each selected school, and finally administer the assessment to each student.

This makes sense in rich countries where you have a) high enrollment, b) accurate and up to date lists of schools, c) high attendance, d) high rates of basic literacy (and thus high capacity to take a paper and pencil test), and e) low incentives for teachers or other administrators to cheat (in the sense that their potential financial gains are a small fraction of what they stand to lose if they are caught).

In developing countries, you often only have (a) – high enrollment. Typically, there is no comprehensive roster of all schools due to the proliferation of low-cost, informal private schools. Attendance on exam day cannot be guaranteed. And a large share of students don’t have sufficient reading skills to take a paper and pencil test.

An alternative to the school-based approach which I think makes more sense in developing countries is to instead randomly select households. That is, you would randomly select villages, then randomly select households within each village, and finally administer the assessment to each child in each randomly selected household. This is the approach that ASER uses and seems to work really well.

Using existing household surveys to collect learning outcomes data

I think that the ideal way of collecting learning outcomes data in most developing countries would be to add a short, basic learning outcomes module to an existing large household survey like the DHS or LSMS. ASER has proved that you can get a good measure of a child’s basic literacy and numeracy in very little time so it wouldn’t add much to the total survey time or cost.

Of course, getting the org running an existing large household survey to add a module to their survey is never an easy task. Out of curiosity, I decided to look at whether it would be possible to get reasonably precise estimates of learning outcomes by administering a learning outcomes assessment to just a sub-sample of children included in a larger household survey. My motivation was that, at least in theory, administering a learning module to a sub-sample of respondents rather than the full sample is lower cost and hassle and thus it might be easier to convince someone to do this than to administer the module to the full sample. (With tools like SurveyCTO it is pretty easy to randomly select a sub-sample on the spot these days.) In practice, the procedures for adding modules / questions to existing major surveys like the DHS or NSSO rounds are extremely bureaucratic so this might just be wishful thinking on my part.

Estimating precision from estimating learning levels using a sub-sample from a larger survey

In theory, you can get reasonably high precision using only a sub-sample of a larger survey because a) you can draw a random sub-sample (i.e. without having to cluster) from the larger sample and b) a household survey like the DHS includes a lot of additional variables which are predictive of learning outcomes. To understand the first reason, suppose that sample size of the the existing household survey is so large that you can effectively consider it the whole population. If you draw a simple random sub-sample of size \(N_{sub}\) from the larger sample then your standard errors will be about as large as they would be if you drew a simple random sample from the entire population and much smaller than they would be under the typically sampling approach of two-stage clustering (e.g. selecting villages then hamlets). To understand the second reason, suppose that there is a variable in the larger sample (such as household wealth) that is extremely highly predictive of learning outcomes. You could then use this variable as an auxiliary variable when generating your estimates (or when selecting your sub-sample as a strata variable).

This approach, in which you collect some information for a large sample and then collect additional information for a second sub-sample is called two-phase sampling.

With a bit of math and data from the IHDS we can provide more formal estimates of the sample size requirements. The description below is a bit informal. For a more formal description, check out chapter 12 in Lohr (2019) or chapter 9 in Sarndal et al (Särndal, Swensson, and Wretman 2003).

Let \(\widehat{\bar{y_f}}\) be the estimate of average learning outcomes that we would obtain if we administered the learning outcome assessment to all children in the sample (f stands for “full”). Let \(\widehat{\bar{y_2}}\) be the estimate of average learning outcome if we only administer the assessment to a subsample of children (2 = “two phase”). Recall from intro stats that \(Var(y)=Var(E[y|x])+E[Var(y|x)]\). If Z is the vector indicating household inclusion in the full sample (also called the first phase sample) then…

\[ Var(\widehat{\bar{y_2}})=Var(E[\widehat{\bar{y_2}}|Z])+E[Var(\widehat{\bar{y_2}}|Z)]\approx Var(\widehat{\bar{y_f}})+E[Var(\widehat{\bar{y_2}}|Z)] \] Where the second approximate equality holds as long as the \(\widehat{\bar{y_2}}\) is approximately an unbiased estimator of \(\widehat{\bar{y_f}}\). The second term in the equation is variance of estimator treating the full sample (i.e. the first phase sample) as the full population. Since we have a lot of auxiliary information with which to create that estimator, it’s variance is likely to be quite small.

We can estimate both of these terms using IHDS data. We first estimate \(Var(\widehat{\bar{y_f}})\) where \(\bar{y}\) is the proportion of children 8-11 who are able to read a standard 2 level text in 2011-12. (IHDS only administered the ASER tool to 8-11 year olds.)

# Load required pacakges
library(tidyverse); library(survey); library(haven); library(tidymodels)

# Load the Stata version of the individual level IHDS file
# To get this file go to
# Then click "download" and "stata". You will need to create a login and you will get a bunch of other files in addition to this one.
ihds_ind_dir <- "C:/Users/dougj/Documents/Data/IHDS/IHDS 2012/DS0001"
ind_file <- file.path(ihds_ind_dir, "36151-0001-Data.dta")
ihds <- read_dta(ind_file, col_select = c(STATEID, DISTID, PSUID, URBAN2011, HHID, HHSPLITID, PERSONID, IDPSU, WT, RO3, RO7, RO5, COPC, ASSETS, GROUPS, HHEDUC, HHEDUCM, HHEDUCF, INCOME, NPERSONS,starts_with("CS"), starts_with("TA"), starts_with("ED")))

# Create variables for full PSU ID, HH ID, ASER score, and state
ihds <- ihds %>% 
  mutate(psu_expanded = paste(STATEID, DISTID, PSUID, sep ="-"), 
         hh_expanded = paste(STATEID, DISTID, PSUID, HHID, HHSPLITID, sep ="-"),
         ASER4 = (TA8B ==4),
         State = as_factor(STATEID)) %>% 

# Specify the survey design
# note that this and the line after can take a minute or two
ihds_svy <- svydesign(id =~ psu_expanded + hh_expanded, weights =~ WT, data = ihds)

# Estimate the mean of ASER4 for the full country and get the standard error
svymean(~ASER4, ihds_svy, na.rm =TRUE)
##               mean     SE
## ASER4FALSE 0.67279 0.0084
## ASER4TRUE  0.32721 0.0084

Our standard error for our estimate is a little under one percentage point. This makes sense since our sample size is over 10,000 children so this implies a design effect of around 2.

In the following code, I estimate the second term in the equation above – the additional variance due to the two-phase sampling design – assuming the random sub-sample is 1/20th the size or about 590 kids. If we just use the simple mean of our sub-sample as our estimator, the RMSE is about .021. (Just as we would expect for a SRS). If we use some of the other variables as auxiliary information, or RMSE is .0196. So, unfortunately, the auxiliary information didn’t buy us much in the way of improved precision here. (Though I should also point out that I pretty much just ran a regression with the first variables I could think of. A more sophisticated approach could potentially generate far better predictions.)

Still, our estimated RMSE for the overall estimator \(\sqrt{.0084^2+.0196^2} = .0213\). Not bad for a total sample size of 590 kids.

# Create a new dataframe with only children 8-11 and all the variables we want to use in our analysis
kids <- ihds %>% 
  filter(! %>% 
  transmute(ASER4, TA4, RO3, URBAN2011, ASSETS, log_inc = log(INCOME+1), group = as_factor(GROUPS), RO5, HHEDUC, NPERSONS, log_pcc = log(COPC), private = ifelse((CS4 ==4), 1, 0))
## Warning in log(INCOME + 1): NaNs produced
# Create a formula that we will use in our probit regression later
rx <- as.formula("ASER4 ~ TA4 + log_pcc + RO3 + RO5 + HHEDUC + private + NPERSONS + URBAN2011 + log_inc + group")

# 1/rho is the proportion of the main sample that we assume will be sub-sampled
rho <- 20

# set the seed so this is reproducible

# Create a 20-fold split. Note that ideally, I would randomly select households to be included/excluded in each split, but this is a pain and wouldn't likely make much of a difference. (Since IHDS only surveyed kids 8-11 there are typically max 1 per hh.)
split_obj <- vfold_cv(kids, v = rho, repeats = 10)

# Calculate the true overall mean
true <- mean(kids$ASER4)

# Create a function which will take one of the splits from the split_obj and calculate 
# the error assuming we don't use an auxiliary info and using auxiliary information.
# See here for more info
ggreg_error <- function(splits){
  # Please note that I use the assessment data as the training data and the analysis test 
  # as the holdout data.  This is the exact opposite of what a normal ML work flow looks like.
  # In a normal workflow, you fit your model on the 1-1/rho of the data and 
  #then test it on the 1/rho portion of the data.  I want to do the exact opposite. 
  sample <- assessment(splits)
  other <- analysis(splits)
  # Fit the model  
  mod <- glm(rx, data = sample, family = binomial(link = "probit"))
  # Generate model predictions for both datasets
  preds_other <- augment(mod, newdata = other, type.predict = "response")
  preds_sample <- augment(mod, newdata = sample, type.predict = "response")

  # Calculate the generalized GREG (see Pfefferman eq 4.8 for details)
  ggreg <- mean(preds_other$.fitted, na.rm = TRUE)*((rho-1)/rho) + mean(sample$ASER4, na.rm = TRUE)/rho 

  # Return   
  return(list(true-ggreg, true- mean(sample$ASER4, na.rm=TRUE)))

# Map this function to the split object
split_obj$error <- map(split_obj$splits, ggreg_error)

# Get the root mean squared error for both the simple mean and the generalized GREG estimator
ggreg_error_vec <- map_dbl(split_obj$error, 1)
simple_error_vec <- map_dbl(split_obj$error, 2)

sqrt_mse_ggreg <- (mean(ggreg_error_vec^2))^.5
## [1] 0.01969
sqrt_mse_simple <- (mean(simple_error_vec^2))^.5
## [1] 0.02088427


Lohr, Sharon L. 2019. Sampling: Design and Analysis: Design and Analysis. CRC Press.
Särndal, Carl-Erik, Bengt Swensson, and Jan Wretman. 2003. Model Assisted Survey Sampling. Springer Science & Business Media.