Index Variable Weirdness

There are many instances where you have a bunch of variables and you need to boil them down to one or just a few. For example, you may be testing the effect of an education program on students’ confidence, self-efficacy, and learning levels. You will likely have at least one measure (likely more) for each of those core outcome variables and want to combine them into one to avoid having to do any multiple hypothesis corrections.

Within economics, people tend to use a few different approaches to constructing such “index” variables. The simplest method, which I’ve seen attributed to Katz, is to standardize each variable, add them all up, and then standardize again. Another common approach is to apply principal components analysis and use the first component. Lastly, people often use the approach by Anderson which assigns weights based on the variance-covariance of the variables.

The weird thing about these methods is that the PCA and Anderson methods are, in some ways, almost complete opposites, with the Katz method somewhere in between the two. With PCA, highly correlated variables tend to be assigned higher weights. With the Anderson method, highly correlated variables tend to be assigned lower weights. Both of these approaches have a certain logic to them and seem appropriate in certain circumstances. The code below demonstrates this for the case where you have have three vars: x1 which is independent of others and z1 and z2 which are highly correlated. All have mean 0 and variance of 1. The Katz method weights each var equally. PCA assigns 0 weight to x1 and half weight to z1 and z2. And the Anderson method assigns higher weight to x1 and lower weights to z1 and z2.

Each of these approaches make sense in the appropriate context. If you are trying to measure some latent, unmeasurable quantity such as intelligence then it probably makes sense to take a bunch of measures and weight those which are highly correlated more highly (i.e. use PCA). On the other hand, if you have a bunch of variables and some subset may of variables may be more or less measuring the same thing you should probably weight those variables lower than the others so that you don’t double count the thing you are measuring. If you are just aggregating a bunch of different measures, then the Katz method seems pretty reasonable.

The odd thing is that people rarely justify their choice in constructing an index.

library(tidyverse); library(MASS)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##     select
# generate random variable
n <- 100000
x1 <- rnorm(n)

# Generate two correlated random variables
rho <- .6
Sigma <- matrix(data = c(1,rho,rho, 1), nrow = 2)
z <- mvrnorm(n, mu = c(0,0), Sigma = Sigma)
z1 <- z[,1]
z2 <- z[,2]

# Create a dataframe from the three variables
df <- tibble(x1 = x1, z1 = z1, z2 = z2)

# Test that all means are 0, variances are 1, and covariance of z1 and z2 is roughly equal to rho
vars <- list(x1, z1, z2)
map_dbl(vars, mean)
## [1]  0.005211541  0.003045969 -0.001404589
map_dbl(vars, var)
## [1] 0.9981877 0.9984357 1.0017675
cov(z1, z2)
## [1] 0.5991689
# Weights from PCA
pca_weights <- prcomp(~ x1 + z1 + z2, data = df, rank = 1)
## Standard deviations (1, .., p=3):
## [1] 1.2646238 0.9990934 0.6331904
## Rotation (n x k) = (3 x 1):
##            PC1
## x1 0.000895669
## z1 0.706123073
## z2 0.708088556
# Weights from Anderson method
Sigma_inv <- solve(cov(df))
anderson_weights <- colSums(Sigma_inv)
##        x1        z1        z2 
## 1.0013371 0.6268521 0.6232446