From M-estimation math to code

In the basic set-up, M-estimation applies to estimators of the \(p \times 1\) parameter \(\theta=(\theta_1, \theta_2, \dots, \theta_p)^{\intercal}\) which can be obtained as solutions to an equation of the form

\[\begin{equation} \label{eq:psi} \sum_{i = 1}^m \psi(O_i, \theta) = 0, \end{equation}\]

where \(O_1, \ldots, O_m\) are \(m\) independent and identically distributed (iid) random variables, and the function \(\psi\) returns a vector of length \(p\) and does not depend on \(i\) or \(m\) (Stefanski and Boos 2002, hereafter SB). See SB for the case where the \(O_i\) are independent but not necessarily identically distributed. The roots of (1) are referred to as M-estimators and denoted by \(\hat \theta\). M-estimators can be solved for analytically in some cases or computed numerically in general. Under certain regularity conditions, the asymptotic properties of \(\hat{\theta}\) can be derived from Taylor series approximations, the law of large numbers, and the central limit theorem. In brief, let \(\theta_0\) be the true parameter value defined by \(\int \psi(o, \theta_0) dF(o) = 0\), where \(F\) is the distribution function of \(O\). Let \(\dot{\psi}(o, \theta) = {\partial \psi(O_i, \theta)/\partial \theta}^{\intercal}\), \(A(\theta_0) = E[-\dot{\psi}(O_1, \theta_0)]\), and \(B(\theta_0) = E[\psi(O_1, \theta_0)\psi(O_1, \theta_0)^{\intercal}]\). Then under suitable regularity assumptions, \(\hat{\theta}\) is consistent and asymptotically Normal, i.e.,

\[\begin{equation*} \label{eq:variance} \sqrt{m}(\hat{\theta} - \theta_0) \overset{d}{\to} N(0, V(\theta_0)) \text{ as } m \to \infty, \end{equation*}\]

where \(V(\theta_0) = A(\theta_0)^{-1} B(\theta_0) \{A(\theta_0)^{-1} \}^{\intercal}\). The sandwich form of \(V(\theta_0)\) suggests several possible large sample variance estimators. For some problems, the analytic form of \(V(\theta_0)\) can be derived and estimators of \(\theta_0\) and other unknowns simply plugged into \(V(\theta_0)\). Alternatively, \(V(\theta_0)\) can be consistently estimated by the empirical sandwich variance estimator, where the expectations in \(A(\theta)\) and \(B(\theta)\) are replaced with their empirical counterparts. Let \(A_i = - \dot{\psi}(O_i, \theta)|_{\theta = \hat{\theta}}\), \(A_m = m^{-1} \sum_{i = 1}^m A_i\), \(B_i = \psi(O_i, \hat{\theta}) \psi(O_i, \hat{\theta})^{\intercal}\), and \(B_m = m^{-1} \sum_{i = 1}^m B_i\). The empirical sandwich estimator of the variance of \(\hat{\theta}\) is:

\[\begin{equation} \label{eq:esve} \hat{\Sigma} = A_m^{-1} B_m \{A_m^{-1}\}^{\intercal}/m . \end{equation}\]

The geex package provides an application programming interface (API) for carrying out M-estimation. The analyst provides a function, called estFUN, corresponding to \(\psi(O_i, \theta)\) that maps data \(O_i\) to a function of \(\theta\). Numerical derivatives approximate \(\dot{\psi}\) so that evaluating \(\hat{\Sigma}\) is entirely a computational exercise. No analytic derivations are required from the analyst.

Consider estimating the population mean \(\theta = E[Y_i]\) using the sample mean \(\hat{\theta} = m^{-1} \sum_{i=1}^m Y_i\) of \(m\) iid random variables \(Y_1, \dots, Y_m\). The estimator \(\hat{\theta}\) can be expressed as the solution to the following estimating equation:

\[ \sum_{i = 1}^m (Y_i - \theta) = 0. \]

which is equivalent to solving \(\eqref{eq:psi}\) where \(O_i = Y_i\) and \(\psi(O_i, \theta) = Y_i - \theta\). An estFUN is a translation of \(\psi\) into a function whose first argument is data and returns a function whose first argument is theta. An estFUN corresponding to the estimating equation for the sample mean of \(Y\) is:

meanFUN <- function(data){ function(theta){ data$Y - theta } } .

If an estimator fits into the above framework, then the user need only specify estFUN. No other programming is required to obtain point and variance estimates. The remaining sections provide examples of translating \(\psi\) into an estFUN.

Calculus of M-estimation Examples

The three examples of M-estimation from SB are presented here for demonstration. For these examples, the data are \(O_i = \{Y_{1i}, Y_{2i}\}\) where \(Y_1 \sim N(5, 16)\) and \(Y_2 \sim N(2, 1)\) for \(m = 100\) and are included in the geexex dataset.

Example 1: Sample moments

The first example estimates the population mean (\(\theta_1\)) and variance (\(\theta_2\)) of \(Y_1\). The solution to the estimating equations below are the sample mean \(\hat{\theta}_1 = m^{-1} \sum_{i = 1}^m Y_{1i}\) and sample variance \(\hat{\theta}_2 = m^{-1} \sum_{i = 1}^m (Y_{1i} - \hat{\theta}_1)^2\).

\[ \psi(Y_{1i}, \theta) = \begin{pmatrix} Y_{1i} - \theta_1 \\ (Y_{1i} - \theta_1)^2 - \theta_2 \end{pmatrix} \]

SB1_estfun <- function(data){
  Y1 <- data$Y1
  function(theta){
      c(Y1 - theta[1],
       (Y1 - theta[1])^2 - theta[2])
  }
}

The primary geex function is m_estimate, which requires two inputs: estFUN (the \(\psi\) function), data (the data frame containing \(O_i\) for \(i = 1, \dots, m\)). The package defaults to rootSolve::multiroot or estimating roots of \(\eqref{eq:psi}\), though the solver algorithm can be specified in the root_control argument. Starting values for rootSolve::multiroot are passed via the root_control argument; e.g.,

library(geex)
results <- m_estimate(
    estFUN = SB1_estfun, 
    data   = geexex,
    root_control = setup_root_control(start = c(1,1)))
n <- nrow(geexex)
A <- diag(1, nrow = 2)

B <- with(geexex, {
  Ybar <- mean(Y1)
  B11 <- mean( (Y1 - Ybar)^2 )
  B12 <- mean( (Y1 - Ybar) * ((Y1 - Ybar)^2 - B11) )
  B22 <- mean( ((Y1 - Ybar)^2 - B11)^2 )
  matrix(
    c(B11, B12,
      B12, B22), nrow = 2
  )
})

# closed form roots
theta_cls <- c(mean(geexex$Y1),
               # since var() divides by n - 1, not n
               var(geexex$Y1) * (n - 1)/ n ) 

# closed form sigma
Sigma_cls <- (solve(A) %*% B %*% t(solve(A))) / n

comparison <- list(geex = list(estimates = coef(results), vcov = vcov(results)), 
                   cls  = list(estimates = theta_cls, vcov = Sigma_cls))

The m_estimate function returns an object of the S4 class geex, which contains an estimates slot and vcov slot for \(\hat{\theta}\) and \(\hat{\Sigma}\), respectively. These slots can be accessed by the functions coef (or roots) and vcov. The point estimates obtained for \(\theta_1\) and \(\theta_2\) are analogous to the base R functions mean and var (after multiplying by \(m-1/m\) for the latter). SB gave a closed form for \(A(\theta_0)\) (an identity matrix) and \(B(\theta_0)\) (not shown here) and suggest plugging in sample moments to compute \(B(\hat{\theta})\). The point and variance estimates using both geex and the analytic solutions are shown below}. The maximum absolute difference between either the point or variance estimates is 4e-11, thus demonstrating excellent agreement between the numerical results obtained from geex and the closed form solutions for this set of estimating equations and data.

## $geex
## $geex$estimates
## [1]  5.044563 10.041239
## 
## $geex$vcov
##            [,1]       [,2]
## [1,] 0.10041239 0.03667969
## [2,] 0.03667969 2.49219638
## 
## 
## $cls
## $cls$estimates
## [1]  5.044563 10.041239
## 
## $cls$vcov
##            [,1]       [,2]
## [1,] 0.10041239 0.03667969
## [2,] 0.03667969 2.49219638

Example 2: Ratio estimator

This example calculates a ratio estimator and illustrates the delta method via M-estimation. The estimating equations target the means of \(Y_1\) and \(Y_2\), labelled \(\theta_1\) and \(\theta_2\), as well as the estimand \(\theta_3 = \theta_1/ \theta_2\).

\[ \psi(Y_{1i}, Y_{2i}, \theta) = \begin{pmatrix} Y_{1i} - \theta_1 \\ Y_{2i} - \theta_2 \\ \theta_1 - \theta_3\theta_2 \end{pmatrix} \]

The solution to \(\eqref{eq:psi}\) for this \(\psi\) function yields \(\hat{\theta}_3 = \bar{Y}_1 / \bar{Y}_2\), where \(\bar{Y}_j\) denotes the sample mean of \(Y_{j1}, \ldots,Y_{jm}\) for \(j=1,2\).

SB2_estfun <- function(data){
  Y1 <- data$Y1; Y2 <- data$Y2
  function(theta){
      c(Y1 - theta[1],
        Y2 - theta[2],
        theta[1] - (theta[3] * theta[2]) 
    )
  }
}
results <- m_estimate(
    estFUN = SB2_estfun, 
    data  = geexex, 
    root_control = setup_root_control(start = c(1, 1, 1)))
# Comparison to an analytically derived sanwich estimator
A <- with(geexex, {
 matrix(
  c(1 , 0, 0,
    0 , 1, 0,
    -1, mean(Y1)/mean(Y2), mean(Y2)),
  byrow = TRUE, nrow = 3)
})

B <- with(geexex, {
  matrix(
    c(var(Y1)   , cov(Y1, Y2), 0,
      cov(Y1, Y2), var(Y2)   , 0,
      0, 0, 0),
    byrow = TRUE, nrow = 3)
})

## closed form roots
theta_cls <- c(mean(geexex$Y1), mean(geexex$Y2))
theta_cls[3] <- theta_cls[1]/theta_cls[2]
## closed form covariance
Sigma_cls <- (solve(A) %*% B %*% t(solve(A))) / nrow(geexex)
comparison <- list(geex = list(estimates = coef(results), vcov = vcov(results)), 
                   cls  = list(estimates = theta_cls, vcov = Sigma_cls))

SB gave closed form expressions for \(A(\theta_0)\) and \(B(\theta_0)\), into which we plug in appropriate estimates for the matrix components and compare to the results from geex. The point estimates again show excellent agreement (maximum absolute difference 4.4e-16), while the covariance estimates differ by the third decimal (maximum absolute difference 1e-03).

comparison
## $geex
## $geex$estimates
## [1] 5.044563 2.012793 2.506250
## 
## $geex$vcov
##              [,1]         [,2]        [,3]
## [1,]  0.100412389 -0.008718712  0.06074328
## [2,] -0.008718712  0.010693092 -0.01764626
## [3,]  0.060743283 -0.017646263  0.05215103
## 
## 
## $cls
## $cls$estimates
## [1] 5.044563 2.012793 2.506250
## 
## $cls$vcov
##             [,1]        [,2]        [,3]
## [1,]  0.10142666 -0.00880678  0.06135685
## [2,] -0.00880678  0.01080110 -0.01782451
## [3,]  0.06135685 -0.01782451  0.05267781

Example 3: Delta method continued

This example extends Example 1 to again illustrate the delta method. The estimating equations target not only the mean (\(\theta_1\)) and variance (\(\theta_2\)) of \(Y_1\), but also the standard deviation (\(\theta_3\)) and the log of the variance (\(\theta_4\)) of \(Y_1\).

\[ \psi(Y_{1i}, \mathbf{\theta}) = \begin{pmatrix} Y_{1i} - \theta_1 \\ (Y_{1i} - \theta_1)^2 - \theta_2 \\ \sqrt{\theta_2} - \theta_3 \\ \log(\theta_2) - \theta_4 \end{pmatrix} \]

SB3_estfun <- function(data){
  Y1 <- data$Y1
  function(theta){
      c(Y1 - theta[1],
       (Y1 - theta[1])^2 - theta[2],
       sqrt(theta[2]) - theta[3],
       log(theta[2]) - theta[4])
  }
}
## closed form roots
theta_cls <- numeric(4)
theta_cls[1] <- mean(geexex$Y1)
theta_cls[2] <- sum((geexex$Y1 - theta_cls[1])^2)/nrow(geexex)
theta_cls[3] <- sqrt(theta_cls[2])
theta_cls[4] <- log(theta_cls[2])

## Compare to closed form ##
theta2 <- theta_cls[2]
mu3 <- moments::moment(geexex$Y1, order = 3, central = TRUE)
mu4 <- moments::moment(geexex$Y1, order = 4, central = TRUE)
## closed form covariance
Sigma_cls <- matrix(
  c(theta2, mu3, mu3/(2*sqrt(theta2)), mu3/theta2,
    mu3, mu4 - theta2^2, (mu4 - theta2^2)/(2*sqrt(theta2)), (mu4 - theta2^2)/theta2,
    mu3/(2 * sqrt(theta2)), (mu4 - theta2^2)/(2*sqrt(theta2)), (mu4 - theta2^2)/(4*theta2), (mu4 - theta2^2)/(2*theta2^(3/2)),
    mu3/theta2, (mu4 - theta2^2)/theta2, (mu4 - theta2^2)/(2*theta2^(3/2)), (mu4/theta2^2) - 1) ,
  nrow = 4, byrow = TRUE) / nrow(geexex)
## closed form covariance
# Sigma_cls <- (solve(A) %*% B %*% t(solve(A))) / n
comparison <- list(geex = list(estimates = coef(results), vcov = vcov(results)), 
                   cls  = list(estimates = theta_cls, vcov = Sigma_cls))

SB again provided a closed form for \(A(\theta_0)\) and \(B(\theta_0)\), which we compare to the geex results. The maximum absolute difference between geex and the closed form estimates for both the parameters and the covariance is 3.8e-11.

## $geex
## $geex$estimates
## [1]  5.044563 10.041239  3.168791  2.306700
## 
## $geex$vcov
##             [,1]       [,2]        [,3]        [,4]
## [1,] 0.100412389 0.03667969 0.005787646 0.003652905
## [2,] 0.036679687 2.49219638 0.393240840 0.248196105
## [3,] 0.005787646 0.39324084 0.062049026 0.039162582
## [4,] 0.003652905 0.24819611 0.039162582 0.024717678
## 
## 
## $cls
## $cls$estimates
## [1]  5.044563 10.041239  3.168791  2.306700
## 
## $cls$vcov
##             [,1]       [,2]        [,3]        [,4]
## [1,] 0.100412389 0.03667969 0.005787646 0.003652905
## [2,] 0.036679687 2.49219638 0.393240840 0.248196105
## [3,] 0.005787646 0.39324084 0.062049026 0.039162582
## [4,] 0.003652905 0.24819611 0.039162582 0.024717678

References

Stefanski, Len A., and Dennis D. Boos. 2002. The Calculus of M-Estimation 56. https://doi.org/10.1198/000313002753631330.