Maximum score estimator

In statistics and econometrics, the maximum score estimator is a nonparametric estimator for discrete choice models developed by Charles Manski in 1975. Unlike the multinomial probit and multinomial logit estimators, it makes no assumptions about the distribution of the unobservable part of utility. However, its statistical properties (particularly its asymptotic distribution) are more complicated. To address the statistical issues, Joel Horowitz proposed a variant, called the smoothed maximum score estimator.

Setting

When modelling discrete choice model, it is always assumed that the choice is determined by the comparison of the underlying latent utility.^[1] Denote the population of the agents as $T$ , the common choice set for each agent as $C$ . For agent $t\in T$ , denote her choice as $y_{t,i}$ , which is equal to 1 if choice $i$ is chosen and 0 otherwise. Assume latent utility is linear in the explanatory variables, and there is an additive response error. Then for an agent $t\in T$ ,

y_{t,i}=1\leftrightarrow x_{t,i}\beta +\epsilon _{t,i}>x_{t,j}\beta +\epsilon _{t,j},\forall j\neq i

and

j\in C

where $x_{t,i}$ and $x_{t,j}$ are the $q$ -dimensional observable covariates about the agent and the choice, and $\epsilon _{t,i}$ and $\epsilon _{t,j}$ are the decision errors caused by some cognitive reasons or information incompleteness. The construction of the observable covariates is very general. For instance, if $C$ is a set of different brands of coffee, then $x_{t,i}$ includes the characteristics both of the agent $t$ , such as age, gender, income and ethnicity, and of the coffee $i$ , such as price, taste and whether it is local or imported. All of the error terms are assumed i.i.d and we need estimate $\beta$ which characterize the effect of different factors on the agent’s choice.

Parametric estimators

Usually some specific distribution assumption on the error term is imposed, such that the parameter $\beta$ is estimated parametrically. For instance, if the distribution of error term is assumed to be normal, then the model is just a multinomial probit model;^[2] if it is assumed to be an extreme value distribution, then the model becomes a multinomial logit model. The parametric model ^[3] is convenient for computation but might not be consistent once the distribution of the error term is misspecified.^[4]

Binary response

For example, suppose that $C$ only contains two items: this is the latent utility representation^[5] for the binary response model. In this model, the choice is: $Y_{t}=1[X_{1,t}\beta +\varepsilon _{1}>X_{2,t}\beta +\varepsilon _{2}]$ , where $X_{1,t},X_{2,t}$ are two vectors of the explanatory covariates, $\varepsilon _{1}$ and $\varepsilon _{2}$ are i.i.d response errors,

X_{1,t}\beta +\varepsilon _{1}{\text{ and }}X_{2,t}\beta +\varepsilon _{2}

are latent utility of choosing choice 1 and 2. Then the log likelihood function can be given as:

Q=\sum _{i-1}^{N}Y_{t}\log(P[X_{1,t}\beta -X_{2,t}\beta >\varepsilon _{2}-\varepsilon _{1}])+(1-Y_{t})\log(1-P[X_{1,t}\beta -X_{2,t}\beta >\varepsilon _{2}-\varepsilon _{1}])

If some distributional assumption about the response error is imposed, then the log likelihood function will have specific close form representation.^[2] For instance, if the response error is assumed to be distributed as: $N(0,\sigma ^{2})$ , then the likelihood function can be rewritten as:

Q=\sum _{i-1}^{N}Y_{t}\log \left(\Phi \left[{\frac {X_{1,t}\beta -X_{2,t}\beta }{\surd 2\sigma }}\right]\right)+(1-Y_{t})\log \left(\Phi \left[{\frac {X_{2,t}\beta -X_{1,t}\beta }{\surd 2\sigma }}\right]\right)

where $\Phi$ is the cumulative distribution function (CDF) for standard normal distribution. Here, even if $\Phi$ doesn't have a closed form of representation, its derivative does. This is the probit model.

This model is based on a distributional assumption about the response error term. Adding specific distribution assumption into the model can make the model computationally tractable due to the existence of the closed form representation. But if the distribution of the error term is misspecified, the estimates based on the distribution assumption will be inconsistent.

The basic idea of the distribution-free model is to replace the two probability term in the log-likelihood function with other weights. The general form of the log-likelihood function can written as:

Q=\sum _{i-1}^{N}Y_{t}\cdot \log(W_{1}(X_{1,t}\beta ,X_{2,t}\beta ))+(1-Y_{t})\log(W_{0}(X_{1,t}\beta ,X_{2,t}\beta ))

Maximum score estimator

To make the estimator more robust to the distributional assumption, Manski (1975) proposed a non-parametric model to estimate the parameters. In this model, denote the number of the elements of the choice set as $J$ , the total number of the agents as $N$ , and $W(J-1)>W(J-2)>\dots >W(1)>W(0)$ is a sequence of real numbers. The Maximum Score Estimator ^[6] is defined as:

{\hat {b}}={\operatorname {arg\max } }_{b}{\frac {1}{N}}\sum _{t=1}^{N}\sum _{i=1}^{J}y_{t,i}W(\sum \nolimits _{j\in C,j\neq i}1(x_{t,i}b>x_{t,j}b))

Here, $\textstyle \sum \nolimits _{j\in C,j\neq i}1(x_{t,i}b>x_{t,j}b)$ is the ranking of the certainty part of the underlying utility of choosing $i$ . The intuition in this model is that the ranking is higher, the more weight will be assigned to the choice, based on which, the optimization objective function similar to the likelihood function in parametric model is constructed.

Under certain conditions, the maximum score estimator can be weak consistent, but its asymptotic property will be very complicated.^[7] This issue mainly comes from the non-smooth of the objective function.

For more about the consistency and asymptotic property about the maximum score estimator, refer to Manski (1975).

Binary example

In the binary context, the maximum score estimator can be represented as:

W_{1}(X_{1,t}\beta ,X_{2,t}\beta )=w_{1}[X_{1,t}\beta -X_{2,t}\beta >0]+w_{0}1[X_{1,t}\beta -X_{2,t}\beta <0],

where

W_{0}(X_{1,t}\beta ,X_{2,t}\beta )=1-W_{1}(X_{1,t}\beta ,X_{2,t}\beta )

and $w_{1}$ and $w_{0}$ are two constants in (0,1). The intuition of this weighting scheme is that the probability of the choice depends on the relative order of the certainty part of the utility.

Smoothed maximum score estimator

Horowitz (1992) proposed a Smoothed Maximum Score (SMS) estimator which has much better asymptotic properties.^[8] The basic idea of this new estimator is just to replace the non-smoothed weight function $\textstyle W(\sum \nolimits _{j\in C,j\neq i}1(x_{t,i}b>x_{t,j}b))$ with a smoothed one. Define a smooth kernel function K satisfying following conditions:

$|K(\cdot )|$ is bounded over R
$\lim _{u\to -\infty }K(u)=0and\lim _{u\to +\infty }K(u)=1$
${\dot {K}}(u)={\dot {K}}(-u)$

Here, the kernel function is analogous to a CDF whose PDF is symmetric around 0. Then, the SMS estimator is defined as:

{\hat {b}}_{SMS}={\operatorname {arg\max } }_{b}{\frac {1}{N}}\sum _{t=1}^{N}\sum _{i=1}^{J}y_{t,i}\sum \nolimits _{j\in C,j\neq i}K(X_{t,i}b-x_{t,j}b/h_{N})

where $(h_{N},N=1,2,...)$ is a sequence of strictly positive numbers and $\lim _{N\to +\infty }h_{N}=0$ . Here, the intuition is the same with the construction of the traditional MS estimator: it is more likely to choose the choice with higher certainty part of the utility. Under certain conditions, SMS estimator is consistent, and more importantly, it has an asymptotic normal distribution. Therefore, all the testing and inference based on asymptotic distribution can be implemented.^[9]

References

^ For more example, refer to: Smith, Michael D. and Brynjolfsson, Erik, Consumer Decision-Making at an Internet Shopbot (October 2001). MIT Sloan School of Management Working Paper No. 4206-01.
^ ^a ^b Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge, Mass: MIT Press. pp. 457–460. ISBN 978-0-262-23219-7.
^ For a concrete example, refer to: Tetsuo Yai, Seiji Iwakura, Shigeru Morichi, Multinomial probit with structured covariance for route choice behavior, Transportation Research Part B: Methodological, Volume 31, Issue 3, June 1997, Pages 195-207, ISSN 0191-2615
^ Jin Yan (2012), "A Smoothed Maximum Score Estimator for Multinomial Discrete Choice Models", Working Paper.
^ Walker, Joan; Ben-Akiva, Moshe (2002). "Generalized random utility model". Mathematical Social Sciences. 43 (3): 303–343. doi:10.1016/S0165-4896(02)00023-9.
^ Manski, Charles F. (1975). "Maximum Score Estimation of the Stochastic Utility Model of Choice". Journal of Econometrics. 3 (3): 205–228. CiteSeerX 10.1.1.587.6474. doi:10.1016/0304-4076(75)90032-9.
^ Kim, Jeankyung; Pollard, David (1990). "Cube Root Asymptotics". Annals of Statistics. 18 (1): 191–219. doi:10.1214/aos/1176347498. JSTOR 2241541.
^ Horowitz, Joel L. (1992). "A Smoothed Maximum Score Estimator for the Binary Response Model". Econometrica. 60 (3): 505–531. doi:10.2307/2951582. JSTOR 2951582.
^ For a survey study, refer to: Jin Yan (2012), "A Smoothed Maximum Score Estimator for Multinomial Discrete Choice Models", Working Paper.

Setting

Parametric estimators

Binary response

Maximum score estimator

Binary example

Smoothed maximum score estimator

References

Further reading