Talk:Metropolis–Hastings algorithm

Statistics C‑class Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics
C	This article has been rated as C-class on Wikipedia's content assessment scale.
Low	This article has been rated as Low-importance on the importance scale.

Hello. A comment on the title -- I've never seen the method in question called by "Metropolis-Hastings Markov Chain Monte Carlo Sampling". A google search finds only two instances of the words in the title ([1] and [2]), excluding many Wikipedia copies. I think the shorter names "Metropolis-Hastings algorithm" and "Markov chain Monte Carlo" are more common. I'm inclined to rename the article to "Markov chain Monte Carlo" and make "Metropolis-Hastings algorithm" redirect to that. Any comments? Happy editing, Wile E. Heresiarch 22:59, 6 Mar 2004 (UTC)

I think "Markov chain Monte Carlo" is often used for more general class of algorithms, including Metropolis-Hastings and other algorithms. So, it might be better to rename this article "Metropolis-Hastings algorithm". Andris 23:06, 6 Mar 2004 (UTC)

OK, thanks for your input. I am going to rename the article to "Metropolis-Hastings algorithm" now. Wile E. Heresiarch 03:18, 7 Mar 2004 (UTC)

Shouldn't there be a separate entry for just the Metropolis algorithm? I don't think the extensions by Hastings are widely-known (Google: 34,300 vs. 9,310). 21:18, 26 Aug 2004 (UTC)

I agree. I am writing an article on the "plain" Metropolis algorithm which I certainly believe to be more commonly used (at least it is in my field). Will add to wikipedia when its a more properly formulated Jackliddle 03:33, 20 Nov 2004 (UTC)

I think it would be clearer to say that if the proposed value is rejected then x^{t+1}=x^t. I got confused because I assumed that the algorithm was repeated (without stepping the time) until a proposed value was accepted.

I found this very confusing as well. The article should say exactly what happens when the proposed value is rejected. I don't know so I can't edit it. Dlakelan

The N function needs to be defined. It seems to be a sum of gaussians? The way it is written, it appears to be a single gaussian in a high dimensional space. --72.33.96.204 19:23, 16 May 2006 (UTC)[reply]

x prime or x_t?

Here are two conflicting comments on the case a>1, which I have cut and pasted for readability. --Rinconsoleao 15:12, 16 July 2007 (UTC)[reply]

There seems to be a typo in the last display. If a>1, we would accept the proposal, x', not x_t. --Youyifong 06:38, 28 July 2006 (UTC)[reply]

Hi, is there a mistake here? In the step-by-step instructions,

x^{t+1}=\left\{{\begin{matrix}x'(x(t)?)&{\mbox{if }}a>1\\x'{\mbox{ with probability }}a,&{\mbox{if }}a<1\end{matrix}}\right.

(Postdoc 02:30, 16 July 2007 (UTC))[reply]

The algorithm always accepts if a>1. That is, $x^{t+1}=x'$ when a>1. Notice that this is consistent with the statements about the uniform random variable $u$ in the previous paragraph. I have rewritten the "step-by-step instructions" to make this clearer. --Rinconsoleao 15:12, 16 July 2007 (UTC)[reply]

Please check the page again. The pages mis-typed x' as x_t, as Youyifong mentioned again. Also, should it be "a >= 1" ? (Postdoc 05:38, 30 July 2007 (UTC))[reply]

Thanks! --Rinconsoleao 15:44, 31 July 2007 (UTC)[reply]

Gibbs sampling

The article says that the Gibbs-sampler is a special case of the MH-algorithm. As far as I know it is a complementary MCMC-method which works quite differently. --Smeyen 01:07, 24 February 2007 (UTC)[reply]

In Gibbs sampler, its proposal distribution is the full conditional distribution of some part of parameter space conditional on the rest, which always resultes in a acceptance probability of 1. See [http:://www.stat.purdue.edu/~lebanon/notes/metropolis.pdf] (Postdoc 05:46, 30 July 2007 (UTC))[reply]

Subscripts instead of superscripts?

Shouldn't this article be using subscripts like x_t instead of superscripts like x^t? SteveWitham 05:10, 7 May 2007 (UTC)[reply]

yes please! when you've got x^t next to σ² the ambiguity is very obvious. wrtp. 17th Sept 2008 —Preceding unsigned comment added by 89.241.139.206 (talk) 14:54, 17 September 2008 (UTC)[reply]

The need for subscripts jumped out at me also. I have edited the article to change powers of x to subscripts. I'm disappointed it took 3 years to make that change. Would someone proofread it and ensure all the subscripts and superscripts are correct? Nickalh50 (talk) 17:59, 20 April 2011 (UTC)[reply]

Q confusion

The notation for Q is not consistent. Sometimes it is called Q(x'|x^t) and sometimes Q(x';x^t).

Also, with the present parameter order (other articles on the web calls it Q(x^t, y), i.e. swaps the parameters), I think the text is wrong when it talks about Q(x'|x^t) being independent of x'. I think it should be Q(x'|x^t) not depending on x^t - makes more sense. -- BjarkeDahlEbert (talk) 22:30, 8 December 2007 (UTC)[reply]

The article now consistently uses the Q(x'|x^t) notation instead of Q(x';x^t). --Kinimod (talk) 12:50, 30 September 2012 (UTC)[reply]

Multiple-try Metropolis

I have created an article called the Multiple-Try Metropolis algorithm, which describes how to sample more efficiently from high dimensional distributions. Hope it's useful :) Velocidex (talk) 02:49, 3 May 2008 (UTC)[reply]

I've moved it to Multiple-try Metropolis, with a lower-case t, as required by WP:MOS. In Wikipedia article titles, the first letter of the title, unlike the later letters, is case-insensitive in links and appears as a capital at the top of the article; later initial letters are to be lower case except when there's some special reason to capitalize them. Metropolis is in this case a person's name, so it has a capital M. Michael Hardy (talk) 16:47, 4 May 2008 (UTC)[reply]

Definition and notations

The symbol U(0;1) is not defined. I guess it is the uniform distribution in the interval [0;1]. May be it is obvious for people used with the statistic wikipedia portal notation, it is not for the others.

The symbol $Q(x^{t};x')$ is defined by not this one $Q(x^{t}|x')$ .

Vincent - 12/04/09

Complexity

The Markov chain is started from a random initial value \displaystyle x^0 and the algorithm is run for many iterations until this initial state is "forgotten"

no bounds are known on the number of needed iterations, e.g. the mixing time of the Markov chain? Can we really claim in the introduction that this algorithm allow to compute something if we don't known how long it should be iterate? Levochik (talk) 07:43, 26 August 2009 (UTC)[reply]

Huh

This statement doesn't make any sense: "requiring only that a function dominating (being greater than or equal to) the density can be calculated at X". How about the function s(x) = P(x) + 1. That dominates P at x. Surely this isn't what you meant. —Preceding unsigned comment added by 64.134.176.249 (talk) 11:03, 22 December 2009 (UTC)[reply]

It is, but if you can compute P(x)+1 then surely P(x) should pose no challenge. 83.101.44.62 (talk) 18:22, 28 February 2010 (UTC)[reply]

The statement you refer to doesn't seem to actually apply to Metropolis-Hastings (it does to other samplers), and the article doesn't make use of this hypothetical dominating function. Someone anonymously changed 'proportional to' to 'dominating'. I will change it back Rjljr2 (talk) 15:30, 11 April 2010 (UTC)[reply]

Example

First of all, thank you for this article. It's the best I could find in the web about Monte Carlo simualtion. But unfortunately, I still don't get it. How can I compute a=a_1a_2 if I don't know P(x)? Perhaps additionally, you could include an example. Something where one can see the step-by-step instruction applied to a simple example. Just like a record of a view steps from a real simulation of some simple case. —Preceding unsigned comment added by Askamaster (talk • contribs) 10:30, 25 March 2010 (UTC)[reply]

You don't need P(x), you only need a function proportional to P(x), as the normalizations cancel in the ratio. It is quite often the case that you can easily determine a function proportional to P(x), but the integral to normalize it is very difficult to perform. This comes up quite often in Bayesian computations of the posterior distribution. Rjljr2 (talk) 15:49, 11 April 2010 (UTC)[reply]

Detailed balance

I believe Detailed balance and Ergodicity should at least be referred once on this page, as they are fundamental for the Markov chain to work for this algorithm. Jorgecarleitao (talk) 11:21, 10 June 2011 (UTC)[reply]

Adapting the proposal density after the measurements have started?

Is it allowed to change $\displaystyle Q(x';x_{t})$ after the measurements have started if, for example, some mechanics in the implementation discovers that the acceptance rate is much too low, say only about 5 % or less, and decides that it has to do something about it? —Kri (talk) 00:58, 25 July 2011 (UTC)[reply]

Proposal density

Clearly the Markov chain doesn't converge for all proposal densities Q. Certainly Q > 0 is sufficient, but I don't believe it is necessary. What condition is required for the proposal density? — Preceding unsigned comment added by WuTheFWasThat (talk • contribs) 19:24, 16 May 2012 (UTC)[reply]

Overview

The section Overview tries to give a formal derivation of the Algorithm. The foregoing sections are so well written that it was very easy to understand for me, even though I am not an expert in the field. Bringing the section Overview to the same didadactic level would be an asset. The questions I would like to have answered are: How can I see that the equation is still redundant and what does the term redundant mean in this context? How does one come to the idea of defining $\displaystyle A(x\rightarrow x')$ as the specified minimum function?

Some other proposals for improvements: The meaning of the variable $\displaystyle x'$ seems to be different from the one in the section Intuition. Would it be an option to use another notation? Furthermore, the section Overview repeats the definition of the algorithm in the section Intuition. However, at the same time it lists the steps of skipping samples to mitigate correlations. I suggest to put the complete list of steps into the previous section. Then, a proper name for the section Overview would e.g. be Formal Derivation.

Sorry, I am new to wikipedia and do not want to just poke your text. Please let me know if you wish me to do some of the mentioned improvements. —Contribo (talk) 20:20, 3 December 2012 (UTC)[reply]

I agree with you, "overview" is not the proper name; maybe "formal derivation" as you suggested sounds better. Nevertheless, the intuition section does not present the modern view of metropolis-hastings algorithm in the sense that it does not uses a general proposal (it is always assumed symmetrical, which is not the general case. However, I think the intuition part is to understand the idea of the method; that is why it skips the fact that the proposal also has to be considered in the acceptance in the general case.

Redundant in the sense that there can be another choices for A which also fulfills the equation you mention (maybe under-determined). I don't recall my statistical physics lectures, but I know there is at least another choice of A which also fulfills it and yet, it does not use the "min".Jorgecarleitao (talk) 20:53, 3 December 2012 (UTC)[reply]

Thank you for your explanations, I did not expect to get an Answer so quickly. I bring these issues into discussion not just because I would like to understand the matter but also to motivate you to work towards improving this article together with me. In my role as the learner I offer to ask you questions and you as the author could modify the article, such as to answer these questions. Of course I offer to do modifications as well — as far as I am qualified. Does this sound attractive?

In this sense let me point out some further aspects. You write the proposal also has to be considered in the acceptance in the general case. I am not sure if I get you right. Isn't this considered in the section Step-by-step Instructions? This brings me to another point: the Step-by-step Instructions should contain the skipping of T states to make the resulting sequence uncorrelated.

I noticed some more points regarding the section Overview: What is the role of ergodicity in this explanation. The text says that it can be shown that ergodicity is a requirement for the process to converge asymptotically. A reference to the proof would be great. How is it ensured that the process is ergodic? Another Point: The markov matrices in the wiki-reference have finite dimensions. However, the markov matrix for the process under discussion is infinte dimensional, this difference should at least be mentioned. --Contribo (talk) 21:24, 3 December 2012 (UTC)[reply]