Talk:Regression analysis/Archive 2

This is an archive of past discussions about Regression analysis. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

External links

I removed two of the links from this article in accordance with WP:EL, yet as they constitute reliable sources I'll leave them here if anybody wants to write their contents into the article and cite them as references.

Exegeses on Linear Models - Some comments on linear regression models by Bill Venables.
Perpendicular Regression of a Line at MathPages.

Them From Space 10:26, 30 July 2009 (UTC)

We already have 2 (maybe more) articles devoted to the perpendicular regression of a line: Deming regression and Total least squares. ... stpasha » talk » 14:20, 30 July 2009 (UTC)

An anon has been adding a link to a history of technical terms, which doesn't seem relevant. In his edit summary, he points out that it's referenced in other articles. — Arthur Rubin (talk) 19:27, 24 September 2009 (UTC)

Meaning of "linear" in linear regression

I moved this comment over from the main article:

[ this explanation makes no senese -- I doubt that the quadratic can be introduced and the term "linear" retained! ]

This is a common misconception. As stated in the article, "linear regression" refers to models that are linear in the unknown parameters. This is not a controversial point, and reflects universally accepted use of the term among statistical researchers and practitioners. You are free to say that the "fitted regression function" or "estimated conditional mean function" are nonlinear in x, or that the "fitted relationship between the independent and dependent variables is nonlinear." But that does not change the fact that the practice described is linear regression. Skbkekas (talk) 21:37, 10 December 2009 (UTC)

Disagreement with above paragraph: — Preceding unsigned comment added by Abhinavjha7 (talk • contribs) 04:09, 11 December 2011 (UTC)

I highly doubt the above statement that linear regression refers to models that are linear in the unknown parameters. A coefficient that is quadratic, say {\beta}^2, can still be written as \gamma. Therefore, the problem just reduces to finding a coefficient that is linear in the unknown parameter. Apart from this, there are multiple other references, that state that the model should be linear in the independent variables. In fact, the main article regarding linear regression also expresses the independent variables to be related with the dependent variables linearly, and most generally, in matrix form. The field of multiple linear regression is actually about having multiple independent variables \cite{http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm}. I request you to please correct the same, since this can be very confusing for a reader. — Preceding unsigned comment added by Abhinavjha7 (talk • contribs) 04:07, 11 December 2011 (UTC)

Extrapolation versus interpolation

In high dimensions (even dimension 5-10), extrapolation is needed for prediction since the convex hull of a reasonably sized sample has very little volume. I believe that the current statements warnings against extrapolation are based on intuition based on low-dimensions --- i.e. on extrapolating from dimensions 1, 2 and 3!

I believe that the previous edit was providing a gloss on extrapolation. (Interpolation is a topic for deterministic models of perfect data.)Kiefer.Wolfowitz (talk) 14:49, 25 February 2010 (UTC)

"Needed" isn't the same as "gives good results" and things can't be better in high dimensions than they are in low dimensions. The term "interpolation" may have been appropriated to one meaning by one group of people but in other fields a phrase like "interpolation not passing exactly through the data values" would raise no problems, even in non-statistical situations. Anyway there was was certainly more to be said about extrapolation, and I have made a start by rasing the point into a higher-level section. Melcombe (talk) 17:16, 25 February 2010 (UTC)

Melcombe, I support all the you wrote - thanks. Talgalili (talk) 18:51, 25 February 2010 (UTC)

Melcombe, I just made further editing to this section. It is not "well polished" yet, but I added some content which is (according to my understanding of my thesis advisor perspective on the subject) is (somewhat) correct Talgalili (talk) 19:18, 25 February 2010 (UTC)

Can this image/notion be used in the article?

In this link:

http://www.r-statistics.com/2010/07/visualization-of-regression-coefficients-in-r/

There is a short post about how visualize the coefficients (with sd) of the regression. Do you think there is a place in the article where this can be used? Talgalili (talk) 15:53, 3 July 2010 (UTC)

Notation

The section "Linear regression" briefly mentions the sum of squared residuals, but denotes it SSE rather than SSR. Would anyone object if I change it to SSR, since that's an acronym for "sum of squared residuals"? SSE is an acronym for "sum of squared errors"; the latter is generally viewed as incorrect terminology because "errors" is conventionally used for the unobserved errors in the true model, not the regression residuals. Leaving it like it is could cause the reader to become confused between the two concepts. Duoduoduo (talk) 18:46, 19 November 2010 (UTC)

SSR is an acronym for "sum of squares for regression". To use it for "sum of squared residuals" would be confusing. Most textbooks in linear models use SSE for the sum of squared residuals. Making the change of notation could cause the reader and too many students to become confused about the the traditional use of the notation in computing formulae and ANOVA tables. In this context the meaning of SSE is clear and conforms to the use in the literature. Mathstat (talk) 19:26, 19 November 2010 (UTC)

If you search Wikipedia for SSR, you get a disambiguation page one of whose entries goes to Sum of squared residuals, which redirects to Residual sum of squares. That article uses the acronym RSS, as does the article Sum of squares. On the other hand, regression analysis uses SSE. It's too bad one notation is not universally used. Duoduoduo (talk) 20:44, 19 November 2010 (UTC)

In "Applied Linear Regression Models" 4th ed. by Kutner, Nachtsheim, and Neter (2004), page 25

"Hence the deviations are the residuals ... and the appropriate sum of squares, denoted by SSE, is ... where SSE stands for the error sum of squares or residual sum of squares."

In "A First Course in Linear Model Theory", by Ravishankar and Dey (2002), p. 101:

"Definition 4.2.3. Sums of squares. Let SST, SSR, and SSE respectively denote the total variation in Y, the variation explained by the fitted model, and the unexplained (residual) variation. ... SST=SSR+SSE, where SSR is the model sum of squares and SSE is the error sum of squares."

So the acronym SSE is used correctly for error sum of squares. It is also correctly residual sum of squares, but it would be very confusing to use SSR. Perhaps insert a note indicating that RSS is also used. Actually one notation SSE is used very consistently; could only find one textbook out of dozens of books that use RSS instead of SSE in this context. None use SSR. Inserting the references in the article for clarification. Mathstat (talk) 05:36, 20 November 2010 (UTC)