![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
In the analysis of variance section, what is m in the formula for the statistic involving R? -- SolarMcPanel ( talk) 19:40, 5 April 2009 (UTC)
The front page of this article tells me that someone has suggested it be merged with Multiple regression. I agree that it should be. Also, there are articles on
could also be merged in.
Please add to the above list if there are others.
Personally, I'd prefer "Linear model" as the title.
Since this is a subject on which a great many books have been written, an article on it is not going to be anything like comprehensive. It might therefore be sensible to lay down some rules on the content, such as the level of mathematical and theoretical rigour.
Perhaps someone should start a Wikibook to cover the gaps...
—The preceding unsigned comment was added by Tolstoy the Cat ( talk • contribs) .
In reference to recent edits which change stuff like <math>x_i</math> to ''x''<sub>''i''</sub> -- the math-tag processor is smart enough to use html markup in simple cases (instead of generating an image via latex). It seems that some of those changes weren't all that helpful, as the displayed text is unchanged and the html markup is harder to edit. I agree the in-line <math>x_1,\ldots,x_n</math> wasn't pretty; however, it does seem necessary to clarify "variables" for the benefit of readers who won't immediately see x as a vector. Wile E. Heresiarch 16:41, 2 Feb 2004 (UTC)
In reference to my recent edit on the paragraph containing the eqn y = a + b x + c^2 + e, I moved the discussion of that eqn up into the section titled "Statement of the linear regression model" since it has to do with the characterizing the class of models which are called "linear regression" models. I don't think it could be readily found in the middle of the discussion about parameter estimation. Wile E. Heresiarch 00:43, 10 Feb 2004 (UTC)
I have a question about the stronger set of assumptions (independent, normally distributed, equal variance, mean zero). What can be proven from these that can't be proven from assuming uncorrelated, equal variance, mean zero? Presumably there is some result stronger than the Gauss-Markov theorem. Wile E. Heresiarch 02:42, 10 Feb 2004 (UTC)
Hello. In taking a closer look at Galton's 1885 paper, I see that he used a variety of terms -- "mean filial regression towards mediocrity", "regression", "regression towards mediocrity" (p 1207), "law of regression", "filial regression" (p 1209), "average regression of the offspring", "filial regression" (p 1210), "ratio of regression", "tend to regress" and "tendency to regress", "mean regression", "regression" (p 1212) -- although not exactly "regression to the mean". So it seems that the claim that Galton specifically used the term "regression to the mean" should be substantiated. -- Also this same paper shows that Galton was aware that regression works the other way too (parents are less exceptional than their children). I'll probably tinker with the history section in a day or two. Happy editing, Wile E. Heresiarch 06:00, 27 Mar 2004 (UTC)
I am confused -- i don't like the notation of d being the solution vector -- how about using Beta1 and Beta0?
The treatment is excellent but largely theoretical. It would be helpful to include additional material describing how regression is actually used by scientists. The following paragraph is a draft or outline of an introduction to this aspect of linear regression. (Needs work.)
Linear regression is widely used in biological and behavioral sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines. For example, early evidence relating cigarette smoking to mortality came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to insure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized experiments are considered to be more trustworthy than a regression analysis.
I also don't see any treatment of statistical probability testing with regression. Scientific researchers commonly test the statistical significance of the observed regression and place considerable emphasis on p values associated with r-squared and the coefficients of the equation. It would be nice to see a practical discussion of how one does this and how to interpret the results of such an analysis. --anon
I changed the terms independent / dependent variable to explanatory / response variable. This was to make the article more in line with the terminology used by the majority of statistics textbooks, and because independent / dependent variables are not statistically independent. Predictor / response might be ideal terminology, I'll see what textbooks are using.-- Theblackgecko 23:06, 11 April 2006 (UTC)
The terms independent and dependent are actually more precise in describing linear regression as such - terms like predictor/response or explanatory/response are related to the application of regression to different problems. Try checking a non-applied stats book.
Scientists tend to speak of independent/dependent variables, but statistics texts (such as mine) prefer explanatory/response. (These two pairs are not strictly interchangeable, in any event, though usually they are.) Here is a table a professor wrote up for the terms: http://www.tufts.edu/~gdallal/slr.htm.
At a broad and simplistic level, we can say that Linear Regression is generally used to estimate or project, often from a sample to a population. It is an estimating technique. Multiple Regression is often used as a measure of the proportion of variability explained by the Linear Regression.
Multiple Regression can be used in an entirely different scenario. For example when the data collected is a census rather than a sample, and Linear Regression is not necessary for estimating. In this case, Multiple Regression can be used to develop a measure of the proportionate contribution of independent variables.
Consequently, I would propose that merging Multiple Regression into the discussion of Linear Regression would bury it in an inappropriate and in the case of the example given, an relatively unrelated topic.
-- 69.119.103.162 20:42, 5 August 2006 (UTC)A. McCready Aug. 5, 2006
Would it be within the scope of Wikipedia to mention or reference how to do linear regressions in Excel and scientific / financial calculators? I think it would be very helpful because that's how 99% of people will actually do a linear regression - no one cares about the stupid math to get to the coefficients and the R squared.
In Excel, there is the scatter plot for single variable linear regressions, and the linear regression data analysis and the linest() family of functions for multi-variable linear regressions.
I believe the hp 12c only supports single variable linear regressions. —The preceding unsigned comment was added by 12.196.4.146( talk • contribs) 15:21, 15 August 2006 (UTC)
I am not a mathematician, but a scientist. I am most familiar with the use of linear regression to calculate a best fit y=mx + b to suit a particular set of ordered pairs of data that are likely to be in a linear relationship. I was wondering if a much simpler explanation of how to calculate this without use of a calculator or knowing matrix notation might be fit into the article somewhere. The texts at my level of full mathematical understanding merely instruct the student to plug in the coordinates or other information into a graphing calculator, giving a 'black box' sort of feel to the discussion. I would suspect many others who would consult wikipedia about this sort of calculation would not the same audience that the discussion in this entry seems to address, enlightening though it may be. I am not suggesting that the whole article be 'dumbed down' for us feeble non-mathematicians, but merely include a simplified explanation for the modern student wishing a slightly better explanation than "pick up your calculator and press these buttons", as is currently provided in many popular college math texts. The24frans 16:21, 18 September 2006 (UTC)frannie
Why are these labeled as "technically incorrect"? There is no explanation or citation given for this reasoning. The terms describe the relationship between the two variables. One is independent because it determines the the other. Thus the second term is dependent on the first. Linear regressions have independent variables, and it is not incorrect to describe them as such. If one is examining say, the amount of time spent on homework and its effect on GPA, then the hypothetical equation would be:
GPA = m * (homework time) + b
where homework time is the independent variable (independent from GPA), and GPA is dependent (on homework time, as shown in the equation). I will remove the references to these terms as technically incorrect unless someone can refute my reasoning. -- Chris53516 21:06, 3 October 2006 (UTC)
--->I added the terms, and I also added the note that they are technically incorrect. They are technically incorrect because they are considered to imply causation. (unsigned comment by 72.87.187.241)
I ran into linear regression for the purpose of forecasting. I believe this is done reasonably often for business planning however I suspect it is statistically incorrect to extend the model outside of its original range for the independent variables. Not that I am by any means a statistician. Worthly of a mention? -K
I have the problem to combine some indicators using a weighted sum. All weights have to be located in the range from 0 till 1. And the weights should add to one.
The probability distribution is rather irregular therefore the application of the EM-algorithm would be rather difficult.
Therefore I am thinking about using a linear regression with Lagrange condition that all indicators sum to one.
One problem which can emerge consists in the fact that a weight derived by linear regression might be negative. I have the idea to filter out indicators with negative weights and redo the linear regression with the remaining indicators until all weights are positive.
Is this sensible or does someone knows a better solution. Or is it better to use neural networks? (unsigned comments of Nulli)
The page gives useful formulae for estimating alpha and beta, but it does not give equations for the level of uncertainty in each (standard error, I think). For you mathematicians out there, I'd love it if this were available on wikipedia so I don't have to hunt through books. 99of9 00:11, 26 October 2006 (UTC)
Would it be worth mentioning in this article that the linear regression is simply a special case of a polynomial fit? Below I copy part of the article rephrased to the more general case:
I recall using this in the past and it worked quite well for fitting polynomials to data. Simply zero out the epsilons and solve for the alpha coefficients, and you have a nice polynomial. It works as long as m<n. For the linear regression, of course,m=1. - Amatulic 18:57, 30 October 2006 (UTC)
There are several ways to do regression in Excel. If you use the LINEST function in Excel you can show the results to any precision. Similary if you use the Data Analysis ToolPak or the Solver. Blaise ( talk) 13:24, 28 August 2008 (UTC)
This section should be rewriten as it is not general: polynomial regression is only a special case of multiple regression. And there is no formula for correlation coefficient, which, in case of multiple regression, is called coefficient of determination. Any comments?
TomyDuby 19:29, 2 January 2007 (UTC)
I accept your comment. But I would like to see the comment I made above about the correlation coefficient fixed. TomyDuby 21:24, 10 January 2007 (UTC)
Thanks! I made one more change. I consider the issue closed. TomyDuby 13:34, 11 January 2007 (UTC)
The French version of this article seems to be better written and more organised. If there are no objections, I intend on translating that article into English and incorporating it into this article in the next month or so. Another point: all the article on regression tend to be poorly written, repeatative, and not clear on the terminology including linear regression, least squares and its many derivatives, multiple linear regression... Woollymammoth 00:08, 21 January 2007 (UTC)
I used {{redirect5|Line of best fit|the song "Line of Best Fit"|You Can Play These Songs with Chords}}
for the disambiguation template. It appears to be the best one to use based on
Wikipedia:Template messages/General. The album page show "Line of Best Fit" in 3 different sections, so linking to 1 section is pointless and misleading. —
Chris53516 (
Talk)
17:25, 30 January 2007 (UTC)
"Line of best fit" redirects here. For the song "Line of Best Fit", see [[You Can Play These Songs with Chords|Line of Best Fit]].
Dude! Instead of wasting my time, did you even check "line of best fit"?? It doesn't even redirect here!! — Chris53516 ( Talk) 23:11, 30 January 2007 (UTC)
There seemed to be no objections to merging trend line into this article, so I went ahead. Cleanup would still be useful. Jeremy Tobacman 01:00, 23 February 2007 (UTC)
I have performed a major rewrite removing most of the repeative and redundant information. A lot of the information has been, or soon will be moved to the article least squares, where it should belong. This article should, in my opinion, contain information about different types of linear regression. There seems to be at least 2, if not more, different types: least squares and robust regression. All the theoretical information about least squares should be in the article on least squares. -- Woollymammoth 02:03, 24 February 2007 (UTC)
It's still not clear (to me) what's the meaning of this:
The problem is that I don't know what is - is it the inverse of the diagonal ii-th element of ? (now that I ask, it seems that it is...).
Could we say that follows some Multivariate Student Distribution with parameters and ? Or is there some other expression for the distribution of ? Albmont 20:05, 6 March 2007 (UTC)
http://en.wikipedia.org/?title=Linear_regression&oldid=110403867
I know this is simple minded and hardly advance statistics, but when need the equations to conduct a linear fit for some data points I expected wikipedia to have and more specifically this page, but they have been removed. See this page for the last occurance. It would be nice if they could be worked back in.-- vossman 17:41, 29 March 2007 (UTC)
Estimating beta (the slope)
We use the summary statistics above to calculate , the estimate of β.
Estimating alpha (the intercept)
We use the estimate of β and the other statistics to estimate α by:
A consequence of this estimate is that the regression line will always pass through the "center" .
One would think that such a basic article which has probably been seen by thousands of people and has been around for years would no longer contain careless imprecision in its explanation of the basic concept. OK, that's the rant. Here's the problem--the first sentence of the 2nd paragraph, which attempts to explain part of the fundamental concept is as follows:
This method is called "linear" because the relation of the response to the explanatory variables is assumed to be a linear function of the parameters.
The reference to "explanatory variables" is nowhere explained. One is expected to know what "explanatory variables" might be, yet they have not been defined, explained, or referenced before this. Instead, the prior paragraph mentions "dependent variables" and "independent variables" and "parameters". What are "explanatory" variables? For that matter, "response" has also not been defined. I'm assuming the sentence means something like:
This method is called "linear" because the relation of the dependent variable to the independent variables is assumed to be a linear function of the parameters.
I'm sure this is all obvious to the people who know what linear regression is, but then they don't really need this part of the article anyway.
-fastfilm 198.81.125.18 16:37, 16 July 2007 (UTC)
How should I include this graphic... or, maybe it's a bad graphic. In that case, how could I improve it? Do we even need something like this? Cfbaf ( talk) 00:07, 29 November 2007 (UTC)
The current introduction is not that friendly to the lay person. Can the introduction not be structured like this? 1. Regression modelling is a group of statistical methods to describe the relationship between multiple risk factors and an outcome. 2. Linear regression is a type of regression model that is used when the outcome is a continuous variable (e.g., blood cholesterol level, or birth weight).
This explains what linear regression is commonly used for, and tells the reader briefly when it is used. The current introduction simply does not provide enough context for the lay reader. We could also add a section at the end for links to chapters describing other regression models. -- Gak ( talk) 01:53, 16 December 2007 (UTC)
I would argue that the term "regession model" is currently broader than just "linear regression" and therefore regression modelshould not therefore redirect to this page. I instead propose a separate article on regression modelling that describes other types of model and when each type of model is selected and for what purpose:
That would tie together the various articles under one coherent banner and would be much more useful to the lay reader. It would also be a place that the various terms could be defined (outcome, outcome variable, dependent variable, independent variable, predictor, risk factor, estimator, covariate, etc.) and then each of the separate articles would then adopt the terminology agreed in the regression model article. -- Gak ( talk) 02:09, 16 December 2007 (UTC)
Shouldn't Pearson's Coefficient R2 be ESS/TSS instead of SSR/TSS? See the Coefficient of Determination page. I think what has happened here is that the earlier definitions of ESS and SSR have been switched. Amberhabib ( talk) 05:59, 16 January 2008 (UTC)
Well, I understand the matrix-notation of multiple regression and also have implemented this. But today I just wanted to see the formula for bivariate regression and its coefficients. For the casual reader here we should have these formulae, too. In an encyclopedia the knowledge of matrix-analysis should not be a prerequisite to get a simple solution for a simple subcase of a general problem...
--Gotti 06:31, 30 January 2008 (UTC)
So I misused the term bi-variate. I meant 1-response-1-predictor variable (two variables involved)--Gotti 10:19, 16 February 2008 (UTC) —Preceding unsigned comment added by Druseltal2005 ( talk • contribs)
Along with a major revision of regression analysis and articles concerned with least squares I have given this article a thorough going-over. Notation is now consistent across all the main articles in least squares and regression, making cross-referencing more convenient and reducing the amount of duplicated material to what I hope is the minimum necessary. There has been considerable rearrangement of material to give the article a more logical structure, but nothing significant has been removed. The example involving a cubic polynomial has been transferred here from regression analysis. Petergans ( talk) 10:23, 22 February 2008 (UTC)
I have a couple of problems with this example:
Sorry to sound so critical. I do appreciate the work you've put in to overhauling these articles and they are important. This article got over 48 000 hits in January, more than one a minute. Regards and happy leap day, Qwfp ( talk) 17:48, 29 February 2008 (UTC)
The section on "Regression statistics" is a little unclear. Expressions for the "mean response confidence interval" and "predicted response confidence interval" are given, but the terms are not defined. They are also not defined or mentioned in the Error_propagation page. Can someone define what these terms mean? And also possibly give a more detailed derivation for where the expressions come from? -- Jonny5cents2 ( talk) 09:24, 10 March 2008 (UTC)
Consider the lines:
Writing the elements as , the mean response confidence interval for the prediction is given, using error propagation theory, by:
The multiplication does not make sense unless is a column vector. i.e. the matrix has n rows and n columns, therefore must have n rows and 1 column. The standard definition of a vector within both physics and maths is that vectors are, by default, column vectors. Therefore should be a column vector, not a row vector as is implied.
I propose that the entry should read . This gives the correct contraction to a scalar. Velocidex ( talk) 03:21, 11 March 2008 (UTC)
In the article it reads:
---
Thus, the normal equations are
---
Where do the the +/- 16, 20 and 6 values come from?
62.92.124.145 ( talk) 09:05, 12 March 2008 (UTC)
The standard deviation on a parameter estimator is given by
Using plus-or-minus the SD does not make sense; one should instead use confidence intervals that get smaller as the sample size grows, even if the SD stays the same. Michael Hardy ( talk) 00:20, 14 March 2008 (UTC)
Once again it becomes apparent that there are completely different approaches within different disciplines. In my discipline, chemistry, it is customary to give standard deviations for least squares parameters. Confidence limits are only calculated when they are required for a statistical test. There is a good reason for this. To derive confidence limits an assumption has to be made concerning the probability distribution of the errors on the dependent variable, y. For example, the parameters will belong to a Student's t distribution if the distribution is Gaussian. The assumption of a Gaussian distribution may or may not be justified. In short, the results of the linear regression calculations are value and standard deviation, regardless of the error distribution function. Confidence limits are derived from those results with a further assumption.
Concerning the factor this is an estimator of the error of an observation of unit weight. Another way to eliminate this factor is to estimate the error experimentally as and do a weighted regression, minimizing. Petergans ( talk) 09:45, 14 March 2008 (UTC)
It is a common misunderstanding to think that the least squares procedure can only be applied when the errors are normally distributed. The Gauss–Markov theorem (and Aitken extesion) is independent of the error distribution function. The misunderstanding arises because the Gauss's method is sometimes (e.g. Numerical Recipes) derived using the maximum likelihood principle; when the errors are normally distributed the maximum likelihood solution coincides with the minimum variance solution.
A simple example is the fitting of radioactive decay data. In that case the errors belong to a Poisson distribution but this does not mean that least squares cannot be used. Admittedly this is a nonlinear case, though for a single (exponential) decay it can be made into a linear one by a logarithmic transformation which also transforms the error distribution into I know not what. Other nonlinear to linear transformations have been extensively used in the past See non-linear least squares#Transformation to a linear model for more details. Even if the original data were normally distributed, the transformed data certainly would not be.
I have made the assumption of normality explicit in the example. Petergans ( talk) 08:55, 15 March 2008 (UTC)
In the Definitions section right at the start the number of values is given as m, not n (which is surely more usual) "The data consist of m values y1,...,ym taken etc", then the first summation following has usage of n, which is not previously or nearby defined. It appears that the x-variable beta, is potentially an order-n vector and the simple case of (x,y) points would be with order n = 1. If this is so, words to this effect would be good, rather than exhibiting a tersely cryptic but high-formalism demonstration of superior knowledge in the minimum number of words. Regards, NickyMcLean ( talk) 21:18, 23 April 2008 (UTC)
I gather there has been some discussion of this notation. In which discipline(s) is it conventional to use "M" for the sample size and "N" for the number of covariates? This seems strange and confusing to me given that in, at least, statistics, econometrics, biostatistics, and psychometrics, "N" conventionally denotes the sample size. Common statistical jargon such as "root n consistency" even invokes that convention. I think it is non-standard and possibly very confusing to use "N" as it is used in this and other wiki articles. —Preceding unsigned comment added by 68.146.25.175( talk) 16:48, 28 July 2008 (UTC)
In the Article's section " Checking model validity" it is mentioned the method of F-test. Can anyone explain this further?
IMHO there are two issues that need clarification.
I may be wrong. If you know about these issues please feel free to make them clear. Frigoris ( talk) 16:58, 25 April 2008 (UTC)
This is not a most important issue, but if two figures in the section " Example" would be even better in SVG format. —Preceding unsigned comment added by Frigoris ( talk • contribs) 17:04, 25 April 2008 (UTC)
NickyMcLean: please consider the following points.
Petergans ( talk) 08:11, 29 April 2008 (UTC)
I was surprised by Arthur Rubin's edit to linear regression with its edit summary that was so emphatic about PROPER explanation of linearity. I was concerned that some of your words might be misunderstood as meaning that polynomial regression is not an instance of linear regression, and then I came to your assertion that if one column of the design matrix X contains the logarithms of the corresponding entries of another column, that makes the regression nonlinear (presumably because the log function is nonlinear). That is grossly wrong and I reverted. Notice that the probability distributions of the least-squares estimators of the coefficients can be found simply by using the fact that they depend linearly on the vector of errors (at least if the error vector is multivariate normal). Nonlinearity of the dependence of one column of the matrix X upon the other columns does not change that at all, since the model attributes no randomness to the entries in X. Nonlinear regression, on the other hand, is quite a different thing from that. Michael Hardy ( talk) 19:10, 8 May 2008 (UTC)
While having some relevant points it was basically wrong. It is possible for both variables X and Y to be measured with error and to do regressions of both Y on X and X on Y and both will be valid provided it is recognised what is being done: to provide the best predictor of Y given values of X when they are measured with the same type of error as in the sample ... and this is a perfectly valid thing to do. Of course the requirement might be to try to identify an underlying relationship of "true" values, in which case the section on "errors in variables" is relevant. Melcombe ( talk) 09:29, 23 May 2008 (UTC)
For reader convenience, here is my reconstitution of the plot of the residuals.
Converting the height values to inches and then correctly converting to metres produces a different quadric fit, with these (much smaller) residuals.
The resulting residuals suggest a cubic shape, so why not try fitting a cubic shape while I'm at it? These residuals are the result.
However, a quartic also produces some reduced residuals. Properly, one must consider the question "significant reduction?", not just wander about. So enough of that. Instead, here are some numbers. (With excessive precision, as of course each exactly minimise their data set's sum of squared errors, and we're not comparing observational errors but analysis errors)
x¹ x² x³ 128.81 -143.16 61.96 Bungled data. 119.02 -131.51 58.50 Corrected data. 408.01 831.24 526.17 118.04 Cubic fit.
As is apparent, the trivial detail of rounding data has produced a significant difference between the parameters of the quadratic fit. So obviously, none of this should be discussed. NickyMcLean ( talk) 23:47, 26 May 2008 (UTC)
I've just done a series of edits on the analysis of variance section that amount to a semi-major rewrite. Some idiot claimed that the "regression sum of squares" was THE SAME THING AS the sum of squares of residuals, and the error sum of square was NOT the same thing as the sum of squares of residuals. In other words, the section was basically nonsense. Michael Hardy ( talk) 12:18, 1 July 2008 (UTC)
Corrected some errors and misleading text on this page. I clarified that the assumptions that X is fixed and that the error is NID are for simple expositions and that these assumptions are commonly relaxed (the previous text asserted that the case in which the mean of the error term is not zero is "beyond the scope of regression analysis"!) It is not true that the residuals nor the vector of estimates is distributed Student even when we are assuming the errors are normal, and I have corrected these claims. I deleted some of "checking model structure," which was a recipe for the most naive form of data mining. Made a few other small clarifications or corrections. —Preceding unsigned comment added by 68.146.25.175 ( talk) 21:15, 26 July 2008 (UTC)
Probably the first thing we need to do to clean up this mess is to undo this edit that was done in February. Michael Hardy ( talk) 18:37, 28 July 2008 (UTC)
"Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation."
It's just wrong. Linear regression is a form of regression analysis in which the unknowns (the betas) are a linear (or affine if you prefer) function of the knowns (the response y and the predictors x1, x2,...xn). And what's this "least squares function"? Least squares is not part of the model, it's an estimation method. It's not the only one and is not essential to the definition of linear regression. Blaise ( talk) 16:34, 28 August 2008 (UTC)
I don't know a whole lot about statistics, but I've been reading about political polling lately, came across this term and wanted to know more. I usually go to Wikipedia and read at least the opening paragraph for a good introduction on a subject. This article uses jargon in the intro paragraph that doesn't help me at all. Can someone put Linear Regression in laymans terms? —Preceding unsigned comment added by GregCovey ( talk • contribs) 23:29, 15 October 2008 (UTC)
Shouldn't the first line of Definitions read: "The data consist of n values " instead of "The data consist of n values "? Colindente ( talk) 13:33, 17 October 2008 (UTC)
Flavio Guitian ( talk) 17:46, 28 November 2008 (UTC)
Can we get a layman's definition of this term? —Preceding unsigned comment added by Ken444444 ( talk • contribs) 22:53, 18 December 2008 (UTC)
Is nothing better than this picture available?
Its artificiality, in that the errors are uniformly distributed, glares at you the instant you see the page. Michael Hardy ( talk) 05:08, 20 January 2009 (UTC)
OK, that image has been replaced. Michael Hardy ( talk) 22:07, 22 February 2009 (UTC)
The conditional distribution of Yi given Xi is a linear transformation of the distribution of the error term.
It is? Seems to me it's a shifted version of the distribution of the error term, which is not a linear transformation. 72.75.126.69 ( talk) 02:22, 26 January 2009 (UTC)
There seem to be inconsistencies in Regression inference section.
1. Expression n - p should be replaced with n - p - 1, since there are p + 1 parameters beta.
2. This statement
(a) refers to normal equations, without explaining what it is.
(b) repeats formula that was already written a few lines above, in Least squares estimates section.
I cannot fix these by myself, because this is not my field of expertise. Alex -- talk 22:30, 23 March 2009 (UTC)
There is just so much wrong with this article...
Firstly and most importantly, it puts undue weight on Least Squares method. Even that sentence in the introduction which states that "linear model and least squares are not synonyms" still somehow leaves an impression that they are almost synonyms. True, OLS was quite popular in times of ENIACs and earlier, but today no serious research paper can be published with OLS in it (unless maybe in a field seriously lacking with mathematical background, like psychology or aesthetics for example). I'm not criticizing LS per se, merely the unnecessary emphasis on it (sections 2, 3 and 4, more than half of the article). LS belongs to its own subpage and only there...
Example section is a perfect example of what a linear regression IS NOT. Linear regression is not OLS (it would be more instructive to conduct different estimation procedures and compare results). Linear regression is not about how to multiply matrices (more than half of the section is devoted to this exercise; modern software is capable of multiplying matrices for you). Linear regression does not have to assume normality of errors. Writing β0 = 129±16 = [92.9, 164.7] is misleading at best (mixing two notational conventions).
Applications section contains a list which is neither complete nor to any degree representative of wide variety of uses of linear regression models. Of course, compilation of such exhaustive list is not a feasible task, but at least some note must be added that the list is far from being complete.
Segmented regression section does not even belong to this article, as it is an extension of linear model to piecewise-linear case.
Lead section is pretty good :)
// Stpasha ( talk) 04:07, 1 July 2009 (UTC)
I moved some things from the "extensions" section to the "estimation methods" section. This was done after modifying the "assumptions" section to state that "linear regression" need not assume independence of the data. I agree that topics like errors-in-variables are extensions, in that they add structure to the generating model that is not included in the classical formulation of linear regression. But correlations between the observations are too important, and have too long of a history in linear modeling to be termed an "extension." This would be much too restrictive. Skbkekas ( talk) 15:47, 7 July 2009 (UTC)
![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
In the analysis of variance section, what is m in the formula for the statistic involving R? -- SolarMcPanel ( talk) 19:40, 5 April 2009 (UTC)
The front page of this article tells me that someone has suggested it be merged with Multiple regression. I agree that it should be. Also, there are articles on
could also be merged in.
Please add to the above list if there are others.
Personally, I'd prefer "Linear model" as the title.
Since this is a subject on which a great many books have been written, an article on it is not going to be anything like comprehensive. It might therefore be sensible to lay down some rules on the content, such as the level of mathematical and theoretical rigour.
Perhaps someone should start a Wikibook to cover the gaps...
—The preceding unsigned comment was added by Tolstoy the Cat ( talk • contribs) .
In reference to recent edits which change stuff like <math>x_i</math> to ''x''<sub>''i''</sub> -- the math-tag processor is smart enough to use html markup in simple cases (instead of generating an image via latex). It seems that some of those changes weren't all that helpful, as the displayed text is unchanged and the html markup is harder to edit. I agree the in-line <math>x_1,\ldots,x_n</math> wasn't pretty; however, it does seem necessary to clarify "variables" for the benefit of readers who won't immediately see x as a vector. Wile E. Heresiarch 16:41, 2 Feb 2004 (UTC)
In reference to my recent edit on the paragraph containing the eqn y = a + b x + c^2 + e, I moved the discussion of that eqn up into the section titled "Statement of the linear regression model" since it has to do with the characterizing the class of models which are called "linear regression" models. I don't think it could be readily found in the middle of the discussion about parameter estimation. Wile E. Heresiarch 00:43, 10 Feb 2004 (UTC)
I have a question about the stronger set of assumptions (independent, normally distributed, equal variance, mean zero). What can be proven from these that can't be proven from assuming uncorrelated, equal variance, mean zero? Presumably there is some result stronger than the Gauss-Markov theorem. Wile E. Heresiarch 02:42, 10 Feb 2004 (UTC)
Hello. In taking a closer look at Galton's 1885 paper, I see that he used a variety of terms -- "mean filial regression towards mediocrity", "regression", "regression towards mediocrity" (p 1207), "law of regression", "filial regression" (p 1209), "average regression of the offspring", "filial regression" (p 1210), "ratio of regression", "tend to regress" and "tendency to regress", "mean regression", "regression" (p 1212) -- although not exactly "regression to the mean". So it seems that the claim that Galton specifically used the term "regression to the mean" should be substantiated. -- Also this same paper shows that Galton was aware that regression works the other way too (parents are less exceptional than their children). I'll probably tinker with the history section in a day or two. Happy editing, Wile E. Heresiarch 06:00, 27 Mar 2004 (UTC)
I am confused -- i don't like the notation of d being the solution vector -- how about using Beta1 and Beta0?
The treatment is excellent but largely theoretical. It would be helpful to include additional material describing how regression is actually used by scientists. The following paragraph is a draft or outline of an introduction to this aspect of linear regression. (Needs work.)
Linear regression is widely used in biological and behavioral sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines. For example, early evidence relating cigarette smoking to mortality came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to insure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized experiments are considered to be more trustworthy than a regression analysis.
I also don't see any treatment of statistical probability testing with regression. Scientific researchers commonly test the statistical significance of the observed regression and place considerable emphasis on p values associated with r-squared and the coefficients of the equation. It would be nice to see a practical discussion of how one does this and how to interpret the results of such an analysis. --anon
I changed the terms independent / dependent variable to explanatory / response variable. This was to make the article more in line with the terminology used by the majority of statistics textbooks, and because independent / dependent variables are not statistically independent. Predictor / response might be ideal terminology, I'll see what textbooks are using.-- Theblackgecko 23:06, 11 April 2006 (UTC)
The terms independent and dependent are actually more precise in describing linear regression as such - terms like predictor/response or explanatory/response are related to the application of regression to different problems. Try checking a non-applied stats book.
Scientists tend to speak of independent/dependent variables, but statistics texts (such as mine) prefer explanatory/response. (These two pairs are not strictly interchangeable, in any event, though usually they are.) Here is a table a professor wrote up for the terms: http://www.tufts.edu/~gdallal/slr.htm.
At a broad and simplistic level, we can say that Linear Regression is generally used to estimate or project, often from a sample to a population. It is an estimating technique. Multiple Regression is often used as a measure of the proportion of variability explained by the Linear Regression.
Multiple Regression can be used in an entirely different scenario. For example when the data collected is a census rather than a sample, and Linear Regression is not necessary for estimating. In this case, Multiple Regression can be used to develop a measure of the proportionate contribution of independent variables.
Consequently, I would propose that merging Multiple Regression into the discussion of Linear Regression would bury it in an inappropriate and in the case of the example given, an relatively unrelated topic.
-- 69.119.103.162 20:42, 5 August 2006 (UTC)A. McCready Aug. 5, 2006
Would it be within the scope of Wikipedia to mention or reference how to do linear regressions in Excel and scientific / financial calculators? I think it would be very helpful because that's how 99% of people will actually do a linear regression - no one cares about the stupid math to get to the coefficients and the R squared.
In Excel, there is the scatter plot for single variable linear regressions, and the linear regression data analysis and the linest() family of functions for multi-variable linear regressions.
I believe the hp 12c only supports single variable linear regressions. —The preceding unsigned comment was added by 12.196.4.146( talk • contribs) 15:21, 15 August 2006 (UTC)
I am not a mathematician, but a scientist. I am most familiar with the use of linear regression to calculate a best fit y=mx + b to suit a particular set of ordered pairs of data that are likely to be in a linear relationship. I was wondering if a much simpler explanation of how to calculate this without use of a calculator or knowing matrix notation might be fit into the article somewhere. The texts at my level of full mathematical understanding merely instruct the student to plug in the coordinates or other information into a graphing calculator, giving a 'black box' sort of feel to the discussion. I would suspect many others who would consult wikipedia about this sort of calculation would not the same audience that the discussion in this entry seems to address, enlightening though it may be. I am not suggesting that the whole article be 'dumbed down' for us feeble non-mathematicians, but merely include a simplified explanation for the modern student wishing a slightly better explanation than "pick up your calculator and press these buttons", as is currently provided in many popular college math texts. The24frans 16:21, 18 September 2006 (UTC)frannie
Why are these labeled as "technically incorrect"? There is no explanation or citation given for this reasoning. The terms describe the relationship between the two variables. One is independent because it determines the the other. Thus the second term is dependent on the first. Linear regressions have independent variables, and it is not incorrect to describe them as such. If one is examining say, the amount of time spent on homework and its effect on GPA, then the hypothetical equation would be:
GPA = m * (homework time) + b
where homework time is the independent variable (independent from GPA), and GPA is dependent (on homework time, as shown in the equation). I will remove the references to these terms as technically incorrect unless someone can refute my reasoning. -- Chris53516 21:06, 3 October 2006 (UTC)
--->I added the terms, and I also added the note that they are technically incorrect. They are technically incorrect because they are considered to imply causation. (unsigned comment by 72.87.187.241)
I ran into linear regression for the purpose of forecasting. I believe this is done reasonably often for business planning however I suspect it is statistically incorrect to extend the model outside of its original range for the independent variables. Not that I am by any means a statistician. Worthly of a mention? -K
I have the problem to combine some indicators using a weighted sum. All weights have to be located in the range from 0 till 1. And the weights should add to one.
The probability distribution is rather irregular therefore the application of the EM-algorithm would be rather difficult.
Therefore I am thinking about using a linear regression with Lagrange condition that all indicators sum to one.
One problem which can emerge consists in the fact that a weight derived by linear regression might be negative. I have the idea to filter out indicators with negative weights and redo the linear regression with the remaining indicators until all weights are positive.
Is this sensible or does someone knows a better solution. Or is it better to use neural networks? (unsigned comments of Nulli)
The page gives useful formulae for estimating alpha and beta, but it does not give equations for the level of uncertainty in each (standard error, I think). For you mathematicians out there, I'd love it if this were available on wikipedia so I don't have to hunt through books. 99of9 00:11, 26 October 2006 (UTC)
Would it be worth mentioning in this article that the linear regression is simply a special case of a polynomial fit? Below I copy part of the article rephrased to the more general case:
I recall using this in the past and it worked quite well for fitting polynomials to data. Simply zero out the epsilons and solve for the alpha coefficients, and you have a nice polynomial. It works as long as m<n. For the linear regression, of course,m=1. - Amatulic 18:57, 30 October 2006 (UTC)
There are several ways to do regression in Excel. If you use the LINEST function in Excel you can show the results to any precision. Similary if you use the Data Analysis ToolPak or the Solver. Blaise ( talk) 13:24, 28 August 2008 (UTC)
This section should be rewriten as it is not general: polynomial regression is only a special case of multiple regression. And there is no formula for correlation coefficient, which, in case of multiple regression, is called coefficient of determination. Any comments?
TomyDuby 19:29, 2 January 2007 (UTC)
I accept your comment. But I would like to see the comment I made above about the correlation coefficient fixed. TomyDuby 21:24, 10 January 2007 (UTC)
Thanks! I made one more change. I consider the issue closed. TomyDuby 13:34, 11 January 2007 (UTC)
The French version of this article seems to be better written and more organised. If there are no objections, I intend on translating that article into English and incorporating it into this article in the next month or so. Another point: all the article on regression tend to be poorly written, repeatative, and not clear on the terminology including linear regression, least squares and its many derivatives, multiple linear regression... Woollymammoth 00:08, 21 January 2007 (UTC)
I used {{redirect5|Line of best fit|the song "Line of Best Fit"|You Can Play These Songs with Chords}}
for the disambiguation template. It appears to be the best one to use based on
Wikipedia:Template messages/General. The album page show "Line of Best Fit" in 3 different sections, so linking to 1 section is pointless and misleading. —
Chris53516 (
Talk)
17:25, 30 January 2007 (UTC)
"Line of best fit" redirects here. For the song "Line of Best Fit", see [[You Can Play These Songs with Chords|Line of Best Fit]].
Dude! Instead of wasting my time, did you even check "line of best fit"?? It doesn't even redirect here!! — Chris53516 ( Talk) 23:11, 30 January 2007 (UTC)
There seemed to be no objections to merging trend line into this article, so I went ahead. Cleanup would still be useful. Jeremy Tobacman 01:00, 23 February 2007 (UTC)
I have performed a major rewrite removing most of the repeative and redundant information. A lot of the information has been, or soon will be moved to the article least squares, where it should belong. This article should, in my opinion, contain information about different types of linear regression. There seems to be at least 2, if not more, different types: least squares and robust regression. All the theoretical information about least squares should be in the article on least squares. -- Woollymammoth 02:03, 24 February 2007 (UTC)
It's still not clear (to me) what's the meaning of this:
The problem is that I don't know what is - is it the inverse of the diagonal ii-th element of ? (now that I ask, it seems that it is...).
Could we say that follows some Multivariate Student Distribution with parameters and ? Or is there some other expression for the distribution of ? Albmont 20:05, 6 March 2007 (UTC)
http://en.wikipedia.org/?title=Linear_regression&oldid=110403867
I know this is simple minded and hardly advance statistics, but when need the equations to conduct a linear fit for some data points I expected wikipedia to have and more specifically this page, but they have been removed. See this page for the last occurance. It would be nice if they could be worked back in.-- vossman 17:41, 29 March 2007 (UTC)
Estimating beta (the slope)
We use the summary statistics above to calculate , the estimate of β.
Estimating alpha (the intercept)
We use the estimate of β and the other statistics to estimate α by:
A consequence of this estimate is that the regression line will always pass through the "center" .
One would think that such a basic article which has probably been seen by thousands of people and has been around for years would no longer contain careless imprecision in its explanation of the basic concept. OK, that's the rant. Here's the problem--the first sentence of the 2nd paragraph, which attempts to explain part of the fundamental concept is as follows:
This method is called "linear" because the relation of the response to the explanatory variables is assumed to be a linear function of the parameters.
The reference to "explanatory variables" is nowhere explained. One is expected to know what "explanatory variables" might be, yet they have not been defined, explained, or referenced before this. Instead, the prior paragraph mentions "dependent variables" and "independent variables" and "parameters". What are "explanatory" variables? For that matter, "response" has also not been defined. I'm assuming the sentence means something like:
This method is called "linear" because the relation of the dependent variable to the independent variables is assumed to be a linear function of the parameters.
I'm sure this is all obvious to the people who know what linear regression is, but then they don't really need this part of the article anyway.
-fastfilm 198.81.125.18 16:37, 16 July 2007 (UTC)
How should I include this graphic... or, maybe it's a bad graphic. In that case, how could I improve it? Do we even need something like this? Cfbaf ( talk) 00:07, 29 November 2007 (UTC)
The current introduction is not that friendly to the lay person. Can the introduction not be structured like this? 1. Regression modelling is a group of statistical methods to describe the relationship between multiple risk factors and an outcome. 2. Linear regression is a type of regression model that is used when the outcome is a continuous variable (e.g., blood cholesterol level, or birth weight).
This explains what linear regression is commonly used for, and tells the reader briefly when it is used. The current introduction simply does not provide enough context for the lay reader. We could also add a section at the end for links to chapters describing other regression models. -- Gak ( talk) 01:53, 16 December 2007 (UTC)
I would argue that the term "regession model" is currently broader than just "linear regression" and therefore regression modelshould not therefore redirect to this page. I instead propose a separate article on regression modelling that describes other types of model and when each type of model is selected and for what purpose:
That would tie together the various articles under one coherent banner and would be much more useful to the lay reader. It would also be a place that the various terms could be defined (outcome, outcome variable, dependent variable, independent variable, predictor, risk factor, estimator, covariate, etc.) and then each of the separate articles would then adopt the terminology agreed in the regression model article. -- Gak ( talk) 02:09, 16 December 2007 (UTC)
Shouldn't Pearson's Coefficient R2 be ESS/TSS instead of SSR/TSS? See the Coefficient of Determination page. I think what has happened here is that the earlier definitions of ESS and SSR have been switched. Amberhabib ( talk) 05:59, 16 January 2008 (UTC)
Well, I understand the matrix-notation of multiple regression and also have implemented this. But today I just wanted to see the formula for bivariate regression and its coefficients. For the casual reader here we should have these formulae, too. In an encyclopedia the knowledge of matrix-analysis should not be a prerequisite to get a simple solution for a simple subcase of a general problem...
--Gotti 06:31, 30 January 2008 (UTC)
So I misused the term bi-variate. I meant 1-response-1-predictor variable (two variables involved)--Gotti 10:19, 16 February 2008 (UTC) —Preceding unsigned comment added by Druseltal2005 ( talk • contribs)
Along with a major revision of regression analysis and articles concerned with least squares I have given this article a thorough going-over. Notation is now consistent across all the main articles in least squares and regression, making cross-referencing more convenient and reducing the amount of duplicated material to what I hope is the minimum necessary. There has been considerable rearrangement of material to give the article a more logical structure, but nothing significant has been removed. The example involving a cubic polynomial has been transferred here from regression analysis. Petergans ( talk) 10:23, 22 February 2008 (UTC)
I have a couple of problems with this example:
Sorry to sound so critical. I do appreciate the work you've put in to overhauling these articles and they are important. This article got over 48 000 hits in January, more than one a minute. Regards and happy leap day, Qwfp ( talk) 17:48, 29 February 2008 (UTC)
The section on "Regression statistics" is a little unclear. Expressions for the "mean response confidence interval" and "predicted response confidence interval" are given, but the terms are not defined. They are also not defined or mentioned in the Error_propagation page. Can someone define what these terms mean? And also possibly give a more detailed derivation for where the expressions come from? -- Jonny5cents2 ( talk) 09:24, 10 March 2008 (UTC)
Consider the lines:
Writing the elements as , the mean response confidence interval for the prediction is given, using error propagation theory, by:
The multiplication does not make sense unless is a column vector. i.e. the matrix has n rows and n columns, therefore must have n rows and 1 column. The standard definition of a vector within both physics and maths is that vectors are, by default, column vectors. Therefore should be a column vector, not a row vector as is implied.
I propose that the entry should read . This gives the correct contraction to a scalar. Velocidex ( talk) 03:21, 11 March 2008 (UTC)
In the article it reads:
---
Thus, the normal equations are
---
Where do the the +/- 16, 20 and 6 values come from?
62.92.124.145 ( talk) 09:05, 12 March 2008 (UTC)
The standard deviation on a parameter estimator is given by
Using plus-or-minus the SD does not make sense; one should instead use confidence intervals that get smaller as the sample size grows, even if the SD stays the same. Michael Hardy ( talk) 00:20, 14 March 2008 (UTC)
Once again it becomes apparent that there are completely different approaches within different disciplines. In my discipline, chemistry, it is customary to give standard deviations for least squares parameters. Confidence limits are only calculated when they are required for a statistical test. There is a good reason for this. To derive confidence limits an assumption has to be made concerning the probability distribution of the errors on the dependent variable, y. For example, the parameters will belong to a Student's t distribution if the distribution is Gaussian. The assumption of a Gaussian distribution may or may not be justified. In short, the results of the linear regression calculations are value and standard deviation, regardless of the error distribution function. Confidence limits are derived from those results with a further assumption.
Concerning the factor this is an estimator of the error of an observation of unit weight. Another way to eliminate this factor is to estimate the error experimentally as and do a weighted regression, minimizing. Petergans ( talk) 09:45, 14 March 2008 (UTC)
It is a common misunderstanding to think that the least squares procedure can only be applied when the errors are normally distributed. The Gauss–Markov theorem (and Aitken extesion) is independent of the error distribution function. The misunderstanding arises because the Gauss's method is sometimes (e.g. Numerical Recipes) derived using the maximum likelihood principle; when the errors are normally distributed the maximum likelihood solution coincides with the minimum variance solution.
A simple example is the fitting of radioactive decay data. In that case the errors belong to a Poisson distribution but this does not mean that least squares cannot be used. Admittedly this is a nonlinear case, though for a single (exponential) decay it can be made into a linear one by a logarithmic transformation which also transforms the error distribution into I know not what. Other nonlinear to linear transformations have been extensively used in the past See non-linear least squares#Transformation to a linear model for more details. Even if the original data were normally distributed, the transformed data certainly would not be.
I have made the assumption of normality explicit in the example. Petergans ( talk) 08:55, 15 March 2008 (UTC)
In the Definitions section right at the start the number of values is given as m, not n (which is surely more usual) "The data consist of m values y1,...,ym taken etc", then the first summation following has usage of n, which is not previously or nearby defined. It appears that the x-variable beta, is potentially an order-n vector and the simple case of (x,y) points would be with order n = 1. If this is so, words to this effect would be good, rather than exhibiting a tersely cryptic but high-formalism demonstration of superior knowledge in the minimum number of words. Regards, NickyMcLean ( talk) 21:18, 23 April 2008 (UTC)
I gather there has been some discussion of this notation. In which discipline(s) is it conventional to use "M" for the sample size and "N" for the number of covariates? This seems strange and confusing to me given that in, at least, statistics, econometrics, biostatistics, and psychometrics, "N" conventionally denotes the sample size. Common statistical jargon such as "root n consistency" even invokes that convention. I think it is non-standard and possibly very confusing to use "N" as it is used in this and other wiki articles. —Preceding unsigned comment added by 68.146.25.175( talk) 16:48, 28 July 2008 (UTC)
In the Article's section " Checking model validity" it is mentioned the method of F-test. Can anyone explain this further?
IMHO there are two issues that need clarification.
I may be wrong. If you know about these issues please feel free to make them clear. Frigoris ( talk) 16:58, 25 April 2008 (UTC)
This is not a most important issue, but if two figures in the section " Example" would be even better in SVG format. —Preceding unsigned comment added by Frigoris ( talk • contribs) 17:04, 25 April 2008 (UTC)
NickyMcLean: please consider the following points.
Petergans ( talk) 08:11, 29 April 2008 (UTC)
I was surprised by Arthur Rubin's edit to linear regression with its edit summary that was so emphatic about PROPER explanation of linearity. I was concerned that some of your words might be misunderstood as meaning that polynomial regression is not an instance of linear regression, and then I came to your assertion that if one column of the design matrix X contains the logarithms of the corresponding entries of another column, that makes the regression nonlinear (presumably because the log function is nonlinear). That is grossly wrong and I reverted. Notice that the probability distributions of the least-squares estimators of the coefficients can be found simply by using the fact that they depend linearly on the vector of errors (at least if the error vector is multivariate normal). Nonlinearity of the dependence of one column of the matrix X upon the other columns does not change that at all, since the model attributes no randomness to the entries in X. Nonlinear regression, on the other hand, is quite a different thing from that. Michael Hardy ( talk) 19:10, 8 May 2008 (UTC)
While having some relevant points it was basically wrong. It is possible for both variables X and Y to be measured with error and to do regressions of both Y on X and X on Y and both will be valid provided it is recognised what is being done: to provide the best predictor of Y given values of X when they are measured with the same type of error as in the sample ... and this is a perfectly valid thing to do. Of course the requirement might be to try to identify an underlying relationship of "true" values, in which case the section on "errors in variables" is relevant. Melcombe ( talk) 09:29, 23 May 2008 (UTC)
For reader convenience, here is my reconstitution of the plot of the residuals.
Converting the height values to inches and then correctly converting to metres produces a different quadric fit, with these (much smaller) residuals.
The resulting residuals suggest a cubic shape, so why not try fitting a cubic shape while I'm at it? These residuals are the result.
However, a quartic also produces some reduced residuals. Properly, one must consider the question "significant reduction?", not just wander about. So enough of that. Instead, here are some numbers. (With excessive precision, as of course each exactly minimise their data set's sum of squared errors, and we're not comparing observational errors but analysis errors)
x¹ x² x³ 128.81 -143.16 61.96 Bungled data. 119.02 -131.51 58.50 Corrected data. 408.01 831.24 526.17 118.04 Cubic fit.
As is apparent, the trivial detail of rounding data has produced a significant difference between the parameters of the quadratic fit. So obviously, none of this should be discussed. NickyMcLean ( talk) 23:47, 26 May 2008 (UTC)
I've just done a series of edits on the analysis of variance section that amount to a semi-major rewrite. Some idiot claimed that the "regression sum of squares" was THE SAME THING AS the sum of squares of residuals, and the error sum of square was NOT the same thing as the sum of squares of residuals. In other words, the section was basically nonsense. Michael Hardy ( talk) 12:18, 1 July 2008 (UTC)
Corrected some errors and misleading text on this page. I clarified that the assumptions that X is fixed and that the error is NID are for simple expositions and that these assumptions are commonly relaxed (the previous text asserted that the case in which the mean of the error term is not zero is "beyond the scope of regression analysis"!) It is not true that the residuals nor the vector of estimates is distributed Student even when we are assuming the errors are normal, and I have corrected these claims. I deleted some of "checking model structure," which was a recipe for the most naive form of data mining. Made a few other small clarifications or corrections. —Preceding unsigned comment added by 68.146.25.175 ( talk) 21:15, 26 July 2008 (UTC)
Probably the first thing we need to do to clean up this mess is to undo this edit that was done in February. Michael Hardy ( talk) 18:37, 28 July 2008 (UTC)
"Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation."
It's just wrong. Linear regression is a form of regression analysis in which the unknowns (the betas) are a linear (or affine if you prefer) function of the knowns (the response y and the predictors x1, x2,...xn). And what's this "least squares function"? Least squares is not part of the model, it's an estimation method. It's not the only one and is not essential to the definition of linear regression. Blaise ( talk) 16:34, 28 August 2008 (UTC)
I don't know a whole lot about statistics, but I've been reading about political polling lately, came across this term and wanted to know more. I usually go to Wikipedia and read at least the opening paragraph for a good introduction on a subject. This article uses jargon in the intro paragraph that doesn't help me at all. Can someone put Linear Regression in laymans terms? —Preceding unsigned comment added by GregCovey ( talk • contribs) 23:29, 15 October 2008 (UTC)
Shouldn't the first line of Definitions read: "The data consist of n values " instead of "The data consist of n values "? Colindente ( talk) 13:33, 17 October 2008 (UTC)
Flavio Guitian ( talk) 17:46, 28 November 2008 (UTC)
Can we get a layman's definition of this term? —Preceding unsigned comment added by Ken444444 ( talk • contribs) 22:53, 18 December 2008 (UTC)
Is nothing better than this picture available?
Its artificiality, in that the errors are uniformly distributed, glares at you the instant you see the page. Michael Hardy ( talk) 05:08, 20 January 2009 (UTC)
OK, that image has been replaced. Michael Hardy ( talk) 22:07, 22 February 2009 (UTC)
The conditional distribution of Yi given Xi is a linear transformation of the distribution of the error term.
It is? Seems to me it's a shifted version of the distribution of the error term, which is not a linear transformation. 72.75.126.69 ( talk) 02:22, 26 January 2009 (UTC)
There seem to be inconsistencies in Regression inference section.
1. Expression n - p should be replaced with n - p - 1, since there are p + 1 parameters beta.
2. This statement
(a) refers to normal equations, without explaining what it is.
(b) repeats formula that was already written a few lines above, in Least squares estimates section.
I cannot fix these by myself, because this is not my field of expertise. Alex -- talk 22:30, 23 March 2009 (UTC)
There is just so much wrong with this article...
Firstly and most importantly, it puts undue weight on Least Squares method. Even that sentence in the introduction which states that "linear model and least squares are not synonyms" still somehow leaves an impression that they are almost synonyms. True, OLS was quite popular in times of ENIACs and earlier, but today no serious research paper can be published with OLS in it (unless maybe in a field seriously lacking with mathematical background, like psychology or aesthetics for example). I'm not criticizing LS per se, merely the unnecessary emphasis on it (sections 2, 3 and 4, more than half of the article). LS belongs to its own subpage and only there...
Example section is a perfect example of what a linear regression IS NOT. Linear regression is not OLS (it would be more instructive to conduct different estimation procedures and compare results). Linear regression is not about how to multiply matrices (more than half of the section is devoted to this exercise; modern software is capable of multiplying matrices for you). Linear regression does not have to assume normality of errors. Writing β0 = 129±16 = [92.9, 164.7] is misleading at best (mixing two notational conventions).
Applications section contains a list which is neither complete nor to any degree representative of wide variety of uses of linear regression models. Of course, compilation of such exhaustive list is not a feasible task, but at least some note must be added that the list is far from being complete.
Segmented regression section does not even belong to this article, as it is an extension of linear model to piecewise-linear case.
Lead section is pretty good :)
// Stpasha ( talk) 04:07, 1 July 2009 (UTC)
I moved some things from the "extensions" section to the "estimation methods" section. This was done after modifying the "assumptions" section to state that "linear regression" need not assume independence of the data. I agree that topics like errors-in-variables are extensions, in that they add structure to the generating model that is not included in the classical formulation of linear regression. But correlations between the observations are too important, and have too long of a history in linear modeling to be termed an "extension." This would be much too restrictive. Skbkekas ( talk) 15:47, 7 July 2009 (UTC)