theta
of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equal theta
vector. However they do not.theta
vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data.theta
s look like (fist column is output of Gradient descent, second output of Normal equation):Grad desc Norm eq -237.7752 -4.6736 -5.8471 -5.8467 9.9174 9.9178 2.1135 2.1134 -1.5001 -1.5003 -37.8558 -37.8505 -1.1024 -1.1116 -19.2969 -19.2956 66.6423 66.6447 297.3666 296.7604 -741.9281 -744.1541 296.4649 296.3494 146.0304 144.4158 -2.9978 -2.9976 -0.8190 -0.8189
theta(1, 1)
returned by gradient descent compared to theta(1, 1)
returned by normal equation? Do I have bug in my code?function theta = normalEque(X, y) [m, n] = size(X); X = [ones(m, 1), X]; theta = pinv(X'*X)*X'*y; end
function theta = gradientDesc(X, y) options = optimset('GradObj', 'on', 'MaxIter', 9999); [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),... zeros(size(X, 2), 1), options); end function [J, grad] = cost(theta, X, y) m = size(X, 1); X = [ones(m, 1), X]; J = sum((X * theta - y) .^ 2) ./ (2*m); for i = 1:size(theta, 1) grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m; end end
X
and y
to both functions (I do not normalize X
).Matrix is close to singular or badly scaled.
. To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. This time inv(X)
did not display the warning and using pinv
and inv
gave same results. So I hope that X
is not close to singular any more.normalEque
code as suggested by woodchips so now it looks like:function theta = normalEque(X, y) X = [ones(size(X, 1), 1), X]; theta = pinv(X)*y; end
normalEque
function on new data that are not close to singular gives different theta
as gradientDesc
.normalEque
but different to the output of gradientDesc
. So I guess that normalEque
is correct and there is a bug in gradientDesc
.theta
s computed by Weka, normalEque
and GradientDesc
:Weka(correct) normalEque gradientDesc 779.8229 779.8163 302.7994 1.6571 1.6571 1.7064 1.8430 1.8431 2.3809 -1.5945 -1.5945 -1.5964 3.8190 3.8195 5.7486 -4.8265 -4.8284 -11.1071 -6.9000 -6.9006 -11.8924 -15.6956 -15.6958 -13.5411 43.5561 43.5571 31.5036 -44.5380 -44.5386 -26.5137 0.9935 0.9926 1.2153 -3.1556 -3.1576 -1.8517 -0.1927 -0.1919 -0.6583 2.9207 2.9227 1.5632 1.1713 1.1710 1.1622 0.1091 0.1093 0.0084 1.5768 1.5762 1.6318 -1.3968 -1.3958 -2.1131 0.6966 0.6963 0.5630 0.1990 0.1990 -0.2521 0.4624 0.4624 0.2921 -12.6013 -12.6014 -12.2014 -0.1328 -0.1328 -0.1359
normalEque
gives slightly lesser squared error but the difference is marginal. What is more when I compute gradient of cost of theta
using function cost
(the same as the one used by gradientDesc
) I got gradient near zero. Same done on output of gradientDesc
does not give gradient near zero. Here is what I mean:>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1)); >> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1)); >> disp([J_gd, J_ne]) 120.9932 119.1469 >> disp([grad_gd, grad_ne]) -0.005172856743846 -0.000000000908598 -0.026126463200876 -0.000000135414602 -0.008365136595272 -0.000000140327001 -0.094516503056041 -0.000000169627717 -0.028805977931093 -0.000000045136985 -0.004761477661464 -0.000000005065103 -0.007389474786628 -0.000000005010731 0.065544198835505 -0.000000046847073 0.044205371015018 -0.000000046169012 0.089237705611538 -0.000000046081288 -0.042549228192766 -0.000000051458654 0.016339232547159 -0.000000037654965 -0.043200042729041 -0.000000051748545 0.013669010209370 -0.000000037399261 -0.036586854750176 -0.000000027931617 -0.004761447097231 -0.000000027168798 0.017311225027280 -0.000000039099380 0.005650124339593 -0.000000037005759 0.016225097484138 -0.000000039060168 -0.009176443862037 -0.000000012831350 0.055653840638386 -0.000000020855391 -0.002834810081935 -0.000000006540702 0.002794661393905 -0.000000032878097
I finally had time to get back to this. There is no "bug".
If the matrix is singular, then there are infinitely many solutions. You can choose any solution from that set, and get equally as good an answer. The pinv(X)*y solution is a good one that many like because it is the minimum norm solution.
There is NEVER a good reason to use inv(X)*y. Even worse, is to use inverse on the normal equations, thus inv(X'*X)*X'*y is simply numerical crap. I don't care who told you to use it, they are guiding you to the wrong place. (Yes, it will work acceptably for problems that are well-conditioned, but most of the time you don't know when it is about to give you crap. So why use it?)
The normal equations are in general a bad thing to do, EVEN if you are solving a regularized problem. There are ways to do that that avoid squaring the condition number of the system, although I won't explain them unless asked as this answer has gotten long enough.
X\y will also yield a result that is reasonable.
There is ABSOLUTELY no good reason to throw an unconstrained optimizer at the problem, as this will yield results that are unstable, completely dependent on your starting values.
As an example, I'll start with a singular problem.
X = repmat([1 2],5,1);
y = rand(5,1);
>> X\y
Warning: Rank deficient, rank = 1, tol = 2.220446e-15.
ans =
0
0.258777984694222
>> pinv(X)*y
ans =
0.103511193877689
0.207022387755377
pinv and backslash return slightly different solutions. As it turns out, there is a basic solution, to which we can add ANY amount of the nullspace vector for the row space of X.
null(X)
ans =
0.894427190999916
-0.447213595499958
pinv generates the minimum norm solution. Of all of the solutions that might have resulted, this one has minimum 2-norm.
In contrast, backslash generates a solution that will have one or more variables set to zero.
But if you use an unconstrained optimizer, it will generate a solution that is completely dependent on your starting values. Again, ANLY amount of that null vector can be added to your solution, and you still have an entirely valid solution, with the same value of the sum of squares of errors.
Note that even though no singularity waring is returned, this need not mean your matrix is not close to singular. You have changed little about the problem, so it is STILL close, just not enough to trigger the warning
theta = X\y
which will use a QR-decomposition method to solve it.As others mentioned, an ill-conditioned hessian matrix is likely the cause of your problem.
The number of steps that standard gradient descent algorithms take to reach a local optimum is a function of the largest eigenvalue of the hessian divided by the smallest (this is known as the condition number of the Hessian). So, if your matrix is ill-conditioned, then it could take an extremely large number of iterations for gradient descent to converge to an optimum. (For the singular case, it could converge to many points, of course.)
I would suggest trying three different things to verify that an unconstrained optimization algorithm works for your problem (which it should): 1) Generate some synthetic data by computing the result of a known linear function for random inputs and adding a small amount of gaussian noise. Make sure that you have many more data points than dimensions. This should produce a non-singular hessian. 2) Add a regularization terms to your error function to increase the condition number of the hessian. 3) Use a second order method like conjugate gradient or L-BFGS rather than gradient descent to reduce the number of steps needed for the algorithm to converge. (You will probably need to do this in conjunction with #2).