-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
33 additions
and
26 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,67 +1,74 @@ | ||
|
||
<style> | ||
.image.fit{ | ||
all: unset; | ||
display: inline-block; | ||
margin-bottom: -5px; | ||
} | ||
</style> | ||
|
||
# How do you derive the Gradient Descent rule for Linear Regression and Adaline? | ||
|
||
Linear Regression and Adaptive Linear Neurons (Adalines) are closely related to each other. In fact, the Adaline algorithm is a identical to linear regression except for a threshold function data:image/s3,"s3://crabby-images/721b3/721b3612b08ebb009d52c6b0e7f2fc02c11385f4" alt="" that converts the continuous output into a categorical class label | ||
Linear Regression and Adaptive Linear Neurons (Adalines) are closely related to each other. In fact, the Adaline algorithm is a identical to linear regression except for a threshold function data:image/s3,"s3://crabby-images/721b3/721b3612b08ebb009d52c6b0e7f2fc02c11385f4" alt="" that converts the continuous output into a categorical class label | ||
|
||
data:image/s3,"s3://crabby-images/22aac/22aac935c624c4b3e075da7fb3994113b0cca16b" alt="" | ||
data:image/s3,"s3://crabby-images/22aac/22aac935c624c4b3e075da7fb3994113b0cca16b" alt="" | ||
|
||
where $z$ is the net input, which is computed as the sum of the input features **x** multiplied by the model weights **w**: | ||
|
||
data:image/s3,"s3://crabby-images/587b7/587b772b75d0bbd759073e36cb8e0ab7e100507c" alt="" | ||
data:image/s3,"s3://crabby-images/587b7/587b772b75d0bbd759073e36cb8e0ab7e100507c" alt="" | ||
|
||
(Note that data:image/s3,"s3://crabby-images/ea757/ea7571ab4af5f8de417d64098570119f78c2e152" alt="" refers to the bias unit so that data:image/s3,"s3://crabby-images/3e570/3e570376b1edfbf2ddb6e8c745df8c3aec66c5a1" alt="".) | ||
(Note that data:image/s3,"s3://crabby-images/ea757/ea7571ab4af5f8de417d64098570119f78c2e152" alt="" refers to the bias unit so that data:image/s3,"s3://crabby-images/3e570/3e570376b1edfbf2ddb6e8c745df8c3aec66c5a1" alt="".) | ||
|
||
In the case of linear regression and Adaline, the activation function data:image/s3,"s3://crabby-images/d2343/d2343d54582b6db4d5bab390243314680b231e32" alt="" is simply the identity function so that data:image/s3,"s3://crabby-images/f5eae/f5eae934fe57bc45c76af6e308755aade88f521d" alt="". | ||
In the case of linear regression and Adaline, the activation function data:image/s3,"s3://crabby-images/d2343/d2343d54582b6db4d5bab390243314680b231e32" alt="" is simply the identity function so that data:image/s3,"s3://crabby-images/f5eae/f5eae934fe57bc45c76af6e308755aade88f521d" alt="". | ||
|
||
data:image/s3,"s3://crabby-images/484be/484be9e8ba491486e68faf24bbd86925dccb6f5a" alt="" | ||
data:image/s3,"s3://crabby-images/484be/484be9e8ba491486e68faf24bbd86925dccb6f5a" alt="" | ||
|
||
Now, in order to learn the optimal model weights **w**, we need to define a cost function that we can optimize. Here, our cost function data:image/s3,"s3://crabby-images/b0301/b0301d2d6e941f457b3c9393fa7f7e2473956b0a" alt="" is the sum of squared errors (SSE), which we multiply by data:image/s3,"s3://crabby-images/2409f/2409f0ff09cf3ef2ad7589b07a75ec6b855ee997" alt="" to make the derivation easier: | ||
Now, in order to learn the optimal model weights **w**, we need to define a cost function that we can optimize. Here, our cost function data:image/s3,"s3://crabby-images/b0301/b0301d2d6e941f457b3c9393fa7f7e2473956b0a" alt="" is the sum of squared errors (SSE), which we multiply by data:image/s3,"s3://crabby-images/2409f/2409f0ff09cf3ef2ad7589b07a75ec6b855ee997" alt="" to make the derivation easier: | ||
|
||
data:image/s3,"s3://crabby-images/ac54f/ac54fdca4ce72ffc19fb18788001a11608a8cd21" alt="" | ||
data:image/s3,"s3://crabby-images/ac54f/ac54fdca4ce72ffc19fb18788001a11608a8cd21" alt="" | ||
|
||
where data:image/s3,"s3://crabby-images/52419/52419f778528867d307242058d3cefa00af9c730" alt="" is the label or target label of the *i*th training point data:image/s3,"s3://crabby-images/0565d/0565d5ed5d3a84d709e60170c7e4c11e0736e121" alt="". | ||
where data:image/s3,"s3://crabby-images/52419/52419f778528867d307242058d3cefa00af9c730" alt="" is the label or target label of the *i*th training point data:image/s3,"s3://crabby-images/0565d/0565d5ed5d3a84d709e60170c7e4c11e0736e121" alt="". | ||
|
||
(Note that the SSE cost function is convex and therefore differentiable.) | ||
|
||
In simple words, we can summarize the gradient descent learning as follows: | ||
|
||
1. Initialize the weights to 0 or small random numbers. | ||
2. For *k* epochs (passes over the training set) | ||
A. For each training sample data:image/s3,"s3://crabby-images/0565d/0565d5ed5d3a84d709e60170c7e4c11e0736e121" alt="" | ||
a. Compute the predicted output value data:image/s3,"s3://crabby-images/fc89a/fc89adb2240a8633c9c2265ac6099d3d8ded5253" alt="" | ||
b. Compare data:image/s3,"s3://crabby-images/fc89a/fc89adb2240a8633c9c2265ac6099d3d8ded5253" alt="" to the actual output data:image/s3,"s3://crabby-images/46516/465163ad2365198f320ba6907831917c39be23b2" alt="" and Compute the "weight update" value | ||
c. Update the "weight update" value | ||
B. Update the weight coefficients by the accumulated "weight update" values | ||
3. For each training sample data:image/s3,"s3://crabby-images/0565d/0565d5ed5d3a84d709e60170c7e4c11e0736e121" alt="" | ||
- Compute the predicted output value data:image/s3,"s3://crabby-images/fc89a/fc89adb2240a8633c9c2265ac6099d3d8ded5253" alt="" | ||
- Compare data:image/s3,"s3://crabby-images/fc89a/fc89adb2240a8633c9c2265ac6099d3d8ded5253" alt="" to the actual output data:image/s3,"s3://crabby-images/46516/465163ad2365198f320ba6907831917c39be23b2" alt="" and Compute the "weight update" value | ||
- Update the "weight update" value | ||
4. Update the weight coefficients by the accumulated "weight update" values | ||
|
||
Which we can translate into a more mathematical notation: | ||
|
||
1. Initialize the weights to 0 or small random numbers. | ||
2. For *k* epochs | ||
3. For each training sample data:image/s3,"s3://crabby-images/0565d/0565d5ed5d3a84d709e60170c7e4c11e0736e121" alt="" | ||
1. data:image/s3,"s3://crabby-images/13851/1385147062da1f382578868de32123b5796743d0" alt="" | ||
2. data:image/s3,"s3://crabby-images/f7068/f70685f1dfe0095fd616896d87319a971a0dc2d8" alt="" (where *η* is the learning rate); | ||
3. data:image/s3,"s3://crabby-images/25c11/25c112a77f873e2fe05be507c7f4a99f94248ca2" alt="" (*t* is the time step) | ||
|
||
3. data:image/s3,"s3://crabby-images/782bd/782bd38e942a77b5f25049797cec4a84935f1def" alt="" | ||
3. For each training sample data:image/s3,"s3://crabby-images/0565d/0565d5ed5d3a84d709e60170c7e4c11e0736e121" alt="" | ||
- data:image/s3,"s3://crabby-images/13851/1385147062da1f382578868de32123b5796743d0" alt="" | ||
- data:image/s3,"s3://crabby-images/f7068/f70685f1dfe0095fd616896d87319a971a0dc2d8" alt="" (where *η* is the learning rate); | ||
- data:image/s3,"s3://crabby-images/25c11/25c112a77f873e2fe05be507c7f4a99f94248ca2" alt="" | ||
3. data:image/s3,"s3://crabby-images/782bd/782bd38e942a77b5f25049797cec4a84935f1def" alt="" | ||
|
||
Performing this global weight update | ||
|
||
data:image/s3,"s3://crabby-images/782bd/782bd38e942a77b5f25049797cec4a84935f1def" alt="", | ||
data:image/s3,"s3://crabby-images/782bd/782bd38e942a77b5f25049797cec4a84935f1def" alt="", | ||
|
||
can be understood as "updating the model weights by taking an opposite step towards the cost gradient scaled by the learning rate *η*" | ||
|
||
data:image/s3,"s3://crabby-images/22746/227460016b37ecab9d49c0615908a99865286a78" alt="" | ||
data:image/s3,"s3://crabby-images/22746/227460016b37ecab9d49c0615908a99865286a78" alt="" | ||
|
||
where the partial derivative with respect to each data:image/s3,"s3://crabby-images/16abc/16abc7609e9c2248b1f430c3962150c01d0ca615" alt="" can be written as | ||
where the partial derivative with respect to each data:image/s3,"s3://crabby-images/16abc/16abc7609e9c2248b1f430c3962150c01d0ca615" alt="" can be written as | ||
|
||
data:image/s3,"s3://crabby-images/988f5/988f546ae75ec1176b0608453ed4f7ae1a1188f8" alt="" | ||
data:image/s3,"s3://crabby-images/988f5/988f546ae75ec1176b0608453ed4f7ae1a1188f8" alt="" | ||
|
||
|
||
|
||
To summarize: in order to use gradient descent to learn the model coefficients, we simply update the weights **w** by taking a step into the opposite direction of the gradient for each pass over the training set -- that's basically it. But how do we get to the equation | ||
|
||
data:image/s3,"s3://crabby-images/75790/757902af8b3f2a0cc817c579365615c4b651af6b" alt="" | ||
data:image/s3,"s3://crabby-images/75790/757902af8b3f2a0cc817c579365615c4b651af6b" alt="" | ||
|
||
Let's walk through the derivation step by step. | ||
|
||
data:image/s3,"s3://crabby-images/9fef8/9fef896cb8ace85ebe219eb47c488fe7a828beff" alt="" | ||
data:image/s3,"s3://crabby-images/9fef8/9fef896cb8ace85ebe219eb47c488fe7a828beff" alt="" |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.