Says One Neuron To Another
1.Statlog(Heart) Data Set
2.Breast Cancer Wisconsin data set
To demonstrate the neural network process, three main layers are to be conducted.
In general, we would design a network as shown above with an input layer, several hidden layers, and a output layer, where
between each layers activation functions were used as a threshold.
- data preperation
readcsv and pandas are used for data sorting. For seperating test/train data sections, sklearn train_test_split is
used to make life better. below is the header printed with first several rows of data. As shown in this data we had 13 features.
before going further in reallife, data must be checked if there's missing values and what the data type is for each feature.
the data is splited as shown:
- design the hiden layer(s)
In this heart data, one hidden layers are specified with 13 nodes which needs to match the input layer.
weights and bias are initialized by given random numbers from a random normal distribution with ragard to the size of input/hidden/output layer and stored in a dictionary params
Note that we also set the learning rate which basically defines when we want the training to stop and iterations of the training.
- activation functions
In this neural network, I played with both sigmoid and ReLu activation function to get familiar with them. But the rationale of using sigmoid at the output layer is we have a binary class output.
- loss function
A loss function is also specify in the study for backward propagation when we want to update the weights and bias. The choice of loss function is dependent on tasks, in our case here, we are trying to do classification problems where cross-entropy loss is preferred.
Furthermore, for binary classification task as the heart data here, the following loss function is used:
- forward propagation
In this phase we need to compute the weighted sum between input and first layer with 'Z1 = (W1 * X) + b', then pass the result Z1 to our activation function(ReLu) and get A1. With that, the second weighted sum would be computed based on the previous result A1, which is writen 'Z2 = (W2 * A1) + b2', then Z2 will be pass into the output layer's activation function returning A2. Finally, the loss during propagation should be computed between A2 and true label.
- backward propagation
In order to minimize the loss here, We do this by taking the derivative of the error function, with respect to the weights (W) of our NN, using gradient descent. As shown in the code, derivative of loss function in respect to output yhat is calculated, then passing backward to hidden layer output Z2, Z1. The derivative of all the variables involved are then computed, except the input X.
then that's where we can update our weights and bias with the loss with respect to each variables by deducting them with the 'learning rate' and 'loss': self.params['W1'] = self.params['W1'] - self.learning_rate * dl_wrt_w1
For the first testing data, I chose the heartbeat data to start with, since it has a straight forawrd binary outcome.
Given learning rate=0.01 and interation=1000, the loss curve is as shown. which indicated the training goes smoothly.
I then test the neural network by feeding test data, the results is at aboout 74% acccuracy which is not bad.
So far, the basic 2 layer neural network is successfully built and implemented.
In this project, there are several things we can twist and play around to adjust for different task and performance. Hidden layers and nodes to be chosen are responsible for the runtime and accuracy of our NN in general. On the other hand, if we start working on more deeper, we should add in more optimization process such as ADAM, momentum...from the data side, normaliztion is always helpful for preparing the data. With deeper NN, cross-validation is usually used to prevent overfitting which it can be apllied in our case as well, but since I'm pretty confident about the result, this process is skipped.
Activation function are chosen based on the task and computing price which is why ReLu is wildly used as default.
Note that the accuracy of the test data is < 80%, which is oftenly considered acceptable because going higher would easily introduce overfitting.