I’m accumulating a lot of notes about these topics. I’m no expert so don’t take anything I say too seriously.

## Gaussian Naive Bayes Classifier

The following is handy for building a classifier model from a dataset. Data is `features_train` and labels are `labels_train` for the training data. Once the model is built, new classifications can be calculated with new data.

Gaussian Naive Bayes Classifier Example
``````>>> import numpy as np
>>> features_train= np.array([[-1,-1], [-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
>>> labels_train= np.array([1,1,1,2,2,2])
>>> from sklearn.naive_bayes import GaussianNB
>>> classifier= GaussianNB()
>>> classifier.fit(features_train,labels_train)
GaussianNB(priors=None)
>>> classifier.predict([[-.8,-1]])
array()
>>> classifier.predict([[5,4]])
array()
>>> classifier.predict([[0,0]])
array()``````

And to find out how well a classifier is doing, check it on a test set. Make predictions of the features that produced `labels_test` and then compare them.

``````from sklearn.metrics.classification import accuracy_score
correct_ratio= accuracy_score(labels_test,predicted_labels_test)``````

## SVM

Support Vector Machines are weirdly named, almost to the point of foolishness. The "margin" is the width of the space around the dividing (classifying) line generated by the SVM algorithm. I believe that the points that constrain and limit this margin, the ones touching the margin, are the "support vectors", like they’re supporting this line somehow. I think the algorithm is supposed to be thought of as a machine to generate these support vectors, thus the margin, thus the dividing/classifying line/vector.

This Support Vector Classifier (SVC) example uses the same data defined in the Gaussian example.

A "kernel" in this business is a function that maps a low dimensionality input to a higher dimensional space with the hope that a linear classifier can cut a line or plane through the resulting mess somewhere. This is a way of cheaply introducing non-linearity into a system that a linear classifier can still slice up. Possible kernels available to use include the following.

• linear

• poly

• rbf - radial basis function

• sigmoid

• precomputed

• A DIY callback.

Other important parameters. * c - Controls the trade off between smooth decision boundary and classifying training points correctly. Higher c means more training points are classified correctly at the risk of overfitting. * gamma - Defines how far the influence of a single training example reaches. Lower values mean that each value has a far reach.

SVM Classifier Example
``````>>> from sklearn import svm
>>> classifierSVM= svm.SVC() # Or SVC(kernel='linear')
>>> classifierSVM.fit(features_train,labels_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> classifierSVM.predict([[2,2]])
array()
>>> classifierSVM.predict([[-5,-5]])
array()``````

### Entropy For Decision Trees

``entropy = sum(Pi*log2(Pi))``

Where Pi is a fraction of a certain kind of classification. All items homogeneous, then entropy is 0. A 50/50 split, then maximal entropy at 1.

Decision trees maximize information gain.

``information_gain= entropy(parent) - [weighted average](entropy(potential children))``

## Decision Trees

Unlike SVM, this classifier is named so well it needs little further elaboration.

``````>>> from sklearn import tree
>>> tree_clf= tree.DecisionTreeClassifier()
>>> tree_clf.fit(features_train,labels_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
>>> tree_clf.predict([[2,2]])
array()
>>> tree_clf.predict([[-4,-4]])
array()``````

One parameter that may help is `min_samples_split`. This keeps a tree from splitting when there are very few items in a branch. The default is 2 which means that any branch with multiple items can be split.

## Neural Networks

I wondered if there was an analogous biological process to backpropagation. It looks like the answer is inconclusive.

### Forget About Neurons

…originally motivated by the goal of having machines that can mimic the brain. …[the reason for learning is] they work really well… and not, certainly not, just because they’re "biologically" [air quotes!] motivated.

— Andrew Ng

Due to all these and many other simplifications, be prepared to hear groaning sounds from anyone with some neuroscience background if you draw analogies between Neural Networks and real brains.

— Karpathy

It is often worth understanding models that are known to be wrong (but we must not forget that they are wrong!)

— Hinton

### Terminology

• Perceptron - inputs are binary and produces a single binary output. As it is a step function, it is difficult to fine tune so not so good with learning in incremental iterative techniques.

• Sigmoid neurons - input values are between 0 and 1. Includes a weight and a bias. The output is calculated with the sigmoid function, `1/(1+e^-z)`. Also known as "logistic neurons" using the "logistic function". It is the quintessential activation function though others are possible. This activation function also has helpful properties for the necessary differentiation of neural networks. This kind neuron/operation takes only 1 input and is an example of an activation function which can add non-linearity. `def sigmoid(z): return 1.0/(1.0+np.exp(-z))` `def sigmoid_derivative(z): sigmoid(z)*(1-sigmoid(z))`

• Multilayer Perceptrons - (MLP) Basically just a neural network with different layers, confusingly, made up of sigmoid neurons, not perceptrons.

• Cost Function - Says that our training algorithm has done a good job if C(w,b) is near 0. It is sometimes called a "loss function". This is really a proxy for the thing you really want (e.g. images classified correctly) the difference being that this is something that can be tractably optimized. Also known as the "objective function", or "quadratic cost function" or "mean squared error" or "MSE". Here is the MSE function, `C(w,b)=(1/(2*n))*sum((y(x)-a)^2)`, or shorter, `np.mean((y-a)**2)`. There can be other cost functions, for example, the cross-entropy cost function, `C=(-1/n)*sum( y*ln(a) + (1-y)*ln(1-a) )`.

• Stochastic Gradient Descent (SGD) - A technique to find minima. Rather than considering all of the training set as a basis for where to look for the lower cost function, a few (orders of magnitude less) training inputs are chosen randomly to represent the entire known set.

• Delta rule - This is the specific concept that establishes a strategy for updating weights in gradient descent. It is the reason that a when a change in a weight produces a large change in cost function, that weight is adjusted more aggressively. (I could imagine the complete opposite being plausible but that is just not how it’s done.) Specifically, the formula is `deltaW[j][i]= eta*(target[j]-y[j])*x[i]`.

• Training Set - The data that the model is trained with. If the model gets these wrong, then you know you’ve failed.

• Validation Set - Often the training data is segregated into two parts. The first part is used to do the actual training but a smaller reserved portion can be used to check if there is overfitting in between iterations of model adjustment. It’s like a mini synthetic test set.

• Test Set - This is the data you you will use to do a final check on the model. I may not even be available to you if it is part of some contest.

• Batch Size - Instead of trying a set of inputs and seeing how that goes, it is common to try many in parallel to get a more complete understanding of the system. With the right hardware (GPUs, multiple CPUs) this can speed things up enormously. Common batch sizes are 32, 64, 128, 256, 512, generally whatever can comfortably fit in memory. Turns the input vector into an input matrix.

• Bias - Good at generalizing, stubborn about fitting new training data. Consistent.

• Variance - Good at fitting but perhaps overfitting and missing the generalizations. Accurate. More on these terms here interestingly enough.

## Back Propagation

The critical mathematical breakthrough in neural networks isn’t necessarily the network, rather it’s a clever way to automatically adjust the weights of the neurons so that the outputs turn out to be the values they should be, i.e. error is minimized. This is done in three steps. First the inputs are propagated through the network. During training this produces an end value which can be compared against the known true value (the whole premise of "supervised learning" is that you know the answers for some training inputs). The amount the system differs from what is known to be the correct answer is the error. The trick is to figure out how much each of the weights of each neuron (each node in the NN DAG) contributes to that error. This is done in the second step which is back propagation. Back propagation figures out how much each neuron’s weight contributed to the error and therefore also how the weights of the neurons can be adjusted to reduce this error. The third step is to simply update the weights to reflect this new understanding of the system. With a hopefully improved model that should produce a smaller error, the whole process is run again. This is repeated until the error no longer can be reduced.

It turns out that backpropagation is a dynamic programming optimization technique because the intermediate partial derivatives are computed once and used repeatedly in the adjustment of all network parameters (weights, biases) saving considerable computational expense. This allows the networks to become more complex and therefore more expressive.

This post implements a very small and simple example of a neural net with back propagation. The essential code looks something like this.

```X= np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ]) # Input data set.
y= np.array([ [0,1,1,0] ]).T                     # Output data set.
syn0= 2*np.random.random((3,4))-1                # Hidden weights.
syn1= 2*np.random.random((4,1))-1
for j in xrange(60000):
l1= 1/(1+np.exp(-(np.dot(X,syn0))))           # Input Layer.
l2= 1/(1+np.exp(-(np.dot(l1,syn1))))          # Hidden Layer.
l2_delta= (y-l2)*(l2*(1-l2))                  # Back prop intermediate
l1_delta= l2_delta.dot(syn1.T) * (11 * (1-l1)) # Back prop intermediate
syn1 += l1.T.dot(l2_delta)
syn0 += X.T.dot(l1_delta)```

In back propagation, the question that must be answered for each variable weight in the network is how much does a particular weight contribute to the magnitude of the final error? Or put another way, when changing only a particular variable in the whole complex system by a certain amount, how much does that change the overall error? By answering this, the contributory effects of each weight variable can be ascertained and sensibly adjusted. Amazingly a good way to think about this question is with calculus in mind. The change in final error per (with respect to) the change in some contributory variable is a derivative. Each variable weight in the network graph must be considered in a derivative of the final error. dE/dw1, dE/dw2, etc.

### Chain Rule

In calculating the error contributions of each weight (dE/dwN) the graph can have many edges between the final known error value and the node with the weight of interest. The chain rule is used to differentiate composite functions which allows the process to proceed from the known final error node step by step propagating backwards through the graph until all nodes are covered.

``d(f(g(x)))/dx = f'(g(x))*g'(x)``
``h(x) = (sin(x))^2   # Example composite fn to find derivative of.``
``````f(x) = x^2          # Can be broken in to part 1...
g(x) = sin(x)       # ...and part 2.
f'(x) = 2x          # Simple derivative of part 1.
g'(x) = cos(x)      # Simple derivative of part 2.``````
``````h'(x) = f'(g(x)) * g'(x)     # Chain rule.
h'(x) = 2*sin(x) * cos(x)  # The complex derivative.``````

Here’s another excellent example from this nice slow explanation of the chain rule as applied to neural networks. (I’ve adjusted the notation, so it could be pedantically wrong, but it makes the point.)

``````f(x) = e^( sin( x^2 ) )
u = sin(v)
f(x) = e^u
v = x^2
f'(x) = e^u * u'(v)
u'(v) = cos(v) * v' = cos(x^2) * 2x
f'(x) = df/dx = e^(sin(x^2)) * cos(x^2) * 2x``````

This shows the chain rule being used twice, once for f' and again for u'.

Also the brilliant C. Olah has a fantastic and extremely clear explanation of back propagation (and more!).

Also called `steepest-descent`. This is a numerical algorithm to efficiently find a least-squares fit for non-linear equations. Simplified it looks something like this.

``X[i] = X[i-1] - learning_rate*derivative_of_f(X[i-1])``

This technique can perform badly if it has to work through a long narrow shape. For example, imagine a ball rolling back and forth across a bathtub before it finally makes it to the drain, the system’s minimum.

There are actually other algorithms that do pretty much the same thing with different trade offs. For example, the Gauss-Newton algorithm. Another is the Levenberg-Marquardt algorithm which is used, for example, by GNUPlot’s fit function.

Here’s a good simple explanation of gradient descent.

## Misc

It seems like the job of optimizing the execution time of trained neural nets with architecture changes could be a job for neural nets.

@ID_AA_Carmack 2017-11-09
— John Carmack

## Recurrent Neural Networks

LSTM

Long Short Term Memory

GRU

Gated Recurrent Unit

## Applied to Secondary Structure of Proteins

Q3

100 * residues_correctly_predicted/total_residues Worries only about helix, beta sheet, and coil (or, it seems to me "other").

Q8

involves fancier elements.

### History

If you tell me precisely what it is a machine cannot do, then I can always make a machine which will do just that.

— John von Neumann

#### PHD

"Profile network from HeiDelberg" link "We have trained a two-layered feed-forward neural network on a non-redundant data base of 130 protein chains to predict the secondary structure of water-soluble proteins." Burkhard Rost 1993!

"For networks, such a strategy is prohibited by the limitations of computational resources."

• Q3 = 70% (Which is frankly not bad for 1993 and only training on 130 proteins!)

#### Nearest Neighbor Secondary Structure Prediction

NNSSP Not a neural network. 1995

• Q3 = 72.2%

#### DSC

"Discrimination of Secondary structure Class" link Another ancient technique. 1996 Still 126 protein training set!

• Q3 = 70.1%

#### JPRED

JPred: a consensus secondary structure prediction server. link

• Q3 = 72.9% in 1998

Other Preds included PREDATOR, MULPRED, ZPRED, others?

#### RaptorX

RaptorX: a Web Portal for Protein Structure and Function Prediction link

#### MatLab

Using this exact problem as an example. Results not so hot.

• Q3 = <50%

#### DNSS

A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction ieee link 2014 Spencer M, Eickholt J, Cheng J Used CUDA on GPUs.

• Q3 = 80.7%

#### CNN Example

Next-Step Conditioned Deep Convolutional Neural Networks link From Google as part of Google Brain Residency. Feb 2017 - I’m not sure it’s really "published" (arXiv).

• Q8 = 71.4% (Probably couldn’t find an improved Q3.)

Padded to 700 residues. Whatever.

Still seem to be training on CB513 data set which seems to have only 5534 proteins. Whatever.

#### RNNs + some other stuff

Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks PDF 2016-04-25 This seems very complex. But maybe that’s what’s needed.

• Q3 87.8% on CASP10, 85.3% on CASP11