I’m accumulating a lot of notes about these topics. I’m no expert so don’t take anything I say too seriously.

Resources

  • TensorFlow’s Playground - A brilliant interactive neural network for educational purposes. Superb.

  • YOLO Real-time object detection.

  • SSD Single Shot MultiBox Detector object detection.

Frameworks and Libraries

Terms

Bias - Good at generalizing, stubborn about fitting new training data. Variance - Good at fitting but perhaps overfitting and missing the generalizations.

Gaussian Naive Bayes Classifier

The following is handy for building a classifier model from a dataset. Data is features_train and labels are labels_train for the training data. Once the model is built, new classifications can be calculated with new data.

Gaussian Naive Bayes Classifier Example
>>> import numpy as np
>>> features_train= np.array([[-1,-1], [-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
>>> labels_train= np.array([1,1,1,2,2,2])
>>> from sklearn.naive_bayes import GaussianNB
>>> classifier= GaussianNB()
>>> classifier.fit(features_train,labels_train)
GaussianNB(priors=None)
>>> classifier.predict([[-.8,-1]])
array([1])
>>> classifier.predict([[5,4]])
array([2])
>>> classifier.predict([[0,0]])
array([1])

And to find out how well a classifier is doing, check it on a test set. Make predictions of the features that produced labels_test and then compare them.

from sklearn.metrics.classification import accuracy_score
correct_ratio= accuracy_score(labels_test,predicted_labels_test)

SVM

Support Vector Machines are weirdly named, almost to the point of foolishness. The "margin" is the width of the space around the dividing (classifying) line generated by the SVM algorithm. I believe that the points that constrain and limit this margin, the ones touching the margin, are the "support vectors", like they’re supporting this line somehow. I think the algorithm is supposed to be thought of as a machine to generate these support vectors, thus the margin, thus the dividing/classifying line/vector.

This Support Vector Classifier (SVC) example uses the same data defined in the Gaussian example.

A "kernel" in this business is a function that maps a low dimensionality input to a higher dimensional space with the hope that a linear classifier can cut a line or plane through the resulting mess somewhere. This is a way of cheaply introducing non-linearity into a system that a linear classifier can still slice up. Possible kernels available to use include the following.

  • linear

  • poly

  • rbf - radial basis function

  • sigmoid

  • precomputed

  • A DIY callback.

Other important parameters. * c - Controls the trade off between smooth decision boundary and classifying training points correctly. Higher c means more training points are classified correctly at the risk of overfitting. * gamma - Defines how far the influence of a single training example reaches. Lower values mean that each value has a far reach.

SVM Classifier Example
>>> from sklearn import svm
>>> classifierSVM= svm.SVC() # Or SVC(kernel='linear')
>>> classifierSVM.fit(features_train,labels_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
>>> classifierSVM.predict([[2,2]])
array([2])
>>> classifierSVM.predict([[-5,-5]])
array([1])

Entropy For Decision Trees

entropy = sum(Pi*log2(Pi))

Where Pi is a fraction of a certain kind of classification. All items homogeneous, then entropy is 0. A 50/50 split, then maximal entropy at 1.

Decision trees maximize information gain.

information_gain= entropy(parent) - [weighted average](entropy(potential children))

Decision Trees

Unlike SVM, this classifier is named so well it needs little further elaboration.

>>> from sklearn import tree
>>> tree_clf= tree.DecisionTreeClassifier()
>>> tree_clf.fit(features_train,labels_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
>>> tree_clf.predict([[2,2]])
array([2])
>>> tree_clf.predict([[-4,-4]])
array([1])

One parameter that may help is min_samples_split. This keeps a tree from splitting when there are very few items in a branch. The default is 2 which means that any branch with multiple items can be split.

Neural Networks

I wondered if there was an analogous biological process to backpropagation. It looks like the answer is inconclusive.

Convolutional Neural Networks

Recurrent Neural Networks

Applied to Secondary Structure of Proteins

Q3

100 * residues_correctly_predicted/total_residues Worries only about helix, beta sheet, and coil (or, it seems to me "other").

Q8

involves fancier elements.

History

PHD

"Profile network from HeiDelberg" link "We have trained a two-layered feed-forward neural network on a non-redundant data base of 130 protein chains to predict the secondary structure of water-soluble proteins." Burkhard Rost 1993!

"For networks, such a strategy is prohibited by the limitations of computational resources."

  • Q3 = 70% (Which is frankly not bad for 1993 and only training on 130 proteins!)

Nearest Neighbor Secondary Structure Prediction

NNSSP Not a neural network. 1995

  • Q3 = 72.2%

DSC

"Discrimination of Secondary structure Class" link Another ancient technique. 1996 Still 126 protein training set!

  • Q3 = 70.1%

JPRED

JPred: a consensus secondary structure prediction server. link

  • Q3 = 72.9% in 1998

Other Preds included PREDATOR, MULPRED, ZPRED, others?

RaptorX

RaptorX: a Web Portal for Protein Structure and Function Prediction link

MatLab

Using this exact problem as an example. Results not so hot.

  • Q3 = <50%

DNSS

A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction ieee link 2014 Spencer M, Eickholt J, Cheng J Used CUDA on GPUs.

  • Q3 = 80.7%

CNN Example

Next-Step Conditioned Deep Convolutional Neural Networks link From Google as part of Google Brain Residency. Feb 2017 - I’m not sure it’s really "published" (arXiv).

  • Q8 = 71.4% (Probably couldn’t find an improved Q3.)

Padded to 700 residues. Whatever.

Still seem to be training on CB513 data set which seems to have only 5534 proteins. Whatever.

RNNs + some other stuff

Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks PDF 2016-04-25 This seems very complex. But maybe that’s what’s needed.

  • Q3 87.8% on CASP10, 85.3% on CASP11