I’m accumulating a lot of notes about these topics. I’m no expert so don’t take anything I say too seriously.


Frameworks and Libraries


Bias - Good at generalizing, stubborn about fitting new training data. Variance - Good at fitting but perhaps overfitting and missing the generalizations.

Gaussian Naive Bayes Classifier

The following is handy for building a classifier model from a dataset. Data is features_train and labels are labels_train for the training data. Once the model is built, new classifications can be calculated with new data.

Gaussian Naive Bayes Classifier Example
>>> import numpy as np
>>> features_train= np.array([[-1,-1], [-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
>>> labels_train= np.array([1,1,1,2,2,2])
>>> from sklearn.naive_bayes import GaussianNB
>>> classifier= GaussianNB()
>>> classifier.fit(features_train,labels_train)
>>> classifier.predict([[-.8,-1]])
>>> classifier.predict([[5,4]])
>>> classifier.predict([[0,0]])

And to find out how well a classifier is doing, check it on a test set. Make predictions of the features that produced labels_test and then compare them.

from sklearn.metrics.classification import accuracy_score
correct_ratio= accuracy_score(labels_test,predicted_labels_test)


Support Vector Machines are weirdly named, almost to the point of foolishness. The "margin" is the width of the space around the dividing (classifying) line generated by the SVM algorithm. I believe that the points that constrain and limit this margin, the ones touching the margin, are the "support vectors", like they’re supporting this line somehow. I think the algorithm is supposed to be thought of as a machine to generate these support vectors, thus the margin, thus the dividing/classifying line/vector.

This Support Vector Classifier (SVC) example uses the same data defined in the Gaussian example.

A "kernel" in this business is a function that maps a low dimensionality input to a higher dimensional space with the hope that a linear classifier can cut a line or plane through the resulting mess somewhere. This is a way of cheaply introducing non-linearity into a system that a linear classifier can still slice up. Possible kernels available to use include the following.

  • linear

  • poly

  • rbf - radial basis function

  • sigmoid

  • precomputed

  • A DIY callback.

Other important parameters. * c - Controls the trade off between smooth decision boundary and classifying training points correctly. Higher c means more training points are classified correctly at the risk of overfitting. * gamma - Defines how far the influence of a single training example reaches. Lower values mean that each value has a far reach.

SVM Classifier Example
>>> from sklearn import svm
>>> classifierSVM= svm.SVC() # Or SVC(kernel='linear')
>>> classifierSVM.fit(features_train,labels_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
>>> classifierSVM.predict([[2,2]])
>>> classifierSVM.predict([[-5,-5]])

Entropy For Decision Trees

entropy = sum(Pi*log2(Pi))

Where Pi is a fraction of a certain kind of classification. All items homogeneous, then entropy is 0. A 50/50 split, then maximal entropy at 1.

Decision trees maximize information gain.

information_gain= entropy(parent) - [weighted average](entropy(potential children))

Decision Trees

Unlike SVM, this classifier is named so well it needs little further elaboration.

>>> from sklearn import tree
>>> tree_clf= tree.DecisionTreeClassifier()
>>> tree_clf.fit(features_train,labels_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
>>> tree_clf.predict([[2,2]])
>>> tree_clf.predict([[-4,-4]])

One parameter that may help is min_samples_split. This keeps a tree from splitting when there are very few items in a branch. The default is 2 which means that any branch with multiple items can be split.

Neural Networks

I wondered if there was an analogous biological process to backpropagation. It looks like the answer is inconclusive.

Convolutional Neural Networks

Recurrent Neural Networks

Applied to Secondary Structure of Proteins


100 * residues_correctly_predicted/total_residues Worries only about helix, beta sheet, and coil (or, it seems to me "other").


involves fancier elements.



"Profile network from HeiDelberg" link "We have trained a two-layered feed-forward neural network on a non-redundant data base of 130 protein chains to predict the secondary structure of water-soluble proteins." Burkhard Rost 1993!

"For networks, such a strategy is prohibited by the limitations of computational resources."

  • Q3 = 70% (Which is frankly not bad for 1993 and only training on 130 proteins!)

Nearest Neighbor Secondary Structure Prediction

NNSSP Not a neural network. 1995

  • Q3 = 72.2%


"Discrimination of Secondary structure Class" link Another ancient technique. 1996 Still 126 protein training set!

  • Q3 = 70.1%


JPred: a consensus secondary structure prediction server. link

  • Q3 = 72.9% in 1998

Other Preds included PREDATOR, MULPRED, ZPRED, others?


RaptorX: a Web Portal for Protein Structure and Function Prediction link


Using this exact problem as an example. Results not so hot.

  • Q3 = <50%


A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction ieee link 2014 Spencer M, Eickholt J, Cheng J Used CUDA on GPUs.

  • Q3 = 80.7%

CNN Example

Next-Step Conditioned Deep Convolutional Neural Networks link From Google as part of Google Brain Residency. Feb 2017 - I’m not sure it’s really "published" (arXiv).

  • Q8 = 71.4% (Probably couldn’t find an improved Q3.)

Padded to 700 residues. Whatever.

Still seem to be training on CB513 data set which seems to have only 5534 proteins. Whatever.

RNNs + some other stuff

Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks PDF 2016-04-25 This seems very complex. But maybe that’s what’s needed.

  • Q3 87.8% on CASP10, 85.3% on CASP11