This page explains how to use the most common of the implemented models.
Each model corresponds to a differents structured prediction task, or possibly
a different parametrization of the model. As such, the training data
Y has slightly different forms for each model.
A model is given by four functions,
loss_augmented_inference. If you just want to use the included models,
you don’t need to worry about these, and can just use the
of the learner.
For those interested in what happens behind the scenes, or those who might want to
adjust a model, there is a short explanation of these functions for each model below.
For all models, the
joint_feature(x, y) function takes a data point and a
tentative prediction, and computes a continuous vector of a fixed length that
captures the relation between features and label. Learning (that is
learner.fit(X, y)) will learn a parameter vector
w, and predictions
will be made using
That means the number of parameters in the model is the same as the
The actual maximization is performed in the
inference(x, w) function, which
takes a sample
x and a parameter vector
w and outputs a
which (at least approximately) maximizes the above equation.
loss(y_true, y_pred) function gives a numeric loss for a ground truth
y_true and a prediction
y_pred, and finally
loss_augmented_inference(x, y, w) gives an (approximate) maximizer for
A good place to understand these definitions is Multi-class SVM.
Currently all models expect labels to be integers from 0 to n_states (or n_classes). Starting labels at 1 or using other labels might lead to errors and / or incorrect results.
None of the model include a bias (intercept) by default. Therefore it is usually a good idea to add a constant feature to all feature vectors, both for unary and pairwise features.
A precursor for structured SVMs was the multi-class SVM by Crammer and Singer. While in practice it is often faster to use an One-vs-Rest approach and an optimize binary SVM, this is a good hello-world example for structured predicition and using pystruct. In the case of multi-class SVMs, in contrast to more structured models, the labels set Y is just the number of classes, so inference can be performed by just enumerating Y.
Lets say we want to classify the classical iris dataset. There are three classes and four features:
>>> import numpy as np >>> from sklearn.datasets import load_iris >>> iris = load_iris() >>> iris.data.shape, iris.target.shape ((150, 4), (150,)) >>> np.unique(iris.target) array([0, 1, 2])
We split the data into training and test set:
>>> from sklearn.cross_validation import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split( ... iris.data, iris.target, test_size=0.4, random_state=0)
The Crammer-Singer model is implemented in
As this is a simple multi-class classification task, we can pass in training data
as numpy arrays of shape
(n_samples, n_features) and training labels as
numpy array of shape (n_samples,) with classes from 0 to 2.
For training, we pick the learner
learners.NSlackSSVM, which works
well with few samples and requires little tuning:
>>> from pystruct.learners import NSlackSSVM >>> from pystruct.models import MultiClassClf >>> clf = NSlackSSVM(MultiClassClf())
The learner has the same interface as a scikit-learn estimator:
>>> clf.fit(X_train, y_train) NSlackSSVM(C=1.0, batch_size=100, break_on_bad=False, check_constraints=True, inactive_threshold=1e-05, inactive_window=50, logger=None, max_iter=100, model=MultiClassClf(n_features=4, n_classes=3), n_jobs=1, negativity_constraint=None, show_loss_every=0, switch_to=None, tol=0.001, verbose=0) >>> clf.predict(X_test) array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1, 0, 0, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, 2, 1, 0, 2, 1, 1, 2, 0, 2, 0, 0, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2]) >>> clf.score(X_test, y_test) 0.96...
For this simple model, the
joint_feature(x, y) is a vector of size
n_features * n_classes,
which corresponds to one copy of the input features for each possibly class.
For any given pair
(x, y) the features in
x will be put at the position corresponding
to the class in
Correspondingly, the weights that are learned are one vector of length
n_features for each class:
w = np.hstack([w_class_0, ..., w_class_1]).
For this simple model, and inference is just the argmax over the inner product with each of these
>>> y_pred = np.argmax(np.dot(w.reshape(n_classes, n_features), x))
To perform max-margin learning, we also need the loss-augmented inference. PyStruct has an optimized version, but a pure python version would look like this:
>>> scores = np.dot(w.reshape(n_classes, n_features), x) >>> scores[np.arange(n_classes) != y] += 1 >>> y_pred = np.argmax(scores)
Essentialy the response (score / energy) of wrong label is down weighted by 1, the loss of doing an incorrect prediction.
A multi-label classification task is one where each sample can be labeled with any number of classes. In other words, there are n_classes many binary labels, each indicating whether a sample belongs to a given class or not. This could be treated as n_classes many independed binary classification problems, as the scikit-learn OneVsRest classifier does. However, it might be beneficial to exploit correlations between labels to achieve better generalization.
In the scene classification dataset, each sample is a picture of an outdoor scene, representated using simple color aggregation. The labels characterize the kind of scene, which can be “beach”, “sunset”, “fall foilage”, “field”, “mountain” or “urban”. Each image can belong to multiple classes, such as “fall foilage” and “field”. Clearly some combinations are more likely than others.
We could try to model all possible combinations, which would result in a 2 ** 6
= 64 class multi-class classification problem. This would allow us explicitly model all correlations between labels,
but it would prevent us from predicting combinations that don’t appear in the training set.
Even if a combination did appear in the training set, the numer of samples in each class would be very small.
A compromise between modeling all correlations and modelling no correlations is modeling only pairwise correlations,
which is the approach implemented in
The input to this model is similar to the Multi-class SVM, with the training data
a numpy array of shape
(n_samples, n_features) and the training labels a binary indicator matrix
>>> from pystruct.datasets import load_scene >>> scene = load_scene() >>> X_train, X_test = scene['X_train'], scene['X_test'] >>> y_train, y_test = scene['y_train'], scene['y_test'] >>> X_train.shape (1211, 294) >>> y_train.shape (1211, 6)
>>> from pystruct.learners import NSlackSSVM >>> from pystruct.models import MultiLabelClf >>> clf = NSlackSSVM(MultiLabelClf())
Training looks as before, only that
y_train is now a matrix:
>>> # clf.fit(X_train, y_train) >>> # clf.predict(X_test) >>> # clf.score(X_test, y_test)
With only 64 possible label-combinations, we can actually enumerate all states. Unfortunately, in general, inference in a fully connected binary graph is in gerneral NP-hard, so we might need to rely on approximate inference, like loopy believe propagation or AD3.
An alternative to using approximate inference for larger numbers of labels is to not create a fully connected graph,
but restrict ourself to pairwise interactions on a tree over the labels. In the above example of outdoor scenes,
some labels might be informative about others, maybe a beach picture is likely to be of a sunset, while
an urban scene might have as many sunset as non-sunset samples. The optimum tree-structure for such a problem
can easily be found using the Chow-Liu tree, which is simply the maximum weight spanning tree over the graph, where
edge-weights are given by the mutual information between labels on the training set.
You can use the Chow-Liu tree method simply by specifying
This allows us to use efficient and exact max-product message passing for
>>> clf = NSlackSSVM(MultiLabelClf(edges="chow_liu"))
Training looks as before, only that
y_train is now a matrix:
>>> # clf.fit(X_train, y_train) >>> # clf.predict(X_test) >>> # clf.score(X_test, y_test)
This model for multi-label classification with full connectivity is taken from the paper T. Finley, T. Joachims, Training Structural SVMs when Exact Inference is Intractable.
The model creates a graph over
n_classes binary nodes, together with edges
between each pair of classes. Each binary node has represents one class, and
therefor will get its own column in the weight-vector, similar to the
crammer-singer multi-class classification.
In addition, there is a pairwise weight between each pair of labels. This leads to a feature function of this form:
The implementation of the inference for this model creates a graph with unary potentials (given by the inner product of features and weights), and pairwise potentials given by the pairwise weight. This graph is then passed to the general graph-inference, which runs the selected algorithm.
The following models are all pairwise models over nodes, that is they model a
labeling of a graph, using features at the nodes, and relation between
neighboring nodes. The main assumption in these models in PyStruct is that
nodes are homogeneous, that is they all have the same meaning. That means that
each node has the same number of classes, and these classes mean the same
thing. In practice that means that weights are shared across all nodes and
edges, and the model adapts via features.
This is in contrast to the
models.MultiLabelClf, which builds a binary graph
were nodes mean different things (each node represents a different class), so
they do not share weights.
I call these models Conditional Random Fields (CRFs), but this a slight abuse of notation, as PyStruct actually implements perceptron and max-margin learning, not maximum likelihood learning. So these models might better be called Maximum Margin Random Fields. However, in the computer vision community, it seems most pairwise models are called CRFs, independent of the method of training.
One of the most common use-cases for structured prediction is chain-structured
outputs, implemented in
models.ChainCRF. These occur naturaly in
sequence labeling tasks, such as Part-of-Speech tagging or named entity
recognition in natural language processing, or segmentation and phoneme
recognition in speech processing.
As an example dataset, we will use the toy OCR dataset letters. In this dataset, each sample is a handwritten word, segmented into letters. This dataset has a slight oddity, in that the first letter of every word was removed, as it was capitalized, and therefore different from all the other letters.
Each letter is a node in our chain, and neighboring letters are connected with
an edge. The length of the chain varies with the number of letters in the
word. As in all CRF-like models, the nodes in
models.ChainCRF all have
the same meaning and share parameters.
The letters dataset comes with prespecified folds, we take one fold to be the training set, and the rest to be the test set, as in Max-Margin Markov Networks:
>>> from pystruct.datasets import load_letters >>> letters = load_letters() >>> X, y, folds = letters['data'], letters['labels'], letters['folds'] >>> X, y = np.array(X), np.array(y) >>> X_train, X_test = X[folds == 1], X[folds != 1] >>> y_train, y_test = y[folds == 1], y[folds != 1] >>> len(X_train) 704 >>> len(X_test) 6173
The training data is a array of samples, where each sample is a numpy array of
(n_nodes, n_features). Here n_nodes is the length of the input
sequence, that is the length of the word in our case. That means the input
array actually has dtype object. We can not store the features in a simple
array, as the input sequences can have different length:
>>> X_train.shape (9, 128) >>> y_train.shape (9,) >>> X_train.shape (7, 128) >>> y_train.shape (7,)
Edges don’t need to be specified, as the input features are assumed to be in the order of the nodes in the chain.
The default inference method is max-product message passing on the chain (aka
viterbi), which is always exact and efficientl. We use the
learners.FrankWolfeSSVM, which is a very efficient learner when
inference is fast:
>>> from pystruct.models import ChainCRF >>> from pystruct.learners import FrankWolfeSSVM >>> model = ChainCRF() >>> ssvm = FrankWolfeSSVM(model=model, C=.1, max_iter=10) >>> ssvm.fit(X_train, y_train) FrankWolfeSSVM(C=0.1, batch_mode=False, check_dual_every=10, do_averaging=True, line_search=True, logger=None, max_iter=10, model=ChainCRF(n_states: 26, inference_method: max-product), n_jobs=1, random_state=None, sample_method='perm', show_loss_every=0, tol=0.001, verbose=0) >>> ssvm.score(X_test, y_test) 0.78...
The unary potentials in each node are given as the inner product of the features at this node (the input image) with the weights (which are shared over all nodes):
The pairwise potentials are identical over the whole chain and given simply by the weights:
models.GraphCRF model is a generalization of the ChainCRF to
arbitray graphs. While in the chain model, the direction of the edge is
usually important, for many graphs, the direction of the edge has no semantic
meaning. Therefore, by default, the pairwise interaction matrix of the
models.GraphCRF is forced to be symmetric.
Each training sample for the
models.GraphCRF is a tuple
features is a numpy array of node-features (of shape
(n_nodes, n_features)), and
edges is a array of edges between nodes, of
(n_edges, 2). Each row of the edge array are the indices of the two
nodes connected by the edge, starting from zero.
>>> features, y, folds = letters['data'], letters['labels'], letters['folds'] >>> features, y = np.array(features), np.array(y) >>> features_train, features_test = features[folds == 1], features[folds != 1] >>> y_train, y_test = y[folds == 1], y[folds != 1]
For a single word made out of 9 characters:
>>> features_0 = features_train >>> features_0.shape (9, 128) >>> n_nodes = features_0.shape >>> edges_0 = np.vstack([np.arange(n_nodes - 1), np.arange(1, n_nodes)]) >>> edges_0 array([[0, 1, 2, 3, 4, 5, 6, 7], [1, 2, 3, 4, 5, 6, 7, 8]]) >>> x = (features_0, edges_0)
For the whole training dataset:
>>> f_t = features_train >>> X_train = [(features_i, np.vstack([np.arange(features_i.shape - 1), np.arange(1, features_i.shape)])) ... for features_i in f_t] >>> X_train (array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 1, ..., 0, 1, 1]], dtype=uint8), array([[0, 1, 2, 3, 4, 5, 6, 7], [1, 2, 3, 4, 5, 6, 7, 8]]))
Now we can fit a (directed)
models.GraphCRF on this data:
>>> from pystruct.models import GraphCRF >>> from pystruct.learners import FrankWolfeSSVM >>> model = GraphCRF(directed=True, inference_method="max-product") >>> ssvm = FrankWolfeSSVM(model=model, C=.1, max_iter=10) >>> ssvm.fit(X_train, y_train) FrankWolfeSSVM(C=0.1, batch_mode=False, check_dual_every=10, do_averaging=True, line_search=True, logger=None, max_iter=10, model=GraphCRF(n_states: 26, inference_method: max-product), n_jobs=1, random_state=None, sample_method='perm', show_loss_every=0, tol=0.001, verbose=0)
The potentials are the same as in the ChainCRF model, with unary potentials given as a shared linear function of the features, and pairwise potentials the same for all nodes.
This model is the most general of the CRF models, and contains all other CRF models as a special case. This model assumes again that the parameters of the potentials are shared over all nodes and over all edges, but the pairwise potentials are now also computed as a linear function of the features.
Each training sample for
models.EdgeFeatureGraphCRF is a tuple
(node_features, edges, edge_features), where
node_features is a numpy
array of node-features (of shape
a array of edges between nodes, of shape
(n_edges, 2) as in
edge_features is a feature for each edge, given as a
numpy array of shape
The edge features allow the pairwise interactions to be modulated by the context. Two features important for image segmentation, for example, are color differences between the (super)pixels at given nodes, and whether one is above the other. If two neighboring nodes correspond to regions of simlar color, they are more likely to have the same label. For the vertical direction, a node above a node representing “sky” is more likely to also represent “sky” than “water”.
A great example of the importance of edge features is Conditional Interactions on the Snakes Dataset.
Latent variable models are models that involve interactions with variables that are not observed during training. These are often modelling a “hidden cause” of the data, which might make it easier to learn about the actual observations.
Latent variable models are usually much harder to fit than fully observed
models, and require fitting using either
learners.LatentSSVM alternates between
inferring the unobserved variables with fitting any of the other SSVM models
learners.OneSlackSSVM). Each iteration of this alternation is as
expensive as building a fully observed model, and good initialization can be
very important. This method was published in Learning Structural SVMs with
learners.LatentSubgradientSSVM approach tries to reestimate the latent
variables for each batch, and corresponds to a subgradient descent on the
non-convex objective involving the maximization over hidden variables. I am
unaware of any literature on this approach.
There is an extensive benchmarking in my thesis, TODO.
SubgradientSSVM : Good for many datapoints, fast inference. Usually worse than FrankWolfeSSVM, but takes less memory. Good for obtaining reasonable solutions fast. NSlackSSVM : Good for very few datapoints (hundreds), slow inferece. Good for obtaining high precision solutions. OneSlackSSVM : Good for mid-size data sets (thousands), slow inference. Good for obtaining high precision solutions. FrankWolfeSSVM : Good for fast inference, large datasets. Good for obtaining reasonable solutions fast.