CORELS: Learning Certifiably Optimal RulE ListS

Overview

CORELS (Certifiable Optimal RulE ListS) is a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm provides the optimal solution, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. Our approach produces optimal rule lists on practical problems in seconds. This framework is a novel alternative to CART and other decision tree methods.

If you're not sure what a rule list is, head over to our rule list introduction before continuing.

Input/Output

CORELS is a supervised learning algorithm, trained on a dataset of n rule antecedents, where each antecedent captures some subset of m samples. These antecedents must be pre-mined - that is, CORELS is inputted not a list of samples with their associated features, but a list of rule antecedents, each represented by a bitvector of length m. The i'th bit of a rule's bitvector represents whether that antecedent captures the i'th sample (1) or if it doesn't (0). All features are binary, as is the classification of all the rules. Since all the sample features as well as the rule list classifications must be binary, continuous values or multi-class features must be broken down into categories. For instance, if a particular data set contained samples of people, and one of its features was age, instead of having one continuous 'age' feature you would have to break it down into discrete categories that could then become binary features. For instance, if the range of ages of your training set was 0 - 64, you could reasonably break the age feature into 4 binary features, age < 16, age < 32, age < 32, and age < 64.

Once trained, CORELS outputs a rule list. If allowed to run unconstrained, CORELS will output the certifiably most accurate rule list that can be generated from the given set of rule antecedents over the given training set. However, finding the optimal rule list may be computationally infeasible with large datasets, so we provide an option to stop searching after a certain user-specified number of nodes of the search trie have been evaluated. For suboptimal rule lists, an option is also provided to calculate the upper bound on the remaining search space size.

Algorithm

CORELS is a branch-and-bound algorithm, and leverages several algorithmic bounds to prune a trie that represents the search space of all possible rule lists to be examined.

The search space of all possible rule lists is represented by a prefix tree, or trie. Assuming n antecedents, the root node of this trie has n children, one for each antecedent. Each of those children have (n - 1) children, and so on, forming a trie with n levels (not counting the root node). Each node in this trie represents a rule list prefix (with each node's classification being 0 or 1, depending only on which gives the most accurate prediction over the training data set), and the bounds we calculate to remove nodes operate on these prefixes.

As with other supervised learning algorithms, CORELS defines an objective function for possible models (rule lists), and then seeks to minimize it. The objective of a rule list is simply the fraction of training samples that the rule list misclassifies, plus a factor that is proportional to the length of the rule list. By keeping track of the lowest objective found, it is possible to eliminate many rule lists without evaluating them at all, greatly speeding up the search. The objective function is defined as follows:

R(d, x, y) = misc(d, x, y) + c * length(d)

Where:

d is the rule list
x is the training data features
y is the training data classifications
misc is the fraction of misclassified samples
c is a user-defined constant (called the regularization constant). A larger value of this penalizes larger rule lists - it can be thought of as an accuracy penalty for each added rule to the list (if it is set to 0.01, for instance, every added rule would incur a penalty equal to misclassifying 1% of the samples)

For each prefix (again, a rule list prefix is a rule list without a default prediction) being evaluated in the trie, CORELS calculates a lower bound:

b(p, x, y) = misc(p, x, y) + c * length(p)

If this lower bound is greater than the lowest objective found so far, then this prefix along with every node that is a child of it can be removed from the tree, since any rule list beginning with that prefix must necessarily have an objective that is greater than the best objective found so far.

If the lower bound of a prefix is below the lowest objective, the node in the trie that represents that prefix is another one of CORELS' data structures: the priority queue. This queue is what determines the order in which rule lists are evaluated - the queue orders all the nodes within it using one of a variety of different metrics, chosen by the user.

As of right now, the following options are avaiable:

Breadth-First Search (BFS)
Depth-First Search (DFS)
Priority of objective: The nodes with the lowest objectives are evaluated first
Priority of lower bound: The nodes with the lowest lower bounds are evaluated first
Custom curiosity metric: This orders nodes based on their lower bounds as well as the number of samples they capture, prioritizing prefixes that capture few samples as well as those that are more accurate

The curiosity metric is still being tweaked - further improvements on it might be added in the future.
Depending on the particular dataset, any of the above ordering metrics could be best.

CORELS also uses a symmetry-aware map to eliminate prefixes that are permutations of each other from the search space. This map provides an important new tool for pruning the trie, since we can implement a tighter bound than only the objective. Specifically, if the lower bound of a particular prefix is greater than the lower bound of another prefix that is simply a permutation of the first prefix, then we can throw the first prefix out. This is because two different permutations of the same group of rules will capture the same set of samples, which means adding rules to the permutation with the higher lower bound would never yield a rule list with an objective lower than that of the best rule list starting with the prefix permutation with the better lower bound.

This permutation map is implemented in two ways - either using the canonical order of rules in a prefix as the map keys (a prefix permutation map), or using the array of captured samples as the keys (a captured vector map). The latter type could result in theoretically better results, as prefixes that are not permutations of each other but capture the same set of samples would be grouped together, but it requires far more memory than the former.

More information about CORELS can be found in the following papers:

Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning Certifiably Optimal Rule Lists for Categorical Data. KDD 2017. Journal of Machine Learning Research, 2018; 19: 1-77. arXiv:1704.01701, 2017.
Nicholas Larus-Stone, Elaine Angelino, Daniel Alabi, Margo Seltzer, Vassilios Kaxiras, Aditya Saligrama, and Cynthia Rudin. Systems Optimizations for Learning Certifiably Optimal Rule Lists. SysML Conference, 2018.
Nicholas Larus-Stone. Learning Certifiably Optimal Rule Lists: A Case For Discrete Optimization in the 21st Century. Senior thesis, 2017. PDF file