Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A
Mizar Mathematical Library (MML) used as the formal corpus.
The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem).
The model predicts the probability that a given axiom is useful for proving a given conjecture.
Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss.
The two embedding networks and the joint predictor path are trained jointly.
Stage 1: Character-level Models
Treat premises on character level using an 80-dimensional one hot encoding vector.
Architectures for embedding:
pure recurrent LSTM and GRU Network
CNN (with max pooling)
Recurrent-convolutional network that shortens the sequence using convolutional layer before feeding it to LSTM.
Stage 2: Word-level Models
MML dataset contains both implicit and explicit definitions.
To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings.
This is better than recursively expanding and embedding the word definition as the definition chains can be very deep.
Once word level embeddings are obtained, the architecture from Character-level models can be reused.
For each conjecture, the model ranks the possible premises.
Primary metric is the number of conjectures proved from top-k premises.
Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking.
Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR.
Asynchronous distributed stochastic gradient descent using Adam optimizer.
Clipped vs Non-clipped Gradients.
Max Sequence length of 2048 for character-level sequences and 500 for word-level sequences.
Best Selection Pipeline - Stage 1 character-level CNN which produces word-level embeddings for the next stage.
Best models use simple CNNs followed by max pooling and two-stage definition-based def-CNN outperforms naive word-CNN (where word embeddings are learnt in a single pass).