Learning Problems in Theorem Proving -...

Learning Problemsin Theorem Proving

Cezary Kaliszyk

Universitat Innsbruck

July 3, 2017LAIVe Summer School

http://cl-informatik.uibk.ac.at

http://cl-informatik.uibk.ac.at/~cek

Computer Theorem Proving: Historical Context

• 1940s: Algorithmic proof search (λ-calculus)

• 1960s: de Bruijn’s Automath

• 1970s: Small Certifiers (LCF)

• 1990s: Resolution (Superposition)

• 2000s: Large theories

• 2010s: ?

C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 2/72

Lecture content

Theorem Proving Introduction

Machine Learning Problems in Theorem Proving

Premise Selection

Useful Intermediate Steps

Theorem Names

Internal Guidance



Outline



Premise Selection


Theorem Names

Internal Guidance



The Kepler Conjecture (year 1611)

The most compact way ofstacking balls of the same size inspace is a pyramid.

V =π√18≈ 74%




Proved in 1998• Tom Hales, 300 page proof using computer programs

• Submitted to the Annals of Mathematics

• 99% correct. . . but we cannot verify the programs

1039 equalities and inequalities

For example:

−x1x3−x2x4+x1x5+x3x6−x5x6++x2(−x2+x1+x3−x4+x5+x6)√√√√4x2

(x2x4(−x2+x1+x3−x4+x5+x6)++x1x5(x2−x1+x3+x4−x5+x6)++x3x6(x2+x1−x3+x4+x5−x6)−−x1x3x4−x2x3x5−x2x1x6−x4x5x6

) < tan(π

2− 0.74)




Solution? Formalized Proof!• Formalize the proof using Proof Assistants

• Implement the computer code in the system

• Prove the code correct

• Run the programs inside the Proof Assistant

Flyspeck Project

• Completed 2015

• Many Proof Assistants and contributors



What is a Proof Assistant? (1/2)

A Proof Assistant is a• a computer program• to assist a mathematician• in the production of a proof• that is mechanically checked

What does a Proof Assistant do?• Keep track of theories, definitions, assumptions• Interaction - proof editing• Proof checking• Automation - proof search

What does it implement? (And how?)

• a formal logical system intended as foundation for mathematics• decision procedures



WKHRUHP�VTUW�BQRWBUDWLRQDO��VTUW��UHDO��SURRI��DVVXPH��VTUW��UHDO��WKHQ�REWDLQ�P�Q��QDW�ZKHUH��QBQRQ]HUR��Q�X��DQG�VTUWBUDW��hVTUW��UHDO��h� �UHDO�P��UHDO�Q��DQG�ORZHVWBWHUPV��JFG�P�Q� ��IURP�QBQRQ]HUR�DQG�VTUWBUDW�KDYH��UHDO�P� �hVTUW��UHDO��h� �UHDO�Q��E\�VLPS��WKHQ�KDYH��UHDO��Pt�� VTUW��UHDO��t� �UHDO��Qt��E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH��DOVR�KDYH��VTUW��UHDO��t� �UHDO��E\�VLPS��DOVR�KDYH�� UHDO��Pt�� UHDO�� Qt��E\�VLPS��ILQDOO\�KDYH�HT��Pt� �� Qt��KHQFH��GYG�Pt��ZLWK�WZRBLVBSULPH�KDYH�GYGBP��GYG�P��E\��UXOH�SULPHBGYGBSRZHUBWZR��WKHQ�REWDLQ�N�ZKHUH��P� �� N��ZLWK�HT�KDYH�� Qt� ��t� �Nt��E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH�PXOWBDF��KHQFH��Qt� �� Nt��E\�VLPS��KHQFH��GYG�Qt��ZLWK�WZRBLVBSULPH�KDYH��GYG�Q��E\��UXOH�SULPHBGYGBSRZHUBWZR��ZLWK�GYGBP�KDYH��GYG�JFG�P�Q��E\��UXOH�JFGBJUHDWHVW��ZLWK�ORZHVWBWHUPV�KDYH��GYG��E\�VLPS��WKXV�)DOVH�E\�DULWKTHG

de Bruijn factor



WKHRUHP�VTUW�BQRWBUDWLRQDO��VTUW��UHDO��SURRI��DVVXPH��VTUW��UHDO��WKHQ�REWDLQ�P�Q��QDW�ZKHUH��QBQRQ]HUR��Q�X��DQG�VTUWBUDW��hVTUW��UHDO��h� �UHDO�P��UHDO�Q��DQG�ORZHVWBWHUPV��JFG�P�Q� ��IURP�QBQRQ]HUR�DQG�VTUWBUDW�KDYH��UHDO�P� �hVTUW��UHDO��h� �UHDO�Q��E\�VLPS��WKHQ�KDYH��UHDO��Pt�� VTUW��UHDO��t� �UHDO��Qt��E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH��DOVR�KDYH��VTUW��UHDO��t� �UHDO��E\�VLPS��DOVR�KDYH�� UHDO��Pt�� UHDO�� Qt��E\�VLPS��ILQDOO\�KDYH�HT��Pt� �� Qt��KHQFH��GYG�Pt��ZLWK�WZRBLVBSULPH�KDYH�GYGBP��GYG�P��E\��UXOH�SULPHBGYGBSRZHUBWZR��WKHQ�REWDLQ�N�ZKHUH��P� �� N��ZLWK�HT�KDYH�� Qt� ��t� �Nt��E\��DXWR�VLPS�DGG��SRZHU�BHTBVTXDUH�PXOWBDF��KHQFH��Qt� �� Nt��E\�VLPS��KHQFH��GYG�Qt��ZLWK�WZRBLVBSULPH�KDYH��GYG�Q��E\��UXOH�SULPHBGYGBSRZHUBWZR��ZLWK�GYGBP�KDYH��GYG�JFG�P�Q��E\��UXOH�JFGBJUHDWHVW��ZLWK�ORZHVWBWHUPV�KDYH��GYG��E\�VLPS��WKXV�)DOVH�E\�DULWKTHG

de Bruijn factorC. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 10/72


Intel Pentium R© P5 (1994)

Superscalar; Dual integer pipeline; Faster floating-point, ...

4 159 835

3 145 727= 1.333820...

4 159 835

3 145 727P5= 1.333739...

FPU division lookup table: for certain inputs division result off

Replacement

• Few customers cared, still cost of $475 million

• Testing and model checking insufficient:• Since then Intel and AMD processors formally verified• HOL Light and ACL2 (along other techniques)



Typical proof assistant problem

• Does there exist a function f from R to R, such thatfor all x and y , f (x + y2)− f (x) ≥ y ?



Typical proof assistant problem

• Does there exist a function f from R to R, such thatfor all x and y , f (x + y2)− f (x) ≥ y ?


1. f (x + y2)− f (x) ≥ y for any given x and y

2. f (x + n · y2)− f (x) ≥ n · y for any x , y , and n ∈ N(easy induction using [1] for the step case)

3. f (1)− f (0) ≥ m + 1 for any m ∈ N(set n = (m + 1)2, x = 0, y = 1

m+1 in [2])

4. Contradiction with the Archimedean property of R


Formalization

let lemma =(‘∀f:real→real. ¬(∀x y. f(x + y * y) − f(x) ≥ y)‘,REWRITE_TAC[real_ge] THEN REPEAT STRIP_TAC THEN

SUBGOAL_THEN ‘∀n x y. &n * y ≤ f(x + &n * y * y) − f(x)‘ MP_TAC THENL

[MATCH_MP_TAC num_INDUCTION THEN SIMP_TAC[REAL_MUL_LZERO; REAL_ADD_RID] THEN

REWRITE_TAC[REAL_SUB_REFL; REAL_LE_REFL; GSYM REAL_OF_NUM_SUC] THEN

GEN_TAC THEN REPEAT(MATCH_MP_TAC MONO_FORALL THEN GEN_TAC) THEN

FIRST_X_ASSUM(MP_TAC o SPECL [‘x + &n * y * y‘; ‘y:real‘]) THEN

SIMP_TAC[REAL_ADD_ASSOC; REAL_ADD_RDISTRIB; REAL_MUL_LID] THEN

REAL_ARITH_TAC;

X_CHOOSE_TAC ‘m:num‘ (SPEC ‘f(&1) − f(&0):real‘ REAL_ARCH_SIMPLE) THEN

DISCH_THEN(MP_TAC o SPECL [‘SUC m EXP 2‘; ‘&0‘; ‘inv(&(SUC m))‘]) THEN

REWRITE_TAC[REAL_ADD_LID; GSYM REAL_OF_NUM_SUC; GSYM REAL_OF_NUM_POW] THEN

REWRITE_TAC[REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) = &m + &1‘;

REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) * inv(&m + &1) = &1‘] THEN

ASM_REAL_ARITH_TAC]);;



Formalization

let lemma =(‘∀f:real→real. ¬(∀x y. f(x + y * y) − f(x) ≥ y)‘,REWRITE_TAC[real_ge] THEN REPEAT STRIP_TAC THEN

SUBGOAL_THEN ‘∀n x y. &n * y ≤ f(x + &n * y * y) − f(x)‘ MP_TAC THENL

[MATCH_MP_TAC num_INDUCTION THEN SIMP_TAC[REAL_MUL_LZERO; REAL_ADD_RID] THEN

REWRITE_TAC[REAL_SUB_REFL; REAL_LE_REFL; GSYM REAL_OF_NUM_SUC] THEN

GEN_TAC THEN REPEAT(MATCH_MP_TAC MONO_FORALL THEN GEN_TAC) THEN

FIRST_X_ASSUM(MP_TAC o SPECL [‘x + &n * y * y‘; ‘y:real‘]) THEN

SIMP_TAC[REAL_ADD_ASSOC; REAL_ADD_RDISTRIB; REAL_MUL_LID] THEN

REAL_ARITH_TAC;

X_CHOOSE_TAC ‘m:num‘ (SPEC ‘f(&1) − f(&0):real‘ REAL_ARCH_SIMPLE) THEN

DISCH_THEN(MP_TAC o SPECL [‘SUC m EXP 2‘; ‘&0‘; ‘inv(&(SUC m))‘]) THEN

REWRITE_TAC[REAL_ADD_LID; GSYM REAL_OF_NUM_SUC; GSYM REAL_OF_NUM_POW] THEN

REWRITE_TAC[REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) = &m + &1‘;

REAL_FIELD ‘(&m + &1) pow 2 * inv(&m + &1) * inv(&m + &1) = &1‘] THEN

ASM_REAL_ARITH_TAC]);;


HOL(y)Hammer: general purposeproof assistant automation

• Machine Learning

• Automated Reasoning


Proof Assistant (2/2)

• Keep track of theories, definitions, assumptions• set up a theory that describes mathematical concepts

(or models a computer system)• express logical properties of the objects

• Interaction - proof editing• typically interactive• specified theory and proofs can be edited• provides information about required proof obligations• allows further refinement of the proof• often manually providing a direction in which to proceed.

• Automation - proof search• various strategies• decision procedures

• Proof checking• checking of complete proofs• sometimes providing certificates of correctness

• Why should we trust it?• small core



Can a Proof Assistant do all proofs?

Decidability!

• Validity of formulas is undecidable

• (for non-trivial logical systems)

Automated Theorem Provers• Specific domains

• Adjust your problem

• Answers: Valid (Theorem with proof)

• Or: Countersatisfiable (Possibly with counter-model)

Proof Assistants• Generally applicable

• Direct modelling of problems

• Interactive



Tool Categories

Computer Algebra

• Solving equations, simplifications, numerical approximations

• Maple, Mathematica, . . .

Model Checkers• Space state abstraction

• Spin, Uppaal, . . .

ATPs• Built in automation (model elimination, resolution)

• ACL2, Vampire, Eprover, SPASS, . . .



Theorems and programs that use ITP/ATP

Software and Hardware• Processors and Chips

• Security Protocols

• Project Cristal (Comp-Cert)

• L4-Verified

• Java Bytecode

Mathematical Theorems• Kepler Conjecture (2015)

• 4 color theorem

• Feit-Thomson

• Robbins conjecture, AIM

Wiedijk’s 100



Outline



Premise Selection


Theorem Names

Internal Guidance



Fast progress in machine learning

Tasks involving logical inference

• Natural language question answering [Sukhbaatar+2015 ]

• Knowledge base completion [Socher+2013 ]

• Automated translation [Wu+2016 ]

GamesAlphaGo problems similar to proving [Silver+2016 ]

• Node evaluation

• Policy decisions

Computer Vision

Better than human performance on some tasks [Russakovsky+2015 ]

A lot of AI, but little use of learning in proving



AI theorem proving techniques

High-level AI guidance

• tactic selection and premise (lemma) selection

• based on suitable features (characterizations) of the formulas

• and on learning lemma-relevance from many related proofs

Mid-level AI guidance

• learn good ATP strategies/tactics/heuristics for classes of problems

• learning lemma and concept re-use

• learn conjecturing

Low-level AI guidance

• guide (almost) every inference

step by previous knowledge

• good proof-state characterization and fast relevance



AI theorem proving techniques

High-level AI guidance

• tactic selection and premise (lemma) selection

• based on suitable features (characterizations) of the formulas

• and on learning lemma-relevance from many related proofs

Mid-level AI guidance

• learn good ATP strategies/tactics/heuristics for classes of problems

• learning lemma and concept re-use

• learn conjecturing

Low-level AI guidance

• guide (almost) every inference step by previous knowledge

• good proof-state characterization and fast relevance



Problems for Machine Learning

• Is a statement is useful?

• For a conjecture

• What are the dependencies of statement? (premise selection)

• Is a statement important? (named)

• What should the next proof step be?

• Tactic? Instantitation?

• How to name a statement?

• What new problem is likely to be true?

• Intermediate statement for a conjecture


Premise Selection

Outline



Premise Selection


Theorem Names

Internal Guidance


Premise Selection

Premise selection

Intuition

Given:

• set of theorems T (together with proofs)

• conjecture c

Find:

• minimal subset of T that can be used to prove c

More formally

arg mint⊆T

{|t| | t ` c}


Premise Selection

In machine learning terminology

Multi-label classification

Input: set of samples S, where samples are triples s,F (s), L(s)

• s is the sample ID

• F (s) is the set of features of s

• L(s) is the set of labels of s

Output: function f that predicts n labels (sorted by relevance) for set offeatures

Sample add comm (a + b = b + a) could have:

• F(add comm) = {“+”, “=”, “num”}• L(add comm) = {num induct, add 0, add suc, add def}


Premise Selection

Not exactly the usual machine learning problem

Labels correspond to premises and samples to theorems

• Very often same

Similar theorems are likely to be useful in the proof

• Also likely to have similar premises

Theorems sharing logical features are similar

• Theorems sharing rare features are very similar

Recently considered theorems and premises are important


Premise Selection

Not exactly for the usual machine learning tools

Multi-label classifier output

• Often asked for 1000 or more most relevant lemmas

Efficient classifier update

• Learning time + prediction time small

• User will not wait more than 10–30 sec for all phases

Large numbers of features

• Complicated feature relations


Premise Selection

Tried Premise Selection Techniques

• Syntactic methods• Neighbours using various metrics• Clustering• Recursive: SInE, MePo

• k-NN, Naive Bayes

• (space reductions)

• Regression• Kernel-based multi-output ranking

• Decision Trees (Random Forests)

• Neural Networks• Winnow, Perceptron (SNoW)• DeepMath


Premise Selection

Problem pruning with SInE

• SInE – SUMO Inference Engine [Hoder 2008 ]

• Original version implemented in Python and debuted in CASC-J4(with E and Vampire)

• Used by other ATPs: iProver-LTB, etc.• Efficient reimplementations

Vampire (2009/2010) and E (2010, GSinE)

• SInE selects relevant axioms from a large set• Axiom selection based on function/predicate symbols:

• The conjecture is relevant• Symbols in the conjecture are relevant• Formulas defining these symbols are relevant, and their other symbols

become relevant• Repeat until a fixed point or limit is reached

• A formula is interpreted as defining the globally rarest symbol(s)occurring in it

• Very Similar to MePo [MengPaulson2009 ]


Premise Selection

GSInE in E: e axfilter

• Parameterizable filters• Different generality measures

(frequency count, generosity, benevolence)• Different limits (absolute/relative size, # of iterations)• Different seeds (conjecture/hypotheses)

• Efficient implementation• E data types and libraries• Indexing (symbol → formula, formula → symbol)

• Multi-filter support• Parse & index once (amortize costs)• Apply different independent filters

• Primary use: Initial over-approximation(efficiently reduce HUGE input files to manageable size)

• Secondary use: Filtering for individual E strategies


Premise Selection

Naive Bayes

P(f is relevant for proving g)

= P(f is relevant | g ’s features)

= P(f is relevant | f1, . . . , fn)

∝ P(f is relevant)Πni=1P(fi | f is relevant)

∝ #f is a proof dependency#all dependencies

· Πni=1

#fi appears when f is a proof dependency#f is a proof dependency


Premise Selection

Naive Bayes: adaptation to premise selection

extended features F (φ) of a fact φ

features of φ and of the facts that were proved using φ(only one iteration)

More precise estimation of the relevance of φ to prove γ:

P(φ is used in ψ’s proof)

·∏

f∈F (γ)∩F (φ)P(ψ has feature f | φ is used in ψ’s proof

)·∏

f∈F (γ)−F (φ)P(ψ has feature f | φ is not used in ψ’s proof

)·∏

f∈F (φ)−F (γ)P(ψ does not have feature f | φ is used in ψ’s proof

)


Premise Selection

All these probabilities can be computed efficiently

Update two functions (tables):

• t(φ): number of times a fact φ was dependency

• s(φ, f ):number of times a fact φ was dependency of a fact described byfeature f

Then:

P(φ is used in a proof of (any) ψ) =t(φ)

K

P(ψ has feature f | φ is used in ψ’s proof

)=

s(φ, f )

t(φ)

P(ψ does not have feature f | φ is used in ψ’s proof

)= 1− s(φ, f )

t(φ)

≈ 1− s(φ, f )− 1

t(φ)

Times for large dataset: secondsC. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 32/72

Premise Selection

Decision Trees

Definition• each leaf stores a set of samples

• each branch stores a feature f and two subtrees, where:

• the left subtree contains only samples having f• the right subtree contains only samples not having f

Example

+

×

a× (b + c) =a× b + a× c

a + b =b + a

sin

sin x =− sin(−x)

×

a× b = b × a a = a


Premise Selection

Random Forests

• Online and Offline variants [Safari09,Agrawal13 ]

• Online gives too much influence to the first learned nodes• Offline too slow to update

• Too few selected labels: Multi-path query

• Too slow to compute Gini-index for splitting feature selection

• With 50 trees and 1000 labels in each performance weak• Needs combining with other classifier: k-NN in the leave.


Premise Selection

Random Forests: Results [Farber2015 ]

200 400 600 800 1,000

0.9

0.95

1

Evaluation samples

Pre

cisi

on

RFk-NN

Runtimes to achieve same number of proven theorems

Classifier Classifier runtime E timeout E runtime Total

k-NN 0.5min 15sec 314min 341minRF 22min 10sec 252min 272min


Premise Selection

Deep Learning vs Shallow Learning

Hand crafted Features

Predictor

Data

Traditional machine learning

• Mostly convex, provably tractable

• Special purpose solvers

• Non-layered architectures

Learned Features

Predictor

Data

Deep Learning

• Mostly NP-Hard

• General purpose solvers

• Hierarchical models


Premise Selection

Deep Learning for Mizar Lemma Selection [Alemi+2016 ]

• Embed all lemmas into Rn using an LSTM

• Embed conjecture into Rn using an LSTM

• Simple classifier on top of concatenated embeddings

• Trained to estimate usefulness on positive and negative examples

Statement to be proved

Embedding network

Potential Premise

Embedding network

Combiner network

Classifier/Ranker


Premise Selection

DeepMath (2/2)

Cutoff k-NN Baseline (%) char-CNN (%) word-CNN (%) def-CNN-LSTM (%) def-CNN (%) def+char-CNN (%)16 674 (24.6) 687 (25.1) 709 (25.9) 644 (23.5) 734 (26.8) 835 (30.5)32 1081 (39.4) 1028 (37.5) 1063 (38.8) 924 (33.7) 1093 (39.9) 1218 (44.4)64 1399 (51) 1295 (47.2) 1355 (49.4) 1196 (43.6) 1381 (50.4) 1470 (53.6)

128 1612 (58.8) 1534 (55.9) 1552 (56.6) 1401 (51.1) 1617 (59) 1695 (61.8)256 1709 (62.3) 1656 (60.4) 1635 (59.6) 1519 (55.4) 1708 (62.3) 1780 (64.9)512 1762 (64.3) 1711 (62.4) 1712 (62.4) 1593 (58.1) 1780 (64.9) 1830 (66.7)

1024 1786 (65.1) 1762 (64.3) 1755 (64) 1647 (60.1) 1822 (66.4) 1862 (67.9)

Table 1: Results of ATP premise selection experiments with hard negative mining on a test set of 2,742 theorems.



Outline



Premise Selection


Theorem Names

Internal Guidance



Intermediate lemmas [JSC 2015,FroCoS 2015 ]

Size of the inference graphs

Flyspeck graph

nodes edges

kernel inferences 1,728,861,441 1,953,406,411reduced trace 159,102,636 233,488,673

tactical inferences 11,824,052 42,296,208tactical trace 1,067,107 4,268,428

Same orders of magnitude for single ATP proofs

Repetitions → Re-use

• The graphs are already computed outsize of HOL

• Feature extraction, prediction, translation do not scale...

• Pre-select heuristically interesting lemmas



Heuristics for interesting lemmas (1/2)

Definition (Recursive dependencies and uses)

D(i) =

1 if i ∈ Named ∨ i ∈ Axioms,∑j∈d(i)

D(j) otherwise.

Definition (Lemma quality)

Q1(i) =U(i) ∗ D(i)

S(i)

Q2(i) =U(i) ∗ D(i)

S(i)2

Qr1(i) =

U(i)r ∗ D(i)2−r

S(i)

Q3(i) =U(i) ∗ D(i)

1.1S(i)

EpclLemma (longest chain), AGIntRater, ...



Heuristics for interesting lemmas (2/2)

PageRank

• Eigenvector centrality of a graph

• Fast, non-iterative, usable on whole Flyspeck

• Dominant eigenvector of:

PR1(i) =1− f

N+ f

∑i∈d(j)

PR1(j)

|d(j)|

• Size normalized

PR2(i) =PR1(i)

S(i)

Maximum Graph Cut



Learning Lemma Usefulness [ICLR 2017 ]

HOLStep Dataset

• Intermediate steps of the Kepler proof

• Only relevant proofs of reasonable size

• Annotate steps as useful and unused• Same number of positive and negative

• Tokenization and normalization of statements

Statistics

Train Test Positive Negative

Examples 2013046 196030 1104538 1104538Avg. length 503.18 440.20 535.52 459.66Avg. tokens 87.01 80.62 95.48 77.40Conjectures 9999 1411 - -Avg. deps 29.58 22.82 - -



Considered Models



Baselines (Training Profiles)

char-level token-level

un

con

dit

ion

edco

ject

ure

con

dit

ion

ed


Theorem Names

Outline



Premise Selection


Theorem Names

Internal Guidance


Theorem Names

Example Theorem Names [Aspinall2016]

• Names attached to noteworthy theoremsRIEMANN MAPPING THEOREM:∀s.open s ∧ simply connected s ⇐⇒ s = {} ∨ s = (:real2) ∨∃f g. f holomorphic on s ∧ g holomorphic on ball(0, 1) ∧(∀z. z IN s =⇒ f z IN ball(Cx(&0), &1) ∧ g(f z) = z) ∧(∀z. z IN ball(Cx(&0),&1) =⇒ g z IN s ∧ f(g z) = z)

• Descriptive namesADD ASSOC : ∀m n p. m+(n+p)=(m+n)+p

REAL MIN ASSOC : ∀x y z. min x (min y z)=min (min x y) z

SUC GT ZERO : ∀x. Suc x> 0

• Automatic names:HOJODCM LEBHIRJ OBDATYB MEEIXJO KBWPBHQ RYIUUVKDIOWAAS PNXVWFS PBFLHET QCDVKEA NCVIBWU PPLHULJ


Theorem Names

Associations between names and statements

Consistent

• association between symbols and parts of theorem names

Abstract

• abstracting from concrete symbols

• association between patterns and name parts


Theorem Names

Modified k-Nearest Neighbours multi-label classifier

The nearness of two statements s1, s2:

n(s1, s2) =

√∑f∈f (s1)∩f (s2)

w(f )2

• w(f ) is the IDF (inverse document frequency) of feature f

Training examples are indexed by featuresto efficiently evaluate the relevance of a label:

R(l) =∑

s1∈N,l∈l(s1)

n(s1, s2)∣∣l(s1)∣∣

Positions for a stem proposed using the weighted average of thepositions in the recommendations.


Theorem Names

Selected features of ADD ASSOC

Feature Frequency Position IDF

(V0 + (V1 + V2) = (V0 + V1) + V2) 1 0.37 7.82((V0 + V1) + V2) 1 0.75 7.13

(V0 + V1) 1 0.84 3.95+ 4 0.72 2.62

num 3 0.21 1.15= 1 0.43 0.23∀ 3 0.15 0.03


Theorem Names

Nearest Neighbours of ADD ASSOC

Theorem name Statement Nearness

MULT ASSOC (V0 * (V1 * V2)) = ((V0 * V1) * V2) 553ADD AC 1 ((V0 + V1) + V2) = (V0 + (V1 + V2)) 264EXP MULT (EXP V0 (V1 * V2)) = (EXP (EXP V0 V1) V2) 247

HREAL ADD ASSOC (V0 +H (V1 +H V2)) = ((V0 +H V1) +H V2) 246HREAL MUL ASSOC (V0 *H (V1 *H V2)) = ((V0 *H V1) *H V2) 246REAL ADD ASSOC (V0 +R (V1 +R V2)) = ((V0 +R V1) +R V2) 246REAL MUL ASSOC (V0 *R (V1 *R V2)) = ((V0 *R V1) *R V2) 246REAL MAX ASSOC (MAXR V0 (MAXR V1 V2)) = (MAXR (MAXR V0 V1) V2) 246INT ADD ASSOC (V0 +Z (V1 +Z V2)) = ((V0 +Z V1) +Z V2) 246

REAL MIN ASSOC (MINR V0 (MINR V1 V2)) = (MINR (MINR V0 V1) V2) 246


Theorem Names

Predicted Stems for associativity of addition

Stem Positions

ADD [0.18; 0.14; 0.12; 0; 0.12]ASSOC [1]

AC [0; 1]NUM [0.67; 0.45; 0; 0.70]

FIXED [0]EQUAL [0.25; 0.25]

ONE [0]

Stem Positions

MONO [0.44; 0.49]REFL [1]

SELECT [0]SPLITS [0]

RCANCEL [1]IITN [0]GEN [1]


Theorem Names

Learning and Evaluation

• Use k-NN to predict stems and their positions• Based on constants and shapes of theorems (ADD ASSOC)

• Leave-one-out cross-validation

• 2298 statements in HOL Light standard library

First Later Same All StemChoice Choice Stems Stems Overlap Fail

273 501 291 757 459 17

Failures:

• EXCLUDED MIDDLE (proposed: DISJ THM)

• INFINITY AX (proposed: ONTO ONE)

• ETA AX, WLOG RELATION TOPOLOGICAL SORT

Inconsistencies


Internal Guidance

Outline



Premise Selection


Theorem Names

Internal Guidance


Internal Guidance

leanCoP: Lean Connection Prover (Jens Otten)

• Connected tableaux calculus• Goal oriented, good for large theories

• Regularly beats Metis and Prover9 in CASC• despite their much larger implementation• very good performance on some ITP challenges

• Compact Prolog implementation, easy to modify• Variants for other foundations: iLeanCoP, mLeanCoP• First experiments with machine learning: MaLeCoP

• Easy to imitate• leanCoP tactic in HOL Light


Internal Guidance

Lean connection Tableaux

Very simple rules:

• Reduction unifies the current literal with a literal on the path

• Extension unifies the current literal with a copy of a clause

{}, M, PathAxiom

C , M, Path ∪ {L2}C ∪ {L1}, M, Path ∪ {L2}

Reduction

C2 \ {L2}, M, Path ∪ {L1} C , M, Path

C ∪ {L1}, M, PathExtension


Internal Guidance

leanCoP: Basic Unoptimized Code

1 prove([Lit|Cla],Path ,PathLim ,Lem ,Set) :-

23 (-NegLit=Lit;-Lit=NegLit) ->

4 (

567 member(NegL ,Path),unify_with_occurs_check(NegL ,NegLit)

8 ;

9 lit(NegLit ,NegL ,Cla1 ,Grnd1),

10 unify_with_occurs_check(NegL ,NegLit),

11121314 prove(Cla1 ,[Lit|Path],PathLim ,Lem ,Set)

15 ),

1617 prove(Cla ,Path ,PathLim ,Lem ,Set).

18 prove([],_,_,_,_).


Internal Guidance

leanCoP: Actual Optimized Code

1 prove([Lit|Cla],Path ,PathLim ,Lem ,Set) :-

2 \+ (member(LitC ,[Lit|Cla]),member(LitP ,Path),LitC==LitP),

3 (-NegLit=Lit;-Lit=NegLit) ->

4 (

5 member(LitL ,Lem), Lit==LitL

6 ;

7 member(NegL ,Path),unify_with_occurs_check(NegL ,NegLit)

8 ;

9 lit(NegLit ,NegL ,Cla1 ,Grnd1),

10 unify_with_occurs_check(NegL ,NegLit),

11 ( Grnd1=g -> true ;

12 length(Path ,K), K<PathLim -> true ;

13 \+ pathlim -> assert(pathlim), fail ),

14 prove(Cla1 ,[Lit|Path],PathLim ,Lem ,Set)

15 ),

16 ( member(cut ,Set) -> ! ; true ),

17 prove(Cla ,Path ,PathLim ,[Lit|Lem],Set).

18 prove([],_,_,_,_).


Internal Guidance

leanCoP in HOL Light

• OCaml leanCoP integrated as an automated proof tactic• Reconstruction technique for hammers

• Some Prolog technology missing in functional languages• explicit stack for current proof state

• Including the trail of variable bindings

• continuation passing style for flow control• backtracking alternatives• further tasks

• Improvement compared to existing reconstruction• Very secure HOL Light certification of leanCoP proofs

Prover Theorem (%)

OcaML-leanCoP (cut) 759 (87.04)Metis (2.3) 708 (81.19)Meson 683 (78.32)

any 832 (95.41)


Internal Guidance

FEMaLeCoP: Advice Overview and Used Features

• Advise the:• selection of clause for every tableau extension step

• Proof state: weighted vector of symbols (or terms)• extracted from all the literals on the active path• Frequency-based weighting (IDF)• Simple decay factor (using maximum)

• Consistent clausification• formula ?[X]: p(X) becomes p(’skolem(?[A]:p(A),1)’)

• Advice using custom sparse naive Bayes• association of the features of the proof states• with contrapositives used for the successful extension steps


Internal Guidance

FEMaLeCoP: Data Collection and Indexing

• Slight extension of the saved proofs• Training Data: pairs (path, used extension step)

• External Data Indexing (incremental)• te num: number of training examples• pf no: map from features to number of occurrences ∈ Q• cn no: map from contrapositives to numbers of occurrences• cn pf no: map of maps of cn/pf co-occurrences

• Problem Specific Data• Upon start FEMaLeCoP reads

• only current-problem relevant parts of the training data

• cn no and cn pf no filtered by contrapositives in the problem• pf no and cn pf no filtered by possible features in the problem

Rest of Naive Bayes, as in previous lecture


Internal Guidance

It cannot work?

But it does!

Inference speed ...

drops to about 40%, but:

Prover Proved (%)

Unguided OCaml-leanCoP 574 (27.6%)FEMaLeCoP 635 (30.6%)

together 664 (32.0%)

(evaluation on MPTP bushy problems, 60 s)


Internal Guidance

It cannot work? But it does!

Inference speed ... drops to about 40%, but:

Prover Proved (%)

Unguided OCaml-leanCoP 574 (27.6%)FEMaLeCoP 635 (30.6%)

together 664 (32.0%)

(evaluation on MPTP bushy problems, 60 s)


Internal Guidance

Even more impressive in HOL (Satallax) [BrownFarber2016 ]

0 5 10 15 20 25 30

0

1,000

2,000

3,000

4,000

Time [s]

Pro

ble

ms

solv

ed

OfflineTraining


Internal Guidance

E-Prover given-clause loop

Most important choice: unprocessed clause selection [Schulz 2015 ]


Internal Guidance

Data Collection

Mizar top-level theorems [Urban 2006 ]

• Encoded in FOF

32,521 Mizar theorems with ≥ 1 proof

• training-validation split (90%-10%)

• replay with one strategy

Collect all CNF intermediate steps

• and unprocessed clauses when proof is found


Internal Guidance

Deep Network Architectures

Clause EmbedderNegated conjecture

embedder

Concatenate

Fully Connected(1024 nodes)

Fully Connected(1 node)

Logistic loss

Clause tokens Negated conjecture tokens

Conv 5 (1024) + ReLU

Input token embeddings



Max Pooling

Overall network Convolutional Embedding

Non-dilated and dilated convolutions


Internal Guidance

Recursive Neural Networks

• Curried representation of first-order statements

• Separate nodes for apply, or, and, not

• Layer weights learned jointly for the same formula

• Embeddings of symbols learned with rest of network

• Tree-RNN and Tree-LSTM models


Internal Guidance

Model accuracy

Model Embedding Size Accuracy: 50-50% splitTree-RNN-256×2 256 77.5%Tree-RNN-512×1 256 78.1%

Tree-LSTM-256×2 256 77.0%Tree-LSTM-256×3 256 77.0%Tree-LSTM-512×2 256 77.9%

CNN-1024×3 256 80.3%?CNN-1024×3 256 78.7%

CNN-1024×3 512 79.7%CNN-1024×3 1024 79.8%

WaveNet-256×3×7 256 79.9%?WaveNet-256×3×7 256 79.9%WaveNet-1024×3×7 1024 81.0%

WaveNet-640×3×7(20%) 640 81.5%?WaveNet-640×3×7(20%) 640 79.9%

? = train on unprocessed clauses as negative examples


Internal Guidance

Hybrid Heuristic

Already on proved statementsperformance requires modifications

102 103 104 105

Processed clause limit

0%

20%

40%

60%

80%

100%

Perc

ent

unpro

ved

Pure CNN

Hybrid CNN

Pure CNN; Auto

Hyrbid CNN; Auto

102 103 104 105

Processed clause limit

0%

20%

40%

60%

80%

100%

Perc

ent

unpro

ved

Auto

WaveNet 640*

WaveNet 256

WaveNet 256*

WaveNet 640

CNN

CNN*


Internal Guidance

Harder Mizar top-level statements

Model DeepMath 1 DeepMath 2 Union of 1 and 2

Auto 578 581 674?WaveNet 640 644 612 767?WaveNet 256 692 712 864

WaveNet 640 629 685 997?CNN 905 812 1,057

CNN 839 935 1,101

Total (unique) 1,451 1,458 1,712

Overall proved 7.4% of the harder statements

• Batching and hybrid necessary

• Model accuracy unsatisfactory


Internal Guidance

Generative Models of Text

[A. Karpathy 2016]C. Kaliszyk (Universitat Innsbruck) Learning Problems in Theorem Proving 71/72

Internal Guidance

Take Home• Proofs are hard

• Machine learning key to most powerful proof assistant automation

• Older but very efficient algorithms with significant adjustments

• Many other learning problems and scenarios

Not covered• Learning strategy selection [Jakubuv,Urban]

• Kernel methods [Kuhlwein]

• SVM [Holden]

• Deep Prolog [Rocktaschel ]

• Semantic Features, Conecturing

• Tactic selection [Nagashima,Gauthier ]

• Adversarial Networks [Szegedy ]

• Human proof optimization

• Theory exploration [Bundy+]

• Learning to formalize [Vyskocil ]

• ...


Learning Problems in Theorem Proving -...

Documents

Transcript of Learning Problems in Theorem Proving -...