## Sledgehammer: some history, some tips

13 Apr 2022

Sledgehammer is the subsystem that links Isabelle/HOL to automatic theorem provers like Vampire and Z3. It is so much part of the Isabelle user’s everyday experience that it can be hard to remember a time before it was there. Let’s see if I can dig up some memories, and also come up with some usage tips relevant today.

### The beginnings

My memories are hazy, but I was at a conference (very likely IJCAR 2001, in Siena) sitting on some steps and talking to a representative from Springer. He said something like “How can we increase the circulation of Journal of Automated Reasoning?” Is it profitable?, I recall asking. “Oh yes!” Then perhaps you could try to hold down the cost of a subscription?

That conversation went nowhere, but I recall bumping into Andrei Voronkov, the architect of the Vampire prover and one of the tiny elite group of people who know how to make resolution theorem proving actually work. It was his idea to see if we could combine our technologies and build something that was greater than both. I forget what happened in the interim, but by the time I received the relevant grant, I found myself also working with Christoph Weidenbach and his SPASS prover. The Cambridge team included Jia Meng and Claire Quigley, and we already had a preliminary paper ready for IJCAR 2004.

### Why automation?

We speak of automatic versus interactive theorem provers, but this dichotomy is misleading. (If there are two tools that do exactly the same thing, and one of them is fully automatic, what is the other one for?) In fact they do quite different things.

• Automatic theorem provers solve tough, one-shot problems, typically in a fairly weak formalism like first-order logic, perhaps augmented with arithmetic.

• Interactive theorem provers are primarily specification editors. Users can build tangled webs of theories and theory libraries, and on the way, prove a series of facts leading to some deep results.

Automation is necessary because proofs reduced to bare logical rules can be unfeasibly long. You are going to struggle unless your prover regards a fact such as this is trivial:

\begin{align*} C\not=\emptyset \quad\Longrightarrow\quad \bigcap_{x\in C} \bigl(A(x) \cap B(x)\bigr) = \bigl(\bigcap_{x\in C} A(x)\bigr) \cap \bigl(\bigcap_{x\in C} B(x)\bigr) \end{align*}

Isabelle could prove this automatically already in 1988. What about the proof assistant you use?

### Why resolution?

It was trendy to despise resolution theorem proving in the early 2000s. Other technologies, such as model checkers, BDDs and SAT solvers, were making a huge impact on software and hardware verficiation. Resolution provers were incredibly powerful, but earlier ideas to use theorem-proving to accomplish planning or even general intelligence had clearly failed. Resolution was looking like a solution without a problem.

Around that time, I had been using Isabelle to verify cryptographic protocols (also here), with considerable success. I was quietly pleased to see an attempt to replicate my work using a Certain Other Ballyhooed System (not type theory based) fail utterly. There then appeared an automatic protocol verifier called TAPS (alternative link). Its results were too good to be true, and I’m sorry to confess that I was suspicious. I asked the author, Ernie Cohen, a series of technical questions designed to find out whether TAPS really was giving the right answers. It was: Cohen possessed some sort of magic bullet. I could not understand his translation from protocol models to first-order logic. But I did note that his system proved theorems using SPASS.

### The objectives and the obstacles

The original proposal (written in 2003) said

Twenty years ago, when many users had to share a single computer, a proof command could realistically take at most a few seconds of processor time. Now that 2GHz processors are commonplace, we should abandon the traditional mode of interaction, where the proof tool does nothing until the user types a command. Background processes (perhaps on several computers) should try to prove the outstanding subgoals.

It was clearly time to ask the hardware to do the hard work. By 2005, the objectives had become more clear-cut, as we can see in an early paper also here:

• user interaction should be minimal: provers should run in the background; users should not have to select relevant lemmas
• automatic provers should not be trusted
• proofs should be delivered in source form

These criteria came from reading about prior efforts to combine systems, which however impressive tended to require users to do quite a bit of work up front. Not trusting the external prover is intrinsic to our LCF ethos. So part of our task is somehow to extract a valid Isabelle/HOL proof from the proof emitted by Vampire or SPASS. And now we were also using the E prover, by Stephan Schulz.

The difficulties were already clear in that early paper. It included techniques for encoding the type polymorphism and type classes of Isabelle/HOL in first-order logic, and declared the necessity of techniques to identify relevant lemmas automatically.

### The first working version of sledgehammer

Sledgehammer became fully operational in 2007. Papers from that era described techniques for translating higher-order problems into first-order clauses (the published title of this paper is actually a typographical error) and a simple but workable relevance filter (see also here). Both papers were joint with Jia Meng and described techniques that had been refined by expanding vast amounts of computer time on a processor bank.

The task of translating resolution proofs into Isabelle/HOL proofs defeated us. We resorted to importing Joe Hurd’s own resolution prover, Metis, which had been designed with the specific aim of being integrated within LCF-style proof assistant (HOL4). Metis was not at all competitive with Vampire or SPASS, but it could do the job if given an absolutely minimal set of axioms, which we could extract from the proof emitted by the external prover. In short, the external prover was being used merely to check that a proof existed and from which specific facts. Someone told me that Christoph Weidenbach was shocked to hear this, but whatever works, works.

So the objective of delivering proofs to the user in source form boiled down to returning a call to metis with the appropriate set of lemmas. This part of the work was done by another postdoc, Kong Woei Susanto.

### The next generation

I know much less about what happened since 2007. It happened at TU Munich, and I was distracted by other research projects. It’s not surprising that others wanted to create a new version of sledgehammer: the original was a prototype, and frankly it was a mess. The single greatest flaw: its translation from Isabelle/HOL into first-order logic was unsound, so “proofs” were regularly returned that could not be run.

The Munich group conducted a big empirical evaluation and only a year later it had been extended to also call SMT solvers. These projects were done by Jasmin Blanchette, Sascha Böhme and Tobias Nipkow. Blanchette has gone on to pursue an extended project related to sledgehammer, including efficient, sound translations (unsoundness is completely gone now), vastly improved proof reconstruction and better support for higher-order logic theorem proving (e.g. through Zipperposition).

### Usage tips

Although sledgehammer offers configuration-free one-click invocation, there are some useful tips in the manual. I almost always run sledgehammer using the jEdit panel, but occasionally it’s a good idea to create a specialised invocation, and you need to read the manual for that. I tend to do this when the panel found a proof but a better one might be found if more time was allocated.

The most obvious advice: keep your subgoals simple. A simple tip is to first apply auto, then use sledgehammer to prove each of the resulting goals. This proof will be a mess and will require restructuring, but getting that first proof is a major step.

Goals containing giant set comprehensions can be difficult even for humans, let alone the computer. Introduce abbreviations judiciously to avoid repeated terms, especially constructions containing bound variables.

If you’ve done all those things and sledgehammer still fails, see if you can think of some intermediate fact that follows from your assumptions and that could help to prove your conclusion. If sledgehammer can prove that fact, you have made progress.

### Sledgehammer in action

To get an idea of how Isabelle/HOL and sledgehammer work together, you might consider watching Martin Kleppmann’s video entitled Correctness proofs of distributed systems with Isabelle. It is chiefly aimed at systems researchers and software developers unfamiliar with the concept of formal verification. He presents his problem in distributed systems and then introduces Isabelle/HOL, starting about five minutes in. He develops some Isabelle proofs right before your eyes, including a number of successful sledgehammer calls.

### That last snarky remark

One of the reasons I prefer higher-order logic to dependent type theories — apart from simple semantics, equality that works and no need to put everything in the kernel — is that dependent types seem to make automation much more difficult. Groups with access to institutional support and steady, ample resources still don’t seem to have duplicated what had been built at Cambridge on a single £250K grant. And please don’t say “dependent type checking makes other automation unnecessary”. Yes, I keep hearing this.