Harmonic is continuing to make progress toward mathematical superintelligence, and we have three updates to share today:

- A new state-of-the-art of 90% on the MiniF2F benchmark
- A public release of our internal updates to MiniF2F
- An all-new natural-language interface

Last month we announced Aristotle our first model under development. When prompted with natural-language math problems, Aristotle is able to formalize the problem in Lean 4, solve the problem in a formally verified way, and output the answer in both natural-language and Lean. This is an important step towards our long-term goal of solving advanced problems in research mathematics and verifying all of the world's reasoning.

We plan to track our system's growing capabilities on key mathematics benchmarks. We are starting with MiniF2F, the standard formal mathematics benchmark, which separates the problem of formal theorem proving from the translation between natural language and formal math.

Let's dig in.

90% on MiniF2F

We're excited to announce that we have achieved 90% on MiniF2F, a new state-of-the-art that improves on our previous result of 83% last month. 1 2

MiniF2F measures our model's core problem-solving ability when presented with formally-specified problems. These problems are drawn from national and international high-school math competitions, as well as high school and undergraduate math classes, which lets us benchmark Aristotle against both human performance and prior research. The problems span a range of difficulty, from simple calculations to extremely challenging proofs: three of the validation-set problems are drawn from the International Mathematical Olympiad (IMO), widely considered exceptionally difficult even for trained competitors. Going forward, we will continue to track progress publicly on formal-only benchmarks like MiniF2F, as well as tasks involving natural-language understanding.

An updated & corrected MiniF2F

To accurately assess our system's capabilities, we have made several improvements to MiniF2F:

- We translated the problem statements to Lean 4, the leading system for formal mathematics today 3
- We re-split the 488 problems uniformly randomly into a training set of 392 problems, a validation set of 48 problems, and a test set of 48 problems. 4
- We ensured that each formal statement is associated with an accurate natural-language statement, allowing us to evaluate autoformalization capabilities.
- We fixed many incorrect formalizations, including several theorem statements that became trivial or impossible in their original Lean encodings 5

Today, we are releasing our revision of MiniF2F . We hope it will support future work on AI for formal mathematics using Lean 4.

Natural-language interface

This automatic translation of natural-language problems and solutions into their formal representation is known as autoformalization, and it's a key capability of our system. Autoformalization enables Aristotle to collaborate with mathematicians and educators who may not know Lean, checking their work and filling in details grounded by formal understanding. It also makes the world of existing natural-language math, in the form of textbooks, research papers, and blog posts, available as training data for formal reasoning.

Below is an unmodified example of Aristotle at work on question 6 from the 2001 IMO. We prompt Aristotle with the natural-language statement of the problem, a natural-language proof of the problem, as well as the formalized statement of the problem.

` ````
/-
PROBLEM
Let $a, b, c, d$ be integers with $a>b>c>d>0$.
Suppose that $ac+bd=(b+d+a-c)(b+d-a+c)$.
Prove that $ab+cd$ is not prime.
PROVIDED SOLUTION
We'll use a proof by contradiction. Assume that $ab+cd$ is a prime number.
Rearranging terms in $ac+bd=(a+b-c+d)(-a+b+c+d)$, we get $b^2+bd+d^2=a^2-ac+c^2$.
Thus we have $(ac+bd)(b^2+bd+d^2)=ac(b^2+bd+d^2)+bd(a^2-ac+c^2)=(ab+cd)(ad+bc)$.
Therefore, $ac+bd$ divides $(ab+cd)(ad+bc)$.
Since we assumed that $ab+cd$ is a prime number, either it divides $ac+bd$, or $ac+bd$ divides $ad+bc$.
Therefore, we have either $ab+cd \le ac+bd$ or $ac+bd \le ad+bc$.
Each of these cases contradicts the rearrangement inequality.
-/
theorem IMO_2001_Q6 (a b c d : ℤ) (hab : a > b) (hbc : b > c) (hcd : c > d) (hd : d > 0)
(hacbd : a * c + b * d = (a + b - c + d) * (-a + b + c + d)) :
¬ Prime (a * b + c * d) := by
sorry
```

Fig 1: Unmodified prompt given to Aristotle to autoformalize the solution to question 6 of the 2001 IMO.

In its output, Aristotle fills in the proof, guided by the natural-language solution in the prompt. Aristotle is able to translate each step of the natural-language proof to valid Lean 4, eventually arriving at a complete formally-verified proof.

` ````
theorem IMO_2001_Q6 (a b c d : ℤ) (hab : a > b) (hbc : b > c) (hcd : c > d) (hd : d > 0)
(hacbd : a * c + b * d = (a + b - c + d) * (-a + b + c + d)) :
¬ Prime (a * b + c * d) := by
-- We'll use a proof by contradiction. Assume that $ab+cd$ is a prime number.
intro habcd
-- Rearranging terms in $ac+bd=(a+b-c+d)(-a+b+c+d)$, we get $b^2+bd+d^2=a^2-ac+c^2$.
have hb2bda2ac : b^2 + b * d + d^2 = a^2 - a * c + c^2 := by
linarith
-- Thus we have $(ac+bd)(b^2+bd+d^2)=ac(b^2+bd+d^2)+bd(a^2-ac+c^2)=(ab+cd)(ad+bc)$.
have multiply_out : (a * c + b * d) * (b^2 + b * d + d^2) = (a * b + c * d) * (a * d + b * c) := by
ring_nf
nlinarith [ mul_self_nonneg ( a - c ), mul_self_nonneg ( b - d ), hacbd, habcd, hab, hbc, hcd, hd, hb2bda2ac.symm, -1 < c * d, ( a - b - c - d ) / b < 1, -1 < a * d - b * c, ( ( -a + b - c - d ) / d : ℤ ) < 1, ( b * d : ℤ ) > 0, b * d + c * c < b * d + a * a ] <;> ring_nf <;> try aesop <;> interval_cases a <;> any_goals norm_num at habcd ⊢
-- Therefore, $ac+bd$ divides $(ab+cd)(ad+bc)$.
have divides : a * c + b * d ∣ (a * b + c * d) * (a * d + b * c) := by
exact ⟨ _, multiply_out.symm ⟩ <;> rw [ mul_comm ] <;> nlinarith [ hab, hbc, hcd, habcd, hb2bda2ac, multiply_out.symm ]
-- Since we assumed that $ab+cd$ is a prime number, either it divides $ac+bd$, or $ac+bd$ divides $ad+bc$.
have case1 : a * b + c * d ∣ a * c + b * d ∨ a * c + b * d ∣ a * d + b * c := by
exact?
rcases case1 with ( h | h ) <;> have := Int.le_of_dvd ( by nlinarith ) h <;> nlinarith
```

Fig 2: Unmodified output from Aristotle to the prompt in figure 1.

Join us

Aristotle's improved reasoning and autoformalization capabilities represent a major step toward mathematical superintelligence that can easily converse in natural-language. We look forward to sharing our progress as we continue to scale/advance Aristotle's capabilities. If building the world's most advanced mathematical reasoning engine excites you, join us or follow us

- Tudor, Vlad, and the Harmonic Team

^{1}
We evaluate Aristotle in a setting where additional external computer algebra systems are permitted to solve simple subproblems in a trusted way. Our Lean results are not based on expert iteration on the validation set.

^{2}
Deepseek results:
https://arxiv.org/html/2405.14333v1
and LEGO-prover results:
https://arxiv.org/abs/2310.00656

^{3}
We began with formal statements released by the
Hoskinson Center
,
themselves derived from
Facebook Research's fork
which fixed several issues with the
original benchmark
.

^{4}
The original dataset specified 244 validation and 244 test problems, without a training set. Prior research on MiniF2F typically involved training on the validation set, but we wanted to evaluate our model's ability to solve unseen problems.

^{5}
We believe our formalizations accurately represent the original, natural-language statement. We welcome any corrections, contributions, or feedback from the community.