Automated Evaluation of LLMs for Math Benchmark.

CisnerosA

This project was developed as part of the mentorship program at Carreras con Impacto (CCI). Its goal was to address a critical challenge in evaluating large language models (LLMs) within the mathematical domain. The insights and findings presented here are the result of this mentorship initiative.

In this post, we describe the process and outcome of automating the result review process for the mathematics benchmark developed by the AIxo team within Carreras con Impacto, under the AI4Math initiative (Peñaloza-Perez et al., 2025). Using scripts and LLMs, we automated the review process and compared its performance against manual evaluations conducted by mathematicians.

Introduction

Robust and scalable evaluation of large language models (LLMs) in mathematical tasks is essential for understanding the evolution of frontier models’ capabilities, as well as the potential risks they may pose. Obtaining rapid and consistent results on model performance across different areas provides valuable insights into their capabilities, limitations, potential risks, and possible applications (Bubeck et al., 2023).

However, this process is often hindered by a major bottleneck: the manual review of answers. This post describes our effort to develop an automated evaluation system that not only replicates the accuracy of a human mathematician but also does so in a scalable, deterministic, and efficient manner. We explain how, starting from a simple text-comparison script, we managed to significantly reduce review time while maintaining the same quality, addressing challenges such as synonymy and semantic-numerical equivalence.

Problem Context and Proposed Solution

The Problem: A “Glass Wall” in Evaluation

For the AI4Math Benchmark, manual review of the 105 questions per model required an average of 1.5 hours. However, as the number of models, configurations, and languages increased, this process became unfeasible and slow, demanding, and prone to inconsistencies. This led us to explore automation as a necessary step.

The intuitive initial solution was to use LLMs directly as reviewers. However, this approach proved unreliable (see Table 1). In preliminary tests, these “AI evaluators” misclassified correct answers as incorrect and vice versa, introducing non-deterministic variability. What we needed was an infallible, predictable judge.

Table 1. Performance reported by human mathematicians and LLM evaluators.

Modelo	LLM Evaluator (GPT4o)	Mathematicians	% Difference
o3-mini CoT-Spanish	50.4%	74.28%	23.88
GPT-4o ZS-Spanish	38.10%	53.33%	15.23
DeepSeek-R1 CoT-Spanish	59.05%	69.52%	10.47%

Our Exploration: From Raw Text to Semantic Understanding

Our goal was clear, to build an automated review system that is:

Deterministic: same input, same output, every time.
Programmatic: zero dependence on LLMs for the final decision.
Robust: capable of understanding that “fifty-five,” “55,” and “the answer is 55” are equivalent.

We hypothesized that the real performance gap did not lie in the models’ logic, but rather in the inability of automated systems to interpret the linguistic diversity of responses.

How Did We Address These Inconsistencies?

Our methodology evolved through several iterations, each addressing the limitations of the previous one:

Phase 1: Manual, Rule-Based Approach

The first stage focused on a completely manual and programmatic approach. We developed a system that could load Excel files and select key columns (such as question ID, the mathematician’s “gold standard” answer, and the model-generated response text).

To extract possible answers, we implemented an advanced regular expressions (Regex) system. It searched for anchor phrases such as “final answer,” “the result is,” or “solution:” to identify where the answer began or ended and extract the following text.

However, several limitations soon appeared: Some responses contained multiple anchor phrases, creating ambiguity; answer positions were inconsistent (sometimes at the beginning, other times at the end of the text); and when multiple anchors coexisted, the script could not decide which was the true answer. Ultimately, Regex revealed a fundamental weakness, its inability to capture semantic equivalence.

Phase 2: Incorporating AI for Intelligent Extraction.

After recognizing the limitations of the rule-based approach, we integrated a large language model (LLM) as a specialized extractor. Its function was to identify the text fragment containing the final answer, omit intermediate reasoning or unnecessary explanations, and resolve cases with multiple possible answers.

Next, we applied a programmatic normalization process to standardize results before comparison. This included converting numbers written in words to digits (e.g., “fifty-five” to “55”), unifying fraction, decimal, and scientific notation formats, and performing basic text cleaning (lowercasing, punctuation removal, etc.).

The final evaluation remained fully deterministic and automated. The system performed exact comparisons between normalized strings and used predefined rules to recognize common mathematical equivalences. Thus, the process combined the semantic comprehension of LLMs with the precision and reproducibility of programmatic logic.

Phase 3: Embeddings and Cosine Similarity for Semantic Equivalence.

To handle the most complex cases of synonymy and semantic equivalence, we implemented an additional embedding-based layer.

An embedding converts words or phrases into numerical vectors representing semantic meaning, while cosine similarity measures the angle between those vectors - the smaller the angle, the closer their meanings. This allowed the system to recognize equivalent answers such as “infinity” and “infinite”, even when they did not match literally.

Results and Insights

When comparing our automated evaluation against the mathematicians’ gold standard:

Improved Accuracy: The automated system reproduced human evaluation with high fidelity, maintaining the same quality while drastically reducing review time.
Broader Coverage: The embedding layer captured an additional 15% of correct answers that Regex and basic normalization missed.
Sustained Efficiency:

Before
Manual (1 mathematician).

After

Automated System

1 LLM: ~1.5 hours

1 LLM: ~3-5 min

Development time: ~60 hours (6 weeks X 5 days X 2 hours/day)
Efficiency gain: 18X faster.

Results

A deterministic system with semantic capability and robust consistency. This means:

Feasibility of Automation: It is possible to replicate human evaluation with high precision in a specialized domain such as mathematics.
The Value of Hybridization: Combining programmatic logic (for control) and LLMs (for semantic understanding) outperforms using either approach alone.
The Real Issue: A substantial portion of what was previously perceived as “model error” was in fact evaluation system error.

Code developed:

GitHub Repository: Link

Lessons Learned During the Development

Throughout the project, we learned that perfection is the enemy of progress. Starting with a simple, deterministic (yet imperfect) solution gave us a solid foundation to iterate and measure real progress.

We also realized that LLMs work best as tools, not an oracle. Its real value lay not in making the final judgment but in helping us prepare and normalize data so a deterministic system could make reliable and reproducible decisions.

Another key insight was the importance of ground truth. Without high-quality human evaluation (our gold standard) it would have been difficult to validate or improve the automated system’s results. Automation, therefore, does not replace expertise; it amplifies it.

Finally, we confirmed that semantics matter deeply. In mathematical tasks, semantic equivalence is a much more complex challenge than simple textual equality, and addressing it properly was crucial to achieving fair and comprehensive evaluations.

General Conclusions

This project demonstrates that it is possible to overcome the bottleneck of manual review in mathematical benchmarks.

We developed and validated a hybrid system that combines the precision and scalability of automation with the semantic flexibility of LLMs, maintaining deterministic control at its core.

The result is not just a script, it is a methodological framework that enables scaling from 105 to 250 problems and beyond, allowing researchers to evaluate models quickly, consistently, and reliably. In turn, this accelerates the evaluation cycle of LLMs.

The future of AI evaluation lies not in choosing between humans and machines, but in finding the perfect synergy between them.

Future Improvements
Based on our findings, we identified several opportunities to strengthen and scale the automated evaluation system:

Standardization of Prompts and Keywords:
Future iterations should define more uniform answer structures within benchmark design, including specific guiding keywords or phrases (in addition to existing ones) to facilitate automatic data extraction. Such standardization would harmonize outputs across models and providers, reducing variability and simplifying automation.

Performance and Runtime Optimization:
Although AI-based extraction improved flexibility, it also increased processing time and costs. A future improvement would be to optimize the balance between AI-driven components and purely programmatic processes, maintaining semantic understanding where necessary but reducing total execution time. With a more refined hybrid approach, equivalent results could be achieved in a fraction of the current time.

Extension to New Domains:
Because the code relies on generalizable structures and programmatic comparison, the system can easily be adapted to other domains beyond mathematics (such as physics, chemistry, logical reasoning, etc.) using the same methodology for rapid and consistent evaluation.

These improvements pave the way toward a more agile, adaptable, and generalizable system, one capable of scaling automated evaluation across diverse knowledge areas without sacrificing precision or reliability.

Effective Altruism Forum
EA Forum

Automated Evaluation of LLMs for Math Benchmark.

3

3

Reactions

More posts like this