Sign in
Measuring impact

When prompt engineering just isn’t enough: How Upright built a model for sizing impacts

Ground-truth on impact size is sparse, and no one can define it generally. Here’s how we trained a model anyway.

Juho Ojala

Juho Ojala

CTO & Co-Founder, Upright

Published Jun 15, 2026

Today we are introducing Leena-2, Upright's vertically-trained large language model for high-performance impact size inference. Leena-2 will soon power most impact estimates on the Upright Platform: for every product and service in our universe, it sizes positive and negative impacts on the environment, human health, society, and knowledge. Comprehensively, quantitatively, and, crucially, coherently.

This post explains why we built it, how it works conceptually, and what we learned. In brief:

  • Coherence lets the model build a theory of impact size. For some products and impacts, measured ground truth exists, but it's sparse, and no one has a general way to define impact, or its size, that holds across everything. So we lean on what science and the available data tell us, and train the model to make those judgments coherent across thousands of comparisons. In reconciling them, it implicitly discovers a consistent theory of impact size, anchored by real data and science-grounded research where they exist, and extended, consistently, to where they don't.
  • Training for coherent impact comparison unexpectedly improves general estimation ability. The post-trained model consistently outperforms its own base model on a hardened version of an academic Fermi estimation benchmark, despite never being trained on Fermi problems.
  • The coherence of final estimates comes from two multiplicative mechanisms: coherence trained into the model, and coherence computed at inference time by mathematically reconciling thousands of pairwise comparisons per product. The latter is only practical with an efficient vertically trained model, like Leena-2.
Why you can't just ask a general-purpose LLM

Imagine being asked whether the greenhouse gas emissions of steel production are a bigger deal than steel's contribution to societal infrastructure, or whether the joy people get from chocolate outweighs its health harms. Now imagine answering thousands of such questions, across hundreds of different impacts, in a way that stays logically consistent: if you say impact A is ten times B, and C is five times A, you are committed to C being fifty times B, without remembering your previous answers.

Last year, we published an evaluation of frontier LLMs' ability to quantify consequences. The headline finding: general-purpose models know a surprising amount about the impacts of products and services, but their quantitative judgments are locally plausible and globally incoherent. Sizes don't survive transitivity checks, flip depending on framing, and systematically underweight well-researched harms in favor of what is most commonly repeated in training data.

The reason is structural, not a fixable prompt issue. Questions about the relative magnitude of real-world consequences have no ground-truth answers. They bundle counterfactual reasoning ("compared to what?"), causal attribution ("how much of this is the product's doing?"), and value systems ("how do you weigh a tonne of CO₂e against a person's eyesight?"). There is no answer key anywhere, and so nothing in pre-training that would force a model's thousands of separate magnitude judgments to agree with each other.

Eight years of vertical machine learning

Upright has been answering exactly these no-answer-key questions since 2017, quantifying the net impact of companies in a way that is science-based, comprehensive, and comparable. We started using large language models for it in 2018, when we trained Google's BERT (the first publicly available LLM) on 30,000 research papers our team manually labeled. That model (introduced to the world in 2018 as "Leena-Leena, the net impact AI") and its successors have processed hundreds of millions of scientific articles into the structured impact knowledge behind our platform.

Leena-2 continues that lineage. The constant across eight years: we treat LLMs as machinery for integrating information and applying scientific research to impact quantification at scale.

Why didn't anyone else build this?

A fair question to ask about any claimed invention. Our honest answer: for most estimation problems, plenty of unambiguously correct answers exist, so everyone sensibly trains directly on correct answers. Impact-size comparisons offer no such luxury. As discussed above, they entangle value systems, counterfactuals, and causal attribution so thoroughly that even defining the question precisely remains elusive.

Our approach is to train the model to implicitly discover a coherent framework for these comparisons, as a workaround for the human inability (ours included) to define exactly what it means to compare the size of two impacts. That is a decidedly non-obvious move, and we suspect we only landed on it because answering questions with no "correct" answers has been this company's entire job since 2017. Most teams optimizing LLMs for estimation never had a reason to stop trusting single answers; we never had the option to start.

How Leena-2 is trained

Conceptually, training proceeds as a loop:

  1. A strong model (with access to web search and instructed to ground its answers in scientific research, using tools to perform computations) runs a large number of pairwise impact-size comparisons, each impact compared against many others.
  2. Because those answers are never fully consistent with each other, we mathematically solve for the implied impact sizes: the single set of sizes most coherent with all of the pairwise answers at once.
  3. The implied sizes become "correct answers" for generating training data.
  4. The model is post-trained on that data, and the loop repeats with the improved model.

In effect, the model is pulled toward the most coherent version of its own research-grounded judgments.

The base model is a 14-billion-parameter open-weights dense model from EU-based Mistral AI, which was the best-performing model fitting our criteria (dense model around 10B-20B params). We were happy to see the European option win on merit 🇪🇺.

Where the coherence comes from

A subtle but central point: the coherence of the final impact estimates served comes from two distinct mechanisms, not one.

1. Coherence trained into the model. As described above, Leena-2 is post-trained specifically to produce coherent answers. Its individual estimates are not independent guesses; they draw on one implicitly learned, consistent framework for weighing consequences.

2. Coherence computed at inference scale. The estimates we publish are never single model outputs. For each product, the production harness runs thousands of pairwise comparisons, both between the product's own impacts and against the impacts of other products, and then mathematically solves for the set of implied impact sizes most coherent with all of them at once. Any single comparison can be noisy; the solved system is not.

 
Two mechanisms, multiplied
1: Coherence trained into the model
Strong model runs pairwise impact-size comparisons, grounded in web search and scientific research
Solve for the single most coherent set of implied impact sizes
Use implied sizes as training targets to post‑train Leena‑2
↺ repeat with the improved model
2: Coherence computed at inference scale
Leena‑2
Thousands of pairwise comparisons per product, between a product's own impacts and across products
Solve all comparisons into one coherent set of impact sizes
Figure 1: Conceptual illustration of Leena-2's training loop (top) and the inference-scale coherence solving in the production harness (bottom). The training loop teaches the model one consistent framework for sizing impacts; the harness then reconciles thousands of its pairwise judgments per product into a single coherent set of impact sizes.

The two mechanisms are multiplicative, not merely additive, because mechanism 2 is only practical with a vertically trained, high-performance model in the first place. Thousands of comparisons per product across the whole product universe is an enormous inference volume: with a frontier-scale general model it would be prohibitively slow and expensive, and with an untrained general model the comparisons would not be coherent enough for the solved system to converge on anything meaningful. The trained-in coherence is what makes the computed coherence feasible.

Results

Our primary benchmark measures transitive coherence: what share of a model's pairwise comparison results align with the implied impact sizes solved from all of its comparisons together. Leena-2 improves significantly over its base model on held-out questions: individual judgments agree with the model's own aggregate worldview far more often.

The result that surprised us came from a benchmark we never trained for. Fermi problems (named after physicist Enrico Fermi, who loved questions where exact data doesn't exist) are estimation questions like "how many piano tuners are there in Chicago?". Solving them requires decomposing the problem into factors you can reasonably estimate and combining those, e.g. by multiplication, into a rough answer. On a hardened version of the Fermi estimation benchmark published by the Allen Institute for Artificial Intelligence, the post-trained model consistently outperforms its own base model, a surprising generalization, since it was never trained on Fermi problems.

Our working explanation is that comparing impact sizes coherently exercises a wide range of skills (decomposition, unit sanity, counterfactual discipline, robust handling of uncertainty) that transfer to estimation problems in general. These are early results from modest training runs; we expect scaling the dataset and further optimization to widen the gap.

Joy, in cents per dollar

We aim to measure impact comprehensively, and we take this very seriously. This includes also quantifying impacts that are hard to measure, including impacts related to fluffy topics like "Meaning & Joy", which we take as seriously as GHG emissions, or Jobs and Taxes. Therefore, Upright's Impact Data Engine (and Leena-2), is probably the world's most sophisticated machinery ever assembled for quantifying meaning & joy.

Some findings from the harness tables:

  • Pizza is the most sensory-pleasurable product we have measured: 34.6 ¢ of sensory pleasure per dollar of revenue, comfortably ahead of spirits (23.6) and cotton candy (15.7).
  • Board games are the #1 product for reducing boredom (17.2 ¢/$).
  • Pets, honestly quantified: 21.4 ¢/$ of emotional pleasure, alongside 8.0 ¢/$ of distress and 3.3 ¢/$ of frustration. Joy and chaos, in the same dataset.
  • Social media scores on both "reduces boredom" and "causes boredom" simultaneously, which sounds about right.
  • Electric bicycles increase riders' feeling of autonomy (3.6 ¢/$), a reminder that getting around is partly about freedom.
  • The overall Meaning & joy champions are gardening lessons (172.8) and sailing school lessons (148.9), while meditation apps lead both "creates mindfulness" and "creates spiritual fulfillment".
 
Select impact sizes within the Meaning & Joy impact category
Figure 2: Selected impact sizes within the Meaning & joy impact category, in cents of economic/social benefit (green) or cost (purple) per dollar of revenue, as estimated by the Leena-2 harness from experimental runs.

(These data points come from experimental harness runs, not released platform scores, and will go through a validation pass before being treated as anything more than evidence that the machinery has range.)

Running Leena-2 at scale

We run Leena-2 for every product and service traded globally. Each product is evaluated against Upright's Universal Impact Structure, a hierarchy of ~800 impacts. The harness starts from the top level of the hierarchy and descends where relevant, running thousands of pairwise comparisons per product (both within the same product and across products), solving them into a coherent set of impact sizes for all products and services. The product-level impact sizes are then combined with information on companies' products and services to create company-level impact profiles.

Why this matters beyond sustainability

In January, Anthropic published Claude's new constitution, arguing that aligned AI cannot be built from rule-following alone: models need to understand why we want them to behave in certain ways, and to exercise good judgment across a wide range of novel situations rather than mechanically applying specific rules.

We agree. Exercising good judgment in novel situations means weighing the consequences of actions: coherently, comprehensively, and quantitatively. As our earlier evaluation showed, today's LLMs reason about consequences well locally but incoherently globally. A model that cannot coherently size what happens cannot reliably judge what to do.

This is why we believe the capability Leena-2 targets (coherent, comprehensive, quantified consequence-reasoning) is a necessary ingredient for two alignment problems at once. Aligning AI systems requires models whose judgments about consequences hang together. And "aligning" the economy (Upright's mission since 2017) requires exactly the same machinery, pointed at capital allocation instead of model behavior. Same capability, two markets: steering models, and steering money.

Notices

Upright has a U.S. patent pending on the post-training technique described in this post.

Citing

To reference this post, you may use the citation information below:

APA

Ojala, J. (2026). When prompt engineering just isn't enough: How Upright built a model for sizing impacts (Technical Report No. UPTR-202601). Upright Oy. https://www.uprightproject.com/blog/leena-2

BibTeX

June 15th, 2026

Juho Ojala

CTO & Co-Founder, Upright

Share:

LinkedIn logo
Juho Ojala