Twenty years in aerospace have taught me the importance of understanding the limitations of our engineering models. We build systems where a single failure mode such as combustion instability in a rocket engine or resonance in a wing can be catastrophic. The extreme environment aerospace systems operate in demands rigor, but this same environment also requires taking smart risks to develop new and improved designs. We are unable to test every scenario including combined structural, thermal, acoustic and aerodynamic loading before first flight. This means the success of our first flights depends on strong engineering models. A good model must be technically sound, but since all models are approximations, they must also declare their limits, so we know when we're taking appropriate risks versus simply gambling. As I strive to integrate AI into my engineering workflow, I find that our current AI models provide little to no information about applicability, uncertainty or sensitivity to inform risk. When an LLM gives you a surprisingly helpful answer, it's like a rocket engine that performs better than expected. You're thrilled but you should also be concerned. Systems that work for reasons you don't understand will eventually also fail for reasons you don't understand. The first step is understanding if we have the right model. Is it applicable to the topic of our problem, and to what degree? Next, we could use the concept of model anchoring from aerospace. A model is anchored when it's well correlated to real-world testing, and you understand its limitations. For example: an aerodynamic model anchored to wind tunnel data within 5% accuracy for Mach 0.7 to 0.85, at angles of attack from -5° to +10° is trustworthy within those bounds. Outside these bounds we know we're extrapolating and we treat the results accordingly. An LLM should do the same: 'This answer is interpolating within and near training data (high confidence)' vs. 'This answer is extrapolating beyond my training envelope (lower confidence). And finally, we would like to understand how stable the model’s output is. Does it rely on one piece of training data, or does it fit a well-established trend? Note I use the term “confidence” in this discussion to mean probability of correctness (not a statistical confidence interval). Also, the values I provide for each category are for illustration purposes only and are not rigorously derived.

 

Three basic questions we could ask are: Is the model domain relevant to the problem topic? Does the range of the model encompass the specific problem and/or is the training data closely related to our specific problem? And how sensitive is the model output to the training data or problem input? Here is how this thought process could apply to LLMs in critical applications:

 

Question 1: Is the model domain relevant to the problem topic? Before using a structural model in a high cycle application, we need to confirm it can handle fatigue calculations. Similarly, before using an LLM in a particular application, we'd want to know: did the model include significant training data on this topic? Was this query about a topic with dense training representation (common Python questions, popular historical events) or sparse representation (niche technical domains, recent developments, uncommon language combinations)? What users might see:

  • "High training density: Topic well-covered (confidence: 95%)”

  • "Moderate training density: Some coverage (confidence: 66%)"

  • "Low training density: Sparse coverage (confidence: 33%)"

Training density alone wouldn’t cover data quality, conflicting sources, or temporal bias, but it could be used as a first order approximation of model applicability. For proprietary models, the semantic distance from training distribution clusters would need to be evaluated without exposing training data details.

 

Question 2: Does the range of the model encompass the specific problem and/or is the training data closely related to our specific problem? It’s always important in an engineering model to know whether the problem you are solving is within the model’s anchored range and how close the problem is to any anchor point. For example, a model anchored on the linear elastic range of a material’s stress-strain curve will not accurately predict strain near the transition or in the plastic deformation region of a material’s properties. Does the training data cover the specific problem with closely related content (combining familiar Python concepts) allowing direct citing of sources or interpolation? Or will the model need to extrapolate beyond its training distribution or far from any training data (novel problem types or unusual combinations)? This 2x2 set of parameters (range and proximity) could be thought of as training data geometry. What the user might see:

  • "Well-covered: Interpolation + near anchors (confidence: 95%)"

  • "Moderate coverage: Interpolation but far from anchors (confidence: 75%)"

  • "Edge coverage: Extrapolation but near anchors (confidence: 50%)"

  • "Poor coverage: Extrapolation + far from anchors (confidence: 25%)"

Measuring interpolation vs extrapolation in high-dimensional semantic space is far harder than identifying where you are on a 2D stress-strain curve, but if solved, it could significantly increase our confidence in a model’s answer. For proprietary models, this would require vendors to disclose some training distribution information which might reveal model limitations they would rather keep private.

 

Question 3: How sensitive is the model output to the training data or problem input? In aerospace, we test model robustness by doing sensitivity studies or perturbing inputs: if changing the angle of attack by 0.1° causes predicted lift to swing by 50%, the model may not be stable. A stable model usually shows consistent, predictable responses to small input changes. We also test sensitivity to the underlying data. If removing a single test point drastically changes predictions, the model is over-reliant on limited data rather than capturing a robust trend. For LLMs, we could measure stability by varying the query phrasing or by using Monte Carlo dropout to measure prediction variance (running inference multiple times with random neurons disabled). High variation in responses suggests the model may be relying on sparse or conflicting training data. Low variation suggests the answer is well-supported by multiple independent sources in training.  What the user might see:

  • "High stability: Consistent answers across various prompts (confidence: 95%)"

  • "Medium stability: Some variation in responses (confidence: 75%)"

  • "Low stability: Significant response variation (confidence: 50%)"

Both prompt variation testing and Monte Carlo inference methods exist but are computationally expensive, requiring multiple queries per answer. However, even occasional stability testing on critical queries could provide valuable confidence information.

 

An overall confidence assessment could be made by compounding the factors above. Like engineering safety factors where structural, environmental, and operational factors are combined, each source of uncertainty reduces the overall confidence. To illustrate how this could work we can simplify the formula for overall confidence to be: Q1 x Q2 x Q3. For example, here are two high-consequence queries with this framework applied:

 

Query 1: "Based on these symptoms, do I have Lyme disease?"

  • Q1: High density (significant medical content on this topic, confidence: 95%)

  • Q2: Poor coverage for specific symptoms (diagnosis requires extrapolation beyond knowledge, confidence: 25%)

  • Q3: Low stability (symptoms map to multiple conditions, confidence: 50%)

Overall confidence: 95% x 25% x 50% = 12%, Recommendation: The model recognizes Lyme disease is serious and requires prompt diagnosis (high domain knowledge) but cannot reliably map your specific symptoms to a diagnosis (poor geometric coverage, low stability). Consult a medical professional immediately.

 

Query 2: "What is the minimum wire gauge for a 20-amp circuit?"

  • Q1: High density (electrical codes well-covered in training, confidence: 95%)

  • Q2: Well-covered (standard residential wiring, direct citation, confidence: 95%)

  • Q3: High stability (consistent code requirements, confidence: 95%)

Overall confidence: 95% x 95% x 95% = 86%, Recommendation: The model can reliably cite electrical code requirements for common residential wiring. This provides a high-confidence reference for licensed professionals with direct citation links for verification. Due to fire and shock hazards, always consult a licensed electrician for installation.

 

In aerospace we often use formal risk assessments composed of two variables: Likelihood and consequence. The highest risks have both high likelihood and high consequences. Query 1 has high risk given the high consequence of the topic and the high likelihood of improper diagnosis (low confidence in the right answer). Query 2 has medium risk, however, despite the consequence of the topic also being high, the likelihood of incorrect information is low (high confidence in the right answer). This type of risk assessment could be added to critical query results based on the likelihood derived from the overall confidence and the consequence based on the criticality of the topic.

 

This framework faces real challenges such as measuring semantic distance in high-dimensional space, accessing proprietary training distributions and high computational costs of stability testing. Additionally, the simple multiplicative combination of scores (Q1 × Q2 × Q3) treats uncertainty sources as independent factors, however the actual dependencies between training density, geometry, and stability are certainly related in complex ways and should be combined more rigorously. Regardless of all this, even imperfect risk assessments provided with the context of their derivation would be valuable. Aerospace has used risk assessments of imperfect models for decades by understanding their limitations. Does this framework resonate with anyone else thinking about use of AI in critical applications?

1

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
More from Jadair
Curated and popular this week
Relevant opportunities