An Interpreter Layer for AEC: Cross-Modal Grounding & Ontology-Driven Retrieval

A neuro-symbolic AI middleware that grounds site photos and field notes to specific IFC element GUIDs in a BIM model β€” hallucination-free, fully traceable.

Chia Hui Yen

M.Sc. Computational Design Β· Carnegie Mellon University

The ProblemThe physical–digital traceability gap

AEC projects standardise on OpenBIM/IFC, yet day-to-day site coordination still runs on manual interpretive labour: people translate unstructured evidence β€” photos, chat messages, floorplan patches β€” into model-linked, schema-compliant records. The front-line worker sees a local condition but not the BIM GUID; the coordinator sees the model but not the site context.

The final step is choosing one element from a retrieved shortlist that already contains the answer (median 76 candidates, ground truth present 100% of the time). So the difficulty is not retrieval but discriminating among visually identical siblings β€” what separates two windows is not in their pixels, it is in the building's relational structure. A black-box matcher plateaus at 6.7% right-first with the answer already in the pool. This work makes that relational signal explicit as a spatial address that can be read from evidence and executed against BIM.

Multi-stakeholder AEC workflow showing information loss between physical site and digital BIM cloud
Coordination spans physical site and the digital BIM cloud across many stakeholders β€” every manual handoff leaks context.

AbstractOverview

Coordinating AEC projects means constantly translating messy site evidence into the structured BIM record β€” error-prone, manual work. This thesis asks: how can AI act as an interpreter middleware to reliably align unstructured site evidence with digital project data? The answer is a hierarchical neuro-symbolic architecture: a fine-tuned vision-language model extracts typed spatial constraints, and a deterministic symbolic layer runs a priority-ordered traversal over an enriched IFC knowledge graph, with a Graph-RAG reranker producing the final ranked candidates. The load-bearing idea is a type-conditional spatial address β€” a relational key computable from the BIM model with no labels and recoverable from a flat image. Supplied perfectly it lifts pool right-first from 4.9% to 78.5%; one realized deterministic detector lifts the addressable subset to 58.9% end-to-end, and a calibrated answer/defer gate raises the answered subset to 73.4% β€” the system knows when to abstain.

ContributionsWhat this work delivers

Symbolic

Hallucination-resistant retrieval

The neural layer never writes a query. A typed Constraints JSON fills deterministic Cypher templates over an IFC-native knowledge graph, so the system can only return GUIDs that exist in the model β€” zero hallucinated IDs, fully repeatable.

Neural Β· ML Engineering

Multimodal interpreter

A fine-tuned Qwen2.5-VL (LoRA) reads photo + note + plan into a typed spatial contract, trained with zero real labels β€” 990 cases generated deterministically from raw IFC (structure mining β†’ Blender hard negatives β†’ LLM-as-judge).

Architecture

Decomposable failures

Decoupling probabilistic extraction from deterministic retrieval removes black-box failure: every error is traceable to a specific stage β€” type, floor, position, confidence, or ranking.

ArchitectureThe neuro-symbolic interpreter

Five stages: multimodal input β†’ neural extraction (VLM + LoRA) β†’ Constraints JSON β†’ symbolic retrieval (priority Cypher cascade over a Neo4j IFC graph) β†’ Graph-RAG reranking β†’ ranked GUIDs. The neural side translates evidence into a typed spatial address; the symbolic side names only GUIDs that exist in the model. Design rationale: reliability over flexibility (deterministic templates instead of Text-to-Cypher), and an IFC-native graph built from the schema itself rather than noisy document-level extraction.

Module decomposition as graph reasoning on a real case: pool narrows from 76 to 46 same-class siblings, the position-slot resolves to rank 1
Module decomposition as graph reasoning (real case AP_SK_092). The layout is the real knowledge graph — a force-directed layout on the actual FILLS (window→host-wall hub) and NEXT_TO (consecutive openings) edges, held fixed across all four panels. The recall-safe pool of 76 candidates narrows to 46 same-storey+class windows in 6 host-wall clusters; the target's wall is a 10-opening chain and the position-slot (8 of 10) re-weights the siblings (none removed); the target lands at rank 1.

Key ModulesHow it works under the hood

IFC parse engine converting raw IFC into an enriched knowledge graph

IFC Parse & Enrich Engine

Converts raw IFC into a query-ready Neo4j graph, enriching topology beyond native containment/fill with retrieval-oriented edges: NEXT_TO, CONNECTS_TO, ADJACENT_TO, ON_STOREY.

IfcOpenShell Β· Neo4j
Symbolic retrieval planner combining attribute filter and topology rerank

Symbolic Planner + Graph-RAG

Nine priority strategies; the recall-preserving p0βˆͺp1 union planner combines attribute filtering with spatial-topology ranking. Templates are deterministic β€” same input, same query, same result.

Cypher Β· priority cascade
Synthetic data pipeline: IFC mining, Blender synthesis, LLM-as-judge filtering

Synthetic Data Pipeline

Skeleton-first generation fixes ground truth before adding stochastic visual evidence: IFC mining β†’ Blender renders with hard negatives β†’ Gemini text β†’ LLM-as-judge filter β†’ 990 cases, no real labels.

Blender Β· Gemini Β· LLM-as-judge
Deterministic visual heuristics: OpenCV counting and ResNet size classification

Deterministic Visual Specialists

OpenCV handles counting and ordinal slot position; ResNet classifies element size bands β€” the sub-tasks VLMs do unreliably. These inject discriminative cues into the reranker.

OpenCV Β· ResNet
Fine-tuned VLM measured per-field, with a calibratable detector confidence

Fine-tuned VLM Extractor

Qwen2.5-VL + LoRA (G-series) emits a typed contract {storey, ifc_class, spatial_relations[]}. Fine-tuning saturates coarse fields (storey/class β†’ 100%) and learns relation typing (direction 0% β†’ 82%), but discriminating slot/size stay at 0% β€” the evidence for delegating them to specialists.

Qwen2.5-VL Β· LoRA Β· Constraints JSON
Calibrated confidence routing and selective-prediction curve

Calibrated Routing & Control

Every field carries a {value, confidence, source} record. An orchestration layer routes on calibrated confidence (AUROC 0.80) β€” applying constraints softly and deferring to a human instead of guessing. Studied as static β†’ learned router β†’ LLM-agent ablation on accuracy, latency, and repeatability.

temperature scaling Β· answer/defer gate

Engineering decisions, measured against alternatives

Each contract has an experiment behind it: what to let the VLM do, what to push into deterministic geometry, when to trust an extracted address, and how much graph context is recoverable from a flat image.

Engineering questionDecisionComparison / evidence
Should the VLM directly name the BIM element? No. The VLM emits typed JSON; deterministic queries name only existing IFC GUIDs. Fine-tuned VLM reaches 100% GT-in-pool but only 6.7% Top-1 β€” sibling selection needs explicit spatial structure.
Can generic retrieval solve the task? No. Retrieval is ontology-constrained, then re-ranked by a spatial address over the IFC graph. Lexical/dense baselines stay at 1.7% Top-1; the address ceiling reaches 78.5% Top-1 / 98.1% Top-10 (n=60).
Should noisy address fields hard-filter candidates? No. The address is a soft prior with calibrated answer/defer routing, preserving recall. Realized end-to-end reaches 58.9% Top-1 (67.6% with oracle coarse fields); selective prediction lifts the answered subset to 73.4%.
How deep should graph context go? Compile one-hop context into the element record; do not chase deep relation chains from an image. Realizable discrimination saturates at one hop: the confusable set shrinks from 13 to about 8, with little further realized gain.
Learned perception vs deterministic specialists? VLM for coarse semantics; OpenCV/ResNet specialists for count, ordinal slot, and size. The VLM reaches 100% on storey/class but 0% on slot/size; the realized slot specialist reaches 58.9% Top-1 end-to-end (n=35 fillers).

InnovationReading space from a flat image

The thesis proved the architecture sound β€” with perfect perception the symbolic engine keeps the correct element 100% of the time and compresses the shortlist toward one, so the bottleneck is perception, not graph logic. The core advance deepens the deterministic specialists into a principled spatial-interpretation layer: instead of ad-hoc cues, the system recovers a spatial address β€” a relational key read from plain 2D images, computable from the BIM model and recoverable from the evidence. Three questions, answered with measurements:

β‘  Representation

What is the minimal address?

It is type-conditional: a coarse prefix (storey + class) is necessary but saturated; the discriminator is class-specific β€” a position-slot (i of M) for an opening, a connectivity fingerprint for a wall.

ceiling: right-first 4.9% β†’ 78.5% (n=60)
β‘‘ Mechanism

How to use a noisy address?

Hard filtering deletes the answer, so the address is a soft prior in a recall-fixed pool. One real detector closes most of the gap; its confidence passes a calibration gate, and its payoff is knowing when to abstain.

realized slot: 6.6% β†’ 58.9% end-to-end Β· defer β†’ 73.4% (n=35)
β‘’ Architecture

How deep should context go?

A depth law: deeper relations are more unique but their recovery from an image collapses with distance, so discrimination saturates at one hop β€” compile the relation into the element rather than chase deep chains.

confusable set 13 β†’ 8 at one hop
A type-conditional spatial address: shared storey+class prefix, plus a class-specific body β€” position-slot for openings, connectivity fingerprint for walls.
What a type-conditional spatial address is, by class. A coarse ontological prefix (storey + IFC class) is shared but non-discriminating (oracle Top-1 4.9% alone). The discriminating body is class-specific: a position-slot (i of M) for an opening β€” image-recoverable, realized 58.9% end-to-end β€” and a connectivity fingerprint for a wall β€” oracle-discriminative but not image-recoverable. Each reorders the same retrieved pool to the per-class oracle Top-1 shown.
The spatial-interpretation pipeline traced through one real case with real screenshots, stages (a)-(e)
Spatial interpretation, traced through one real case (AP_SK_107, real artefacts). (a) evidence β†’ (b) per-field extraction, each a {value, confidence, source} record (VLM β†’ storey/class; OpenCV β†’ the position-slot; ResNet β†’ size) β†’ (c) the depth-1 spatial-address record β†’ (d) calibrated routing β†’ (e) the knowledge-graph shortlist collapses to a GUID. The highlighted lane is the confidence-routing path.

ResultsThe address makes the architecture realizable

Headline: under perfect extraction the symbolic layer retains the correct element in 100% of cases and compresses the candidate pool from a median of 46 β†’ 1. The architecture is provably sound β€” the remaining operational ceiling is bound by neural extraction quality, not by graph logic.
Oracle experiment showing the symbolic ceiling and fingerprint ladder
Oracle ceiling: enriched topology + the address drive the retrievable pool toward 1, isolating extraction as the bottleneck.

Oracle fingerprint ladder

Median live pool Β· AP held-out benchmark (n=60)

Fingerprint levelMedian pool
L0 β€” no filter1233
L1 β€” storey + class46
L3 β€” + direction + subtype9
L4 β€” + exact position slot*1

*L4 extractable for the addressable filler subset. L3 fingerprints provide the dominant compression.

Downstream retrieval β€” the black box plateaus; the address does not

AP held-out benchmark (n=60), p0βˆͺp1 planner.

Model / diagnosticGT-in-poolTop-1Top-10MRR@10
Zero-shot Gemini95.0%1.7%18.3%0.056
Fine-tuned VLM (best end-to-end)100%6.7%30.0%0.110
+ realized position-slot specialist (end-to-end)100%58.9%67.1%β€”
+ realized slot, oracle coarse (upper bound)100%67.6%80.9%β€”
+ type-conditional spatial-address ceiling100%78.5%98.1%0.854

The ceiling row is a diagnostic upper bound, not a deployed end-to-end system. The realized row is the actual spatial-address path currently built β€” a deterministic position-slot specialist for the 35 window/door filler cases; realized wall/other extractors are future work.

The calibration is real β€” and pays off as deferral

ECE calibration gate
The calibration gate passes. The detector's confidence tracks correctness (AUROC 0.80) and is only moderately mis-calibrated (ECE 0.206) β€” so routing on it is legitimate, not assumed.
Realizable confusable-set size by relational depth
The depth law, measured. The model reads deep relation types reliably, yet realized discrimination saturates at one hop because those types are homogeneous β€” the limit is informational, not extraction reliability.

Worked cases β€” answer vs defer

Two real held-out cases through the same gate: when the detector is correct and confident it commits; when it is wrong but unsure it abstains and returns candidates β€” the behaviour that makes a triage tool trustworthy.

ANSWER case AP_SK_102: predicted slot 2 of 17 correct, calibrated confidence 0.57 above threshold, commit GUID.
ANSWER (AP_SK_102). Predicted slot 2 of 17 β€” correct; calibrated confidence 0.57 β‰₯ Ο„ 0.40, so the system commits the GUID.
DEFER case AP_SK_092: predicted slot 1 but truth slot 8, calibrated confidence 0.05 below threshold, abstain.
DEFER (AP_SK_092). Predicted slot 1 where truth is 8 β€” wrong; but calibrated confidence 0.05 < Ο„ 0.40, so the system abstains and returns the candidate pool rather than a confident mistake.

Workflow ImpactFrom manual search to verification

Before β€” manual coordination

  • Monthly site walk to inspect
  • Hand-written report; find element by sight
  • Manually type the IFC GUID
  • 1–3 day lag; knowledge lost on handover

With AEC Interpreter

  • Photo + field note in
  • ~1s β†’ ranked candidate pool
  • Worker taps the correct element
  • BCF package linked to the BIM cloud (CORENET X ready)

Triage effort β€” search becomes verification

Triage effort proxy: median candidate inspections drop from 38 to 0.5 with the spatial address.
The spatial address changes the coordinator's task from scanning a long candidate list to verifying a near-front-ranked element.
ArmMedian inspectionsβ‰ˆ time / elementSuccess@1
Manual scan, no ranking38.0~570 s1.8%
Coarse storey + class23.0~345 s4.9%
+ spatial-address ceiling0.5~8 s78.5%

A ranking-derived effort proxy (15 s/inspection assumption), not a user study.

Human-in-the-loop triage and downstream BCF / CORENET compliance
Human-in-the-loop triage: the coordinator's role shifts from searching a database to verifying model-linked evidence. Accept/reject becomes a fine-tuning signal.

DemoOne case, end-to-end

A browser demo takes an uploaded construction photo plus a natural-language description and highlights the matched IFC elements in a 3D viewer β€” no AEC software required. The interactive version lets you pick a representative held-out case and inspect both the knowledge-graph spatial address that drives retrieval and the grounded element highlighted in 3D.

Full thesis walkthrough β€” on-site evidence grounded to digital BIM truth, end-to-end.
β–Ά  Launch the interactive 3D demo serve site/ over HTTP: cd site && python3 -m http.server β†’ http://localhost:8000/demo.html