AEC Interpreter — Cross-Modal Grounding & Ontology-Driven Retrieval

ProblemThe physical–digital traceability gap

AEC projects standardise on OpenBIM/IFC, yet day-to-day site coordination still runs on manual interpretive labour. Subcontractors log unstructured on-site evidence — photos, chat messages, annotated floorplan patches — into forms; construction managers re-key it into the system. The front-line worker sees a local condition but not the BIM GUID; the coordinator sees the model but not the site context; the developer in the office receives it late and degraded, and plans against it anyway.

Bridging that gap by hand leaks in two ways: semantic drift as detail is re-described and re-typed, and context decay as notes go stale and people roll off the project. It is worst in high-rise work, where dozens of windows and walls are visually identical — a photo of "a window" could be any of them, so even a faithful description never pins down which one.

In a BIM-native workflow every issue should resolve to a GUID linked back to the model, so downstream teams — issue tracking, compliance, planning — inherit it cleanly. Ungrounded records break that chain: not query-able, not traceable, and the decisions built on them arrive late and drift from site reality.

Before any system can reason about what changed or what comes next, it must answer the question that unlocks the rest — where, exactly, is this? That is spatial grounding: a hallucination-resistant link from site evidence to the exact model element. This work introduces an interpreter middleware to close that gap — streamlining data across many stakeholders and formats by making on-site evidence traceable.

Multi-stakeholder AEC workflow showing information loss between physical site and digital BIM cloud — Coordination spans physical site and the digital BIM cloud across many stakeholders — every manual handoff leaks context.

ApproachNeuro-symbolic Architecture Interpreter

Coordinating AEC projects means constantly translating messy site evidence into the structured BIM record — error-prone, manual work. This paper asks: how can AI act as an interpreter middleware to reliably align unstructured site evidence with digital project data? The answer is a hierarchical neuro-symbolic architecture with two complementary layers. The neural layer is a fine-tuned, domain-specific vision-language model that reads a photo, field note, and plan patch and extracts the typed spatial relationships between elements — the flexible, probabilistic front end. The symbolic layer is a knowledge graph mined from the enriched IFC model: those typed constraints drive a deterministic, priority-ordered traversal that can only return GUIDs that exist in the graph — the hallucination-resistant back end. A deterministic, calibrated soft-rerank produces the final ranked candidates (an optional Graph-RAG/LLM reorder helps only on coarse pools). The load-bearing idea is a type-conditional spatial address — a relational key computable from the BIM model with no labels and recoverable from a flat image. Supplied perfectly it lifts pool right-first from 4.9% to 78.5%; one realized deterministic detector lifts the addressable subset to 58.9% end-to-end, and a calibrated answer/defer gate raises the answered subset to 73.4% — the system knows when to abstain.

Key result: the representation, not a bigger model, separates identical elements

Ranking on identical recall-safe pools. Generic text retrievers (BM25, MiniLM), a zero-shot VLM reranker, and zero-shot Gemini all stay near chance (1.7 to 3.3 percent Top-1) on visually identical siblings. The fine-tuned VLM leads them but plateaus at 6.7 percent. A realized position-slot detector reaches 58.9 percent Top-1 on the addressable window/door subset (n=35), and the oracle spatial address marks the 78.5 percent ceiling. — **Ranking on identical recall-safe pools.** Generic text retrievers (BM25, MiniLM), a zero-shot VLM reranker, and zero-shot Gemini all stay near chance on visually identical siblings; the fine-tuned VLM leads every learned baseline but plateaus at **6.7% Top-1** even with the target in the pool. The discriminator is the **type-conditional spatial address**: a realized position-slot detector reaches **58.9% Top-1** end-to-end on the addressable window/door subset (n=35), and the oracle address marks the **78.5%** information ceiling. The gain comes from the representation, not a larger model.

ContributionsWhat this work delivers

Neural · ML Engineering

Multimodal interpreter

A fine-tuned Qwen2.5-VL (LoRA) reads photo + note + plan into a typed spatial contract, trained with zero real labels — generated deterministically from raw IFC (structure mining → Blender hard-negative renders → Gemini scenario text → LLM-as-judge filter), then augmented to 990 training cases.

Symbolic

Hallucination-resistant retrieval

The neural layer never writes a query. A typed Constraints JSON fills deterministic Cypher templates over an IFC-native knowledge graph, so the system can only return GUIDs that exist in the model — zero hallucinated IDs, fully repeatable.

Architecture

Decomposable failures

Decoupling probabilistic extraction from deterministic retrieval removes black-box failure: every error is traceable to a specific stage — type, floor, position, confidence, or ranking.

ArchitectureThe neuro-symbolic interpreter

Architecture in five stages: multimodal input → neural extraction (VLM + LoRA) → Constraints JSON → symbolic retrieval (priority Cypher cascade over a Neo4j IFC graph) → calibrated soft-rerank (optional Graph-RAG/LLM reorder) → ranked GUIDs.

The neural side translates evidence into a typed spatial address; the symbolic side names only GUIDs that exist in the model. Design rationale: reliability over flexibility (deterministic templates instead of Text-to-Cypher), and an IFC-native graph built from the schema itself rather than noisy document-level extraction.

Multimodal input — a site photo, a short field note, and a marked floorplan crop.
Neural perception (VLM) — a fine-tuned Qwen2.5-VL + LoRA emits a typed Constraints JSON {storey, ifc_class, spatial_relations[{predicate, object_type, direction}]}, each field a {value, confidence, source} record. It describes; it never names a GUID.
Symbolic retrieval — the constraints compile into a deterministic Cypher cascade over the enriched Neo4j IFC graph; a recall-safe p0∪p1 union returns a candidate pool with the target retained ~100% of the time. The relational predicates build the pool, not the within-pool ranking.
Deterministic slot discriminator + soft rerank — an OpenCV specialist reads the opening position-slot (i of M) from the plan; candidates are reordered by a soft score (storey + class + confidence-weighted slot match) that never removes a candidate.
Calibrated answer / defer — a temperature-scaled confidence commits the top GUID when it clears the threshold, and otherwise defers the candidate pool to a human reviewer.

A fuller engineered stack — a ResNet size-band filter and a Gemini Graph-RAG LLM reranker — was evaluated and not adopted: on the topology-filtered pool the LLM reorder is redundant and lowers Top-1 (6.7% → 1.7%) while breaking the 100% recall guarantee. The deployed ranking is deterministic end-to-end apart from the VLM perception front end.

Module decomposition as graph reasoning on a real case: pool narrows from 76 to 46 same-class siblings, the position-slot resolves to rank 1 — **Module decomposition as graph reasoning** (sample case: AP_SK_092). The layout is the *real* knowledge graph — a force-directed layout on the actual `FILLS` (window→host-wall hub) and `NEXT_TO` (consecutive openings) edges, held fixed across all four panels. The recall-safe pool of 76 candidates narrows to 46 same-storey+class windows in 6 host-wall clusters; the target's wall is a 10-opening chain and the position-slot (8 of 10) re-weights the siblings (none removed); the target lands at rank 1.

Key ModulesHow it works under the hood

IFC parse engine converting raw IFC into an enriched knowledge graph

IFC Parse & Enrich Engine

Converts raw IFC into a query-ready Neo4j graph, enriching topology beyond native containment/fill with retrieval-oriented edges: NEXT_TO, CONNECTS_TO, ADJACENT_TO, ON_STOREY.

IfcOpenShell · Neo4j

Symbolic retrieval planner combining attribute filter and topology rerank

Symbolic Planner + Graph-RAG

Nine priority strategies; the recall-preserving p0∪p1 union planner combines attribute filtering with spatial-topology ranking. Templates are deterministic — same input, same query, same result. The final rerank is a deterministic calibrated soft-rerank; the Graph-RAG (LLM) reorder is optional and helps only on coarse pools.

Cypher · priority cascade

Synthetic Data Pipeline

Skeleton-first generation fixes ground truth before adding stochastic visual evidence: IFC mining → Blender hard-negative renders → Gemini scenario text → LLM-as-judge filter, then augmentation → 990 training cases, no real labels.

Blender · Gemini · LLM-as-judge

Deterministic visual heuristics: OpenCV counting and ResNet size classification

Deterministic Visual Specialists

OpenCV handles counting and ordinal slot position; ResNet classifies element size bands — the sub-tasks VLMs do unreliably. These inject discriminative cues into the reranker.

OpenCV · ResNet

Fine-tuned VLM measured per-field, with a calibratable detector confidence

Fine-tuned VLM Extractor

Qwen2.5-VL as backbone with LoRA finetuned with domain specific data emits a typed contract {storey, ifc_class, spatial_relations[]}, saturates coarse fields (storey/class → 100%) and learns relation typing (e.g. direction 0% → 82%), but discriminating slot/size stay at 0% — the evidence for delegating them to specialists.

Qwen2.5-VL · LoRA · Constraints JSON

Calibrated confidence routing and selective-prediction curve

Calibrated Routing & Control

The address fields carry a {value, confidence, source} record (position + size live today; generalizing to every field is in progress). The gate routes on this confidence — which discriminates correct from wrong (AUROC 0.80) and is temperature-calibrated (ECE 0.21) — applying constraints softly and deferring to a human instead of guessing. Whether a learned router or LLM-agent beats this static threshold at equal budget is an open ablation (accuracy / latency / repeatability).

temperature scaling · answer/defer gate

Engineering decisions, measured against alternatives

Each contract has an experiment behind it: what to let the VLM do, what to push into deterministic geometry, when to trust an extracted address, and how much graph context is recoverable from a flat image.

Engineering question	Decision	Comparison / evidence
Should the VLM directly name the BIM element?	No. The VLM emits typed JSON; deterministic queries name only existing IFC GUIDs.	Fine-tuned VLM reaches 100% GT-in-pool but only 6.7% Top-1 — sibling selection needs explicit spatial structure.
Can generic retrieval solve the task?	No. Retrieval is ontology-constrained, then re-ranked by a spatial address over the IFC graph.	Lexical/dense baselines stay at 1.7% Top-1; the address ceiling reaches 78.5% Top-1 / 98.1% Top-10 (n=60).
Should noisy address fields hard-filter candidates?	No. The address is a soft prior with calibrated answer/defer routing, preserving recall.	Realized end-to-end reaches 58.9% Top-1 (67.6% with oracle coarse fields); selective prediction lifts the answered subset to 73.4%.
How deep should graph context go?	Compile one-hop context into the element record; do not chase deep relation chains from an image.	Realizable discrimination saturates at one hop: the confusable set shrinks from 13 to about 8, with little further realized gain.
Learned perception vs deterministic specialists?	VLM for coarse semantics; OpenCV/ResNet specialists for count, ordinal slot, and size.	The VLM reaches 100% on storey/class but 0% on slot/size; the realized slot specialist reaches 58.9% Top-1 end-to-end (n=35 fillers).

InnovationReading space from a flat image

The paper proved the architecture sound — with perfect perception the symbolic engine keeps the correct element 100% of the time and compresses the shortlist toward one, so the bottleneck is perception, not graph logic. The core advance deepens the deterministic specialists into a principled spatial-interpretation layer: instead of ad-hoc cues, the system recovers a spatial address — a relational key read from plain 2D images, computable from the BIM model and recoverable from the evidence. Three questions, answered with measurements:

① Representation

What is the minimal address?

It is type-conditional: a coarse prefix (storey + class) is necessary but saturated; the discriminator is class-specific — a position-slot (i of M) for an opening, a connectivity fingerprint for a wall.

ceiling: right-first 4.9% → 78.5% (n=60)

② Mechanism

How to use a noisy address?

Hard filtering deletes the answer, so the address is a soft prior in a recall-fixed pool. One real detector closes most of the gap; its confidence passes a calibration gate, and its payoff is knowing when to abstain.

realized slot: 6.7% → 58.9% end-to-end · defer → 73.4% (n=35)

③ Architecture

How deep should context go?

A depth law: deeper relations are more unique but their recovery from an image collapses with distance, so discrimination saturates at one hop — compile the relation into the element rather than chase deep chains.

confusable set 13 → 8 at one hop

One image, one GUID — the address narrowing the pool at each stage

Median candidate pool on the AP held-out targets (n=60). Each stage adds one slice of the spatial address; the same real case is traced below.

Stage	What it does	Median pool
VLM typed output	The fine-tuned VLM emits `{storey, ifc_class, spatial_relations[]}` — each field a `{value, confidence, source}` record, no GUID	1233
Coarse attribute-only retrieval	storey + IFC class filter the graph to same-type elements	46
Topology-aware retrieval	`FILLS` / `NEXT_TO` / `CONNECTS_TO` edges narrow to same-host-wall siblings	~9
Spatial address (KG traversal)	the type-conditional address — position-slot `(i of M)` for an opening — reorders the siblings to the target	1

A calibration gate (AUROC 0.80) then routes ANSWER (commit the GUID) vs DEFER (return the pool), and the deterministic slot soft-rerank produces the final ranked candidates — traced on one real case below.

The spatial-interpretation pipeline traced through one real case with real screenshots, stages (a)-(e) — **Spatial interpretation, traced through one real case** (AP_SK_107, real artefacts). (a) evidence → (b) per-field extraction, each a `{value, confidence, source}` record (VLM → storey/class; OpenCV → the position-slot; ResNet → size) → (c) the depth-1 spatial-address record → (d) calibrated routing → (e) the knowledge-graph shortlist collapses to a GUID. The highlighted lane is the confidence-routing path.

ResultsArchitecture Oracle Experiment

The Oracle test measures the symbolic engine under perfect extraction — no VLM hallucination noise — to isolate how far the graph-query strategy can go. With the full spatial address it drives the median pool from 1233 candidates down to 1: the exact position-slot is a decisive discriminator once supplied, which is exactly why perception, not graph logic, is the bottleneck.

Oracle experiment showing the symbolic ceiling and fingerprint ladder — Oracle ceiling: enriched topology + the address drive the retrievable pool toward 1, isolating extraction as the bottleneck.

Oracle fingerprint ladder

Median live pool · AP held-out benchmark (n=60)

Fingerprint level	Median pool
L0 — no filter	1233
L1 — storey + class	46
L3 — + direction + subtype	9
L4 — + exact position slot*	1

*L4 extractable for the addressable filler subset. L3 fingerprints provide the dominant compression.

Downstream retrieval — the black box plateaus; the address does not

AP held-out benchmark (n=60), p0∪p1 planner.

Model / diagnostic	GT-in-pool	Top-1	Top-10	MRR@10
Zero-shot Gemini	91.7%	1.7%	18.3%	0.056
Fine-tuned VLM (best end-to-end)	100%	6.7%	30.0%	0.110
+ realized position-slot specialist (end-to-end)	100%	58.9%	67.1%	—
+ realized slot, oracle coarse (upper bound)	100%	67.6%	80.9%	—
+ type-conditional spatial-address ceiling	100%	78.5%	98.1%	0.854

The ceiling row is a diagnostic upper bound, not a deployed end-to-end system. The realized row is the actual spatial-address path currently built — a deterministic position-slot specialist for the 35 window/door filler cases; realized wall/other extractors are future work. The realized 58.9% is the aggregate over those 35 fillers; on the realistic cluttered-floor subset it is 21.2% end-to-end (39.1% with oracle coarse fields) — the aggregate is flattered by 18 sparser re-rendered upper-floor plans, so we report the split.

The calibration is real — and pays off as deferral

ECE calibration gate — **The calibration gate passes.** The detector's confidence tracks correctness (AUROC 0.80) and is only moderately mis-calibrated (ECE 0.206) — so routing on it is legitimate, not assumed.

Realizable confusable-set size by relational depth — **The depth law, measured.** The model reads deep relation types reliably, yet realized discrimination saturates at one hop because those types are homogeneous — the limit is informational, not extraction reliability.

Worked cases — answer vs defer

Two real held-out cases through the same gate: when the detector is correct and confident it commits; when it is wrong but unsure it abstains and returns candidates — the behaviour that makes a triage tool trustworthy.

ANSWER case AP_SK_102: predicted slot 2 of 17 correct, calibrated confidence 0.57 above threshold, commit GUID. — **ANSWER (AP_SK_102).** Predicted slot 2 of 17 — correct; calibrated confidence 0.57 ≥ τ 0.40, so the system commits the GUID.

DEFER case AP_SK_092: predicted slot 1 but truth slot 8, calibrated confidence 0.05 below threshold, abstain. — **DEFER (AP_SK_092).** Predicted slot 1 where truth is 8 — wrong; but calibrated confidence 0.05 < τ 0.40, so the system abstains and returns the candidate pool rather than a confident mistake.

Workflow ImpactFrom manual search to verification

Before — manual coordination

Monthly site walk to inspect
Hand-written report; find element by sight
Manually type the IFC GUID
1–3 day lag; knowledge lost on handover

With AEC Interpreter

Photo + field note in
~1s → ranked candidate pool
Worker taps the correct element
BCF package linked to the BIM cloud (CORENET X ready)

Triage effort proxy — pool review becomes verification

Filter level	Median pool	Expected inspections	≈ time / element	Top-1
No retrieval (building-wide)	1233	~616	~2.6 h	—
Storey + type	46	23	~345 s	4.9%
+ spatial relationship (1-hop)	~9	~5	~68 s	—
+ exact position slot	1	~0.5	~8 s	78.5%

What is measured?

Not a user study or observed field-labour time — a ranking-derived proxy over the held-out targets: the median candidate pool a coordinator would scan, with expected inspections ≈ ½·pool at ~15 s each, before reaching the correct element.

The address ladder turns a building-wide search (1233 elements) into a near-front verification task: storey + type alone still leaves ~46 look-alikes; the spatial address collapses the pool toward one.

Median candidate pool on the held-out targets; expected inspections ≈ ½·pool under a 15 s/inspection assumption — read as scale, not measured labour. The position-slot row is the oracle spatial-address ceiling (Top-1 78.5%).

Human-in-the-loop triage and downstream BCF / CORENET compliance — Human-in-the-loop triage: the coordinator's role shifts from searching a database to verifying model-linked evidence. Accept/reject becomes a fine-tuning signal.

An Interpreter Layer for AEC: Cross-Modal Grounding & Ontology-Driven Retrieval

VideoFull Paper Walkthrough