AEC Interpreter — Cross-Modal Grounding & Ontology-Driven Retrieval

The ProblemThe physical–digital traceability gap

AEC projects standardise on OpenBIM/IFC, yet day-to-day site coordination still runs on manual interpretive labour: people translate unstructured evidence — photos, chat messages, floorplan patches — into model-linked, schema-compliant records. The front-line worker sees a local condition but not the BIM GUID; the coordinator sees the model but not the site context.

The final step is choosing one element from a retrieved shortlist that already contains the answer (median 76 candidates, ground truth present 100% of the time). So the difficulty is not retrieval but discriminating among visually identical siblings — what separates two windows is not in their pixels, it is in the building's relational structure. A black-box matcher plateaus at 6.7% right-first with the answer already in the pool. This work makes that relational signal explicit as a spatial address that can be read from evidence and executed against BIM.

Multi-stakeholder AEC workflow showing information loss between physical site and digital BIM cloud — Coordination spans physical site and the digital BIM cloud across many stakeholders — every manual handoff leaks context.

AbstractOverview

Coordinating AEC projects means constantly translating messy site evidence into the structured BIM record — error-prone, manual work. This thesis asks: how can AI act as an interpreter middleware to reliably align unstructured site evidence with digital project data? The answer is a hierarchical neuro-symbolic architecture: a fine-tuned vision-language model extracts typed spatial constraints, and a deterministic symbolic layer runs a priority-ordered traversal over an enriched IFC knowledge graph, with a Graph-RAG reranker producing the final ranked candidates. The load-bearing idea is a type-conditional spatial address — a relational key computable from the BIM model with no labels and recoverable from a flat image. Supplied perfectly it lifts pool right-first from 4.9% to 78.5%; one realized deterministic detector lifts the addressable subset to 58.9% end-to-end, and a calibrated answer/defer gate raises the answered subset to 73.4% — the system knows when to abstain.

ContributionsWhat this work delivers

Symbolic

Hallucination-resistant retrieval

The neural layer never writes a query. A typed Constraints JSON fills deterministic Cypher templates over an IFC-native knowledge graph, so the system can only return GUIDs that exist in the model — zero hallucinated IDs, fully repeatable.

Neural · ML Engineering

Multimodal interpreter

A fine-tuned Qwen2.5-VL (LoRA) reads photo + note + plan into a typed spatial contract, trained with zero real labels — 990 cases generated deterministically from raw IFC (structure mining → Blender hard negatives → LLM-as-judge).

Architecture

Decomposable failures

Decoupling probabilistic extraction from deterministic retrieval removes black-box failure: every error is traceable to a specific stage — type, floor, position, confidence, or ranking.

ArchitectureThe neuro-symbolic interpreter

Five stages: multimodal input → neural extraction (VLM + LoRA) → Constraints JSON → symbolic retrieval (priority Cypher cascade over a Neo4j IFC graph) → Graph-RAG reranking → ranked GUIDs. The neural side translates evidence into a typed spatial address; the symbolic side names only GUIDs that exist in the model. Design rationale: reliability over flexibility (deterministic templates instead of Text-to-Cypher), and an IFC-native graph built from the schema itself rather than noisy document-level extraction.

Module decomposition as graph reasoning on a real case: pool narrows from 76 to 46 same-class siblings, the position-slot resolves to rank 1 — **Module decomposition as graph reasoning** (real case AP_SK_092). The layout is the *real* knowledge graph — a force-directed layout on the actual `FILLS` (window→host-wall hub) and `NEXT_TO` (consecutive openings) edges, held fixed across all four panels. The recall-safe pool of 76 candidates narrows to 46 same-storey+class windows in 6 host-wall clusters; the target's wall is a 10-opening chain and the position-slot (8 of 10) re-weights the siblings (none removed); the target lands at rank 1.

Key ModulesHow it works under the hood

IFC parse engine converting raw IFC into an enriched knowledge graph

IFC Parse & Enrich Engine

Converts raw IFC into a query-ready Neo4j graph, enriching topology beyond native containment/fill with retrieval-oriented edges: NEXT_TO, CONNECTS_TO, ADJACENT_TO, ON_STOREY.

IfcOpenShell · Neo4j

Symbolic retrieval planner combining attribute filter and topology rerank

Symbolic Planner + Graph-RAG

Nine priority strategies; the recall-preserving p0∪p1 union planner combines attribute filtering with spatial-topology ranking. Templates are deterministic — same input, same query, same result.

Cypher · priority cascade

Synthetic Data Pipeline

Skeleton-first generation fixes ground truth before adding stochastic visual evidence: IFC mining → Blender renders with hard negatives → Gemini text → LLM-as-judge filter → 990 cases, no real labels.

Blender · Gemini · LLM-as-judge

Deterministic visual heuristics: OpenCV counting and ResNet size classification

Deterministic Visual Specialists

OpenCV handles counting and ordinal slot position; ResNet classifies element size bands — the sub-tasks VLMs do unreliably. These inject discriminative cues into the reranker.

OpenCV · ResNet

Fine-tuned VLM measured per-field, with a calibratable detector confidence

Fine-tuned VLM Extractor

Qwen2.5-VL + LoRA (G-series) emits a typed contract {storey, ifc_class, spatial_relations[]}. Fine-tuning saturates coarse fields (storey/class → 100%) and learns relation typing (direction 0% → 82%), but discriminating slot/size stay at 0% — the evidence for delegating them to specialists.

Qwen2.5-VL · LoRA · Constraints JSON

Calibrated confidence routing and selective-prediction curve

Calibrated Routing & Control

Every field carries a {value, confidence, source} record. An orchestration layer routes on calibrated confidence (AUROC 0.80) — applying constraints softly and deferring to a human instead of guessing. Studied as static → learned router → LLM-agent ablation on accuracy, latency, and repeatability.

temperature scaling · answer/defer gate

Engineering decisions, measured against alternatives

Each contract has an experiment behind it: what to let the VLM do, what to push into deterministic geometry, when to trust an extracted address, and how much graph context is recoverable from a flat image.

Engineering question	Decision	Comparison / evidence
Should the VLM directly name the BIM element?	No. The VLM emits typed JSON; deterministic queries name only existing IFC GUIDs.	Fine-tuned VLM reaches 100% GT-in-pool but only 6.7% Top-1 — sibling selection needs explicit spatial structure.
Can generic retrieval solve the task?	No. Retrieval is ontology-constrained, then re-ranked by a spatial address over the IFC graph.	Lexical/dense baselines stay at 1.7% Top-1; the address ceiling reaches 78.5% Top-1 / 98.1% Top-10 (n=60).
Should noisy address fields hard-filter candidates?	No. The address is a soft prior with calibrated answer/defer routing, preserving recall.	Realized end-to-end reaches 58.9% Top-1 (67.6% with oracle coarse fields); selective prediction lifts the answered subset to 73.4%.
How deep should graph context go?	Compile one-hop context into the element record; do not chase deep relation chains from an image.	Realizable discrimination saturates at one hop: the confusable set shrinks from 13 to about 8, with little further realized gain.
Learned perception vs deterministic specialists?	VLM for coarse semantics; OpenCV/ResNet specialists for count, ordinal slot, and size.	The VLM reaches 100% on storey/class but 0% on slot/size; the realized slot specialist reaches 58.9% Top-1 end-to-end (n=35 fillers).

InnovationReading space from a flat image

The thesis proved the architecture sound — with perfect perception the symbolic engine keeps the correct element 100% of the time and compresses the shortlist toward one, so the bottleneck is perception, not graph logic. The core advance deepens the deterministic specialists into a principled spatial-interpretation layer: instead of ad-hoc cues, the system recovers a spatial address — a relational key read from plain 2D images, computable from the BIM model and recoverable from the evidence. Three questions, answered with measurements:

① Representation

What is the minimal address?

It is type-conditional: a coarse prefix (storey + class) is necessary but saturated; the discriminator is class-specific — a position-slot (i of M) for an opening, a connectivity fingerprint for a wall.

ceiling: right-first 4.9% → 78.5% (n=60)

② Mechanism

How to use a noisy address?

Hard filtering deletes the answer, so the address is a soft prior in a recall-fixed pool. One real detector closes most of the gap; its confidence passes a calibration gate, and its payoff is knowing when to abstain.

realized slot: 6.6% → 58.9% end-to-end · defer → 73.4% (n=35)

③ Architecture

How deep should context go?

A depth law: deeper relations are more unique but their recovery from an image collapses with distance, so discrimination saturates at one hop — compile the relation into the element rather than chase deep chains.

confusable set 13 → 8 at one hop

A type-conditional spatial address: shared storey+class prefix, plus a class-specific body — position-slot for openings, connectivity fingerprint for walls. — **What a type-conditional spatial address is**, by class. A coarse ontological prefix (storey + IFC class) is shared but *non-discriminating* (oracle Top-1 4.9% alone). The discriminating body is class-specific: a **position-slot** `(i of M)` for an opening — image-recoverable, realized 58.9% end-to-end — and a **connectivity fingerprint** for a wall — oracle-discriminative but not image-recoverable. Each reorders the same retrieved pool to the per-class oracle Top-1 shown.

The spatial-interpretation pipeline traced through one real case with real screenshots, stages (a)-(e) — **Spatial interpretation, traced through one real case** (AP_SK_107, real artefacts). (a) evidence → (b) per-field extraction, each a `{value, confidence, source}` record (VLM → storey/class; OpenCV → the position-slot; ResNet → size) → (c) the depth-1 spatial-address record → (d) calibrated routing → (e) the knowledge-graph shortlist collapses to a GUID. The highlighted lane is the confidence-routing path.

ResultsThe address makes the architecture realizable

Headline: under perfect extraction the symbolic layer retains the correct element in 100% of cases and compresses the candidate pool from a median of 46 → 1. The architecture is provably sound — the remaining operational ceiling is bound by neural extraction quality, not by graph logic.

Oracle experiment showing the symbolic ceiling and fingerprint ladder — Oracle ceiling: enriched topology + the address drive the retrievable pool toward 1, isolating extraction as the bottleneck.

Oracle fingerprint ladder

Median live pool · AP held-out benchmark (n=60)

Fingerprint level	Median pool
L0 — no filter	1233
L1 — storey + class	46
L3 — + direction + subtype	9
L4 — + exact position slot*	1

*L4 extractable for the addressable filler subset. L3 fingerprints provide the dominant compression.

Downstream retrieval — the black box plateaus; the address does not

AP held-out benchmark (n=60), p0∪p1 planner.

Model / diagnostic	GT-in-pool	Top-1	Top-10	MRR@10
Zero-shot Gemini	95.0%	1.7%	18.3%	0.056
Fine-tuned VLM (best end-to-end)	100%	6.7%	30.0%	0.110
+ realized position-slot specialist (end-to-end)	100%	58.9%	67.1%	—
+ realized slot, oracle coarse (upper bound)	100%	67.6%	80.9%	—
+ type-conditional spatial-address ceiling	100%	78.5%	98.1%	0.854

The ceiling row is a diagnostic upper bound, not a deployed end-to-end system. The realized row is the actual spatial-address path currently built — a deterministic position-slot specialist for the 35 window/door filler cases; realized wall/other extractors are future work.

The calibration is real — and pays off as deferral

ECE calibration gate — **The calibration gate passes.** The detector's confidence tracks correctness (AUROC 0.80) and is only moderately mis-calibrated (ECE 0.206) — so routing on it is legitimate, not assumed.

Realizable confusable-set size by relational depth — **The depth law, measured.** The model reads deep relation types reliably, yet realized discrimination saturates at one hop because those types are homogeneous — the limit is informational, not extraction reliability.

Worked cases — answer vs defer

Two real held-out cases through the same gate: when the detector is correct and confident it commits; when it is wrong but unsure it abstains and returns candidates — the behaviour that makes a triage tool trustworthy.

ANSWER case AP_SK_102: predicted slot 2 of 17 correct, calibrated confidence 0.57 above threshold, commit GUID. — **ANSWER (AP_SK_102).** Predicted slot 2 of 17 — correct; calibrated confidence 0.57 ≥ τ 0.40, so the system commits the GUID.

DEFER case AP_SK_092: predicted slot 1 but truth slot 8, calibrated confidence 0.05 below threshold, abstain. — **DEFER (AP_SK_092).** Predicted slot 1 where truth is 8 — wrong; but calibrated confidence 0.05 < τ 0.40, so the system abstains and returns the candidate pool rather than a confident mistake.

Workflow ImpactFrom manual search to verification

Before — manual coordination

Monthly site walk to inspect
Hand-written report; find element by sight
Manually type the IFC GUID
1–3 day lag; knowledge lost on handover

With AEC Interpreter

Photo + field note in
~1s → ranked candidate pool
Worker taps the correct element
BCF package linked to the BIM cloud (CORENET X ready)

Triage effort — search becomes verification

Triage effort proxy: median candidate inspections drop from 38 to 0.5 with the spatial address. — The spatial address changes the coordinator's task from scanning a long candidate list to verifying a near-front-ranked element.

Arm	Median inspections	≈ time / element	Success@1
Manual scan, no ranking	38.0	~570 s	1.8%
Coarse storey + class	23.0	~345 s	4.9%
+ spatial-address ceiling	0.5	~8 s	78.5%

A ranking-derived effort proxy (15 s/inspection assumption), not a user study.

Human-in-the-loop triage and downstream BCF / CORENET compliance — Human-in-the-loop triage: the coordinator's role shifts from searching a database to verifying model-linked evidence. Accept/reject becomes a fine-tuning signal.

DemoOne case, end-to-end

A browser demo takes an uploaded construction photo plus a natural-language description and highlights the matched IFC elements in a 3D viewer — no AEC software required. The interactive version lets you pick a representative held-out case and inspect both the knowledge-graph spatial address that drives retrieval and the grounded element highlighted in 3D.

Full thesis walkthrough — on-site evidence grounded to digital BIM truth, end-to-end.

▶ Launch the interactive 3D demo serve site/ over HTTP: cd site && python3 -m http.server → http://localhost:8000/demo.html