A neuro-symbolic AI middleware that grounds site photos and field notes to specific IFC element GUIDs in a BIM model β hallucination-free, fully traceable.
M.Sc. Computational Design Β· Carnegie Mellon University
AEC projects standardise on OpenBIM/IFC, yet day-to-day site coordination still runs on manual interpretive labour: people translate unstructured evidence β photos, chat messages, floorplan patches β into model-linked, schema-compliant records. The front-line worker sees a local condition but not the BIM GUID; the coordinator sees the model but not the site context.
The final step is choosing one element from a retrieved shortlist that already contains the answer (median 76 candidates, ground truth present 100% of the time). So the difficulty is not retrieval but discriminating among visually identical siblings β what separates two windows is not in their pixels, it is in the building's relational structure. A black-box matcher plateaus at 6.7% right-first with the answer already in the pool. This work makes that relational signal explicit as a spatial address that can be read from evidence and executed against BIM.
Coordinating AEC projects means constantly translating messy site evidence into the structured BIM record β error-prone, manual work. This thesis asks: how can AI act as an interpreter middleware to reliably align unstructured site evidence with digital project data? The answer is a hierarchical neuro-symbolic architecture: a fine-tuned vision-language model extracts typed spatial constraints, and a deterministic symbolic layer runs a priority-ordered traversal over an enriched IFC knowledge graph, with a Graph-RAG reranker producing the final ranked candidates. The load-bearing idea is a type-conditional spatial address β a relational key computable from the BIM model with no labels and recoverable from a flat image. Supplied perfectly it lifts pool right-first from 4.9% to 78.5%; one realized deterministic detector lifts the addressable subset to 58.9% end-to-end, and a calibrated answer/defer gate raises the answered subset to 73.4% β the system knows when to abstain.
The neural layer never writes a query. A typed Constraints JSON fills deterministic Cypher templates over an IFC-native knowledge graph, so the system can only return GUIDs that exist in the model β zero hallucinated IDs, fully repeatable.
A fine-tuned Qwen2.5-VL (LoRA) reads photo + note + plan into a typed spatial contract, trained with zero real labels β 990 cases generated deterministically from raw IFC (structure mining β Blender hard negatives β LLM-as-judge).
Decoupling probabilistic extraction from deterministic retrieval removes black-box failure: every error is traceable to a specific stage β type, floor, position, confidence, or ranking.
Five stages: multimodal input β neural extraction (VLM + LoRA) β Constraints JSON β symbolic retrieval (priority Cypher cascade over a Neo4j IFC graph) β Graph-RAG reranking β ranked GUIDs. The neural side translates evidence into a typed spatial address; the symbolic side names only GUIDs that exist in the model. Design rationale: reliability over flexibility (deterministic templates instead of Text-to-Cypher), and an IFC-native graph built from the schema itself rather than noisy document-level extraction.
FILLS (windowβhost-wall hub) and NEXT_TO (consecutive openings) edges,
held fixed across all four panels. The recall-safe pool of 76 candidates narrows to 46
same-storey+class windows in 6 host-wall clusters; the target's wall is a 10-opening chain and
the position-slot (8 of 10) re-weights the siblings (none removed); the target lands at rank 1.
Converts raw IFC into a query-ready Neo4j graph, enriching topology beyond native containment/fill with retrieval-oriented edges: NEXT_TO, CONNECTS_TO, ADJACENT_TO, ON_STOREY.
Nine priority strategies; the recall-preserving p0βͺp1 union planner combines
attribute filtering with spatial-topology ranking. Templates are deterministic β same input,
same query, same result.
Skeleton-first generation fixes ground truth before adding stochastic visual evidence: IFC mining β Blender renders with hard negatives β Gemini text β LLM-as-judge filter β 990 cases, no real labels.
OpenCV handles counting and ordinal slot position; ResNet classifies element size bands β the sub-tasks VLMs do unreliably. These inject discriminative cues into the reranker.
Qwen2.5-VL + LoRA (G-series) emits a typed contract
{storey, ifc_class, spatial_relations[]}. Fine-tuning saturates coarse fields
(storey/class β 100%) and learns relation typing (direction 0% β 82%), but discriminating
slot/size stay at 0% β the evidence for delegating them to specialists.
Every field carries a {value, confidence, source} record. An orchestration layer
routes on calibrated confidence (AUROC 0.80) β applying constraints softly and
deferring to a human instead of guessing. Studied as static β learned router β
LLM-agent ablation on accuracy, latency, and repeatability.
Each contract has an experiment behind it: what to let the VLM do, what to push into deterministic geometry, when to trust an extracted address, and how much graph context is recoverable from a flat image.
| Engineering question | Decision | Comparison / evidence |
|---|---|---|
| Should the VLM directly name the BIM element? | No. The VLM emits typed JSON; deterministic queries name only existing IFC GUIDs. | Fine-tuned VLM reaches 100% GT-in-pool but only 6.7% Top-1 β sibling selection needs explicit spatial structure. |
| Can generic retrieval solve the task? | No. Retrieval is ontology-constrained, then re-ranked by a spatial address over the IFC graph. | Lexical/dense baselines stay at 1.7% Top-1; the address ceiling reaches 78.5% Top-1 / 98.1% Top-10 (n=60). |
| Should noisy address fields hard-filter candidates? | No. The address is a soft prior with calibrated answer/defer routing, preserving recall. | Realized end-to-end reaches 58.9% Top-1 (67.6% with oracle coarse fields); selective prediction lifts the answered subset to 73.4%. |
| How deep should graph context go? | Compile one-hop context into the element record; do not chase deep relation chains from an image. | Realizable discrimination saturates at one hop: the confusable set shrinks from 13 to about 8, with little further realized gain. |
| Learned perception vs deterministic specialists? | VLM for coarse semantics; OpenCV/ResNet specialists for count, ordinal slot, and size. | The VLM reaches 100% on storey/class but 0% on slot/size; the realized slot specialist reaches 58.9% Top-1 end-to-end (n=35 fillers). |
The thesis proved the architecture sound β with perfect perception the symbolic engine keeps the correct element 100% of the time and compresses the shortlist toward one, so the bottleneck is perception, not graph logic. The core advance deepens the deterministic specialists into a principled spatial-interpretation layer: instead of ad-hoc cues, the system recovers a spatial address β a relational key read from plain 2D images, computable from the BIM model and recoverable from the evidence. Three questions, answered with measurements:
It is type-conditional: a coarse prefix (storey + class) is necessary but
saturated; the discriminator is class-specific β a position-slot
(i of M) for an opening, a connectivity fingerprint for a wall.
Hard filtering deletes the answer, so the address is a soft prior in a recall-fixed pool. One real detector closes most of the gap; its confidence passes a calibration gate, and its payoff is knowing when to abstain.
A depth law: deeper relations are more unique but their recovery from an image collapses with distance, so discrimination saturates at one hop β compile the relation into the element rather than chase deep chains.
(i of M) for an opening β image-recoverable, realized 58.9% end-to-end β and a
connectivity fingerprint for a wall β oracle-discriminative but not
image-recoverable. Each reorders the same retrieved pool to the per-class oracle Top-1 shown.
{value, confidence, source}
record (VLM β storey/class; OpenCV β the position-slot; ResNet β size) β (c) the depth-1
spatial-address record β (d) calibrated routing β (e) the knowledge-graph shortlist collapses to a
GUID. The highlighted lane is the confidence-routing path.
Median live pool Β· AP held-out benchmark (n=60)
| Fingerprint level | Median pool |
|---|---|
| L0 β no filter | 1233 |
| L1 β storey + class | 46 |
| L3 β + direction + subtype | 9 |
| L4 β + exact position slot* | 1 |
*L4 extractable for the addressable filler subset. L3 fingerprints provide the dominant compression.
AP held-out benchmark (n=60), p0βͺp1 planner.
| Model / diagnostic | GT-in-pool | Top-1 | Top-10 | MRR@10 |
|---|---|---|---|---|
| Zero-shot Gemini | 95.0% | 1.7% | 18.3% | 0.056 |
| Fine-tuned VLM (best end-to-end) | 100% | 6.7% | 30.0% | 0.110 |
| + realized position-slot specialist (end-to-end) | 100% | 58.9% | 67.1% | β |
| + realized slot, oracle coarse (upper bound) | 100% | 67.6% | 80.9% | β |
| + type-conditional spatial-address ceiling | 100% | 78.5% | 98.1% | 0.854 |
The ceiling row is a diagnostic upper bound, not a deployed end-to-end system. The realized row is the actual spatial-address path currently built β a deterministic position-slot specialist for the 35 window/door filler cases; realized wall/other extractors are future work.
Two real held-out cases through the same gate: when the detector is correct and confident it commits; when it is wrong but unsure it abstains and returns candidates β the behaviour that makes a triage tool trustworthy.
| Arm | Median inspections | β time / element | Success@1 |
|---|---|---|---|
| Manual scan, no ranking | 38.0 | ~570 s | 1.8% |
| Coarse storey + class | 23.0 | ~345 s | 4.9% |
| + spatial-address ceiling | 0.5 | ~8 s | 78.5% |
A ranking-derived effort proxy (15 s/inspection assumption), not a user study.
A browser demo takes an uploaded construction photo plus a natural-language description and highlights the matched IFC elements in a 3D viewer β no AEC software required. The interactive version lets you pick a representative held-out case and inspect both the knowledge-graph spatial address that drives retrieval and the grounded element highlighted in 3D.
site/ over HTTP:
cd site && python3 -m http.server β http://localhost:8000/demo.html