Designing Grammar-Guided LLM Outputs for Open Data Integration – A DSR Approach to IoT Data Platforms
Samir Chatterjee; Jan Brocke; Ricardo Anderson (Hrsg). Local Solutions for Global Challenges : 20th International Conference on Design Science Research in Information Systems and Technology, DESRIST 2025, Montego Bay, Jamaica, June 2-4, 2025, Proceedings, Part I. Bd. 1. Cham: Springer Nature Switzerland 2025 S. 178 - 195
Erscheinungsjahr: 2025
Publikationstyp: Diverses (Konferenzbeitrag)
Sprache: Englisch
| Geprüft: | Bibliothek |
Inhaltszusammenfassung
This paper designs and implements an artifact for converting unstructured or semi-structured open data into outputs conforming to the OGC SensorThings API (STA). Motivated by the growing influx of heterogeneous data in Internet-of-Things environments, the study employs an Action Design Research process to apply formalized grammars to Large Language Models (LLMs) to produce valid, STA-compliant JSON documents. Early prototypes using JSON schemas and Pydantic models highlighted the need for str...This paper designs and implements an artifact for converting unstructured or semi-structured open data into outputs conforming to the OGC SensorThings API (STA). Motivated by the growing influx of heterogeneous data in Internet-of-Things environments, the study employs an Action Design Research process to apply formalized grammars to Large Language Models (LLMs) to produce valid, STA-compliant JSON documents. Early prototypes using JSON schemas and Pydantic models highlighted the need for stricter control mechanisms to handle real-world open data complexity. Evaluation across multiple open data sources demonstrates the effectiveness of grammar-driven constraints in reducing malformed or incomplete outputs. Three smaller LLMs—Qwen 2.5 Instruct, Llama 3.1 Instruct, and Phi-4—were tested, showing that grammar length and input context can significantly influence output quality and model throughput. The findings underscore the advantages of embedding strict syntax requirements without sacrificing flexibility for diverse use cases. While domain-level validation (e.g., verifying realistic time-series values) remains a future direction, this research confirms the promise of grammar-based generation for streamlining data ingestion in IoT platforms. The approach facilitates more consistent and maintainable pipelines, potentially boosting interoperability and data quality in sensor-driven environments.» weiterlesen» einklappen