Starten Sie Ihre Suche...


Wir weisen darauf hin, dass wir technisch notwendige Cookies verwenden. Weitere Informationen

Prompt-Based and Transformer-Based Models Evaluation for Semantic Segmentation of Crowdsourced Urban Imagery Under Projection and Geometric Symmetry Variations

Symmetry. Bd. 18. H. 1. MDPI AG 2025 S. 68

Erscheinungsjahr: 2025

Publikationstyp: Zeitschriftenaufsatz

Sprache: Englisch

Doi/URN: 10.3390/sym18010068

Volltext über DOI/URN

Geprüft:Bibliothek

Inhaltszusammenfassung


Semantic segmentation of crowdsourced street-level imagery plays a critical role in urban analytics by enabling pixel-wise understanding of urban scenes for applications such as walkability scoring, environmental comfort evaluation, and urban planning, where robustness to geometric transformations and projection-induced symmetry variations is essential. This study presents a comparative evaluation of two primary families of semantic segmentation models: transformer-based models (SegFormer and...Semantic segmentation of crowdsourced street-level imagery plays a critical role in urban analytics by enabling pixel-wise understanding of urban scenes for applications such as walkability scoring, environmental comfort evaluation, and urban planning, where robustness to geometric transformations and projection-induced symmetry variations is essential. This study presents a comparative evaluation of two primary families of semantic segmentation models: transformer-based models (SegFormer and Mask2Former) and prompt-based models (CLIPSeg, LangSAM, and SAM+CLIP). The evaluation is conducted on images with varying geometric properties, including normal perspective, fisheye distortion, and panoramic format, representing different forms of projection symmetry and symmetry-breaking transformations, using data from Google Street View and Mapillary. Each model is evaluated on a unified benchmark with pixel-level annotations for key urban classes, including road, building, sky, vegetation, and additional elements grouped under the “Other” class. Segmentation performance is assessed through metric-based, statistical, and visual evaluations, with mean Intersection over Union (mIoU) and pixel accuracy serving as the primary metrics. Results show that LangSAM demonstrates strong robustness across different image formats, with mIoU scores of 64.48% on fisheye images, 85.78% on normal perspective images, and 96.07% on panoramic images, indicating strong semantic consistency under projection-induced symmetry variations. Among transformer-based models, SegFormer proves to be the most reliable, attains higher accuracy on fisheye and normal perspective images among all models, with mean IoU scores of 72.21%, 94.92%, and 75.13% on fisheye, normal, and panoramic imagery, respectively. LangSAM not only demonstrates robustness across different projection geometries but also delivers the lowest segmentation error, consistently identifying the correct class for corresponding objects. In contrast, CLIPSeg remains the weakest prompt-based model, with mIoU scores of 77.60% on normal images, 59.33% on panoramic images, and a substantial drop to 59.33% on fisheye imagery, reflecting sensitivity to projection-related symmetry distortions» weiterlesen» einklappen

  • semantic segmentation
  • transformer models
  • street-level imagery
  • prompt-based
  • urban scene
  • fisheye distortion
  • Mask2Former
  • CLIPSeg
  • LangSAM
  • SAM+CLIP

Autoren


Yousefi, Aida (Autor)
Arefi, Hossein (Autor)

Klassifikation


DFG Fachgebiet:
4.43-05 - Bild- und Sprachverarbeitung, Computergraphik und Visualisierung, Human Computer Interaction, Ubiquitous und Wearable Computing

DDC Sachgruppe:
Ingenieurwissenschaften

Verknüpfte Personen