A semantic consistency benchmark for evaluating Vision-Language Models (VLMs) on agricultural crop-row environments. The benchmark compares five embedding models — CLIP, OpenCLIP, DINOv2, EVA-CLIP, and SigLIP — across three complementary metrics (CSC, SSC, TSC) using paired RGB-D frames collected by a Scout robot platform in real field conditions.
This work proposes a benchmark to evaluate the consistency of semantic embeddings in agricultural environments by isolating three key factors: viewpoint, time, and spatial structure. It defines three tasks—cross-view, temporal, and structural consistency—along with corresponding metrics (CSC, TSC, SSC) to quantify how consistently the same semantic entity is represented under varying observations, enabling controlled analysis of embedding stability.
This benchmark evaluates semantic consistency under three sources of variation: viewpoint, time, and spatial structure. All three metrics share the same form: they compare intra-class similarity (how similar embeddings of the same semantic target are) against inter-class similarity (how similar crop and non-crop representations are).
CSC measures whether the same object keeps a similar embedding when observed from different viewpoints.
$$\text{CSC} = \frac{\text{SC}_\text{intra}(\mathcal{P}_\text{CSC})}{\text{SC}_\text{inter}}, \quad \mathcal{P}_\text{CSC} = \{(e_i, e_j) \mid o_i = o_j,\ v_i \neq v_j\}$$
Here, \(e_i\) and \(e_j\) are embeddings of the same object observed from different viewing directions.
TSC measures whether the same object keeps a similar embedding when observed at different times.
$$\text{TSC} = \frac{\text{SC}_\text{intra}(\mathcal{P}_\text{TSC})}{\text{SC}_\text{inter}}, \quad \mathcal{P}_\text{TSC} = \{(e_i, e_j) \mid o_i = o_j,\ t_i \neq t_j\}$$
This captures semantic stability across repeated observations under changing illumination or time.
SSC measures whether crops belonging to the same structural group, such as the same row, are represented consistently in embedding space.
$$\text{SSC} = \frac{\text{SC}_\text{intra}(\mathcal{P}_\text{SSC})}{\text{SC}_\text{inter}}, \quad \mathcal{P}_\text{SSC} = \{(e_i, e_j) \mid g_i = g_j\}$$
Here, \(g_i\) and \(g_j\) indicate the same spatial group.
In all cases, the benchmark uses cosine similarity to compute semantic consistency:
$$\text{SC} = \frac{\text{SC}_\text{intra}}{\text{SC}_\text{inter}}$$
A higher score means that semantically related observations stay close to each other while remaining separable from unrelated regions.
Embedding Instability Across Viewpoint Changes: Semantic consistency decreases as the difference in observation angle increases. When the same object is observed from different viewpoints, the resulting embeddings become less similar, indicating that semantic representations are not invariant to viewing direction. This effect is relatively small for CLIP, which maintains a near-zero slope, but becomes much more pronounced for DINOv2, where consistency drops significantly as the angle difference grows. The plot shows how semantic consistency (CSC) changes as the observation angle difference increases. The x-axis represents the angular difference between two observations of the same object, and the y-axis represents the consistency score. CLIP maintains a relatively stable behavior with a near-zero slope (−0.00229), indicating that its embeddings remain consistent across viewpoints. In contrast, DINOv2 exhibits a much steeper negative slope (−0.02345), meaning that its embeddings change significantly as the viewing angle varies. Although both models observe the same physical crops, the resulting embeddings diverge as viewpoint differences increase. This demonstrates that semantic representations are not invariant to viewpoint changes, and such instability can directly degrade the reliability of semantic maps.
Spatial Instability of Semantic Embeddings: Semantic consistency is not uniform even within structurally identical regions. When examining embedding variance across the map, some areas produce stable and consistent representations, while others show large variability despite corresponding to the same crop category. The plot is a top-down visualization of embedding variance across crop regions. Each voxel in the map is colored based on the variance of embeddings assigned to it, where red indicates high variance (low consistency) and blue indicates low variance (high consistency). While all crops belong to the same semantic category, the figure shows that some regions produce highly consistent embeddings, whereas others exhibit large variability. Notably, this inconsistency appears even within the same crop row, which should ideally share a uniform semantic representation. Comparing models, CLIP produces more consistent embeddings overall, while DINOv2 shows higher variance across many regions. This result highlights that semantic inconsistency is not only caused by viewpoint changes but also emerges spatially within repetitive structures, leading to fragmented semantic representations in the final map.
A large-scale multimodal agricultural dataset is constructed in a confined and highly repetitive farm environment where semantic consistency is challenging. The dataset includes RGB-D, LiDAR, IMU, GPS, and camera poses, covering approximately 2.5 km of trajectories, and is built using a pose-aware pipeline combining GPS and LiDAR-inertial SLAM to provide accurate spatial alignment for systematic evaluation.