Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity

1. Introduction and Motivation

In modern agriculture, the geometric features of farmland play a crucial role in farm management and planning. Understanding these characteristics enables farmers to make informed decisions, manage resources more efficiently, and promote sustainable agricultural practices.

This research leverages data from Litefarm, an open-source agri-tech application designed to support sustainable agriculture. Litefarm provides detailed information about farmland, including field shapes, offering valuable insights for analysis. However, as an open platform, Litefarm’s database may include unrealistic or inaccurate data entries, such as “fake farms.” Cleaning and validating this data is essential for ensuring the reliability of agricultural analyses.

In this blog, we focus on identifying fake farms by analyzing field shapes to detect unrealistic entries. Our goal is to enhance data accuracy, providing a stronger foundation for future agriculture-related research.

2. Dataset Overview and Preparation

Data Source

The data for this study was extracted from Litefarm’s database, which contains detailed information about farm geometries, locations, and user associations. The dataset included the following key attributes:

Farm-Level Information:
Each farm is uniquely identified by a farm_ID, representing an individual farm within the Litefarm database.
Polygon-Level Information:
Each farm consists of multiple polygons, corresponding to distinct areas such as fields, gardens, or barns. Each polygon is uniquely identified by a location_ID, ensuring that every area within a farm is individually traceable.
Geometric Attributes:
- Area: The total surface area of the polygon.
- Perimeter: The boundary length of the polygon.
Vertex Coordinates:
The geographic shape of each polygon is defined by a list of vertex coordinates in latitude and longitude format, represented as: [(lat1, lon1), (lat2, lon2), ..., (latN, lonN)].
Polygon Types:
The polygons in each farm are categorized into various types:
- Fields
- Farm site boundaries
- Residences
- Barns
- Gardens
- Surface water
- Natural areas
- Greenhouses
- Ceremonial areas

This rich dataset captures farm structures and geometries comprehensively, enabling the analysis of relationships between polygon features and agricultural outcomes.

This study focuses specifically on productive areas—gardens, greenhouses, and fields—as these contribute directly to agricultural output. Since different polygon types possess unique geometric characteristics, we focused on a single type to maintain analytical consistency.

As the Litefarm database is dynamic and continuously updated, the data captured as of November 28th showed that 36.4% of farms included garden areas, 20.7% had greenhouse areas, and nearly 70% contained fields. To ensure a robust and representative analysis, we focused on field polygons, which had the highest proportion within the dataset.

Refined Litefarm Dataset

To ensure that only valid and realistic farm data was included in the analysis, we applied rigorous SQL filters to the Litefarm database. These filters excluded:

Placeholder farms and internal test accounts.
Deleted records.
Farms located in countries with insufficient representation (fewer than 10 farms).

The table below summarizes the results of the filtering process and the composition of the cleaned dataset:

Description	Count
Initial number of farms in Litefarm	3,559
Farms after SQL filtering	2,919
Farms with field areas	2,022
Farms with garden areas	1,063
Farms with greenhouse areas	607
Total number of field polygons	6,340

By narrowing the focus to field polygons, we ensured that the dataset was both robust and suitable for exploring the relationship between geometric features and agricultural outcomes.

3. Shape Analysis

This study focuses on the geometric properties of field polygons, as these are essential for understanding farm structures and ensuring data reliability. Each field polygon is represented by a series of vertices in latitude-longitude pairs, which outline its geometric boundaries. These vertices are the foundation for calculating key metrics such as area, perimeter, and more complex shape properties.

To perform a robust analysis, we systematically processed and evaluated the field polygon data through the following steps:

1. Vertex Distribution Analysis

The first step in our analysis was to examine the vertex distribution of the field polygons to understand their general characteristics and ensure data quality. A box plot was created to visualize the distribution of vertex counts:

The results revealed a wide range of vertex counts, spanning from 3 to 189 vertices. This variability required filtering to address potential outliers. Using the z-score method, we identified and excluded extreme values, capping the maximum vertex count at 34.

After filtering, we analyzed the revised vertex distribution using a histogram, which revealed that 47.4% of field polygons had exactly four vertices:

2. Validation of Area and Perimeter Metrics

Recalculation Process:

Vertex coordinates, initially in latitude-longitude format, were transformed into a planar coordinate system (EPSG:6933) to enable precise calculations.
Area and perimeter were computed directly from the transformed vertex data.

Scatter plots comparing the user-provided values with the recalculated metrics showed strong alignment, with most points clustering around the diagonal (dashed line), confirming the accuracy of the recalculated values:

Perimeter Comparison
Area Comparison

This validation step provided confidence in the accuracy of the recalculated metrics, allowing us to proceed with subsequent shape analysis using reliable data.

Field Polygon Standardization and Preparation

To focus on the geometric properties of field polygons, we projected all polygons into a size-and-shape space. This transformation isolates the shape and scale of the polygons while removing variations caused by rotation and translation. The size-and-shape space ensures consistent and meaningful comparisons of the underlying geometric features.

While this study emphasizes polygon shapes, we recognize that area is a critical feature in agricultural studies due to its relationship with factors like regional regulations and agricultural policies. Thus, we preserved the size (scaling) component in our analysis to maintain the relevance of area.

To ensure uniformity and consistency in the dataset, we performed the following preprocessing steps:

Standardizing Landmark Points:

To enable meaningful comparisons in the size-and-shape space, each polygon was resampled to have exactly 34 evenly spaced points along its boundary. The following Python function illustrates this process:

Code

import folium
import json
from shapely.geometry import shape, Polygon, Point, MultiPoint, MultiPolygon, LineString,LinearRing, MultiLineString
from shapely.ops import unary_union, transform, nearest_points
from collections import defaultdict
import geopy.distance
import pandas as pd
import math
import numpy as np
from itertools import combinations
import itertools
import pyproj
from functools import partial
from collections import defaultdict
import altair as alt
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from pyproj import Transformer, CRS 
import seaborn as sns
import plotly.express as px
import logging
from shapely.validation import explain_validity
import geopandas as gpd
import ast
from geographiclib.geodesic import Geodesic
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from geopy.distance import geodesic
from geomstats.geometry.pre_shape import PreShapeSpace
from geomstats.visualization import KendallDisk, KendallSphere

Code

def resample_polygon(projected_coords, num_points=34):
    """
    Resample a polygon's boundary to have a specified number of evenly spaced points.

    Parameters:
    - projected_coords: List of coordinates defining the polygon's boundary.
    - num_points: The number of evenly spaced points to resample (default is 34).

    Returns:
    - new_coords: List of resampled coordinates.
    """
    ring = LinearRing(projected_coords)
    
    total_length = ring.length

    distances = np.linspace(0, total_length, num_points, endpoint=False)
    
    new_coords = [ring.interpolate(distance).coords[0] for distance in distances]
    
    return new_coords

Ensuring Consistent Vertex Direction:

All polygons were standardized to have vertices drawn in the same direction (clockwise or counterclockwise). This step ensures that the orientation of the vertices does not introduce inconsistencies in the analysis.

Code

def is_clockwise(coords):
    """
    Check if the polygon vertices are in a clockwise direction.

    Parameters:
    - coords: List of coordinates defining the polygon's boundary.

    Returns:
    - True if the polygon is clockwise; False otherwise.
    """
    ring = LinearRing(coords)
    return ring.is_ccw == False 

def make_clockwise(coords):
    """
    Convert the polygon's vertices to a clockwise direction, if it is not 

    Parameters:
    - coords: List of coordinates defining the polygon's boundary.

    Returns:
    - List of coordinates ordered in a clockwise direction.
    """
    if not is_clockwise(coords):  
        return [coords[0]] + coords[:0:-1]  # Reverse the vertex order, keeping the start point
    return coords

The image illustrates four polygons that have been standardized by resampling them to have 34 evenly spaced points, with all vertices aligned in a clockwise direction.

Validation of Standardization

To confirm the accuracy of these transformations, we compared the areas and perimeters of the resampled polygons with the original values. The results demonstrated minimal deviation, indicating the transformations preserved the integrity of the shapes.

Perimeter Comparison

Area Comparison

By meeting these preprocessing requirements, we ensured that the polygons were accurately prepared for subsequent shape analysis.

Shape Alignment and Fréchet Mean Analysis

With data preparation complete, the polygons were ready for analysis in the size-and-shape space. This specialized framework enables consistent comparison of shapes by accounting for geometric differences, including scaling, translation, and rotation. It provides a robust foundation for meaningful geometric analysis.

The polygons were aligned using Procrustes analysis(Dryden and Mardia 2016), and their Fréchet Mean was iteratively computed in Euclidean space. This process standardizes the shapes, ensuring variations caused by translation and rotation are removed, allowing for accurate and meaningful comparisons.

The Fréchet Mean(Dryden and Mardia 2016) represents the “average” shape in a geometric space (manifold), minimizing the average squared distance to all sample shapes. It serves as a standardized and central representation of the dataset.

Step-by-Step Overview

Shape Alignment:
- The align_shape function performs Procrustes alignment through the following steps:
  1. Removing Translation:
    - The centroid (average position of all points) of each shape is computed. The shape is then centered by subtracting its centroid from all points, ensuring the shape is position-independent.
  2. Removing Rotation:
    - Using Singular Value Decomposition (SVD), the optimal rotation matrix is calculated to align the target shape with the reference shape. This step removes rotation differences while preserving the relative positions of the points.
Measuring Shape Differences:
- The riemannian_distance function computes the Riemannian distance between two shapes in size-and-shape space. This metric quantifies geometric differences between shapes, considering both size and rotation.

Riemannian Distance in Size-and-Shape Space

Given two \(k\)-point configurations in \(m\)-dimensions, \(X_1^o, X_2^o \in \mathbb{R}^{k \times m}\), the Riemannian distance(Dryden and Mardia 2016) in size-and-shape space is defined as:

\[ d_S(X_1^o, X_2^o) = \sqrt{S_1^2 + S_2^2 - 2 S_1 S_2 \cos \rho(X_1^o, X_2^o)} \]

where:

\(S_1, S_2\): Centroid sizes of \(X_1^o\) and \(X_2^o\), representing the Frobenius norms of the centered shapes.
\(\rho(X_1^o, X_2^o)\): Riemannian shape distance.

This formula ensures that the distance captures both shape similarity and scaling differences, making it a robust tool for geometric analysis.

Iterative Fréchet Mean Calculation:
- The algorithm begins with an initial reference shape and aligns all other shapes to it using Procrustes alignment.
- The Fréchet Mean is then calculated as the average shape in Euclidean space.
- The shapes are iteratively re-aligned to the updated Fréchet Mean, refining the alignment and mean calculation until convergence is achieved.

Python Implementation

The following Python code implements the entire process of shape alignment, Riemannian distance computation, and iterative Fréchet Mean calculation.

Code

def align_shape(reference_shape, target_shape):
    """
    Align the target shape to the reference shape using Procrustes alignment.

    Parameters:
    - reference_shape: The reference shape to align to.
    - target_shape: The shape to be aligned.

    Returns:
    - aligned_shape: The aligned target shape.
    """
    reference_shape = np.array(reference_shape)
    target_shape = np.array(target_shape)

    # Step 1: Remove the translation
    centroid_reference = np.mean(reference_shape, axis=0)
    centroid_target = np.mean(target_shape, axis=0)
    centered_reference = reference_shape - centroid_reference
    centered_target = target_shape - centroid_target

    # Step 2: Remove the rotation
    u, s, vh = np.linalg.svd(np.matmul(np.transpose(centered_target), centered_reference))
    r = np.matmul(u, vh)
    aligned_shape = np.matmul(centered_target, r)

    return aligned_shape

def riemannian_distance(reference_shape, target_shape):
    """
    Compute the Riemannian distance between two shapes.

    Parameters:
    - reference_shape: The reference shape.
    - target_shape: The target shape.

    Returns:
    - distance: The Riemannian distance between the shapes.
    """
    reference_shape = np.array(reference_shape)
    target_shape = np.array(target_shape)

    # Step 1: Compute centroid sizes
    S1 = np.linalg.norm(reference_shape)  
    S2 = np.linalg.norm(target_shape)

    # Step 2: Remove translation by centering the shapes
    centered_reference = reference_shape - np.mean(reference_shape, axis=0)
    centered_target = target_shape - np.mean(target_shape, axis=0)

    # Step 3: Compute optimal rotation using SVD
    H = np.dot(centered_target.T, centered_reference)
    U, _, Vt = np.linalg.svd(H)
    R = np.dot(U, Vt)

    # Step 4: Align target shape
    aligned_target = np.dot(centered_target, R)

    # Step 5: Compute the Riemannian distance
    cosine_rho = np.trace(np.dot(aligned_target.T, centered_reference)) / (S1 * S2)
    cosine_rho = np.clip(cosine_rho, -1, 1)
    distance = np.sqrt(S1**2 + S2**2 - 2 * S1 * S2 * cosine_rho)

    return distance

# Iterative Fréchet Mean Calculation
epsilon = 1e-6  
max_iterations = 100  
reference_shape = field_data['resampled_point'].iloc[0]  
aligned_shapes = []

# Align all shapes to the initial reference shape
for target_shape in field_data['resampled_point']:
    aligned_shape = align_shape(reference_shape, target_shape)
    aligned_shapes.append(aligned_shape)

# Initialize Euclidean space and calculate initial Fréchet Mean
euclidean_space = Euclidean(dim=aligned_shapes[0].shape[1])
frechet_mean = FrechetMean(euclidean_space)
previous_frechet_mean_shape = frechet_mean.fit(aligned_shapes).estimate_
converged = False
iteration = 0
frechet_means = [previous_frechet_mean_shape]

while not converged and iteration < max_iterations:
    iteration += 1
    aligned_shapes2 = []
    for target_shape in field_data['resampled_point']:
        aligned_shape = align_shape(previous_frechet_mean_shape, target_shape)
        aligned_shapes2.append(aligned_shape)

    # Calculate new Fréchet Mean
    frechet_mean = FrechetMean(euclidean_space)
    current_frechet_mean_shape = frechet_mean.fit(aligned_shapes2).estimate_
    frechet_means.append(current_frechet_mean_shape)
    
    # Check convergence
    difference = riemannian_distance(previous_frechet_mean_shape, current_frechet_mean_shape)
    if difference < epsilon:
        converged = True
    else:
        previous_frechet_mean_shape = current_frechet_mean_shape

Global Fréchet Mean and Outlier Detection

Here is the global Fréchet mean calculated from all field polygons:

The following image illustrates the original polygon and its alignment with the Fréchet mean:

After aligning all shapes to the Fréchet mean, the riemannian_distance function was used to calculate the distances between the mean shape and each aligned shape. To identify potential outliers, the z-score method was applied to these distance values.

Below are the four field polygons detected as outliers using the global Fréchet mean:

Code

import pandas as pd

# Load the CSV file
four_potiential_fake_farm = pd.read_csv("data/potiential_fake_field.csv")

# Display the table
four_potiential_fake_farm   # Or use `data` to show the entire table

	Farm Number	country_name	type	calculated_perimeter_m	calulated_area_ha	number of vertices	distance_to_frechet_mean	z_score
0	Farm 310	United States	field	744797.7117	2.600780e+06	3	591590.0609	48.784896
1	Farm 71	Canada	field	864206.5248	4.251124e+06	5	709800.2531	58.580655
2	Farm 45	Canada	field	341370.9916	8.453115e+04	5	177371.8498	14.459753
3	Farm 2792	India	field	200958.9993	2.170029e+05	4	166440.3554	13.553890

Fréchet Mean Shape by Country

The shape of field polygons varies significantly across different countries. To capture this variation, we calculated the Fréchet mean shape* for each country based on the fields located within that specific country.

The plot below summarizes the Fréchet mean shapes for all countries in the dataset.

In this visualization, different colors represent different continents. It is evident that both the shapes and areas of the field polygons differ substantially across regions, highlighting the diversity in field geometry across countries.

Assessing Mean Shape Representation in Countries with Limited Data

To evaluate the representativeness of the mean shape, we specifically selected countries with fewer than 10 polygons. The small number of polygons in these cases allows for easier visualization, helping us assess whether the mean shape effectively captures the overall geometric characteristics of these datasets.

Zambia

Chile

From the above plots, we can draw the following conclusions:

Effective Representation with Similar Shapes:

When the field polygons within a country have similar shapes, the calculated Fréchet mean serves as an effective representation of the general shape trend.
Limitations with Diverse Shapes:

If the field polygons within a country show significant variation in their shapes, the Fréchet mean becomes less representative and may fail to adequately capture the geometric diversity of the dataset.

Detecting Potential Fake Field Polygons

Building on the country-level mean shape analysis, we applied the same methodology to detect potential fake field polygons. For each country, field polygons were aligned to their corresponding Fréchet mean, and the z-score technique was used to identify anomalies based on the distances between each polygon and the mean shape.

Through this analysis, we identified 51 potential fake field polygons. To verify their validity, we visualized each field polygon on satellite imagery. The results are summarized in the plot below:

Gray markers: Fake fields
Pink markers: True fields
Orange markers: Potential fake fields

Satellite plot for all 51 potential fake fields

After visualizing all 51 potential fake field polygons, the findings were as follows:

45.1% were confirmed as fake fields.
29.4% were ambiguous, meaning they could potentially be either fake or real fields, requiring further investigation.
25.5% were determined to be true fields.

Below are examples of confirmed fake fields. These polygons often exhibit:

Unusual geometric shapes
Sizes that are disproportionately large compared to neighboring field polygons

Future Work

Our analysis successfully identified a significant number of potential fake field polygons, with nearly half of these cases being validated as genuinely fake. While this demonstrates the effectiveness of our approach, there is still room to improve the accuracy and reliability of the detection process. To further refine our results, future efforts will focus on:

Incorporate Geographic Information:
Enrich the dataset with geographic features such as proximity to natural landmarks (e.g., mountains, rivers) or man-made structures (e.g., urban areas, roads). These features could provide valuable context for improving the calculation of the Fréchet mean and detecting anomalies more effectively.
Improve Outlier Detection Methods:
Leverage advanced machine learning models, such as clustering algorithms or ensemble methods, to identify subtle patterns and relationships that may indicate fake fields. Techniques like unsupervised learning or deep anomaly detection could also be explored to improve performance.

References

Dryden, Ian L., and Kanti V. Mardia. 2016. Statistical Shape Analysis, with Applications in r. Wiley Series in Probability and Statistics. Chichester, UK: John Wiley & Sons, Ltd. https://doi.org/10.1002/9781119072492.