GeoPlant Dataset
The GeoPlant dataset comprises Species Observation data (i.e., Presence-Only (PO) occurrences and Presence-Absence (PA) surveys) and a wide set of Environmental Predictors. It covers 38 European countries and 8 major biogeographic regions (e.g., Alpine, Atlantic, and Boreal).
For each species observation, we provide:
- Diverse environmental rasters (e.g., elevation, human footprint, land use, soil)
- Sentinel-2 RGB and Near-Infra-Red satellite images (128×128 pixels at 10 m resolution)
- A 20-year time series of climatic variables
- Satellite time-series point values for six bands (R, G, B, NIR, SWIR1, SWIR2) from Landsat
For a detailed description of all predictor modalities, see the Predictors & Modalities page.
Figure 1. Geo spatial scale of the dataset. Presence-Only (PO) data spans all habitable Europe, while Presence-Absence (PA) training and test sites are primarily in France, Denmark, Switzerland, and Czechia.
Species Observation Data
The dataset contains approximately 5M PO occurrences and around 90K PA surveys.
PO data covers most of Europe, but is sampled opportunistically without a standardized protocol, leading to various biases. Local observation of a species does not guarantee other species are truly absent. PA surveys are conducted by experts and provide much more reliable information.
Presence-Absence (PA) surveys
A PA survey is an expert inventory of all plant species in a given plot (10–400 m²). All unobserved species are likely truly absent.
- Source: 29 datasets hosted in the European Vegetation Archive (EVA)
- Size: 93,703 surveys covering 5,016 species (≈ half of the European flora)
- Imbalance: Most species are rarely observed in PA surveys.
- Train/Test Splits: 95%/5% using spatial block hold-out (10×10 km grid) to balance biogeographical regions.
Presence-Only (PO) occurrences
A PO occurrence is a geolocated species observation with unknown sampling protocol, providing no info on species absences. Sampling effort is highly heterogeneous in space, time, and across species—most PO records come from citizen science, are concentrated in accessible/populated areas, and focus on charismatic/easy species. Nevertheless, PO data helps compensate for PA survey gaps when models control for sampling bias.
- Size: ~5 million records for 9,709 plant species (2017–2021)
- Source: 13 pre-selected datasets from GBIF
Table 1. Presence-Only dataset sources Selected GBIF datasets cover 38 European countries. "Uniq. species" indicates the number of unique species in each dataset compared to the rest.
GBIF Dataset Name | Records | Species | Uniq. species |
---|---|---|---|
Pl@ntNet Observations + Pl@ntNet Occurrences | 2,298,884 | 4,631 | 295 |
Danmarks Miljøportals Naturdatabase | 691,313 | 1,457 | 14 |
iNaturalist Research-grade Observations | 625,681 | 7,496 | 1,754 |
Norwegian Species Observation Service | 601,101 | 2,243 | 167 |
Observation.org | 241,205 | 5,108 | 429 |
Non-native plant occurrences in Flanders/Brussels | 178,544 | 1,464 | 134 |
Artportalen (Swedish Species Observation System) | 163,513 | 2,771 | 464 |
National Plant Monitoring Scheme U.K. | 120,413 | 1,109 | 11 |
Vascular plant records via iRecord | 103,213 | 2,179 | 99 |
Swiss National Databank of Vascular Plants | 49,173 | 58 | 2 |
Invazivke - Invasive Alien Species in Slovenia | 4,171 | 60 | 1 |
Masaryk University - Herbarium BRNU | 2,586 | 1,321 | 122 |
GeoPlant PO data (Combined) | 5,079,797 | 9,709 | --- |
Environmental Predictors
The environmental predictor data are crucial for modeling.
Each observation (PO or PA) is accompanied by:
- A 4-band 128×128 satellite image at 10 m resolution (Sentinel-2)
- Time series of 6 satellite bands (Landsat; R, G, B, NIR, SWIR1, SWIR2; 1999–2020)
- Environmental rasters at European scale: climate, soil, elevation, land use, human footprint
- Monthly climatic rasters (CHELSA; 4 variables, 2000–2019)
For full details and variable lists, see the Predictors & Modalities page.
For data download and file structure, see the Resources page.
Please cite the GeoPlant paper if you use or redistribute this dataset.