Third International Scientific and Practical Conference “Eagles of the Palearctic: Study and Conservation”

Raptors Conservation. Suppl. 2. Proceedings of Conferences

Open Data Sources for Species Distribution Modelling: Biodiversity Information Systems and Spatial Datasets of Environmental Conditions Variables

Shashkov M.P. (Karaganda Buketov University, Karaganda, Kazakhstan)

Maxim Shashkov
Recommended citation: Shashkov M.P. Open Data Sources for Species Distribution Modelling: Biodiversity Information Systems and Spatial Datasets of Environmental Conditions Variables. – Raptors Conservation. 2023. S2: 358–362. DOI: 10.19074/1814-8654-2023-2-358-362 URL:

BIOKLIM, the first algorithm for habitat modelling (Species Distribution Modelling (SDM)), was developed in the 1980s. This area of population and ecological research began to gain prominence with the wide availability of computers, development of the Internet, and development of open resources that provide access to data on species occurrences and environmental factors. Most algorithms for SDM (with the exception of the first "bioclimatic envelope" methods) are based on regression analysis and machine learning. The most commonly used today is the MaxEnt maximum entropy method. All methods achieve the objective of revealing quantitative relationships between occurrences of the focal species and environmental variable values where the species occurs, with subsequent extrapolation of the ensuing patterns across the entire study area. The result is an assessment of habitat suitability (probability of occurrence) for the species within the study area.

Species distribution modelling methods are implemented both as standalone software products (MaxEnt) and as modules for GIS (and for QGIS, SDMToolbox for ArcGIS, etc.) and packages for the R environment (dismo, biomod2, ENMTools, etc.).

Any species distribution modelling method requires two types of input data: (1) occurrences of the focal species, represented as a set of points with geographic coordinates; and (2) environmental variables (predictors) that may be valuable for species distribution, in the format of continuous raster layers.

Considerable advances in the digitization of scientific collections around the globe and development of other sources for species distribution data have made it possible for researchers to significantly augment their own data to develop more accurate models. Such data are available through thematic repositories, the largest of which is the Global Biodiversity Information Facility – GBIF, which currently provides over 2.5 billion occurrences, twothirds of which relate to birds. Along with data derived from scientific collections, GBIF hosts data from multiple citizen science systems as well. The largest of these is eBird, with 1,277.5 million observations. The iNaturalist system has about 20 million bird observations. A much smaller fraction of data comes from biological collections (8.5 million) and automatic observation systems (camera traps and satellite trackers, 9.5 million). GBIF has accumulated 195,000 bird observations in Kazakhstan, in addition to the above-mentioned data, originating in the following observation systems: Raptors of the World, RU-BIRDS. RU,, and

The volume of available data on focal species occurrence can reach tens of thousands of records, but much less is required for modelling; for this reason, data filtering and quality control are important steps. When compiling an input dataset of occurrences, the researcher must consider the biological features of focal species. In birds, the circumstances in which a particular individual was encountered is important: on the nest, while hunting during nesting, overwintering, migrating, etc., as well as age group. It is also necessary to note which part of the range is used in the model: breeding, wintering ground, or year-round presence. The records of target species’ occurrences should be more or less evenly distributed over the area of interest, should not raise questions regarding identification, and should have a geographical accuracy comparable to the resolution of the predictor layers used.

The environmental variables most in demand are bioclimatic data from the WorldClim resource. Those data reflect the distribution of precipitation and long-term average temperature. Information on soil conditions is provided by SoilGrid250. Layers for land surface classification by habitat type are also available: qualitative (Global Land Cover 2000) and quantitative (Global 1-km Consensus Land Cover). Remote sensing imagery data from the Landsat and Sentinel satellite series are often used as predictors. Both particular image channels and layers with indexes calculated on that basis (e.g., NDVI – Normalized Difference Vegetation Index) can be included in the analysis. The SRTM (Shuttle Radar Topography Mission) digital surface model is also in wide use.

It is important to test the predictors for multicollinearity, as strongly correlated factors will introduce uncertainty in the resulting model. The test is performed over the set of values that spatially correspond to the species occurrences, rather than over the entire area of the layers. Among two correlated layers, the less environmentally dependent one is usually left, for which the working hypothesis is tested or allowing comparison of the results with other studies. It is recommended that correlation coefficient values > 0.7 be taken as critical. Predictor selection should be based on focal species biology and ecology. For some species, topography may be important, not only elevation but also, for example, slope steepness. In species associated with wetlands, it is important to include layers related to the hydrological network. The influence of factors can be both direct and indirect. For example, a particular bird species nests in an area with a certain range of mean annual temperatures, but at the local level, it chooses habitats rich in food resources, which in turn may be associated with certain soil characteristics or vegetation types. Therefore, initial model builds typically use multiple layers of environmental variables to identify significant factors and the nature of their influence on the probability of encountering the target species. Usually, no more than ten predictors remain in the final model. There must be at least ten points of occurrence of the focal species for each predictor to build a good quality model.