DWD SQLite database archive

About

This page contains information about SQLite database files produced by the phenodata program and its export-observations-all subcommand.

Prior art

Jonas Dierenbach, Franz-W. Badeck, and Jörg Schaber presented PPODB, the Plant-Phenological Online Database, that provides unrestricted and free access to over 16 million plant phenological observations, mostly in Germany.

Its PPODB SQL interface offers online SQL access to the database, running on premises of the Institute for Theoretical Biophysics at the Humboldt University.

In the same spirit, we are producing and publishing SQLite database files about the same matter, for everyone to run database queries on their own hardware.

See also

In order to learn the important details about PPODB, we recommend reading the PPODB summary. More details can be discovered within the scientific paper The plant phenological online database (PPODB) » an online database for long-term phenological data, and the Plant-Phenological Online Database (PPODB) handbook.

Data source

Data and metadata is acquired from the DWD CDC Open Data Server, specifically the observations_germany/phenology and help directories.

SQLite database files

The data folder contains four SQLite database files. Before downloading and using them, please recognize the DWD data copyright notices referenced below at Data copyright.

  • phenodata-dwd-annual-historical.sqlite (1.7 GB)

  • phenodata-dwd-annual-recent.sqlite (24 MB)

  • phenodata-dwd-immediate-historical.sqlite (90 MB)

  • phenodata-dwd-immediate-recent.sqlite (5.5 MB)

Usage

Getting started

Let’s start by defining the database name, and downloading the file.

export DBPATH=phenodata-dwd-immediate-historical.sqlite
wget --no-clobber "https://phenodata.hiveeyes.org/data/dwd/${DBPATH}"

Inquire the database schema.

# Display all tables and views.
sqlite3 "${DBPATH}" '.tables'

# Display schema of all tables and views.
sqlite3 "${DBPATH}" '.fullschema --indent'

# Display database metadata information.
sqlite3 "${DBPATH}" -csv -header 'SELECT * FROM dwd_about'

The database about historical observations from immediate reporters contains ~250,000 records as of 2023.

sqlite3 "${DBPATH}" 'SELECT COUNT(*) FROM dwd_phenology;'
252378

It spans the time range of observations between 1979 and 2021.

sqlite3 "${DBPATH}" 'SELECT MIN(reference_year), MAX(reference_year) FROM dwd_phenology;'
1979|2021

Run a query on the dwd_phenology view, with output in CSV format.

sqlite3 -csv -header "${DBPATH}" 'SELECT * FROM dwd_phenology ORDER BY date;'

The same query, but more suitable when aiming to write your query using multiple lines, for example within a program or script file.

sqlite3 -csv -header "${DBPATH}" <<SQL
SELECT * FROM dwd_phenology ORDER BY date;
SQL

PPODB examples

Those examples have been taken from the PPODB SQL interface page, and slightly adjusted to use the DWD/Phenology/SQLite database schema.

To invoke those queries, start an interactive shell using sqlite3:

sqlite3 "${DBPATH}" -csv -header

At first, you usually want to get an overview over the database and list all available tables:

.tables

Often, you want to check whether a certain table contains the information you are interested in. Therefore, you want to have a quick overlook over the columns in the table of interest:

.schema dwd_phase --indent

Stations and observations are uniquely referenced by identifiers. Therefore, it is safer and more efficient to access phenological observations by their identifiers rather than by names. The identifiers of all stations with a name similar to “Geisenheim” can be retrieved with the query:

SELECT station_id, station_full
FROM dwd_phenology
WHERE station_full LIKE '%Geisenheim%';

Within the database, a combination of plant and phase is referenced by a single unique identifier, which is handy:

SELECT *
FROM dwd_phenology
WHERE
    species_name_en LIKE '%hazel%' AND
    phase_name_en LIKE '%flowering%';

With a station-id and a phase-id, you can efficiently retrieve time series, e.g. flowering of hazel at Geisenheim (DWD):

SELECT day_of_year, reference_year, source, species_name_en
FROM dwd_phenology
WHERE
    station_id=19476 AND
    species_id=113 AND
    phase_id=5
ORDER BY reference_year, day_of_year;

Contrary to PPODB’s recommendation, we think it is acceptable to use human-readable labels for querying. If you will discover this to be a bottleneck for your application, please consider adding additional indexes:

SELECT day_of_year, reference_year, source, species_name_en
FROM dwd_phenology
WHERE
    station_full LIKE '%Geisenheim%' AND
    species_name_en LIKE '%hazel%' AND
    phase_name_en LIKE '%flowering%'
ORDER BY reference_year, day_of_year;

You can also ask more complex questions, e.g. which of the following plants flowered earliest after 1951 on average, hazelnut, chestnut or birch?:

SELECT AVG(day_of_year) mean, reference_year, species_name_en plant, phase_name_en phase
FROM dwd_phenology
WHERE phase_name_en LIKE '%flowering%'
GROUP BY phase_id ORDER BY mean;

or, e.g. how many single station time series are there that have a certain length including all lengths and phases (see Figure 1 in the documentation)?:

SELECT c, COUNT(c) FROM
    (
    SELECT station_id AS sid, phase_id AS pid, COUNT(DISTINCT reference_year) AS c
    FROM dwd_phenology
    WHERE phase_id != 0 GROUP BY station_id, phase_id
    )
AS sq GROUP BY c;

Specialist’s toolbox

At phenological calendar for foraging plants, we are discussing the development of a convenient phenological calendar for beekeepers. Here, we are presenting corresponding database queries suitable for that purpose.

In order to query the database for multiple plants conveniently, there is the dwd_species_group table, derived from phenodata’s presets.json file. The statement below uses the group mellifera-de-primary-openhive, to list all observations of “flowering” events for primary foraging plants of honeybees (apis mellifera), filtering by location on behalf of the synthesized station_full field:

SELECT
    reference_year,
    day_of_year,
    source,
    species_name_de,
    phase_name_de,
    station_name
FROM dwd_phenology_group
WHERE true
    AND group_name = 'mellifera-de-primary-openhive'
    AND phase_name_en LIKE '%flowering%'
    AND station_full LIKE '%brandenburg%';

In order to list the available plant group names, query the dwd_species_group table:

SELECT
    dwd_species.*
FROM dwd_species_group, dwd_species
WHERE true
    AND dwd_species_group.species_id=dwd_species.id
    AND group_name='mellifera-de-primary-openhive';
id,species_name_de,species_name_en,species_name_la
205,Winterraps,"winter oilseed rape","Brassica napus var. napus"
209,Sonnenblume,sunflower,"Helianthus annuus"
215,Mais,maize,"Zea mays"
310,Apfel,apple,"Malus domestica"
320,Birne,pear,"Pyrus communis"
330,"Süßkirsche",cherry,"Prunus avium"
340,Sauerkirsche,morello,"Prunus cerasus"
382,Himbeere,raspberry,"Rubus idaeus"
383,Brombeere,blackberry,"Rubus fructicosus"
113,Hasel,"common hazel","Corylus avellana"
114,Heidekraut,"common heather","Calluna vulgaris"
120,"Löwenzahn",dandelion,"Taraxacum officinale"
121,Robinie,"black locust","Robinia pseudoacacia"
122,Rosskastanie,"horse chestnut","Aesculus hippocastanum"
124,Sal-Weide,"goat willow","Salix caprea"
131,Spitz-Ahorn,"Norway maple","Acer platanoides"
137,Winter-Linde,"small leafed lime","Tilia cordata"

Note

If you have a different use case, or think the existing species groups should be expanded, do not hesitate to drop us a line by creating an issue, in order to propose changes to the dwd_species_group table.

Attributions

Acknowledgements

Thanks to the many observers of »Deutscher Wetterdienst« (DWD), the »Global Phenological Monitoring programme« (GPM), and all people working behind the scenes for their commitment on recording observations and making the excellent datasets available to the community. You know who you are.

Reproduce

This section explains how to export all available datasets into corresponding SQLite database files, on your own machine, using the export-observations-all subcommand.

The process will take about five to ten minutes, based on the capacity of your computing device. Processing the immediate/historical+recent and annual/recent data is pretty fast. The annual/historical data however, as the largest one with a size of ~1.7 GB, takes the major share of computing time on the export operation.

phenodata export-observations-all --source=dwd

The command will create four SQLite database files, they can be consumed using the sqlite3 command, or other tools.

Tip

If you want to create database files by selecting individual subsets of the data, please refer to the SQLite database export documentation.

Note

The cache directory, for example located at /Users/<username>/Library/Caches/phenodata on macOS machines, will hold all the data downloaded from DWD servers. It is about 160 MB in size for both of the “recent” datasets, while immediate-historical weighs in with about 500 MB, and annual-historical with about another 3 GB.

Upload

rsync -azuv phenodata-dwd-*.sqlite root@elbanco.hiveeyes.org:/var/lib/phenodata/dwd

Backlog

Todo

  • [o] Publish using datasette

  • [o] Publish using Grafana SQLite Datasource

  • [o] Outline other end-user tools to consume the databases

  • [o] Implement phenodata.open_database("dwd", "immediate", "recent") to consume the databases


Enjoy your research.