< Back to previous page

Publication

Oral presentation, Henry Engledow: Data cleaning an iterative process: lessons learned from the second mass digitization project at Meise Botanic Garden

Book Contribution - Chapter

The second mass digitization project at the Meise Botanic Garden (Belgium), DOE2, finished in 2021. Most of the label transcription was done by a contracted third party. Despite the data being delivered at a quality controlled high standard, extensive data cleaning was still needed before it could be imported into our database. The poor data quality was the result of four main factors: (1) the herbarium being digitized is not as well curated as the previous project; (2) the content of the collection is poorly known; (3) specimen labels are often handwritten and barely legible; and, (4) the information on the labels is often ambiguous, unclear or absent. These issues led to a dataset that needed extensive cleaning. As the Dataset comprises some 1,200,000 entries this allowed us to compare records within the dataset, fields could be sorted and grouped allowing us to normalise and link data. This first step allowed us to remove the most obvious errors. Semiautomated tools were tried, but they proved to be more work than help. Secondly, certain absent data important to the collection and data cleaning, such as the country of origin, could be deduced from other label information, like the locality. This required someone to interpret and group records lacking a country code. Again, a semi-automated approach was attempted, but with a high level of uncertainty. The more data one has the better one is able to clean other data, e.g. cleaning an ambiguous collector name can be cleaned by having data on the country and year of collecting event. Certain linked data is also important in our database, like taxon and collector. In many cases these data were missing from our database and needed to be created and linked before importing. Once the data are close to complete, one can enter the final stage that aims to look at logical inconsistencies in the data, such as the collecting date being out of the range of the collector’s life span. Data cleaning is an iterative process, at first the image is out of focus, but with each iteration the picture becomes clearer.
Book: Society for the Preservation of Natural History Collections (SPNHC) - Annual Meeting 2022
Publication year:2022
Accessibility:Open