How to transform a complex project-specific web application into a large static site

Workshop at DCMI 2025, Barcelona, 2025-10-25

Joachim Neubert

Workshop Agenda

  1. 14:00 Experiences from the PM20 project
  2. ~15:00 Discussion
  3. ~15:40 Preparation for peer consulting
  4. 16:00 Coffee break (16 - 16:30 h)
  5. 16:30 Peer consulting

The 20th Century Press Archives

by Max-Michael Wannags (Wikimedia commons)

intro press archives (cont.)

  • roughly 20 million newspaper clippings
  • collected 1909 - 2005
  • from all over the world
  • organized in thematic folders
  • inherited by ZBW - Leibniz Information Centre for Economics

Further presentation

  • Static site migration
    • Legacy starting point
    • Aims and design guidelines
    • Implementation
  • Wikidata integration
    • Data donation
    • Search and other extensions via Wikidata

Digitization project

  • 2004-2007, funded by German Research Foundation (DFG)
  • digitization of roll films (1908-1960) plus the paper clippings in the persons archives (dossiers starting before 1949)
  • electronic reconstruction of dossiers (folders)
  • tedious work, particularly due to intellectual property law
  • kept up until 2018, resulted in:
    • total 2 million indexed pages
    • plus 4 million “raw” digital images

Legacy application (up to 2023)

Specialized application for sophisticated discovery and access, architecturally outdated and expensive to maintain

Aims (from an institutional POV)

  • drastically reduce maintenance effort and cost
    • outsource operation to hosting provider
  • make accumulated knowledge useful beyond own web site
      => CC0 license for all metadata
  • make metadata findable, extendable and maintainable by community
      => data donation to Wikidata

Technical design guidelines

  • should work “as is” for the next 10, 20, 40, … years
  • avoid “moving parts” as much as possible
    • dynamically created content (server side via php/java/node.js/… or client side via javascript frameworks) make web sites brittle
    • OS dependencies require updates of e.g. php or database versions
    • security flaws make updates mandatory

design guidelines (cont.)

  • plain HTML files, richly interlinked
  • no database, no page generation at runtime
  • avoid non-essential abstraction layers
  • provide clean and long-term-reliable URLs for every web page and every digitized image

design guidelines (cont.)

  • trusting on Apache (mostly .htaccess) magic for
    • access restrictions based on request IP address
    • rewriting to clean, persistent URLS

Alltogether
=> web site can be considered as frozen at any time

Expectations and how things work out …

Two major changes 2023

  • intellectual property law changes in the EU and Germany permitted cultural heritage institutions to publish “archival units” of out-of-commerce material with mixed content
    • that allowed ZBW to grant access (within EU) to many thousands of digitized roll films with millions of pages without evaluation of the IPR status of every single page
  • ongoing cooperation between ZBW and “Wikiprojekt Pressearchiv” to extend metadata coverage for the new material

Page and navigation examples

PM20 Homepage

Implementation

Step 1: Generate JSON-LD files from database

  • the processing in this step is based on a closed, historically evolved database, now running on Wikimedia infrastructure, which I consider out of scope for this talk
  • JSON-LD as a file format and interface was chosen because
    • web developers are familiar with JSON
    • LOD freaks - like myself - are happy with semantics in RDF
    • the JSON-LD context file provides better documenation of the meaning of data elements, which benefits all

JSON-LD dataset available at Zenodo

overall flow (again)

Step 2: Generate pages in markdown format from JSON-LD

  • produces a large amount of markdown files, for folder, alphabetic index, country or topic index pages
    • implemented with Perl scripts
    • needs to be executed only if the archive’s metadata had been extended
  • highly dataset-specific

Code available at Codeberg

overall flow (again)

Step 3: Create HTML from markdown pages

  • implemented through Pandoc with a HTML page template
  • executed by Make
  • overall controlled by Bash scripts
  • this part of the process is highly generic

Template and Makefile available at Codeberg

Result of the static page generation

  • overall runtime for recreating all pages less than 4 hours
  • total number of HTML files 143,000
  • total number of JSON files 2,118,000 (for IIIF image display)
  • html pages are easily grasped be search engines, no continuous SEO optimization necessary
  • robust against reckless genAI crawlers

Image viewing

  • implemented in two rather standardized ways:
    • DFG viewer service, consuming METS/MODS (xml) files
      mandatory for funded digitization projects in Germany
    • IIIF viewers consuming JSON-LD files
  • plus custom solution in plain php for digitized roll films

Folder view in Mirador IIIF viewer

IIIF infrastructure for folders

  • static   manifest.json  (IIIF Presentation API 3.0) (example )
  • generated by scripts and templates during step 2 (“business logic”) of the process described above
  • for each image, an   info.json   (IIIF Image API 3.0) with the available resolutions (example )
  • Mirador viewer downloaded from PM20 site
    • but: manifests work with other viewers (UV example)

More documentation on our use of static IIIF

DFG viewer infrastructure for folders

  • public viewer service of SLUB Dresden invoked (example )
  • static METS/MODS xml files, generated like the IIIF files

  • both methods are offered on the folder pages, deliberately
  • image viewing has moving parts, it will break
    • the javascript of the aging Mirador version may be blocked by browsers
    • DFG Viewer service may be discontinued
  • redundancy => resilence

Wikidata cooperation

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.

Wikidata acts as central storage for the structured data of Wikipedia and other projects, within and beyond the Wikiverse.

Example item

Important aspects for our project


When all PM20 folders are connected to Wikidata items,
we gain …
  • current context information from Wikipedia
    • without own maintenance effort
  • links to Wikidata item as hub into the Web of Data
    • additional discovery and access path from Wikidata
  • sharing our metadata and make it more helpful

PM20 data donation to Wikidata

Formally announced with Wikimedia Deutschland in 2019
  • process continuously discussed with Wikidata community
  • several properties introduced, e.g. “PM20 folder ID”
  • tedious laborious work for matching PM20 to Wikidata items
    • Wikidata community provides tools for that, such as “Mix’n’Match” or “Open Refine”

data donation (cont.)

Carefully intellectual-verified links allowed then automatically …
  • creating new items (particularly historical companies)
  • adding synonyms to improve findability
  • adding data elements, such as birth dates for persons
  • adding relations between items, e.g. between a company and a person on its board

Search: the donation pays back

  • people want to search sites comfortably, using synonyms, truncation, …
  • single hardest problem for a large static site
    • Lucene, Elasticsearch and the like massively increase complexity
    • Google search with “site:{domain}” ??
  • Wikidata has our metadata! Our synonyms - and much more synonyms! An open search interface!

Search implemented via Wikidata

How does that work?

  • the search link opens a page on the wikimedia tool server
  • a simple php script takes the searchbox input,
  • fills it into a SPARQL query which uses Wikidatas fulltext index,
  • runs the query and formats the result as links to PM20 folders
  • the synonyms are in Wikidata - contributed by ZBW’s data donation and many others

pm20-search repo on Wikimedia Gitlab

Ways of expanding the static site in Wikidata

  • reports created via the Wikidata query interface - e.g.
    • map of PM20 economists by place of birth
    • companies by NACE industry classification
  • links from Wikidata items to specific not-yet-indexed material - e.g. 1920 Schleswig plebiscite
  • one place where new and corrected data about the topic of an folder can be collected

sparql-queries repo on Codeberg, pm20-report repo on WM Gitlab

Expanding the static site via Wikidata as a strategy

  • of course, fundamentally depends on agreement in the community, that the data is useful in Wikidata
  • everybody can contribute without authorization or formal access restrictions
  • e.g., when somebody has researched material about a certain topic on the digitized roll films, they can add a PM20 film section links

Conclusion

  • for other sites, challenges and solutions will be different
  • could be easier, if at the start of projects long-term maintainablity would be made a priority

Acknowlegements

Thanks to the colleagues of the Wikiproject Pressearchiv.

Thanks in particular to Max-Michael Wannags, who worked the digitized archives for many year up to his retirement and, since then, has created almost single-handedly detailed folder indexes for hundred-thousands of digitized pages.

Thank you, and I look forward to further questions and discussion

Joachim Neubert (Wiki user: Jneubert)

This presentation: https://jneubert.de/slides/website-dcmi2025

Auxiliary material

Aside: Some sources on LOUD - Linked Open Usable data

The use of JSON-LD for interoperability between independently developed, linked systems has been described at

Side note: Why not a static site generator?

  • successfully used in projects, many different flavours
  • PM20 content pages anyway generated from database with complex logic
  • an SSG would have added its own complexities

Roll film viewer

Films starting with Land ownership : British India, Roads Bridges : Dutch East Indies

Press Page-Down for navigation links and help (in German)

Conclusion: The case for static sites

=>
  • reduced maintenance cost and dependency on specialized knowledge
  • sustainability and enhanced robustness re. cost cuts or political havoc
  • hopefully collections will survive the next decades on the web