How to transform a complex project-specific web application into a large static site

Workshop Agenda

14:00 Experiences from the PM20 project
~15:00 Discussion
~15:40 Preparation for peer consulting
16:00 Coffee break (16 - 16:30 h)
16:30 Peer consulting

The 20th Century Press Archives

by Max-Michael Wannags (Wikimedia commons)

intro press archives (cont.)

roughly 20 million newspaper clippings
collected 1909 - 2005
from all over the world
organized in thematic folders
inherited by ZBW - Leibniz Information Centre for Economics

Further presentation

Static site migration
- Legacy starting point
- Aims and design guidelines
- Implementation
Wikidata integration
- Data donation
- Search and other extensions via Wikidata

Digitization project

2004-2007, funded by German Research Foundation (DFG)
digitization of roll films (1908-1960) plus the paper clippings in the persons archives (dossiers starting before 1949)
electronic reconstruction of dossiers (folders)
tedious work, particularly due to intellectual property law
kept up until 2018, resulted in:
- total 2 million indexed pages
- plus 4 million “raw” digital images

Legacy application (up to 2023)

Specialized application for sophisticated discovery and access, architecturally outdated and expensive to maintain

Aims (from an institutional POV)

drastically reduce maintenance effort and cost
- outsource operation to hosting provider
make accumulated knowledge useful beyond own web site
=> CC0 license for all metadata
make metadata findable, extendable and maintainable by community
=> data donation to Wikidata

Technical design guidelines

should work “as is” for the next 10, 20, 40, … years
avoid “moving parts” as much as possible
- dynamically created content (server side via php/java/node.js/… or client side via javascript frameworks) make web sites brittle
- OS dependencies require updates of e.g. php or database versions
- security flaws make updates mandatory

design guidelines (cont.)

plain HTML files, richly interlinked
no database, no page generation at runtime
avoid non-essential abstraction layers
provide clean and long-term-reliable URLs for every web page and every digitized image

design guidelines (cont.)

trusting on Apache (mostly .htaccess) magic for
- access restrictions based on request IP address
- rewriting to clean, persistent URLS

Alltogether
=> web site can be considered as frozen at any time

Expectations and how things work out …

Two major changes 2023

intellectual property law changes in the EU and Germany permitted cultural heritage institutions to publish “archival units” of out-of-commerce material with mixed content
- that allowed ZBW to grant access (within EU) to many thousands of digitized roll films with millions of pages without evaluation of the IPR status of every single page
ongoing cooperation between ZBW and “Wikiprojekt Pressearchiv” to extend metadata coverage for the new material

PM20 Homepage

Implementation

Step 1: Generate JSON-LD files from database

the processing in this step is based on a closed, historically evolved database, now running on Wikimedia infrastructure, which I consider out of scope for this talk
JSON-LD as a file format and interface was chosen because
- web developers are familiar with JSON
- LOD freaks - like myself - are happy with semantics in RDF
- the JSON-LD context file provides better documenation of the meaning of data elements, which benefits all

JSON-LD dataset available at Zenodo

overall flow (again)

Step 2: Generate pages in markdown format from JSON-LD

produces a large amount of markdown files, for folder, alphabetic index, country or topic index pages
- implemented with Perl scripts
- needs to be executed only if the archive’s metadata had been extended
highly dataset-specific

Code available at Codeberg

overall flow (again)

Step 3: Create HTML from markdown pages

implemented through Pandoc with a HTML page template
executed by Make
overall controlled by Bash scripts
this part of the process is highly generic

Template and Makefile available at Codeberg

Result of the static page generation

overall runtime for recreating all pages less than 4 hours
total number of HTML files 143,000
total number of JSON files 2,118,000 (for IIIF image display)
html pages are easily grasped be search engines, no continuous SEO optimization necessary
robust against reckless genAI crawlers

Image viewing

implemented in two rather standardized ways:
- DFG viewer service, consuming METS/MODS (xml) files
  mandatory for funded digitization projects in Germany
- IIIF viewers consuming JSON-LD files
plus custom solution in plain php for digitized roll films

Folder view in Mirador IIIF viewer

IIIF infrastructure for folders

static manifest.json (IIIF Presentation API 3.0) (example )
generated by scripts and templates during step 2 (“business logic”) of the process described above
for each image, an info.json (IIIF Image API 3.0) with the available resolutions (example )
Mirador viewer downloaded from PM20 site
- but: manifests work with other viewers (UV example)

More documentation on our use of static IIIF

DFG viewer infrastructure for folders

public viewer service of SLUB Dresden invoked (example )
static METS/MODS xml files, generated like the IIIF files

both methods are offered on the folder pages, deliberately
image viewing has moving parts, it will break
- the javascript of the aging Mirador version may be blocked by browsers
- DFG Viewer service may be discontinued
redundancy => resilence

Wikidata cooperation

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.

Wikidata acts as central storage for the structured data of Wikipedia and other projects, within and beyond the Wikiverse.

Example item

Important aspects for our project

When all PM20 folders are connected to Wikidata items,
we gain …

current context information from Wikipedia
- without own maintenance effort
links to Wikidata item as hub into the Web of Data
- additional discovery and access path from Wikidata
sharing our metadata and make it more helpful

PM20 data donation to Wikidata

Formally announced with Wikimedia Deutschland in 2019

process continuously discussed with Wikidata community
several properties introduced, e.g. “PM20 folder ID”
tedious laborious work for matching PM20 to Wikidata items
- Wikidata community provides tools for that, such as “Mix’n’Match” or “Open Refine”

data donation (cont.)

Carefully intellectual-verified links allowed then automatically …

creating new items (particularly historical companies)
adding synonyms to improve findability
adding data elements, such as birth dates for persons
adding relations between items, e.g. between a company and a person on its board

Search: the donation pays back

people want to search sites comfortably, using synonyms, truncation, …
single hardest problem for a large static site
- Lucene, Elasticsearch and the like massively increase complexity
- Google search with “site:{domain}” ??
Wikidata has our metadata! Our synonyms - and much more synonyms! An open search interface!

Search implemented via Wikidata

link on company or person overview pages

How does that work?

the search link opens a page on the wikimedia tool server
a simple php script takes the searchbox input,
fills it into a SPARQL query which uses Wikidatas fulltext index,
runs the query and formats the result as links to PM20 folders
the synonyms are in Wikidata - contributed by ZBW’s data donation and many others

pm20-search repo on Wikimedia Gitlab

Ways of expanding the static site in Wikidata

reports created via the Wikidata query interface - e.g.
- map of PM20 economists by place of birth
- companies by NACE industry classification
links from Wikidata items to specific not-yet-indexed material - e.g. 1920 Schleswig plebiscite
one place where new and corrected data about the topic of an folder can be collected

sparql-queries repo on Codeberg, pm20-report repo on WM Gitlab

Expanding the static site via Wikidata as a strategy

of course, fundamentally depends on agreement in the community, that the data is useful in Wikidata
everybody can contribute without authorization or formal access restrictions
e.g., when somebody has researched material about a certain topic on the digitized roll films, they can add a PM20 film section links

Conclusion

for other sites, challenges and solutions will be different
could be easier, if at the start of projects long-term maintainablity would be made a priority

Acknowlegements

Thanks to the colleagues of the Wikiproject Pressearchiv.

Thanks in particular to Max-Michael Wannags, who worked the digitized archives for many year up to his retirement and, since then, has created almost single-handedly detailed folder indexes for hundred-thousands of digitized pages.

Thank you, and I look forward to further questions and discussion

Joachim Neubert (Wiki user: Jneubert)

20th Century Press Archives (Homepage)
ZBW Labs data donation series (1, 2, 3, 4, 5)
Wikidata project 20th Century Press Archives
Wikipedia Projekt Pressearchiv (in German)

This presentation: https://jneubert.de/slides/website-dcmi2025

Auxiliary material

Aside: Some sources on LOUD - Linked Open Usable data

The use of JSON-LD for interoperability between independently developed, linked systems has been described at

LOUD (by Linked Art)
Rob Sanderson: The importance of beeing LOUD (slides)
Julien A. Raemy: Linked Open Usable Data for Cultural Heritage. Perspectives on Community Practices and Semantic Interoperability. (Dissertation, University of Basel, 2024)

Side note: Why not a static site generator?

successfully used in projects, many different flavours
PM20 content pages anyway generated from database with complex logic
an SSG would have added its own complexities

Roll film viewer

Films starting with Land ownership : British India, Roads Bridges : Dutch East Indies

Press Page-Down for navigation links and help (in German)

Conclusion: The case for static sites

=>

reduced maintenance cost and dependency on specialized knowledge
sustainability and enhanced robustness re. cost cuts or political havoc
hopefully collections will survive the next decades on the web

How to transform a complex project-specific web application into a large static site

Workshop Agenda

The 20th Century Press Archives

intro press archives (cont.)

Further presentation

Digitization project

Legacy application (up to 2023)

Aims (from an institutional POV)

Technical design guidelines

design guidelines (cont.)

design guidelines (cont.)

Expectations and how things work out …

Two major changes 2023

Page and navigation examples

Implementation

Step 1: Generate JSON-LD files from database

overall flow (again)

Step 2: Generate pages in markdown format from JSON-LD

overall flow (again)

Step 3: Create HTML from markdown pages

Result of the static page generation

Image viewing

Folder view in Mirador IIIF viewer

IIIF infrastructure for folders

DFG viewer infrastructure for folders

Wikidata cooperation

Example item

Important aspects for our project

PM20 data donation to Wikidata

data donation (cont.)

Search: the donation pays back

Search implemented via Wikidata

How does that work?

Ways of expanding the static site in Wikidata

Expanding the static site via Wikidata as a strategy

Conclusion

Acknowlegements

Thank you, and I look forward to further questions and discussion

Auxiliary material

Aside: Some sources on LOUD - Linked Open Usable data

Side note: Why not a static site generator?

Roll film viewer

Conclusion: The case for static sites