How to transform a complex project-specific
web application into a large static site
Workshop at DCMI 2025, Barcelona, 2025-10-25
Joachim Neubert
Workshop Agenda
14:00 Experiences from the PM20 project
~15:00 Discussion
~15:40 Preparation for peer consulting
16:00 Coffee break (16 - 16:30 h)
16:30 Peer consulting
From that point onwards, the workshop depends on you. Please,
consider if you want to receive feedback about a topic from your
project, or just like to discuss a thought experiment or an open
question with us as your peers after the coffee break.
> I will now introduce the press archives collection
> Questions encouraged during talk
The 20th Century Press
Archives
by
Max-Michael Wannags (Wikimedia commons)
the paper archives looked like this on an image from 2015
within the boxes, you see folders with clippings on a certain topic,
a person, a company, …
intro press archives (cont.)
roughly 20 million newspaper clippings
collected 1909 - 2005
from all over the world
organized in thematic folders
inherited by ZBW - Leibniz
Information Centre for Economics
to my best knowleged the largest public clippings archive in the
world
it is unique in its geographical, thematic and temporal
coverage
worth to be preserved forever
Further presentation
Static site migration
Legacy starting point
Aims and design guidelines
Implementation
Wikidata integration
Data donation
Search and other extensions via Wikidata
Implementation with a special look at image viewing
> All this relates to the digitized part of the archives - about
one third of the total holdings -, one third is on microfiche, and one
third is on paper and now unaccessibly stored in cellar rooms
Digitization project
2004-2007, funded by German Research Foundation (DFG)
digitization of roll films (1908-1960) plus the paper clippings in
the persons archives (dossiers starting before 1949)
electronic reconstruction of dossiers (folders)
tedious work, particularly due to intellectual property law
kept up until 2018, resulted in:
total 2 million indexed pages
plus 4 million “raw” digital images
images : each normally containing two clippings
> indexed pages presented in legacy application
Legacy application (up to
2023)
Specialized application for sophisticated discovery and access,
architecturally outdated and expensive to maintain
and access: multi-dimensional search, GND identifiers
web design from the 2000s
implemented in Coldfusion (not webstandards-conforming)
backend database Oracle
maintenance contracts with external companies
not affordable any more
particularly with the upcoming retirement of the single person
familiar with the application
Aims (from an institutional POV)
drastically reduce maintenance effort and cost
outsource operation to hosting provider
make accumulated knowledge useful beyond own web site =>
CC0 license for all metadata
make metadata findable, extendable and maintainable by community
=> data donation to Wikidata
CC0 …: and ZBW does not claim any rights in the digital
copies
> the first aim resulted in a massive technical change
Technical design guidelines
should work “as is” for the next 10, 20, 40, … years
avoid “moving parts” as much as possible
dynamically created content (server side via php/java/node.js/… or
client side via javascript frameworks) make web sites brittle
OS dependencies require updates of e.g. php or database
versions
security flaws make updates mandatory
I don’t think that’s utopian - we know that sites from the very
first days of the web can still be displayed and navigated by the latest
browsers
design guidelines (cont.)
plain HTML files, richly interlinked
no database, no page generation at runtime
avoid non-essential abstraction layers
provide clean and long-term-reliable URLs for every web page and
every digitized image
> these URLs are the most important human and machine interface
to your data
design guidelines (cont.)
trusting on Apache (mostly .htaccess) magic
for
access restrictions based on request IP address
rewriting to clean, persistent URLS
Alltogether => web site can be considered as frozen at any time
the website code, including the .htaccess files, is online (code
icon)
site would not have been possible without Apache
bet on the future, that HTML/CSS/Apache will stay
Apache was founded in 1995, active developer community, large market
share
Expectations and how
things work out …
I started the development of this new site in 2019. My expectation,
firmly shared by the ZBW directorate, was that the site would be
finished and frozen at least with my set retirement in Januar 2024
Two major changes 2023
intellectual property law changes in the EU and Germany permitted cultural
heritage institutions to publish “archival units” of out-of-commerce
material with mixed content
that allowed ZBW to grant access (within EU) to many thousands of
digitized roll films with millions of pages without evaluation of the
IPR status of every single page
ongoing cooperation between ZBW and “Wikiprojekt Pressearchiv” to
extend metadata coverage for the new material
… mixed content: - under a lot of conditions
… single page: this was a large breakthrough - however:
these millions of pages were not indexed
happily, there were a few people who were willing to do the indexing
work and to extend the website accordingly
the “Wikiprojekt Pressearchiv” was founded, and ZBW was willing to
accept extended datasets and pull request for program code
> So what had been exspected to be an one-time migration, now is
a process chain which is executed two times a year
> Before I come to the implementation, I’ll give you a glance how
the site works
Implementation
> the processing in the first step is based on a closed,
historically evolved database, now running on Wikimedia infrastructure,
which I consider out of scope for this talk
Step 1: Generate
JSON-LD files from database
the processing in this step is based on a closed, historically
evolved database, now running on Wikimedia infrastructure, which I
consider out of scope for this talk
JSON-LD as a file format and interface was
chosen because
web developers are familiar with JSON
LOD freaks - like myself - are happy with semantics in RDF
the JSON-LD context file provides better documenation of the meaning
of data elements, which benefits all
JSON-LD dataset available at Zenodo
< I will however show the results
… RDF: and the interopeability it provides
> the JSON-LD dataset constitutes not only the interface between
Wikiprojekt and ZBW, but at the same time is a core part of the Linked
Open Useable Data concept. You find some links on this in the auxiliary
material after the end of the presentation.
overall flow (again)
the rest of the process, step 2 and 3, is running on the ZBW
server
these steps are highly resilent
if at some point of time ZBW would decide that it can no longer keep
up this service and another cutural heritage institution would be
willing to step in, they could be easily transferred
published code and metadata, and hopefully having the digitized
images harvested, also provides some technical resilence against fascist
attacks on science and knowledge, because the website can be re-errected
elsewhere
Step 2:
Generate pages in markdown format from JSON-LD
produces a large amount of markdown
files, for folder, alphabetic index, country or topic index pages
implemented with Perl scripts
needs to be executed only if the archive’s metadata had been
extended
highly dataset-specific
Code available at Codeberg
overall flow (again)
from these markdown files we create HTML files 1:1
Step 3: Create HTML from
markdown pages
implemented through Pandoc with a HTML page template
executed by Make
overall controlled by Bash scripts
this part of the process is highly generic
Template
and Makefile
available at Codeberg
Pandoc exists for almost 20 years, continuously maintained and
extended
only ONE Pandoc template is used for all German and English pages,
some logic injected by parameters provided by make
this is cheap, when e.g. the mail address in the page footer changes
for all pages
use of “make” guarantees, that the template is applied to any
changed file (and only to changed files)
the whole process can be executed by invocing one bash script
This is, of course, a bet on the future, that Perl, Pandoc, make and
bash will stay available for a long time
Result of the static page
generation
overall runtime for recreating all pages less than 4 hours
total number of HTML files 143,000
total number of JSON files 2,118,000 (for IIIF image display)
html pages are easily grasped be search engines, no continuous SEO
optimization necessary
robust against reckless genAI crawlers
the rather complex page generation, step 2, is only necessary, when
we want to deploy data enhancements
overall page changes, e.g. in the footer, involve only step 3 and
are much faster
the site is large, but hard wired with the HTML links, the
technically most stable navigation mechanism on the web
Image viewing
implemented in two rather standardized ways:
DFG viewer service, consuming
METS/MODS (xml) files mandatory for funded digitization projects in
Germany
IIIF viewers consuming JSON-LD
files
plus custom solution in plain php for digitized roll films
< this was the most challenging part of this endeavour
< users want to browse images, zoom and pan - highly interactive
activities
< in the past, there were hundreds of different implementations,
often with heavy external dependencies
Folder view in Mirador IIIF
viewer
IIIF infrastructure for
folders
static manifest.json (IIIF Presentation API 3.0) (example
)
generated by scripts and templates during step 2 (“business logic”)
of the process described above
for each image, an info.json (IIIF Image API 3.0)
with the available resolutions (example
)
Mirador viewer downloaded from PM20 site
but: manifests work with other viewers (UV
example )
More
documentation on our use of static IIIF
the static IIIF service implementation could be subject of a talk of
its own - here I just want to hint you to the documentation linked
below
both methods are offered on the folder pages, deliberately
image viewing has moving parts, it will break
the javascript of the aging Mirador version may be blocked by
browsers
DFG Viewer service may be discontinued
redundancy => resilence
… break: at some point in time, most probably in less than
20 years
… resilence: different dependencies, different
vulnerabilities
hopefully, they will not break a the same time => chance to
fix
Mirador fix requires update or installation of another IIIF viewer
(such as Universal Viewer)
DFG viewer can be installed on premises (open source)
> so far on the migration to static part. We now come to …
Wikidata cooperation
Wikidata is a free and open knowledge base that can be read and
edited by both humans and machines.
Wikidata acts as central storage for the structured data of Wikipedia
and other projects, within and beyond the Wikiverse.
Example item
structured data, links to other items
multilingual
links to Wikipedia articles
links to identifiers outside Wikidata
Important aspects for our
project
When all PM20 folders are connected to Wikidata items, we gain …
current context information from Wikipedia
without own maintenance effort
links to Wikidata item as hub into the Web of Data
additional discovery and access path from Wikidata
sharing our metadata and make it more helpful
click first link item - Wikipedia link (in English
and German)
click Wikidata icon, search for “identifiers”, scroll to PM20
permanent, available without effort
that motivated our data donation
PM20 data donation to
Wikidata
Formally announced with Wikimedia Deutschland in 2019
process continuously discussed with Wikidata community
several properties introduced, e.g. “PM20 folder ID”
tedious laborious work for matching PM20 to Wikidata items
Wikidata community provides tools for that, such as “Mix’n’Match” or
“Open Refine”
data donation (cont.)
Carefully intellectual-verified links allowed then automatically …
creating new items (particularly historical companies)
adding synonyms to improve findability
adding data elements, such as birth dates for persons
adding relations between items, e.g. between a company and a person
on its board
lots of such companies and organizations
persons or founding date, seat or industry for
companies
Search: the donation pays
back
people want to search sites comfortably, using synonyms, truncation,
…
single hardest problem for a large static site
Lucene, Elasticsearch and the like massively increase
complexity
Google search with “site:{domain}” ??
Wikidata has our metadata! Our synonyms - and much more synonyms! An
open search interface!
we have seen a very simple search in html lists
the accelerating enshitification of Google prohibits even the
thought
Search implemented via
Wikidata
Shift-Click on “overview pages”
I will walk you through the search two times - as a user and as a
software architect
How does that work?
the search link opens a page on the wikimedia tool server
a simple php script takes the searchbox input,
fills it into a SPARQL query which uses Wikidatas fulltext
index,
runs the query and formats the result as links to PM20 folders
the synonyms are in Wikidata - contributed by ZBW’s data donation
and many others
pm20-search
repo on Wikimedia Gitlab
the complete process, from invoking the “Folder search” link to
displaying a result, runs on Wikimedia tool server
it can be improved and adapted to changes, in the indexing for
example, there
useful enhancement in production, which does not require changes in
ZBW infrastructure
Ways of expanding
the static site in Wikidata
reports
created via the Wikidata query interface - e.g.
map of PM20 economists by place of birth
companies by NACE industry classification
links from Wikidata items to specific
not-yet-indexed material - e.g. 1920 Schleswig
plebiscite
one place where new and corrected data about the
topic of an folder can be collected
sparql-queries
repo on Codeberg, pm20-report
repo on WM Gitlab
these queries on PM20 data are executed against the Wikidata SPARQL
endpoint without need for any other data sources
this uses the clean and reliable URLs I mentioned before as an
essential design goal
this is sustainable: at some point in time would discontinue the
service, and the data would be provided by another institution, it would
be sufficient to update a handfull of prefix definitions to keep all
Wikidata links working
Expanding
the static site via Wikidata as a strategy
of course, fundamentally depends on agreement in the community, that
the data is useful in Wikidata
everybody can contribute without authorization or formal access
restrictions
e.g., when somebody has researched material about a certain topic on
the digitized roll films, they can add a PM20 film section links
easy for Press Archives, where a journalist found it worth to
publish about a topic, and an archivist has considered it worth
collecting
archival guide to German Colonialism has
done exactly that
a final statement: for me, and for many others contributors to
Wikidata the most satisfying part is that every contribution does not
only improve the dataset you are working on, but is valuable for
everybody else relying on Wikidata
Conclusion
for other sites, challenges and solutions will be different
could be easier, if at the start of projects long-term
maintainablity would be made a priority
< to wrap everything up: all of this was a lot of work
different hopefully, code examples repositories linked from
these slides may be helpful
Acknowlegements
Thanks to the colleagues of the Wikiproject Pressearchiv.
Thanks in particular to Max-Michael Wannags, who worked the digitized
archives for many year up to his retirement and, since then, has created
almost single-handedly detailed folder indexes for hundred-thousands of
digitized pages.
Aside:
Some sources on LOUD - Linked Open Usable data
The use of JSON-LD for interoperability between independently
developed, linked systems has been described at
Side note: Why not a
static site generator?
successfully used in projects, many different flavours
PM20 content pages anyway generated from database with complex
logic
an SSG would have added its own complexities
DCMI has ported all its sites to Hugo a few year ago, which made the
sites much more maintainable
very few pages a manually edited at all
> I now come to the particular topic of image viewing, which of
course is essential for our site
Conclusion: The case for
static sites
=>
reduced maintenance cost and dependency on specialized
knowledge
sustainability and enhanced robustness re. cost cuts or political
havoc
hopefully collections will survive the next decades on the web