Acquisition and Indexing of Web Content

The responsibilities and the collection profile of the archival institution define the web content to be acquired. In most cases, this will be content that was originally created on behalf of the relevant persons and institutions themselves.

From case to case, content might be significant that was made available online by third parties. Increasingly, social network activities must be maintained in addition to the homepage. The archivist is responsible for deciding whether these contents are relevant, as frequently the boundaries of private and public utterances are blurred on social media platforms. In any case, the site owner, who is in most cases the rights holder, should be asked for approval of archiving the respective web content.

An analysis of the technical form of content presentation offers the possibility to acquire various data from and about the relevant website in advance, e.g. the number of subpages, available image or video galleries, documents available for download, etc. At this point, tasks for the further acquisition of the site, e.g. key word assignment, can be defined. It depends on the importance of the provenance of the archive, the number of pages to be archived, and not least on the available human and technical resources how detailed this analysis will be.

Defining a URL as a starting point for the mirroring process, file types and file sizes the page contains, update intervals, etc. allow a specific site-tailored setting of the archiving software. Some of these metadata can be relevant for conversion or migration of data for long-term archiving: Which browser version and which version of the archiving software was used? The acquired data containing content as well as technical and structural information can and should also be made available to users and explained if necessary.