Archiving websites: how and why?

Je organisatie zal wel al eens van website gewisseld zijn, of de inhoud van de website is grondig veranderd. De oude versies van je website kunnen echter historische waarde hebben. Het heeft dan ook zin om je website af en toe eens te archiveren.
In deze tool leer je het volgende:

  • Waarom zou je je website archiveren en wanneer doe je dit?
  • Op welke manier ga je te werk om je website te archiveren?

Most organisations have already moved on from one or more of their previous websites. But when transitioning to a new website, they often wonder how to archive the old one. Their old website will often contain interesting information that is no longer relevant for the new one, but still holds historical value for the organisation. So what’s the simplest way to archive this information?

Why should you archive your website?

Not so long ago, websites consisted solely of static HTML pages, which are simple text pages with formatting that the web browser can transform into a web page. To archive these websites, you simply had to copy the folder with the files to your own computer. But more recent websites use a Content Management System (CMS), which is a database that manages website information and compiles web pages at the moment they are opened. This makes the website dynamic, but also much harder to archive.

In this article, we discuss how to digitally archive such (dynamic) websites in a simple way. The website will be made static again and stored offline in a format that can be preserved in the long term. Just like with emails, the digital aspect of websites is an essential property that needs to be preserved. Without digital preservation, you would lose the ‘look & feel’, and the experience of browsing the website.[1]

How?

Analyse your website

Start by analysing your website. The choice of archiving method depends on the type, content and elements of your website.

There are basically three types of websites:

  • static websites with fixed content;
  • dynamic websites that retrieve content from the deep web;[2]
  • a hybrid of the two.

Static websites consist of a number of interconnected pages and are usually formatted in HTML. They may contain links with images or links to other websites. All files are stored in a hierarchical folder structure on the web server.

A dynamic website is a website that is compiled at the moment it is opened. The pages themselves do not contain any content. Instead, they are filled with content from a backend database, such as a CMS. Cookies store specific user information on the user’s computer, which allows the browser to adjust the web page’s content to the user’s personal preferences. Most websites are a hybrid of static and dynamic.[3]

You also need to consider the content and elements that your website consists of. Does it contain many links to other websites? Does it use external services, such as Google Maps, YouTube videos or photos hosted on an online photo service? Are there any animated or interactive images and buttons? All these elements determine the complexity of archiving a website because they are often harder to preserve. You may lose certain functionalities, such as playing Flash animations[4] or elements for which plug-ins[5] need to be installed. Interactive elements may not work in archived websites, just like files retrieved from another website.

You can measure your website’s ‘archivability’ at archiveready.com. If you’re developing new websites, try to ensure, where possible, that they will be easy to archive at a later date.

Set clear objectives

It is also important to set clear objectives before choosing an archiving method. This choice involves several considerations. Firstly, what needs to be captured during archiving: the entire site, including external web pages that your site links to, or just your own website’s domain? Secondly, how frequently should the components be archived?[6]

Archiving web pages presents several challenges due to their unique nature. Websites are highly transient as they are frequently updated and modified. Furthermore, the way in which a web page is displayed on the screen depends on the interaction with the user (e.g. web browser, personal settings and preferences). Web pages are also closely interconnected: they are linked to each other, sometimes hosted on multiple servers, or retrieve information from external services or websites.[7]

You will therefore need to decide when to archive your website and how to define the scope of the website to be archived. Will you only capture the website when it’s taken offline, annually, or with every update? Will only your own domain’s website be archived, or also all the pages that it refers to? You’ll need to accept that there will always be gaps when archiving an website.

Preserve your website’s essential features

The transient nature and personalisation of web pages makes authenticity a challenging concept when it comes to archiving websites. Several essential properties can be defined, however.[8]:

  • Context: this refers to the data that indicates the relationship between the website and the archive creator. You can preserve it by recording descriptive metadata about your website.
  • Content: this includes all the text, photos, videos, maps, etc. on your website. Some elements, such as information retrieved from external services (e.g. YouTube, Google Maps and Flickr), are difficult to archive. You therefore need to document the external services that your website uses.
  • Structure: this shows the relationship between the website and its components. Most websites have a sitemap[9] that displays the structure of the website. You can preserve this property by saving the original structure of your website (i.e. the original structure of the web pages on the web server) and maintaining the relationships between the different pages.
  • Look & feel: in addition to content, structure and context, a website’s ‘look & feel’ is also an essential component that needs to be preserved. It is therefore important to document the technical environment in which your website was created, such as the CMS software used, the plug-ins required to display certain components, and the server configuration. You also need to keep a record of the period during which your website was online. This provides information about the HTML version used, the software, and the browser versions in which the website can be viewed. This information can serve as the basis for reconstructing the website.
  • Behaviour and functionalities: websites can also have specific ‘behaviours’ and ‘functionalities’, such as animations, interactive elements and hyperlinks. You also need to record your website’s technical environment for this because certain functionalities may be lost when choosing a specific archiving method.

Essential features are preserved so that a faithful reconstruction of the website is possible, and the website is archived within its context. On the eDAVID website, you can find a document with a list of all metadata that needs to be preserved(in Dutch). Save this document as a structured text file (e.g. an XML, CSV or Excel file) and store it along with the archived website in the digital archive. You should also keep all additional documentation about your website as this can come in very handy in case emulation is needed in the future.

Documenting which plug-ins the website uses enables you to reconstruct the website using emulation, for example, and avoid losing certain elements. Always archive a website before taking it offline and removing it from the web server. This gives you the opportunity to perform quality control after archiving and ensure that all essential properties have been preserved.

Sustainably preserve the website

The general rules for sustainable preservation also apply to website preservation. Always make sure you use good back-up procedures and store multiple back-ups of your files in different (geographical) locations. Monitor the integrity of your archived website by using checksums and periodically checking the files.

One challenge for the long-term preservation of websites is the large number of file formats that can be placed on websites. It is a complex process to migrate these to sustainable file formats because this can break the relationship between the web page and the file. Research has shown, however, that websites mainly use standardised formats, such as HTML, JPEG and MP3, which helps to make this problem less of an issue.

One solution to this challenge is to archive websites in the WARC format[10]. This is a standard format for storing various digital resources with metadata in a single archive file. You can find a simple and slightly more complex but less time-consuming method for archiving websites in the WARC format in this article[11].

Archiving methods

This section discusses three archiving methods:

Each method has its limitations, so it may be necessary to combine multiple methods to preserve every aspect of your website.


Author: Nastasia Vanderperren (meemoo) with help from Joris Janssens

  1. F. Boudrez, Archiveren van websites: een kwestie van waardering en ‘capture’, p. 5.
  2. The deep web refers to the part of the internet that is not accessible to search engines, such as databases that are protected by passwords. The database behind a CMS system is part of the deep web. See: https://en.wikipedia.org/wiki/Deep_web
  3. Boudrez, Archiveren van websites: een kwestie van waardering en ‘capture’, p. 7.
  4. Flash is Adobe software that is used to create animations, videos and applications, and enhance websites. You need a Flash Player plug-in on your web browser to play these files. See: https://en.wikipedia.org/wiki/Adobe_Flash.
  5. A plug-in or add-on is an extension of a computer program. In a web browser, they are used to show special information on a website, such as Flash animations.
  6. Boudrez, Archiveren van websites: een kwestie van waardering en ‘capture’, p. 5.
  7. Boudrez, Archiveren van websites: een kwestie van waardering en ‘capture’, p. 7.
  8. Boudrez, Archiveren van websites: een kwestie van waardering en ‘capture’, p.7.
  9. A sitemap is a page or document containing links to all pages of a website, which serves as a handy tool for visitors and search engines to find specific pages. See: https://en.wikipedia.org/wiki/Site_map.
  10. For more info, see Wikipedia
  11. M. Pennock, Web-archiving, p.15-16

Share this article:          

TRACKS is a collaboration between these partners: