Archiving Studio ORKA's website and social media accounts

Theatre company Studio ORKA transferred their archive to the Letterenhuis following the cessation of their activities. The archive includes a website and various social media accounts.

Problem definition

Studio ORKA wanted to ensure that all information about their theatre productions, along with the visual material, would be preserved during the transfer of their archive. This information was located on their website, and they also expressed the desire to transfer their social media accounts.

Method and results

Website

Since the website contained extensive descriptions of the performances, the archivist decided to start with this material. Initially, an attempt was made to automate the process using a web crawler application to scan and store the entire website. This was done with Heritrix, a versatile web crawler often used for such tasks. For this specific application, where it was crucial that every link was correctly captured, this option proved problematic: some links were saved, while others were missing or not working correctly. This made the results unreliable and incomplete. They therefore moved away from Heritrix and opted for Archive WebPage, manually going through all the links on the Studio ORKA website to save the entire site in both WARC and WACZ formats (Web ARChive).

The WARC format not only saves the HTML pages but also all associated files such as images, videos and scripts, so the website remains fully interactive later. The WACZ format is a compressed (zipped) version with additional metadata, making the archived website easier to open and ensuring dynamic content, such as videos and forms, is preserved correctly.



These WACZ files can be viewed in various ways. There are several online tools available for consulting WARC/WACZ files. ReplayWeb.page proved to be the best choice, as the associated tool was used to archive the website. The tool also allows archived websites to be opened and explored locally. This is a simple process: you load the WARC/WACZ files into Archive WebPage, click on the links you want to view, and the website appears with all functional buttons intact. You can find more information about this via the Archive WebPage guide.




Social media

In addition to the website, Studio ORKA’s Facebook and Instagram accounts were archived. META, the parent company of both, offers built-in options that allow users to archive their accounts and export all data in a user-friendly way.

On Facebook/Instagram, data was requested and downloaded via the privacy settings of Studio ORKA’s account. The downloaded data includes all posts/messages that Studio ORKA has ever posted, liked or shared, supplemented with other of the account’s activities that META itself records. In the case of Studio ORKA, a full archiving was chosen. There is also an option to choose what you want to archive and what you don't.



When downloading the data, there is the option to choose the desired output format: JSON or HTML. The most straightforward option was HTML, which provided a quasi-representation of the website version of Studio ORKA’s Facebook/Instagram. This representation is not an exact copy in terms of design, but the content is 1:1.



The data was also downloaded in JSON format, which is the better option if you want to analyse data or import data into other systems. The downside is that the display is less clear.



Author: Ghaith Al-Ani (Letterenhuis)

Share this article:            

TRACKS is a collaboration between these partners: