The creation of a Wide Area Archive & Library (WAAL)

Proposal for the International Institute of Social History

Author: Tjebbe van Tijen
Date: 7 April 1994
Comment: This project has indeed been realized as the Occasio Project by Antenna in close cooperation with IISG and Tjebbe van Tijen.


The archive & library consists of digital documents representing all kinds of information from text (in the first stages of the project) to images and sound (in the future). The reason for constituting a WAAL is that, although the production and proliferation of electronic documents has been astronomic, there has very little been done for long term preservation of this kind of information. It is clear that there is a strong impact on society by the new information technologies, especially through the diffusion of information by telecommunication. This phenomenon has been compared on several occasions with the 'revolution of the printed word' as it developed from the 15th. century onwards. The 'digital revolution' will be a popular subject for historical study soon. To make such studies possible we have to act now to rescue what will otherwise be lost forever. 

There are distinct differences between printed and digital information. The first is tangible and readable without any devices (except spectacles in some cases), the second is disembodied and can only be perceived with the help of special appliances. Papyrus, parchment and paper have carried information from generation to generation for more then 4000 years. It is not likely that this 'paper memory system' will be fully replaced by digital documents as some over enthusiastic computer lovers propagate. Nevertheless we should start to take care also of the 'digital memory system', if we do not want to leave our predecessors with a historical void. The new form of information circulation over electronic networks has a very ephemeral quality. Text is often written and read directly on and from the computer terminal screen. Not much thought is given to long term preservation of such texts and if so the necessary facilities, finances and expertise are not available.

There are a few characteristics of this new media that will force us to rethink the concepts we use to determine the selective criteria for building historical collections of information items. The notions of 'small' and 'big' publishers, limited and wide circulation, are less applicable. The ease with which documents can be duplicated, adapted and re-circulated, placed from one electronic bulletin board to a whole network, from one network to other networks, does away with the earlier distinction between mass media and its implicit counter part 'non mass media'. The ease with which one can now circulate information from local to global scale has also consequences for another concept of the 'paper world' collection building, that is the importance given to the 'place of origin' of an information item. Collections are often build up geographically. Consequently, work tasks are also divided over different geographical areas. With the implosion of physical space in the world wide electronic network collection structuring and task division should be revised. Collecting digital information can be done from any point in the interconnected global network. The traditional division in document types like correspondence, manuscript, book, periodical, press release, pamphlet, hand out, leaflet, is getting less distinctive. A whole chain of activities of the publisher, printer, distributor, bookshop, has suddenly been united in one process: computer networking.

Of course the International Institute of Social History should make a selection of the hundreds of thousands digital documents that are (still) available now. One major network with an information content closest to the collection profile of the Institute is the Association for Progressive Communications (APS). APS started in 1984 in the San Francisco Bay Area as an initiative of the Ark Communications Institute, the Center for Innovative Diplomacy, Community Data Processing and the Foundation for the Arts of Peace (at that time called PeaceNet). In 1987 PeaceNet was managed by the newly formed Institute for Global Communications (IGC), set up by the Tides Foundation. Other networks were created, such as EcoNet and ConflictNet. Among the financial supporters of these initiatives was Apple Computer. Later the network made connections with similar initiatives in other countries such as GreenNet in England. In 1987 Peter Gabriels directed financial support to the project from a fund raising rock concert in Tokyo (the year before). The transatlantic link with GreenNet proved so successful that other funds for furthering the net could be found from foundations like MacArthur, Ford, General Service and the United Nations Development Program. In 1990 the Association for Progressive Communications was formed to coordinate the by now global networking activities. There were more then 15.000 subscribers in 90 countries in 1993, mostly Non-Governmental Organisations (NGO) (see map).

The outline of the proposal that I discussed last week with Michael Polman and Alfred Heitink from the Antenna Foundation in Nijmegen reads as follow:

First step: gather all archive material of the APS network, in as far as it has been preserved somewhere in the world. A rough estimate is that it will be between 2 and 3 Gigabyte since 1984. The daily feed of material is now around 1 Mb per day. This estimate is mainly material in English, but also includes text in Spanish, German and Portuguese. The proposal is to make a contract with the representative of the APC network in the Netherlands, the Antenna Foundation. The Antenna Foundation will make a separate agreement with another partner, GreenNet in London, to assure long term continuity. In principle all APC materials are free on the network. Participating host organisations in different countries have an agreement that they only will charge for the transport costs of the information, not for the information itself. There are some exceptions, as with the materials from International Press Service (IPS). In such cases separate deals need to be made with these information providers.

At the moment the most cost effective and safe method of preservation is writing the digital archive material to CD-ROM. Each CD-ROM has a capacity of a bit more then 600 Mb. The whole APS archive could be written on 5 to 6 of such CD-ROMs. With the lowering of the prices of hardware and software it is feasible now to buy a CD-Recordable writing device with a dedicated computer and apliances for a price around fl. 15.000,-. Blank CD-Recordable discs cost now between fl. 50,- and fl. 75,- a piece. The writing of the CD's can thus be done 'in house'. The great advantage is that once the material has been prepared for storing on a CD-ROM, other copies can be made easily and cheaply, either by 'burning' another CD-Recordable, or duplicate them in a small copy range through a duplicating company. Also the same digital material can be formatted on CD-ROM for usage on different platforms (PC, MAC, UNIX). Also duplicates of archives can be exchanged with other institutions or made into a publication. Of course there need to be permissions by copyright holders before such a publication can be made.
The main steps for the APC project will be: 
- archiving/preservation;
- classifying/normalisation;
- making the material public available.
Each of these tasks can be divided in separate steps: 

- through direct Internet connections;
- archive materials on DAT cartridges;
- the original structure of bulletin boards and networks with news groups, subject lists, conferences, electronic journals and file sections will be preserved as much as possible;
- deselection by automatic filtering, for instance all messages of less than 5 Kb, or messages that consist mainly of quotations of other messages;
- detection of double items on the basis of unique 'message ID' (only within a news group);
- registration of verification by using ... of original text.

- automatic description on the basis of formal elements in the headers of messages (from - date - subject line);
- registration of conference(s) or list(s) where the message has been posted (also multiple appearance);
- automatic classification of specific names derived from full text (person, corporations, geographical names) on the basis of expertise dictionaries;
- semi automatic classification with descriptors/keywords on the basis of expertise dictionaries, in such way that sets of message descriptions can easily be selected or deselected by the classifier;
- normalisation of text that has been non correctly formatted;
- reformatting for CD-ROM of view copy of texts that use national/language specific routines for non lower ASCII characters.
Making the material public available 
- Bringing the indexes that refer to the full text on line (through an existing bulletin board system, direct dialling, on Internet, distributing the index to other bulletin boards); - bringing the whole text on line (so-called FTP site), either based at a computer at the Institute or for instance on the GreenNet computer in London;
- establishing a service whereby on the basis of the descriptions (indexes) selections of text can be made 'on line' or by buying a floppy disc for use at home; the requested material can than be delivered on floppy, in an email box or on a CD-Recordable (with an automatic billing and payment registration program);
- and of course consultation directly at the Institute.

Once the information is preserved on CD-ROM an 'on line resource center' will be constituted at the International Institute of Social History.

A rough estimation of costs that can be divided in one time investments and annual exploitation costs. Although the dynamic hardware and software market will make it necessary to renew the hardware and software on a regular basis.
Starting options: 
- Hardware and software for archiving materials on CD-ROM 15.000,-
- Multiple CD-ROM player to put in local and external network 5.000,-
- Software development, training and support 10.000,- - Transport costs of data 5.000,-
- Peripheral equipment (high speed modems, cabling, network facilities) 5.000,-
How to proceed 
I propose that the project will be developed in stages whereby at the first stage the project will be set up by an external company on the basis of a contract with a fixed price. The Antenna Foundation will be the most suitable candidate. For the project there will be formed a steering committee with 2 representatives of the Institute and two of the company. The project will include training of personnel of the Institute. The dedicated software that has to be developed should, as much as possible, be made up of combinations of existing widely accepted software modules and its construction should be modular and be open to adaptations by the Institute. The software should be able to handle a wide variety of text and database material formats and platforms.

Of course when the Institute decides to do the writing of CD-ROMs 'in house', the equipment can be used at the same time for other projects as:
- a compilation of existing text format inventories of the Institute and affiliates (with one general index);
- publication of new inventories on CD-ROM (ID Archiv);
- Archives de Bakunin;
- publication of the general catalogue (OPC) on CD-ROM;
- back up safety copies of images files of the iconographic department.
The creation of a Wide Area Archive & Library (WAAL)