Since March 17, the date on which France was placed in confinement to fight against the coronavirus pandemic, French Web archivists from the National Library of France (BNF) have compiled a number of online documents related to the Covid -19, and the impact it had on the daily life of French Internet users. Ultimately, this collection will allow researchers to investigate and find Internet archives that may have disappeared.
In addition to the huge annual backup automatically carried out in the fall by a “crawler” called Heritrix, the team of ten librarians and computer experts dedicated to legal deposit of the Web at the BNF also carries out targeted collections for specific events. (elections, sporting events, unforeseen highlights…). “The last emergency collection operations that we carried out concerned for example the attacks of 2015, the #metoo movement, but also that of the” yellow vests “, the fire of Notre-Dame”, lists Benoît Tuleu, director of the legal deposit department at the BNF.
Gather all the facets of the pandemic
“At the end of January, the permanent team had started to take an interest in the coronavirus in France in its usual watch, in particular via the first French-speaking hashtags which appeared as #JeNeSuisPasUnVirus which denounced the stigmatization of the Asian community, explains Benoît Tuleu, but the fact that the population was confined overnight while remaining very connected seemed new and necessarily interesting to archive for studies in the future. “
Even more unprecedented, the BNF personnel in charge of the operation were themselves confined and teleworked. But for the past month, the permanent digital legal deposit team has been reinforced by the eve of 50 additional librarians internally at the BNF. A network of correspondents from 26 territorial libraries and archive services has been activated, for local refinement of the documents collected
“Our Covid-19 collection tracks the evolution and overall impact of the pandemic on the French web. We focus on all facets of the health crisis: medical, scientific, cultural, daily life, relationship with the body, moral issues, socialization, etc. For the sake of representativeness, there are both official and personal accounts, blogs, confinement newspapers, public content on social networks but also videos on YouTube ”, list Benoît Tuleu.
The addresses of the websites or content to be saved are manually pointed out by the librarians to the robot, which will scan them at different rates depending on the relevance: several times a day on a social network for example; once a month for a site with less power; daily indexing for press sites. At the moment, the robot has collected 2,000 URLs linked to Covid-19.
Between 8 and 10 terabytes
About 40% of this coronavirus content comes from social media, the librarians estimate. “It is difficult to estimate today the size of this collection which is not yet finished. But if we extrapolate on what has already been achieved, the volume of the collection on the coronavirus should be between 8 and 10 terabytes of data, depending on the duration of the epidemic. For comparison, all annual collections have a volume of 150 terabytes ”, specifies Benoît Tuleu.
Unlike the legal deposit that the BNF operates for books but also for other media such as video games, web archivists are not exhaustive – it would simply be impossible to keep the entire web – and n don't expect site owners to send them a copy. “It is our teams who will harvest the sites with a concern for representativeness in mind, without judging on quality or a priori, in an encyclopedic approach”, says Benoît Tuleu. A mission posed by the law Dadvsi of 2006 but that the librarians had started to experiment several years before; and completed by an acquisition of French Net archives between 1996 and 2000, precious “Incunabula of the Web” as they call them.
Today, a total of two petabytes (or two million gigabytes) of backed up Internet that the BNF has collected on its servers at the rate of approximately 2.6 billion URL addresses scanned each year. They can be consulted by researchers on site in the 13e Parisian district and in a network of partner libraries on French territory.
In addition to the fields of research to come, this curation work dedicated to the coronavirus stored on the BNF servers will be shared with the International Internet Preservation Consortium (IIPC) to contribute to the international archiving Novel Coronavirus (2019-nCoV) outbreak , launched in February 2020 in association with the Internet Archive. A collective memory of the health crisis in which some thirty libraries around the world participate.