What is web archiving?












            ******************************************************************************************

            * Why is web archived?
            ******************************************************************************************


            Archived internet will serve as a basic source of information for future researchers. Vast
            scientific and cultural information is nowadays published only in a digital form. Web cont
            is short-lived: it quickly changes, links rot, and information that was online yesterday i

            This is why various institutions interested in preserving data harvest and archive also in







            ******************************************************************************************
            * Web archiving technology

            ******************************************************************************************

            To harvest or scrap the content of internet pages, the Webarchiv of the National Library o

            Republic, like many other institutions, uses the Heritrix [ URL "https://webarchive.jira.c
            heritrix"] web crawler. Smooth and efficient harvesting, however, requires further extensi
            The crawler browses the web, harvests content, and creates snapshots of pages at a particu

            in time. It also creates an index, which is then uses to emulate archives pages in order t
            accessible.
            Archived content is stored in ARC or WARC [ URL "http://www.digitalpreservation.gov/format

            fdd000236.shtml"] XML containers, which not only store web content but also supplement it 
            and administrative metadata.


            <iframe src="//www.slideshare.net/slideshow/embed_code/key/AgMgbvBq0gR4Ks" width="595" hei
            frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #C
            width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="mar

            </div>





            ******************************************************************************************

            * What can be archived using the web archiving technology?
            ******************************************************************************************


            In principle, web archiving amounts to the downloading html and css files, images, pdf, do
            objects, as well as audio and video files, eventually also javascript.
            Web archiving technology enables the harvesting of only a fraction of internet. What remai

            is a large part of the deep web, paid content or contents which require login, contents of
            problematic is also the harvesting of social networks or sites which contain streamed cont
            impossible to harvest the content of digital libraries and similar applications.

            In addition to technological limitations, web harvesting has also organisational and finan
            Webarchiv of the National Library of the Czech Republic does not have unlimited resources,
            instance a comprehensive harvest of Czech internet can only be carried out several times a

            limitations are user-defined, such as the number of links a crawler follows, maximum size 
            downloaded objects, and the like.






            ******************************************************************************************
            * Comprehensive versus topic and selective harvests
            ******************************************************************************************


            Harvesting as such is implemented by either comprehensive or topic and selective harvests.
            comprehensive harvest of an entire domain creates a snapshot of Czech internet at a partic

            Topic harvests focus on documenting, for instance, the impact of a particular event on the
            information space. Additionally, there are some important sources which the Webarchiv of t
            Library of the Czech Republic archives selectively, that is, over and above the regular co

            harvests.





            ******************************************************************************************

            * What is the difference between web archiving and a local backup of a web page and a data
            ******************************************************************************************


            The description of web archiving technologies listed above clearly indicates that web arch
            replace the backing up of files which make up a web page, its CMS system, and its database
            however, make it possible to access an image of internet pages at a particular time even a

            can no longer be accessed by the usual methods.





            ******************************************************************************************

            * For webmasters
            ******************************************************************************************


            For webmasters, harvesting by the National Library crawler usually represents no risk. Rob
            Webarchiv of the National Library can be identified in access logs and its access to some 
            objects can be denied in robots.txt.

            You can inform the Webarchiv of the National Library of the Czech Republic about your inte
            using the https://www.webarchiv.cz/en/add [ URL "https://www.webarchiv.cz/en/add"] form.