在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:oduwsdl/ipwb开源软件地址:https://github.com/oduwsdl/ipwb开源编程语言:Python 67.8%开源软件介绍:InterPlanetary Wayback (ipwb)Peer-To-Peer Permanence of Web Archives InterPlanetary Wayback (ipwb) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the IPFS network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. ipwb splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returned, and combines the header and payload from IPFS at the time of replay. InterPlanetary Wayback primarily consists of two scripts:
A pictorial representation of the ipwb indexing and replay process: An important aspect of archival replay systems is rewriting various resource references for proper memento reconstruction so that they are dereferenced properly from the archive from around the same datetime as of the root memento and not from the live site (in which case the resource might have changed or gone missing). Many archival replay systems perform server-side rewriting, but it has its limitations when URIs are generated using JavaScript. To handle this we use Service Worker for rerouting requests on the client-side when they are dereferenced to avoid any server-side rewiring. For this, we have implemented a separate library, Reconstructive, which is reusable and extendable by any archival replay system. Another important feature of archival replays is the inclusion of an archival banner in mementos. The purpose of an archival banner is to highlight that a replayed page is a memento and not a live page, to provide metadata about the memento and the archive, and to facilitate additional interactivity. Many archival banners used in different web archival replay systems are obtrusive in nature and have issues like style leakage. To eliminate both of these issues we have implemented a Custom HTML Element, as part of the Reconstructive library and used in the ipwb. InstallingInterPlanetary Wayback (ipwb) requires Python 3.7+. ipwb can also be used with Docker (see below). For conventional usage, the latest release of ipwb can be installed using pip:
The latest development version containing changes not yet released can be installed from source:
SetupThe InterPlanetary File System (ipfs) daemon must be installed and running before starting ipwb. See the Install IPFS page to accomplish this. In the future, we hope to make this more automated. Once ipfs is installed, start the daemon:
If you encounter a conflict with the default API port of 5001 when starting the daemon, running the following prior to launching the daemon will change the API port to access to one of your choosing (here, shown to be 5002):
IndexingIn a separate terminal session (or the same if you started the daemon in the background), instruct ipwb to push contents of a WARC file into IPFS and create an index of records:
...for example, from the root of the ipwb repository:
The ipwb indexer partitions the WARC into WARC Records and extracts the WARC Response headers, HTTP response headers, and the HTTP response bodies (payloads). Relevant information is extracted from the WARC Response headers, temporary byte strings are created for the HTTP response headers and payload, and these two bytes strings are pushed into IPFS. The resulting CDXJ data is written to
ReplayingAn archival replay system is also included with ipwb to re-experience the content disseminated to IPFS. A CDXJ index needs to be provided and used by the ipwb replay system by specifying the path of the index file as a parameter to the replay system:
ipwb also supports using an IPFS hash or any HTTP location as the source of the CDXJ:
Once started, the replay system's web interface can be accessed through a web browser, e.g., http://localhost:2016/ by default. To run it under a domain name other than
Using DockerA pre-built Docker image is made available that can be run as following:
The container will run an IPFS daemon, index a sample WARC file, and replay it using the newly created index. It will take a few seconds to be ready, then the replay will be accessible at http://localhost:2016/ with a sample archived page. To index and replay your own WARC file, bind mount your data folders inside the container using
If the host folder structure is something other than To build an image from the source, run the following command from the directory where the source code is checked out. The name of the locally built image could be anything, but we use
By default, the image building process also performs tests, so it might take a while to build the image. It ensures that an image will not be created with failing tests. However, it is possible to skip tests by supplying a build-arg
HelpUsage of sub-commands in ipwb can be accessed through providing the
Project HistoryThis repo contains the code for integrating WARCs and IPFS as developed at the Archives Unleashed: Web Archive Hackathon in Toronto, Canada in March 2016. The project was also presented at:
Citing ProjectWe have numerous publications related to this project, but the most significant and primary one was published in TPDL 2016. (Read the PDF)
@INPROCEEDINGS{ipwb-tpdl2016,
AUTHOR = {Mat Kelly and
Sawood Alam and
Michael L. Nelson and
Michele C. Weigle},
TITLE = {{InterPlanetary Wayback}: Peer-To-Peer Permanence of Web Archives},
BOOKTITLE = {Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries},
PAGES = {411--416},
MONTH = {June},
YEAR = {2016},
ADDRESS = {Hamburg, Germany},
DOI = {10.1007/978-3-319-43997-6_35}
} LicenseMIT |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论