在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:ZachisGit/ipfs-arxiv开源软件地址:https://github.com/ZachisGit/ipfs-arxiv开源编程语言:Python 38.7%开源软件介绍:IPFS-ArxivIPFS-Arxiv is hosted via the interplanetary file system protocol. It is a decentralized collection of the newest 1000 machine learning papers from arxiv. Important: All papers are actually uploaded and stored in a decentralized fashion! https://gateway.ipfs.io/ipns/QmbahoVpU7qr5NWu8SYLP4my3iPnn9skgV7uFkFwXCfYmX/ Siraj - IPFS challengeI have created this site because of two reasons, my primary goal was to create something for IPFS because I think it is amazing and may very well be the future of the internet. And the second reason why I created this particular project at this time is Sirajes IPFS challenge. So Siraj, this is my official submission. (https://www.youtube.com/watch?v=BA2rHlbB5i0)
DescriptionThe project consists of two parts. The first part is a webscraper written in python. It requests machine learning paper search results via the arxiv API. And the second one is the website itself, it reads in the json index generated by the webscraper and displays the results. It also links to the papers contained in the "pdfs" folder.
WebscraperThe webscraper uses the arxiv API to send a search query in this format:
The query in the machine learning case is "machine%20learning", start_index is a zero based offset index from which paper to start and result_count the amount of results that should be returned. API results are in xml format so the next step is to parse them and add the titles, ids and summaries (these are all the tags we are using) to the index.json file. There we save all the information but not without checking for doubles, so no redundant entries are being written to the index after a potential restart. Then all pdfs of the non redundant entries are downloaded and saved to the "pdfs/" folder.
PDF request url:
Yah, paper_id is just the id of the paper!
All the code is in arxiv_scraper.py
WebsiteThe website reads the index.json file generated by the webscraper and creates the website entries from it. The site consists of index.html the main html file, ipfs_arxiv.js creates the entries from json and handles next/prev page management, jquery is completely pointless and pdfs/ contains all the machine learning paper pdfs.
Instructions1. Scraping arxiv papers
raw = get_arxiv(i,length=100,query='machine%20learning')
xml_root = xml(raw)
entries = get_entries(xml_root) In the get_arxiv() function [i] is the start index of the papers you are requesting and [length] the amount you get back, [query] holds the search query for which the API returns the xml entries. Then we parse it into xml and extract the id, title and summary from all the entries. entries = write_entries_to_index(entries,index_file='index.json')
download_pdfs(entries,folder='pdfs') These two lines are for writing the retrived entries to the index.json file that contains all the entry information we are interested in and download_pdfs downloads the pdfs if they exist from all the entries to the "pdfs/" folder. 2. The website In the next step we take a look at the website code. The whole site consists of three files and a folder containing all the pdfs; lets first look at index.html it contains all the html and css code of the site plus it loads the javascripts (jquery for no reason, index.json and ipfs_arxiv.js). I wanted to use jquery to load the index.json file into the website so the javascript code could access it, but that didn't work out in chrome because of it's cross-domain protection so I had to get a little creative. I put "var index_json = [index.json content];" around the previously created entry index and loaded it like this:
Now that index.json is loaded as a javascript we can just access the index_json variable containing the entire index table as a list of dictionaries. And the second line loads the main js script.
3. Host it via IPFS Put all the website files (index.html, ipfs_arxiv.js, jquery, index.json) and the pdfs/ dir in a new folder, I called it published Then if you haven't yet install ipfs from here and initialize ipfs
then start the daemon in a separate terminal
and upload the published folder like this
In the output of this you want to take the hash from the last line, it should look like this
you can now access your ipfs site via your hash either over the online gateway or the daemon running on your machine
Congrats you did it :) But if you want to make changes to your site in the future and don't want to give every one that wants to access your site a new hash every time you do there is one last step. Essentially we publish the site hash under our peerID so that the id resolves to our site hash
the first line connects the site hash to your peerID, the second one is to check if it worked. Now if you want to access your site with the peedID instead change ipfs to ipns in your request url
Thats it! |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论