ZachisGit/ipfs-arxiv: Machine Learning papers from arXiv hosted on IPFS website.

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称：

ZachisGit/ipfs-arxiv

开源软件地址：

https://github.com/ZachisGit/ipfs-arxiv

开源编程语言：

Python 38.7%

开源软件介绍：

IPFS-Arxiv

IPFS-Arxiv is hosted via the interplanetary file system protocol. It is a decentralized collection of the newest 1000 machine learning papers from arxiv. Important: All papers are actually uploaded and stored in a decentralized fashion!

https://gateway.ipfs.io/ipns/QmbahoVpU7qr5NWu8SYLP4my3iPnn9skgV7uFkFwXCfYmX/

(Just in case the top one doesn't work) (https://gateway.ipfs.io/ipfs/QmVxheJqLSqJ4VTw2LYwe3UbDYLYhq1RWb3ie43yGYTr8y)

Siraj - IPFS challenge

I have created this site because of two reasons, my primary goal was to create something for IPFS because I think it is amazing and may very well be the future of the internet. And the second reason why I created this particular project at this time is Sirajes IPFS challenge. So Siraj, this is my official submission. (https://www.youtube.com/watch?v=BA2rHlbB5i0)

Description

The project consists of two parts. The first part is a webscraper written in python. It requests machine learning paper search results via the arxiv API. And the second one is the website itself, it reads in the json index generated by the webscraper and displays the results. It also links to the papers contained in the "pdfs" folder.

Webscraper

The webscraper uses the arxiv API to send a search query in this format:

http://export.arxiv.org/api/query?search_query=[query]&start=[start_index]&max_results=[result_count]

The query in the machine learning case is "machine%20learning", start_index is a zero based offset index from which paper to start and result_count the amount of results that should be returned.

API results are in xml format so the next step is to parse them and add the titles, ids and summaries (these are all the tags we are using) to the index.json file. There we save all the information but not without checking for doubles, so no redundant entries are being written to the index after a potential restart. Then all pdfs of the non redundant entries are downloaded and saved to the "pdfs/" folder. PDF request url:

http://arxiv.org/pdf/[paper_id].pdf

Yah, paper_id is just the id of the paper! All the code is in arxiv_scraper.py

Website

The website reads the index.json file generated by the webscraper and creates the website entries from it. The site consists of index.html the main html file, ipfs_arxiv.js creates the entries from json and handles next/prev page management, jquery is completely pointless and pdfs/ contains all the machine learning paper pdfs.

Instructions

1. Scraping arxiv papers

If you want to host your own ipfs-arxiv clone just follow the instructions. First is scraping the PDFs from arxiv with a search query, in this case "machine learning" but you can change it to what ever you want. Just note that you cant use spaces or other certain characters, the query needs to be urlencoded.

raw = get_arxiv(i,length=100,query='machine%20learning')
xml_root = xml(raw)
entries = get_entries(xml_root)

In the get_arxiv() function [i] is the start index of the papers you are requesting and [length] the amount you get back, [query] holds the search query for which the API returns the xml entries. Then we parse it into xml and extract the id, title and summary from all the entries.

entries = write_entries_to_index(entries,index_file='index.json')
download_pdfs(entries,folder='pdfs')

These two lines are for writing the retrived entries to the index.json file that contains all the entry information we are interested in and download_pdfs downloads the pdfs if they exist from all the entries to the "pdfs/" folder.

2. The website

In the next step we take a look at the website code. The whole site consists of three files and a folder containing all the pdfs; lets first look at index.html it contains all the html and css code of the site plus it loads the javascripts (jquery for no reason, index.json and ipfs_arxiv.js). I wanted to use jquery to load the index.json file into the website so the javascript code could access it, but that didn't work out in chrome because of it's cross-domain protection so I had to get a little creative. I put "var index_json = [index.json content];" around the previously created entry index and loaded it like this:

<script src='index.json'></script>
<script type='text/javascript' src='ipfs_arxiv.js'></script>

Now that index.json is loaded as a javascript we can just access the index_json variable containing the entire index table as a list of dictionaries. And the second line loads the main js script.
ipfs_arxiv.js contains all the javascript code for displaying and managing pages (next/prev and what page we are currently on etc.) and based on the page_index it adds the appropriate entries from index_json to the main display section.
The pdfs/ folder contains all the papers we scraped earlier. There are some arxiv entries without a pdf associated to them but at the moment all entries are displayed even the ones without pdfs. In this repo I only uploaded a hand full of pdfs so you can just clone it and try it out but on the real site there are at the moment 1000 papers uploaded (~950MB).

3. Host it via IPFS

Put all the website files (index.html, ipfs_arxiv.js, jquery, index.json) and the pdfs/ dir in a new folder, I called it published

Then if you haven't yet install ipfs from here and initialize ipfs

ipfs init

then start the daemon in a separate terminal

ipfs daemon

and upload the published folder like this

ipfs add -r published/

In the output of this you want to take the hash from the last line, it should look like this

added QmVxheJqLSqJ4VTw2LYwe3UbDYLYhq1RWb3ie43yGYTr8y published

you can now access your ipfs site via your hash either over the online gateway or the daemon running on your machine

https://gateway.ipfs.io/ipfs/QmVxheJqLSqJ4VTw2LYwe3UbDYLYhq1RWb3ie43yGYTr8y
http://127.0.0.1:8080/ipfs/QmVxheJqLSqJ4VTw2LYwe3UbDYLYhq1RWb3ie43yGYTr8y

Congrats you did it :)
But if you want to make changes to your site in the future and don't want to give every one that wants to access your site a new hash every time you do there is one last step. Essentially we publish the site hash under our peerID so that the id resolves to our site hash

ipfs name publish QmVxheJqLSqJ4VTw2LYwe3UbDYLYhq1RWb3ie43yGYTr8y
ipfs name resolve QmbahoVpU7qr5NWu8SYLP4my3iPnn9skgV7uFkFwXCfYmX

the first line connects the site hash to your peerID, the second one is to check if it worked. Now if you want to access your site with the peedID instead change ipfs to ipns in your request url

https://gateway.ipfs.io/ipns/QmVxheJqLSqJ4VTw2LYwe3UbDYLYhq1RWb3ie43yGYTr8y
http://127.0.0.1:8080/ipns/QmVxheJqLSqJ4VTw2LYwe3UbDYLYhq1RWb3ie43yGYTr8y

Thats it!

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

mrusme/superhighway84: USENET-inspired, uncensorable, decentralized internet dis ...发布时间：2022-06-22

graphprotocol/ipfs-sync: Script to sync files from one IPFS node to another发布时间：2022-06-22

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18253|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9671|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8175|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8547|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8455|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9387|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8427|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7859|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8410|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7394|2022-11-06

客服电话

电子邮件

ZachisGit/ipfs-arxiv: Machine Learning papers from arXiv hosted on IPFS website.

开源软件名称：

开源软件地址：

开源编程语言：

开源软件介绍：

IPFS-Arxiv

Siraj - IPFS challenge

Description

Webscraper

Website

Instructions

请发表评论

全部评论

上一篇：

下一篇：

dphi-official/Machine_Learning_Bootcamp

tianli/matlab_offscreen: Matlab offscree

win7系统注册表编辑器打开的操作方法

これがマストドンだ！使い方からインスタ

芙蓉王（硬领航）多少钱一包？

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053

客服电话

电子邮件

ZachisGit/ipfs-arxiv: Machine Learning papers from arXiv hosted on IPFS website.

开源软件名称：

开源软件地址：

开源编程语言：

开源软件介绍：

IPFS-Arxiv

Siraj - IPFS challenge

Description

Webscraper

Website

Instructions

请发表评论

全部评论

上一篇：

下一篇：

dphi-official/Machine_Learning_Bootcamp

tianli/matlab_offscreen: Matlab offscree

win7系统注册表编辑器打开的操作方法

これがマストドンだ！ 使い方からインスタ

芙蓉王（硬领航）多少钱一包？

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053

これがマストドンだ！使い方からインスタ