icy/google-group-crawler: [Deprecated] Get (almost) original messages from googl ...

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称（OpenSource Name）：

icy/google-group-crawler

开源软件地址(OpenSource Url)：

https://github.com/icy/google-group-crawler

开源编程语言(OpenSource Language)：

开源软件介绍(OpenSource Introduction)：

WARNING: This project doesn't work and it's deprecated. Reason: Ajax support is completely deprecated by Google See also #42 (comment)

Download all messages from Google Group archive

google-group-crawler is a Bash-4 script to download all (original) messages from a Google group archive. Private groups require some cookies string/file. Groups with adult contents haven't been supported yet.

Installation
Usage
Contributions
Similar projects
License
Author
For script hackers

Installation

The script requires bash-4, sort, curl, sed, awk.

Make the script executable with chmod 755 and put them in your path (e.g, /usr/local/bin/.)

The script may not work on Windows environment as reported in #26.

Usage

The first run

For private group, please prepare your cookies file.

# export _CURL_OPTIONS="-v"       # use curl options to provide e.g, cookies
# export _HOOK_FILE="/some/path"  # provide a hook file, see in #the-hook

# export _ORG="your.company"      # required, if you are using Gsuite
export _GROUP="mygroup"           # specify your group
./crawler.sh -sh                  # first run for testing
./crawler.sh -sh > curl.sh        # save your script
bash curl.sh                      # downloading mbox files

You can execute curl.sh script multiple times, as curl will skip quickly any fully downloaded files.

Update your local archive thanks to RSS feed

After you have an archive from the first run you only need to add the latest messages as shown in the feed. You can do that with -rss option and the additional _RSS_NUM environment variable:

export _RSS_NUM=50                # (optional. See Tips & Tricks.)
./crawler.sh -rss > update.sh     # using rss feed for updating
bash update.sh                    # download the latest posts

It's useful to follow this way frequently to update your local archive.

Private group or Group hosted by an organization

To download messages from private group or group hosted by your organization, you need to provide some cookie information to the script. In the past, the script uses wget and the Netscape cookie file format, now we are using curl with cookie string and a configuration file.

Open Firefox, press F12 to enable Debug mode and select Network tab from the Debug console of Firefox. (You may find a similar way for your favorite browser.)
Log in to your testing google account, and access your group. For example https://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public (replace google-group-crawler-public with your group name). Make sure you can read some contents with your own group URI.
Now from the Network tab in Debug console, select the address and select Copy -> Copy Request Headers. You will have a lot of things in the result, but please paste them in your text editor and select only Cookie part.
Now prepare a file curl-options.txt as below
```
 user-agent = "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
 header = "Cookie: <snip>"
```
Of course, replace the <snip> part with your own cookie strings. See man curl for more details of the file format.
Specify your cookie file by _CURL_OPTIONS:
```
 export _CURL_OPTIONS="-K /path/to/curl-options.txt"
```
Now every hidden group can be downloaded :)

The hook

If you want to execute a hook command after a mbox file is downloaded, you can do as below.

Prepare a Bash script file that contains a definition of __curl_hook command. The first argument is to specify an output filename, and the second argument is to specify an URL. For example, here is simple hook
```
 # $1: output file
 # $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
 __curl_hook() {
   if [[ "$(stat -c %b "$1")" == 0 ]]; then
     echo >&2 ":: Warning: empty output '$1'"
   fi
 }
```
In this example, the hook will check if the output file is empty, and send a warning to the standard error device.
Set your environment variable _HOOK_FILE which should be the path to your file. For example,
```
 export _GROUP=archlinuxvn
 export _HOOK_FILE=$HOME/bin/curl.hook.sh
```
Now the hook file will be loaded in your future output of commands crawler.sh -sh or crawler.sh -rss.

What to do with your local archive

The downloaded messages are found under $_GROUP/mbox/*.

They are in RFC 822 format (possibly with obfuscated email addresses) and they can be converted to mbox format easily before being imported to your email clients (Thunderbird, claws-mail, etc.)

You can also use mhonarc ultility to convert the downloaded to HTML files.

Rescan the whole local archive

Sometimes you may need to rescan / redownload all messages. This can be done by removing all temporary files

rm -fv $_GROUP/threads/t.*    # this is a must
rm -fv $_GROUP/msgs/m.*       # see also Tips & Tricks

or you can use _FORCE option:

_FORCE="true" ./crawler.sh -sh

Another option is to delete all files under $_GROUP/ directory. As usual, remember to backup before you delete some thing.

Known problems

Fails on group with adult contents (#14)
This script may not recover emails from public groups. When you use valid cookies, you may see the original emails if you are a manager of the group. See also #16.
When cookies are used, the original emails may be recovered and you must filter them before making your archive public.
Script can't fetch from group whose name contains some special character (e.g, +) See also #30

Contributions

parallel support: @Pikrass has a script to download messages in parallel. It's discussed in the ticket #32. The script: https://gist.github.com/Pikrass/f8462ff8a9af18f97f08d2a90533af31
raw access denied: @alexivkin mentioned he could use the print function to work-around the issue. See it here #29 (comment)

Similar projects

(website) Google Takeout - Download all info for any groups you own
(Shell/curl) ggscrape - Download emails from a Google Group. Rescue your archives
(Python/Webdriver) scrape_google_groups.py - A simple script to scrape a google group
(Python/webscraping.webkit) gg-scrape - Liberate you data from google groups
(Python/urllib) gg_scraper
(PHP/libcurl) scraping-google-groups

License

This work is released under the terms of a MIT license.

Author

This script is written by Anh K. Huynh.

He wrote this script because he couldn't resolve the problem by using nodejs, phantomjs, Watir.

New web technology just makes life harder, doesn't it?

For script hackers

Please skip this section unless your really know to work with Bash and shells.

If you clean your files (as below), you may notice that it will be very slow when re-downloading all files. You may consider to use the -rss option instead. This option will fetch data from a rss link.

It's recommmeded to use the -rss option for daily update. By default, the number of items is 50. You can change it by the _RSS_NUM variable. However, don't use a very big number, because Google will ignore that.

Because Topics is a FIFO list, you only need to remove the last file. The script will re-download the last item, and if there is a new page, that page will be fetched.

 ls $_GROUP/msgs/m.* \
 | sed -e 's#\.[0-9]\+$##g' \
 | sort -u \
 | while read f; do
     last_item="$f.$( \
       ls $f.* \
       | sed -e 's#^.*\.\([0-9]\+\)#\1#g' \
       | sort -n \
       | tail -1 \
     )";
     echo $last_item;
   done

The list of threads is a LIFO list. If you want to rescan your list, you will need to delete all files under $_D_OUTPUT/threads/

You can set the time for mbox output files, as below

 ls $_GROUP/mbox/m.* \
 | while read FILE; do \
     date="$( \
       grep ^Date: $FILE\
       | head -1\
       | sed -e 's#^Date: ##g' \
     )";
     touch -d "$date" $FILE;
   done

This will be very useful, for example, when you want to use the mbox files with mhonarc.

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

star2dev/Google-plus-buttons发布时间：2022-06-12

Sponsor @hungtruong on GitHub Sponsors · GitHub发布时间：2022-06-12

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18013|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9590|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8139|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8522|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8425|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9326|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8388|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7823|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8377|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7370|2022-11-06

客服电话

电子邮件

icy/google-group-crawler: [Deprecated] Get (almost) original messages from googl ...

开源软件名称（OpenSource Name）：

开源软件地址(OpenSource Url)：

开源编程语言(OpenSource Language)：

开源软件介绍(OpenSource Introduction)：

Download all messages from Google Group archive

Installation

Usage

The first run

Update your local archive thanks to RSS feed

Private group or Group hosted by an organization

The hook

What to do with your local archive

Rescan the whole local archive

Known problems

Contributions

Similar projects

License

Author

For script hackers

请发表评论

全部评论

上一篇：

下一篇：

CVE-2022-36036

lua-alchemy/lua-alchemy: Port of the Lua

mbh038/vdbMATLAB

CVE-2022-36898

parse-community/Parse-SDK-iOS-OSX: The O

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053