WARNING: This project doesn't work and it's deprecated.
Reason: Ajax support is completely deprecated by Google
See also #42 (comment)
Download all messages from Google Group archive
google-group-crawler is a Bash-4 script to download all (original)
messages from a Google group archive.
Private groups require some cookies string/file.
Groups with adult contents haven't been supported yet.
# export _CURL_OPTIONS="-v" # use curl options to provide e.g, cookies
# export _HOOK_FILE="/some/path" # provide a hook file, see in #the-hook
# export _ORG="your.company" # required, if you are using Gsuite
export _GROUP="mygroup" # specify your group
./crawler.sh -sh # first run for testing
./crawler.sh -sh > curl.sh # save your script
bash curl.sh # downloading mbox files
You can execute curl.sh script multiple times, as curl will skip
quickly any fully downloaded files.
Update your local archive thanks to RSS feed
After you have an archive from the first run you only need to add the latest
messages as shown in the feed. You can do that with -rss option and the
additional _RSS_NUM environment variable:
export _RSS_NUM=50 # (optional. See Tips & Tricks.)
./crawler.sh -rss > update.sh # using rss feed for updating
bash update.sh # download the latest posts
It's useful to follow this way frequently to update your local archive.
Private group or Group hosted by an organization
To download messages from private group or group hosted by your organization,
you need to provide some cookie information to the script. In the past,
the script uses wget and the Netscape cookie file format,
now we are using curl with cookie string and a configuration file.
Open Firefox, press F12 to enable Debug mode and select Network tab
from the Debug console of Firefox. (You may find a similar way for
your favorite browser.)
Now from the Network tab in Debug console, select the address
and select Copy -> Copy Request Headers. You will have a lot of
things in the result, but please paste them in your text editor
and select only Cookie part.
If you want to execute a hook command after a mbox file is downloaded,
you can do as below.
Prepare a Bash script file that contains a definition of __curl_hook
command. The first argument is to specify an output filename, and the
second argument is to specify an URL. For example, here is simple hook
# $1: output file
# $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
__curl_hook() {
if [[ "$(stat -c %b "$1")" == 0 ]]; then
echo >&2 ":: Warning: empty output '$1'"
fi
}
In this example, the hook will check if the output file is empty,
and send a warning to the standard error device.
Set your environment variable _HOOK_FILE which should be the path
to your file. For example,
Now the hook file will be loaded in your future output of commands
crawler.sh -sh or crawler.sh -rss.
What to do with your local archive
The downloaded messages are found under $_GROUP/mbox/*.
They are in RFC 822 format (possibly with obfuscated email addresses)
and they can be converted to mbox format easily before being imported
to your email clients (Thunderbird, claws-mail, etc.)
You can also use mhonarc ultility to convert
the downloaded to HTML files.
This script may not recover emails from public groups.
When you use valid cookies, you may see the original emails
if you are a manager of the group. See also #16.
When cookies are used, the original emails may be recovered
and you must filter them before making your archive public.
Script can't fetch from group whose name contains some special character (e.g, +)
See also #30
This work is released under the terms of a MIT license.
Author
This script is written by Anh K. Huynh.
He wrote this script because he couldn't resolve the problem by using
nodejs, phantomjs, Watir.
New web technology just makes life harder, doesn't it?
For script hackers
Please skip this section unless your really know to work with Bash and shells.
If you clean your files (as below), you may notice that it will be
very slow when re-downloading all files. You may consider to use
the -rss option instead. This option will fetch data from a rss link.
It's recommmeded to use the -rss option for daily update. By default,
the number of items is 50. You can change it by the _RSS_NUM variable.
However, don't use a very big number, because Google will ignore that.
Because Topics is a FIFO list, you only need to remove the last file.
The script will re-download the last item, and if there is a new page,
that page will be fetched.
ls $_GROUP/msgs/m.* \
| sed -e 's#\.[0-9]\+$##g' \
| sort -u \
| while read f; do
last_item="$f.$( \
ls $f.* \
| sed -e 's#^.*\.\([0-9]\+\)#\1#g' \
| sort -n \
| tail -1 \
)";
echo $last_item;
done
The list of threads is a LIFO list. If you want to rescan your list,
you will need to delete all files under $_D_OUTPUT/threads/
You can set the time for mbox output files, as below
ls $_GROUP/mbox/m.* \
| while read FILE; do \
date="$( \
grep ^Date: $FILE\
| head -1\
| sed -e 's#^Date: ##g' \
)";
touch -d "$date" $FILE;
done
This will be very useful, for example, when you want to use the
mbox files with mhonarc.
请发表评论