Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
223 views
in Technique[技术] by (71.8m points)

Snakemake: use checksums instead of timestamps?

My project is likely to have instances where input datasets are overwritten but the contents are not changed. Is there a way in Snakemake to check for build changes using checksums instead of timestamps?

For example, Scons checks for build changes in both code and data using md5 hashes (hashes computed only where timestamps have changed). But I'd so much prefer to use Snakemake because of its other killer features.

The desired behavior is similar to the between workflow caching functionality described in the docs. In the docs it says:

There is no need to use this feature to avoid redundant computations within a workflow. Snakemake does this already out of the box.

But all references to this issue point to Snakemake only using timestamps within a normal workflow.

Using the ancient marker or using touch to adjust timestamps won't work for me as that will require too much manual intervention.

I eventually found an old SO post indicating that I could do this by writing my own script to compare checksums and then feeding that into Snakemake, but I'm not sure if this is still the only option.

question from:https://stackoverflow.com/questions/65907256/snakemake-use-checksums-instead-of-timestamps

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'm not aware of a built-in solution in snakemake. Maybe here's how I would go about it.

Say your input data is data.txt. This is the file(s) that is overwritten possibly without changing. Instead of using this file directly in the snakemake rules, you use a cached copy that is overwritten only if the md5 has changed between original and cache. The checking can be done before the rule all using standard python code.

Here's a pseudo-code example:

input_md5 = get_md5('data.txt')
cache_md5 = get_md5('cache/data.txt')

if input_md5 != cache_md5:
    # This will trigger the pipeline because cache/data.txt is newer than output
    copy('data.txt', 'cache/data.txt') 

rule all:
    input:
        'stuff.txt'

rule one:
    input:
        'cache/data.txt',
    output:
        'stuff.txt',

EDIT: This pseudo-code saves the md5 of the cached input files so they don't need to be recomputed each time. It also saves to file the timestamp of the input data so that the md5 of the input is recomputed only if such timestamp is newer than the cached one:

for each input datafile do:

current_input_timestamp = get_timestamp('data.txt')
cache_input_timestamp = read('data.timestamp.txt')

if current_input_timestamp > cache_input_timestamp:
    input_md5 = get_md5('data.txt')
    cache_md5 = read('cache/data.md5')

    if input_md5 != cache_md5:
        copy('data.txt', 'cache/data.txt') 
        write(input_md5, 'cache/data.md5')
        write(current_input_timestamp, 'data.timestamp.txt')

# If any input datafile is newer than the cache, the pipeline will be triggered

However, this adds complexity to the pipeline so I would check whether it is worth it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...