Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
41 views
in Technique[技术] by (71.8m points)

python - Open XML document protection implementation (documentProtection class)

I'm trying to implement the Open XML documentProtection hash protection of a MS Word (2019) document in Python to test the hashing algorithm. So I've created a Word document, protected it against editing with this password: johnjohn. Then, opening the document as ZIP/XML, I see the following in the documentProtection section:

<w:documentProtection w:edit="readOnly" w:enforcement="1" w:cryptProviderType="rsaAES" w:cryptAlgorithmClass="hash" w:cryptAlgorithmType="typeAny" w:cryptAlgorithmSid="14" w:cryptSpinCount="100000" w:hash="pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw==" w:salt="pH1TDVHSfGBxkd3Q88UNhQ==" /> 

According to the Open XML docs (ECMA-376-1:2016 #17.15.1.29):

  • cryptAlgorithmSid="14" points to the SHA-512 algorithm
  • cryptSpinCount="100000" means that hashing must be done in 100k rounds, using the following algoright (quote from above standard):

Specifies the number of times the hashing function shall be iteratively run (runs using each iteration's result plus a 4 byte value (0-based, little endian) containing the number of the iteration as the input for the next iteration) when attempting to compare a user-supplied password with the value stored in the hashValue attribute.

The BASE64-encoded salt used for hashing ("pH1TDVHSfGBxkd3Q88UNhQ==") is prepended to the original password. The target BASE64-encoded hash must be "pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw=="

So my Python script attempts to generate the same hash value with the described algorithm as follows:

import hashlib
import base64
import struct

TARGET_HASH = 'pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw=='

TARGET_SALT = 'pH1TDVHSfGBxkd3Q88UNhQ=='
bsalt = base64.b64decode(TARGET_SALT)

def hashit(what, alg='sha512', **kwargs):
    if alg == 'sha1':
        return hashlib.sha1(what)
    elif alg == 'sha512':
        return hashlib.sha512(what)
    # etc...
    else:
        raise Exception(f'Unsupported hash algorithm: {alg}')

def gethash(data, salt=None, alg='sha512', iters=100000, base64result=True, returnstring=True):
    # encode password in UTF-16LE
    # ECMA-376-1:2016 17.15.1.29 (p. 1026)
    if isinstance(data, str): data = data.encode('utf-16-le')
    
    # prepend salt if provided
    if not salt is None:
        if isinstance(salt, str): salt = salt.encode('utf-16-le')
        ghash = salt + data
    else:
        ghash = data
    
    # hash iteratively for 'iters' rounds
    for i in range(iters):
        try:
            # next hash = hash(previous data) + 4-byte integer (previous round number) with LE byte ordering
            # ECMA-376-1:2016 17.15.1.29 (p. 1020)
            ghash = hashit(ghash, alg).digest() + struct.pack('<I', i)
        except Exception as err:
            print(err)
            break
    
    # remove trailing round number bytes
    ghash = ghash[:-4]

    # BASE64 encode if requested
    if base64result:
        ghash = base64.b64encode(ghash)
    # return as an ASCII string if requested
    if returnstring:
        ghash = ghash.decode()
        
    return ghash

But then when I run

print(gethash('johnjohn', bsalt))

I get the following hash which is not equal to the target one:

G47RT4/+JdE6pnrP6MqUKa3JyL8abeYSCX+E4+9J+6shiZqImBJ8M6bb+IMKEdvKd6+9dVnQ3oeOsgQz/aCdcQ==

Could I be wrong in my implementation somewhere or do you think there's a difference in the low-level hash function implementation (Python's hashlib vs. Open XML)?

Updated

I realized that Word uses a legacy algorithm to pre-process passwords (for compatibility with older versions). This algorithm is described at length in ECMA-376-1:2016 Part 4 (Transitional Migration Features, #14.8.1 "Legacy Password Hash Algorithm"). So I've managed to make a script that reproduces the official ECMA example:

def strtobytes(s, trunc=15):    
    b = s.encode('utf-16-le')
    # remove BOM symbol if present
    if b[0] == 0xfeff: b = b[1:]    
    pwdlen = min(trunc, len(s))
    if pwdlen < 1: return None
    return bytes([b[i] or b[i+1] for i in range(0, pwdlen * 2, 2)])

def process_pwd(pwd):
    # 1. PREPARE PWD STRING (TRUNCATE, CONVERT TO BYTES)
    pw = strtobytes(pwd) if isinstance(pwd, str) else pwd[:15]
    pwdlen = len(pw)
    
    # 2. HIGH WORD CALC
    HW = InitialCodeArray[pwdlen - 1]
    for i in range(pwdlen):
        r = 15 - pwdlen + i
        for ibit in range(7):
            if (pw[i] & (0x0001 << ibit)):                
                HW ^= EncryptionMatrix[r][ibit]
    
    # 3. LO WORD CALC
    LW = 0
    for i in reversed(range(pwdlen)):
        LW = (((LW >> 14) & 0x0001) | ((LW << 1) & 0x7FFF)) ^ pw[i]
    LW = (((LW >> 14) & 0x0001) | ((LW << 1) & 0x7FFF)) ^ pwdlen ^ 0xCE4B    
    
    # 4. COMBINE AND REVERSE
    return bytes([LW & 0xff, LW >> 8, HW & 0xff, HW >> 8])

So when I do process_pwd('Example') I get what's said in the ECMA (0x7EEDCE64). The hashing function was also modified (the initial SALT + HASH should not be included in the main iteration loop, as I found on a forum):

def gethash(data, salt=None, alg='sha512', iters=100000, base64result=True, returnstring=True):
    
    def hashit(what, alg='sha512'):
        return getattr(hashlib, alg)(what)
    
    # encode password with legacy algorithm if a string is given
    if isinstance(data, str): 
        data = process_pwd(data)
        
    if data is None: 
        print('WRONG PASSWORD STRING!')
        return None
    
    # prepend salt if provided
    if not salt is None:
        if isinstance(salt, str): 
            salt = process_pwd(salt)
            if salt is None:
                print('WRONG SALT STRING!')
                return None
        ghash = salt + data
    else:
        ghash = data
    
    # initial hash (salted)
    ghash = hashit(ghash, alg).digest()
    
    # hash iteratively for 'iters' rounds
    for i in range(iters):
        try:
            # next hash = hash(previous data + 4-byte integer (previous round number) with LE byte ordering)
            # ECMA-376-1:2016 17.15.1.29 (p. 1020)
            ghash = hashit(ghash + struct.pack('<I', i), alg).digest()
        except Exception as err:
            print(err)
            return None

    # BASE64 encode if requested
    if base64result:
        ghash = base64.b64encode(ghash)
        
    # return as an ASCII string if requested
    if returnstring:
        ghash = ghash.decode()
        
    return ghash

However many times I've re-checked this code, I couldn't see any more errors. But I still can't reproduce the target hash in the test Word document:

myhash = gethash('johnjohn', base64.b64decode('pH1TDVHSfGBxkd3Q88UNhQ=='))
print(myhash)
print(TARGET_HASH == myhash)

I get:

wut2VOpT+X8pKXky6u/+YtwRX2inDv1WVC8FtZcdxKsyX0gHNBJGYwBgV8xzq7Rke/hWMfWe9JVvqDQAZ11A5w==

False

question from:https://stackoverflow.com/questions/65877620/open-xml-document-protection-implementation-documentprotection-class

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...