Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
551 views
in Technique[技术] by (71.8m points)

parsing - Creating a tree/deeply nested dict from an indented text file in python

Basically, I want to iterate through a file and put the contents of each line into a deeply nested dict, the structure of which is defined by the amount of whitespace at the start of each line.

Essentially the aim is to take something like this:

a
    b
        c
    d
        e

And turn it into something like this:

{"a":{"b":"c","d":"e"}}

Or this:

apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10

into this:

{"apple":{"colours":["red","yellow","green"],"type":"granny smith","price":0.10}

So that I can send it to Python's JSON module and make some JSON.

At the moment I'm trying to make a dict and a list in steps like such:

  1. {"a":""} ["a"]
  2. {"a":"b"} ["a"]
  3. {"a":{"b":"c"}} ["a","b"]
  4. {"a":{"b":{"c":"d"}}}} ["a","b","c"]
  5. {"a":{"b":{"c":"d"},"e":""}} ["a","e"]
  6. {"a":{"b":{"c":"d"},"e":"f"}} ["a","e"]
  7. {"a":{"b":{"c":"d"},"e":{"f":"g"}}} ["a","e","f"]

etc.

The list acts like 'breadcrumbs' showing where I last put in a dict.

To do this I need a way to iterate through the list and generate something like dict["a"]["e"]["f"] to get at that last dict. I've had a look at the AutoVivification class that someone has made which looks very useful however I'm really unsure of:

  1. Whether I'm using the right data structure for this (I'm planning to send it to the JSON library to create a JSON object)
  2. How to use AutoVivification in this instance
  3. Whether there's a better way in general to approach this problem.

I came up with the following function but it doesn't work:

def get_nested(dict,array,i):
if i != None:
    i += 1
    if array[i] in dict:
        return get_nested(dict[array[i]],array)
    else:
        return dict
else:
    i = 0
    return get_nested(dict[array[i]],array)

Would appreciate help!

(The rest of my extremely incomplete code is here:)

#Import relevant libraries
import codecs
import sys

#Functions
def stripped(str):
    if tab_spaced:
        return str.lstrip('').rstrip('

')
    else:
        return str.lstrip().rstrip('

')

def current_ws():
    if whitespacing == 0 or not tab_spaced:
        return len(line) - len(line.lstrip())
    if tab_spaced:
        return len(line) - len(line.lstrip('

'))

def get_nested(adict,anarray,i):
    if i != None:
        i += 1
        if anarray[i] in adict:
            return get_nested(adict[anarray[i]],anarray)
        else:
            return adict
    else:
        i = 0
        return get_nested(adict[anarray[i]],anarray)

#initialise variables
jsondict = {}
unclosed_tags = []
debug = []

vividfilename = 'simple.vivid'
# vividfilename = sys.argv[1]
if len(sys.argv)>2:
    jsfilename = sys.argv[2]
else:
    jsfilename = vividfilename.split('.')[0] + '.json'

whitespacing = 0
whitespace_array = [0,0]
tab_spaced = False

#open the file
with codecs.open(vividfilename,'rU', "utf-8-sig") as vividfile:
    for line in vividfile:
        #work out how many whitespaces at start
        whitespace_array.append(current_ws())

        #For first line with whitespace, work out the whitespacing (eg tab vs 4-space)
        if whitespacing == 0 and whitespace_array[-1] > 0:
            whitespacing = whitespace_array[-1]
            if line[0] == '':
                tab_spaced = True

        #strip out whitespace at start and end
        stripped_line = stripped(line)

        if whitespace_array[-1] == 0:
            jsondict[stripped_line] = ""
            unclosed_tags.append(stripped_line)

        if whitespace_array[-2] < whitespace_array[-1]:
            oldnested = get_nested(jsondict,whitespace_array,None)
            print oldnested
            # jsondict.pop(unclosed_tags[-1])
            # jsondict[unclosed_tags[-1]]={stripped_line:""}
            # unclosed_tags.append(stripped_line)

        print jsondict
        print unclosed_tags

print jsondict
print unclosed_tags
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is an object oriented approach based on a composite structure of nested Node objects.

Input:

indented_text = 
"""
apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10
"""

a Node class

class Node:
    def __init__(self, indented_line):
        self.children = []
        self.level = len(indented_line) - len(indented_line.lstrip())
        self.text = indented_line.strip()

    def add_children(self, nodes):
        childlevel = nodes[0].level
        while nodes:
            node = nodes.pop(0)
            if node.level == childlevel: # add node as a child
                self.children.append(node)
            elif node.level > childlevel: # add nodes as grandchildren of the last child
                nodes.insert(0,node)
                self.children[-1].add_children(nodes)
            elif node.level <= self.level: # this node is a sibling, no more children
                nodes.insert(0,node)
                return

    def as_dict(self):
        if len(self.children) > 1:
            return {self.text: [node.as_dict() for node in self.children]}
        elif len(self.children) == 1:
            return {self.text: self.children[0].as_dict()}
        else:
            return self.text

To parse the text, first create a root node. Then, remove empty lines from the text, and create a Node instance for every line, pass this to the add_children method of the root node.

root = Node('root')
root.add_children([Node(line) for line in indented_text.splitlines() if line.strip()])
d = root.as_dict()['root']
print(d)

result:

{'apple': [
  {'colours': ['red', 'yellow', 'green']},
  {'type': 'granny smith'},
  {'price': '0.10'}]
}

I think that it should be possible to do it in one step, where you simply call the constructor of Node once, with the indented text as an argument.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...