Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
260 views
in Technique[技术] by (71.8m points)

python - How to avoid code repetition and redundancy

I am trying to simplify some code which does the following:

  • create one empty list where to store the information scraped from one website
  • apply a function to fill the list
  • add these information in a dataframe
  • repeat the process selecting element not scraped yet

The process looks like

list1 = []

def fun(df):
    for x in df['Col']: 
        url = "my_website"+x
        soup = BeautifulSoup(requests.get(url).content, "html.parser")
...
        list1.append(data1)
    return list1

list1 = fun(my_df)

my_df['List1'] = list1

(I tried to keep the code as simpler as possible) The output looks like (the column Col is my initial dataframe, i.e. my_df):

Col          List1
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  

Then, I repeat the process for strings in the list for each row:

# 2nd round
list1 = []
my_df2 = my_df.explode('List1')
my_list2 = pd.Series(list(set(my_df2['List1']) - set(my_df['Col'])), name='Col')
new_df2 = pd.DataFrame(my_list2, columns=['Col'])

list1 = fun(new_df2)

new_df2['List1'] = list1

Then I have another dataframe with other values, so I append these results to my original dataframe, my_df

my_df2= my_df.append(new_df2) 

I repeat again the process

# 3rd round
list1 = []
my_df3 = my_df2.explode('List1')
my_list3 = pd.Series(list(set(my_df3['List1']) - set(my_df2['Col'])), name='Col')
new_df3 = pd.DataFrame(my_list3, columns=['Col'])

list1 = fun(new_df3)

new_df3['List1'] = list1

and so on, until I have finished to scrape all the data.

Since I am repeating these 'rounds' every time manually, I would like to ask you if there is a way to simplify the code in order to avoid all these awful repetition. Any tips will be appreciated.

EDIT: my difficulties are in setting a condition where, if I have my original dataset, i.e. before creating the column List1, then create the empty list list1 then apply fun to my dataset.

In the other steps, I should:

  • Initialise again the list1 in order to get a new dataframe from the list1 by exploding the column in my original (now update) dataset)
  • Calculate the difference between this new dataframe and the previous one, to remove duplicates
  • Run again the fun, storing the results in a column List1
  • Append the results to the dataframe updated (which would be always the previous dataframe)
  • repeat again the process.

If you need more information, I will be happy to provide it.

question from:https://stackoverflow.com/questions/65864016/how-to-avoid-code-repetition-and-redundancy

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As far as I can tell from my_df, the list1 declaration should be inside fun, or you're emptying it elsewhere.

First, I would change fun to only work on one entry (not whole Series):

def fun(x):
    list1 = []
    url = "my_website"+x
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    ...
    list1.append(data1)
    return list1

Then, you can do the first transformation (populating second column List1) by doing:

my_df['List1'] = my_df.Col.apply(lambda x: fun(x))

After that, you could do something like:

while scraping_to_do:
     newCol = pd.Series(list(set(my_df['List1']) - set(my_df['Col'])))
     newList1 = newCol.apply(lambda x: fun(x))
     my_df = my_df.append(pd.DataFrame(dict('Col'=newCol, 'List1'=newList1)), ignore_index=True)
     my_df = my_df.explode('List1')

You need to figure out when to stop scraping (when the set difference is the empty set?), as well as deal with the NaNs that explode produces from empty lists.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...