I am trying to simplify some code which does the following:
- create one empty list where to store the information scraped from one website
- apply a function to fill the list
- add these information in a dataframe
- repeat the process selecting element not scraped yet
The process looks like
list1 = []
def fun(df):
for x in df['Col']:
url = "my_website"+x
soup = BeautifulSoup(requests.get(url).content, "html.parser")
...
list1.append(data1)
return list1
list1 = fun(my_df)
my_df['List1'] = list1
(I tried to keep the code as simpler as possible)
The output looks like (the column Col
is my initial dataframe, i.e. my_df
):
Col List1
mouse [dog, horse, cat]
horse [mouse, elephant]
tiger []
Then, I repeat the process for strings in the list for each row:
# 2nd round
list1 = []
my_df2 = my_df.explode('List1')
my_list2 = pd.Series(list(set(my_df2['List1']) - set(my_df['Col'])), name='Col')
new_df2 = pd.DataFrame(my_list2, columns=['Col'])
list1 = fun(new_df2)
new_df2['List1'] = list1
Then I have another dataframe with other values, so I append these results to my original dataframe, my_df
my_df2= my_df.append(new_df2)
I repeat again the process
# 3rd round
list1 = []
my_df3 = my_df2.explode('List1')
my_list3 = pd.Series(list(set(my_df3['List1']) - set(my_df2['Col'])), name='Col')
new_df3 = pd.DataFrame(my_list3, columns=['Col'])
list1 = fun(new_df3)
new_df3['List1'] = list1
and so on, until I have finished to scrape all the data.
Since I am repeating these 'rounds' every time manually, I would like to ask you if there is a way to simplify the code in order to avoid all these awful repetition.
Any tips will be appreciated.
EDIT: my difficulties are in setting a condition where, if I have my original dataset, i.e. before creating the column List1, then create the empty list list1 then apply fun to my dataset.
In the other steps, I should:
- Initialise again the list1 in order to get a new dataframe from the list1 by exploding the column in my original (now update) dataset)
- Calculate the difference between this new dataframe and the previous one, to remove duplicates
- Run again the fun, storing the results in a column List1
- Append the results to the dataframe updated (which would be always the previous dataframe)
- repeat again the process.
If you need more information, I will be happy to provide it.
question from:
https://stackoverflow.com/questions/65864016/how-to-avoid-code-repetition-and-redundancy