python - Optimizing for loops, dataframe partition

Question

Welcome To Ask or Share your Answers For Others

python - Optimizing for loops, dataframe partition

posted Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Optimizing for loops, dataframe partition

I'm making an algorithm for feature selection in a binary classification which swipe through a np.array or pd.series to find intervals with good target division in a greedy approach.

The code works fine, However I use a for loop with a if conditional, so as a consequence the performance's quite slow. I was wondering if there's a smarter (and faster) way to do this. My code looks something like this:

import pandas as pd
df = pd.DataFrame([[51, 35, 1], [52, 3, 1], [53, 11, 1], [61, 8, 0], [75, 23, 0], [83, 45, 0], [95, 56, 1], [13, 66, 1], [1, 0, 1], [22, 68, 1]], columns=['feat1', 'feat2', 'target'])
target = df['target'] # values range from 0 to 1

def my_generic_metric_function(y):
  #This is just a generic metric that I'm using as an example.
  if len(y)>0:

    tgt = sum(y==1)
    no_tgt = sum(y==0)
    return 1.0*tgt/1.0*(no_tgt+tgt)

  else:
    return 0


def find_intervals(x, min_metric=10):
    ## Important: all my features receive a treatment that "fits" them in a range from 0 to 100
    ## Note that I'm not iterating through the DataFrame, I'm iterating over a range of values and finding the partitions in the dataframe.
    print(x.name)
    steps = [0]
    metric_partition = []
    for i in range(0, 101):


        ## This the target series filtered by the interval in x value
        band = target[(x>steps[-1]) & (x<=i)]
        partition_metric = my_generic_metric_function(band) 

        
        if partition_metric >= min_metric:
            steps.append(i)
            metric_partition.append(partition_metric)

    return {'f':x.name,'s': steps, 'm':metric_partition}

And I would apply this function to an entire dataframe using .apply():

bi_df = df.drop("target", axis=1).apply(find_intervals)

This problem looks a lot like a CART algorithm, however I didn't find any implementation that could help me optimize my problem.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-01-27T03:49:33+0000

It seems that my_generic_metric_function is a monotonically increasing function in respect to i in the loop. I may not be able to prove it right now, but I've done few simulations with random numbers and all had this results.
If this is the case, instead of doing a linear search (for i in range(0, 101)) you can do a (almost) binary search looking for the first number to reach your min_metric threshold.
It would be almost as a simple binary search but as a second verification step you would need to check if the number that came before is less then the threshold.
This would reduce the loop time from n to log(n).

Besides that you can try to index your feature columns.

Other options could be try to use numpy instead of pandas for everything, adjust and compile find_interval and even parallelize each column apply (with multiprocessing or other simple approach). But these three won't reduce your algorithm time, just improve the execution time (although for some operations numpy performance is orders of magnitude better then pandas).

Categories

python - Optimizing for loops, dataframe partition

python - Optimizing for loops, dataframe partition

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags