Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
173 views
in Technique[技术] by (71.8m points)

python - Optimizing for loops, dataframe partition

I'm making an algorithm for feature selection in a binary classification which swipe through a np.array or pd.series to find intervals with good target division in a greedy approach.

The code works fine, However I use a for loop with a if conditional, so as a consequence the performance's quite slow. I was wondering if there's a smarter (and faster) way to do this. My code looks something like this:

import pandas as pd
df = pd.DataFrame([[51, 35, 1], [52, 3, 1], [53, 11, 1], [61, 8, 0], [75, 23, 0], [83, 45, 0], [95, 56, 1], [13, 66, 1], [1, 0, 1], [22, 68, 1]], columns=['feat1', 'feat2', 'target'])
target = df['target'] # values range from 0 to 1

def my_generic_metric_function(y):
  #This is just a generic metric that I'm using as an example.
  if len(y)>0:

    tgt = sum(y==1)
    no_tgt = sum(y==0)
    return 1.0*tgt/1.0*(no_tgt+tgt)

  else:
    return 0


def find_intervals(x, min_metric=10):
    ## Important: all my features receive a treatment that "fits" them in a range from 0 to 100
    ## Note that I'm not iterating through the DataFrame, I'm iterating over a range of values and finding the partitions in the dataframe.
    print(x.name)
    steps = [0]
    metric_partition = []
    for i in range(0, 101):


        ## This the target series filtered by the interval in x value
        band = target[(x>steps[-1]) & (x<=i)]
        partition_metric = my_generic_metric_function(band) 

        
        if partition_metric >= min_metric:
            steps.append(i)
            metric_partition.append(partition_metric)

    return {'f':x.name,'s': steps, 'm':metric_partition}

And I would apply this function to an entire dataframe using .apply():

bi_df = df.drop("target", axis=1).apply(find_intervals)

This problem looks a lot like a CART algorithm, however I didn't find any implementation that could help me optimize my problem.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It seems that my_generic_metric_function is a monotonically increasing function in respect to i in the loop. I may not be able to prove it right now, but I've done few simulations with random numbers and all had this results.
If this is the case, instead of doing a linear search (for i in range(0, 101)) you can do a (almost) binary search looking for the first number to reach your min_metric threshold.
It would be almost as a simple binary search but as a second verification step you would need to check if the number that came before is less then the threshold.
This would reduce the loop time from n to log(n).

Besides that you can try to index your feature columns.

Other options could be try to use numpy instead of pandas for everything, adjust and compile find_interval and even parallelize each column apply (with multiprocessing or other simple approach). But these three won't reduce your algorithm time, just improve the execution time (although for some operations numpy performance is orders of magnitude better then pandas).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...