I'm making an algorithm for feature selection in a binary classification which swipe through a np.array
or pd.series
to find intervals with good target division in a greedy approach.
The code works fine, However I use a for
loop with a if
conditional, so as a consequence the performance's quite slow. I was wondering if there's a smarter (and faster) way to do this. My code looks something like this:
import pandas as pd
df = pd.DataFrame([[51, 35, 1], [52, 3, 1], [53, 11, 1], [61, 8, 0], [75, 23, 0], [83, 45, 0], [95, 56, 1], [13, 66, 1], [1, 0, 1], [22, 68, 1]], columns=['feat1', 'feat2', 'target'])
target = df['target'] # values range from 0 to 1
def my_generic_metric_function(y):
#This is just a generic metric that I'm using as an example.
if len(y)>0:
tgt = sum(y==1)
no_tgt = sum(y==0)
return 1.0*tgt/1.0*(no_tgt+tgt)
else:
return 0
def find_intervals(x, min_metric=10):
## Important: all my features receive a treatment that "fits" them in a range from 0 to 100
## Note that I'm not iterating through the DataFrame, I'm iterating over a range of values and finding the partitions in the dataframe.
print(x.name)
steps = [0]
metric_partition = []
for i in range(0, 101):
## This the target series filtered by the interval in x value
band = target[(x>steps[-1]) & (x<=i)]
partition_metric = my_generic_metric_function(band)
if partition_metric >= min_metric:
steps.append(i)
metric_partition.append(partition_metric)
return {'f':x.name,'s': steps, 'm':metric_partition}
And I would apply this function to an entire dataframe using .apply()
:
bi_df = df.drop("target", axis=1).apply(find_intervals)
This problem looks a lot like a CART algorithm, however I didn't find any implementation that could help me optimize my problem.