Creating and ID for each sequence and using groupby:
df['Seq'] = (df.state.diff() != 0).cumsum() # ID for each sequence
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]} # compute return of first each sequence
).groupby('state').mid_price.mean() # compute average return for each state
The Seq
column is computed so that all the rows belonging to a sequence have the same ID:
Timestamp mid_price state Seq
0 2020-01-01 00:05:25+00:00 528.5 0 1
1 2020-01-01 00:05:25.500000+00:00 529.0 0 1
2 2020-01-01 00:05:26+00:00 527.5 1 2
3 2020-01-01 00:05:26.500000+00:00 528.0 1 2
4 2020-01-01 00:05:27+00:00 529.0 1 2
5 2020-01-01 00:05:27.500000+00:00 531.5 -1 3
6 2020-01-01 00:05:28+00:00 531.5 -1 3
7 2020-01-01 00:05:28.500000+00:00 532.5 1 4
8 2020-01-01 00:05:29+00:00 532.5 1 4
9 2020-01-01 00:05:29.500000+00:00 530.0 1 4
The main idea is to identify the elements of the state
column where the value change of state
change using diff
: if the state is different w.r.t. the previous column, the difference will be different than zero. Then, you can create an increasing ID using cumsum
, i.e. the cumulative sum. This works also because the first element returned by the diff is NaN, that is different from zero.
Once you have the Seq
column, the solution of your problem is very simple: firstly, group by state
and Seq
and get the return for each sequence and then compute the average of returns for each state.
Here the complete code and the result:
import pandas as pd
df = pd.DataFrame(
columns=["Timestamp", "mid_price", "state"],
data=[
["2020-01-01 00:05:25+00:00", 528.50, 0],
["2020-01-01 00:05:25.500000+00:00", 529.00, 0],
["2020-01-01 00:05:26+00:00", 527.50, 1],
["2020-01-01 00:05:26.500000+00:00", 528.00, 1],
["2020-01-01 00:05:27+00:00", 529.00, 1],
["2020-01-01 00:05:27.500000+00:00", 531.50, -1],
["2020-01-01 00:05:28+00:00", 531.50, -1],
["2020-01-01 00:05:28.500000+00:00", 532.50, 1],
["2020-01-01 00:05:29+00:00", 532.50, 1],
["2020-01-01 00:05:29.500000+00:00", 530.00, 1],
],
)
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df['Seq'] = (df.state.diff() != 0).cumsum()
df.groupby(['Seq', 'state']).agg(
{'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]} # compute return of first each sequence
).groupby('state').mid_price.mean() # compute average return for each state
state
-1 0.000000
0 0.000946
1 -0.000926
Name: mid_price, dtype: float64