pandas - How to calculate average returns over separate consecutive ranges determined by another column in Python?

Question

Welcome To Ask or Share your Answers For Others

pandas - How to calculate average returns over separate consecutive ranges determined by another column in Python?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

pandas - How to calculate average returns over separate consecutive ranges determined by another column in Python?

I currently have a Pandas DataFrame which contains a time series of asset prices and a column containing a "state". There are three states -1, 0, 1 that occur at various points in the data.

I am trying to find the average return on the asset in each of these states, ideally using a vectorised method.

Here is an example of the DataFrame:

| Timestamp                        | mid_price | state |
|----------------------------------|-----------|-------|
| 2020-01-01 00:05:25+00:00        | 528.50    | 0     |
| 2020-01-01 00:05:25.500000+00:00 | 529.00    | 0     |
| 2020-01-01 00:05:26+00:00        | 527.50    | 1     |
| 2020-01-01 00:05:26.500000+00:00 | 528.00    | 1     |
| 2020-01-01 00:05:27+00:00        | 529.00    | 1     |
| 2020-01-01 00:05:27.500000+00:00 | 531.50    | -1    |
| 2020-01-01 00:05:28+00:00        | 531.50    | -1    |
| 2020-01-01 00:05:28.500000+00:00 | 532.50    | 1     |
| 2020-01-01 00:05:29+00:00        | 532.50    | 1     |
| 2020-01-01 00:05:29.500000+00:00 | 530.00    | 1     |

I am trying to calculate the average return for each state, so for example for state 1:

First sequence: (529 - 527.50) / 527.50 = 0.0028...

Second sequence: (530 - 532.50) / 532.50 = -0.0047...

Average return: (0.0028 - 0.0047) / 2 = -0.00095

Is there a neat, vectorised way to do this?

question from:https://stackoverflow.com/questions/65871841/how-to-calculate-average-returns-over-separate-consecutive-ranges-determined-by

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:23:22+0000

Creating and ID for each sequence and using groupby:

df['Seq'] = (df.state.diff() != 0).cumsum()  # ID for each sequence
df.groupby(['Seq', 'state']).agg(
    {'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state

The Seq column is computed so that all the rows belonging to a sequence have the same ID:

                         Timestamp  mid_price  state  Seq
0        2020-01-01 00:05:25+00:00      528.5      0    1
1 2020-01-01 00:05:25.500000+00:00      529.0      0    1
2        2020-01-01 00:05:26+00:00      527.5      1    2
3 2020-01-01 00:05:26.500000+00:00      528.0      1    2
4        2020-01-01 00:05:27+00:00      529.0      1    2
5 2020-01-01 00:05:27.500000+00:00      531.5     -1    3
6        2020-01-01 00:05:28+00:00      531.5     -1    3
7 2020-01-01 00:05:28.500000+00:00      532.5      1    4
8        2020-01-01 00:05:29+00:00      532.5      1    4
9 2020-01-01 00:05:29.500000+00:00      530.0      1    4

The main idea is to identify the elements of the state column where the value change of state change using diff: if the state is different w.r.t. the previous column, the difference will be different than zero. Then, you can create an increasing ID using cumsum, i.e. the cumulative sum. This works also because the first element returned by the diff is NaN, that is different from zero.

Once you have the Seq column, the solution of your problem is very simple: firstly, group by state and Seq and get the return for each sequence and then compute the average of returns for each state.

Here the complete code and the result:

import pandas as pd

df = pd.DataFrame(
    columns=["Timestamp", "mid_price", "state"],
    data=[
        ["2020-01-01 00:05:25+00:00", 528.50, 0],
        ["2020-01-01 00:05:25.500000+00:00", 529.00, 0],
        ["2020-01-01 00:05:26+00:00", 527.50, 1],
        ["2020-01-01 00:05:26.500000+00:00", 528.00, 1],
        ["2020-01-01 00:05:27+00:00", 529.00, 1],
        ["2020-01-01 00:05:27.500000+00:00", 531.50, -1],
        ["2020-01-01 00:05:28+00:00", 531.50, -1],
        ["2020-01-01 00:05:28.500000+00:00", 532.50, 1],
        ["2020-01-01 00:05:29+00:00", 532.50, 1],
        ["2020-01-01 00:05:29.500000+00:00", 530.00, 1],
    ],
)

df['Timestamp'] = pd.to_datetime(df.Timestamp)
df['Seq'] = (df.state.diff() != 0).cumsum()
df.groupby(['Seq', 'state']).agg(
    {'mid_price': lambda x: (x.values[-1] - x.values[0]) / x.values[0]}  # compute return of first each sequence
).groupby('state').mid_price.mean()  # compute average return for each state

state
-1    0.000000
 0    0.000946
 1   -0.000926
Name: mid_price, dtype: float64

Categories

pandas - How to calculate average returns over separate consecutive ranges determined by another column in Python?

pandas - How to calculate average returns over separate consecutive ranges determined by another column in Python?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags