Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
353 views
in Technique[技术] by (71.8m points)

python - Pandas custom function to find whether it is the 1st, 2nd etc Monday, Tuesday, etc - all suggestions welcome

So I have the following code which reads in 5 columns, date ohlc. It then creates a column 'dow' to hold day of week. So far so good:

import numpy as np
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Forex/EURUSD-2018_12_18-2020_11_01.csv',parse_dates=True,names = ['date','1','2','3','4',])

df['date'] = pd.to_datetime(df['date'])
df.index = df['date']
df['dow'] = df['date'].dt.dayofweek
#df['downum'] = df.apply(lambda x: downu(x['date']))
df

Producing the following output:

                    date                1       2       3       4       dow
date                        
2018-12-18 00:00:00 2018-12-18 00:00:00 1.13498 1.13497 1.13508 1.13494 1
2018-12-18 00:01:00 2018-12-18 00:01:00 1.13497 1.13500 1.13500 1.13496 1
2018-12-18 00:02:00 2018-12-18 00:02:00 1.13500 1.13498 1.13502 1.13495 1
2018-12-18 00:03:00 2018-12-18 00:03:00 1.13498 1.13513 1.13513 1.13498 1
2018-12-18 00:04:00 2018-12-18 00:04:00 1.13513 1.13511 1.13515 1.13511 1
... ... ... ... ... ... ...
2020-11-01 23:55:00 2020-11-01 23:55:00 1.16402 1.16408 1.16410 1.16401 6
2020-11-01 23:56:00 2020-11-01 23:56:00 1.16409 1.16408 1.16410 1.16405 6
2020-11-01 23:57:00 2020-11-01 23:57:00 1.16409 1.16417 1.16418 1.16408 6
2020-11-01 23:58:00 2020-11-01 23:58:00 1.16417 1.16416 1.16418 1.16414 6
2020-11-01 23:59:00 2020-11-01 23:59:00 1.16418 1.16419 1.16419 1.16413 6

Now I want to do add the following custom function:

def downu(dtime):
  d = dtime.dt.day
  x = np.ceil(d/7)
  return x

and call it before displaying the dataframe like this:

df['downum'] = df.apply(lambda x: downu(x['date']))

to add a column indicating first '1', second '2'.... fifth '5' xxxday in the month

However this produces the following error:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

/usr/local/lib/python3.6/dist-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
   1373     else:
-> 1374         return DEFAULTPARSER.parse(timestr, **kwargs)
   1375 

11 frames
ParserError: Unknown string format: date

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

ValueError: could not convert string to Timestamp

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    603                 key = self._maybe_cast_for_get_loc(key)
    604             except ValueError as err:
--> 605                 raise KeyError(key) from err
    606 
    607         elif isinstance(key, timedelta):

KeyError: 'date'

I have seen the following suggested in similar situations:

df['downum'] = df.apply(lambda x: downu(x.date))

but this produces the following (understandable) error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-33-7f9aa69c7ea7> in <module>()
     12 df.index = df['date']
     13 df['dow'] = df['date'].dt.dayofweek
---> 14 df['downum'] = df.apply(lambda x: downu(x.date))
     15 df

5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   5139             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5140                 return self[name]
-> 5141             return object.__getattribute__(self, name)
   5142 
   5143     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'date'

Any solutions?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try:

df['downum'] = df['date'].apply(downu)

and change downu to:

def downu(dtime):
  d = dtime.day      # use dtime.day rather than dtime.dt.day
  x = np.ceil(d/7)
  return int(x)      # cast to int since d/7 is float even after np.ceil()

df.apply() works on the whole df, i.e. all columns one by one in turn (column-wise). The index of each processing column Series is still the DataFrame index. Hence, the column label 'date' cannot be used as index to this intermediate Series being processed. You have to use apply() on the 'date' Series instead unless your downu() function can accept values of all columns and ignore irrelevant columns.

Edit:

Add alternate solutions using apply(... ,axis=1) and list(map(...))

Here are further solutions, some of which I think was OP's originally attempted target way of coding. I will also discuss their pros and cons with respect to system performance (execution time) for large dataset and also for program readability (clarity).

Alternate solution 1:

%%timeit
df['downum'] = df.apply(lambda x: downu(x['date']), axis=1)

988 μs ± 8.26 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternate solution 2:

%%timeit
df['downum'] = df.apply(lambda x: downu(x.date), axis=1)

1.01 ms ± 13.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternate solution 3:

%%timeit
df['downum'] = list(map(downu, df['date']))

244 μs ± 3.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original solution:

%%timeit 
df['downum'] = df['date'].apply(downu)

810 μs ± 484 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternate solutions 1 and 2 are supposed to be OP's originally attempted target format of coding. The only difference is that axis=1 is added to make them workable. Now, with axis=1 added to the DataFrame.apply() function, downu(x['date'] and downu(x.date) can be used within the lambda function. Effectively, axis=1 modifies behavior of DataFrame.apply() function to allow the column index names be used. This can be better understood by thinking the function inside apply() is passed with a Series object, row-wise. The Series object has the original DataFrame column names/indices now become the Series index. Hence, you can access the elements in the same way as you access the Series elements by coding in format like series_obj['index'].

Comparison of solutions:

Comparing the execution time of the original solution (using pandas.Series.apply()) with the 2 alternate solutions using pandas.DataFrame.apply(... ,axis=1), the original solution is still a little bit faster. In terms of program readability, the original solution working on the df['date'] pandas Series is perceived to be simple and better.

In consideration of system performance, alternate solution 3 using list(map(...)) is 3x ~ 4x times faster than all other solutions. Note that this performance comparison result of DataFrame.apply(..., axis=1) vs list(map(..)) is generic rather than specific to this question. You can refer to this answer of the post How to apply a function to two columns of Pandas dataframe for a more in-depth discussion of the topic. Some other answers of that same post are also very useful for better understanding of the apply() function.

In summary, if the dataset is not large and system performance not a major consideration, use the original solution using pandas.Series.apply() in favor of program readability and clarity. Otherwise, for system performance consideration, using list(map(...)) is far superior to the pandas.DataFrame.apply(... ,axis=1) approach.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...