如何对使用其自身输出的滞后值的函数进行矢量化处理?
问题内容:
我很抱歉这个问题的措辞不佳,但这是我所能做的最好的。我确切地知道我想要什么,但不完全知道如何提出要求。
这是一个示例演示的逻辑:
取值为1或0的两个条件触发一个信号,取值为1或0的一个信号。条件A触发该信号(如果A = 1,则signal = 1,否则signal =
0),无论如何。条件B不会触发信号,但是如果条件B先前已触发信号后条件B保持等于1,则信号将保持触发状态。只有在A和B都回到0后,信号才返回0。
1.输入:
2.所需的输出(signal_d) 并确认for循环可以解决它(signal_l):
3.我尝试使用numpy.where():
4.可复制的代码段:
# Settings
import numpy as np
import pandas as pd
import datetime
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
# My attempt with np.where in column signal_v1
df['Signal_v1'] = df['condition_A'].copy()
df['Signal_v1'] = np.where(df.condition_A == 1, 1, np.where( (df.shift(1).Signal_v1 == 1) & (df.condition_B == 1), 1, 0))
print(df)
使用带有滞后值和嵌套if语句的for循环,这非常简单,但是我无法使用诸如的矢量化函数来弄清楚numpy.where()
。我知道对于更大的数据帧,这会更快。
感谢您的任何建议!
问题答案:
我认为没有矢量化此操作的方法会比Python循环快得多。(至少,如果您只想使用Python,pandas和numpy,则不需要。)
但是,您可以通过简化代码来提高此操作的性能。您的实现使用if
语句和许多DataFrame索引。这些是相对昂贵的操作。
这是对脚本的修改,其中包括两个功能:add_signal_l(df)
和add_lagged(df)
。第一个是您的代码,仅包装在一个函数中。第二个使用更简单的函数来达到相同的结果-
仍然是Python循环,但它使用numpy数组和按位运算符。
import numpy as np
import pandas as pd
import datetime
#-----------------------------------------------------------------------
# Create the test DataFrame
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
#-----------------------------------------------------------------------
def add_signal_l(df):
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
def compute_lagged_signal(a, b):
x = np.empty_like(a)
x[0] = a[0]
for i in range(1, len(a)):
x[i] = a[i] | (x[i-1] & b[i])
return x
def add_lagged(df):
df['lagged'] = compute_lagged_signal(df['condition_A'].values, df['condition_B'].values)
这是在IPython会话中运行的两个函数的计时比较:
In [85]: df
Out[85]:
condition_A condition_B signal_d
dates
2017-06-09 0 0 0
2017-06-10 0 1 0
2017-06-11 0 1 0
2017-06-12 0 1 0
2017-06-13 1 0 1
2017-06-14 1 0 1
2017-06-15 0 1 1
2017-06-16 0 1 1
2017-06-17 0 1 1
2017-06-18 0 1 1
2017-06-19 0 1 1
2017-06-20 1 0 1
2017-06-21 1 0 1
2017-06-22 0 0 0
In [86]: %timeit add_signal_l(df)
8.45 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [87]: %timeit add_lagged(df)
137 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
如您所见,add_lagged(df)
速度更快。