Gradients can be fetched wrt weights or outputs - we'll be needing latter.
(可以从权重或输出中获取渐变-我们将需要后者。)
Further, for best results, an architecture-specific treatment is desired. (此外,为了获得最佳结果,需要特定于体系结构的处理。)
Below code & explanations cover every possible case of a Keras/TF RNN, and should be easily expandable to any future API changes. (以下代码和说明涵盖了Keras / TF RNN的所有可能情况 ,并且应该可以轻松扩展为将来的任何API更改。)
Completeness : code shown is a simplified version - the full version can be found at my repository, See RNN (this post included w/ bigger images);
(完整性 :显示的代码是简化版本-完整版本可以在我的存储库中找到, 请参阅RNN (此文章带有大图);)
included are: (包括:)
I/O dimensionalities (all RNNs):
(I / O维度 (所有RNN):)
- Input :
(batch_size, timesteps, channels)
- or, equivalently, (samples, timesteps, features)
(输入 :( (batch_size, timesteps, channels)
-或等效地, (samples, timesteps, features)
)
- Output : same as Input, except:
(输出 :与输入相同,除了:)
-
channels
/ features
is now the # of RNN units , and: (channels
/ features
现在是RNN单位的数量 ,并且:)
-
return_sequences=True
--> timesteps_out = timesteps_in
(output a prediction for each input timestep) (return_sequences=True
> timesteps_out = timesteps_in
(为每个输入时间步长输出预测))
-
return_sequences=False
--> timesteps_out = 1
(output prediction only at the last timestep processed) (return_sequences=False
> timesteps_out = 1
(仅在处理的最后一个时间步输出预测))
Visualization methods :
(可视化方法 :)
# for below examples
grads = get_rnn_gradients(model, x, y, layer_idx=1) # return_sequences=True
grads = get_rnn_gradients(model, x, y, layer_idx=2) # return_sequences=False
EX 1: one sample, uni-LSTM, 6 units -- return_sequences=True
, trained for 20 iterations
(EX 1:一个样本,uni-LSTM,6个单位 return_sequences=True
,经过20次迭代训练)
show_features_1D(grads[0], n_rows=2)
EX 2: all (16) samples, uni-LSTM, 6 units -- return_sequences=True
, trained for 20 iterations
(EX 2:所有(16)个样本,uni-LSTM,6个单位 return_sequences=True
,经过20次迭代训练)
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))
- Each sample shown in a different color (but same color per sample across channels)
(每个样本以不同的颜色显示(但每个样本在通道之间具有相同的颜色))
- Some samples perform better than one shown above, but not by much
(一些样本的效果优于上面显示的样本,但效果不佳)
- The heatmap plots channels (y-axis) vs. timesteps (x-axis);
(热图绘制通道(y轴)与时间步长(x轴)的关系;)
blue=-0.01, red=0.01, white=0 (gradient values) (蓝色= -0.01,红色= 0.01,白色= 0(渐变值))
EX 3: all (16) samples, uni-LSTM, 6 units -- return_sequences=True
, trained for 200 iterations
(例3:所有(16)个样本,uni-LSTM,6个单位 return_sequences=True
,经过200次迭代训练)
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))
- Both plots show the LSTM performing clearly better after 180 additional iterations
(这两个图均显示,经过180次迭代后,LSTM的性能明显更好)
- Gradient still vanishes for about half the timesteps
(渐变仍然消失了大约一半的时间)
- All LSTM units better capture time dependencies of one particular sample (blue curve, all plots) - which we can tell from the heatmap to be the first sample.
(所有LSTM单位都可以更好地捕获一个特定样品(蓝色曲线,所有曲线)的时间依赖性-从热图中可以看出这是第一个样品。)
We can plot that sample vs. other samples to try to understand the difference (我们可以绘制该样本与其他样本的关系图,以试图了解差异)
EX 4: 2D vs. 1D, uni-LSTM : 256 units, return_sequences=True
, trained for 200 iterations
(EX 4:2D与1D,uni-LSTM :256个单位, return_sequences=True
,训练了200次迭代)
show_features_1D(grads[0])
show_features_2D(grads[:, :, 0], norm=(-.0001, .0001))
- 2D is better suited for comparing many channels across few samples
(2D更适合比较少数样本中的多个通道)
- 1D is better suited for comparing many samples across a few channels
(1D更适合于跨几个通道比较许多样本)
EX 5: bi-GRU, 256 units (512 total) -- return_sequences=True
, trained for 400 iterations
(EX 5:bi-GRU,256个单位(总共512个) return_sequences=True
,训练了400次迭代)
show_features_2D(grads[0], norm=(-.0001, .0001), reflect_half=True)
- Backward layer's gradients are flipped for consistency wrt time axis
(向后翻转图层的渐变以确保时间轴的一致性)
- Plot reveals a lesser-known advantage of Bi-RNNs - information utility : the collective gradient covers about twice the data.
(该图揭示了Bi-RNN的一个鲜为人知的优势- 信息实用程序 :集体梯度覆盖了大约两倍的数据。)
However , this isn't free lunch: each layer is an independent feature extractor, so learning isn't really complemented (但是 ,这不是免费的午餐:每一层都是独立的特征提取器,因此学习并没有真正的补充)
- Lower
norm
for more units is expected, as approx. (期望更多单位的norm
较低,大约为)
the same loss-derived gradient is being distributed across more parameters (hence the squared numeric average is less) (相同的基于损耗的梯度分布在更多参数上(因此平方的平方数较小))
![image](https://stackoom.com/link/aHR0cHM6Ly9pLnN0YWNrLmltZ3VyLmNvbS91ZUdWQi5wbmc=)
EX 6: 0D, all (16) samples, uni-LSTM, 6 units -- return_sequences=False
, trained for 200 iterations
(EX 6:0D,所有(16)个样本,uni-LSTM,6个单位 return_sequences=False
,训练了200次迭代)
show_features_0D(grads)
-
return_sequences=False
utilizes only the last timestep's gradient (which is still derived from all timesteps, unless using truncated BPTT), requiring a new approach (return_sequences=False
仅使用最后一个时间步的梯度(除非使用截断的BPTT,否则它仍从所有时间步得出),需要一种新方法)
- Plot color-codes each RNN unit consistently across samples for comparison (can use one color instead)
(在样本之间对每个RNN单元一致地进行颜色编码以进行比较(可以使用一种颜色代替))
- Evaluating gradient flow is less direct and more theoretically involved.
(梯度流的评估不那么直接,而且在理论上涉及更多。)
One simple approach is to compare distributions at beginning vs. later in training: if the difference isn't significant, the RNN does poorly in learning long-term dependencies (一种简单的方法是在训练的开始阶段和以后的阶段比较分布:如果差异不大,则RNN在学习长期依赖项方面表现不佳)
<img src="https://stackoom.com/link/aHR0cHM6Ly9pLnN0YWNrLmltZ3VyLmNvbS82OTNFTy5wbmc=" width="560" referrerpolicy="no-