python - 如何在Keras / TensorFlow中可视化RNN / LSTM梯度？(How to visualize RNN/LSTM gradients in Keras/TensorFlow?)

Question

Welcome To Ask or Share your Answers For Others

python - 如何在Keras / TensorFlow中可视化RNN / LSTM梯度？(How to visualize RNN/LSTM gradients in Keras/TensorFlow?)

posted Feb 21, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - 如何在Keras / TensorFlow中可视化RNN / LSTM梯度？(How to visualize RNN/LSTM gradients in Keras/TensorFlow?)

I've come across research publications and Q&A's discussing a need for inspecting RNN gradients per backpropagation through time (BPTT) - ie, gradient for each timestep .

(我遇到过研究出版物，并且Q＆A讨论了需要检查每个反向传播的时间（BPTT）的RNN梯度，即每个时间步的梯度。)

The main use is introspection : how do we know if an RNN is learning long-term dependencies ?

(主要用途是自省：我们如何知道RNN是否正在学习长期依赖关系 ？)

A question of its own topic, but the most important insight is gradient flow :

(关于其自身主题的问题，但最重要的见解是梯度流 ：)

If a non-zero gradient flows through every timestep, then every timestep contributes to learning - ie, resultant gradients stem from accounting for every input timestep, so the entire sequence influences weight updates
(如果每个时间步长都流过一个非零的梯度，那么每个时间步长都有助于学习 -即，由于每个输入时间步长都产生了最终的梯度，因此整个序列都会影响权重更新)
Per above, an RNN no longer ignores portions of long sequences , and is forced to learn from them
(在上面，RNN 不再忽略长序列的部分 ，而是被迫向它们学习)

... but how do I actually visualize these gradients in Keras / TensorFlow?

(...但是我实际上如何在Keras / TensorFlow中可视化这些渐变呢？)

Some related answers are in the right direction, but they seem to fail for bidirectional RNNs, and only show how to get a layer's gradients, not how to meaningfully visualize them (the output is a 3D tensor - how do I plot it?)

(一些相关的答案是朝着正确的方向发展的，但对于双向RNN而言，它们似乎失败了，仅显示了如何获取图层的渐变，而不是如何有意义地可视化它们（输出是3D张量-我如何绘制它？）)

ask by OverLordGoldDragon translate from so

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-20T21:51:24+0000

Gradients can be fetched wrt weights or outputs - we'll be needing latter.

(可以从权重或输出中获取渐变-我们将需要后者。)

Further, for best results, an architecture-specific treatment is desired.

(此外，为了获得最佳结果，需要特定于体系结构的处理。)

Below code & explanations cover every possible case of a Keras/TF RNN, and should be easily expandable to any future API changes.

(以下代码和说明涵盖了Keras / TF RNN的所有可能情况 ，并且应该可以轻松扩展为将来的任何API更改。)

Completeness : code shown is a simplified version - the full version can be found at my repository, See RNN (this post included w/ bigger images);

(完整性 ：显示的代码是简化版本-完整版本可以在我的存储库中找到，请参阅RNN （此文章带有大图）；)

included are:

(包括：)

Greater visual custsomizability
(更好的视觉客体性)
Docstrings explaining all functionality
(解释所有功能的文档字符串)
Support for Eager, Graph, TF1, TF2, and from keras & from tf.keras
(支持Eager，Graph，TF1，TF2， from keras和from tf.keras)
Activations visualization
(激活可视化)
Weights gradients visualization (coming soon)
(权重梯度可视化（即将推出）)
Weights visualization (coming soon)
(重量可视化（即将推出）)

I/O dimensionalities (all RNNs):

(I / O维度 （所有RNN）：)

Input : (batch_size, timesteps, channels) - or, equivalently, (samples, timesteps, features)
(输入：（ (batch_size, timesteps, channels) -或等效地， (samples, timesteps, features))
Output : same as Input, except:
(输出：与输入相同，除了：)
- channels / features is now the # of RNN units , and:
  (channels / features现在是RNN单位的数量 ，并且：)
- return_sequences=True --> timesteps_out = timesteps_in (output a prediction for each input timestep)
  (return_sequences=True > timesteps_out = timesteps_in （为每个输入时间步长输出预测）)
- return_sequences=False --> timesteps_out = 1 (output prediction only at the last timestep processed)
  (return_sequences=False > timesteps_out = 1 （仅在处理的最后一个时间步输出预测）)

Visualization methods :

(可视化方法 ：)

1D plot grid : plot gradient vs. timesteps for each of the channels
(一维绘图网格 ：每个通道的绘图梯度与时间步长)
2D heatmap : plot channels vs. timesteps w/ gradient intensity heatmap
(2D热图 ：使用梯度强度热图绘制通道与时间步的关系)
0D aligned scatter : plot gradient for each channel per sample
(0D对齐散点图 ：每个样本每个通道的绘图梯度)
histogram : no good way to represent "vs. timesteps" relations
(直方图 ：没有好的方法来表示“与时间步长”的关系)
One sample : do each of above for a single sample
(一个样本 ：对单个样本执行上述每个操作)
Entire batch : do each of above for all samples in a batch;
(整批：对一批中的所有样品执行上述每个操作；)
requires careful treatment
(需要仔细治疗)

# for below examples
grads = get_rnn_gradients(model, x, y, layer_idx=1) # return_sequences=True
grads = get_rnn_gradients(model, x, y, layer_idx=2) # return_sequences=False

EX 1: one sample, uni-LSTM, 6 units -- return_sequences=True , trained for 20 iterations

(EX 1：一个样本，uni-LSTM，6个单位 return_sequences=True ，经过20次迭代训练)
show_features_1D(grads[0], n_rows=2)

Note : gradients are to be read right-to-left , as they're computed (from last timestep to first)
(注意：在计算梯度（从最后一个时间步到第一个时间）时，应从右到左读取)
Rightmost (latest) timesteps consistently have a higher gradient
(最右边（最新）的时间步长始终具有较高的梯度)
Vanishing gradient : ~75% of leftmost timesteps have a zero gradient, indicating poor time dependency learning
(消失的梯度 ：最左边的时间步中约75％的梯度为零，表明时间依赖性学习较差)

EX 2: all (16) samples, uni-LSTM, 6 units -- return_sequences=True , trained for 20 iterations

(EX 2：所有（16）个样本，uni-LSTM，6个单位 return_sequences=True ，经过20次迭代训练)
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

Each sample shown in a different color (but same color per sample across channels)
(每个样本以不同的颜色显示（但每个样本在通道之间具有相同的颜色）)
Some samples perform better than one shown above, but not by much
(一些样本的效果优于上面显示的样本，但效果不佳)
The heatmap plots channels (y-axis) vs. timesteps (x-axis);
(热图绘制通道（y轴）与时间步长（x轴）的关系；)
blue=-0.01, red=0.01, white=0 (gradient values)
(蓝色= -0.01，红色= 0.01，白色= 0（渐变值）)

EX 3: all (16) samples, uni-LSTM, 6 units -- return_sequences=True , trained for 200 iterations

(例3：所有（16）个样本，uni-LSTM，6个单位 return_sequences=True ，经过200次迭代训练)
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

Both plots show the LSTM performing clearly better after 180 additional iterations
(这两个图均显示，经过180次迭代后，LSTM的性能明显更好)
Gradient still vanishes for about half the timesteps
(渐变仍然消失了大约一半的时间)
All LSTM units better capture time dependencies of one particular sample (blue curve, all plots) - which we can tell from the heatmap to be the first sample.
(所有LSTM单位都可以更好地捕获一个特定样品（蓝色曲线，所有曲线）的时间依赖性-从热图中可以看出这是第一个样品。)
We can plot that sample vs. other samples to try to understand the difference
(我们可以绘制该样本与其他样本的关系图，以试图了解差异)

EX 4: 2D vs. 1D, uni-LSTM : 256 units, return_sequences=True , trained for 200 iterations

(EX 4：2D与1D，uni-LSTM ：256个单位， return_sequences=True ，训练了200次迭代)
show_features_1D(grads[0])
show_features_2D(grads[:, :, 0], norm=(-.0001, .0001))

2D is better suited for comparing many channels across few samples
(2D更适合比较少数样本中的多个通道)
1D is better suited for comparing many samples across a few channels
(1D更适合于跨几个通道比较许多样本)

EX 5: bi-GRU, 256 units (512 total) -- return_sequences=True , trained for 400 iterations

(EX 5：bi-GRU，256个单位（总共512个） return_sequences=True ，训练了400次迭代)
show_features_2D(grads[0], norm=(-.0001, .0001), reflect_half=True)

Backward layer's gradients are flipped for consistency wrt time axis
(向后翻转图层的渐变以确保时间轴的一致性)
Plot reveals a lesser-known advantage of Bi-RNNs - information utility : the collective gradient covers about twice the data.
(该图揭示了Bi-RNN的一个鲜为人知的优势- 信息实用程序 ：集体梯度覆盖了大约两倍的数据。)
However , this isn't free lunch: each layer is an independent feature extractor, so learning isn't really complemented
(但是，这不是免费的午餐：每一层都是独立的特征提取器，因此学习并没有真正的补充)
Lower norm for more units is expected, as approx.
(期望更多单位的norm较低，大约为)
the same loss-derived gradient is being distributed across more parameters (hence the squared numeric average is less)
(相同的基于损耗的梯度分布在更多参数上（因此平方的平方数较小）)

EX 6: 0D, all (16) samples, uni-LSTM, 6 units -- return_sequences=False , trained for 200 iterations

(EX 6：0D，所有（16）个样本，uni-LSTM，6个单位 return_sequences=False ，训练了200次迭代)
show_features_0D(grads)

return_sequences=False utilizes only the last timestep's gradient (which is still derived from all timesteps, unless using truncated BPTT), requiring a new approach
(return_sequences=False仅使用最后一个时间步的梯度（除非使用截断的BPTT，否则它仍从所有时间步得出），需要一种新方法)
Plot color-codes each RNN unit consistently across samples for comparison (can use one color instead)
(在样本之间对每个RNN单元一致地进行颜色编码以进行比较（可以使用一种颜色代替）)
Evaluating gradient flow is less direct and more theoretically involved.
(梯度流的评估不那么直接，而且在理论上涉及更多。)
One simple approach is to compare distributions at beginning vs. later in training: if the difference isn't significant, the RNN does poorly in learning long-term dependencies
(一种简单的方法是在训练的开始阶段和以后的阶段比较分布：如果差异不大，则RNN在学习长期依赖项方面表现不佳)

<img src="https://stackoom.com/link/aHR0cHM6Ly9pLnN0YWNrLmltZ3VyLmNvbS82OTNFTy5wbmc=" width="560" referrerpolicy="no-

Categories

python - 如何在Keras / TensorFlow中可视化RNN / LSTM梯度？(How to visualize RNN/LSTM gradients in Keras/TensorFlow?)

python - 如何在Keras / TensorFlow中可视化RNN / LSTM梯度？(How to visualize RNN/LSTM gradients in Keras/TensorFlow?)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags