Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
312 views
in Technique[技术] by (71.8m points)

python - 如何在Keras / TensorFlow中可视化RNN / LSTM梯度?(How to visualize RNN/LSTM gradients in Keras/TensorFlow?)

I've come across research publications and Q&A's discussing a need for inspecting RNN gradients per backpropagation through time (BPTT) - ie, gradient for each timestep .

(我遇到过研究出版物,并且Q&A讨论了需要检查每个反向传播的时间(BPTT)的RNN梯度,即每个时间步的梯度。)

The main use is introspection : how do we know if an RNN is learning long-term dependencies ?

(主要用途是自省 :我们如何知道RNN是否正在学习长期依赖关系 ?)

A question of its own topic, but the most important insight is gradient flow :

(关于其自身主题的问题,但最重要的见解是梯度流 :)

  • If a non-zero gradient flows through every timestep, then every timestep contributes to learning - ie, resultant gradients stem from accounting for every input timestep, so the entire sequence influences weight updates

    (如果每个时间步长流过一个非零的梯度,那么每个时间步长都有助于学习 -即,由于每个输入时间步长都产生了最终的梯度,因此整个序列都会影响权重更新)

  • Per above, an RNN no longer ignores portions of long sequences , and is forced to learn from them

    (在上面,RNN 不再忽略长序列的部分 ,而是被迫向它们学习)

... but how do I actually visualize these gradients in Keras / TensorFlow?

(...但是我实际上如何在Keras / TensorFlow中可视化这些渐变呢?)

Some related answers are in the right direction, but they seem to fail for bidirectional RNNs, and only show how to get a layer's gradients, not how to meaningfully visualize them (the output is a 3D tensor - how do I plot it?)

(一些相关的答案是朝着正确的方向发展的,但对于双向RNN而言,它们似乎失败了,仅显示了如何获取图层的渐变,而不是如何有意义地可视化它们(输出是3D张量-我如何绘制它?))

  ask by OverLordGoldDragon translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Gradients can be fetched wrt weights or outputs - we'll be needing latter.

(可以从权重输出中获取渐变-我们将需要后者。)

Further, for best results, an architecture-specific treatment is desired.

(此外,为了获得最佳结果,需要特定于体系结构的处理。)

Below code & explanations cover every possible case of a Keras/TF RNN, and should be easily expandable to any future API changes.

(以下代码和说明涵盖了Keras / TF RNN的所有可能情况 ,并且应该可以轻松扩展为将来的任何API更改。)


Completeness : code shown is a simplified version - the full version can be found at my repository, See RNN (this post included w/ bigger images);

(完整性 :显示的代码是简化版本-完整版本可以在我的存储库中找到, 请参阅RNN (此文章带有大图);)

included are:

(包括:)

  • Greater visual custsomizability

    (更好的视觉客体性)

  • Docstrings explaining all functionality

    (解释所有功能的文档字符串)

  • Support for Eager, Graph, TF1, TF2, and from keras & from tf.keras

    (支持Eager,Graph,TF1,TF2, from kerasfrom tf.keras)

  • Activations visualization

    (激活可视化)

  • Weights gradients visualization (coming soon)

    (权重梯度可视化(即将推出))

  • Weights visualization (coming soon)

    (重量可视化(即将推出))


I/O dimensionalities (all RNNs):

(I / O维度 (所有RNN):)

  • Input : (batch_size, timesteps, channels) - or, equivalently, (samples, timesteps, features)

    (输入 :( (batch_size, timesteps, channels) -或等效地, (samples, timesteps, features))

  • Output : same as Input, except:

    (输出 :与输入相同,除了:)

    • channels / features is now the # of RNN units , and:

      (channels / features现在是RNN单位的数量 ,并且:)

    • return_sequences=True --> timesteps_out = timesteps_in (output a prediction for each input timestep)

      (return_sequences=True > timesteps_out = timesteps_in (为每个输入时间步长输出预测))

    • return_sequences=False --> timesteps_out = 1 (output prediction only at the last timestep processed)

      (return_sequences=False > timesteps_out = 1 (仅在处理的最后一个时间步输出预测))


Visualization methods :

(可视化方法 :)

  • 1D plot grid : plot gradient vs. timesteps for each of the channels

    (一维绘图网格 :每个通道的绘图梯度与时间步长)

  • 2D heatmap : plot channels vs. timesteps w/ gradient intensity heatmap

    (2D热图 :使用梯度强度热图绘制通道与时间步的关系)

  • 0D aligned scatter : plot gradient for each channel per sample

    (0D对齐散点图 :每个样本每个通道的绘图梯度)

  • histogram : no good way to represent "vs. timesteps" relations

    (直方图 :没有好的方法来表示“与时间步长”的关系)

  • One sample : do each of above for a single sample

    (一个样本 :对单个样本执行上述每个操作)

  • Entire batch : do each of above for all samples in a batch;

    (整批 :对一批中的所有样品执行上述每个操作;)

    requires careful treatment

    (需要仔细治疗)

# for below examples
grads = get_rnn_gradients(model, x, y, layer_idx=1) # return_sequences=True
grads = get_rnn_gradients(model, x, y, layer_idx=2) # return_sequences=False

EX 1: one sample, uni-LSTM, 6 units -- return_sequences=True , trained for 20 iterations

(EX 1:一个样本,uni-LSTM,6个单位 return_sequences=True ,经过20次迭代训练)
show_features_1D(grads[0], n_rows=2)

  • Note : gradients are to be read right-to-left , as they're computed (from last timestep to first)

    (注意 :在计算梯度(从最后一个时间步到第一个时间)时,应从右到左读取)

  • Rightmost (latest) timesteps consistently have a higher gradient

    (最右边(最新)的时间步长始终具有较高的梯度)

  • Vanishing gradient : ~75% of leftmost timesteps have a zero gradient, indicating poor time dependency learning

    (消失的梯度 :最左边的时间步中约75%的梯度为零,表明时间依赖性学习较差)

在此处输入图片说明


EX 2: all (16) samples, uni-LSTM, 6 units -- return_sequences=True , trained for 20 iterations

(EX 2:所有(16)个样本,uni-LSTM,6个单位 return_sequences=True ,经过20次迭代训练)
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

  • Each sample shown in a different color (but same color per sample across channels)

    (每个样本以不同的颜色显示(但每个样本在通道之间具有相同的颜色))

  • Some samples perform better than one shown above, but not by much

    (一些样本的效果优于上面显示的样本,但效果不佳)

  • The heatmap plots channels (y-axis) vs. timesteps (x-axis);

    (热图绘制通道(y轴)与时间步长(x轴)的关系;)

    blue=-0.01, red=0.01, white=0 (gradient values)

    (蓝色= -0.01,红色= 0.01,白色= 0(渐变值))

在此处输入图片说明 在此处输入图片说明


EX 3: all (16) samples, uni-LSTM, 6 units -- return_sequences=True , trained for 200 iterations

(例3:所有(16)个样本,uni-LSTM,6个单位 return_sequences=True ,经过200次迭代训练)
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

  • Both plots show the LSTM performing clearly better after 180 additional iterations

    (这两个图均显示,经过180次迭代后,LSTM的性能明显更好)

  • Gradient still vanishes for about half the timesteps

    (渐变仍然消失了大约一半的时间)

  • All LSTM units better capture time dependencies of one particular sample (blue curve, all plots) - which we can tell from the heatmap to be the first sample.

    (所有LSTM单位都可以更好地捕获一个特定样品(蓝色曲线,所有曲线)的时间依赖性-从热图中可以看出这是第一个样品。)

    We can plot that sample vs. other samples to try to understand the difference

    (我们可以绘制该样本与其他样本的关系图,以试图了解差异)

在此处输入图片说明 在此处输入图片说明


EX 4: 2D vs. 1D, uni-LSTM : 256 units, return_sequences=True , trained for 200 iterations

(EX 4:2D与1D,uni-LSTM :256个单位, return_sequences=True ,训练了200次迭代)
show_features_1D(grads[0])
show_features_2D(grads[:, :, 0], norm=(-.0001, .0001))

  • 2D is better suited for comparing many channels across few samples

    (2D更适合比较少数样本中的多个通道)

  • 1D is better suited for comparing many samples across a few channels

    (1D更适合于跨几个通道比较许多样本)

在此处输入图片说明


EX 5: bi-GRU, 256 units (512 total) -- return_sequences=True , trained for 400 iterations

(EX 5:bi-GRU,256个单位(总共512个) return_sequences=True ,训练了400次迭代)
show_features_2D(grads[0], norm=(-.0001, .0001), reflect_half=True)

  • Backward layer's gradients are flipped for consistency wrt time axis

    (向后翻转图层的渐变以确保时间轴的一致性)

  • Plot reveals a lesser-known advantage of Bi-RNNs - information utility : the collective gradient covers about twice the data.

    (该图揭示了Bi-RNN的一个鲜为人知的优势- 信息实用程序 :集体梯度覆盖了大约两倍的数据。)

    However , this isn't free lunch: each layer is an independent feature extractor, so learning isn't really complemented

    (但是 ,这不是免费的午餐:每一层都是独立的特征提取器,因此学习并没有真正的补充)

  • Lower norm for more units is expected, as approx.

    (期望更多单位的norm较低,大约为)

    the same loss-derived gradient is being distributed across more parameters (hence the squared numeric average is less)

    (相同的基于损耗的梯度分布在更多参数上(因此平方的平方数较小))

image


EX 6: 0D, all (16) samples, uni-LSTM, 6 units -- return_sequences=False , trained for 200 iterations

(EX 6:0D,所有(16)个样本,uni-LSTM,6个单位 return_sequences=False ,训练了200次迭代)
show_features_0D(grads)

  • return_sequences=False utilizes only the last timestep's gradient (which is still derived from all timesteps, unless using truncated BPTT), requiring a new approach

    (return_sequences=False仅使用最后一个时间步的梯度(除非使用截断的BPTT,否则它仍从所有时间步得出),需要一种新方法)

  • Plot color-codes each RNN unit consistently across samples for comparison (can use one color instead)

    (在样本之间对每个RNN单元一致地进行颜色编码以进行比较(可以使用一种颜色代替))

  • Evaluating gradient flow is less direct and more theoretically involved.

    (梯度流的评估不那么直接,而且在理论上涉及更多。)

    One simple approach is to compare distributions at beginning vs. later in training: if the difference isn't significant, the RNN does poorly in learning long-term dependencies

    (一种简单的方法是在训练的开始阶段和以后的阶段比较分布:如果差异不大,则RNN在学习长期依赖项方面表现不佳)

<img src="https://stackoom.com/link/aHR0cHM6Ly9pLnN0YWNrLmltZ3VyLmNvbS82OTNFTy5wbmc=" width="560" referrerpolicy="no-


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...