I recommend you to firstly learn the concepts of BPTT (Back Propagation Through Time) and mini-batch SGD(Stochastic Gradient Descent), then you'll have further understandings of LSTM's training procedure.
For your questions,
Q1. In stateless cases, LSTM updates parameters on batch1 and then, initiate hidden states and cell states (usually all zeros) for batch2, while in stateful cases, it uses batch1's last output hidden states and cell sates as initial states for batch2.
Q2. As you can see above, when two sequences in two batches have connections (e.g. prices of one stock), you'd better use stateful mode, else (e.g. one sequence represents a complete sentence) you should use stateless mode.
BTW, @vu.pham said if we use stateful RNN, then in production, the network is forced to deal with infinite long sequences
. This seems not correct, actually, as you can see in Q1, LSTM WON'T learn on the whole sequence, it first learns sequence in batch1, updates parameters, and then learn sequence on batch2.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…