The answer to the short question is "No". Warp level branch divergence around a __syncthreads()
instruction will cause a deadlock and result in a kernel hang. Your code example is not guaranteed to be safe or correct. The correct way to implement the code would be like this:
__global__ void kernel(...)
if (tidx < N) {
// Code stanza #1
}
__syncthreads();
if (tidx < N) {
// Code stanza #2
}
// etc
}
so that the __syncthreads()
instructions are executed unconditionally.
EDIT: Just to add a bit of additional information which confirms this assertion, __syncthreads()
calls get compiled into the PTX bar.sync
instruction on all architectures. The PTX2.0 guide (p133) documents bar.sync
and includes the following warning:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
So despite any assertions to the contrary, it is not safe to have conditional branching around a __syncthreads()
call unless you can be 100% certain that every thread in any given warp follows the same code path and no warp divergence can occur.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…