c - Float32 to Float16

Question

Welcome To Ask or Share your Answers For Others

c - Float32 to Float16

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Float32 to Float16

Can someone explain to me how I convert a 32-bit floating point value to a 16-bit floating point value?

(s = sign e = exponent and m = mantissa)

If 32-bit float is 1s7e24m
And 16-bit float is 1s5e10m

Then is it as simple as doing?

int     fltInt32;
short   fltInt16;
memcpy( &fltInt32, &flt, sizeof( float ) );

fltInt16 = (fltInt32 & 0x00FFFFFF) >> 14;
fltInt16 |= ((fltInt32 & 0x7f000000) >> 26) << 10;
fltInt16 |= ((fltInt32 & 0x80000000) >> 16);

I'm assuming it ISN'T that simple ... so can anyone tell me what you DO need to do?

Edit: I cam see I've got my exponent shift wrong ... so would THIS be better?

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x7c000000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

I'm hoping this is correct. Apologies if I'm missing something obvious that has been said. Its almost midnight on a friday night ... so I'm not "entirely" sober ;)

Edit 2: Ooops. Buggered it again. I want to lose the top 3 bits not the lower! So how about this:

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x0f800000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

Final code should be:

fltInt16    =  ((fltInt32 & 0x7fffffff) >> 13) - (0x38000000 >> 13);
fltInt16    |= ((fltInt32 & 0x80000000) >> 16);

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:10:23+0000

The exponent needs to be unbiased, clamped and rebiased. This is the fast code I use:

unsigned int fltInt32;
unsigned short fltInt16;

fltInt16 = (fltInt32 >> 31) << 5;
unsigned short tmp = (fltInt32 >> 23) & 0xff;
tmp = (tmp - 0x70) & ((unsigned int)((int)(0x70 - tmp) >> 4) >> 27);
fltInt16 = (fltInt16 | tmp) << 10;
fltInt16 |= (fltInt32 >> 13) & 0x3ff;

This code will be even faster with a lookup table for the exponent, but I use this one because it is easily adapted to a SIMD workflow.

Limitations of the implementation:

Overflowing values that cannot be represented in float16 will give undefined values.
Underflowing values will return an undefined value between 2^-15 and 2^-14 instead of zero.
Denormals will give undefined values.

Be careful with denormals. If your architecture uses them, they may slow down your program tremendously.

Categories

c - Float32 to Float16

c - Float32 to Float16

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags