ARM fpus have evolved over time one period of time the term was VFP. The assembly language instructions are still supported AFAIK. (I do not use the new assembly language I use the older stuff on various cores, now what I use may not be the VFP assembly it might be somewhere in the middle).
At the time and it still appears to be in the present the FPU is a coprocessor, a feature that perhaps did not take hold for third party vendors, but you could add coprocessors to the core and use the MRS/MSR instructions for acces. I have not looked today but at that time the VFP instructions were nothing more than MRS/MSR coprocessor access instructions. The assembly language took care of this so you could ask to add two registers and not have to know what the gory details were.
Floating point solutions for ARM (well and everyone else) have evolved over time and this term is no longer used in normal conversation (for ARM at least).
How it accelerates the processor is that it is additional logic (just like a cache accelerates a processor) that is connected to the processor and we the programmer offload this work to that coprocessor. So we could use the normal ARM instruction set and do soft float operations which with fixed point math take a while, many instructions. Or you can pass the operation over to the coprocessor, where it's logic can do the work directly and give you a result much faster. Net result is overall better performance. Not unlike when speeding down the highway and asking the passenger to open a beer for you, you are offloading that work...
For the case of ARM the floating point instructions map into the core as instructions that are aimed at this logic be it a coprocessor like the old days or directly implemented in the core (if that is how it works today, I still need to enable a coprocessor in the ARM to enable the FPU so I suspect they are still coprocessors in some form).
How x86 and others do it is a separate topic, it may or may not be similar, certainly in the early days the 8087 was a separate coprocessor chip, but as with ARM these things have evolved. The best solution is to have the core be able to take the instructions directly, but you can still offload things and have an overall performance gain (think video cards).
Re-reading your question
From a current ARM document:
The Vector Floating-Point (VFP) architecture is a coprocessor extension to the ARM ? architecture. It provides single-precision and double-precision floating-point arithmetic, as defined by ANSI/IEEE Std. 754-1985 IEEE Standard for Binary Floating-Point Arithmetic. This document is referred to as the IEEE 754 standard in the following text.
and you can read from there this is the ARMv5 ARM ARM. Which is arm7/arm9 days.
When you see VFP with respect to ARM just think FPU or floating point instruction set. It is a coprocessor directly attached to the ARM core (if you paid for that and compiled it into your core) and the ARM core "executes" these instructions.
Because of the combinations of cores and features and what each chip vendor can and cannot do, specific combinations may not have a hard fpu and you have to use a soft fpu and the soft library may only support a certain instruction set.