Author: Hubert Lamontagne |
Date: 2016-04-03 19:00 |
Suppose the SP is at 0x02018, and the L1 cache lines are 64 bytes in size, and you want to save a vector that's, say, 24 bytes long (6*32bit floats). Then, you need to first save the control word that tells you the size, compression etc of the vector. Fair enough, the vector data goes to 0x01FE8..0x02017. And then you have to save the vector size control word, which puts you at 0x01FE4 if you assume 32bits... but this doesn't work because then your SP is not 8-byte aligned anymore for 64bit integer and floating-point value. So you must save the vector size word to 0x01FE0 instead, with some extra padding (and the CPU either stores the amount of padding in the vector size word, or recalculates the amount of padding from SP alignment and vector size when reloading the vector). Another possibility is that you could add some post-padding, so that the vector line is saved to 0x01FD8..0x01FFF and the control word goes to 0x01FD0, so that the whole thing fits in a single cache line. The amount of post-padding must be saved in the vector size control word. Yeah, it's doable. But it's a long multicycle instruction, probably microcoded - after all, it writes an unpredictable amount of bytes to unpredictable offsets, often spanning 2 different cache lines, and updates the SP, and involves multiple address calculations to figure out just how much pre-padding and post-padding you need to do to keep your stack and your data well aligned. And it's very likely to completely block memory operation reordering (ie act like a memory barrier) because it's too difficult for concurrent memory operations to figure out whether they will overlap or not. Agner wrote:
Hardware multipliers are expensive, and divisors are
even more expensive. I wonder if we need to support
multiplication and division of all operand sizes,
including vectors of 8-bit and 16-bit integers, if
programmers are using floating point anyway?
Generally, 8-bit and 16-bit vector multiplications are provided in SIMD instruction sets to do stuff like movie decoding and software rendering (when OpenGL/DirectX are unavailable due to software constraints, such as running as a plugin). For scalars, 32*32->32 multiplies cover everything (and are common in C++ code), but some CPUs also provide 16*16->32 multiplies because they run faster (ARM). |