Agner`s CPU blog

Do we need instructions with two outputs?

Author:

Date: 2016-04-02 17:09

Agner wrote:

Most operating systems have now switched to 64-bit addresses. It is true that most applications can do with a private 32-bit address space, but not all. A video editing program, for example, may need more than 4 gigabytes of data, and future needs may be still more. Better have 64-bit addresses to fit future needs than using complicated memory bank swapping and the like.

Am I correct in recalling that x86-64 doesn't actually expose a 64-bit address space, but rather a 48-bit one? See stackoverflow.com/questions/6716946/why-do-64-bit-systems-have-only-a-48-bit-address-space

However, this doesn't matter for my purposes. I'm asking why, in virtual memory, we mirror the physical memory addressing scheme. Why does a process use an address like 0x7fffd1510060 when it could use and address like 1 or 2 or D4? It's the process' exclusive virtual memory space â€“ wouldn't it save a lot of memory if it could use one-byte pointers? I image that the TLB or MMU can translate these virtual addresses just as easily as it translates 0x7fffd1510060.

I've also wondered if we could use time stamps as virtual memory addresses â€“ start the clock at zero nanoseconds for each process, every allocation marked by the nanoseconds transpired since the start of the process, each process having its own little Unix epoch if you will. This would also be more compact than the status quo, given some simple truncation and compression techniques. Time-stamped allocations might also be useful for a capabilities-based system, like the CHERI ISA at Cambridge or the Barrelfish OS that Microsoft and ETH Zurich have worked on.

Agner wrote:

I don't know if a JIT compiler needs anything special. Maybe string compare, but we can do that with vectors of 8-bit elements (this will work with UTF-8 strings). Anything else you have in mind for JIT compilers?

Parsing, parsing, and more parsing (and lexing). I'm not sure that processor and ISA designers have thoroughly explored how parsing performance might be improved. And of course the actual compilation to machine code. JITs have to make hard trade-offs with respect to generating maximally optimized code vs. generating code quickly. They have to forsake some of the optimizations that a static C/C++ compiler would provide. (Relatedly, you might find this interesting. Apple recently added LLVM as last stage compiler for their WebKit/Safari JavaScript JIT, but more recently replaced it with a completely new compiler. Very interesting deep dive here: https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/)

The other big thing JITs and many non-JIT runtimes have to do is garbage collection. I think it's well worth thinking about how an ISA could be designed to optimize garbage collection. There are some papers out there on hardware-accelerated garbage collection, but I haven't seen anyone model how an ISA's design decisions could help (or hurt) garbage collection.

Agner wrote:

My proposal includes variable-length vector registers that enable the software to adapt automatically to the different vector sizes of different processors without recompiling. If one compiled executable file fits all variants of the processor, why do we need JIT compilers at all?

We need them for the web, for JavaScript. This will be the case for many years to come. And for REPL and interpreter-like execution environments. Julia comes to mind the most.

WebAssembly is rolling out later this year. It looks excellent and thankfully everyone is behind it: Microsoft, Mozilla, Google, and perhaps Apple. It will be a partly compiled bytecode that browsers can execute much, much faster than JavaScript. However it won't replace JavaScript, as it's meant more for games, multimedia, and compute-intensive workloads. For now, the only source languages supported are C and C++, but more are expected. https://github.com/WebAssembly/design

Now that I think about it, you really ought to be involved in that project. They would benefit from your input. Relatedly, the web standards authorities and browser makers have been working on SIMD.JS, which I think would also benefit from your insights. I'm surprised they haven't asked for your help (if in fact they haven't). https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SIMD

Agner wrote:

The SSE4.2 instructions are ingenious, but very complicated for programmers to use. Most text strings intended for human reading are not so long that the speed of text processing really matters. The SSE4.2 instructions may be useful for other purposes, e.g. DNA sequence analysis.

I don't think text length is the issue. The new instructions are mostly designed to parse XML (and would work just as well for parsing any kind of structured text, HTML, even bytecode depending on some particulars.) From one of the Intel papers:

XML documents are made up of storage units called entities, like Character Data, Element, Comment, CDATA Section, etc. Each type of entity has its own well-formed definition that is a serious of character range rules.[1] The main work of Intel XML parsing is to recognize these entities and their logic structures.
From Intel XML Parsing Accelerator, we found that character checking loop occupies more than 60% CPU cycles of the whole parsing process, depending on the property of benchmark. There are two kinds of important behavior in this loop, read bytes and check whether it is legal for its corresponding entity type. Without any parallel instructions for string comparison, this process must be implemented in serializing mode.

(From https://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4)

I think the instructions would be useful for superfast user agent detection in web servers. I think PCMPESTRI and the other instructions work with 16-byte strings, and you could probably take a 16-byte chunk of a certain area of the user agent string that would uniquely identify the key factors you cared about across all user agents, like mobile or not, specific browser and version (which could, for example, tell you if you could use the vectors in SIMD.JS because you'd know which browsers support it.) The web is too slow, and I think common web servers, applications, and databases would be much faster if they used modern CPU instructions in their code.

(Example user agent string: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36)

Cheers,

Joe D.

Reply To This Message

Previous Message

Next Message