Implementing performant scalable solutions in confidential computing
Confidential computing (also known as Trusted Execution Environments or SGX enclaves) is a means to calculate secret information while maintaining confidence that malicious software installed on that machine cannot see or amend those calculations.
The traditional way to communicate with enclaves is expensive
The normal way to communicate with these enclaves is through machine generated code, made from an interface that you have designed using EDL files. These EDL (Enclave Definition Language) files follow the design patterns of older IDL style interfaces used in RPC, COM and CORBA. The EDL describes a set of functions on which Intel's code generator (the edger8r) converts into proxy and stub code that allows communication between host applications and enclaves.
These calls, however, are expensive. For a conventional C style call, we typically push a few registers for the parameters before jumping to that function address. At a minimum, this takes one or two clock cycles. For enclave calls this is approximately 8000 cycles. It’s not a serious problem when setting up the environment of an enclave, but prohibitive when writing high throughput code.
Intel’s ‘switchless’ design has pros and cons
Intel's development team have been working on this challenge and have come up with the 'switchless' design. This involves marshalling data between the host and enclave, using separate threads, with the sender effectively suspended, until the receiver is complete. In a limited environment this locks valuable resources which then cannot be used for other purposes. The advantage is that the ocall is faster, as the thread does not need to be sanitised when calling into the other side of the enclave.
Asynchronous programming for 100% capacity of enclave threads
However, we think the most promising approach is to use our own queues and employ an asynchronous programming style throughout the entire enclave, so that while the work is being done on the other side of the enclave boundary, the thread can continue to do other work. This maximises the efficiencies of the enclave threads, allowing them to work at 100% capacity, something not possible with conventional and switchless ecall/ocall implementations.
We are investigating the use of SPSC queues using C++ atomic circular buffers, as we believe they look more efficient. The trick is to allocate the queues in host memory and then share their pointers with the enclave. Enclaves can read and write to host memory, but not the other way around, if you can set up this queue when the thread starts then you're good to go with sending messages between them.
Performance test results
These are the test results when modifying Intel’s own switchless performance test app when doing 50,000 messages between host and enclave when sending 0 bytes of data:
![Performance test results](/uploads/performant_scalable_solutions_0a95d753e1.png)_Source: _<https://github.com/secretarium/demo_circular_buffer>
As one can see there are impressive performance enhancements using this approach. Of course, this is not a real use case and if a large amount of data needs to be marshalled between host and enclave, secondary costs start mounting. You need to implement your own serialization logic across this queue, which adds complexity. With a large blob of serialized data, you also need to break it up into sections that fit into the blocks of the queue. This complexity increases when the data being marshalled is larger than the circular buffer itself. You also have to concern yourself with race conditions issues if multiple threads are wanting to work over the same channel.
The nice thing about the edger8r and the Intel runtime is that all of this complexity is hidden from you. But this is at the cost of performance, and inefficient use of those precious resources in the enclave.
However if you are looking at C++ 20 we now have coroutines, these are regular functions that suspend when there is a blocking call and resume when an executor has data for them. The only change that you need to make to your functions is to wrap your return values in futures: use “co_return” instead of “return” and use “co_await” for every call into any coroutine that it needs to call. The costs of doing this radically reduces the costs in migrating your application into a much more efficient enclave implementation.
So now you’re ready to increase the performance of your enclave applications by orders of magnitude. Without having to radically rewrite existing code bases.