What we solve…

daedaelus
February 8, 2023

Robust Communication for Distributed Applications

By: Steve Chalmers - February 8, 2023
PDF Link

Each of us uses applications every day, which behind the scenes run on a large team of computers working together. Whether we are using collaboration, social media, storing or viewing photos or video in the cloud, or just doing business with an online store or bank, all of us have experienced those distributed applications getting into trouble. They get very slow suddenly for seemingly no reason, or lose or duplicate a post or email we’ve written, or worse, lose or corrupt an item in a database, or lose or corrupt an entire folder or directory. Most of us just shrug these events off as software bugs, get annoyed (or panic if it destroyed a work deliverable just before deadline), and recover as best we can.

Sometimes these are just software bugs, and sometimes a person typed an incorrect command into the application or some piece of equipment it runs on. But there is also a whole class of problems which occur in distributed software when imperfect communication leaves the individual computers not agreeing on the basic facts they need to correctly run the application together, as a team. To oversimplify, grey failures where performance collapses while computers wait for each other to catch up on updates to those basic facts are one of the better outcomes of these ugly situations, and silent data corruption is one of the worse outcomes.

Our founder set out 22 years ago to understand how a straightforward disaster recovery algorithm had, upon destruction of the building where the application normally ran, left corrupted data at the site where the application was supposed to resume. This post mortem which one would expect to simply find a software bug, instead found that distributed algorithms built the best we know how, running over the networks we consider best practice even today, simply don’t work correctly in certain situations. Computer science has over these decades provided proof of this problem, in the form of the CAP theorem, and more recently in practical form in the 2018 Alquraan paper and the 2023 “partial network partitioning” paper.Daedaelus exists, as a software company, to provide the robust network foundation a distributed application needs to work correctly all the time, not just most of the time.

Daedaelus itself is a new company, formed in late 2022 and seeking out customers and investment as we put our demo together in the first half of 2023. But over the years our principals have learned that our focus turns out to help with several other “interesting” problems of best practice computer infrastructure today:

By pausing to route packets around a failed link, we ensure that each packet is delivered. We do not drop packets.
By checking the link state at both ends as part of the route around process, we avoid duplicate packets by resending only if the receiver didn’t get it before the failure
This check also means that if a true network partition has occurred, the sender knows promptly if the packet cannot be delivered
By using a mesh of servers, without switches, not only are network partitions essentially eliminated, but also most cables are very short and thus can be very cheap

Three very interesting, practical problems are “accidentally” solved by the technology choices enabling the robust network foundation

Module (palm of your hand) scale servers can be cost effectively networked with 2cm “cables” to adjacent neighbors. No longer have to cable each to a Top of Rack or blade enclosure switch
Addressing by naming a software endpoint in the mesh, rather than a physical location, with dynamic rerouting, an endpoint can move to another computer without losing packets…or its sessions
Addressing by multicast group rather than by endpoint allows failover of a destination without packet loss and without the sender knowing this has occurred

Through the scars of earlier startups, we have learned:

The robustness we seek — the certainty of exactly once prompt delivery, or prompt explicit failure — cannot be achieved as an overlay on today’s best practice networking
Although obviously we still run under the TCP/IP stack, existing Ethernet software stacks and drivers are allowed to drop packets on corner cases so do not meet our robustness goal
Existing NICs are allowed to drop packets on congestion and corner cases so do not meet our robustness goal

Thus the outline of our product/business model is:

Our initial product is off the shelf x86 servers, each with an off the shelf FPGA based 8 port NIC, with short cables to the 8 nearest neighbors in our mesh’s plane
We contribute code which runs in that FPGA
Our mature product is someone else’s module scale server, with onboard FPGA running our code, with 2cm “cables” to 6 or 8 neighbors
There are lots of ways this could be built. We think the industry will find a better way than we currently have in mind.
We create a library/API which distributed applications may, but are not required to, use when running on our mesh
We believe that certainty around updating shared facts (“consensus”) is worth writing to our API
Once we have both the FPGA and the API, there are opportunities to offload application specific work to the FPGA. There is real value in latency and certainty.

Over those decades there are technical principles we have realized are critical to this robustness, which will be counter intuitive to many in the industry:

Timeouts, although ubiquitous in computing today, do not work in the general case. Actions and responses through alternate paths should be used instead.
Time stamps only work as well as clock synchronization, which is always an approximation. Kudos to Google for approaching the asymptote here.
When the goal is robustness (absolute accuracy), only the order of events at each given observer can be relied upon
Events should occur at first opportunity, not sit in queues.
If more bandwidth than 1/latency is needed, explicitly make events parallel
We’re working on our demo, and starting to seek out distributed software developers (who we see as customers) to get our direction on target as we seek funding so we can write code at scale.

Share the Post:

What we solve…

Robust Communication for Distributed Applications

By: Steve Chalmers - February 8, 2023
PDF Link

Quick Links

Get In Touch

What we solve…

Robust Communication for Distributed Applications By: Steve Chalmers - February 8, 2023 PDF Link

Robust Communication for Distributed Applications

By: Steve Chalmers - February 8, 2023
PDF Link