FMS 2024 - CTO Vision Panel

Paul organized the CTO Vision Panel for the 2024 Future of Memory and Storage Summit
August 8, 2024
San Jose
CTO Vision Panelists

OCP Time Appliance Project

Paul gives a 3-part series of presentations given to the Open Compute Time Appliance Project (OCP TAP) on the nature of time and its applications to distributed systems.
Open Compute Project
OCP Time Appliance Project

Race Conditions and Exactly Once Semantics in Distributed Systems

A Façade of Newtonianism in Networking and its Consequences

Clock Coherence: from Terrestrial Micro Data Centers to Proxima Centauri

Chiplet Summit 2024

February 6-8, 2024
San Jose, CA
Clos networks are plagued by queuing delays at the switches
Mesh based networks enable efficient routing behaviors
Dynamic routing protocols quickly heal from link faiulres in a mesh network

SmartNICs Summit

June 13-15th, 2023
San Jose, CA

Charlie Johnson presenting DædælusProtocol

Daedaelus Presentation
Superpanel Discussion

Asilomar Microcomputer Workshop

April 26-28, 2023
Pacific Grove, CA
Clock Synchronization

 This clock talk is motivated by a project Robert G Kennedy III, who wants to send a swarm of nano-satellites to Alpha Centauri and return pictures. He will be revealing his approach and a project to actually do it at the Asilomar Microcomputer Workshop.

Knowing techniques for distributed clock synchronization is apparently critical to the success of the project. Clock synchronization is, as we know, a hard problem.

Robert is seeking experts to transfer knowledge and assist. Physicists, Computer Scientists, Mathematicians, Neuroscientists, Philosophers and Practicing Engineers from the ItsAboutTime.club.

Chiplet Summit 2023

January 25-26, 2023
San Jose, CA
Jonathan Gorard with John Moussouris, Chiplet Summit 2023

Daedaelus: Problem and Our Solution

Daedaelus: Entanglement for Distributed Systems

25 years ago computing chose scale-out distributed software, running on millions of commodity servers, on commodity networks. We as an industry assumed we could throw packets over the wall to a best effort network, and distributed software endpoints could clean up after packet drops, delays, etc. It turns out doing so is intractable in practice. Today’s implementations approach but do not achieve the End to End Principle. Gray failures of large distributed systems are common, due to retry storms and other situations where nodes can’t rapidly agree on basic facts. More insidious is data corruption because the nodes of a distributed database or storage system got confused amid a storm of pending interactions.

Daedaelus: The communication distributed software needs

When a packet is sent, the sender in microseconds knows if the receiver has it, or not. If the packet doesn’t get there, every trace of its existence goes away, eliminating intractable corner cases. When a link fails, it’s a small local failure the network routes around without dropping a single packet. If a packet fails to be delivered, it’s easy to determine whether the problem is a dead node or a network partition.

Daedaelus: There’s a lot of science behind getting this right

Computer Science: we get around the CAP theorem by using a mesh network to prevent partitions within a subnet. Physics: we are doing our best to emulate entanglement between communication endpoints, an utterly different End to End principle. Time: based in thinking about quantum mechanics, absolute time doesn’t work between locations, only events between them… and we have to run time backward in very limited contexts to make those failed packets go away with no possible side effects.

Daedaelus: Who is this for?

Our market is secure, reliable, distributed computing on the edge. Use-cases include Transaction Processing, Infrastructure Failover, Digital Twins, and any other distributed application with persistent state. Conserved quantities is an interesting application for Daedaelus technology.

Daedaelus: Where could this technology end up?

The algorithms can be used over WAN distances, if we can control the choice of paths between locations. Our mesh network is a better choice for a tiled plane of dozens or hundreds of single-module-of-chiplets scale servers, than today’s star wiring.

Each of us uses applications every day, which behind the scenes run on a large team of computers working together.  Whether we are using collaboration, social media, storing or viewing  photos or video in the cloud, or just doing business with an online store or bank, all of us have experienced those distributed applications getting into trouble.  They get very slow suddenly for seemingly no reason, or lose or duplicate a post or email we’ve written, or worse, lose or corrupt an item in a database, or lose or corrupt an entire folder or directory.  Most of us just shrug these events off as software bugs, get annoyed (or panic if it destroyed a work deliverable just before deadline), and recover as best we can.

Sometimes these are just software bugs, and sometimes a person typed an incorrect command into the application or some piece of equipment it runs on.  But there is also a whole class of problems which occur in distributed software when imperfect communication leaves the individual computers not agreeing on the basic facts they need to correctly run the application together, as a team.  To oversimplify, grey failures where performance collapses while computers wait for each other to catch up on updates to those basic facts are one of the better outcomes of these ugly situations, and silent data corruption is one of the worse outcomes.

Our founder set out 22 years ago to understand how a straightforward disaster recovery algorithm had, upon destruction of the building where the application normally ran, left corrupted data at the site where the application was supposed to resume.  This post mortem which one would expect to simply find a software bug, instead found that distributed algorithms built the best we know how, running over the networks we consider best practice even today, simply don’t work correctly in certain situations.  Computer science has over these decades provided proof of this problem, in the form of the CAP theorem, and more recently in practical form in the 2018 Alquraan paper and the 2023 “partial network partitioning” paper.Daedaelus exists, as a software company, to provide the robust network foundation a distributed application needs to work correctly all the time, not just most of the time.

Daedaelus itself is a new company, formed in late 2022 and seeking out customers and investment as we put our demo together in the first half of 2023.  But over the years our principals have learned that our focus turns out to help with several other “interesting” problems of best practice computer infrastructure today:

  • By pausing to route packets around a failed link, we ensure that each packet is delivered.  We do not drop packets.
  • By checking the link state at both ends as part of the route around process, we avoid duplicate packets by resending only if the receiver didn’t get it before the failure
  • This check also means that if a true network partition has occurred, the sender knows promptly if the packet cannot be delivered
  • By using a mesh of servers, without switches, not only are network partitions essentially eliminated, but also most cables are very short and thus can be very cheap

Three very interesting, practical problems are “accidentally” solved by the technology choices enabling the robust network foundation

  • Module (palm of your hand) scale servers can be cost effectively networked with 2cm “cables” to adjacent neighbors.  No longer have to cable each to a Top of Rack or blade enclosure switch
  • Addressing by naming a software endpoint in the mesh, rather than a physical location, with dynamic rerouting, an endpoint can move to another computer without losing packets…or its sessions
  • Addressing by multicast group rather than by endpoint allows failover of a destination without packet loss and without the sender knowing this has occurred

Through the scars of earlier startups, we have learned:

  • The robustness we seek — the certainty of exactly once prompt delivery, or prompt explicit failure — cannot be achieved as an overlay on today’s best practice networking
  • Although obviously we still run under the TCP/IP stack, existing Ethernet software stacks and drivers are allowed to drop packets on corner cases so do not meet our robustness goal
  • Existing NICs are allowed to drop packets on congestion and corner cases so do not meet our robustness goal

Thus the outline of our product/business model is:

  • Our initial product is off the shelf x86 servers, each with an off the shelf FPGA based 8 port NIC, with short cables to the 8 nearest neighbors in our mesh’s plane
  • We contribute code which runs in that FPGA
  • Our mature product is someone else’s module scale server, with onboard FPGA running our code, with 2cm “cables” to 6 or 8 neighbors
  • There are lots of ways this could be built.  We think the industry will find a better way than we currently have in mind.
  • We create a library/API which distributed applications may, but are not required to, use when running on our mesh
  • We believe that certainty around updating shared facts (“consensus”) is worth writing to our API
  • Once we have both the FPGA and the API, there are opportunities to offload application specific work to the FPGA. There is real value in latency and certainty.

Over those decades there are technical principles we have realized are critical to this robustness, which will be counter intuitive to many in the industry:

  • Timeouts, although ubiquitous in computing today, do not work in the general case.  Actions and responses through alternate paths should be used instead.
  • Time stamps only work as well as clock synchronization, which is always an approximation. Kudos to Google for approaching the asymptote here.
  • When the goal is robustness (absolute accuracy), only the order of events at each given observer can be relied upon
  • Events should occur at first opportunity, not sit in queues.
  • If more bandwidth than 1/latency is needed, explicitly make events parallel
  • We’re working on our demo, and starting to seek out distributed software developers (who we see as customers) to get our direction on target as we seek funding so we can write code at scale.