Trinity Mode for Larger Scales

Note: A full breakout of the underlying components are documented here in the SDD. For the purposes of this discussion, we focus on the key and driving components relevant for increasing the scale of total distributed workers.

Basic Deployment between Mozart resource manager and Verdi compute nodes

A basic deployment of HySDS has the Mozart resource manager component handling the full load of the distributed Verdi compute nodes. Figure 1 shows this baseline deployment mode where if we have n workers, then Mozart has to support 3n persistent connections. Each compute node makes 3 connections back to the resource manager:

  1. (redis) job status events - Verdi emits job state changes back to mozart

  2. (rabbitmq) job descriptors - Verdi gets the next job popped off the queue

  3. (rabbitmq) control messages - Verdi gets a control command such as revoke a running job.

At the end of each job, if datasets are created to be ingested back into GRQ datasets catalog, then at the end of each job, there are periodic calls to submit percolator jobs to evaluate if there are any production rules that acts on that dataset type just ingested.

Figure 1. Basic HySDS deployment of network connectivity between Mozart resource manager and Verdi compute nodes.

Trinity Mode Deployment: move RabbitMQ and ElasticSearch out of original Mozart

The next iteration is to break apart Mozart to enable more scaling. By moving RabbitMQ and ElasticSearch out to their own standalone services, the distribution of worker connections are spread across more services. Figure 2 shows the rearrangement of network topology in this mode where for n workers will make 2n persistent connections to only the RabbitMQ service. The main Mozart component now needs 1n connections for n workers.

Figure 2. Trinity Mode deployment where RabbitMQ and ElasticSearch are moved out of original Mozart component.

In this approach, the ElasticSearch component can be more easily replaced with other managed services such as AWS OpenSearch. This is what the SWOT SDS PCM is using. Similarly, the RabbitMQ component can be updated to high availability (HA) mode as well. Or alternatively replaced with AWS MQ.

Trinity+ Mode Deployment: move Redis, RabbitMQ, and ElasticSearch out of original Mozart

Yet another iteration is to also include Redis in the redistribution to more standalone services. This would reduce the footprint of the core value-added of Mozart to be task workers and logstash for high rate job management (Figure 3). Similar to Trinity Mode, the network topology is spread out to also include Redis as its own standalone service. This allows the use of switching out Redis with managed service offerings such as AWS ElasticCache/MemoryDB.

References

https://hysds-core.atlassian.net/wiki/spaces/HYS/pages/199786902#SoftwareDesignDocument(SDD)-EnvironmentView

Diagram Source (Google Slides)

https://www.rabbitmq.com/connections.html

https://www.rabbitmq.com/networking.html#open-file-handle-limit

 

 

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: