2016-10-28 HySDS v2 large scale 1M dumby-landsat test run


2:28pm: ASG at 643 dumby-workers on c3.large/us-west-2

  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 40% cpu
    • 8.0MB/s network in
    • 5.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 15% cpu
    • 1.1MB/s network in
    • 1.9MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.0% cpu
    • 2.4MB/s network in
    • 0.75MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 11% cpu
    • 3.2MB/s network in
    • 2.9MB/s network out


3:49pm: ASG at 3000 dumby-workers on c3.large/us-west-2

  • Status:
    • job-queued (947614)
    • job-completed (49369)
    • job-started (3001)
    • job-failed (16)
  • Redeliveries:
    • F (52340)
    • T (46)
  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 21% cpu
    • 25.0MB/s network in
    • 14.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 53% cpu
    • 3.5MB/s network in
    • 2.3MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.7% cpu
    • 5.8MB/s network in
    • 0.45MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 24% cpu
    • 3.7MB/s network in
    • 5.0MB/s network out


4:27pm: increased ASG from 3000 to 4000.

c3.large spot price at 3000 instances


4:37pm: ASG at 3469 dumby-workers on c3.large/us-west-2

  • Status:
    • job-queued (918997)
    • job-completed (77978)
    • job-started (3004)
    • job-failed (21)
  • Redelivered:
    • F (80957)
    • T (46)
  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 15% cpu
    • 29.0MB/s network in
    • 16.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 60% cpu
    • 4.0MB/s network in
    • 2.3MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.7% cpu
    • 6.2MB/s network in
    • 0.47MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 25% cpu
    • 3.7MB/s network in
    • 5.0MB/s network out


4:57pm: ASG showing 4000 c3.large spot in running state.

Mozart facetview shows intermittent 503 Service Unavailable

  • Status:
    • job-queued (918997)
    • job-completed (77978)
    • job-started (3004)
    • job-failed (21)
  • Redelivered:
    • F (80957)
    • T (46)
  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 15% cpu
    • 29.0MB/s network in
    • 16.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 60% cpu
    • 4.0MB/s network in
    • 2.3MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.7% cpu
    • 6.2MB/s network in
    • 0.47MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 25% cpu
    • 3.7MB/s network in
    • 5.0MB/s network out


5:05pm: metrics

1000000 jobs/3000 workers - run 1



  • products ingested into FacetView: 999727

  • missing products: 273

  • dumby-landsat job breakdown

  • job-completed: 998977

  • job-failed: 975

  • of these, 724 had “timed out” errors trying to queue product for user_rules_product after publishing products



Traceback (most recent call last):

 File "/home/ops/verdi/ops/hysds/hysds/worker.py", line 811, in run_job

   prod_dir, job_dir)

 File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 384, in ingest

   queue_product(ipath, update_json, product_processed_queue)

 File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 47, in queue_product

   hysds.orchestrator.submit_job.apply_async((payload,), queue=queue_name)

 File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/task.py", line 573, in apply_async

   **dict(self._get_exec_options(), **options)

 File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_task

   reply_to=reply_to or self.oid, **options

 File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/amqp.py", line 323, in publish_task

   **kwargs

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish

   routing_key, mandatory, immediate, exchange, declare)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 470, in _ensured

   interval_max)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 382, in ensure_connection

   interval_start, interval_step, interval_max, callback)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_time

   return fun(*args, **kwargs)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 250, in connect

   return self.connection

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 756, in connection

   self._connection = self._establish_connection()

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 711, in _establish_connection

   conn = self.transport.establish_connection()

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection

   conn = self.Connection(**opts)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__

   self.transport = self.Transport(host, connect_timeout, ssl)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport

   return create_transport(host, connect_timeout, ssl)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport

   return TCPTransport(host, connect_timeout)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__

   raise socket.error(last_err)

error: timed out

TODO

  • break out product ingest to it’s own worker queue

  • add “full jitter” to submit_job call (exception is socket.error)

  • thus 251 jobs failed with other errors

  • “failed to download…Connection timed out”: 150 (no product published)

  • “failed to download…BadStatusLine”: 2 (no product published)

  • “failed to download…500 Internal Server Error”: 1 (no product published)

  • “failed to upload…S3UploadFailedError…”: 1 (no product published)

  • An error occurred (RequestTimeout) when calling the PutObject operation (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

  • “connection timed out to GRQ update”: 95 (no product published)

  • “no such file or directory: …/_context.json”: 2 (unknown cause; no product published; need running instance to further debug)

  • Of the 273 missing products, 251 are accounted for as failed jobs. 22 are unaccounted.

  • job-started: 25

  • 4 of them published products

  • job stuck in job-started

  • task is task-failed

  • worker is worker-offline

  • 21 of them did not published products

  • job stuck in job-started

  • task is task-failed

  • worker is worker-offline

  • in all cases, seems that process_events handling of worker-offline failed to find the task-id in ES as job-started

  • TODO: update process_events to query all jobs ran by a celery worker and check for job-started in redis

  • of the 273 missing products, 251 are accounted for as failed jobs and 21 are accounted for as stuck in job-started/task-failed and failed to publish. 1 is unaccounted for.

  • job-offline: 23:

  • 1 of them did not publish products

  • job stuck in job-offline

  • task is task-failed

  • worker is worker-offline

  • 22 of them published products

  • job stuck in job-offline

  • task is task-succeeded

  • worker is worker-offline

  • looking at task ff10b759-3100-4ea7-b434-2ef4986c9d72 verdi log



[2016-10-28 22:21:51,206: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Got exception trying to log job status: Error 110 connecting to 172.31.9.0:6379. Connection timed out.

[2016-10-28 22:21:51,210: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Traceback (most recent call last):

 File "/home/ops/verdi/ops/hysds/hysds/log_utils.py", line 141, in log_job_status

   job['status']) # for dedup

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 1093, in setex

   return self.execute_command('SETEX', name, time, value)

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 578, in execute_command

   connection.send_command(*args)

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 563, in send_command

   self.send_packed_command(self.pack_command(*args))

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 538, in send_packed_command

   self.connect()

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 442, in connect

   raise ConnectionError(self._error_message(e))

ConnectionError: Error 110 connecting to 172.31.9.0:6379. Connection timed out.


[2016-10-28 22:21:51,222: INFO/MainProcess] Task hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72] succeeded in 561.831776954s: {'payload_id': '7312e18c-a49a-4028-9bb5-ac64c1cf8182', 'status': 'job-completed', 'uuid':...



  • TODO: Need to add “full jitter” to log_job_status() and log_job_info() (exception is redis.RedisError)




  • 251 are accounted for as job-failed (real PGE errors)

  • task-failed

  • job-failed

  • worker-offline (harikiri)

  • 21 are accounted for as stuck in job-started

  • task-failed

  • job-started

  • worker-offline

  • 1 is accounted for as job-offline

  • task-failed

  • job-offline

  • worker-offline



Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: