/
2016-10-28 HySDS v2 large scale 1M dumby-landsat test run

2016-10-28 HySDS v2 large scale 1M dumby-landsat test run


2:28pm: ASG at 643 dumby-workers on c3.large/us-west-2

  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 40% cpu
    • 8.0MB/s network in
    • 5.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 15% cpu
    • 1.1MB/s network in
    • 1.9MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.0% cpu
    • 2.4MB/s network in
    • 0.75MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 11% cpu
    • 3.2MB/s network in
    • 2.9MB/s network out


3:49pm: ASG at 3000 dumby-workers on c3.large/us-west-2

  • Status:
    • job-queued (947614)
    • job-completed (49369)
    • job-started (3001)
    • job-failed (16)
  • Redeliveries:
    • F (52340)
    • T (46)
  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 21% cpu
    • 25.0MB/s network in
    • 14.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 53% cpu
    • 3.5MB/s network in
    • 2.3MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.7% cpu
    • 5.8MB/s network in
    • 0.45MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 24% cpu
    • 3.7MB/s network in
    • 5.0MB/s network out


4:27pm: increased ASG from 3000 to 4000.

c3.large spot price at 3000 instances


4:37pm: ASG at 3469 dumby-workers on c3.large/us-west-2

  • Status:
    • job-queued (918997)
    • job-completed (77978)
    • job-started (3004)
    • job-failed (21)
  • Redelivered:
    • F (80957)
    • T (46)
  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 15% cpu
    • 29.0MB/s network in
    • 16.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 60% cpu
    • 4.0MB/s network in
    • 2.3MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.7% cpu
    • 6.2MB/s network in
    • 0.47MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 25% cpu
    • 3.7MB/s network in
    • 5.0MB/s network out


4:57pm: ASG showing 4000 c3.large spot in running state.

Mozart facetview shows intermittent 503 Service Unavailable

  • Status:
    • job-queued (918997)
    • job-completed (77978)
    • job-started (3004)
    • job-failed (21)
  • Redelivered:
    • F (80957)
    • T (46)
  • hysds-v2-dev-mozart on m4.4xlarge/us-west-2
    • 15% cpu
    • 29.0MB/s network in
    • 16.0MB/s network out
  • hysds-v2-dev-grq on m4.4xlarge/us-west-2
    • 60% cpu
    • 4.0MB/s network in
    • 2.3MB/s network out
  • hysds-v2-dev-metrics on m4.4xlarge/us-west-2
    • 1.7% cpu
    • 6.2MB/s network in
    • 0.47MB/s network out
  • hysds-v2-dev-factotum on c4.2xlarge/us-west-2
    • 25% cpu
    • 3.7MB/s network in
    • 5.0MB/s network out


5:05pm: metrics

1000000 jobs/3000 workers - run 1



  • products ingested into FacetView: 999727

  • missing products: 273

  • dumby-landsat job breakdown

  • job-completed: 998977

  • job-failed: 975

  • of these, 724 had “timed out” errors trying to queue product for user_rules_product after publishing products



Traceback (most recent call last):

 File "/home/ops/verdi/ops/hysds/hysds/worker.py", line 811, in run_job

   prod_dir, job_dir)

 File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 384, in ingest

   queue_product(ipath, update_json, product_processed_queue)

 File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 47, in queue_product

   hysds.orchestrator.submit_job.apply_async((payload,), queue=queue_name)

 File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/task.py", line 573, in apply_async

   **dict(self._get_exec_options(), **options)

 File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_task

   reply_to=reply_to or self.oid, **options

 File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/amqp.py", line 323, in publish_task

   **kwargs

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish

   routing_key, mandatory, immediate, exchange, declare)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 470, in _ensured

   interval_max)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 382, in ensure_connection

   interval_start, interval_step, interval_max, callback)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_time

   return fun(*args, **kwargs)

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 250, in connect

   return self.connection

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 756, in connection

   self._connection = self._establish_connection()

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 711, in _establish_connection

   conn = self.transport.establish_connection()

 File "/home/ops/verdi/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection

   conn = self.Connection(**opts)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__

   self.transport = self.Transport(host, connect_timeout, ssl)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport

   return create_transport(host, connect_timeout, ssl)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport

   return TCPTransport(host, connect_timeout)

 File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__

   raise socket.error(last_err)

error: timed out

TODO

  • break out product ingest to it’s own worker queue

  • add “full jitter” to submit_job call (exception is socket.error)

  • thus 251 jobs failed with other errors

  • “failed to download…Connection timed out”: 150 (no product published)

  • “failed to download…BadStatusLine”: 2 (no product published)

  • “failed to download…500 Internal Server Error”: 1 (no product published)

  • “failed to upload…S3UploadFailedError…”: 1 (no product published)

  • An error occurred (RequestTimeout) when calling the PutObject operation (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

  • “connection timed out to GRQ update”: 95 (no product published)

  • “no such file or directory: …/_context.json”: 2 (unknown cause; no product published; need running instance to further debug)

  • Of the 273 missing products, 251 are accounted for as failed jobs. 22 are unaccounted.

  • job-started: 25

  • 4 of them published products

  • job stuck in job-started

  • task is task-failed

  • worker is worker-offline

  • 21 of them did not published products

  • job stuck in job-started

  • task is task-failed

  • worker is worker-offline

  • in all cases, seems that process_events handling of worker-offline failed to find the task-id in ES as job-started

  • TODO: update process_events to query all jobs ran by a celery worker and check for job-started in redis

  • of the 273 missing products, 251 are accounted for as failed jobs and 21 are accounted for as stuck in job-started/task-failed and failed to publish. 1 is unaccounted for.

  • job-offline: 23:

  • 1 of them did not publish products

  • job stuck in job-offline

  • task is task-failed

  • worker is worker-offline

  • 22 of them published products

  • job stuck in job-offline

  • task is task-succeeded

  • worker is worker-offline

  • looking at task ff10b759-3100-4ea7-b434-2ef4986c9d72 verdi log



[2016-10-28 22:21:51,206: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Got exception trying to log job status: Error 110 connecting to 172.31.9.0:6379. Connection timed out.

[2016-10-28 22:21:51,210: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Traceback (most recent call last):

 File "/home/ops/verdi/ops/hysds/hysds/log_utils.py", line 141, in log_job_status

   job['status']) # for dedup

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 1093, in setex

   return self.execute_command('SETEX', name, time, value)

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 578, in execute_command

   connection.send_command(*args)

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 563, in send_command

   self.send_packed_command(self.pack_command(*args))

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 538, in send_packed_command

   self.connect()

 File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 442, in connect

   raise ConnectionError(self._error_message(e))

ConnectionError: Error 110 connecting to 172.31.9.0:6379. Connection timed out.


[2016-10-28 22:21:51,222: INFO/MainProcess] Task hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72] succeeded in 561.831776954s: {'payload_id': '7312e18c-a49a-4028-9bb5-ac64c1cf8182', 'status': 'job-completed', 'uuid':...



  • TODO: Need to add “full jitter” to log_job_status() and log_job_info() (exception is redis.RedisError)




  • 251 are accounted for as job-failed (real PGE errors)

  • task-failed

  • job-failed

  • worker-offline (harikiri)

  • 21 are accounted for as stuck in job-started

  • task-failed

  • job-started

  • worker-offline

  • 1 is accounted for as job-offline

  • task-failed

  • job-offline

  • worker-offline



Related content

2017-08-31 Death Valley for HySDS v2
2017-08-31 Death Valley for HySDS v2
More like this
Beginner's Guide to HySDS
Beginner's Guide to HySDS
More like this
Optimizing System Resources and HySDS for Death Valley conditions
Optimizing System Resources and HySDS for Death Valley conditions
More like this
v3.0.0-rc.5
v3.0.0-rc.5
More like this
Operational Pain Points
Operational Pain Points
More like this
Job Workflow in HySDS
Job Workflow in HySDS
More like this
Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: