/
2016-10-28 HySDS v2 large scale 1M dumby-landsat test run
2016-10-28 HySDS v2 large scale 1M dumby-landsat test run
2:28pm: ASG at 643 dumby-workers on c3.large/us-west-2
|
3:49pm: ASG at 3000 dumby-workers on c3.large/us-west-2
|
4:27pm: increased ASG from 3000 to 4000. c3.large spot price at 3000 instances |
4:37pm: ASG at 3469 dumby-workers on c3.large/us-west-2
|
4:57pm: ASG showing 4000 c3.large spot in running state. Mozart facetview shows intermittent 503 Service Unavailable
|
5:05pm: metrics |
1000000 jobs/3000 workers - run 1
products ingested into FacetView: 999727
missing products: 273
dumby-landsat job breakdown
job-completed: 998977
job-failed: 975
of these, 724 had “timed out” errors trying to queue product for user_rules_product after publishing products
Traceback (most recent call last):File "/home/ops/verdi/ops/hysds/hysds/worker.py", line 811, in run_jobprod_dir, job_dir)File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 384, in ingestqueue_product(ipath, update_json, product_processed_queue)File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 47, in queue_producthysds.orchestrator.submit_job.apply_async((payload,), queue=queue_name)File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/task.py", line 573, in apply_async**dict(self._get_exec_options(), **options)File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_taskreply_to=reply_to or self.oid, **optionsFile "/home/ops/verdi/lib/python2.7/site-packages/celery/app/amqp.py", line 323, in publish_task**kwargsFile "/home/ops/verdi/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publishrouting_key, mandatory, immediate, exchange, declare)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 470, in _ensuredinterval_max)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 382, in ensure_connectioninterval_start, interval_step, interval_max, callback)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_timereturn fun(*args, **kwargs)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 250, in connectreturn self.connectionFile "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 756, in connectionself._connection = self._establish_connection()File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 711, in _establish_connectionconn = self.transport.establish_connection()File "/home/ops/verdi/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connectionconn = self.Connection(**opts)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__self.transport = self.Transport(host, connect_timeout, ssl)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transportreturn create_transport(host, connect_timeout, ssl)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transportreturn TCPTransport(host, connect_timeout)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__raise socket.error(last_err)error: timed outTODO |
break out product ingest to it’s own worker queue
add “full jitter” to submit_job call (exception is socket.error)
thus 251 jobs failed with other errors
“failed to download…Connection timed out”: 150 (no product published)
“failed to download…BadStatusLine”: 2 (no product published)
“failed to download…500 Internal Server Error”: 1 (no product published)
“failed to upload…S3UploadFailedError…”: 1 (no product published)
An error occurred (RequestTimeout) when calling the PutObject operation (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
“connection timed out to GRQ update”: 95 (no product published)
“no such file or directory: …/_context.json”: 2 (unknown cause; no product published; need running instance to further debug)
Of the 273 missing products, 251 are accounted for as failed jobs. 22 are unaccounted.
job-started: 25
4 of them published products
job stuck in job-started
task is task-failed
worker is worker-offline
21 of them did not published products
job stuck in job-started
task is task-failed
worker is worker-offline
in all cases, seems that process_events handling of worker-offline failed to find the task-id in ES as job-started
TODO: update process_events to query all jobs ran by a celery worker and check for job-started in redis
of the 273 missing products, 251 are accounted for as failed jobs and 21 are accounted for as stuck in job-started/task-failed and failed to publish. 1 is unaccounted for.
job-offline: 23:
1 of them did not publish products
job stuck in job-offline
task is task-failed
worker is worker-offline
22 of them published products
job stuck in job-offline
task is task-succeeded
worker is worker-offline
looking at task ff10b759-3100-4ea7-b434-2ef4986c9d72 verdi log
[2016-10-28 22:21:51,206: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Got exception trying to log job status: Error 110 connecting to 172.31.9.0:6379. Connection timed out.[2016-10-28 22:21:51,210: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Traceback (most recent call last):File "/home/ops/verdi/ops/hysds/hysds/log_utils.py", line 141, in log_job_statusjob['status']) # for dedupFile "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 1093, in setexreturn self.execute_command('SETEX', name, time, value)File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 578, in execute_commandconnection.send_command(*args)File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 563, in send_commandself.send_packed_command(self.pack_command(*args))File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 538, in send_packed_commandself.connect()File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 442, in connectraise ConnectionError(self._error_message(e))ConnectionError: Error 110 connecting to 172.31.9.0:6379. Connection timed out.[2016-10-28 22:21:51,222: INFO/MainProcess] Task hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72] succeeded in 561.831776954s: {'payload_id': '7312e18c-a49a-4028-9bb5-ac64c1cf8182', 'status': 'job-completed', 'uuid':... |
TODO: Need to add “full jitter” to log_job_status() and log_job_info() (exception is redis.RedisError)
251 are accounted for as job-failed (real PGE errors)
task-failed
job-failed
worker-offline (harikiri)
21 are accounted for as stuck in job-started
task-failed
job-started
worker-offline
1 is accounted for as job-offline
task-failed
job-offline
worker-offline
, multiple selections available,
Related content
2017-08-31 Death Valley for HySDS v2
2017-08-31 Death Valley for HySDS v2
More like this
Beginner's Guide to HySDS
Beginner's Guide to HySDS
More like this
Optimizing System Resources and HySDS for Death Valley conditions
Optimizing System Resources and HySDS for Death Valley conditions
More like this
v3.0.0-rc.5
v3.0.0-rc.5
More like this
Operational Pain Points
Operational Pain Points
More like this
Job Workflow in HySDS
Job Workflow in HySDS
More like this
Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: