2016-10-28 HySDS v2 large scale 1M dumby-landsat test run
2:28pm: ASG at 643 dumby-workers on c3.large/us-west-2
|
3:49pm: ASG at 3000 dumby-workers on c3.large/us-west-2
|
4:27pm: increased ASG from 3000 to 4000. c3.large spot price at 3000 instances |
4:37pm: ASG at 3469 dumby-workers on c3.large/us-west-2
|
4:57pm: ASG showing 4000 c3.large spot in running state. Mozart facetview shows intermittent 503 Service Unavailable
|
5:05pm: metrics |
1000000 jobs/3000 workers - run 1
products ingested into FacetView: 999727
missing products: 273
dumby-landsat job breakdown
job-completed: 998977
job-failed: 975
of these, 724 had “timed out” errors trying to queue product for user_rules_product after publishing products
Traceback (most recent call last):File "/home/ops/verdi/ops/hysds/hysds/worker.py", line 811, in run_jobprod_dir, job_dir)File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 384, in ingestqueue_product(ipath, update_json, product_processed_queue)File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 47, in queue_producthysds.orchestrator.submit_job.apply_async((payload,), queue=queue_name)File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/task.py", line 573, in apply_async**dict(self._get_exec_options(), **options)File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_taskreply_to=reply_to or self.oid, **optionsFile "/home/ops/verdi/lib/python2.7/site-packages/celery/app/amqp.py", line 323, in publish_task**kwargsFile "/home/ops/verdi/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publishrouting_key, mandatory, immediate, exchange, declare)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 470, in _ensuredinterval_max)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 382, in ensure_connectioninterval_start, interval_step, interval_max, callback)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_timereturn fun(*args, **kwargs)File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 250, in connectreturn self.connectionFile "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 756, in connectionself._connection = self._establish_connection()File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 711, in _establish_connectionconn = self.transport.establish_connection()File "/home/ops/verdi/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connectionconn = self.Connection(**opts)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__self.transport = self.Transport(host, connect_timeout, ssl)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transportreturn create_transport(host, connect_timeout, ssl)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transportreturn TCPTransport(host, connect_timeout)File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__raise socket.error(last_err)error: timed outTODO |
break out product ingest to it’s own worker queue
add “full jitter” to submit_job call (exception is socket.error)
thus 251 jobs failed with other errors
“failed to download…Connection timed out”: 150 (no product published)
“failed to download…BadStatusLine”: 2 (no product published)
“failed to download…500 Internal Server Error”: 1 (no product published)
“failed to upload…S3UploadFailedError…”: 1 (no product published)
An error occurred (RequestTimeout) when calling the PutObject operation (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
“connection timed out to GRQ update”: 95 (no product published)
“no such file or directory: …/_context.json”: 2 (unknown cause; no product published; need running instance to further debug)
Of the 273 missing products, 251 are accounted for as failed jobs. 22 are unaccounted.
job-started: 25
4 of them published products
job stuck in job-started
task is task-failed
worker is worker-offline
21 of them did not published products
job stuck in job-started
task is task-failed
worker is worker-offline
in all cases, seems that process_events handling of worker-offline failed to find the task-id in ES as job-started
TODO: update process_events to query all jobs ran by a celery worker and check for job-started in redis
of the 273 missing products, 251 are accounted for as failed jobs and 21 are accounted for as stuck in job-started/task-failed and failed to publish. 1 is unaccounted for.
job-offline: 23:
1 of them did not publish products
job stuck in job-offline
task is task-failed
worker is worker-offline
22 of them published products
job stuck in job-offline
task is task-succeeded
worker is worker-offline
looking at task ff10b759-3100-4ea7-b434-2ef4986c9d72 verdi log
[2016-10-28 22:21:51,206: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Got exception trying to log job status: Error 110 connecting to 172.31.9.0:6379. Connection timed out.[2016-10-28 22:21:51,210: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Traceback (most recent call last):File "/home/ops/verdi/ops/hysds/hysds/log_utils.py", line 141, in log_job_statusjob['status']) # for dedupFile "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 1093, in setexreturn self.execute_command('SETEX', name, time, value)File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 578, in execute_commandconnection.send_command(*args)File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 563, in send_commandself.send_packed_command(self.pack_command(*args))File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 538, in send_packed_commandself.connect()File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 442, in connectraise ConnectionError(self._error_message(e))ConnectionError: Error 110 connecting to 172.31.9.0:6379. Connection timed out.[2016-10-28 22:21:51,222: INFO/MainProcess] Task hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72] succeeded in 561.831776954s: {'payload_id': '7312e18c-a49a-4028-9bb5-ac64c1cf8182', 'status': 'job-completed', 'uuid':... |
TODO: Need to add “full jitter” to log_job_status() and log_job_info() (exception is redis.RedisError)
251 are accounted for as job-failed (real PGE errors)
task-failed
job-failed
worker-offline (harikiri)
21 are accounted for as stuck in job-started
task-failed
job-started
worker-offline
1 is accounted for as job-offline
task-failed
job-offline
worker-offline
Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: