2:28pm: ASG at 643 dumby-workers on c3.large/us-west-2

hysds-v2-dev-mozart on m4.4xlarge/us-west-2

40% cpu
8.0MB/s network in
5.0MB/s network out

hysds-v2-dev-grq on m4.4xlarge/us-west-2

15% cpu
1.1MB/s network in
1.9MB/s network out

hysds-v2-dev-metrics on m4.4xlarge/us-west-2

1.0% cpu
2.4MB/s network in
0.75MB/s network out

hysds-v2-dev-factotum on c4.2xlarge/us-west-2

11% cpu
3.2MB/s network in
2.9MB/s network out

3:49pm: ASG at 3000 dumby-workers on c3.large/us-west-2

Status:

job-queued (947614)
job-completed (49369)
job-started (3001)
job-failed (16)

Redeliveries:

F (52340)
T (46)

hysds-v2-dev-mozart on m4.4xlarge/us-west-2

21% cpu
25.0MB/s network in
14.0MB/s network out

hysds-v2-dev-grq on m4.4xlarge/us-west-2

53% cpu
3.5MB/s network in
2.3MB/s network out

hysds-v2-dev-metrics on m4.4xlarge/us-west-2

1.7% cpu
5.8MB/s network in
0.45MB/s network out

hysds-v2-dev-factotum on c4.2xlarge/us-west-2

24% cpu
3.7MB/s network in
5.0MB/s network out

4:27pm: increased ASG from 3000 to 4000.

c3.large spot price at 3000 instances

4:37pm: ASG at 3469 dumby-workers on c3.large/us-west-2

Status:

job-queued (918997)
job-completed (77978)
job-started (3004)
job-failed (21)

Redelivered:

F (80957)
T (46)

hysds-v2-dev-mozart on m4.4xlarge/us-west-2

15% cpu
29.0MB/s network in
16.0MB/s network out

hysds-v2-dev-grq on m4.4xlarge/us-west-2

60% cpu
4.0MB/s network in
2.3MB/s network out

hysds-v2-dev-metrics on m4.4xlarge/us-west-2

1.7% cpu
6.2MB/s network in
0.47MB/s network out

hysds-v2-dev-factotum on c4.2xlarge/us-west-2

25% cpu
3.7MB/s network in
5.0MB/s network out

4:57pm: ASG showing 4000 c3.large spot in running state.

Mozart facetview shows intermittent 503 Service Unavailable

Status:

job-queued (918997)
job-completed (77978)
job-started (3004)
job-failed (21)

Redelivered:

F (80957)
T (46)

hysds-v2-dev-mozart on m4.4xlarge/us-west-2

15% cpu
29.0MB/s network in
16.0MB/s network out

hysds-v2-dev-grq on m4.4xlarge/us-west-2

60% cpu
4.0MB/s network in
2.3MB/s network out

hysds-v2-dev-metrics on m4.4xlarge/us-west-2

1.7% cpu
6.2MB/s network in
0.47MB/s network out

hysds-v2-dev-factotum on c4.2xlarge/us-west-2

25% cpu
3.7MB/s network in
5.0MB/s network out

5:05pm: metrics

1000000 jobs/3000 workers - run 1

products ingested into FacetView: 999727

missing products: 273

dumby-landsat job breakdown

job-completed: 998977

job-failed: 975

of these, 724 had “timed out” errors trying to queue product for user_rules_product after publishing products

Traceback (most recent call last):

File "/home/ops/verdi/ops/hysds/hysds/worker.py", line 811, in run_job

prod_dir, job_dir)

File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 384, in ingest

queue_product(ipath, update_json, product_processed_queue)

File "/home/ops/verdi/ops/hysds/hysds/product_ingest.py", line 47, in queue_product

hysds.orchestrator.submit_job.apply_async((payload,), queue=queue_name)

File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/task.py", line 573, in apply_async

dict(self._get_exec_options(), options)

File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_task

reply_to=reply_to or self.oid, **options

File "/home/ops/verdi/lib/python2.7/site-packages/celery/app/amqp.py", line 323, in publish_task

**kwargs

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish

routing_key, mandatory, immediate, exchange, declare)

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 470, in _ensured

interval_max)

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 382, in ensure_connection

interval_start, interval_step, interval_max, callback)

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/utils/init.py", line 246, in retry_over_time

return fun(*args, **kwargs)

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 250, in connect

return self.connection

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 756, in connection

self._connection = self._establish_connection()

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/connection.py", line 711, in _establish_connection

conn = self.transport.establish_connection()

File "/home/ops/verdi/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection

conn = self.Connection(**opts)

File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 165, in init

self.transport = self.Transport(host, connect_timeout, ssl)

File "/home/ops/verdi/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport

return create_transport(host, connect_timeout, ssl)

File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport

return TCPTransport(host, connect_timeout)

File "/home/ops/verdi/lib/python2.7/site-packages/amqp/transport.py", line 95, in init

raise socket.error(last_err)

error: timed out

TODO

break out product ingest to it’s own worker queue

add “full jitter” to submit_job call (exception is socket.error)

thus 251 jobs failed with other errors

“failed to download…Connection timed out”: 150 (no product published)

“failed to download…BadStatusLine”: 2 (no product published)

“failed to download…500 Internal Server Error”: 1 (no product published)

“failed to upload…S3UploadFailedError…”: 1 (no product published)

An error occurred (RequestTimeout) when calling the PutObject operation (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

“connection timed out to GRQ update”: 95 (no product published)

“no such file or directory: …/_context.json”: 2 (unknown cause; no product published; need running instance to further debug)

Of the 273 missing products, 251 are accounted for as failed jobs. 22 are unaccounted.

job-started: 25

4 of them published products

job stuck in job-started

task is task-failed

worker is worker-offline

21 of them did not published products

job stuck in job-started

task is task-failed

worker is worker-offline

in all cases, seems that process_events handling of worker-offline failed to find the task-id in ES as job-started

TODO: update process_events to query all jobs ran by a celery worker and check for job-started in redis

of the 273 missing products, 251 are accounted for as failed jobs and 21 are accounted for as stuck in job-started/task-failed and failed to publish. 1 is unaccounted for.

job-offline: 23:

1 of them did not publish products

job stuck in job-offline

task is task-failed

worker is worker-offline

22 of them published products

job stuck in job-offline

task is task-succeeded

worker is worker-offline

looking at task ff10b759-3100-4ea7-b434-2ef4986c9d72 verdi log

[2016-10-28 22:21:51,206: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Got exception trying to log job status: Error 110 connecting to 172.31.9.0:6379. Connection timed out.

[2016-10-28 22:21:51,210: ERROR/Worker-1] hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72]: Traceback (most recent call last):

File "/home/ops/verdi/ops/hysds/hysds/log_utils.py", line 141, in log_job_status

job['status']) # for dedup

File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 1093, in setex

return self.execute_command('SETEX', name, time, value)

File "/home/ops/verdi/lib/python2.7/site-packages/redis/client.py", line 578, in execute_command

connection.send_command(*args)

File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 563, in send_command

self.send_packed_command(self.pack_command(*args))

File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 538, in send_packed_command

self.connect()

File "/home/ops/verdi/lib/python2.7/site-packages/redis/connection.py", line 442, in connect

raise ConnectionError(self._error_message(e))

ConnectionError: Error 110 connecting to 172.31.9.0:6379. Connection timed out.

[2016-10-28 22:21:51,222: INFO/MainProcess] Task hysds.worker.run_job[ff10b759-3100-4ea7-b434-2ef4986c9d72] succeeded in 561.831776954s: {'payload_id': '7312e18c-a49a-4028-9bb5-ac64c1cf8182', 'status': 'job-completed', 'uuid':...

TODO: Need to add “full jitter” to log_job_status() and log_job_info() (exception is redis.RedisError)

251 are accounted for as job-failed (real PGE errors)

task-failed

job-failed

worker-offline (harikiri)

21 are accounted for as stuck in job-started

task-failed

job-started

worker-offline

1 is accounted for as job-offline

task-failed

job-offline

worker-offline

Browser not supported