torresal edited this page on May 24, 2018 · 8 revisions

Page Navigation:

Table of Contents

Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Integrating a "Hello World" Dataset Type

For definitions of terminology used, please refer to our terminology reference.

In this tutorial, we continue where we left off in integrating our first job type, Hello World. We will update our job type to generate a HySDS dataset that will be automatically ingested and catalogued into the system.

On the mozart instance, navigate to the hello_world repository
Code Block
cd ~/mozart/ops/hello_world
We need to author a dataset ID scheme according to the HySDS dataset specification. In this tutorial, we will use the following dataset ID scheme
Code Block
hello_world-product-<year><month><day>T<hour><minute><second>Z-<hash>
where <year> will be replaced by a fictional observation year and likewise for <month>, <day>, <hour>, <minute>, <second> and <hash>. An example instance of this dataset ID would be
Code Block
hello_world-product-20170901T000102.021Z-a3d0x

Edit the script that runs our job type

Code Block
vi run_hello_world.sh

and add code to create an instance of our dataset. The script should look like this

Code Block

#!/bin/bash
echo "Hello World"
echo "Hello World to STDERR" 1>&2

# generate dataset ID
timestamp=$(date -u +%Y%m%dT%H%M%S.%NZ)
hash=$(echo $timestamp | sha224sum | cut -c1-5)
id=hello_world-product-${timestamp}-${hash}
echo "dataset ID: $id"

# create dataset directory
mkdir $id

# create fake data
fake_data_file=${id}/fake_data.dat
dd if=/dev/urandom of=$fake_data_file bs=1M count=5

# create minimal dataset JSON file
dataset_json_file=${id}/${id}.dataset.json
echo "{\"version\": \"v1.0\"}" > $dataset_json_file

# create minimal metadata file
metadata_json_file=${id}/${id}.met.json
echo "{}" > $metadata_json_file

Access the documentation on the specifications for the HySDS dataset JSON file and metadata JSON file for more detail. Finally, commit your change and push the update

Code Block
git commit -a -m "add dataset generation" git push

If your jenkins job for the master branch of your hello_world repo was configured correctly, the git push should have kicked off a rebuild of the container. If not, navigate to your ciinstance (e.g. http://:8080) and manually schedule a build

Test the script by running it

Code Block
./run_hello_world.sh dataset ID: hello_world-product-20170907T141956.052530118Z-8d5a7 5+0 records in 5+0 records out 5242880 bytes (5.2 MB) copied, 0.439011 s, 11.9 MB/s

You should see the dataset directory you just created sitting there in your directory. Ensure that the fake_data.dat and the HySDS dataset and metadata JSON files were generated

Code Block

ls -l hello_world-product-*
total 5128
-rw-r--r-- 1 ops ops 5242880 Sep  7 14:19 fake_data.dat
-rw-r--r-- 1 ops ops      20 Sep  7 14:19 hello_world-product-20170907T141956.052530118Z-8d5a7.dataset.json
-rw-r--r-- 1 ops ops       3 Sep  7 14:19 hello_world-product-20170907T141956.052530118Z-8d5a7.met.json

Now let's try to ingest this new dataset into our catalog

Code Block
~/mozart/ops/hysds/scripts/ingest_dataset.py hello_world-product-20170907T141956.052530118Z-8d5a7 ~/mozart/etc/datasets.json

You should've gotten an error that the data was not recognized

Code Block

Traceback (most recent call last):
  File "/home/ops/mozart/ops/hysds/scripts/ingest_dataset.py", line 16, in <module>
    os.path.abspath(args.ds_dir), None)
  File "/home/ops/mozart/ops/hysds/hysds/dataset_ingest.py", line 204, in ingest
    r = Recognizer(dsets_file, local_prod_path, objectid, version)
  File "/home/ops/mozart/ops/hysds/hysds/recognize.py", line 53, in __init__
    self._recognize(path)
  File "/home/ops/mozart/ops/hysds/hysds/recognize.py", line 84, in _recognize
    raise RecognizerError("No dataset configured for %s. Check %s." % (path, self.dataset_file))
hysds.recognize.RecognizerError: No dataset configured for /home/ops/mozart/ops/hello_world/hello_world-product-20170907T141956.052530118Z-8d5a7. Check /home/ops/mozart/etc/datasets.json.

Let's configure our new dataset so that it can be recognized by our HySDS cluster. Since we want to ensure all nodes in our cluster is able to recognize this new dataset, we need to make updates to the datasets.json file in ~/.sds/files. Still on mozart

Code Block
cd ~/.sds/files

edit the datasets.json file

Code Block
vi datasets.json

and add the following dataset configuration below the dumby-product configuration

Code Block

 { 
   "ipath": "hysds::data/hello_world-product",
   "match_pattern": "/(?P<id>hello_world-product-(?P<year>\\d{4})(?P<month>\\d{2})(?P<day>\\d{2})T.*)$",
   "alt_match_pattern": null,
   "extractor": null,
   "level": "l0",
   "type": "hello_world-data",
   "publish": {
     "s3-profile-name": "default",
     "location": "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/products/hello_world/{version}/{year}/{month}/{day}/{id}",
     "urls": [
       "http://{{ DATASET_BUCKET }}.{{ DATASET_S3_WEBSITE_ENDPOINT }}/products/hello_world/{version}/{year}/{month}/{day}/{id}",
       "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/products/hello_world/{version}/{year}/{month}/{day}/{id}"
     ]
   },
   "browse": {
     "s3-profile-name": "default",
     "location": "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/browse/hello_world/{version}/{year}/{month}/{day}/{id}",
     "urls": [
       "http://{{ DATASET_BUCKET }}.{{ DATASET_S3_WEBSITE_ENDPOINT }}/browse/hello_world/{version}/{year}/{month}/{day}/{id}",
       "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/browse/hello_world/{version}/{year}/{month}/{day}/{id}"
     ]
   }
 }

and save.

Run the following fabric command to push the update you made to the datasets.jsontemplate on mozart

Code Block
cd .. fab -f cluster.py -R mozart send_template:datasets.json,~/mozart/etc/datasets.json

You should have output similar to the following

Code Block
[172.31.4.210] Executing task 'send_template' [172.31.4.210] run: cp "$(echo ~/mozart/etc/datasets.json)"{,.bak} [172.31.4.210] put: <file obj> -> /home/ops/mozart/etc/datasets.json Done.

Now let's try to ingest this new dataset into our catalog

Code Block
cd ~/mozart/ops/hello_world ~/mozart/ops/hysds/scripts/ingest_dataset.py hello_world-product-20170907T141956.052530118Z-8d5a7 ~/mozart/etc/datasets.json

Verify that your new dataset was published to your HySDS cluster. Open up a browser and point it to your grq instance's tosca (Dataset FacetSearch) interface (e.g. https:///search). On the left panel you should see your new dataset type listed under the "dataset" facet showing 1 result
Click on "hello_world-product" under the "dataset" facet and your result set should be constrained now to just the dataset you just ingested
Now that we've verified that the new dataset is recognized by the system, let's push the updates to the rest of the system. On mozart
Code Block
sds stop all sds update all sds ship sds start all

That's it!

Congratulations, you've integrated a new dataset type into your HySDS cluster.

To configure Autoscaling groups for your HySDS cluster, continue on to Create-AWS-Autoscaling-Group-for-Verdi.

To configure a staging area for your HySDS cluster, continue on to Create-AWS-Resources-for-Staging-Area.

📖 Related Articles:

Filter by label (Content by label)

showLabels	false
max	12
showSpace	false
sort	title	showSpace	false
cql	label in ( "hysds_setup" , "github" )

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channelto learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions atStack Overflow Enterprise

Live Search

placeholder	Search HySDS Wiki

🚀 Page Information:

Was this page useful?

Yes No

Contribution History:

Contributors

mode	list
showLastTime	true
order	update

Subject Matter Expert:

Alexander Torres

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Subject Matter ExpertPage Maintainer:

Alexander Torres Lan Dang

Versions Compared

Old Version 5

New Version Current

Key

Integrating a "Hello World" Dataset Type

That's it!

Page Comparison

Versions Compared

Old Version 5

New Version Current

Key

Integrating a "Hello World" Dataset Type

That's it!