And not even data clustering – here I’m just whacking at the Sun GridEngine cluster to run my data processing scripts more efficiently. Good news is that I’ve done a barebones port and successfully improved my cluster submission experience by noticing the following:
- you can put default options for submitted jobs in ~/.sgerequest
- for me this is
- -v PATH to export PATH (yay! now my scripts can find mysql, and the rest of my shell scripts)
- -j y to dump the error output stream to the output file
- -cwd to use the current working directory to start the batch script
- -S /bin/bash to use bash as the shell for invoking the scripts
Planned addition: I notice there’s parallel make support via qmake, but it seems to be broken on our current cluster. Once that works though, I could theroretically rewrite my script files (urgh…again) as Makefiles (or at least invoke via Makefiles) and then the parallel make would take care of scheduling all the jobs…and making sure that they happen in the proper sequence. For this to work, I’ve got to get it so I can submit jobs from the cluster nodes in addition to the head node.
Also discovered – I need to create indexes for some of my derived tables – this should speed up queries significantly. I was going the route of mesh_ids for the mesh terms, but I think the indexing is far more critical.