Refactor job running on endpoint

This commit refactors the job running in the endpoint implementation to be a multi-stage (multi-function-call) process. Before this patch, preparing, starting and running the container as well as shutdown of the container was one huge function. This was sub-optimal, because we had to accummulate information during the run and be extra-careful when returning from that function. Ultimatively, we did not succeed in doing a good job here, which resulted in data-loss if a script exited to early because the database-code was just not hit yet. Consider the following packaging script: exit 1 this does nothing. This does, in fact, not even notify butido that it errored. This is not a scientific scenario, because a valid script might exit in the first command (unintentionally) and thus result in the same as the above example. The issue here is, that this causes butido to continue operation as normal -> It tries to copy the artifacts from the container. This, of course, then fails and butido stops processing. Because of that, the job information is never written to the database, because butido exits before that can happen. With this patch, the whole chain of container-orchestration commands gets split up into multiple, chainable functions. There is a function to prepare the container, which returns information that can then be used to execute the container, which returns information that can be used to finalize the container. You get the idea. Because this is multi-staged now, information can be retrieved and processed _in between these steps_. This results in us being able to write information to the database as soon as possible, which is what we do now. Of course this is only the refactoring patch and butido runs with this. More fixes and minor tweaks in the whole processing chain might be required to make the whole process even smoother. But this can be the stepping stone for such improvements. Tested-by: Matthias Beyer <mail@beyermatthias.de> Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
author: Matthias Beyer <mail@beyermatthias.de> 2020-12-15 09:39:54 +0100
committer: Matthias Beyer <mail@beyermatthias.de> 2020-12-15 10:14:26 +0100
commit: f4592ab5d21f4e4dcd23f0153699b27b687bfc65 (patch)
tree: e19fb185708ae441e0c07aadabb7e07395383288 /src/endpoint/scheduler.rs
parent: 352390565280165af60a6597247d4ed4008482f6 (diff)
1 files changed, 15 insertions, 9 deletions
diff --git a/src/endpoint/scheduler.rs b/src/endpoint/scheduler.rs
index 74155e7..25165fa 100644
--- a/src/endpoint/scheduler.rs
+++ b/src/endpoint/scheduler.rs
@@ -138,8 +138,11 @@ impl JobHandle {
         let envs     = self.create_env_in_db()?;
         let job_id   = self.job.uuid().clone();
         trace!("Running on Job {} on Endpoint {}", job_id, ep.name());
-        let res = ep
-            .run_job(self.job, log_sender, self.staging_store);
+        let running_container = ep.prepare_container(self.job, self.staging_store.clone())
+            .await?
+            .start()
+            .await?
+            .execute_script(log_sender);
 
         let logres = LogReceiver {
             package_name: &package.name,
@@ -150,19 +153,22 @@ impl JobHandle {
             bar: &self.bar,
         }.join();
 
-        let (res, logres) = tokio::join!(res, logres);
-
-        trace!("Found result for job {}: {:?}", job_id, res);
+        let (run_container, logres) = tokio::join!(running_container, logres);
         let log = logres.with_context(|| anyhow!("Collecting logs for job on '{}'", ep.name()))?;
-        let (paths, container_hash, script) = res.with_context(|| anyhow!("Error during running job on '{}'", ep.name()))?;
-
-        let job = dbmodels::Job::create(&self.db, &job_id, &self.submit, &endpoint, &package, &image, &container_hash, &script, &log)?;
+        let run_container = run_container.with_context(|| anyhow!("Running container {} failed"))?;
+        let job = dbmodels::Job::create(&self.db, &job_id, &self.submit, &endpoint, &package, &image, &run_container.container_hash(), run_container.script(), &log)?;
         trace!("DB: Job entry for job {} created: {}", job.uuid, job.id);
         for env in envs {
             let _ = dbmodels::JobEnv::create(&self.db, &job, &env)?;
         }
 
-        let paths = paths?;
+
+        let res : crate::endpoint::FinalizedContainer = run_container
+            .finalize(self.staging_store.clone())
+            .await?;
+        trace!("Found result for job {}: {:?}", job_id, res);
+        let (paths, res) = res.unpack();
+        let _ = res.with_context(|| anyhow!("Error during running job on '{}'", ep.name()))?;
 
         // Have to do it the ugly way here because of borrowing semantics
         let mut r = vec![];
author	Matthias Beyer <mail@beyermatthias.de>	2020-12-15 09:39:54 +0100
committer	Matthias Beyer <mail@beyermatthias.de>	2020-12-15 10:14:26 +0100
commit	f4592ab5d21f4e4dcd23f0153699b27b687bfc65 (patch)
tree	e19fb185708ae441e0c07aadabb7e07395383288 /src/endpoint/scheduler.rs
parent	352390565280165af60a6597247d4ed4008482f6 (diff)