summaryrefslogtreecommitdiffstats
path: root/src/endpoint/scheduler.rs
AgeCommit message (Collapse)Author
2021-11-19Update Copyright string to 2020-2022Matthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-11-11Restructure error output for more screen real estateMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-11-11Add line which can be used to print the job log with butidoMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-11-02Fix missing fmt argument in error messageDavid Tolnay
Without this, the error message would literally be "Running container {} failed" with curly braces in it instead of the container id. Signed-off-by: David Tolnay <dtolnay@gmail.com>
2021-09-16Schedule to host where fewer containers are runningMatthias Beyer
This is not perfect, of course, as we rather should check the load on the host rather than some artificial number of running containers (more failed jobs result in more running containers, which results in more scheduled to the host), but still, this is an improvement over the previous state because utilization did not express enough. So the issue was that the utilization only represents the utilization of an endpoint _from the current process_. If butido is called three times, the same host was selected for a job because the utilization in each butido process resulted in the same values for this endpoint. In practice, this resulted in a build for "gcc" to be scheduled to the same host when butido was executed three times for "gcc" (e.g. for different platforms). The endresult was a load of ~70 on a host where 24 was "100% load". This is not ideal. Checking the load of a host is not possible via the docker API, so this is what I came up with. Now, builds are scheduled to endpoints which have fewer running containers. This might not result in a perfect distribution (or even a "good" one), but still it might result in a better distribution of workload than in before this patch. Some completely non-artificial testing resulted in the expected behaviour. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-08-17Fix: Construct logfile name properlyMatthias Beyer
Previously we constructed <log dir>/<job id>/.log but the job id shouldn't be a directory name but the filename of the logfile. This patch fixes this bug. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-08-17Add error context when opening logfileMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-05-31Fix: Insert shouldn't be done with LogItem::display()Matthias Beyer
This fixes a nasty bug; because we used the LogItem::display() function to build the log string that was put into the database, the database contained _colorized_ log lines. Because of this, our parsing of the log text later (when reading historical data from the database) failed to parse whether the job was successfull or not. The issue was actually introduced in b3a6458ce34e3065192208826b2fc85edd4761f9, where we changed the `LogItem::display()` implementation. With this fix in place, the log text in the database is corrected again (only for new logs that get put into the database). Fixes: b3a6458ce34e3065192208826b2fc85edd4761f9 ("Add color support in LogItem::display() impl") Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-05-20Update dependency: indicatif 0.15 -> 0.16.1Matthias Beyer
0.16.1 has been published which fixes the bug introduced by 0.16.0, hence update this dependency. This reverts commit ddbd9629b3188c9f08d023829683c40ab9e1448b. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-05-18Preallocate a reasonable large bufferMatthias Beyer
This patch changes the LogReceiver to preallocate a reasonable large buffer for holding logs in memory. Because of the fact that we have to write the log _in one piece_ to the database (there might be a way to append to the log field, but that's optimization of a problem we shouldn't have - we rather should use another approach for storing logs), we have to keep the log in memory until the job is finished. For keeping the log of a job in memory, we use a `Vec`. After consulting the documentation https://doc.rust-lang.org/stable/std/collections/index.html#when-should-you-use-which-collection https://doc.rust-lang.org/stable/std/collections/index.html#sequences we found that `Vec` should be appropriate here, although `VecDeque` might be an option as well because of O(1) insertion time (and we're only inserting). Still, because `Vec` has amortized constant time for adding elements at the end, this should be sufficient. Preallocating a reasonable large amount of elements could yield big benefits with only minor disadvantage. If the job fails early (after only a few lines of log output), there's memory wasted. Also, if we have a large number of jobs, we allocate a huge amount of memory before filling it up. Because we need to have that memory anyways (considering all jobs succeed), that is not really a disadvantage. Now, what is "reasonable large"? This value might be changed later on, but for now, I took this value from experience we had when using butido in practice: select AVG(length.l), MAX(length.l), MIN(length.l) FROM ( SELECT LENGTH(log_text) - LENGTH(REPLACE(log_text, chr(10), '')) AS l FROM jobs ) AS length +-----------------------+--------+-------+ | avg | max | min | |-----------------------+--------+-------| | 2863.0497427101200686 | 165213 | 11 | +-----------------------+--------+-------+ The max and min values seem to be extreme. Now, the values contain a lot of failed jobs and the maximum value (165k log lines is _huge_!) was also a bogus setup. Removing these from the equation, using only the successful jobs, gives us a not-so-different number: SELECT AVG(l.len), MAX(l.len), MIN(l.len) FROM ( SELECT LENGTH(log_text) - LENGTH(REPLACE(log_text, CHR(10), '')) AS len, STRPOS(log_text, 'BUTIDO:STATE:OK') AS okpos FROM JOBS ) AS l WHERE l.okpos != 0 AND l.len != 165213 +-----------------------+-------+-------+ | avg | max | min | |-----------------------+-------+-------| | 2661.7306791569086651 | 55422 | 66 | +-----------------------+-------+-------+ Using the average (arithmetic mean) as a baseline, we decided to go for 4096 (2^12) preallocated elements in the buffer, which should be reasonable. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-05-17Revert "Update dependency: indicatif 0.15 -> 0.16"Matthias Beyer
This reverts the dependency update because indicatif 0.16 introduced a reachable `unreachable!()` statement in their sourcecode that we actually hit if we `--hide-bars`. This reverts commit 6ceccb679d9c2d19389c6c6eef792d8db9086f31. Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-05-11Add endpoint name and container id in progress messageMatthias Beyer
This patch adds the endpoint name and a (7 character) appreviated container id in the progress message, so that the user can see where the package is built. This reduces the compile+debug loop by one step in the case where the user does not want to inspect the logs anyways, because they can jump to the right container right away and do not have to `butido db job -sL <job id>` to find out where the job was running. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-05-11Update dependency: indicatif 0.15 -> 0.16Matthias Beyer
Because the interfaces of indicatif have changed, this PR changes a lot of calls in the codebase. (Yay, moving all the things!) Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-04-08Fix: Update the progress bar each secondMatthias Beyer
This patch tries to work around a longer-than-1-sec blocking future while waiting for the channel .recv() call in the LogReceiver. The issue at hand is: If the channel does not produce log output for longer than 1 second, the progress bar won't be `tick()`ed for that amount of time. That leads to progress bars that seem to block (no update of the time in the progress bar output), which might confuse users. This patch works around that by wrapping the recv() in a timeout future and then catch the timeout, ping the progress bar and try to `recv()` again. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-03-05Move endpoint connection setup to utility moduleMatthias Beyer
Because we can re-use this function in our commandline endpoint maintenance commands implementation. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-03-04Remove relative speed setting, select endpoint by utilization insteadMatthias Beyer
This patch removes the "speed" setting from the configuration, which was introduced to set a relative speed for each endpoint, with the idea that the scheduler then would select a faster node preferably. Instead, the utilization of an endpoint is now calculated (number of running jobs vs allowed maximum jobs on the endpoint), and the endpoint with lower utilization is selected. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-03-02Implement scheduling with max jobs per endpointMatthias Beyer
This patch implements support for max jobs per endpoint. The number of running jobs on one endpoint are tracked with a wrapper around the Endpoint object, which increases the job counter on allocation and decreases it on deallocation. This way, the scheduler can know how many jobs are running on one endpoint and select the next endpoint accordingly. The loading/comparing is not perfect, so it might happen that more jobs run on one endpoint than configured, but this is the first step into the right direction. Also, the selection happens on a tokio job which runs in a loop{}. Because this almost blocks the whole executor thread, we use `tokio::task::yield_now()` as soon as there is no free endpoint anymore, to yield the execution to another future to free resources for doing actual work, not scheduling. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-03-01Remove RwLock around EndpointMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-25Multiple release storesMatthias Beyer
This patch adds the ability to have more than one release store. With this patch, a user can (has to) configure release store names in the configuration file, and can then specify one of the configured names to release the artifacts to. This way, different release "channels" can be served, for example a stable channel and a rolling release channel (although "channel" is not in our wording). The code was adapted to be able to fetch releases from multiple release directories, in the crate::db::find_artifact implementation, so that re-using artifacts works across all release directories. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-22Remove MergedStoresMatthias Beyer
The concept of the MergedStores type was okay in the beginning, but it got more and more complex to use properly and most of the time, we used the release/staging stores directly anyways. So this removes the MergedStores type, which is a preparation for the change to have multiple release stores. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-17Pass MergedStores to endpoint/jobs rather than only staging storeMatthias Beyer
This patch changes the code so that the MergedStores object is known in the endpoints and job execution code. This is necessary, because the jobs must be able to load artifacts from the release store as well, for example if a library was released in another submit and we can reuse it because the script for that library didn't change. For that, the interface of StoreRoot::join() was changed - -> Result<FullArtifactPath<'a>> + -> Result<Option<FullArtifactPath<'a>>> whereas it returns an Ok(None) if the artifact requested does not exist. This was necessary to try-joining a path on the store root of the staging store, and if there is no such file continue with try-joining the path on the release store. The calling code got a bit more complex with that change, though. Next, the MergedStores got a `derive(Clone)` because clone()ing it is cheap (two `Arc`s) and necessary to pass it easily to the jobs. Each instance of the code where the staging store was consulted was changed to consult the the release store as well. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-15Add more error contextMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-15Fix: Return script errors properlyMatthias Beyer
This patch fixes error returning from the JobHandle::run() implementation. Somehow, in the many rewrites, the error returning resulted in code that did not differentiate between script run errors and scheduling errors. This patch fixes this by making the JobHandle::run() method return Result<Result<Vec<ArtifactPath>>> whereas the outer error is the scheduling result and the inner error is the script result. The calling code was adapted accordingly. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-08Colorize error messageMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-08Remove `Artifact` typeMatthias Beyer
This patch follows-up on the shrinking of the `Artifact` type and removes it entirely. The type is not needed. Only the `ArtifactPath` type is needed, which is a thin wrapper around `PathBuf`, ensuring that the path is relative to the store root. The `Artifact` type used `pom` to parse the name and version of the package from the `ArtifactPath` object it contained, which resulted in the restriction that the path must always be <name>-<version>... Which should not be a requirement and actually caused issues with a package named "foo-bar" (as an example). Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-06Fix: Do not set message when setting progressMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-02-02Fix: Make sure that bar is moved to LogReceiver, drop it afterwardsMatthias Beyer
This patch makes the bar moved to the LogReceiver instead of borrowing it to it. Because ProgressBar::clone() is cheap (according to indicatif documentation it is just an Arc<> holding the actual object), we can do this here without fearing the overhead. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-30Update tokio: 0.2 -> 1.0, shipliftMatthias Beyer
Because tokio 1.0 does not ship with the Stream trait, this patch also introduces tokio_stream as new dependency. For more information, look here: https://docs.rs/tokio/1.0.3/tokio/stream/index.html Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-30itertools: 0.9 -> 0.10Matthias Beyer
We don't need resiter::Map here anymore because itertools 0.10 provides a map_ok() extension. Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-25Merge branch 'more-parallelism' into masterMatthias Beyer
2021-01-25Make progress bar message format uniformMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-25Let the JobHandle::run() return a Vec<Artifact>Matthias Beyer
Before that change, it returned the dbmodels::Artifact objects, for which we needed to fetch the filestore::Artifact again. This change removes that restriction (improving runtime, of course). Signed-off-by: Matthias Beyer <matthias.beyer@atos.net> Tested-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-25Remove unnecessary qualificationMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-19Fix clippy: calling `as_bytes()` on a string literalMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-18Post-format clippy cleanupMatthias Beyer
This patch cleans some rustfmt code that was formatted, but where clippy had something to complain: a scope was unnecessary here, so remove it. Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-18Run `cargo fmt`Matthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-15Fix clippy: using `clone` on a `Copy` typeMatthias Beyer
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-15Fix clippy: useless conversion to the same type: `anyhow::Error`Matthias Beyer
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-15Fix style: Remove trailing whitespaceMatthias Beyer
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-15Fix clippy: calls to `std::mem::drop` with a reference instead of an owned ↵Matthias Beyer
value. Dropping a reference does nothing Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-14Update implementation for dedicated release tableMatthias Beyer
This patch updates the implementation for the dedicated release table. First of all, it updates the model types and adds the new `crate::db::models::Release` type for representing releases in the database (which are just pointers to the released artifact). The `db artifacts` command is updated by using a LEFT JOIN with the "releases" table. The `release` command is updated as well, though the change was a bit more complex. First of all, a LEFT OUTER JOIN was added which is essentially the same as the filter for un-released artifacts from before the change. Then, a `.select()` call was added, so we don't load all the objects we don't need anyways. This reduces overhead not only with the database querying but also in the code here, as we do not allocate objects from the database we destruct right away. The updating of the artifact object in the database was changed to generate a new release object in the "releases" table, of course. Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-13Add LICENSE file and license headersMatthias Beyer
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2021-01-04Make use of JobHandle::create_job_run_error() to add verbose error messageMatthias Beyer
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2021-01-04Add JobHandle::create_job_run_error() helper function for constructing nice ↵Matthias Beyer
error message Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
2020-12-15Refactor job running on endpointMatthias Beyer
This commit refactors the job running in the endpoint implementation to be a multi-stage (multi-function-call) process. Before this patch, preparing, starting and running the container as well as shutdown of the container was one huge function. This was sub-optimal, because we had to accummulate information during the run and be extra-careful when returning from that function. Ultimatively, we did not succeed in doing a good job here, which resulted in data-loss if a script exited to early because the database-code was just not hit yet. Consider the following packaging script: exit 1 this does nothing. This does, in fact, not even notify butido that it errored. This is not a scientific scenario, because a valid script might exit in the first command (unintentionally) and thus result in the same as the above example. The issue here is, that this causes butido to continue operation as normal -> It tries to copy the artifacts from the container. This, of course, then fails and butido stops processing. Because of that, the job information is never written to the database, because butido exits before that can happen. With this patch, the whole chain of container-orchestration commands gets split up into multiple, chainable functions. There is a function to prepare the container, which returns information that can then be used to execute the container, which returns information that can be used to finalize the container. You get the idea. Because this is multi-staged now, information can be retrieved and processed _in between these steps_. This results in us being able to write information to the database as soon as possible, which is what we do now. Of course this is only the refactoring patch and butido runs with this. More fixes and minor tweaks in the whole processing chain might be required to make the whole process even smoother. But this can be the stepping stone for such improvements. Tested-by: Matthias Beyer <mail@beyermatthias.de> Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2020-12-14Add more trace logging for database objects createdMatthias Beyer
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2020-12-10Fix: Write job to database, even if it failedMatthias Beyer
This patch fixes an error where a failing (as in {{stage "ERR" "something"}}) job was not written to the database. The implementation and signatore of Endpoint::run_job() had to be altered for that, see code comment. Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2020-12-10Remove feature: State message in "OK" case.Matthias Beyer
The "OK" case had a state message parameter. This was removed because "OK" does mean that everything went well. No need to have a message here. Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2020-12-10Rewrite container error reportingMatthias Beyer
This patch rewrites the container error reporting to be more simple. The ContainerError type was rewritten to not wrap other errors anymore, but be the root error itself. It only has one variant now, because there's only one kind of error: "It didn't work". The reporting in the calling functions can now use anyhow::Result<_> instead of std::result::Result because of that. Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
2020-12-09Add package name and version in progress bar messageMatthias Beyer
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>