Age | Commit message (Collapse) | Author |
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Without this, the error message would literally be "Running container {}
failed" with curly braces in it instead of the container id.
Signed-off-by: David Tolnay <dtolnay@gmail.com>
|
|
This is not perfect, of course, as we rather should check the load on the host
rather than some artificial number of running containers (more failed jobs
result in more running containers, which results in more scheduled to the host),
but still, this is an improvement over the previous state because utilization
did not express enough.
So the issue was that the utilization only represents the utilization of an
endpoint _from the current process_.
If butido is called three times, the same host was selected for a job because
the utilization in each butido process resulted in the same values for this
endpoint.
In practice, this resulted in a build for "gcc" to be scheduled to the same
host when butido was executed three times for "gcc" (e.g. for different
platforms).
The endresult was a load of ~70 on a host where 24 was "100% load".
This is not ideal. Checking the load of a host is not possible via the docker
API, so this is what I came up with. Now, builds are scheduled to endpoints
which have fewer running containers.
This might not result in a perfect distribution (or even a "good" one), but
still it might result in a better distribution of workload than in before this
patch.
Some completely non-artificial testing resulted in the expected behaviour.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Previously we constructed
<log dir>/<job id>/.log
but the job id shouldn't be a directory name but the filename of the logfile.
This patch fixes this bug.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This fixes a nasty bug; because we used the LogItem::display() function to build
the log string that was put into the database, the database contained
_colorized_ log lines. Because of this, our parsing of the log text later (when
reading historical data from the database) failed to parse whether the job was
successfull or not.
The issue was actually introduced in b3a6458ce34e3065192208826b2fc85edd4761f9,
where we changed the `LogItem::display()` implementation.
With this fix in place, the log text in the database is corrected again (only
for new logs that get put into the database).
Fixes: b3a6458ce34e3065192208826b2fc85edd4761f9 ("Add color support in LogItem::display() impl")
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
0.16.1 has been published which fixes the bug introduced by 0.16.0, hence update
this dependency.
This reverts commit ddbd9629b3188c9f08d023829683c40ab9e1448b.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch changes the LogReceiver to preallocate a reasonable large buffer for
holding logs in memory.
Because of the fact that we have to write the log _in one piece_ to the
database (there might be a way to append to the log field, but that's
optimization of a problem we shouldn't have - we rather should use another
approach for storing logs), we have to keep the log in memory until the job is
finished.
For keeping the log of a job in memory, we use a `Vec`. After consulting the
documentation
https://doc.rust-lang.org/stable/std/collections/index.html#when-should-you-use-which-collection
https://doc.rust-lang.org/stable/std/collections/index.html#sequences
we found that `Vec` should be appropriate here, although `VecDeque` might be an
option as well because of O(1) insertion time (and we're only inserting).
Still, because `Vec` has amortized constant time for adding elements at the end,
this should be sufficient.
Preallocating a reasonable large amount of elements could yield big benefits
with only minor disadvantage. If the job fails early (after only a few lines of
log output), there's memory wasted.
Also, if we have a large number of jobs, we allocate a huge amount of memory
before filling it up.
Because we need to have that memory anyways (considering all jobs succeed), that
is not really a disadvantage.
Now, what is "reasonable large"? This value might be changed later on, but for
now, I took this value from experience we had when using butido in practice:
select
AVG(length.l), MAX(length.l), MIN(length.l)
FROM
(
SELECT
LENGTH(log_text) - LENGTH(REPLACE(log_text, chr(10), '')) AS l
FROM
jobs
) AS length
+-----------------------+--------+-------+
| avg | max | min |
|-----------------------+--------+-------|
| 2863.0497427101200686 | 165213 | 11 |
+-----------------------+--------+-------+
The max and min values seem to be extreme. Now, the values contain a lot of
failed jobs and the maximum value (165k log lines is _huge_!) was also a bogus
setup.
Removing these from the equation, using only the successful jobs, gives us a
not-so-different number:
SELECT
AVG(l.len), MAX(l.len), MIN(l.len)
FROM
(
SELECT
LENGTH(log_text) - LENGTH(REPLACE(log_text, CHR(10), '')) AS len,
STRPOS(log_text, 'BUTIDO:STATE:OK') AS okpos
FROM JOBS
) AS l
WHERE
l.okpos != 0
AND
l.len != 165213
+-----------------------+-------+-------+
| avg | max | min |
|-----------------------+-------+-------|
| 2661.7306791569086651 | 55422 | 66 |
+-----------------------+-------+-------+
Using the average (arithmetic mean) as a baseline, we decided to go for 4096
(2^12) preallocated elements in the buffer, which should be reasonable.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This reverts the dependency update because indicatif 0.16 introduced a
reachable `unreachable!()` statement in their sourcecode that we
actually hit if we `--hide-bars`.
This reverts commit 6ceccb679d9c2d19389c6c6eef792d8db9086f31.
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
This patch adds the endpoint name and a (7 character) appreviated container id
in the progress message, so that the user can see where the package is built.
This reduces the compile+debug loop by one step in the case where the user does
not want to inspect the logs anyways, because they can jump to the right
container right away and do not have to `butido db job -sL <job id>` to find out
where the job was running.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Because the interfaces of indicatif have changed, this PR changes a lot of
calls in the codebase. (Yay, moving all the things!)
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch tries to work around a longer-than-1-sec blocking future while
waiting for the channel .recv() call in the LogReceiver.
The issue at hand is: If the channel does not produce log output for longer than
1 second, the progress bar won't be `tick()`ed for that amount of time.
That leads to progress bars that seem to block (no update of the time in the
progress bar output), which might confuse users.
This patch works around that by wrapping the recv() in a timeout future and then
catch the timeout, ping the progress bar and try to `recv()` again.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Because we can re-use this function in our commandline endpoint maintenance
commands implementation.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch removes the "speed" setting from the configuration, which was
introduced to set a relative speed for each endpoint, with the idea that the
scheduler then would select a faster node preferably.
Instead, the utilization of an endpoint is now calculated (number of running
jobs vs allowed maximum jobs on the endpoint), and the endpoint with lower
utilization is selected.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch implements support for max jobs per endpoint.
The number of running jobs on one endpoint are tracked with a wrapper around the
Endpoint object, which increases the job counter on allocation and decreases it
on deallocation.
This way, the scheduler can know how many jobs are running on one endpoint and
select the next endpoint accordingly.
The loading/comparing is not perfect, so it might happen that more jobs run on
one endpoint than configured, but this is the first step into the right
direction.
Also, the selection happens on a tokio job which runs in a loop{}. Because this
almost blocks the whole executor thread, we use `tokio::task::yield_now()` as
soon as there is no free endpoint anymore, to yield the execution to another
future to free resources for doing actual work, not scheduling.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch adds the ability to have more than one release store.
With this patch, a user can (has to) configure release store names in the
configuration file, and can then specify one of the configured names to release
the artifacts to.
This way, different release "channels" can be served, for example a stable
channel and a rolling release channel (although "channel" is not in our wording).
The code was adapted to be able to fetch releases from multiple release
directories, in the crate::db::find_artifact implementation, so that re-using
artifacts works across all release directories.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
The concept of the MergedStores type was okay in the beginning, but it got more
and more complex to use properly and most of the time, we used the
release/staging stores directly anyways.
So this removes the MergedStores type, which is a preparation for the change to
have multiple release stores.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch changes the code so that the MergedStores object is known in the
endpoints and job execution code.
This is necessary, because the jobs must be able to load artifacts from the
release store as well, for example if a library was released in another submit
and we can reuse it because the script for that library didn't change.
For that, the interface of StoreRoot::join() was changed
- -> Result<FullArtifactPath<'a>>
+ -> Result<Option<FullArtifactPath<'a>>>
whereas it returns an Ok(None) if the artifact requested does not exist.
This was necessary to try-joining a path on the store root of the staging store,
and if there is no such file continue with try-joining the path on the release
store.
The calling code got a bit more complex with that change, though.
Next, the MergedStores got a `derive(Clone)` because clone()ing it is cheap
(two `Arc`s) and necessary to pass it easily to the jobs.
Each instance of the code where the staging store was consulted was changed to
consult the the release store as well.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch fixes error returning from the JobHandle::run() implementation.
Somehow, in the many rewrites, the error returning resulted in code that did not
differentiate between script run errors and scheduling errors.
This patch fixes this by making the JobHandle::run() method return
Result<Result<Vec<ArtifactPath>>>
whereas the outer error is the scheduling result and the inner error is the
script result.
The calling code was adapted accordingly.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch follows-up on the shrinking of the `Artifact` type and removes it
entirely.
The type is not needed. Only the `ArtifactPath` type is needed, which is a thin
wrapper around `PathBuf`, ensuring that the path is relative to the store root.
The `Artifact` type used `pom` to parse the name and version of the package from
the `ArtifactPath` object it contained, which resulted in the restriction that
the path must always be
<name>-<version>...
Which should not be a requirement and actually caused issues with a package
named "foo-bar" (as an example).
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch makes the bar moved to the LogReceiver instead of borrowing it to it.
Because ProgressBar::clone() is cheap (according to indicatif documentation it
is just an Arc<> holding the actual object), we can do this here without fearing
the overhead.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Because tokio 1.0 does not ship with the Stream trait, this patch also
introduces tokio_stream as new dependency.
For more information, look here:
https://docs.rs/tokio/1.0.3/tokio/stream/index.html
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
We don't need resiter::Map here anymore because itertools 0.10 provides
a map_ok() extension.
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Before that change, it returned the dbmodels::Artifact objects, for which we
needed to fetch the filestore::Artifact again.
This change removes that restriction (improving runtime, of course).
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
Tested-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This patch cleans some rustfmt code that was formatted, but where clippy
had something to complain: a scope was unnecessary here, so remove it.
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
value. Dropping a reference does nothing
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
This patch updates the implementation for the dedicated release table.
First of all, it updates the model types and adds the new
`crate::db::models::Release` type for representing releases in the
database (which are just pointers to the released artifact).
The `db artifacts` command is updated by using a LEFT JOIN with the
"releases" table.
The `release` command is updated as well, though the change was a bit
more complex.
First of all, a LEFT OUTER JOIN was added which is essentially the same
as the filter for un-released artifacts from before the change.
Then, a `.select()` call was added, so we don't load all the objects we
don't need anyways.
This reduces overhead not only with the database querying but also in
the code here, as we do not allocate objects from the database we
destruct right away.
The updating of the artifact object in the database was changed to
generate a new release object in the "releases" table, of course.
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
error message
Signed-off-by: Matthias Beyer <matthias.beyer@atos.net>
|
|
This commit refactors the job running in the endpoint implementation to
be a multi-stage (multi-function-call) process.
Before this patch, preparing, starting and running the container as well
as shutdown of the container was one huge function.
This was sub-optimal, because we had to accummulate information during
the run and be extra-careful when returning from that function.
Ultimatively, we did not succeed in doing a good job here, which
resulted in data-loss if a script exited to early because the
database-code was just not hit yet.
Consider the following packaging script:
exit 1
this does nothing. This does, in fact, not even notify butido that it
errored. This is not a scientific scenario, because a valid script might
exit in the first command (unintentionally) and thus result in the same
as the above example.
The issue here is, that this causes butido to continue operation as
normal -> It tries to copy the artifacts from the container. This, of
course, then fails and butido stops processing.
Because of that, the job information is never written to the database,
because butido exits before that can happen.
With this patch, the whole chain of container-orchestration commands
gets split up into multiple, chainable functions.
There is a function to prepare the container, which returns information
that can then be used to execute the container, which returns
information that can be used to finalize the container.
You get the idea.
Because this is multi-staged now, information can be retrieved and
processed _in between these steps_.
This results in us being able to write information to the database as
soon as possible, which is what we do now.
Of course this is only the refactoring patch and butido runs with this.
More fixes and minor tweaks in the whole processing chain might be
required to make the whole process even smoother.
But this can be the stepping stone for such improvements.
Tested-by: Matthias Beyer <mail@beyermatthias.de>
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
This patch fixes an error where a failing
(as in {{stage "ERR" "something"}}) job was not written to the database.
The implementation and signatore of Endpoint::run_job() had to be
altered for that, see code comment.
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
The "OK" case had a state message parameter. This was removed because
"OK" does mean that everything went well. No need to have a message
here.
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
This patch rewrites the container error reporting to be more simple.
The ContainerError type was rewritten to not wrap other errors anymore,
but be the root error itself.
It only has one variant now, because there's only one kind of error: "It
didn't work".
The reporting in the calling functions can now use anyhow::Result<_>
instead of std::result::Result because of that.
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|
|
Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
|