Import blog-import json because we need it to reproduce the build

Signed-off-by: Matthias Beyer <mail@beyermatthias.de>
author: Matthias Beyer <mail@beyermatthias.de> 2021-07-17 13:01:12 +0200
committer: Matthias Beyer <mail@beyermatthias.de> 2021-07-17 13:01:12 +0200
commit: c96f80873e3c368496fc40a7996438fa8e8bb0db (patch)
tree: 61f911595215b690de3f8ae10ffd0f52e33461e1
parent: b59e6442db050fed0e9ea4a4d2b9a6da259d1727 (diff)
2 files changed, 1 insertions, 1 deletions
diff --git a/.gitignore b/.gitignore
index e15c718..933b071 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,2 @@
 /themes/*
 public
-blog-import.json
diff --git a/blog-import.json b/blog-import.json
new file mode 100644
index 0000000..b91fca1
--- /dev/null
+++ b/blog-import.json
@@ -0,0 +1 @@
+{"username":"musicmatze","has_pass":false,"email":"mail@beyermatthias.de","created":"2021-03-30T13:40:20+02:00","status":0,"collections":[{"alias":"musicmatze","title":"musicmatzes blog","description":"","style_sheet":"","public":false,"views":6951,"url":"https://beyermatthias.de/","total_posts":238,"posts":[{"id":"ple6slfspy","slug":"thoughts-on-approaching-minimalism","appearance":"norm","language":"en","rtl":false,"created":"2021-07-13T12:38:45+02:00","updated":"2021-07-13T12:38:45+02:00","title":"Thoughts on approaching Minimalism","body":"I first heard of minimalism because of some documentary on YouTube. I think it was from \"The Minimalists\" although I'm not quite sure.\n\nMinimalism online really is a rabbit hole. You can quickly end up spending a whole night discovering different minimalists online, each giving some insight on how they execute their minimalism. Some are more \"extremist\" than others, but they do have one thing in common: The definition of minimalism is vague enough so everyone could potentially define it for themselves, and still everyone can agree on each others definition.\n\nSounds vague? Let me try to explain. Minimalism in one sentence, for me is\n\n\u003e the idea that material things are not a measurement of value in ones life.\n\nA bit longer explanation would be that the concept of filling ones life with material things, also sometimes deprecatingly called \"stuff\", does not improve your life at all. Some material things do provide value, but only for a short amount of time. And people tend to replace the emptiness that expands after that value runs out with more material things, ending up in a loop of buying things to get value in life.\n\nI one manages to see through that loop and manages to actually break it, _real_ value in life can be established by owning only things that _matter_.\n\nFinding more definitions, examples, opinions and so on online is left as value to the reader. There's a lot. Don't get caught up in a night of research, you've been warned!\n\n## Starting minimalism\n\nI started not only thinking about how these concepts but actually trying to implement them in my life a few weeks, maybe a few months now. I started really slowly by opening one cupboard/wardrobe a week and going through things. Every time I found something where I reacted like \"Ah, I didn't remember I have this\", I put the thing into a dedicated \"give-away-box\". After the box was full, I put things online to give them away.\n\nI gave a handful of clothes to friends and family already and I am giving away more things slowly. Not at a very fast pace, but still noticeable.\n\n## Things improve!\n\nI really love how my life changes for the better right now, I feel the improvement every single time I get rid (selling or giving away for free, not trashing of course) of something that I do not need. And by \"improvement\" I mean not that my life, objectively, improves. But every time I give something away I feel, for a few days, lighter. It might seem strange, but that's what I feel. \n\nAnd albeit my progress is rather slow and small, it is noticeable for me. If you would come to my home and you'd know me, you'd not see any difference (yet?), though!\n\n## Concerns\n\nOne thing that concerns me with the topic of minimalism, is that if you try to get some information about the community and individual minimalists online, you quickly end up irritated because _all_ the minimalists out there are like \"I have one bag full of things, nothing more\" or something like that.\n\nThe impression that minimalism is only minimalism if executed extremely is a false feeling and I hope I can save myself from that. I am clearly threatened by it!\n\nI don't know whether this impression is only a result of \"American minimalism\", which is clearly the majority you'd find online. For example, I can not think of a way where a person from Europe can only have 100 things. At least my documents (bank account documents, tax documents, insurance stuff and other things you _have to have_ in Europe) would qualify as 100 things, if not more. I'm not sure how the \"extremist minimalists\" do this, or whether one even needs these things printed out in america.\n\nEither way, I hope that some day I might be able to pack all my belongings (or almost all, I really don't want to give away my 32 inch screen or my 7.1 sound system!) into a camper and just tour Europe as a digital nomad. The idea of minimalism combined with digital nomadism (is that a word?) is really interesting to me, although I'd like to have a \"home-base\" where I can come back to and stay for several weeks or even months - sometimes one just needs to cuddle into the same bed every night for several consecutive nights!\n\nWhat I really don't want, though, is dropping out. Neither from society nor from my work/industry. My approach towards minimalism is only of philosophical nature - not needing _things_ to define my living standard.","tags":[],"views":0},{"id":"i1bc9d9bna","slug":"deleting-my-github-sources","appearance":"norm","language":"en","rtl":false,"created":"2021-07-11T19:43:55+02:00","updated":"2021-07-11T19:43:55+02:00","title":"Deleting my github sources","body":"I [deleted](https://mastodon.technology/@musicmatze/106527484217736270) the repositories I own(ed) on #github.\n\nAfter it became clear that [Microsoft GitHub Copilot](https://copilot.github.com/) was [trained](https://cybre.space/@tindall/106539167944483388) using open source and GPLed work, keeping my (both public and private) github repositories is just _wrong_. So I deleted all of them, except for the forks I contribute to and maintain (for example [config-rs](https://github.com/mehcode/config-rs) and [shiplift](https://github.com/softprops/shiplift)).\n\nI [hope](https://mastodon.technology/@musicmatze/106544251935596970) that others will follow suit and delete their repositories as well. I can understand if people don't mind the vendor lock that \"discussions\", \"actions\" and other features have created for them. But this (copilot) is pure abuse of free software codebases.\n\n\u003e The \"extend\" phase is over, we're in the \"extinguish\" phase!\n\n([me, on mastodon](https://mastodon.technology/@musicmatze/106544362192261973))\n\nIt [might be legal](https://en.osm.town/@qeef/106538016853397341) for github to do this (IANAL), but nonetheless it is more than just a bad move. If their ToS allows this and we, as a community, can not act upon this because we agreed to these terms, the only sensible thing to do is to move our development away from github to some more open and less abusive services. I'm a big fan of [sourcehut](https://sourcehut.org), of course, but there are others, most prominently [codeberg](https://codeberg.org) and of course, self-hosting.\n\n## Self-hosting and email patches\n\nI am self-hosting all my sources on [git.beyermatthi.as](https://git.beyermatthi.as) plus I host some of these projects on [my sourcehut account](https://sr.ht/~matthiasbeyer/) for more visibility.\nI take patches via mail for all my repositories.\n\nIf you plan on moving away from github, learning how to [send patches via mail](https://git-send-email.io/) and of course also how to accept patches via mail is a [viable skill](https://drewdevault.com/2018/07/02/Email-driven-git.html) that you [will benefit from](https://nasamuffin.github.io/git/open-source/email/code-review/2019/05/22/how-i-learned-to-love-email-patches.html)! Just make sure to [use plain text email](https://useplaintext.email/) instead of [html emails](https://drewdevault.com/2016/04/11/Please-use-text-plain-for-emails.html).\n\nThere are tons and tons of tutorials out there how to work with email patches. Just go and read them, it will make you a better developer, even if you then go to one of the other code forges and don't need the skill, you will start to understand why git works the way it works!\n\nI am using git for over a decade now, over eight years in opensource and over two years professionally (so my whole professional career so far), and it is _the one tool_ I cannot exist without! It amplifies the speed I develop code with by a magnitude!\n\n## If you don't want email...\n\nIf you don't want an email-based workflow for your git repositories, which I can understand (although not approve), and you want a shiny web-interface with all the bells and whistles, you can still go to [codeberg](https://codeberg.org) (or [gitlab.com](https://gitlab.com), fwiw) or self-host one of the great tools that are already out there.\n\nFor example, you can self-host [gitlab](https://about.gitlab.com/) or [gitea](https://gitea.io/en-us/) as a bit more lightweight alternative. Both of them feature issues/discussions and a nice web-UI.\n\nIf you don't want to collaborate but just put your code out there, you can use [cgit](https://git.zx2c4.com/cgit/) (which is really not hard to host) plus, optionally, some [gitolite](https://gitolite.com/gitolite/index.html) if you want to host repositories for others as well.\n\nWith nginx as reverse-proxy and some mild rate-limiting because web-crawlers are still a thing, you can even host this on a $1-VPS instance somewhere (I'm not recommending any service here because this would be advertisement). I'd even say that a Raspberry Pi can handle hundreds of repositories with cgit and nginx as reverse proxy. I did not test this, though I'm fairly sure because git is very well optimized and cgit is written in C (hence the name), so there's only a very minimal footprint!\n","tags":["github"],"views":46},{"id":"p0l8qqyvqa","slug":"thoughts-10","appearance":"norm","language":"en","rtl":false,"created":"2021-07-09T19:59:53+02:00","updated":"2021-07-09T20:01:08+02:00","title":"Thoughts #10","body":"---\n“Thoughts” is (will be) a weekly roll-up of my mastodon feed with some notable thoughts collected into a long-form blog post. “Long form” is relative here, as I will only expand a little on some selected subjects, not write screens and screens of text on each of these subjects.\n\nIf you think I toot too much to follow, this is an alternative to follow some of my thoughts.\n\n---\n\nThis week (2021-07-03– 2021-07-09) I released a long-prepared blog article, deleted github repositories and thought a lot about minimalism.\n\n# butido\n\n[This week](https://mastodon.technology/@musicmatze/106523417189447101), I released [the article on butido](https://beyermatthias.de/butido-a-linux-package-building-tool-in-rust), the packaging tool written in Rust that I wrote for/at my company and was allowed to release as open source.\n\nI got very positive feedback on the article - thank you all for reading and your comments!\n\n# Deleting GitHub repos\nI also [deleted](https://mastodon.technology/@musicmatze/106527484217736270) all my repositories (source) on GitHub. After GitHub was acquired by Microsoft, things [went downhill](https://mastodon.technology/@musicmatze/106544251935596970). Now is [the time to move away](https://mastodon.technology/@musicmatze/106544362192261973).\n\nI already wrote an blog port exclusively on that topic which I will release next week.\n\n# Minimalism\nLately I am getting more and more into the [topic of minimalism (german)](https://mastodon.technology/@musicmatze/106541435193619801). \n\nWhile writing this section of the weekly roll-up, I decided that my thoughts here are worth a dedicated article, that's why I won't lose too much words here. Just so much: I really like the philosophical idea of minimalism, although I have some issues in the practical execution of the/se ideas.","tags":[],"views":12},{"id":"drcdsk5uq4","slug":"butido-a-linux-package-building-tool-in-rust","appearance":"norm","language":"en","rtl":false,"created":"2021-07-04T18:57:48+02:00","updated":"2021-07-05T13:42:16+02:00","title":"butido - a Linux Package Building Tool in Rust","body":"---\n\nPlease keep in mind that I cannot go into too much detail about my companies customer specific requirements, thus, some details here are rather vague.\n\nAlso: All statements and thoughts in this article are my own and do not necessarily represent those of my employer.\n\n---\n\nIn my dayjob, I package [FLOSS](https://en.wikipedia.org/wiki/Free_and_open-source_software#FLOSS) software with special adaptions, patches and customizations for enterprises on a range of linux distributions. As an example, we package Bash in up-to-date versions for [SLES 11](https://en.wikipedia.org/wiki/SUSE_Linux_Enterprise_Server) installations of our customers.\n\nMost software isn't as easy to build and package as, for example, bash. Also, our customers have special requirements, ranging from type of package format (rpm, deb, targz, or even custom formats...) to installation paths, compile time flags, special library versions used in the dependencies of a software and so on...\n\n\nOur main Intellectual Property is not the packaging mechanisms, as there are many tools already available for packaging. Our IP is the knowledge of _how_ to make libraries work together in custom environments, special sites, with non-standard requirements and how to package properly for the needs of our customers.\n\nNaturally, we have tooling for the packaging. That tooling grew over years and years of changing requirements, even target platforms (almost nobody uses IRIX/HPUX/AIX/Solaris anymore, right? at least our customers moved from those systems to Linux completely) and also technology available for packaging.\n\nEnd of 2020, I got the opportunity to re-think the problem at hand and develop a prototype for solving the problem (again) with some more state-of-the-art approaches. This article documents the journey until day.\n\n---\n  \nThe target audience of this article is technical users with a background in linux and especially in linux packaging, either with Debian/Ubuntu (.deb, APT) or RedHat/CentOS/SuSE (.rpm, YUM/ZYPPER) packages.\nBasic understanding of the concept of software packages and docker might be good, but are not necessarily needed.\n\n---\n\n## Finding the requirements\n\nWhen I started thinking about how to re-solve the problem, I already worked for about 1.5 years in my team. I compiled and packaged quite a bunch of software for customers and, in the process of rethinking the approach, also improved some parts of the tooling infrastructure.\nSo it was not that hard to find requirements. Still, patience and thoroughness was key. If I'd miss a critical point, that new software I wanted to develop might result in a mess.\n\n### Docker\n\nThe first big requirement was rather obvious to find. The existing tooling used docker containers as build hosts. For example, if we needed to package a software for debian 10 Buster, we would spin up a docker container for that distribution, mount everything we needed to that container, copy sources and our custom build configuration file to the right location and then start our tool to do its job.\nAfter some automatic dependency resolving, some magic and a fair bit of compiling, a package was born. That package was then tested by us and, if found appropriate, shipped to the customer for installation.\n\nIn about 90% of the cases, we had to investigate some build failures if the package was new (we did not package that software before). In about 75% of the cases the build just succeeded if the package was a mere update of the software, i.e. git 2.25.0 was already packaged and now we needed to package 2.26.0 - the customizations from the old version were just copied over to the new one and it most of the time just worked fine that way.\n\nStill, the basic approach used in the old tooling was stateful as hell. Tooling, scripts, sources and artifacts were _mounted_ to the container, resulting in state which I wanted to avoid in the new tool. Even the filesystem layout was one big pile of state, with symlinks that were required to point to certain locations or else the tooling would fail in weird (and sometimes non-obvious) ways.\n\nI knew from years of programming (and a fair amount of system configuration experience with NixOS) that state is the devil. So my new tool should use docker for the build procedure, but inside the running container, there should be a defined state _before_ the build, and a defined state _after_ the build, but nothing in between. Especially if a build failed, it shouldn't clutter the filesystem with stuff that might break the next or some other build.\n\n### Package scripting\n\nThe next requirement was a bit more difficult to establish - and by that I mean acceptance in the team. The old tooling used bash scripts for build configuration. These bash scripts had a lot of variables that implicitely (at least from my point of view) told the calling tool what to do. That felt like another big pile of state to me and I thought of ways to improve that situation without giving up on the IP that was _in_ those bash scripts, but also providing us with a better way to define structured data that was attached to a package.\n\nThe scheme of our bash scripts were \"phases\" (bash functions) that were called in a certain order to do certain things. For example, there was a function to prepare the build (usually calling the `configure` script with the desired parameters, or preparing the `cmake` build). Then, there was a build function that actually executed the build and a package function that created the package after a successful build.\n\nI wanted to keep that scheme, possibly expanding on it.\n\nAlso, I needed a (new) way of specifying package metadata. This metadata then must be available during the build process so that the build-mechanism could use it.\n\nBecause of the complexity of the requirements of our customers for a package, but also because we package for multiple customers, all with different requirements, the build procedure itself must be highly configurable. Re-using our already existing bash scripts seemed like a good way to go here.\n\nSomething like the [nix expression language](https://nixos.wiki/wiki/Nix_Expression_Language) first came to my mind, but was then discarded as too complex and too powerful for our needs.\n\nSo I developed a simple, yet powerful mechanism based on hierarchical configuration files using [\"Tom's Obvious, Minimal Language\"](https://en.wikipedia.org/wiki/TOML). I will come back to that later.\n\n### Parallelization of Builds\n\nMost of the time, we are building software where the dependencies are few in depth, but lots in breadth. Visually, that means that we'd rather have this:\n\n```\n\n                              A\n                              +\n                              |\n              +---------------+----------------+\n              |                                |\n              v                                v\n              B                                C\n              +                                +\n              |                      +--------------------+\n       +-------------+               |         |          |\n       v      v      v               v         v          v\n       D      E      F               G         H          I\n\n```\n\ninstead of this:\n\n```\n              A\n              +\n              |\n       +-------------+\n       v      v      v\n       B      C      D\n       +             +\n       |             |\n       v             v\n       E             F\n       +             +\n    +--+---+         |\n    v      v         v\n    G      H         I\n\n```\n\nWhich means we could optimize a good part by parallelizing builds. In the first visualization, six packages can be built in parallel in the first step, and two in the second build step . In the second visualization, four packages could be built in parallel in the first step, and so on.\n\nOur old tooling, though, would build all packages in sequence, in one container, on one host.\n\nHow much optimization could benefit us can be calculated easily: If each package would take one minute to build, without considering the tool runtime overhead, our old tooling would take 9 minutes for all 9 packages to be built. Not considering parallelization overhead, a parallel build procedure would result in 3 minutes for the first visualization and 4 minutes for the second.\n\nOf course, parallelizing builds on one host does not result in the same duration because of sharing of CPU cores, and also the tooling overhead must be considered as well as waiting for dependencies to finish - but as a first approximation, this promised a *big* speedup (which turned out to be true in real life), especially because we got the option to parallelize over multiple build hosts, which is not possible at all with the old tooling.\n\n\n### Packaging targets\n\nBecause we provide support for not only one linux distribution, but a range of different distributions, the tool must be able to support all of them. This does not only mean that we must support APT- and RPM-based distributions, but we must also be able to extend our tooling later to be able to support other packaging tools. Also, the tool should be able to package into tar archives or other - proprietary - packaging formats, because apparently that's still a thing in the industry.\n\nSo having the option of packaging for different package formats was another dimension in our requirements matrix.\n\nMy idea here was simple and proved to be effective: Not to care about packaging formats at all! All these formats have one thing in common: At the end of the day, they are just files. And thus, the tool should only care about files. So the actual packaging is just the final phase which would ideally results in artifacts in all formats we provide.\n\n### Replicability\n\nBy using #nixos for several years now, I knew about the benefits of replicability (and reproducibility - patience!)!\n\nIn the old tooling, there was _some_ replicability, because after all, the scripts that implemented the packaging procedure were there. But artifacts were only losely associated with the scripts. Also, as dependencies were implicitly calculated, updating a library broke replicability for packages that dependended on that library - consider the following package dependency chain:\n\n```\nA 1.0 -\u003e B 1.0 -\u003e C 1.0\n```\n\nIf we'd update `B` to `2.0`, nothing in `A` would tell us that a `a-1.0.tar.gz` was built with `B 1.0` instead of `2.0`, because dependencies were only specified by name, not by version. Only _very_ careful investigation of filesystem timestamps and implicitely created files somewhere in the filesystem _could_ lead to the right information. It was there - buried, hard to find and implicitly created during the packaging procedure.\n\nThere clearly was big room for improvement as well. The most obvious way was to make dependencies with their version number explicit. Updating dependencies, though, had to be easy with the new tooling, because you want to be able to update a lot of dependencies easily. Think of openssl, which releases bugfixes more than once a month. You want to be able to update all your packages without a hassle.\n\nAnother big point in the topic of replicability are logs. Logs are important to be kept, because they are the one thing that can tell you the most about that old build job you did two months ago for some customer, that resulted in this one package with that very specific configuration that you need to reproduce with a new version of the source now.\n\nYep, logs are important.\n\nFrom time to time we did a cleanup of the build directories and just deleted them completely along with all the logs. That was, in my opinion, not good enough. After all, why not keep _all_ logs?\n\nThe next thing were, of course, the build scripts. With the old tool, each package was a git repository with some configuration. On each build, a commit was automatically created to ensure the build configuration was persisted. I didn't like how that worked at all. Why not have one big repository, where all build configurations for all packages reside in? Maybe that's just me coming from #nixos where #nixpkgs is just this big pile of packages - maybe. I just felt that this should be the way this should work.\nAlso, with some careful engineering, this would result in _way_ less code duplication, because reusing stuff would be easy - it wouldn't depend on some file that should be somewhere in the filesystem, but a file that is just _there_ because it was committed to the repository.\n\nAlso, replicability could be easily ensured by\n\n1. not allowing any build with uncommitted stuff in the package repository\n2. recording the git commit hash a package was built from\n\nThis way, if you have a package in a format that allows to record arbitrary data in its meta-information (.tar.gz does not, of course), you can just check out the commit the package was built from, get the logs for the relevant build-job, and off you go debugging your problem!\n\n### Reproducability\n\nWith everything recorded in one big git repository for all packages, one big step towards reproducibility was made.\n\nBut there was more.\n\nFirst of all, because of our requirements with different customers, different package configurations and so on, our tooling must be powerful when it comes parametrization. We also needed to be able to record the package parameters. For example, if we package a software for a customer \"CoolCompany\" and we needed, for whatever reason, to change some `CFLAGS` to some or all packages, these parameters needed to be recorded, along with possible default-parameters we provide for individual packages anyways.\n\nIf we package, for whatever reason, some package with some library dependencies explicitely set to _old_ versions of said library, that needs to be recorded as well.\n\n## Selecting Technologies\n\nThe requirements we defined for the project were not decisive for the selected technologies, but rather the history we had in our tooling and how our intellectual property and existing technology stack was built. And that is bash scripts for the package configuration(s) and docker for our build infrastructure.\nNevertheless, a goal we had in mind while developing our new tooling was always that we might, one day, step away from docker. A new and better-suited technology might surface (not that I say that docker is particularly well suited for what we do here), or docker might just become too much of a hassle to work with - you'll never know!\nEither way, the \"backend\" of our tool should be convertible to not only be able to talk to docker, but also other services/tools, e.g. kubernetes or even real virtual machines.\n\nAs previously said (see \"Package scripting\"), the existing tooling (mostly) used bash scripts for package definition. Because these scripts contained a lot of intellectual property, and easy transferation of knowledge from the old tooling to the new tooling was crucial, the new tooling had to use bash scripts as well. Structural and static package metadata, although, could be moved to some other form. TOML, as established markup language in the Rust community, was considered and found appropriate.\nBut because we knew from the old tooling, that repetition was an easy habit to fall for, we needed some setup that made re-use of existing definitions easy and enable flexibility, but also give us the option to fine-tune settings of individual packages, all while not giving away our IP.\nThus, the idea of hierarchical package definition came up. Rather than inventing our own DSL for package definition, we would use a set of predefined keys in the TOML data structures and apply some \"layering\" technique on top to be able to have a generic package definition and alter that per-package (or more fine granular). The details on this follow below in the section \"Defining the Package format\".\n\nFor replicability, aggregation of log output and tracking of build parameters we wanted some form of database. The general theme we quickly recognized was, that all data we were interested in, was immutable once created. i.e., a line of log output is of course not to be altered, but to be recorded as-is, but also parameters we submit to the build jobs, dependency versions, every form of flags or other metadata. Thus, an append-only storage format was considered best.\nDue to the nature of the project (mostly being a \"let's see what we can accomplish in a certain timeframe\"), the least complicated approach for data storage was to use what we knew: postgres - and never _alter_ data once it hit the database. (To date, we have no `UPDATE` sql statements in the codebase and only one `DELETE` statement.)\n\n### Rust\n\nUsing Rust as the language to implement the tool was a no-brainer.\n\nFirst of all, and I admit that was the main reason that drove me, I am most familiar with the Rust programming language. Other languages I am familiar with are not applicable for that kind of problem (that is: Ruby, Bash or C).\nOther languages that would be available for the problem domain we're talking about here would be Python or Go (or possibly even a JVM language), but I'd not consider any of them, mostly because not a single one of these languages gives me the expressiveness and safety that Rust can provide. After all, I needed a language where I could be certain that things actually worked after I wrote the code.\n\nThe Rust ecosystem gives me a handful of awesome crates (read: libraries) to solve certain types of problems:\n\n* The `serde` framework for easy de- and serialization from different formats, most notably for the problem at hand: the TOML, with the `toml` crate as an implementation of that format\n* The `tokio` framework for async/await, which is also needed for\n* `shiplift`, an library to talk to the Docker API from native Rust, leveraging the async/await capabilities of Rust\n* `handlebars` for rendering of templates, used for the package format implementation (read later in \"Defining the Package format\")\n* The `config` crate, which I got maintainer rights during the implementation of butido, to ensure the continued development of that awesome configuration-handling library\n* the `diesel` SQL framework for talking to the PostgreSQL database\n* ... and many, many more libraries which made development a breeze (at the time of writing this article, butido is using 51 crates as dependency, 3 of which I am the author and 2 of which I got maintainer rights during the development of butido - after all, [bus factor](https://en.wikipedia.org/wiki/Bus_factor) is a thing!)\n\n## Defining the Package format\n\nAs stated in one of the previous sections, we had to define a package format that could be used for describing the metadata of packages while keeping the possibility to customize packages per customer and also giving us the ability to use the knowledge from our old bash-based tooling without having to convert everything to the new environment.\n\nMost of the metadata, which is rather static per package, could easily be defined in the format:\n\n* The name of the package\n* The version of the package\n* A list of sources of a package because packages can have multiple sources - think [git, where sources and manpages are distributed as two individual tarballs](https://mirrors.edge.kernel.org/pub/software/scm/git/)\n* Buildtime dependencies (dependencies that are only needed for building the package, a common one being \"make\")\n* Runtime dependencies (dependencies which needed to be installed on a system for the package itself to be usable, could be system packages or packages we've built ourselves)\n* A list of patches we apply to the package before building\n* A key-value list of environment variables for the build procedure that were local for the package but not for all packages in the tree\n* An allowlist of distributions (or rather: docker images) a package can be built on\n* A blocklist of distributions (or rather: docker images) a package can not be built on\n* A key-value list of additional meta information for the package, e.g. a description, a license field or the website of the project that develops the package\n\nAll these settings are rather static per package. The real complexity, though, comes with the definition of the build script.\n\nThe idea of \"phases\" in the build script was valuable. So we decided that each package should be able to define a list of \"phases\", that could be compiled to a \"script\" that, when executed, transformed the sources of a package to some artifacts (for example, a .rpm file).\nThe idea was, that the script alone knew about packaging formats, because (if you remember from earlier in the article), we needed flexibility in packaging targets: rpm, deb, tar.gz or even proprietary formats must be supported.\n\nSo, each package had a predefined list of phases that could be set to a string. These strings are then concatenated by butido and the resulting script is, along with the package sources, artifacts from the build of the package dependencies and the patches for the package, copied to the build-container and executed. To visualize this:\n\n```\nSources      Patches      Script\n   │            │            │\n   │            │            │\n   └────────────┼────────────┘\n                ▼\n         ┌─────────────┐\n         │             │\n         │  Container  │◄────┐\n         │             │     │\n         └──────┬──────┘     │\n                │            │\n                │            │\n                ▼            │\n            Artifacts────────┘\n```\n\nOne ciritcal point, though, was repetition. Having to repeat parts of a script in multiple packages was a deal-breaker. Therefore being able to reuse parts of a script is necessary. Because we did not want to invent our own scripting language/DSL for this, we decided to use the layering mentioned before to implement reuse of parts of scripts.\n\nConsider the following tree of package definition files:\n\n```\n/pkg.toml\n\n/packageA/pkg.toml\n/packageA/1.0/pkg.toml\n/packageA/2.0/pkg.toml\n/packageA/2.0/1/pkg.toml\n/packageA/2.0/2/pkg.toml\n\n/packageB/pkg.toml\n/packageB/0.1/pkg.toml\n/packageB/0.2/pkg.toml\n```\n\nThe idea with that scheme was that we implement a high-level package definition (`/pkg.toml`), where variables and build-functionality was predefined, and later alter variables and definitions as needed in the individual packages (`/packageA/pkg.toml` or `/packageB/pkg.toml`), in different versions of these packages (`/packageA/1.0/pkg.toml`, `/packageA/2.0/pkg.toml`, `/packageB/0.1/pkg.toml` or `/packageB/0.2/pkg.toml`), or even in different builds of a single version of a package (`/packageA/2.0/1/pkg.toml`, `/packageA/2.0/2/pkg.toml`).\n\nHere, the top level `/pkg.toml` would define, for example, `CFLAGS = [\"-O2\"]`, so that all packages had that `CFLAGS` passed to their build by default. Later, this environment variable could be overwritten in `/packageB/0.2/pkg.toml`, only having an effect in that very version.\nMeanwhile, `/packageA/pkg.toml` would define `name = \"packageA\"` as the name of the package. That setting automatically applies to all sub-directories (and their `pkg.toml` files).\nPackage-local environment variables, special build-system scripts and metadata would be defined _once_ and reused in all sub-`pkg.toml` files, so that repetition is not necessary.\n\nThat scheme is also true for the script phases of the packages. That means, that we implement a generic framework of how a package is built in `/pkg.toml`, with a lot of bells and whistles - but not tied to a build tool (autotools or cmake or...) or a packaging target (rpm or deb or...), but with flexibility to handle all of these cases gracefully.\nLater, we customize parts of the script by overwriting environment variables to configure the generic implementation, or we overwrite whole phases of the generic implementation with a specialized version to meet the needs of the specific package.\n\nIn reality, this looks approximately like this:\n\n```toml\n# /pkg.toml\n# no name = \"\"\n# no version = \"\"\n\n[phases]\nunpack.script = '''\n    tar xf $sourcefile\n'''\n\nbuild.script = '''\n    make -j 4\n'''\n\ninstall.script = '''\n    make install\n'''\n\npackage.script = '''\n    # .. you get the hang of it\n'''\n```\n\nand later:\n\n```toml\n# in /tmux/pkg.toml\nname = \"tmux\"\n# still no version = \"\"\n\n[phases]\nbuild.script = '''\n    make -j 8\n'''\n```\n\nand even later:\n\n```toml\n# in /tmux/3.2/pkg.toml\nversion = \"3.2\"\n\n[phases]\ninstall.script = '''\n\tmake install PREFIX=/usr/local/tmux-3.2\n'''\n```\n\nnot that the above example is accurate or sane, but it demonstrates the power of the approach: In the top-level `/pkg.toml`, we define a generic way of building packages. in `/tmux/pkgs.toml` we overwrite some settings that are equal for all packaged tmux versions: `name` and the `build` part of the script, and in the `/tmux/3.2/pkg.toml` file we define the last bits of the package definition: The `version` field and, because of some reason, we overwrite the `install` part of the script to install to a certain location.\n\nOne could even go further and have a `/tmux/3.2/fastbuild/pkg.toml`, where `version` is set to `\"3.2-fast\"` and build tmux with `-O3` in that case.\n\n### Templating package scripts\n\nThe approach described above is very powerful and flexible. It has one critical problem, though: What if we needed information from the lowest `pkg.toml` file in the tree (e.g. `/tmux/3.2/fastbuild/pkg.toml`), but that information had to be available in `/pkg.toml`.\n\nThere are two solutions to this problem. The first one would be that we would define a phase at the very beginning of the package script that would define all the variables. The `/tmux/3.2/fastbuild/pkg.toml` file would overwrite that phase and define all the variables, and later phases would use them.\n\nThat approach had one critical problem: It would yield the layering of `pkg.toml` files meaningless, because each `pkg.toml` file would need to overwrite that phase with the appropriate settings for the package: if `/tmux/3.2/pkg.toml` defined all the variables, but one variable needs to be overwritten for `/tmux/3.2/fastbuild/pkg.toml`, the latter would still need to overwrite the complete phase.\nThis was basically a pothole for the don't-repeat-yourself idea and thus a no-go.\n\nSo we asked ourselves: what data do we need in the top-level generic scripts that gets defined in the more specific package files? Turns out: only the static stuff! The generic script phases need the name of a package, or the version of a package, or meta-information of the package... and all this data is static.\nSo, we could just sprinkle a bit of templating over the whole thing and be done with it!\n\nThat's why we added [handlebars](https://crates.io/crates/handlebars) to our dependencies: To be able to access variables of a package in the build script. Now, we can define:\n\n```\nbuild.script = '''\n    cd /build/{{this.name}}-{{this.version}}/\n    make\n'''\n```\n\nAnd even more complicated things, for example iterating over all defined dependencies of a package and check whether they are installed correctly in the container.\nAnd all that can be scripted in an easy generic way, without knowing about the package-specific details.\n\n### PostgreSQL\n\nWe decided to use postgresql for logging structured information of the build processes.\n  \nAfter identifying the entities that needed to be stored, setting up the database with the appropriate scheme was not too much of a hassle, given the awesome [diesel](https://diesel.rs/) crate.\n\nBefore we started with the implementation of butido, we identified the entities with the following diagram:\n\n```\n+------+ 1             N +---+ 1          1 +-----------------------+\n|Submit|\u003c---------------\u003e|Job|-------+-----\u003e|Endpoint *             |\n+--+---+                 +---+       |    1 +-----------------------+\n   |                                 +-----\u003e|Package *              |\n   |                                 |    1 +-----------------------+\n   |                                 +-----\u003e|Log                    |\n   |  1  +-----------------------+   |    1 +-----------------------+\n   +----\u003e|Config Repository HEAD |   +-----\u003e|OutputPath             |\n   |  1  +-----------------------+   |  N:M +-----------------------+\n   +----\u003e|Requested Package      |   +-----\u003e|Input Files            |\n   |  1  +-----------------------+   |  N:M +-----------------------+\n   +----\u003e|Requested Image Name   |   +-----\u003e|ENV                    |\n   | M:N +-----------------------+   |    1 +-----------------------+\n   +----\u003e|Additional ENV         |   +-----\u003e|Script *               |\n   |  1  +-----------------------+          +-----------------------+\n   +----\u003e|Timestamp              |\n   |  1  +-----------------------+\n   +----\u003e|Unique Build Request ID|\n   |  1  +-----------------------+\n   +----\u003e|Package Tree (JSON)    |\n      1  +-----------------------+\n```\n\nWhich is explained in a few sentences:  \n  \n  1. Each job builds one package\n  2. Each job runs on one endpoint\n  3. Each job produces one log\n  4. Each job results in one output path\n  5. Each job runs one script\n  6. Each job has N input files, and each file belongs to M jobs\n  7. Each job has N environment variables, and each environment belongs to M jobs\n  8. One submit results in N jobs\n  9. Each submit was started from one config repository commit\n  10. Each submit has one package that was requested\n  11. Each submit runs on one image\n  12. Each submit has one timestamp\n  13. Each submit has a unique ID\n  14. Each submit has one Tree of packages that needed to be built\n  \nI know that this method is not necessarily \"following the book\" of how to develop software, but this was the very first sketch-up of how our data needed to be structured. Of course, this is not the final database layout that is implemented in butido today, but nevertheless it hasn't changed fundamentally since. The idea of a package \"Tree\" that is stored in the database was removed, because after all, the packages are not a tree but a DAG (more details below). What hasn't changed, is that the script that was executed in a container is stored in the database, as well as the log output of that script.\n\n## Implementing a MVP\n\nAfter a considerable planning phase and several whiteboards of sketches, the implementation started. Initially, I was allowed to spend 50 hours of work on the problem. That was enough for get some basics done and a plan on how to reach the MVP. After 50 hours, I was able to say that with approximately another 50 hours I could get a prototype that could be used to actually build a software package.\n\n### Architecture\n\nThe architecture of butido is not as complex as it might seem. As these things are best described with a visualization, here we go:  \n  \n```\n      ┌─────────────────────────────────────────────────┐\n      │                                                 │\n      │                    Orchestrator                 │\n      │                                                 │\n      └─┬──────┬───────┬───────┬───────┬───────┬──────┬─┘\n        │      │       │       │       │       │      │\n        │      │       │       │       │       │      │\n    ┌───▼─┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌─▼───┐\n    │     │ │     │ │     │ │     │ │     │ │     │ │     │\n    │ Job │ │ Job │ │ Job │ │ Job │ │ Job │ │ Job │ │ Job │\n    │     │ │     │ │     │ │     │ │     │ │     │ │     │\n    └───┬─┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └─┬───┘\n        │      │       │       │       │       │      │\n      ┌─▼──────▼───────▼───────▼───────▼───────▼──────▼─┐\n      │                                                 │\n      │                    Scheduler                    │\n      │                                                 │\n      └────────┬───────────────────────────────┬────────┘\n               │                               │\n      ┌────────▼────────┐             ┌────────▼────────┐\n      │                 │             │                 │\n      │    Endpoint     │             │    Endpoint     │\n      │                 │             │                 │\n      └────────┬────────┘             └────────┬────────┘\n               │                               │\n┌─────┬────────▼────────┬─────┐ ┌─────┬────────▼────────┬─────┐\n│     │ Docker Endpoint │     │ │     │ Docker Endpoint │     │\n│     └─────────────────┘     │ │     └─────────────────┘     │\n│                             │ │                             │\n│       Physical Machine      │ │       Physical Machine      │\n│                             │ │                             │\n│ ┌───────────┐ ┌───────────┐ │ │ ┌───────────┐ ┌───────────┐ │\n│ │           │ │           │ │ │ │           │ │           │ │\n│ │ Container │ │ Container │ │ │ │ Container │ │ Container │ │\n│ │           │ │           │ │ │ │           │ │           │ │\n│ └───────────┘ └───────────┘ │ │ └───────────┘ └───────────┘ │\n│                             │ │                             │\n└─────────────────────────────┘ └─────────────────────────────┘\n```\n\nOne part I could not visualize properly without messing up the whole thing is that each job talks to some other jobs. Also, some helpers are not visualized here, but they do not play a part in the overal architecture. But lets start at the beginning.\n\nFrom top to bottom: The orchestrator uses a `Repository` type to load the definitions of packages from the filesystem. It then fetches the packages that need to be build using said `Repository` type, which does a recursive traversal of the packages. That process results in a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of packages. For each package, a `Job` is created, which is a set of variables that need to be associated with the `Package` so that can be built. That is environment variables, but also the image that should be used for executing the build script and some more settings.\n\nEach of those jobs is then given to a \"Worker\" (named \"Job\" in above visualization). Each of these workers gets associated with the jobs that need to be successfully built before the job itself can be run - that's how dependencies are resolved. This association itself is a DAG, and it is automatically executed in the right order because each job waits on its dependents. If an error happened, either during the execution of the script in the container, or while processing of the jobs itself, the job sends the error to its parent job in the DAG. This way, errors propagate through the DAG and all jobs exit, either with success, an error from a child, or an own error.\n\nEach of those jobs knows a \"scheduler\" object, which can be used to submit work to a docker endpoint. This scheduler keeps track of how many builds run at every point in time and blocks further spawning of further builds if there are too many running (which is a configuration option for the user of butido).\n\nThe scheduler knows each configured and connected `Endpoint` and uses it for submitting builds to a docker endpoint. These docker endpoints could be physical machines (like in the visualzation) or VMs or whatever else.\n\nOf course, during the whole process there are some helper objects involved, the database connection is passed around, progress bar API objects are passed around and a lot of necessary type-safety stuff is done. But for the architectural part, that really was the essence.\n\n### Problems during implementation\n\nDuring the implementation of butido, some problems were encountered and dealt with. Nothing too serious that made us re-do the architecture and nothing too complicated. Still, I want to highlight some of the problems and how they were solved.\n\n#### Artifacts and Paths\n\nOne problem we encountered, and which is a constant source of happyness in every software project, was the _paths_ to our input and output artifacts and the handling of them.  \n  \nLuckily, Rust has a very convenient and concise path handling API in the standard library, so over several refactorings, we were able to not introduce more bugs than we squashed.\n\nThere's no individual commit I can link here, because quite a few commits were made that changed how we track(ed) our Artifact objects and path to artifacts during the execution of builds.  `git log --grep=path --oneline | wc -l` lists 37 of them at the time of writing this article.\n\n\u003e It is not enough for code to work.\n\u003e (Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship)\n\nLessons learned is: Make use of strong types as much as you can when working with paths. For example, we have a `StoreRoot` type, that points to a directory where artifacts are stored in. Because artifacts are identified by their filename, there's a `ArtifactPath` type. If `StoreRoot::join(artifact_path)` is called, the caller gets a `FullArtifactPath`, which is an _absolute_ path to the actual file on disk. Objects of the aformentioned types cannot be altered, but only new ones can be created from them. That, plus some careful API crafting, makes some classes of bugs impossible.\n\n#### Orchestrator rewrite(s)\n\nIn above section on the general architecture of butido, the \"Orchestrator\" was already mentioned. This type takes care of loading the repository from disk, selecting the package that needs to be build, finds all the (reverse) dependencies of the package and transforms each package into a runnable job that can be submitted to the scheduler and to the endpoints from there, using workers to orchestrate the whole thing.\n\nThe Orchestrator itself, though, was [rewritten](https://git.sr.ht/~science-computing/butido/commit/889649ac16367fe671ce61363bb6ce82531e5a6b) [twice](https://git.sr.ht/~science-computing/butido/commit/5a51e23ba57491d100f4ffeac5c8657aaa1b011b).\n\nIn the beginning, the orchestrator was still working on a Tree of packages. This tree was processed layer-by-layer:  \n  \n```\n    A\n   / \\\n  B   E\n / \\   \\\nC   D   F\n```\n\nSo this tree resulted in the following sets of jobs:  \n  \n```  \n[\n    [ C, D, F ]\n    [ B, E ]\n    [ A ]\n]\n```\n\nand because the packages did not depend on each other, these lists would then be processed in parallel.\n\nThis was a simple implementation of a simple case of the problem that worked very well until we were able to run the first prototype. It was far from optimal though. In above tree, the package named `E` could start to build eventhough `C` and `D` were not finished building yet.\n\nThe first rewrite mentioned above solved that by reimplementing the job-spawning algorithm to perform better on such and similar cases - which are in fact not that uncommon for us.\n\nThe second rewrite of the `Orchestrator`, which happened shortly after the first one (after we understood the problem at hand even better), optimized that algorithm to the best possible solution.\nThe new implementation uses a trick: It spawns one worker for each job. Each of those workers has \"incoming\" and \"outgoing\" channels (that's actually [Multi-Producer-Single-Consumer-Queues](https://docs.rs/tokio/1.5.0/tokio/sync/mpsc/index.html)). The orchestrator associates each job in the dependency DAG with its parent by connecting the childs \"outgoing\" channel with the parents \"incoming\" channel. Leaf nodes in the DAG have no \"incoming\" channel (they are closed right away after instantiation) and the \"outgoing\" channel of the \"root node\" in the DAG sends to the orchestrator itself.\n\nThe channels are used to send either successfully built artifacts to the parent, or an error.\n\nEach worker then waits on the \"incoming\" channels for the artifacts it depends on. If it gets an error, it does nothing but send that error to its parent. If all artifacts are received, it starts scheduling its own build on the scheduler, sending the result of that process to its parent.\n\nThis way, artifacts and/or errors propagate through the DAG until the `Orchestrator` gets all results. And the `tokio` runtime, which is used for the async-await handling, orchestrates the execution of the processes automatically.\n\n#### Tree vs. DAG\n\nAnother problem we encountered was not so much of a problem but rather a simplification we used when writing the package-to-job conversion algorithm the first time. When we began implementing the package-loading mechansim, we used a tree data structure for the recursive loading of dependencies.\nThat meant that a structure like this:  \n  \n```\nA ------\u003e B\n|         ^\n|         |\n `-\u003e C --´ \n```\n\nWas actually represented like this:\n\n```\nA -----\u003e  B\n|\n|\n `-\u003e C -\u003e B \n```\n\nThat meant that `B` was built twice: Once as a dependency of `A` and once as a dependency of `C`. That was a simplification we used to get a working prototype _fast_, because implementing the proper structure (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) was considered too involved at the time, given the \"prototype nature\" the project had back then.\n\nBecause we used clear seperation of concerns in the codebase and the module that implemented the datatype for holding the loaded package dependency collection was seperated from the rest of the logic, replacing the tree structure with a DAG structure [was not too much involved](https://git.sr.ht/~science-computing/butido/commit/efe07be74cf1ae704cb73b0f20c28b33aa46c217) ([merge](https://git.sr.ht/~science-computing/butido/commit/2d82860bbd867a328fa1ab21e77705439ba4636b)).\n\nTo lay a bit more emphasize on that: When we changed the implementation of the package dependency collection handling, [we didn't need to rewrite the `Orchestrator` for that](https://git.sr.ht/~science-computing/butido/commit/efe07be74cf1ae704cb73b0f20c28b33aa46c217#src/orchestrator/orchestrator.rs) (except changing some interfaces), because the two parts of the codebase were seperated enough so that changing the implementation on one side didn't result in a rewrite on the other side.\n\n## Open sourcing\n\nFor me it was clear from the outset that this tool was not what we call our intellectual property. Our expertise is in the scripts that actually executed the build, in our handling of the large number of special cases for every package/target distribution/customer combination. The tool that orchestrated these builds is nothing that could be sold to someone, because it is not rocket science.\n\nBecause of that, I asked whether butido could be open sourced. Long story short: I was allowed to release the source code under a open source license and since we had some experience with the Eclipse Public License (EPL 2.0), I released it under that license.\n\nBefore we were able to release it under that license, I had to add some package lints (`cargo-deny`), to verify that all our dependencies were on terms with that and I had to remove one package that was pulled in but had an unclear license. Because our CI does check the license of all (reverse) dependencies, we can be sure that such an package is not added again.\n\n## butido 0.1.0\n\nAfter about 380 hours of work, butido 0.1.0 was released. Since then, four maintenance releases (`v0.1.{1, 2, 3, 4}`) were released to fix some minor bugs.\n\nbutido is usable for our infrastructure, we've started packaging our software packages with it and started (re)implementing our scripts in the butido specific setup. For now, everything seems to work properly and at the time of writing this article, about 30 packages were already successfully packaged and built.\n\n\u003e In general, the longer you wait before fixing a bug, the costlier (in time and money) it is to fix.\n\u003e (Joel Spolsky, Joel on Software)\n\nOf course there might be still some bugs, but I think for now, the most serious problems are dealt with and I'm eager to start using butido at a large scale.\n\n## Plans for the future\n\nWe also have plans for the future: some of these plans are more in the range of \"maybe in a year\", but some of them are also rather short-term.\n\nSome refactoring stuff is always on such a list, as it is with butido: When implementing certain parser helpers, we used the \"pom\" crate, for example. This should be replaced by the way more popular \"nom\" crate, just to be future-proof here. Some frontend (CLI) cleanup should be
author	Matthias Beyer <mail@beyermatthias.de>	2021-07-17 13:01:12 +0200
committer	Matthias Beyer <mail@beyermatthias.de>	2021-07-17 13:01:12 +0200
commit	c96f80873e3c368496fc40a7996438fa8e8bb0db (patch)
tree	61f911595215b690de3f8ae10ffd0f52e33461e1
parent	b59e6442db050fed0e9ea4a4d2b9a6da259d1727 (diff)