summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorSean E. Russell <ser@ser1.net>2021-03-02 03:16:40 -0600
committerSean E. Russell <ser@ser1.net>2021-03-02 03:17:53 -0600
commita44ced4bba471325bba98cabce5709d52c90a60c (patch)
treeb628af5fc503ab90ba7d3f1c8ab47811dc624fbf
parent94d5c2e33da1af733644932003d619a7ba537207 (diff)
Bring extensions under the umbrella.
Bring extensions under the umbrella.
-rw-r--r--README.md17
-rw-r--r--assets/screenshots/fourby.pngbin0 -> 186287 bytes
-rw-r--r--devices/nvidia.go184
-rw-r--r--devices/remote.go271
-rw-r--r--docs/extensions.md12
-rw-r--r--docs/remote-monitoring.md62
6 files changed, 531 insertions, 15 deletions
diff --git a/README.md b/README.md
index cd58020..6e8b96b 100644
--- a/README.md
+++ b/README.md
@@ -35,23 +35,14 @@ If you install gotop by hand, or you download or create new layouts or colorsche
```
- **OSX**: gotop is in *homebrew-core*. `brew install gotop`. Make sure to uninstall and untap any previous installations or taps.
- **Prebuilt binaries**: Binaries for most systems can be downloaded from [the github releases page](https://github.com/xxxserxxx/gotop/releases). RPM and DEB packages are also provided.
-- **Prebuild binaries with extensions**:
- - [NVidia GPU support](https://github.com/xxxserxxx/gotop-nvidia/releases)
- - [Remote gotop support](https://github.com/xxxserxxx/gotop-remote/releases)
-- **Source**: This requires Go >= 1.14. `go get -u github.com/xxxserxxx/gotop/cmd/gotop`
+- **Source**: gotop requires Go >= 1.14: `go get -u github.com/xxxserxxx/gotop/cmd/gotop`
### Extension builds
-An evolving mechanism in gotop are extensions. This is designed to allow gotop to support feature sets that are not universally needed without blowing up the application for average users with unused features. Examples are support for specific hardware sets like video cards, or things that are just obviously not a core objective of the application, like remote server monitoring.
+Extensions have proven problematic; go plugins are not usable in real-world cases, and the solution I had running for a while was hacky, at best. Consequently, extensions have been moved into the main code base for now.
-The path to these extensions is a tool called [gotop-builder](https://github.com/xxxserxxx/gotop-builder). It is easy to use and depends only on having Go installed. You can read more about it on the project page, where you can also find binaries for Linux that have *all* extensions built in. If you want less than an all-inclusive build, or one for a different OS/architecture, you can use gotop-builder itself to create your own.
-
-There are currently two extensions:
-
-- Support for [NVidia GPUs](https://github.com/xxxserxxx/gotop-nvidia), which add GPU usage, memory, and temperature data to the respective widgets
-- Support for [remote devices](https://github.com/xxxserxxx/gotop-remote), which allows running gotop on a remote machine and seeing the sensors from that as if they were local sensors.
-
-There are builds for those binaries for Linux in each of the repositories.
+- nvidia support: requires the `enable` flag. Detecting nvidia hardware, or rather, the absense of NVidia hardware, can take seconds; this greatly slows down gotop's start-up time. To avoid this, the NVidia code will not be run unless it has been enabled with the `--enable nvidia` runtime flag.
+- remote: allows gotop to pull sensor data from applications exporting Prometheus metrics, including remote gotop instances themselves.
### Console Users
diff --git a/assets/screenshots/fourby.png b/assets/screenshots/fourby.png
new file mode 100644
index 0000000..12c99ba
--- /dev/null
+++ b/assets/screenshots/fourby.png
Binary files differ
diff --git a/devices/nvidia.go b/devices/nvidia.go
new file mode 100644
index 0000000..0e50dba
--- /dev/null
+++ b/devices/nvidia.go
@@ -0,0 +1,184 @@
+package devices
+
+import (
+ "bytes"
+ "encoding/csv"
+ "errors"
+ "fmt"
+ "os/exec"
+ "strconv"
+ "sync"
+ "time"
+
+ "github.com/xxxserxxx/opflag"
+)
+
+// Set up variables and register this plug-in with the main code.
+// The functions Register*(f) tell gotop which of these plugin functions to
+// call to update data; the RegisterStartup() function sets the function
+// that gotop will call when everything else has been done and the plugin
+// should start collecting data.
+//
+// In this plugin, one call to the nvidia program returns *all* the data
+// we're looking for, but gotop will call each update function during each
+// cycle. This means that the nvidia program would be called 3 (or more)
+// times per update, which isn't very efficient. Therefore, we make this
+// code more complex to run a job in the background that runs the nvidia
+// tool periodically and puts the results into hashes; the update functions
+// then just sync data from those hashes into the return data.
+func init() {
+ opflag.BoolVarP(&nvidia, "nvidia", "", false, "Enable NVidia GPU support")
+ RegisterStartup(startNVidia)
+}
+
+// updateNvidiaTemp copies data from the local _temps cache into the passed-in
+// return-value map. It is called once per cycle by gotop.
+func updateNvidiaTemp(temps map[string]int) map[string]error {
+ nvidiaLock.Lock()
+ defer nvidiaLock.Unlock()
+ for k, v := range _temps {
+ temps[k] = v
+ }
+ return _errors
+}
+
+// updateNvidiaMem copies data from the local _mems cache into the passed-in
+// return-value map. It is called once per cycle by gotop.
+func updateNvidiaMem(mems map[string]MemoryInfo) map[string]error {
+ nvidiaLock.Lock()
+ defer nvidiaLock.Unlock()
+ for k, v := range _mems {
+ mems[k] = v
+ }
+ return _errors
+}
+
+// updateNvidiaUsage copies data from the local _cpus cache into the passed-in
+// return-value map. It is called once per cycle by gotop.
+func updateNvidiaUsage(cpus map[string]int, _ bool) map[string]error {
+ nvidiaLock.Lock()
+ defer nvidiaLock.Unlock()
+ for k, v := range _cpus {
+ cpus[k] = v
+ }
+ return _errors
+}
+
+// startNVidia is called once by gotop, and forks a thread to call the nvidia
+// tool periodically and update the cached cpu, memory, and temperature
+// values that are used by the update*() functions to return data to gotop.
+//
+// The vars argument contains command-line arguments to allow the plugin
+// to change runtime options; the only option currently supported is the
+// `nvidia-refresh` arg, which is expected to be a time.Duration value and
+// sets how frequently the nvidia tool is called to refresh the date.
+func startNVidia(vars map[string]string) error {
+ if !nvidia {
+ return nil
+ }
+ _, err := exec.Command("nvidia-smi", "-L").Output()
+ if err != nil {
+ return errors.New(fmt.Sprintf("NVidia GPU error: %s", err))
+ }
+ _errors = make(map[string]error)
+ _temps = make(map[string]int)
+ _mems = make(map[string]MemoryInfo)
+ _cpus = make(map[string]int)
+ _errors = make(map[string]error)
+ RegisterTemp(updateNvidiaTemp)
+ RegisterMem(updateNvidiaMem)
+ RegisterCPU(updateNvidiaUsage)
+
+ nvidiaLock = sync.Mutex{}
+ // Get the refresh period from the passed-in command-line/config
+ // file options
+ refresh := time.Second
+ if v, ok := vars["nvidia-refresh"]; ok {
+ if refresh, err = time.ParseDuration(v); err != nil {
+ return err
+ }
+ }
+ // update once to populate the device names, for the widgets.
+ update()
+ // Fork off a long-running job to call the nvidia tool periodically,
+ // parse out the values, and put them in the cache.
+ go func() {
+ timer := time.Tick(refresh)
+ for range timer {
+ update()
+ }
+ }()
+ return nil
+}
+
+// Caches for the output from the nvidia tool; the update() functions pull
+// from these and return the values to gotop when requested.
+var (
+ _temps map[string]int
+ _mems map[string]MemoryInfo
+ _cpus map[string]int
+ // A cache of errors generated by the background job running the nvidia tool;
+ // these errors are returned to gotop when it calls the update() functions.
+ _errors map[string]error
+)
+
+var nvidiaLock sync.Mutex
+
+// update calls the nvidia tool, parses the output, and caches the results
+// in the various _* maps. The metric data parsed is: name, index,
+// temperature.gpu, utilization.gpu, utilization.memory, memory.total,
+// memory.free, memory.used
+//
+// If this function encounters an error calling `nvidia-smi`, it caches the
+// error and returns immediately. We expect exec errors only when the tool
+// isn't available, or when it fails for some reason; no exec error cases
+// are recoverable. This does **not** stop the cache job; that will continue
+// to run and continue to call update().
+func update() {
+ bs, err := exec.Command(
+ "nvidia-smi",
+ "--query-gpu=name,index,temperature.gpu,utilization.gpu,memory.total,memory.used",
+ "--format=csv,noheader,nounits").Output()
+ if err != nil {
+ _errors["nvidia"] = err
+ //bs = []byte("GeForce GTX 1080 Ti, 0, 31, 9, 11175, 206")
+ return
+ }
+ csvReader := csv.NewReader(bytes.NewReader(bs))
+ csvReader.TrimLeadingSpace = true
+ records, err := csvReader.ReadAll()
+ if err != nil {
+ _errors["nvidia"] = err
+ return
+ }
+
+ // Ensure we're not trying to modify the caches while they're being read by the update() functions.
+ nvidiaLock.Lock()
+ defer nvidiaLock.Unlock()
+ // Errors during parsing are recorded, but do not stop parsing.
+ for _, row := range records {
+ // The name of the devices is the nvidia-smi "<name>.<index>"
+ name := row[0] + "." + row[1]
+ if _temps[name], err = strconv.Atoi(row[2]); err != nil {
+ _errors[name] = err
+ }
+ if _cpus[name], err = strconv.Atoi(row[3]); err != nil {
+ _errors[name] = err
+ }
+ t, err := strconv.Atoi(row[4])
+ if err != nil {
+ _errors[name] = err
+ }
+ u, err := strconv.Atoi(row[5])
+ if err != nil {
+ _errors[name] = err
+ }
+ _mems[name] = MemoryInfo{
+ Total: 1048576 * uint64(t),
+ Used: 1048576 * uint64(u),
+ UsedPercent: (float64(u) / float64(t)) * 100.0,
+ }
+ }
+}
+
+var nvidia bool
diff --git a/devices/remote.go b/devices/remote.go
new file mode 100644
index 0000000..a86f3ee
--- /dev/null
+++ b/devices/remote.go
@@ -0,0 +1,271 @@
+package devices
+
+import (
+ "bufio"
+ "log"
+ "net/http"
+ "net/url"
+ "strconv"
+ "strings"
+ "sync"
+ "time"
+
+ "github.com/xxxserxxx/opflag"
+)
+
+var name string
+var remote_url string
+var sleep time.Duration
+var remoteLock sync.Mutex
+
+// FIXME Widgets don't align values
+// TODO remote network & disk aren't reported
+// TODO network resiliency; I believe it currently crashes gotop when the network goes down
+// TODO Replace custom decoder with https://github.com/prometheus/common/blob/master/expfmt/decode.go
+// TODO MQTT / Stomp / MsgPack
+func init() {
+ opflag.StringVarP(&name, "remote-name", "", "", "Remote: name of remote gotop")
+ opflag.StringVarP(&remote_url, "remote-url", "", "", "Remote: URL of remote gotop")
+ opflag.DurationVarP(&sleep, "remote-refresh", "", 0, "Remote: Frequency to refresh data, in seconds")
+
+ RegisterStartup(startup)
+}
+
+type Remote struct {
+ url string
+ refresh time.Duration
+}
+
+func startup(vars map[string]string) error {
+ // Don't set anything up if there's nothing to do
+ if name == "" || remote_url == "" {
+ return nil
+ }
+ _cpuData = make(map[string]int)
+ _tempData = make(map[string]int)
+ _netData = make(map[string]float64)
+ _diskData = make(map[string]float64)
+ _memData = make(map[string]MemoryInfo)
+
+ remoteLock = sync.Mutex{}
+ remotes := parseConfig(vars)
+ if remote_url != "" {
+ r := Remote{
+ url: remote_url,
+ refresh: 2 * time.Second,
+ }
+ if name == "" {
+ name = "Remote"
+ }
+ if sleep != 0 {
+ r.refresh = sleep
+ }
+ remotes[name] = r
+ }
+ if len(remotes) == 0 {
+ log.Println("Remote: no remote URL provided; disabling extension")
+ return nil
+ }
+ RegisterTemp(updateTemp)
+ RegisterMem(updateMem)
+ RegisterCPU(updateUsage)
+
+ // We need to know what we're dealing with, so the following code does two
+ // things, one of them sneakily. It forks off background processes
+ // to periodically pull data from remote sources and cache the results for
+ // when the UI wants it. When it's run the first time, it sets up a WaitGroup
+ // so that it can hold off returning until it's received data from the remote
+ // so that the rest of the program knows how many cores, disks, etc. it needs
+ // to set up UI elements for. After the first run, each process discards the
+ // the wait group.
+ w := &sync.WaitGroup{}
+ for n, r := range remotes {
+ n = n + "-"
+ r.url = r.url
+ var u *url.URL
+ w.Add(1)
+ go func(name string, remote Remote, wg *sync.WaitGroup) {
+ for {
+ res, err := http.Get(remote.url)
+ if err == nil {
+ u, err = url.Parse(remote.url)
+ if err == nil {
+ if res.StatusCode == http.StatusOK {
+ bi := bufio.NewScanner(res.Body)
+ process(name, bi)
+ } else {
+ u.User = nil
+ log.Printf("unsuccessful connection to %s: http status %s", u.String(), res.Status)
+ }
+ } else {
+ log.Print("error processing remote URL")
+ }
+ } else {
+ }
+ res.Body.Close()
+ if wg != nil {
+ wg.Done()
+ wg = nil
+ }
+ time.Sleep(remote.refresh)
+ }
+ }(n, r, w)
+ }
+ w.Wait()
+ return nil
+}
+
+var (
+ _cpuData map[string]int
+ _tempData map[string]int
+ _netData map[string]float64
+ _diskData map[string]float64
+ _memData map[string]MemoryInfo
+)
+
+func process(host string, data *bufio.Scanner) {
+ remoteLock.Lock()
+ for data.Scan() {
+ line := data.Text()
+ if line[0] == '#' {
+ continue
+ }
+ if line[0:6] != _gotop {
+ continue
+ }
+ sub := line[6:]
+ switch {
+ case strings.HasPrefix(sub, _cpu): // int gotop_cpu_CPU0
+ procInt(host, line, sub[4:], _cpuData)
+ case strings.HasPrefix(sub, _temp): // int gotop_temp_acpitz
+ procInt(host, line, sub[5:], _tempData)
+ case strings.HasPrefix(sub, _net): // int gotop_net_recv
+ parts := strings.Split(sub[5:], " ")
+ if len(parts) < 2 {
+ log.Printf(`bad data; not enough columns in "%s"`, line)
+ continue
+ }
+ val, err := strconv.ParseFloat(parts[1], 64)
+ if err != nil {
+ log.Print(err)
+ continue
+ }
+ _netData[host+parts[0]] = val
+ case strings.HasPrefix(sub, _disk): // float % gotop_disk_:dev:mmcblk0p1
+ parts := strings.Split(sub[5:], " ")
+ if len(parts) < 2 {
+ log.Printf(`bad data; not enough columns in "%s"`, line)
+ continue
+ }
+ val, err := strconv.ParseFloat(parts[1], 64)
+ if err != nil {
+ log.Print(err)
+ continue
+ }
+ _diskData[host+parts[0]] = val
+ case strings.HasPrefix(sub, _mem): // float % gotop_memory_Main
+ parts := strings.Split(sub[7:], " ")
+ if len(parts) < 2 {
+ log.Printf(`bad data; not enough columns in "%s"`, line)
+ continue
+ }
+ val, err := strconv.ParseFloat(parts[1], 64)
+ if err != nil {
+ log.Print(err)
+ continue
+ }
+ _memData[host+parts[0]] = MemoryInfo{
+ Total: 100,
+ Used: uint64(100.0 / val),
+ UsedPercent: val,
+ }
+ default:
+ // NOP! This is a metric we don't care about.
+ }
+ }
+ remoteLock.Unlock()
+}
+
+func procInt(host, line, sub string, data map[string]int) {
+ parts := strings.Split(sub, " ")
+ if len(parts) < 2 {
+ log.Printf(`bad data; not enough columns in "%s"`, line)
+ return
+ }
+ val, err := strconv.Atoi(parts[1])
+ if err != nil {
+ log.Print(err)
+ return
+ }
+ data[host+parts[0]] = val
+}
+
+func updateTemp(temps map[string]int) map[string]error {
+ remoteLock.Lock()
+ for name, val := range _tempData {
+ temps[name] = val
+ }
+ remoteLock.Unlock()
+ return nil
+}
+
+// FIXME The units are wrong: getting bytes, assuming they're %
+func updateMem(mems map[string]MemoryInfo) map[string]error {
+ remoteLock.Lock()
+ for name, val := range _memData {
+ mems[name] = val
+ }
+ remoteLock.Unlock()
+ return nil
+}
+
+func updateUsage(cpus map[string]int, _ bool) map[string]error {
+ remoteLock.Lock()
+ for name, val := range _cpuData {
+ cpus[name] = val
+ }
+ remoteLock.Unlock()
+ return nil
+}
+
+func parseConfig(vars map[string]string) map[string]Remote {
+ rv := make(map[string]Remote)
+ for key, value := range vars {
+ if strings.HasPrefix(key, "remote-") {
+ parts := strings.Split(key, "-")
+ if len(parts) == 2 {
+ log.Printf("malformed Remote extension configuration '%s'; must be 'remote-NAME-url' or 'remote-NAME-refresh'", key)
+ continue
+ }
+ name := parts[1]
+ remote, ok := rv[name]
+ if !ok {
+ remote = Remote{}
+ }
+ if parts[2] == "url" {
+ remote.url = value
+ } else if parts[2] == "refresh" {
+ sleep, err := strconv.Atoi(value)
+ if err != nil {
+ log.Printf("illegal Remote extension value for %s: '%s'. Must be a duration in seconds, e.g. '2'", key, value)
+ continue
+ }
+ remote.refresh = time.Duration(sleep) * time.Second
+ } else {
+ log.Printf("bad configuration option for Remote extension: '%s'; must be 'remote-NAME-url' or 'remote-NAME-refresh'", key)
+ continue
+ }
+ rv[name] = remote
+ }
+ }
+ return rv
+}
+
+const (
+ _gotop = "gotop_"
+ _cpu = "cpu_"
+ _temp = "temp_"
+ _net = "net_"
+ _disk = "disk_"
+ _mem = "memory_"
+)
diff --git a/docs/extensions.md b/docs/extensions.md
index b88f63f..b013554 100644
--- a/docs/extensions.md
+++ b/docs/extensions.md
@@ -1,9 +1,17 @@
% Plugins
+# Current state
-# Extensions
+First, there were go plugins. This turned out to be impractical due to the limitations in plugins making them unsuitable for use outside of a small, strict, and (one could argue) useless use case.
-- Plugins will supply an `Init()` function that will call the appropriate
+Then I tried external static extensions. This approach used a trick to copy and modify the gotop main executable, which then imported it's own packages from upstream. This worked, but was awkward and required several steps to build.
+
+Currently, as I've only written two modules since I started down this path, and there's no clean, practical solution yet in Go, I've folded the extensions into the main codebase. This means there's no programmatic extension mechanism for gotop.
+
+
+# Devices
+
+- Devices supply an `Init()` function that will call the appropriate
`Register\*()` functions in the `github.com/xxxserxxx/gotop/devices` package.
- `devices` will supply:
- RegisterCPU (opt)
diff --git a/docs/remote-monitoring.md b/docs/remote-monitoring.md
new file mode 100644
index 0000000..48fd0a8
--- /dev/null
+++ b/docs/remote-monitoring.md
@@ -0,0 +1,62 @@
+# Remote monitoring extension for gotop
+
+
+Show data from gotop running on remote servers in a locally-running gotop. This allows gotop to be used as a simple terminal dashboard for remote servers.
+
+![Screenshot](/assets/screenshots/fourby.png)
+
+
+## Configuration
+
+gotop exports metrics on a local port with the `--export <port>` argument. This is a simple, read-only interface with the expectation that it will be run behind some proxy that provides security. A gotop built with this extension can read this data and render it as if the devices being monitored were on the local machine.
+
+On the local side, gotop gets the remote information from a config file; it is not possible to pass this in on the command line. The recommended approach is to create a remote-specific config file, and then run gotop with the `-C <remote-config-filename>` option. Two options are available for each remote server; one of these, the connection URL, is required.
+
+The format of the configuration keys are: `remote-SERVERNAME-url` and `remote-SERVERNAME-refresh`; `SERVERNAME` can be anything -- it doesn't have to reflect any real attribute of the server, but it will be used in widget labels for data from that server. For example, CPU data from `remote-Jerry-url` will show up as `Jerry-CPU0`, `Jerry-CPU1`, and so on; memory data will be labeled `Jerry-Main` and `Jerry-Swap`. If the refresh rate option is omitted, it defaults to 1 second.
+
+
+### An example
+
+One way to set this up is to run gotop behind [Caddy](https://caddyserver.com). The `Caddyfile` would have something like this in it:
+
+```
+gotop.myserver.net {
+ basicauth / gotopusername supersecretpassword
+ proxy / http://localhost:8089
+}
+```
+
+Then, gotop would be run in a persistent terminal session such as [tmux](https://github.com/tmux/tmux) with the following command:
+
+```
+gotop -x :8089
+```
+
+Then, on a local laptop, create a config file named `myserver.conf` with the following lines:
+
+```
+remote-myserver-url=https://gotopusername:supersecretpassword@gotop.myserver.net/metrics
+remote-myserver-refresh=2
+```
+
+Note the `/metrics` at the end -- don't omit that, and don't strip it in Caddy. The refresh value is in seconds. Run gotop with:
+
+```
+gotop -C myserver.conf
+```
+
+and you should see your remote server sensors as if it were running on your local machine.
+
+You can add as many remote servers as you like in the config file; just follow the naming pattern.
+
+## Why
+
+This can combine multiple servers into one view, which makes it more practical to use a terminal-based monitor when you have more than a couple of servers, or when you don't want to dedicate an entire wide-screen monitor to a bunch of gotop instances. It's simple to set up, configure, and run, and reasonably resource efficient.
+
+## How
+
+Since v3.5.2, gotop's been able to export its sensor data as [Prometheus](https://prometheus.io/) metrics using the `--export` flag. Prometheus has the advantages of being simple to integrate into clients, and a nice on-demand design that depends on the *aggregator* pulling data from monitors, rather than the clients pushing data to a server. In essence, it inverts the client/server relationship for monitoring/aggregating servers and the things it's monitoring. In gotop's case, it means you can turn on `-x` and not have it impact your gotop instance at all, until you actively poll it. It puts the control on measurement frequency in a single place -- your local gotop. It means you can simply stop your local gotop instance (e.g., when you go to bed) and the demand on the servers you were monitoring drops to 0.
+
+On the client (local) side, sensors are abstracted as devices that are read by widgets, and we've simply implemented virtual devices that poll data from remote Prometheus instances. At a finer grain, there's a single process spawned for each remote server that periodically polls that server and collects the information. When the widget updates and asks the virtual device for data, the device consults the cached data and provides it as the measurement.
+
+The next iteration will optimize the metrics transfer protocol; while it'll likely remain HTTP, optimizations may include HTTP/2.0 streams to reduce the HTTP connection overhead, and a binary payload format for the metrics -- although HTTP/2.0 compression may eliminate any benefit of doing that.