2016-04-07

powerd++: Better CPU Clock Control for FreeBSD

Setting of P-States (power states a.k.a. steppings) on FreeBSD is managed by powerd(8). It has been with us since 2005, a time when the Pentium-M single-core architecture was the cutting edge choice for notebooks and dual-core just made its way to the desktop.

That is not to say that multi-core architectures were not considered when powerd was designed, but as the number of cores grows and hyper-threading has made its way onto notebook CPUs, powerd falls short.

Incentive

Don't you know it? You sit at your desk, reading technical documentation, occasionally scrolling or clicking on the next page link. The only (interactive) programs running are your web browser, an e-mail client and a couple of terminals waiting for input. There is a constant fan noise, which occasionally picks up for no apparent reason, making it a million times more annoying.

You can't work like this!

You start looking at the load, which is low but not minuscule. In the age of IMAP and node.js web browsers and e-mail clients are always a little busy. Still this is not enough to explain the fan noise.

You're running powerd to reduce your energy footprint (for various reasons), or are you? Yes you are. So you start monitoring dev.cpu.0.freq and it turns out your CPU clock is stuck at maximum like the speedometer of an adrenaline junkie with a death wish.

Something is wrong, your 15% to 30% load are way below the 50% default clock down threshold of powerd. You start digging, thinking you can tune powerd to do the right thing. Turns out you can't.

An Introduction to powerd

The following illustration shows powerd's operation on a dual-CPU system with two cores and hyper-threading each. That is not a realistic system today, but it saves space in the illustration and contains all the cases that need to be covered.

Note that …

  • … the sysctl(3) interface flattens the architecture of the CPUs into a list of pipelines, each presented as individual CPUs.
  • … powerd has the first CPU hard coded as the one controlling the clock frequency for all cores.
  • … powerd uses the sum of all loads to control the clock frequency.

powerd Architecture

Powerd using the sum of all loads to rate the overall load of the system allows single threaded loads to trigger higher P-States but comes at the cost of triggering high P-States with low distributed loads. The problem grows with the number of available cores. In the illustrated systems a mean load of 12.5% results in a 100% load rating. The same applies to a single quad-core CPU with hyper-threading.

Another problem resulting from this approach is that the optimal boundaries for the hysteresis changes with the number of cores. Also, to protect single core loads, powerd only permits boundaries from 0% to 100%. This results in powerd changing into the highest P-State at the drop of a needle and only clocking down if the load is close to 0.

The Design of powerd++

The powerd++ design has three significant differences. The way it manages the CPUs/cores/threads presented through the sysctl interface, the way that load is calculated and the way the target frequency is determined.

During its initialisation phase powerd++ assigns a frequency controlling core to each core, grouping them by the core that offers the handle to change the clock frequency. Unlike shown in the following illustration, all cores will always be controlled by dev.cpu.0, because the cpufreq(4) driver only supports global P-State changes. But powerd++ is built unaware of this limitation and will perform fine grained control the moment the driver offers it.

To rate the load within a core group, each core determines its own load and then passes it to the controlling core. The controlling core uses the maximum of the loads in the group as the group load. This approach allows single threaded applications to cause high load ratings (i.e. up to 100%), but having small loads on all cores in a group still results in a small load rating. Another advantage of this design is that load ratings always stay within the 0% to 100% range. Thus the same settings (including the defaults) work equally well for any number of cores.

Instead of using a hysteresis to decide whether the clock frequency should be increased, lowered or stay the same, powerd++ uses a target load to determine the frequency at which the current load would have rated as the target load. This approach results in quick frequency changes in either direction. E.g. given a target of 50% and a current load of 100% the new clock frequency would be twice the current frequency. To reduce sensitivity to signal noise more than two samples (5 by default) can be collected. This works as a low pass filter but is less damaging to the responsiveness of the system than increasing the polling interval.

powerd++ Architecture

Resources

The code is on github. A FreeBSD port is available as sysutils/powerdxx.

Afterthoughts

My experience in automotive and race car engineering came in handy. If your noise filter is not in O(1) (per frame), you're doing it wrong. If you have one control for many inputs a maximum or minimum are usually the right choice, the sum barely is. E.g. if you have 3 sensors that report 62°C, 74°C and 96°C, you want to adjust your coolant throughput to 96°C, not 232°C.

I hope that powerd++ will be widely used (within the community) and inspire the maintainers of cpufreq(4) to add support for per-CPU frequency controls.

TODOs

Currently the power source detection depends on ACPI, I need to implement something similar for older and non-x86/amd64 systems. Currently those just fall back to the unknown state.

2015-02-01

/bin/sh: Writing Your Own watch Command

The command watch in FreeBSD has a completely different function than the popular GNU-command with the same name. Since I find the GNU-watch convenient I wrote a short shell-script to provide that functionality for my systems. The script is a nice way to show off some basics as well as some advanced shell-scripting features.

To resolve the ambiguity with watch(8) I called it observe on my system. My observe command takes the time to wait between updates as the first argument. Successive arguments are interpreted as commands to run. The following listing is the complete code:

#!/bin/sh
set -f
sleep=$1
clear=
shift

runcmd() {
        tput cm 0 0
        (eval "$@")
        tput AL `tput li`
}

trap 'runcmd "$@"; tput ve; exit' EXIT INT TERM
trap 'clear=1' HUP INFO WINCH

tput vi
clear
runcmd "$@"
while sleep $sleep; do
        eval ${clear:+clear;clear=}
        runcmd "$@"
done

Careful observers may notice that there is no parameter checking and the code is not commented. These shortcomings are part of what makes it a convenient example in a tutorial.

Turning Off Glob-Pattern Expansion

The second line already shows a good convention:

#!/bin/sh
set -f

The set builtin can be used to set parameters as if they were provided on the command line. It is also able to turn them off again, e.g. set +x would turn off tracing. The -f option turns off glob pattern expansion for command arguments. This is a good habit to pick up, glob pattern expansion is very dangerous in scripts. Of course the -f option could be set as part of the shebang, e.g. #!/bin/sh -f, but that would allow the script user to override it. By canlling bash ./observe 2 ccache -s the shell could be invoked without setting the option, which is dangerous for options with safety-implications.

Global Variable Initialisation

The next block initialises some global variables:

sleep=$1
clear=
shift

Initialising global variables at the beginning of a script is not just good style (because there is one place to find them all), it also protects the script from whatever the caller put into the environment using export or the interactive shell's equivalent.

The shift builtin can be a very useful feature. It throws away the first argument, so what was $2 becomes $1, $3 turns into $2 etc.. With an optional argument the number of arguments to be removed can be specified.

The runcmd Function

The runcmd function is responsible for invoking the command in a fashion that overwrites its last output:

runcmd() {
        tput cm 0 0
        (eval "$@")
        tput AL `tput li`
}

The tput(1) command is handy to directly talk to the terminal. What it can do depends on the terminal it is run in, so it is good practice to test it in as many terminals as possible. A list of available commands is provided by the terminfo(5) manual page. The following commands were used here:

  • cm: cursor_address #row #col
    Used to position the cursor in the top-left corner
  • AL: parm_insert_line #lines
    Used to push any garbage on the terminal (e.g. random key inputs) out of the terminal
  • li: lines
    Returns the number of terminal lines on stdout

The tput AL `tput li` basically acts as a clear below the cursor command.

The eval "$@" command executes all the arguments (apart from the one that was shifted away) as shell commands. The command is enclosed by parenthesis to invoke it in a subshell. That effectively prevents it from affecting the script. It is not able to change signal handlers or variables of the script, because it is run in its own process.

Signal Handlers

Signal handlers provide a method of overriding the shell's default actions. The trap builtin takes the code to execute as the first argument, followed by a list of signals to catch. Providing a dash as the first argument can be used to invoke the default action:

trap 'runcmd "$@"; tput ve; exit' EXIT INT TERM
trap 'clear=1' HUP INFO WINCH

The INT signal represents a user interrupt, usually caused by the user pressing CTRL+C. The TERM signal is a request to terminate. E.g. it is sent when the system shuts down. The EXIT is a pseudosignal that occurs when the shell terminates regularly, i.e. by reaching the end of the script (in this case if sleep would fail) or an exit call.

The HUP signal is frequently used to reconfigure daemons without terminating them. WINCH occurs when the terminal is resized. The INFO signal is a very useful BSDism. It is usually invoked by pressing CTRL+T and causes a process to print status information.

The Output Cycle

The output cycle heavily interacts with the signal handlers:

tput vi
clear
runcmd "$@"
while sleep $sleep; do
        eval ${clear:+clear;clear=}
        runcmd "$@"
done

The tput vi command hides the cursor, tput ve turns it back on.

The clear command clears up the terminal before the command is run the first time.

The runcmd "$@" call occurs once before the loop, because the first call within the loop occurs after the first sleep interval.

The clear global is set by the HUP/WINCH/INFO handler. The eval ${clear:+clear;clear=} line runs the clear command if the variable is set and resets it afterwards. The clear command is not run every cycle, because it would cause flickering. The ability to trigger it is required to clean up the screen in case a command does not override all the characters from a previous cycle.

Conclusion

If you made it here, thank you for reading this till the end! You probably already knew a lot of what you read. But maybe you also learned a trick or two. That's what I hope.

2015-01-17

/bin/sh: Using Named Pipes to Talk to Your Main Process

Screenshot
A shell application with the main thread providing output and status display for the worker processes

You want to fork off a couple of subshells and have them talk back to your main Process? Then this post is for you.

What is a Named Pipe?

A named pipe is a pipe with a file system node. This allows arbitrary numbers of processes to read and write from the pipe. Which in turn makes multiple usage scenarios possible. his post just covers one of them, others may be covered in future posts.

The Shell

The following examples should work in any Bourne Shell clone, such as the Almquist Shell (/bin/sh on FreeBSD) or the Bourne-Again Shell (bash).

HowTo

The first step is to create a Named Pipe. This can be done with the mkfifo(1) command:

# Get a temporary file name
node="$(mktemp -u)" || exit
# Create a named pipe
mkfifo -m0600 "$node" || exit

Running that code should produce a Named Pipe in /tmp.

The next step is to open a file descriptor. In this example a file descriptor is used for reading and writing, this avoids a number of pitfalls like deadlocking the script:

# Attach the pipe to file descriptor 3
exec 3<> "$node"
# Remove file system node
rm "$node"

Note how the file system node of the named pipe is removed immediately after assigning a file descriptor. The exec 3<> "$node" command has opened a permanent file descriptor, which remains open until manually closed or until the process terminates. So deleting the file system node will cause the system to remove the Named Pipe as soon as the process terminates, even when it is terminated by a signal like SIGINT (user presses CTRL-C).

Forking and Writing into the Named Pipe

From this point on the subshells can be forked using the & operator:

# This function does something
do_something() {
    echo "do_something() to stdout"
    echo "do_something() to named pipe" >&3
}

# Fork do_something()
do_something &
# Fork do_something(), attach stdout to the named pipe
do_something >&3 &

# Fork inline
(
    echo "inline to pipe" >&3
) &
# Fork inline, attach stdout to the named pipe
(
    echo "inline to stdout"
) >&3 &

Whether output is redirected per command or for the entire subshell is a matter of personal taste. Either way the processes inherit the file descriptor to the Named Pipe. It is also possible to redirect stderr as well, or redirect it into a different named pipe.

The Named Pipe is buffered, so all the subshells can start writing into it immediately. Once the buffer is full, processes trying to write into the pipe will block, so sooner or later the data needs to be read from the pipe.

Reading from the Named Pipe

To read from the pipe the shell-builtin command read is used.

Using non-builtin commands like head(1) usually leads to problems, because they may read more data from a pipe than they output, causing it to be lost.

# Make sure white space does not get mangled by read (IFS only contains the newline character)
IFS='
'

# Blocking read, this will halt the process until data is available
read line <&3

# Non-blocking read that reads as much data as is currently available
line_count=0
lines=
while read -t0 line <&3; do
    line_count=$((line_count + 1))
    lines="$lines$line$IFS"
done

Using a blocking read causes the process to sleep until data is available. The process does not require any CPU time, the kernel takes care of waking the process.

That's all that is required to establish ongoing communication between your processes.

The direction of communication can be reversed to use the pipe as a job queue for forked processes. Or a second pipe can be used to establish 2-way communications. With just two processes a single pipe might suffice for two way communications. A named pipe can be connected to an ssh(1) session or nc(1).

Basically named pipes are a way to establish a pipe to background processes or completely independent processes, which do not even have to run on the same machine. So, happy hacking!

2014-09-27

Another day in my love affair with AWK

I consider myself a C/C++ developer. Right now I am embracing C++11 (I wanted to wait till it is actually well supported by compilers) and I am loving it.

Despite my happy relationship with C/C++ I have maintained a torrid affair with AWK for many years, which has spilled into this blog before:

  • Almost a year ago I concluded that MAWK is freakin' fast and GNU AWK freakin' fast as a snail
  • The past summer I stumbled over a bottleneck in the one-true-AWK, default for *BSD and Mac OS-X

A Matter of Accountability

So far circumstances dictated that either the script or the input data or both had to be kept secret. In this post both will be publicly available. The purpose of this post is to give people the chance to perform their own tests.

The following is required to perform the test:

The dbc2c.awk script was already part of my first post. It parses Vector DBC (Database CAN) files, an industry standard for describing a set of devices, messages and signals for the real time bus CAN (one can argue it's soft real time, it depends). It does the following things:

  1. Parse data from 1 or more input files
  2. Store the data in arrays, use indexes as references to describe relationships
  3. Output the data
    1. Traverse the data structure and store attributes of objects in an array
    2. Read a template
    3. Insert data into the template and print on stdout
Test Environment
  • The operating system:
    FreeBSD AprilRyan.norad 10.1-BETA2 FreeBSD 10.1-BETA2 #0 r271856: Fri Sep 19 12:55:39 CEST 2014 root@AprilRyan.norad:/usr/obj/S403/amd64/usr/src/sys/S403 amd64
  • The compiler:
    FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
    Target: x86_64-unknown-freebsd10.1
    Thread model: posix
  • CPU: Core i7@2.4GHz (Haswell)
  • NAWK version: awk version 20121220 (FreeBSD)
  • MAWK version: mawk 1.3.4.20140914
  • GNU AWK version: GNU Awk 4.1.1, API: 1.1

Tests

With the recent changeset 219:01114669a8bf, the script switched from using array iteration (for (index in array) { … }) to creating a numbered index for each object type and iterate through them in order of creation to make sure data is output in the same order with every AWK implementation. This makes it much easier to compare and validate outputs from different flavours of AWK.

To reproduce the tests, run:

time -l awk -f scripts/dbc2c.awk -vDATE=whenever j1939_utf8.dbc | sha256

The checksum for the output should read:

9f0a105ed06ecac710c20d863d6adefa9e1154e9d3a01c681547ce1bd30890df

Here are my runtime results:

6.23 s
6.32 s
6.27 s
11.79 s
11.88 s
11.80 s
1.98 s
2.02 s
1.97 s

Memory usage (maximum resident set size):

22000 k
50688 k
26644 k

Conclusion

Once again the usual order of things establishes itself. GNU AWK wastes our time and memory while MAWK takes the winner's crown and NAWK keeps to the middle ground.

The dbc2c.awk script has been tested before and GNU AWK actually performs much better this time, 6.0 instead of 9.6 time slower than MAWK. Maybe just parsing one file instead of 3 helps or the input data produces less collisions for the hashing algorithm (AWK array indexes are always cast to string and stored in hash tables).

In any way I'd love to see some more benchmarks out there. And maybe someone bringing their favourite flavour of AWK to the table.

2014-07-03

AWK Reloaded

Last year I compared the performance of 3 AWK interpreters, NAWK, GAWK and MAWK. For the test I used 3 of my .awk scripts (available under Beerware). But the data I processed with them was confidential. Any way, NAWK won 1/3, MAWK 2/3 (with astonishing leads), GAWK was the clear looser with abysmal performance in 2/3 tests.

Recently I developed a run time interpreter for the Heidenhain NC (Numerical Control) language. The code of the script as well as the program it interprets are confidential, unfortunately. But the results are interesting nonetheless.

Tests

Test Environment

Unfortunately I cannot run the tests on the same machine as last time, it was stolen during a visit in the UK last winter.

So this time the tests are run on its replacement, an Intel Haswell Core i-7 (2 cores, 4 pipelines) at 2.4 GHz under FreeBSD 10 r267867 (amd64).

AWK Versions
  • nawk 20121220 (FreeBSD)
  • gawk 4.1.1
  • mawk 1.3.4
hhrti.awk [1 pt/s]

This is a test run with the aforementioned run time interpreter. There is a more in depth explanation at the end of this article.

150.43 s
149.41 s
149.29 s
67.40 s
67.86 s
67.61 s
48.97 s
47.48 s
48.02 s

Memory usage (maximum resident set size):

2864 k
4240 k
2760 k
xml.awk [100 pt/s]

This is one of the tests run last time, to confirm that the interpreters still compare similarly with the previous scripts, despite the updated test platform. That seems to be the case here.

0.16 s
0.16 s
0.16 s
0.57 s
0.57 s
0.58 s
0.02 s
0.02 s
0.02 s

Conclusions

Consistently with the previous performance tests, MAWK takes the lead. What is surprising that GAWK performs well with only 1.38 times the run time of MAWK, which is a far cry from the abysmal performance it exhibited in some of the other tests. A quick rerun of the previous tests shows the same performance gaps as before, so neither the slight version changes nor the new compiler version (clang 3.4.1) introduced a performance boost in GAWK.

The real surprise is the performance of NAWK. This is the first test case where it performs worse than GAWK, with a runtime factor of 3.0. That's a far cry from GAWK's sad >25 in the xml.awk case, but still it hints to a bottleneck in NAWK.

Differences to Previous Tests

This test is a lot less array heavy than the dbc2c.awk and xml.awk test cases. The parsing stage barely takes any time after that only small local arrays are used for temporary tokenizing. The most time consuming operation seems to be evaluating arithmetic, because whenever an operation is performed it creates a number of copy operations. Depending on the operator all following tokens need to be shifted one or two places.

Bottleneck Test

In order to verify the assumption that copies in arrays might be responsible I created a small script that performs this operation repeatedly:

BEGIN {
 TXT = "111 222 333 444 555 666 777 888 999 000"
 REPEAT = 100000
 if (ARGC > 1) {
  REPEAT = ARGV[--ARGC]
  delete ARGV[ARGC]
 }
 srand() # Seed

 # Perform the test this many times
 for (i = 1; i <= REPEAT; i++) {

  # Create an array with tokens
  len = split(TXT, a)
  a[0] = len # Store the length in index 0, this is very
             # convenient in real apps with lots of arrays

  # Test case, delete a random field until none
  # are left
  while (a[0]) {
   # Select a random entry to delete
   del = int(rand() * 65536) % a[0] + 1

   # Shift the following tokens left
   for (p = del; p < a[0]; p++) {
    a[p] = a[p + 1]
   }
   # Delete the tail
   delete a[a[0]--]
  }
 }
}
bottleneck.awk [10 pt/s]

This artificial test seems to confirm this, by reproducing the same performance pattern and amplifying the performance problem of NAWK. The script was run with 200000 repetitions.

12.24 s
12.28 s
12.17 s
2.29 s
2.29 s
2.28 s
1.68 s
1.63 s
1.69 s

Memory usage (maximum resident set size):

2488 k
3596 k
2476 k

The Heidenhain Real Time Interpreter

The Heidenhain NC language can be used to control an NC mill. I.e. access various functions of the machine, such as cooling systems, automatic tool changers and provide milling instructions. Additionally it has programming instructions that can be used to make on the fly calculations and decisions. The purpose of the interpreter is to perform arithmetic and conditional flow in advance.

The need for this arose with a program written for a research project, which is so computation heavy that it causes the machine to stutter.

The interpreter works in two stages, a code parsing stage and an evaluation stage.

Parsing

In this stage every command is stored in a one-way linked list. Additional code files may be called within a program, those are parsed after the current file has been completed and appended to the same list.

The Heidenhain NC language has several kinds of commands, most of these are pretty static, they access machine functions, or describe target coordinates or curves. These kinds of commands are what the interpreter outputs in the evaluation stage.

The other kind of commands provide arithmetic and program flow:

  • Variable assignments
  • Arithmetic expressions (really part of variable assignments)
  • Labels
  • Label calls
  • Conditional label calls (i.e. IF)
  • Program calls

The list is not complete, but it should get the idea across.

Every list entry is classified during parsing stage, and some are preprocessed. E.g. labels and subprogram entries are recorded in an associative array so they can be branched to in the evaluation stage.

Evaluation

In this stage the interpretation is performed. The program starts with an empty call stack at the first parsed code line. Each line is evaluated according to its classification.

  • Variable assignments are evaluated and stored in an associative array containing name, value pairs
  • Label calls are performed
  • Conditions are evaluated and either branch to a label, that is fetched from the array recorded during parsing, or continue with the next line
  • Calls to other programs cause a reference to the current code line to be pushed on the stack
  • Ends of programs cause a call back to the code line recorded on the stack, if the stack is empty the interpreter terminates

Every command that is not classified for special treatment receives the following default treatment:

  1. Substitute variables with their current values
  2. Output the command

The result is a flat NC program that does no longer contain arithmetic and conditional code. E.g:

0    BEGIN PGM mandelbrot_kante MM
1    BLK FORM 0.1 Z X+0 Y-90.0000 Z-50
2    BLK FORM 0.2 X+220.0000 Y+90.0000 Z+0
3    TOOL CALL 4 Z S32000 F8000
4    M3
5    L Z+20 FMAX
6    L X+110.0000 Y-52.2387 FMAX
7    L Z+2 FMAX
8    L Z-0.5000 F800
9    L X+110.2000 Y-52.2387 F8000
[...]
7758 L X+109.4000 Y-52.8387 F8000
7759 L X+109.4000 Y-52.6387 F8000
7760 L X+109.4000 Y-52.4387 F8000
7761 L X+109.4000 Y-52.2387 F8000
7762 L X+109.6000 Y-52.2387 F8000
7763 L X+109.8000 Y-52.2387 F8000
7764 L X+110.0000 Y-52.2387 F8000
7765 L X+110.2000 Y-52.2387 F8000
7766 L Z+50 FMAX
7767 END PGM mandelbrot_kante MM

2014-04-01

geli suspend/resume with Fulll Disk Encryption

This article has been updated 2014-04-01. Changes are marked with this background colour.

This article details my solution of the geli resume deadlock. It is the result of much fiddling and locking myself out of the file system.

The presented solution works most of the time, but it is still possible to lock up the system so far that VT-switching is no longer possible.

After my good old HP6510b notebook was stolen I decided to set up full disk encryption for its replacement. However after I set it up I faced the problem that the device would be wide open after resuming from suspend. That said I rarely reboot my system, I usually keep everything open permanently and suspend the laptop for transport or extended non-use. So the problem is quite severe.

Luckily the FreeBSD encryption solution geli(8) provides a mechanism called geli suspend that deletes the key from memory and stalls all processes trying to access the file system. Unfortunately geli resume would be one such process.

The System

So first things first, a quick overview of the system. If you ever set up full disk encryption yourself, you can probably skip ahead.

The boot partition containing the boot configuration, the kernel and its modules is not encrypted. It resides in the device ada0p2 labelled gpt/6boot. The encrypted device is ada0p4 labelled 6root. For easy maintenance and use the 6boot:/boot directory is mounted into 6root.eli:/boot (the .eli marks an attached encrypted device). Because /boot is a subdirectory in the 6boot file system, a nullfs(5) mount is required to access 6boot:/boot and mount it into 6root:/boot. To access 6boot:/boot, 6boot is mounted into /mnt/boot.

Usually mount automatically loads the required modules when invoked, but this doesn't work when the root file system doesn't contain them. So the required modules need to be loaded during the loader stage.

/boot/loader.conf
# Encrypted root file systeme
vfs.root.mountfrom="ufs:gpt/6root.eli"
geom_eli_load="YES"                     # FS crypto
aesni_load="YES"                        # Hardware AES

# Allow nullfs mounting /boot
nullfs_load="YES"
tmpfs_load="YES"
/etc/fstab
# Device           Mountpoint   FStype Options    Dump Pass
/dev/gpt/6root.eli /            ufs    rw,noatime 1    1
/dev/gpt/6boot     /mnt/boot    ufs    rw,noatime 1    1
/mnt/boot/boot     /boot        nullfs rw         0    0
/dev/gpt/6swap.eli none         swap   sw         0    0
# Temporary files
tmpfs              /tmp         tmpfs  rw         0    0
tmpfs              /var/run     tmpfs  rw         0    0

The Problem

The problem with geli suspend/resume is that calling geli resume ada0p4 deadlocks, because geli is located on the partition that is supposed to be resumed.

The Approach

The solution is quite simple. Put geli somewhere unencrypted.

To implement this several challenges need to be faced:

ChallengeApproach
ProgrammingShell-scripting
Technology, avoiding file system accessUse tmpfs(5)
Usability, how to enter passphrasesUse a system console
Safety, the solution needs to be running before a suspendUse an always on, unauthenticated console
Security, an unauthenticated interactive service is prone to abuseOnly allow password entry, no other kinds of interactive control
Safety, what about accidentally terminating the scriptIgnore SIGINT

The Script

The complete script can be found at the bottom.

Constants

At the beginning of the script some read-only variables (the closest available thing to constants) are defined, mostly for convenience and to avoid typos.

#!/bin/sh
set -f

readonly gcdir="/tmp/geliconsole"
readonly dyn="/sbin/geli;/usr/bin/grep;/bin/sleep;/usr/sbin/acpiconf"
readonly static="/rescue/sh"
Bootstrapping

The script is divided into two parts, the first part is the bootstrapping section that requires file system access and creates the tmpfs with everything that is needed to resume suspended partitions.

The bootstrap is performed in a conditional block, that checks whether the script is runnig from gcdir. It ends with calling a copy of the script. The exec call means the bootstrapping process is replaced with the new call. The copy of the script will detect that it is running from the tmpfs and skip the bootstrapping:

# If this process isn't running from the tmpfs, bootstrap
if [ "${0#${gcdir}}" == "$0" ]; then
 …
 # Complete bootstrap
 exec "${gcdir}/sh" "${gcdir}/${0##*/}" "$@"
fi

Before completing the bootstrap, the tmpfs needs to be set up. Creating it is a good start:

# Create tmpfs
/bin/mkdir -p "${gcdir}"
/sbin/mount -t tmpfs tmpfs "$gcdir" || exit 1

# Copy the script before changing into gcdir, $0 might be a
# relative path
/bin/cp "$0" "${gcdir}/" || exit 1

# Enter tmpfs
cd "${gcdir}" || exit 1

The next step is to populate it with everything that is needed. I.e. all binaries required after performing the bootstrap. Two kinds of binaries are used, statically linked (see the static read-only) and dynamically linked (see the dyn read-only).

The static binaries can simply be copied into the tmpfs, the dynamically linked ones also require libraries, a list of which is provided by ldd(1).

Note the use of IFS (Input Field Separator) to split variables into multiple arguments and how subprocesses are used to limit the scope of IFS changes.

# Get shared objects
(IFS='
'
 for lib in $(IFS=';';/usr/bin/ldd -f '%p;%o\n' ${dyn}); do
  (IFS=';' ; /bin/cp ${lib})
 done
)

# Get executables
(IFS=';' ; /bin/cp ${dyn} ${static} "${gcdir}/")

The resulting tmpfs contains the binaries sh, geli, sleep, grep, acpiconf and all required libraries.

Interactive Stage

When reaching the interactive stage, the script is already run by a static shell within the tmpfs. The first order of business is to make sure the shell won't look for executables outside the tmpfs:

export PATH="./"

The next step is to trap some signals to make sure the script exits gracefully and is not terminated by pressing CTRL+C:

 
trap 'echo geliconsole: Exiting' EXIT
trap "/sbin/umount -f '${gcdir}' ; exit 0" SIGTERM
trap '' SIGINT SIGHUP

The last stage is a while-true loop that checks for suspended partitions and calls geli resume.

echo "geliconsole: Activated"
while :; do
 if geli list | grep -qFx 'State: SUSPENDED'; then
  geom="$(geli list | grep -FxB1 'State: SUSPENDED')"
  geom="${geom#Geom name: }"
  geom="${geom%%.eli*}"
  echo "geliconsole: Resume $geom"
  geli resume "$geom"
  echo .
 else
  sleep 2
 fi
done

The System Console

Because the script does not take care of grabbing the right console, it cannot simply be run from /etc/ttys. Instead it needs to be started by getty(8). To do this a new entry into /etc/gettytab is required:

#
# geliconsole
#
geliconsole|gc.9600:\
 :al=root:tc=std.9600:lo=/root/bin/geliconsole:

The entry defines a new terminal type called geliconsole with auto login.

The new terminal can now be started by the init(8) process by adding the following line to /etc/ttys:

ttyvb "/usr/libexec/getty geliconsole" xterm on  secure

With kill -HUP 1 the init process can be notified of the change.

The console should now be available on terminal 11 (CTRL+ALT+F12) and look similar to this:

FreeBSD/amd64 (AprilRyan.norad) (ttyvb)

geliconsole: Activated

Suspending

In order to automatically suspend disks update /etc/rc.suspend:

…
/usr/bin/logger -t $subsystem suspend at `/bin/date +'%Y%m%d %H:%M:%S'`
/bin/sync && /bin/sync && /bin/sync
/bin/rm -f /var/run/rc.suspend.pid
/usr/sbin/vidcontrol -s 12 < /dev/ttyv0 > /dev/ttyv0
/sbin/geli suspend -a
# The following delay may be reduced, by how much depends on the system, I am using a 1 second delay.
/tmp/geliconsole/sleep 3

if [ $subsystem = "apm" ]; then
 /usr/sbin/zzz
else
 # Notify the kernel to continue the suspend process
 /tmp/geliconsole/acpiconf -k 0
fi

exit 0

The vidcontrol command VT-switches to the geli console, before the geli command suspends all encrypted partitions. They can be recovered by pressing CTRL+ALT+F12 to enter the console and entering the passphrase there.

In order for the VT-switch to work without flaw, the automatic VT switch to console 0 needs to be turned off:

# sysctl hw.syscons.sc_no_suspend_vtswitch=1
# echo hw.syscons.sc_no_suspend_vtswitch=1 >> /etc/sysctl.conf

Desirable Improvements

For people running X, especially with a version where X breaks the console (like is currently the case with KMS support), it would be nice to enter the keywords through a screen locker.

Also it is not really necessary to run the script with root privileges. A dedicated, less privileged user account, should be created and used.

Files

/root/bin/geliconsole
#!/bin/sh
set -f

readonly gcdir="/tmp/geliconsole"
readonly dyn="/sbin/geli;/usr/bin/grep;/bin/sleep;/usr/sbin/acpiconf"
readonly static="/rescue/sh"

# If this process isn't running from the tmpfs, bootstrap
if [ "${0#${gcdir}}" == "$0" ]; then
 # Create tmpfs
 /bin/mkdir -p "${gcdir}"
 /sbin/mount -t tmpfs tmpfs "$gcdir" || exit 1

 # Copy the script before changing into gcdir, $0 might be a
 # relative path
 /bin/cp "$0" "${gcdir}/" || exit 1

 # Enter tmpfs
 cd "${gcdir}" || exit 1

 # Get shared objects
 (IFS='
'
  for lib in $(IFS=';';/usr/bin/ldd -f '%p;%o\n' ${dyn}); do
   (IFS=';' ; /bin/cp ${lib})
  done
 )

 # Get executables
 (IFS=';' ; /bin/cp ${dyn} ${static} "${gcdir}/")

 # Complete bootstrap
 exec "${gcdir}/sh" "${gcdir}/${0##*/}" "$@"
fi

export PATH="./"
 
trap 'echo geliconsole: Exiting' EXIT
trap "/sbin/umount -f '${gcdir}' ; exit 0" SIGTERM
trap '' SIGINT SIGHUP

echo "geliconsole: Activated"
while :; do
 if geli list | grep -qFx 'State: SUSPENDED'; then
  geom="$(geli list | grep -FxB1 'State: SUSPENDED')"
  geom="${geom#Geom name: }"
  geom="${geom%%.eli*}"
  echo "geliconsole: Resume $geom"
  geli resume "$geom"
  echo .
 else
  sleep 2
 fi
done

2013-11-25

Legacy Jails on a FreeBSD 10 Tinderbox

This unfinished article has been sitting here for some time and was an attempt to keep track of my efforts to fix building legacy Tinderbox jails on FreeBSD 10. I have not made any progress for some time. So I decided to publish this article, maybe it is of use to someone else working on the same problems.

It has become customary to me to build libreoffice packages, whenever an update is available in the ports tree. The packages are published on the BSDForen.de Wiki. Recently I updated the Tinderbox host system to FreeBSD 10.

The Error

I use the oldest supported release for each branch to maximize the number of people who can use the packages. I use apply to update the whole batch of jails:

# /usr/local/tinderbox/jails
# apply 'tc makeJail -j' *
8.3-amd64: updating jail with SVN
8.3-amd64: cleaning out /usr/local/portstools/tinderbox/jails/8.3-amd64/obj
8.3-amd64: cleaning out /usr/local/portstools/tinderbox/jails/8.3-amd64/tmp
8.3-amd64: making world
ERROR: world failed - see /usr/local/portstools/tinderbox/jails/8.3-amd64/world.tmp
Cleaning up after Jail creation.  Please be patient.
8.3-i386: updating jail with SVN
8.3-i386: cleaning out /usr/local/portstools/tinderbox/jails/8.3-i386/obj
8.3-i386: cleaning out /usr/local/portstools/tinderbox/jails/8.3-i386/tmp
8.3-i386: making world
ERROR: world failed - see /usr/local/portstools/tinderbox/jails/8.3-i386/world.tmp
Cleaning up after Jail creation.  Please be patient.
...

So I checked the first log file 8.3-amd64/world.tmp:

--- upgrade_checks ---
A failure has been detected in another branch of the parallel make

make[1]: stopped in /usr/local/portstools/tinderbox/jails/8.3-amd64/src
*** [upgrade_checks] Error code 2

make: stopped in /usr/local/portstools/tinderbox/jails/8.3-amd64/src
1 error

make: stopped in /usr/local/portstools/tinderbox/jails/8.3-amd64/src

Diagnostics

That previous error message, wasn't really useful, apparently make had some issues. So I hacked the Tinderbox a bit:

--- lib/tc_command.sh   19 Oct 2013 20:13:08 -0000      1.179
+++ lib/tc_command.sh   30 Oct 2013 08:06:26 -0000
@@ -889,7 +889,7 @@
         fi
 
         cd ${SRCBASE} && env DESTDIR=${J_TMPDIR} ${crossEnv} \
-           make -j${factor} -DNO_CLEAN world > ${jailBase}/world.tmp 2>&1
+           make -B -j${factor} -DNO_CLEAN world > ${jailBase}/world.tmp 2>&1
         rc=$?
         execute_hook "postJailBuild" "JAIL=${jailName} DESTDIR=${J_TMPDIR} JAIL_ARCH=${jailArch} MY_ARCH=${myArch} JAIL_OBJDIR=${JAIL_OBJDIR} SRCBASE=${SRCBASE} PB=${pb} RC=${rc}"
         if [ ${rc} -ne 0 ]; then

FreeBSD 10 comes with a new version of make that has some incompatibilities. The -B flag causes make to behave old-fashioned.

The next log turned out to be a lot more telling:

...
c++ -O2 -pipe -I/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/legacy/usr/include -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf/../../../contrib/gperf/lib -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf -c /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf/../../../contrib/gperf/src/version.cc
c++ -O2 -pipe -I/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/legacy/usr/include -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf/../../../contrib/gperf/lib -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf -c /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf/../../../contrib/gperf/lib/getline.cc
c++ -O2 -pipe -I/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/legacy/usr/include -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf/../../../contrib/gperf/lib -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf -c /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/gperf/../../../contrib/gperf/lib/hash.cc
make: don't know how to make /usr/lib/libstdc++.a. Stop
*** Error code 2

Stop in /usr/local/portstools/tinderbox/jails/8.3-amd64/src.
*** Error code 1

Stop in /usr/local/portstools/tinderbox/jails/8.3-amd64/src.
*** Error code 1

Stop.
make: stopped in /usr/local/portstools/tinderbox/jails/8.3-amd64/src

Fix C++

The meaning of the error should be clear to experienced FreeBSD admins. FreeBSD 10 introduces a new C++11 capable C++ stack and the legacy jails required the old stack to bootstrap the build process.

The solution was to add a line to /etc/src.conf:

WITH_GNUCXX=1

And updating world:

# cd /usr/src
# make -DNO_CLEAN buildworld
...
--------------------------------------------------------------
>>> World build completed on Wed Oct 30 10:28:06 CET 2013
--------------------------------------------------------------
# make installworld
...

That didn't take long, because instead of rebuilding the entire world, only the missing parts were added to the build due to the NO_CLEAN flag.

Note, the tc_command.sh hack should stay in place to ensure make compatibility. Unfortunately that seriously slows down the makeJail, because it prevents parallel make. A workaround that does not require meddling with the bootstrapping process would be to update several jails at the same time.

So once again it was time to kick off the builds:

# cd /usr/local/tinderbox/jails
# apply 'tc makeJail -j' *
8.3-amd64: updating jail with SVN
8.3-amd64: cleaning out /usr/local/portstools/tinderbox/jails/8.3-amd64/obj
8.3-amd64: cleaning out /usr/local/portstools/tinderbox/jails/8.3-amd64/tmp
8.3-amd64: making world
ERROR: world failed - see /usr/local/portstools/tinderbox/jails/8.3-amd64/world.tmp
Cleaning up after Jail creation.  Please be patient.
8.3-i386: updating jail with SVN
8.3-i386: cleaning out /usr/local/portstools/tinderbox/jails/8.3-i386/obj
8.3-i386: cleaning out /usr/local/portstools/tinderbox/jails/8.3-i386/tmp
8.3-i386: making world
ERROR: world failed - see /usr/local/portstools/tinderbox/jails/8.3-i386/world.tmp
Cleaning up after Jail creation.  Please be patient.
...

Fix cc != gcc

So basically the builds got a lot further, but still didn't complete. It was time to look at 8.3-amd64/world.tmp again:

...
cc -O2 -pipe -DIN_GCC -DHAVE_CONFIG_H -DPREFIX=\"/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/usr\" -I/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../cc_tools -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../cc_tools -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/config -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcclibs/include -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcclibs/libcpp/include -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcclibs/libdecnumber   -I/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/legacy/usr/include -c /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/timevar.c
cc -O2 -pipe -DIN_GCC -DHAVE_CONFIG_H -DPREFIX=\"/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/usr\" -I/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../cc_tools -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../cc_tools -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/config -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcclibs/include -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcclibs/libcpp/include -I/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcclibs/libdecnumber   -I/usr/local/portstools/tinderbox/jails/8.3-amd64/obj/usr/local/portstools/tinderbox/jails/8.3-amd64/src/tmp/legacy/usr/include -DTARGET_NAME=\"amd64-undermydesk-freebsd\" -c /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/toplev.c
In file included from /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/toplev.c:58:
/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/output.h:123:6: warning: 'format' attribute argument not supported: __asm_fprintf__ [-Wignored-attributes]
     ATTRIBUTE_ASM_FPRINTF(2, 3);
     ^
/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/output.h:113:53: note: expanded from macro 'ATTRIBUTE_ASM_FPRINTF'
#define ATTRIBUTE_ASM_FPRINTF(m, n) __attribute__ ((__format__ (__asm_fprintf__, m, n))) ATTRIBUTE_NONNULL(m)
                                                    ^
/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/toplev.c:542:1: error: redefinition of a 'extern inline' function 'floor_log2' is not supported in C99 mode
floor_log2 (unsigned HOST_WIDE_INT x)
^
/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/toplev.h:174:1: note: previous definition is here
floor_log2 (unsigned HOST_WIDE_INT x)
^
/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/toplev.c:577:1: error: redefinition of a 'extern inline' function 'exact_log2' is not supported in C99 mode
exact_log2 (unsigned HOST_WIDE_INT x)
^
/usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int/../../../../contrib/gcc/toplev.h:180:1: note: previous definition is here
exact_log2 (unsigned HOST_WIDE_INT x)
^
1 warning and 2 errors generated.
*** Error code 1

Stop in /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc/cc_int.
*** Error code 1

Stop in /usr/local/portstools/tinderbox/jails/8.3-amd64/src/gnu/usr.bin/cc.
*** Error code 1

Stop in /usr/local/portstools/tinderbox/jails/8.3-amd64/src.
*** Error code 1

Stop in /usr/local/portstools/tinderbox/jails/8.3-amd64/src.
*** Error code 1

Stop.
make: stopped in /usr/local/portstools/tinderbox/jails/8.3-amd64/src

In retrospect the cause is obvious, pre-10 releases of FreeBSD expect cc to be gcc. So it was time to update the /etc/src.conf again:

WITH_GCC=1
WITH_GNUCXX=1

And of course to update world like before.

So it was time for another try building a jail using gcc. To save time I opened multiple terminals (I use tmux) and built one jail in each:

# /usr/local/tinderbox/jails
# env CC=gcc CXX=g++ apply 'tc makeJail -j' 8.3-amd64
TODO: Insert output