Testing like a hardware engineer

Testing like a hardware engineer I’m a software engineer who has mostly worked with some form of hardware engineer, especially CPU designers and verification teams. Hardware and software engineering are superficially very similar- architect a design, implement it in a coding language, write tests, run that through some tools to convert code to something useful, and ship. I don’t usually envy my computer engineering friends- release timelines are by nature long, expertise often narrow, waterfall reigns, and the tooling has a substandard reputation. But I do envy the testing in the hardware world. To build chips they are confident in, hardware companies like Intel and ARM combine very normal and deeply esoteric but powerful techniques. Despite software projects rarely using any of these testing methodologies, the fundamental principles could be used to make your next test plan extremely comprehensive.

Generating correct data
Making tests interesting
- Random tests
- Formal tests
Testing the tests
- Checking coverage
- Checking bugs
Handling errors

Generating correct data

For tests to be meaningful, there must be a definition of correctness. Software usually uses a combination of hand coded expected values for each input and internal assertions. Hardware companies use these too (especially assertions), but generating expected values for a complex program running on a modern CPU is usually hard to do by hand. Instead, a CPU company is likely to have a simulator. Or, many simulators. The simulator is usually some C/C++ behavioral model of the architecture under test, with a quality and correspondence to the real device varying radically by company and need. Many companies will have a variety of simulators with different tradeoffs- for example, a cycle-accurate simulator that might be absurdly slow, but let you check that everything happened exactly when you expect. A faster simulator might just do dynamic binary translation to run the target architecture on x86- the exact timings might not match, but you’ll get the expected data values nearly as fast as running natively, especially for loop-heavy tests which can re-use the translation. If a chip uses a well-documented architecture, the simulator need not even be written manually- a specification can be executed directly to generate expected values. This is obviously non-trivial, especially if the spec includes undefined behavior, but can be a great way to arbitrate between conflicting simulators. It can also be a great way to experiment with possible architectural changes, and can help formally verify components with techniques we will see later. Most software projects can’t afford to maintain a second implementation just for comparison, but it doesn’t have to double the development cost. A reference implementation can skip lots of performance and scalability concerns. The reference might use higher level languages or off the shelf components that wouldn’t be suitable for production. In fact, since many refactorings maintain behavior but improve performance at the expense of clarity, a reference implementation might even just be an unoptimized, pre-refactor version of the build.

Making tests interesting

Now that we have a way to check if our code is doing the right thing in a situation, we need to put it in situations that would reveal bugs. In the case of a CPU, that usually means “interesting” small programs, but for a software project it could mean any series of API calls. CPU companies do manually write lots and lots of unit and integration tests targeting specific features, but that process is basically identical to writing white-box integration tests in software, so let’s take it for granted and focus on random and formal testing.

Random tests

Random test generators (often called fuzzers) understand the domain of all possible inputs (e.g. sequences of 32 bit instructions for a CPU, JSON documents for many software systems) and sample test cases from that universe. The range of complexity here can be vast. A simple, fast, and sometimes surprisingly effective test generator might just read from /dev/random and throw bytes at an API. This is great, if a random input is likely to be meaningful. Many pure-random inputs just fail cursory input checks, and you end up testing your input validation very well, and the core functionality not at all. Other, mutation based generators might take known good input and make random modifications to it, and use the modified input as a test case. The more dramatically the input is modified, the greater a variety of test is generated, but the more likely a test case is to just fail input validation. Mutation based generators can also be very effective, but depend on a wide range of representative seed inputs to cover lots of ground.
To explore a wide range of internal state, it may be best to instead build a fuzzer which understands what an input “should” look like, and randomly generates mostly-valid input. For example, a fuzzer for a calculator could understand that the “+” operator comes between two results, that an open parenthesis needs a close parenthesis, etc. The requirements could be written formally as: Number = {[0-9], [0-9]$Number} // A Number is a digit, optionally followed by more digits Binary_Operator = {-, +, *, /, ^} // e.g. 4 ^ 3, Unary_Operator = {- sin cos tan} // Allow negative numbers, trig functions Statement = {$Number, "(" $Statement ")", $Statement $Binary_Operator $Statement, $Unary_Operator $Statement} The fuzzer could read these definitions, and produce random statements (test cases) by selecting one of the four types of statement and repeatedly replacing the $ variables with generated instances of that type. For example, Test case = $Statement Test case = $Statement $Binary_Operator $Statement Test case = ( $Statement ) ^ $Number Test case = ( $Number ) ^ 4$Number Test case = ( 8 ) ^ 42$Number Test case = ( 8 ) ^ 428 This type of testing can generate very complex, interesting tests which might explore areas humans wouldn’t consider (for example, ‘ - ( - - - 2)’ would be a valid test in this grammar). More interesting areas can be weighted more heavily. For example, the tan operator might be more bug-prone than exponentiation, and should be more heavily tested. However, generating the resulting tests quickly can be tricky, and the most interesting test cases can be tricky to express, especially if there are dependencies between parts of the input. So far, we’ve discussed building one input (test case) at a time. In a CPU, the most interesting bugs actually come from generating a series of inputs- for example, a program might have one instruction which overwrites instructions later in the process flow (a form of self modifying code). Then, another instruction might cause an exception, temporarily flushing a lot of state. When the program reaches the overwritten instructions, does it see the original or revised code? Which was it supposed to see? With so much state maintained by a CPU, the interesting bugs are often in the interaction of several instructions, and may even depend on the exact time between events. To find sequence-dependent bugs, CPU verification teams use tools across the static vs dynamic generator spectrum. A static random sequence generator just generates a huge number of instructions, from one of the methods above, and hopes to find interesting sequences through sheer force of numbers. A dynamic generator models the internal state of the device under test, and chooses the next item in the sequence dynamically, based on that state. For example, if the memory unit is already under heavy load, you’re more likely to find a bug if the next instruction does a cache line flush on a recently used address, reachable by the current register state, than a totally random instruction. Most sequence based tools fall somewhere in the middle of the spectrum- for example, generating most instructions from a static template but choosing loads, stores, and branches carefully to make sure they mostly go to valid addresses. Dynamic generation is somewhat related to symbolic execution in the software world, though it typically relies less on inspection of the system under test and more on code written by the tester indicating what types of input to generate based on the current system state.

Formal tests

If you’ve generated a bunch of random inputs and ironed out the bugs, the obvious next step is to try all possible inputs to confirm there cannot possibly be bugs. This is rarely literally feasible. Instead, automated “formal” analysis tools can be used to exhaustively prove certain properties of a design, for example that a certain component is deadlock free. The most common formal tools in hardware rely on assumptions to prove assertions. For any assertion, the tool inspects the code flow and looks for any combination of relevant inputs which, given the assumptions, could violate the assertion. A proven assertion becomes a new assumption, or a counter-example is provided. The technology is very similar to satisfiability solvers, which are also NP-complete, but can also complete in reasonable time for many practical problems. In practice, the hardest part of formal verification is encoding enough assumptions and assertions for the tool to progress in reasonable time, really cover all feasible input, and cover all the correctness properties of the system. Formal tools are used on select hardware subsystems, but rarely in software, with notable exceptions.

Testing the tests

You’ve spent a lot of time on these tests, but how do you know if the tests themselves are high quality?

Checking coverage

In both hardware and software, the most common metric for test quality is some version of coverage- what fraction of things that could have happened, did happen? For example, what fraction of lines of code were exercised, or what fraction of if() statements had both if and else branches execute (exercise for the reader- why is 100% line coverage not equal to 100% condition coverage?). Coverage in the hardware world usually involves hand-coded “cover-points”, with each coverage item declaring possible states of that property. For example, there might be a cover point for experiencing each possible exception type, or a cover point for running in each possible protection ring, one cover point for whether a certain buffer is full, etc. Frequently, a CPU test plan calls for crosses, which means that each value one cover point could take should occur with each possible value of another cover point, with exemptions manually added for the logically impossible (for example, certain privilege exceptions can’t occur if the CPU is running at a privilege level). The tendency for coverage to include crosses leads to an exponential explosion in total coverage to hit. In order for coverage to remain a useful metric, a CPU company might maintain teams of engineers whose whole job is to write test coverage requirements. Writing good coverage requirements is key to making tests high quality, since coverage teams need to target crosses in interactions which are actually likely to reveal bugs, without requiring so many crosses that the test plan is totally infeasible. Even with carefully chosen coverage, writing either directed or random tests to hit certain cover crosses can be very difficult. For example, a coverage cross might require a memory access to fail, while a store buffer is full. But, getting certain very large buffers to fill up can be tricky on it’s own, sometimes requiring carefully tuned code maximizing performance in certain parts of a CPU, while limiting the ability of another part to respond. Adding other requirements on top can make the coverage cross nearly impossible to hit, while also hard to prove that it’s impossible to hit (which would justify an exemption for that cross). To hit these hard to reach cases, one handy technique is to test a different system, which is related to the one you really want to test. For example, templatize the buffer size, and build a new version of the system with a buffer 1/10th the size of the production buffer. Test the modified version, and if the modified version didn’t have a bug in that test, you can have at least some confidence the real system doesn’t have bugs related to that buffer being full.

Checking bugs

Rather than check how much of the design is tested, you can evaluate test completeness by which bugs you’ve found, and how. The simplest heuristic is to just stop testing when continued testing stops finding bugs. Early in the project, you’ll find so many bugs you can’t usefully continue testing until those are resolved. Later, each hand-written or randomly generated test will be less and less likely to find a bug. At this point you can declare that if there are more bugs, they are unlikely to be very important and start to ship the product. If staring at bug-curves and deciding they seem flat doesn’t seem as data-driven as you’d like, you can instead try and estimate what fraction of bugs your tests would find in a theoretical system, and guess that that’s the fraction they did find in your system. For example, in mutation testing, you deliberately introduce a defect into a design, and run the test suite. If the tests detect the error, the mutant is “killed”. The rate at which the mutations are killed approximates the fraction of bugs the test suite has found. The biggest issue with mutation testing is realism- unsophisticated mutations can introduce errors so egregiously wrong that they would never have made it to the test suite, and so mutation testing tends to overestimate the effectiveness of a test suite. The idea behind mutation testing is sound, but needs a more realistic set of errors. Luckily, every project has a whole team constantly introducing realistic errors into a codebase- the engineers who originally wrote it. Instead of automatically generating new errors, it’s possible to re-introduce old errors identified through version control commit messages, or errors from related projects, and check that the test case for this project identifies all those bugs. There’s obviously some sample bias here- you’ll never find out that your tests miss a type of error you’ve never detected. An interesting, if probably impractical, idea to estimate test completeness is to infer the total number of bugs from the overlap between two independent test suites. If two different test suites only find the same bugs, it’s probably because those are all the bugs that exist. If they both catch totally different bugs, it’s probably because there are tons of bugs, but both test suites are finding only a small fraction of them. In the capture-mark-recapture method, run two test suites, and report the number of bugs caught by each system individually, and caught by both suites. Then assume that the test suites are independent. The estimated total number of bugs is: Estimated total bugs = the number caught by suite A * the number caught be suite B / number caught by both Like mutation testing, this method is likely to overestimate the effectiveness of a test suite. Some bugs are easier to catch than others. These bugs will be overrepresented in the “Caught by both” category, which will drag down the estimate for total bugs.

Handling errors

Despite all the effort, sometimes bugs get through. Recalling and replacing a CPU is absurdly expensive. So, computer engineering companies plan for the possibility of bugs. Usually bugs are minor. Every company produces errata documents listing small bugs, which are often ignored except in specialized workloads. Sometimes, a CPU designer has a sense that an optimization has potential to be buggy, so they add undocumented functionality to disable the optimization and fallback to another process. In case there really is an issue, these “chicken bits” can later be set in e.g. a kernel to sacrifice a little performance for correctness. After Spectre, for example, Intel released a microcode update to enable access to a new register controlling optimizations that made the CPU vulnerable to the exploit. Some software projects have a similar fallback process- think of instructions to work around a bug by setting a secret environment variable. Finally, sometimes a CPU design can be changed partway through the manufacturing process. After all the transistors are fabricated, wires are deposited on top of the wafer to connect transistors to each other. These metal layers can be redesigned more cheaply than redesigning every layer in the process, so companies sometimes use a “Metal change” to work around a bug. The closest software analogy here is making changes in software by hacking up a linker script to fix a bug, because it would be too expensive to recompile the software. I don’t recommend looking to metal changes for inspiration on your next bugfix.

2021