Designing New Operating Primitives to Improve Fuzzing Performance
Posted by Ned Williams on , Project Zero
Introduction
When I started my 20% project – an initiative where employees are allocated twenty-percent of their paid work time to pursue personal projects – with Project Zero, I wanted to see if I could apply the techniques I had learned fuzzing Chrome to XNU, the kernel used in iOS and macOS. My interest was sparked after learning some prominent members of the iOS research community believed the kernel was "fuzzed to death," and my understanding was that most of the top researchers used auditing for vulnerability research. This meant finding new bugs with fuzzing would be meaningful in demonstrating the value of implementing newer fuzzing techniques. In this project, I pursued a somewhat unusual approach to fuzz XNU networking in userland by converting it into a library, "booting" it in userspace and using my standard fuzzing workflow to discover vulnerabilities. Somewhat surprisingly, this worked well enough to reproduce some of my peers' recent discoveries and report some of my own, one of which was a reliable privilege escalation from the app context, CVE-2019-8605, dubbed "SockPuppet." I'm excited to open source this fuzzing project, "sockfuzzer," for the community to learn from and adapt. In this post, we'll do a deep dive into its design and implementation.
Attack Surface Review and Target Planning
Choosing Networking
We're at the beginning of a multistage pr oject. I had enormous respect for the difficulty of the task ahead of me. I knew I would need to be careful investing time at each stage of the process, constantly looking for evidence that I needed to change direction. The first big decision was to decide what exactly we wanted to target.
I started by downloading the XNU sources and reviewing them, looking for areas that handled a lot of attacker-controlled input and seemed amenable to fuzzing – immediately the networking subsystem jumped out as worthy of research. I had just exploited a Chrome sandbox bug that leveraged collaboration between an exploited renderer process and a server working in concert. I recognized these attack surfaces' power, where some security-critical code is "sandwiched" between two attacker-controlled entities. The Chrome browser process is prone to use after free vulnerabilities due to the difficulty of managing state for large APIs, and I suspected XNU would have the same issue. Networking features both parsing and state management. I figured that even if others had already fuzzed the parsers extensively, there could still be use after free vulnerabilities lying dormant.
I then proceeded to look at recent bug reports. Two bugs that caught my eye: the mptcp overflow discovered by Ian Beer and the ICMP out of bounds write found by Kevin Backhouse. Both of these are somewhat "straightforward" buffer overflows. The bugs' simplicity hinted that kernel networking, even packet parsing, was sufficiently undertested. A fuzzer combining network syscalls and arbitrary remote packets should be large enough in scope to reproduce these issues and find new ones.
Digging deeper, I wanted to understand how to reach these bugs in practice. By cross-referencing the functions and setting kernel breakpoints in a VM, I managed to get a more concrete idea. Here's the call stack for Ian's MPTCP bug:
The buggy function in question is mptcp_usr_connectx . Moving up the call stack, we find the connectx syscall, which we see in Ian's original testcase. If we were to write a fuzzer to find this bug, how would we do it? Ultimately, whatever we do has to both find the bug and give us the information we need to reproduce it on the real kernel. Calling mptcp_usr_connectx directly should surely find the bug, but this seems like the wrong idea because it takes a lot of arguments. Modeling a fuzzer well enough to call this function directly in a way representative of the real code is no easier than auditing the code in the first place, so we've not made things any easier by writing a targeted fuzzer. It's also wasted effort to write a target for each function this small. On the other hand, the further up the call stack we go, the more complexity we may have to support and the less chance we have of landing on the bug. If I were trying to unit test the networking stack, I would probably avoid the syscall layer and call the intermediate helper functions as a middle ground. This is exactly what I tried in the first draft of the fuzzer; I used sock_socket to create struct socket* objects to pass to connectitx in the hopes that it would be easy to reproduce this bug while being high-enough level that this bug could plausibly have been discovered without knowing where to look for it . Surprisingly, after some experimentation, it turned out to be easier to simply call the syscalls directly (via connectx ). This makes it easier to translate crashing inputs into programs to run against a real kernel since testcases map 1:1 to syscalls. We'll see more details about this later.
We can't test networking properly without accounting for packets. In this case, data comes from the hardware, not via syscalls from a user process. We'll have to expose this functionality to our fuzzer. To figure out how to extend our framework to support random packet delivery, we can use our next example bug. Let's take a look at the call stack for delivering a packet to trigger the ICMP bug reported by Kevin Backhouse:
To reach the buggy function, icmp_error , the call stack is deeper, and unlike with syscalls, it's not immediately obvious which of these functions we should call to cover the relevant code. Starting from the very top of the call stack, we see that the crash occurred in a kernel thread running the dlil_input_thread_func function. DLIL stands for Data Link Interface Layer, a reference to the OSI model's data link layer . Moving further down the stack, we see ether_inet_input , indicating an Ethernet packet (since I tested this issue using Ethernet). We finally make it down to the IP layer, where ip_dooptions signals an icmp_error . As an attacker, we probably don't have a lot of control over the interface a user uses to receive our input, so we can rule out some of the uppermost layers . We also don't want to deal with threads in our fuzzer, another design tradeoff we'll describe in more detail later. proto_input and ip_proto_input don't do much, so I decided that ip_proto was where I would inject packets, simply by calling the function when I wanted to deliver a packet. After reviewing proto_register_input , I discovered another function called ip6_input , which was the entry point for the IPv6 code. Here's the prototype for ip_input :
void ip_input ( struct mbuf * m );
Mbufs are message buffers, a standard buffer format used in network stacks. They enable multiple small packets to be chained together through a linked list. So we just need to generate mbufs with random data before calling ip_input .
I was surprised by how easy it was to work with the network stack compared to the syscall interface. `ip_input` and `ip6_input` pure functions that don't require us to know any state to call them. But stepping back, it made more sense. Packet delivery is inherently a clean interface: our kernel has no idea what arbitrary packets may be coming in, so the interface takes a raw packet and then further down in the stack decides how to handle it. Many packets contain metadata that affect the kernel state once received. For example, TCP or UDP packets will be matched to an existing connection by their port number.
Most modern coverage guided fuzzers, including this LibFuzzer-based project, use a design inspired by AFL. When a test case with some known coverage is mutated and the mutant produces coverage that hasn't been seen before, the mutant is added to the current corpus of inputs. It becomes available for further mutations to produce even deeper coverage. Lcamtuf, the author of AFL, has an excellent demonstration of how this algorithm created JPEGs using coverage feedback with no well-formed starting samples. In essence, most poorly-formed inputs are rejected early. When a mutated input passes a validation check, the input is saved. Then that input can be mutated until it manages to pass the second validation check, and so on. This hill climbing algorithm has no problem generating dependent sequences of API calls, in this case to interleave syscalls with ip_input and ip6_input . Random syscalls can get the kernel into some state where it's expecting a packet. Later, when libFuzzer guesses a packet that gets the kernel into some new state, the hill climbing algorithm will record a new test case when it sees new coverage. Dependent sequences of syscalls and packets are brute-forced in a linear fashion, one call at a time.
Designing for (Development) Speed
Now that we know where to attack this code base, it's a matter of building out the fuzzing research platform. I like thinking of it this way because it emphasizes that this fuzzer is a powerful assistant to a researcher, but it can't do all the work. Like any other test framework, it empowers the researcher to make hypotheses and run experiments over code that looks buggy. For the platform to be helpful, it needs to be comfortable and fun to work with and get out of the way.
When it comes to standard practice for kernel fuzzing, there's a pretty simple spectrum for strategies. On one end, you fuzz self-contained functions that are security-critical, e.g., OSUnserializeBinary. These are easy to write and manage and are generally quite performant. On the other end, you have "end to end" kernel testing that performs random syscalls against a real kernel instance. These heavyweight fuzzers have the advantage of producing issues that you know are actionable right away, but setup and iterative development are slower. I wanted to try a hybrid approach that could preserve some of the benefits of each style. To do so, I would port the networking stack of XNU out of the kernel and into userland while preserving as much of the original code as possible. Kernel code can be surprisingly portable and amenable to unit testing, even when run outside its natural environment.
There has been a push to add more user-mode unit testing to Linux. If you look at the documentation for Linux's KUnit project , there's an excellent quote from Linus Torvalds: "… a lot of people seem to think that performance is about doing the same thing, just doing it faster, and that is not true. That is not what performance is all about. If you can do something really fast, really well, people will start using it differently." This statement echoes the experience I had writing targeted fuzzers for code in Chrome's browser process. Due to extensive unit testing, Chrome code is already well-factored for fuzzing. In a day's work, I could try out many iterations of a fuzz target and the edit/build/run cycle. I didn't have a similar mechanism out of the box with XNU. In order to perform a unit test, I would need to rebuild the kernel. And despite XNU being considerably smaller than Chrome, incremental builds were slower due to the older kmk build system. I wanted to try bridging this gap for XNU.
Setting up the Scaffolding
"Unit" testing a kernel up through the syscall layer sounds like a big task, but it's easier than you'd expect if you forgo some complexity. We'll start by building all of the individual kernel object files from source using the original build flags. But instead of linking everything together to produce the final kernel binary, we link in only the subset of objects containing code in our target attack surface. We then stub or fake the rest of the functionality. Thanks to the recon in the previous section, we already know which functions we want to call from our fuzzer. I used that information to prepare a minimal list of source objects to include in our userland port.
Before we dive in, let's define the overall structure of the project as pictured below. There's going to be a fuzz target implemented in C++ that translates fuzzed inputs into interactions with the userland XNU library. The target code, libxnu , exposes a few wrapper symbols for syscalls and ip_input as mentioned in the attack surface review section. The fuzz target also exposes its random sequence of bytes to kernel APIs such as copyin or copyout , whose implementations have been replaced with fakes that use fuzzed input data.
To make development more manageable, I decided to create a new build system using CMake, as it supported Ninja for fast rebuilds. One drawback here is the original build system has to be run every time upstream is updated to deal with generated sources, but this is worth it to get a faster development loop. I captured all of the compiler invocations during a normal kernel build and used those to reconstruct the flags passed to build the various kernel subsystems. Here's what that first pass looks like:
project ( libxnu )
set ( XNU_DEFINES
- DAPPLE
- DKERNEL
# ...
)
set ( XNU_SOURCES
bsd / conf / param . c
bsd / kern / kern_asl . c
bsd / net / if . c
bsd / netinet / ip_input . c
# ...
)
add_library ( xnu SHARED $ { XNU_SOURCES } $ { FUZZER_FILES } $ { XNU_HEADERS })
protobuf_generate_cpp ( NET_PROTO_SRCS NET_PROTO_HDRS fuzz / net_fuzzer . proto )
add_executable ( net_fuzzer fuzz / net_fuzzer . cc $ { NET_PROTO_SRCS } $ { NET_PROTO_HDRS })
target_include_directories ( net_fuzzer PRIVATE libprotobuf - mutator )
target_compile_options ( net_fuzzer PRIVATE $ { FUZZER_CXX_FLAGS })
Of course, without the rest of the kernel, we see tons of missing symbols.
"_zdestroy" , referenced from :
_if_clone_detach in libxnu . a ( if . c . o )
"_zfree" , referenced from :
_kqueue_destroy in libxnu . a ( kern_event . c . o )
_knote_free in libxnu . a ( kern_event . c . o )
_kqworkloop_get_or_create in libxnu . a ( kern_event . c . o )
_kev_delete in libxnu . a ( kern_event . c . o )
_pipepair_alloc in libxnu . a ( sys_pipe . c . o )
_pipepair_destroy_pipe in libxnu . a ( sys_pipe . c . o )
_so_cache_timer in libxnu . a ( uipc_socket . c . o )
...
"_zinit" , referenced from :
_knote_init in libxnu . a ( kern_event . c . o )
_kern_event_init in libxnu . a ( kern_event . c . o )
_pipeinit in libxnu . a ( sys_pipe . c . o )
_socketinit in libxnu . a ( uipc_socket . c . o )
_unp_init in libxnu . a ( uipc_usrreq . c . o )
_cfil_init in libxnu . a ( content_filter . c . o )
_tcp_init in libxnu . a ( tcp_subr . c . o )
...
"_zone_change" , referenced from :
_knote_init in libxnu . a ( kern_event . c . o )
_kern_event_init in libxnu . a ( kern_event . c . o )
_socketinit in libxnu . a ( uipc_socket . c . o )
_cfil_init in libxnu . a ( content_filter . c . o )
_tcp_init in libxnu . a ( tcp_subr . c . o )
_ifa_init in libxnu . a ( if . c . o )
_if_clone_attach in libxnu . a ( if . c . o )
...
ld : symbol ( s ) not found for architecture x86_64
clang : error : linker command failed with exit code 1 ( use - v to see invocation )
ninja : build stopped : subcommand failed .
To get our initial targeted fuzzer working, we can do a simple trick by linking against a file containing stubbed implementations of all of these. We take advantage of C's weak type system here. For each function we need to implement, we can link an implementation void func() { assert(false); } . The arguments passed to the function are simply ignored, and a crash will occur whenever the target code attempts to call it. This goal can be achieved with linker flags, but it was a simple enough solution that allowed me to get nice backtraces when I hit an unimplemented function.
// Unimplemented stub functions
// These should be replaced with real or mock impls.
#include <kern/assert.h>
#include <stdbool.h>
int printf ( const char * format , ...);
void Assert ( const char * file , int line , const char * expression ) {
printf ( "%s: assert failed on line %d: %s\n" , file , line , expression );
__builtin_trap ();
}
void IOBSDGetPlatformUUID () { assert ( false ); }
void IOMapperInsertPage () { assert ( false ); }
// ...
Then we just link this file into the XNU library we're building by adding it to the source list:
set ( XNU_SOURCES
bsd / conf / param . c
bsd / kern / kern_asl . c
# ...
fuzz / syscall_wrappers . c
fuzz / ioctl . c
fuzz / backend . c
fuzz / stubs . c
fuzz / fake_impls . c
As you can see, there are some other files I included in the XNU library that represent faked implementations and helper code to expose some internal kernel APIs. To make sure our fuzz target will call code in the linked library, and not some other host functions (syscalls) with a clashing name, we hide all of the symbols in libxnu by default and then expose a set of wrappers that call those functions on our behalf. I hide all the names by default using a CMake setting set_target_properties(xnu PROPERTIES C_VISIBILITY_PRESET hidden) . Then we can link in a file (fuzz/syscall_wrappers.c) containing wrappers like the following:
__attribute__ (( visibility ( "default" ))) int accept_wrapper ( int s , caddr_t name ,
socklen_t * anamelen ,
int * retval ) {
struct accept_args uap = {
. s = s ,
. name = name ,
. anamelen = anamelen ,
};
return accept ( kernproc , & uap , retval );
}
Note the visibility attribute that explicitly exports the symbol from the library. Due to the simplicity of these wrappers I created a script to automate this called generate_fuzzer.py using syscalls.master .
With the stubs in place, we can start writing a fuzz target now and come back to deal with implementing them later. We will see a crash every time the target code attempts to use one of the functions we initially left out. Then we get to decide to either include the real implementation (and perhaps recursively require even more stubbed function implementations) or to fake the functionality.
A bonus of getting a build working with CMake was to create multiple targets with different instrumentation. Doing so allows me to generate coverage reports using clang-coverage:
target_compile_options ( xnu - cov PRIVATE $ { XNU_C_FLAGS } - DLIBXNU_BUILD = 1 - D_FORTIFY_SOURCE = 0 - fprofile - instr - generate - fcoverage - mapping )
With that, we just add a fuzz target file and a protobuf file to use with protobuf-mutator and we're ready to get started:
protobuf_generate_cpp ( NET_PROTO_SRCS NET_PROTO_HDRS fuzz / net_fuzzer . proto )
add_executable ( net_fuzzer fuzz / net_fuzzer . cc $ { NET_PROTO_SRCS } $ { NET_PROTO_HDRS })
target_include_directories ( net_fuzzer PRIVATE libprotobuf - mutator )
target_compile_options ( net_fuzzer
PRIVATE - g
- std = c ++ 11
- Werror
- Wno - address - of - packed - member
$ { FUZZER_CXX_FLAGS })
if ( APPLE )
target_link_libraries ( net_fuzzer $ { FUZZER_LD_FLAGS } xnu fuzzer protobuf - mutator $ { Protobuf_LIBRARIES })
else ()
target_link_libraries ( net_fuzzer $ { FUZZER_LD_FLAGS } xnu fuzzer protobuf - mutator $ { Protobuf_LIBRARIES } pthread )
endif ( APPLE )
Writing a Fuzz Target
At this point, we've assembled a chunk of XNU into a convenient library, but we still need to interact with it by writing a fuzz target. At first, I thought I might write many targets for different features, but I decided to write one monolithic target for this project. I'm sure fine-grained targets could do a better job for functionality that's harder to fuzz, e.g., the TCP state machine, but we will stick to one for simplicity.
We'll start by specifying an input grammar using protobuf, part of which is depicted below. This grammar is completely arbitrary and will be used by a corresponding C++ harness that we will write next. LibFuzzer has a plugin called libprotobuf-mutator that knows how to mutate protobuf messages. This will enable us to do grammar-based mutational fuzzing efficiently, while still leveraging coverage guided feedback. This is a very powerful combination.
message Socket {
required Domain domain = 1 ;
required SoType so_type = 2 ;
required Protocol protocol = 3 ;
// TODO: options, e.g. SO_ACCEPTCONN
}
message Close {
required FileDescriptor fd = 1 ;
}
message SetSocketOpt {
optional Protocol level = 1 ;
optional SocketOptName name = 2 ;
// TODO(nedwill): structure for val
optional bytes val = 3 ;
optional FileDescriptor fd = 4 ;
}
message Command {
oneof command {
Packet ip_input = 1 ;
SetSocketOpt set_sock_opt = 2 ;
Socket socket = 3 ;
Close close = 4 ;
}
}
message Session {
repeated Command commands = 1 ;
required bytes data_provider = 2 ;
}
I left some TODO comments intact so you can see how the grammar can always be improved. As I've done in similar fuzzing projects, I have a top-level message called Session that encapsulates a single fuzzer iteration or test case. This session contains a sequence of "commands" and a sequence of bytes that can be used when random, unstructured data is needed (e.g., when doing a copyin ). Commands are syscalls or random packets, which in turn are their own messages that have associated data. For example, we might have a session that has a single Command message containing a "Socket" message. That Socket message has data associated with each argument to the syscall. In our C++-based target, it's our job to translate messages of this custom specification into real syscalls and related API calls. We inform libprotobuf-mutator that our fuzz target expects to receive one "Session" message at a time via the macro DEFINE_BINARY_PROTO_FUZZER .
DEFINE_BINARY_PROTO_FUZZER ( const Session & session ) {
// ...
std :: set <int> open_fds ;
for ( const Command & command : session . commands ()) {
int retval = 0 ;
switch ( command . command_case ()) {
case Command :: kSocket : {
int fd = 0 ;
int err = socket_wrapper ( command . socket (). domain (),
command . socket (). so_type (),
command . socket (). protocol (), & fd );
if ( err == 0 ) {
// Make sure we're tracking fds properly.
if ( open_fds . find ( fd ) != open_fds . end ()) {
printf ( "Found existing fd %d\n" , fd );
assert ( false );
}
open_fds . insert ( fd );
}
break ;
}
case Command :: kClose : {
open_fds . erase ( command . close (). fd ());
close_wrapper ( command . close (). fd (), nullptr );
break ;
}
case Command :: kSetSockOpt : {
int s = command . set_sock_opt (). fd ();
int level = command . set_sock_opt (). level ();
int name = command . set_sock_opt (). name ();
size_t size = command . set_sock_opt (). val (). size ();
std :: unique_ptr < char []> val ( new char [ size ]);
memcpy ( val . get (), command . set_sock_opt (). val (). data (), size );
setsockopt_wrapper ( s , level , name , val . get (), size , nullptr );
break ;
}
While syscalls are typically a straightforward translation of the protobuf message, other commands are more complex. In order to improve the structure of randomly generated packets, I added custom message types that I then converted into the relevant on-the-wire structure before passing it into ip_input. Here's how this looks for TCP:
message Packet {
oneof packet {
TcpPacket tcp_packet = 1 ;
}
}
message TcpPacket {
required IpHdr ip_hdr = 1 ;
required TcpHdr tcp_hdr = 2 ;
optional bytes data = 3 ;
}
message IpHdr {
required uint32 ip_hl = 1 ;
required IpVersion ip_v = 2 ;
required uint32 ip_tos = 3 ;
required uint32 ip_len = 4 ;
required uint32 ip_id = 5 ;
required uint32 ip_off = 6 ;
required uint32 ip_ttl = 7 ;
required Protocol ip_p = 8 ;
required InAddr ip_src = 9 ;
required InAddr ip_dst = 10 ;
}
message TcpHdr {
required Port th_sport = 1 ;
required Port th_dport = 2 ;
required TcpSeq th_seq = 3 ;
required TcpSeq th_ack = 4 ;
required uint32 th_off = 5 ;
repeated TcpFlag th_flags = 6 ;
required uint32 th_win = 7 ;
required uint32 th_sum = 8 ;
required uint32 th_urp = 9 ;
// Ned's extensions
required bool is_pure_syn = 10 ;
required bool is_pure_ack = 11 ;
}
Unfortunately, protobuf doesn't support a uint8 type, so I had to use uint32 for some fields. That's some lost fuzzing performance. You can also see some synthetic TCP header flags I added to make certain flag combinations more likely: is_pure_syn and is_pure_ack . Now I have to write some code to stitch together a valid packet from these nested fields. Shown below is the code to handle just the TCP header.
std :: string get_tcp_hdr ( const TcpHdr & hdr ) {
struct tcphdr tcphdr = {
. th_sport = ( unsigned short ) hdr . th_sport (),
. th_dport = ( unsigned short ) hdr . th_dport (),
. th_seq = __builtin_bswap32 ( hdr . th_seq ()),
. th_ack = __builtin_bswap32 ( hdr . th_ack ()),
. th_off = hdr . th_off (),
. th_flags = 0 ,
. th_win = ( unsigned short ) hdr . th_win (),
. th_sum = 0 , // TODO(nedwill): calculate the checksum instead of skipping it
. th_urp = ( unsigned short ) hdr . th_urp (),
};
for ( const int flag : hdr . th_flags ()) {
tcphdr . th_flags ^= flag ;
}
// Prefer pure syn
if ( hdr . is_pure_syn ()) {
tcphdr . th_flags &= ~( TH_RST | TH_ACK );
tcphdr . th_flags |= TH_SYN ;
} else if ( hdr . is_pure_ack ()) {
tcphdr . th_flags &= ~( TH_RST | TH_SYN );
tcphdr . th_flags |= TH_ACK ;
}
std :: string dat (( char *)& tcphdr , ( char *)& tcphdr + sizeof ( tcphdr ));
return dat ;
}
As you can see, I make liberal use of a custom grammar to enable better quality fuzzing. These efforts are worth it, as randomizing high level structure is more efficient. It will also be easier for us to interpret crashing test cases later as they will have the same high level representation.
High-Level Emulation
Now that we have the code building and an initial fuzz target running, we begin the first pass at implementing all of the stubbed code that is reachable by our fuzz target. Because we have a fuzz target that builds and runs, we now get instant feedback about which functions our target hits. Some core functionality has to be supported before we can find any bugs, so the first attempt to run the fuzzer deserves its own development phase. For example, until dynamic memory allocation is supported, almost no kernel code we try to cover will work considering how heavily such code is used.
We'll be implementing our stubbed functions with fake variants that attempt to have the same semantics. For example, when testing code that uses an external database library, you could replace the database with a simple in-memory implementation. If you don't care about finding database bugs, this often makes fuzzing simpler and more robust. For some kernel subsystems unrelated to networking we can use entirely different or null implementations. This process is reminiscent of high-level emulation, an idea used in game console emulation. Rather than aiming to emulate hardware, you can try to preserve the semantics but use a custom implementation of the API. Because we only care about testing networking, this is how we approach faking subsystems in this project.
I always start by looking at the original function implementation. If it's possible, I just link in that code as well. But some functionality isn't compatible with our fuzzer and must be faked. For example, zalloc should call the userland malloc since virtual memory is already managed by our host kernel and we have allocator facilities available. Similarly, copyin and copyout need to be faked as they no longer serve to copy data between user and kernel pages. Sometimes we also just "nop" out functionality that we don't care about. We'll cover these decisions in more detail later in the "High-Level Emulation" phase. Note that by implementing these stubs lazily whenever our fuzz target hits them, we immediately reduce the work in handling all the unrelated functions by an order of magnitude. It's easier to stay motivated when you only implement fakes for functions that are used by the target code. This approach successfully saved me a lot of time and I've used it on subsequent projects as well. At the time of writing, I have 398 stubbed functions, about 250 functions that are trivially faked (return 0 or void functions that do nothing), and about 25 functions that I faked myself (almost all related to porting the memory allocation systems to userland).
Booting Up
As soon as we start running the fuzzer, we'll run into a snag: many resources require a one-time initialization that happens on boot. The BSD half of the kernel is mostly initialized by calling the bsd_init function. That function, in turn, calls several subsystem-specific initialization functions. Keeping with the theme of supporting a minimally necessary subset of the kernel, rather than call bsd_init , we create a new function that only initializes parts of the kernel as needed.
Here's an example crash that occurs without the one time kernel bootup initialization:
#7 0x7effbc464ad0 in zalloc /source/build3/../fuzz/zalloc.c:35:3
#8 0x7effbb62eab4 in pipepair_alloc /source/build3/../bsd/kern/sys_pipe.c:634:24
#9 0x7effbb62ded5 in pipe /source/build3/../bsd/kern/sys_pipe.c:425:10
#10 0x7effbc4588ab in pipe_wrapper /source/build3/../fuzz/syscall_wrappers.c:216:10
#11 0x4ee1a4 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:979:19
Our zalloc implementation (covered in the next section) failed because the pipe zone wasn't yet initialized:
static int
pipepair_alloc ( struct pipe ** rp_out , struct pipe ** wp_out )
{
struct pipepair * pp = zalloc ( pipe_zone );
Scrolling up in sys_pipe.c, we see where that zone is initialized:
void
pipeinit ( void )
{
nbigpipe = 0 ;
vm_size_t zone_size ;
zone_size = 8192 * sizeof ( struct pipepair );
pipe_zone = zinit ( sizeof ( struct pipepair ), zone_size , 4096 , "pipe zone" );
Sure enough, this function is called by bsd_init . By adding that to our initial setup function the zone works as expected. After some development cycles spent supporting all the needed bsd_init function calls, we have the following:
__attribute__ (( visibility ( "default" ))) bool initialize_network () {
mcache_init ();
mbinit ();
eventhandler_init ();
pipeinit ();
dlil_init ();
socketinit ();
domaininit ();
loopattach ();
ether_family_init ();
tcp_cc_init ();
net_init_run ();
int res = necp_init ();
assert (! res );
return true ;
}
The original bsd_init is 683 lines long, but our initialize_network clone is the preceding short snippet. I want to remark how cool I found it that you could "boot" a kernel like this and have everything work so long as you implemented all the relevant stubs. It just goes to show a surprising fact: a significant amount of kernel code is portable, and simple steps can be taken to make it testable. These codebases can be modernized without being fully rewritten. As this "boot" relies on dynamic allocation, let's look at how I implemented that next.
Dynamic Memory Allocation
Providing a virtual memory abstraction is a fundamental goal of most kernels, but the good news is this is out of scope for this project (this is left as an exercise for the reader). Because networking already assumes working virtual memory, the network stack functions almost entirely on top of high-level allocator APIs. This makes the subsystem amenable to "high-level emulation". We can create a thin shim layer that intercepts XNU specific allocator calls and translates them to the relevant host APIs.
In practice, we have to handle three types of allocations for this project: "classic" allocations (malloc/calloc/free), zone allocations (zalloc), and mbuf (memory buffers). The first two types are more fundamental allocation types used across XNU, while mbufs are a common data structure used in low-level networking code.
The zone allocator is reasonably complicated, but we use a simplified model for our purposes: we just track the size assigned to a zone when it is created and make sure we malloc that size when zalloc is later called using the initialized zone. This could undoubtedly be modeled better, but this initial model worked quite well for the types of bugs I was looking for. In practice, this simplification affects exploitability, but we aren't worried about that for a fuzzing project as we can assess that manually once we discover an issue. As you can see below, I created a custom zone type that simply stored the configured size, knowing that my zinit would return an opaque pointer that would be passed to my zalloc implementation, which could then use calloc to service the request. zfree simply freed the requested bytes and ignored the zone, as allocation sizes are tracked by the host malloc already.
struct zone {
uintptr_t size ;
};
struct zone * zinit ( uintptr_t size , uintptr_t max , uintptr_t alloc ,
const char * name ) {
struct zone * zone = ( struct zone *) calloc ( 1 , sizeof ( struct zone ));
zone -> size = size ;
return zone ;
}
void * zalloc ( struct zone * zone ) {
assert ( zone != NULL );
return calloc ( 1 , zone -> size );
}
void zfree ( void * zone , void * dat ) {
( void ) zone ;
free ( dat );
}
Kalloc, kfree, and related functions were passed through to malloc and free as well. You can see fuzz/zalloc.c for their implementations. Mbufs (memory buffers) are more work to implement because they contain considerable metadata that is exposed to the "client" networking code.
struct m_hdr {
struct mbuf * mh_next ; /* next buffer in chain */
struct mbuf * mh_nextpkt ; /* next chain in queue/record */
caddr_t mh_data ; /* location of data */
int32_t mh_len ; /* amount of data in this mbuf */
u_int16_t mh_type ; /* type of data in this mbuf */
u_int16_t mh_flags ; /* flags; see below */
};
/*
* The mbuf object
*/
struct mbuf {
struct m_hdr m_hdr ;
union {
struct {
struct pkthdr MH_pkthdr ; /* M_PKTHDR set */
union {
struct m_ext MH_ext ; /* M_EXT set */
char MH_databuf [ _MHLEN ];
} MH_dat ;
} MH ;
char M_databuf [ _MLEN ]; /* !M_PKTHDR, !M_EXT */
} M_dat ;
};
I didn't include the pkthdr nor m_ext structure definitions, but they are nontrivial (you can see for yourself in bsd/sys/mbuf.h). A lot of trial and error was needed to create a simplified mbuf format that would work. In practice, I use an inline buffer when possible and, when necessary, locate the data in one large external buffer and set the M_EXT flag. As these allocations must be aligned, I use posix_memalign to create them, rather than malloc . Fortunately ASAN can help manage these allocations, so we can detect some bugs with this modification.
Two bugs I reported via the Project Z ero tracker highlight the benefit of the heap-based mbuf implementation . In the first report , I detected an mbuf double free using ASAN. While the m_free implementation tries to detect double frees by checking the state of the allocation, ASAN goes even further by quarantining recently freed allocations to detect the bug. In this case, it looks like the fuzzer would have found the bug either way, but it was impressive. The second issue linked is much subtler and requires some instrumentation to detect the bug, as it is a use after free read of an mbuf:
== 22568 == ERROR : AddressSanitizer : heap - use - after - free on address 0x61500026afe5 at pc 0x7ff60f95cace bp 0x7ffd4d5617b0 sp 0x7ffd4d5617a8
READ of size 1 at 0x61500026afe5 thread T0
#0 0x7ff60f95cacd in tcp_input bsd/netinet/tcp_input.c:5029:25
#1 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2
#2 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10
0x61500026afe5 is located 229 bytes inside of 256 - byte region [ 0x61500026af00 , 0x61500026b000 )
freed by thread T0 here :
#0 0x4a158d in free /b/swarming/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3
#1 0x7ff60fb7444d in m_free fuzz/zalloc.c:220:3
#2 0x7ff60f4e3527 in m_freem bsd/kern/uipc_mbuf.c:4842:7
#3 0x7ff60f5334c9 in sbappendstream_rcvdemux bsd/kern/uipc_socket2.c:1472:3
#4 0x7ff60f95821d in tcp_input bsd/netinet/tcp_input.c:5019:8
#5 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2
#6 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10
Apple managed to catch this issue before I reported it, fixing it in iOS 13. I believe Apple has added some internal hardening or testing for mbufs that caught this bug. It could be anything from a hardened mbuf allocator like GWP-ASAN , to an internal ARM MTE test, to simple auditing, but it was really cool to see this issue detected in this way, and also that Apple was proactive enough to find this themselves.
Accessing User Memory
When talking about this project with a fellow attendee at a fuzzing conference, their biggest question was how I handled user memory access. Kernels are never supposed to trust pointers provided by user-space, so whenever the kernel wants to access memory-mapped in userspace, it goes through intermediate functions copyin and copyout . By replacing these functions with our fake implementations, we can supply fuzzer-provided input to the tested code. The real kernel would have done the relevant copies from user to kernel pages. Because these copies are driven by the target code and not our testcase, I added a buffer in the protobuf specification to be used to service these requests.
Here's a backtrace from our stub before we implement `copyin` . As you can see, when calling the `recvfrom` syscall, our fuzzer passed in a pointer as an argument.
#6 0x7fe1176952f3 in Assert /source/build3/../fuzz/stubs.c:21:3
#7 0x7fe11769a110 in copyin /source/build3/../fuzz/fake_impls.c:408:3
#8 0x7fe116951a18 in __copyin_chk /source/build3/../bsd/libkern/copyio.h:47:9
#9 0x7fe116951a18 in recvfrom_nocancel /source/build3/../bsd/kern/uipc_syscalls.c:2056:11
#10 0x7fe117691a86 in recvfrom_nocancel_wrapper /source/build3/../fuzz/syscall_wrappers.c:244:10
#11 0x4e933a in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:936:9
#12 0x4e43b8 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1
I've extended the copyin specification with my fuzzer-specific semantics: when the pointer (void*)1 is passed as an address, we interpret this as a request to fetch arbitrary bytes. Otherwise, we copy directly from that virtual memory address. This way, we can begin by passing (void*)1 everywhere in the fuzz target to get as much cheap coverage as possible. Later, as we want to construct well-formed data to pass into syscalls, we build the data in the protobuf test case handler and pass a real pointer to it, allowing it to be copied. This flexibility saves us time while permitting the construction of highly-structured data inputs as we see fit.
int __attribute__ (( warn_unused_result ))
copyin ( void * user_addr , void * kernel_addr , size_t nbytes ) {
// Address 1 means use fuzzed bytes, otherwise use real bytes.
// NOTE: this does not support nested useraddr.
if ( user_addr != ( void *) 1 ) {
memcpy ( kernel_addr , user_addr , nbytes );
return 0 ;
}
if ( get_fuzzed_bool ()) {
return - 1 ;
}
get_fuzzed_bytes ( kernel_addr , nbytes );
return 0 ;
}
Copyout is designed similarly. We often don't care about the data copied out; we just care about the safety of the accesses. For that reason, we make sure to memcpy from the source buffer in all cases, using a temporary buffer when a copy to (void*)1 occurs. If the kernel copies out of bounds or from freed memory, for example, ASAN will catch it and inform us about a memory disclosure vulnerability.
Synchronization and Threads
Among the many changes made to XNU's behavior to support this project, perhaps the most extensive and invasive are the changes I made to the synchronization and threading model. Before beginning this project, I had spent over a year working on Chrome browser process research, where high level "sequences" are preferred to using physical threads . Despite a paucity of data races, Chrome still had sequence-related bugs that were triggered by randomly servicing some of the pending work in between performing synchronous IPC calls. In an exploit for a bug found by the AppCache fuzzer , sleep calls were needed to get the asynchronous work to be completed before queueing up some more work synchronously. So I already knew that asynchronous continuation-passing style concurrency could have exploitable bugs that are easy to discover with this fuzzing approach.
I suspected I could find similar bugs if I used a similar model for sockfuzzer. Because XNU uses multiple kernel threads in its networking stack, I would have to port it to a cooperative style. To do this, I provided no-op implementations for all of the thread management functions and sync primitives, and instead randomly called the work functions that would have been called by the real threads. This involved modifying code: most worker threads run in a loop, processing new work as it comes in. I modified these infinitely looping helper functions to do one iteration of work and exposed them to the fuzzer frontend. Then I called them randomly as part of the protobuf message. The main benefit of doing the project this way was improved performance and determinism. Places where the kernel could block the fuzzer were modified to return early. Overall, it was a lot simpler and easier to manage a single-threaded process. But this decision did not end up yielding as many bugs as I had hoped. For example, I suspected that interleaving garbage collection of various network-related structures with syscalls would be more effective. It did achieve the goal of removing threading-related headaches from deploying the fuzzer, but this is a serious weakness that I would like to address in future fuzzer revisions.
Randomness
Randomness is another service provided by kernels to userland (e.g. /dev/random) and in-kernel services requiring it. This is easy to emulate: we can just return as many bytes as were requested from the current test case's data_provider field.
Authentication
XNU features some mechanisms (KAuth, mac checks, user checks) to determine whether a given syscall is permissible. Because of the importance and relative rarity of bugs in XNU, and my willingness to triage false positives, I decided to allow all actions by default. For example, the TCP multipath code requires a special entitlement, but disabling this functionality precludes us from finding Ian's multipath vulnerability. Rather than fuzz only code accessible inside the app sandbox, I figured I would just triage whatever comes up and report it with the appropriate severity in mind.
For example, when we create a socket, the kernel checks whether the running process is allowed to make a socket of the given domain, type, and protocol provided their KAuth credentials:
static int
socket_common ( struct proc * p ,
int domain ,
int type ,
int protocol ,
pid_t epid ,
int32_t * retval ,
int delegate )
{
struct socket * so ;
struct fileproc * fp ;
int fd , error ;
AUDIT_ARG ( socket , domain , type , protocol );
#if CONFIG_MACF_SOCKET_SUBSET
if (( error = mac_socket_check_create ( kauth_cred_get (), domain ,
type , protocol )) != 0 ) {
return error ;
}
0 Response to "Designing New Operating Primitives to Improve Fuzzing Performance"
Post a Comment