Designing New Operating Primitives to Improve Fuzzing Performance

Posted by Ned Williams on , Project Zero

Introduction

When I started my 20% project – an initiative where employees are allocated twenty-percent of their paid work time to pursue personal projects –  with Project Zero, I wanted to see if I could apply the techniques I had learned fuzzing Chrome to XNU, the kernel used in iOS and macOS. My interest was sparked after learning some prominent members of the iOS research community believed the kernel was "fuzzed to death," and my understanding was that most of the top researchers used auditing for vulnerability research. This meant finding new bugs with fuzzing would be meaningful in demonstrating the value of implementing newer fuzzing techniques. In this project, I pursued a somewhat unusual approach to fuzz XNU networking in userland by converting it into a library, "booting" it in userspace and using my standard fuzzing workflow to discover vulnerabilities. Somewhat surprisingly, this worked well enough to reproduce some of my peers' recent discoveries and report some of my own, one of which was a reliable privilege escalation from the app context, CVE-2019-8605, dubbed "SockPuppet." I'm excited to open source this fuzzing project, "sockfuzzer," for the community to learn from and adapt. In this post, we'll do a deep dive into its design and implementation.

Attack Surface Review and Target Planning

Choosing Networking

We're at the beginning of a multistage  pr oject. I had enormous respect for the difficulty of the task ahead of me. I knew I would need to be careful investing time at each stage of the process, constantly looking for evidence that I needed to change direction. The first big decision was to decide what exactly we wanted  to target.

I started by downloading the XNU sources  and reviewing them, looking for areas that handled a lot of attacker-controlled input and seemed amenable to fuzzing – immediately the networking subsystem jumped out as worthy of research. I had just exploited a Chrome sandbox bug that leveraged collaboration between an exploited renderer process and a server working in concert. I recognized these attack surfaces' power, where some security-critical code is "sandwiched" between two attacker-controlled entities. The Chrome browser process is prone to use after free vulnerabilities due to the difficulty of managing state for large APIs, and I suspected XNU would have the same issue. Networking features both parsing and state management. I figured that even if others had already fuzzed the parsers extensively, there could still be use after free vulnerabilities lying dormant.

I then proceeded to look at recent bug reports. Two bugs that caught my eye: the mptcp overflow  discovered by Ian Beer and the ICMP out of bounds write  found by Kevin Backhouse. Both of these are somewhat "straightforward" buffer overflows. The bugs' simplicity hinted that kernel networking, even packet parsing, was sufficiently undertested. A fuzzer combining network syscalls and arbitrary remote packets should be large enough in scope to reproduce these issues and find new ones.

Digging deeper, I wanted to understand how to reach these bugs in practice. By cross-referencing the functions and setting kernel breakpoints in a VM, I managed to get a more concrete idea. Here's the call stack for Ian's MPTCP bug:

The buggy function in question is mptcp_usr_connectx . Moving up the call stack, we find the connectx  syscall, which we see in Ian's original testcase. If we were to write a fuzzer to find this bug, how would we do it? Ultimately, whatever we do has to both find the bug and give us the information we need to reproduce it on the real kernel. Calling mptcp_usr_connectx  directly should surely find the bug, but this seems like the wrong idea because it takes a lot of arguments. Modeling a fuzzer well enough to call this function directly in a way representative of the real code is no easier than auditing the code in the first place, so we've not made things any easier by writing a targeted fuzzer. It's also wasted effort to write a target for each function this small. On the other hand, the further up the call stack we go, the more complexity we may have to support and the less chance we have of landing on the bug. If I were trying to unit test the networking stack, I would probably avoid the syscall layer and call the intermediate helper functions as a middle ground. This is exactly what I tried in the first draft of the fuzzer; I used sock_socket  to create struct socket*  objects to pass to connectitx  in the hopes that it would be easy to reproduce this bug while being high-enough level that this bug could plausibly have been discovered without knowing where to look for it . Surprisingly, after some experimentation, it turned out to be easier to simply call the syscalls directly (via connectx ). This makes it easier to translate crashing inputs into programs to run against a real kernel since testcases map 1:1 to syscalls. We'll see more details about this later.

We can't test networking properly without accounting for packets. In this case, data comes from the hardware, not via syscalls from a user process. We'll have to expose this functionality to our fuzzer. To figure out how to extend our framework to support random packet delivery, we can use our next example bug. Let's take a look at the call stack for delivering a packet to trigger the ICMP bug reported by Kevin Backhouse:

To reach the buggy function, icmp_error , the call stack is deeper, and unlike with syscalls, it's not immediately obvious which of these functions we should call to cover the relevant code. Starting from the very top of the call stack, we see that the crash occurred in a kernel thread running the dlil_input_thread_func  function. DLIL stands for Data Link Interface Layer, a reference to the OSI model's data link layer . Moving further down the stack, we see ether_inet_input , indicating an Ethernet packet (since I tested this issue using Ethernet). We finally make it down to the IP layer, where ip_dooptions  signals an icmp_error . As an attacker, we probably don't have a lot of control over the interface a user uses to receive our input, so we can rule out some of the uppermost layers . We also don't want to deal with threads in our fuzzer, another design tradeoff we'll describe in more detail later. proto_input  and ip_proto_input  don't do much, so I decided that ip_proto  was where I would inject packets, simply by calling the function when I wanted to deliver a packet. After reviewing proto_register_input , I discovered another function called ip6_input , which was the entry point for the IPv6 code. Here's the prototype for ip_input :

void  ip_input ( struct  mbuf * m );


Mbufs are message buffers, a standard buffer format used in network stacks. They enable multiple small packets to be chained together through a linked list. So we just need to generate mbufs with random data before calling
ip_input .

I was surprised by how easy it was to work with the network stack compared to the syscall interface. `ip_input` and `ip6_input` pure functions that don't require us to know any state to call them. But stepping back, it made more sense. Packet delivery is inherently a clean interface: our kernel has no idea what arbitrary packets may be coming in, so the interface takes a raw packet and then further down in the stack decides how to handle it. Many packets contain metadata that affect the kernel state once received. For example, TCP or UDP packets will be matched to an existing connection by their port number.

Most modern coverage guided fuzzers, including this LibFuzzer-based project, use a design inspired by AFL. When a test case with some known coverage is mutated and the mutant produces coverage that hasn't been seen before, the mutant is added to the current corpus of inputs. It becomes available for further mutations to produce even deeper coverage. Lcamtuf, the author of AFL, has an excellent demonstration of how this algorithm created JPEGs using coverage feedback  with no well-formed starting samples. In essence, most poorly-formed inputs are rejected early. When a mutated input passes a validation check, the input is saved. Then that input can be mutated until it manages to pass the second validation check, and so on. This hill climbing algorithm  has no problem generating dependent sequences of API calls, in this case to interleave syscalls with ip_input  and ip6_input . Random syscalls can get the kernel into some state where it's expecting a packet. Later, when libFuzzer guesses a packet that gets the kernel into some new state, the hill climbing algorithm will record a new test case when it sees new coverage. Dependent sequences of syscalls and packets are brute-forced in a linear fashion, one call at a time.

Designing for (Development) Speed

Now that we know where to attack this code base, it's a matter of building out the fuzzing research platform. I like thinking of it this way because it emphasizes that this fuzzer is a powerful assistant to a researcher, but it can't do all the work. Like any other test framework, it empowers the researcher to make hypotheses and run experiments over code that looks buggy. For the platform to be helpful, it needs to be comfortable and fun to work with and get out of the way.

When it comes to standard practice for kernel fuzzing, there's a pretty simple spectrum for strategies. On one end, you fuzz self-contained functions that are security-critical, e.g., OSUnserializeBinary. These are easy to write and manage and are generally quite performant. On the other end, you have "end to end" kernel testing that performs random syscalls against a real kernel instance. These heavyweight fuzzers have the advantage of producing issues that you know are actionable right away, but setup and iterative development are slower. I wanted to try a hybrid approach that could preserve some of the benefits of each style. To do so, I would port the networking stack of XNU out of the kernel and into userland while preserving as much of the original code as possible. Kernel code can be surprisingly portable and amenable to unit testing, even when run outside its natural environment.

There has been a push to add more user-mode unit testing to Linux. If you look at the documentation for Linux's KUnit project , there's an excellent quote from Linus Torvalds: "… a lot of people seem to think that performance is about doing the same thing, just doing it faster, and that is not true. That is not what performance is all about. If you can do something really fast, really well, people will start using it differently." This statement echoes the experience I had writing targeted fuzzers for code in Chrome's browser process. Due to extensive unit testing, Chrome code is already well-factored for fuzzing. In a day's work, I could try out many iterations of a fuzz target and the edit/build/run cycle. I didn't have a similar mechanism out of the box with XNU. In order to perform a unit test, I would need to rebuild the kernel. And despite XNU being considerably smaller than Chrome, incremental builds were slower due to the older kmk build system. I wanted to try bridging this gap for XNU.

Setting up the Scaffolding

"Unit" testing a kernel up through the syscall layer sounds like a big task, but it's easier than you'd expect if you forgo some complexity. We'll start by building all of the individual kernel  object files from source using the original build flags. But instead of linking everything together to produce the final kernel binary, we link in only the subset of objects containing code in our target attack surface. We then stub or fake the rest of the functionality. Thanks to the recon in the previous section, we already know which functions we want to call from our fuzzer. I used that information to prepare a minimal list of source objects to include in our userland port.

Before we dive in, let's define the overall structure of the project as pictured below. There's going to be a fuzz target implemented in C++ that translates fuzzed inputs into interactions with the userland XNU library. The target code, libxnu , exposes a few wrapper symbols for syscalls and ip_input  as mentioned in the attack surface review section. The fuzz target also exposes its random sequence of bytes to kernel APIs such as copyin  or copyout , whose implementations have been replaced with fakes that use fuzzed input data.

To make development more manageable, I decided to create a new build system using CMake, as it supported Ninja for fast rebuilds. One drawback here is the original build system has to be run every time upstream is updated to deal with generated sources, but this is worth it to get a faster development loop. I captured all of the compiler invocations during a normal kernel build and used those to reconstruct the flags passed to build the various kernel subsystems. Here's what that first pass looks like:

project ( libxnu )

set ( XNU_DEFINES

- DAPPLE

- DKERNEL

# ...

)

set ( XNU_SOURCES

    bsd / conf / param . c

    bsd / kern / kern_asl . c

    bsd / net / if . c

    bsd / netinet / ip_input . c

# ...

)

add_library ( xnu SHARED $ { XNU_SOURCES }  $ { FUZZER_FILES }  $ { XNU_HEADERS })

protobuf_generate_cpp ( NET_PROTO_SRCS NET_PROTO_HDRS fuzz / net_fuzzer . proto )

add_executable ( net_fuzzer fuzz / net_fuzzer . cc $ { NET_PROTO_SRCS }  $ { NET_PROTO_HDRS })

target_include_directories ( net_fuzzer PRIVATE libprotobuf - mutator )

target_compile_options ( net_fuzzer PRIVATE $ { FUZZER_CXX_FLAGS })


Of course, without the rest of the kernel, we see tons of missing symbols.

"_zdestroy" ,  referenced from :

      _if_clone_detach in  libxnu . a ( if . c . o )

"_zfree" ,  referenced from :

      _kqueue_destroy in  libxnu . a ( kern_event . c . o )

      _knote_free in  libxnu . a ( kern_event . c . o )

      _kqworkloop_get_or_create in  libxnu . a ( kern_event . c . o )

      _kev_delete in  libxnu . a ( kern_event . c . o )

      _pipepair_alloc in  libxnu . a ( sys_pipe . c . o )

      _pipepair_destroy_pipe in  libxnu . a ( sys_pipe . c . o )

      _so_cache_timer in  libxnu . a ( uipc_socket . c . o )

...

"_zinit" ,  referenced from :

      _knote_init in  libxnu . a ( kern_event . c . o )

      _kern_event_init in  libxnu . a ( kern_event . c . o )

      _pipeinit in  libxnu . a ( sys_pipe . c . o )

      _socketinit in  libxnu . a ( uipc_socket . c . o )

      _unp_init in  libxnu . a ( uipc_usrreq . c . o )

      _cfil_init in  libxnu . a ( content_filter . c . o )

      _tcp_init in  libxnu . a ( tcp_subr . c . o )

...

"_zone_change" ,  referenced from :

      _knote_init in  libxnu . a ( kern_event . c . o )

      _kern_event_init in  libxnu . a ( kern_event . c . o )

      _socketinit in  libxnu . a ( uipc_socket . c . o )

      _cfil_init in  libxnu . a ( content_filter . c . o )

      _tcp_init in  libxnu . a ( tcp_subr . c . o )

      _ifa_init in  libxnu . a ( if . c . o )

      _if_clone_attach in  libxnu . a ( if . c . o )

...

ld :  symbol ( s ) not  found for  architecture x86_64

clang :  error :  linker command failed with exit  code 1 ( use - v to see invocation )

ninja :  build stopped :  subcommand failed .


To get our initial targeted fuzzer working, we can do a simple trick by linking against a file containing stubbed implementations of all of these. We take advantage of C's weak type system here. For each function we need to implement, we can link an implementation
void func() { assert(false); } . The arguments passed to the function are simply ignored, and a crash will occur whenever the target code attempts to call it. This goal can be achieved with linker flags, but it was a simple enough solution that allowed me to get nice backtraces when I hit an unimplemented function.

// Unimplemented stub functions

// These should be replaced with real or mock impls.

#include <kern/assert.h>

#include <stdbool.h>

int  printf ( const char *  format , ...);

void Assert ( const char *  file , int  line , const char *  expression ) {

  printf ( "%s: assert failed on line %d: %s\n" ,  file ,  line ,  expression );

  __builtin_trap ();

}

void IOBSDGetPlatformUUID () { assert ( false ); }

void IOMapperInsertPage () { assert ( false ); }

// ...


Then we just link this file into the XNU library we're building by adding it to the source list:

set ( XNU_SOURCES

    bsd / conf / param . c

    bsd / kern / kern_asl . c

# ...

    fuzz / syscall_wrappers . c

    fuzz / ioctl . c

    fuzz / backend . c

    fuzz / stubs . c

    fuzz / fake_impls . c


As you can see, there are some other files I included in the XNU library that represent faked implementations and helper code to expose some internal kernel APIs. To make sure our fuzz target will call code in the linked library, and not some other host functions (syscalls) with a clashing name, we hide all of the symbols in
libxnu  by default and then expose a set of wrappers that call those functions on our behalf. I hide all the names by default using a CMake setting set_target_properties(xnu PROPERTIES C_VISIBILITY_PRESET hidden) . Then we can link in a file (fuzz/syscall_wrappers.c) containing wrappers like the following:

__attribute__ (( visibility ( "default" ))) int  accept_wrapper ( int  s ,  caddr_t name ,

                                                          socklen_t *  anamelen ,

int *  retval ) {

struct  accept_args uap = {

. s =  s ,

. name =  name ,

. anamelen =  anamelen ,

};

return  accept ( kernproc , & uap ,  retval );

}

Note the visibility attribute that explicitly exports the symbol from the library. Due to the simplicity of these wrappers I created a script to automate this called generate_fuzzer.py using syscalls.master .

With the stubs in place, we can start writing a fuzz target now and come back to deal with implementing them later. We will see a crash every time the target code attempts to use one of the functions we initially left out. Then we get to decide to either include the real implementation (and perhaps recursively require even more stubbed function implementations) or to fake the functionality.

A bonus of getting a build working with CMake was to create multiple targets with different instrumentation. Doing so allows me to generate coverage reports using clang-coverage:

target_compile_options ( xnu - cov PRIVATE $ { XNU_C_FLAGS } - DLIBXNU_BUILD = 1 - D_FORTIFY_SOURCE = 0 - fprofile - instr - generate - fcoverage - mapping )


With that, we just add a fuzz target file and a protobuf file to use with protobuf-mutator and we're ready to get started:

protobuf_generate_cpp ( NET_PROTO_SRCS NET_PROTO_HDRS fuzz / net_fuzzer . proto )

add_executable ( net_fuzzer fuzz / net_fuzzer . cc $ { NET_PROTO_SRCS }  $ { NET_PROTO_HDRS })

target_include_directories ( net_fuzzer PRIVATE libprotobuf - mutator )

target_compile_options ( net_fuzzer

                       PRIVATE - g

- std = c ++ 11

- Werror

- Wno - address - of - packed - member

                               $ { FUZZER_CXX_FLAGS })

if ( APPLE )

target_link_libraries ( net_fuzzer $ { FUZZER_LD_FLAGS }  xnu fuzzer protobuf - mutator $ { Protobuf_LIBRARIES })

else ()

target_link_libraries ( net_fuzzer $ { FUZZER_LD_FLAGS }  xnu fuzzer protobuf - mutator $ { Protobuf_LIBRARIES }  pthread )

endif ( APPLE )

Writing a Fuzz Target

At this point, we've assembled a chunk of XNU into a convenient library, but we still need to interact with it by writing a fuzz target. At first, I thought I might write many targets for different features, but I decided to write one monolithic target for this project. I'm sure fine-grained targets could do a better job for functionality that's harder to fuzz, e.g., the TCP state machine, but we will stick to one for simplicity.

We'll start by specifying an input grammar using protobuf, part of which is depicted below. This grammar is completely arbitrary and will be used by a corresponding C++ harness that we will write next. LibFuzzer has a plugin called libprotobuf-mutator that knows how to mutate protobuf messages. This will enable us to do grammar-based mutational fuzzing efficiently, while still leveraging coverage guided feedback. This is a very powerful combination.

message Socket {

  required Domain  domain = 1 ;

  required SoType  so_type = 2 ;

  required Protocol  protocol = 3 ;

// TODO: options, e.g. SO_ACCEPTCONN

}

message Close {

  required FileDescriptor  fd = 1 ;

}

message SetSocketOpt {

  optional Protocol  level = 1 ;

  optional SocketOptName  name = 2 ;

// TODO(nedwill): structure for val

  optional bytes val = 3 ;

  optional FileDescriptor  fd = 4 ;

}

message Command {

  oneof command {

Packet  ip_input = 1 ;

SetSocketOpt  set_sock_opt = 2 ;

Socket  socket = 3 ;

Close  close = 4 ;

}

}

message Session {

  repeated Command  commands = 1 ;

  required bytes data_provider = 2 ;

}

I left some TODO comments intact so you can see how the grammar can always be improved. As I've done in similar fuzzing projects, I have a top-level message called Session  that encapsulates a single fuzzer iteration or test case. This session contains a sequence of "commands" and a sequence of bytes that can be used when random, unstructured data is needed (e.g., when doing a copyin ). Commands are syscalls or random packets, which in turn are their own messages that have associated data. For example, we might have a session that has a single Command message containing a "Socket" message. That Socket message has data associated with each argument to the syscall. In our C++-based target, it's our job to translate messages of this custom specification into real syscalls and related API calls. We inform libprotobuf-mutator that our fuzz target expects to receive one "Session" message at a time via the macro DEFINE_BINARY_PROTO_FUZZER .

DEFINE_BINARY_PROTO_FUZZER ( const Session & session ) {

// ...

  std :: set <int>  open_fds ;

for ( const Command & command :  session . commands ()) {

int  retval = 0 ;

switch ( command . command_case ()) {

case Command :: kSocket : {

int  fd = 0 ;

int  err =  socket_wrapper ( command . socket (). domain (),

                                 command . socket (). so_type (),

                                 command . socket (). protocol (), & fd );

if ( err == 0 ) {

// Make sure we're tracking fds properly.

if ( open_fds . find ( fd ) !=  open_fds . end ()) {

            printf ( "Found existing fd %d\n" ,  fd );

assert ( false );

}

          open_fds . insert ( fd );

}

break ;

}

case Command :: kClose : {

        open_fds . erase ( command . close (). fd ());

        close_wrapper ( command . close (). fd (), nullptr );

break ;

}

case Command :: kSetSockOpt : {

int  s =  command . set_sock_opt (). fd ();

int  level =  command . set_sock_opt (). level ();

int  name =  command . set_sock_opt (). name ();

        size_t size =  command . set_sock_opt (). val (). size ();

        std :: unique_ptr < char []>  val ( new char [ size ]);

        memcpy ( val . get (),  command . set_sock_opt (). val (). data (),  size );

        setsockopt_wrapper ( s ,  level ,  name ,  val . get (),  size , nullptr );

break ;

}

While syscalls are typically a straightforward translation of the protobuf message, other commands are more complex. In order to improve the structure of randomly generated packets, I added custom message types that I then converted into the relevant on-the-wire structure before passing it into ip_input. Here's how this looks for TCP:

message Packet {

  oneof packet {

TcpPacket  tcp_packet = 1 ;

}

}

message TcpPacket {

  required IpHdr  ip_hdr = 1 ;

  required TcpHdr  tcp_hdr = 2 ;

  optional bytes data = 3 ;

}

message IpHdr {

  required uint32 ip_hl = 1 ;

  required IpVersion  ip_v = 2 ;

  required uint32 ip_tos = 3 ;

  required uint32 ip_len = 4 ;

  required uint32 ip_id = 5 ;

  required uint32 ip_off = 6 ;

  required uint32 ip_ttl = 7 ;

  required Protocol  ip_p = 8 ;

  required InAddr  ip_src = 9 ;

  required InAddr  ip_dst = 10 ;

}

message TcpHdr {

  required Port  th_sport = 1 ;

  required Port  th_dport = 2 ;

  required TcpSeq  th_seq = 3 ;

  required TcpSeq  th_ack = 4 ;

  required uint32 th_off = 5 ;

  repeated TcpFlag  th_flags = 6 ;

  required uint32 th_win = 7 ;

  required uint32 th_sum = 8 ;

  required uint32 th_urp = 9 ;

// Ned's extensions

  required bool  is_pure_syn = 10 ;

  required bool  is_pure_ack = 11 ;

}

Unfortunately, protobuf doesn't support a uint8 type, so I had to use uint32 for some fields. That's some lost fuzzing performance. You can also see some synthetic TCP header flags I added to make certain flag combinations more likely: is_pure_syn  and is_pure_ack . Now I have to write some code to stitch together a valid packet from these nested fields. Shown below is the code to handle just the TCP header.

std :: string  get_tcp_hdr ( const TcpHdr & hdr ) {

struct  tcphdr tcphdr = {

. th_sport = ( unsigned short ) hdr . th_sport (),

. th_dport = ( unsigned short ) hdr . th_dport (),

. th_seq =  __builtin_bswap32 ( hdr . th_seq ()),

. th_ack =  __builtin_bswap32 ( hdr . th_ack ()),

. th_off =  hdr . th_off (),

. th_flags = 0 ,

. th_win = ( unsigned short ) hdr . th_win (),

. th_sum = 0 , // TODO(nedwill): calculate the checksum instead of skipping it

. th_urp = ( unsigned short ) hdr . th_urp (),

};

for ( const int  flag :  hdr . th_flags ()) {

    tcphdr . th_flags ^=  flag ;

}

// Prefer pure syn

if ( hdr . is_pure_syn ()) {

    tcphdr . th_flags &= ~( TH_RST |  TH_ACK );

    tcphdr . th_flags |=  TH_SYN ;

} else if ( hdr . is_pure_ack ()) {

    tcphdr . th_flags &= ~( TH_RST |  TH_SYN );

    tcphdr . th_flags |=  TH_ACK ;

}

  std :: string  dat (( char *)& tcphdr , ( char *)& tcphdr + sizeof ( tcphdr ));

return  dat ;

}


As you can see, I make liberal use of a custom grammar to enable better quality fuzzing. These efforts are worth it, as randomizing high level structure is more efficient. It will also be easier for us to interpret crashing test cases later as they will have the same high level representation.

High-Level Emulation

Now that we have the code building and an initial fuzz target running, we begin the first pass at implementing all of the stubbed code that is reachable by our fuzz target. Because we have a fuzz target that builds and runs, we now get instant feedback about which functions our target hits. Some core functionality has to be supported before we can find any bugs, so the first attempt to run the fuzzer deserves its own development phase. For example, until dynamic memory allocation is supported, almost no kernel code we try to cover will work considering how heavily such code is used.

We'll be implementing our stubbed functions with fake variants that attempt to have the same semantics. For example, when testing code that uses an external database library, you could replace the database with a simple in-memory implementation. If you don't care about finding database bugs, this often makes fuzzing simpler and more robust. For some kernel subsystems unrelated to networking we can use entirely different or null implementations. This process is reminiscent of high-level emulation, an idea used in game console emulation. Rather than aiming to emulate hardware, you can try to preserve the semantics but use a custom implementation of the API. Because we only care about testing networking, this is how we approach faking subsystems in this project.

I always start by looking at the original function implementation. If it's possible, I just link in that code as well. But some functionality isn't compatible with our fuzzer and must be faked. For example, zalloc  should call the userland malloc since virtual memory is already managed by our host kernel and we have allocator facilities available. Similarly, copyin  and copyout  need to be faked as they no longer serve to copy data between user and kernel pages. Sometimes we also just "nop" out functionality that we don't care about. We'll cover these decisions in more detail later in the "High-Level Emulation" phase. Note that by implementing these stubs lazily whenever our fuzz target hits them, we immediately reduce the work in handling all the unrelated functions by an order of magnitude. It's easier to stay motivated when you only implement fakes for functions that are used by the target code. This approach successfully saved me a lot of time  and I've used it on subsequent projects as well. At the time of writing, I have 398 stubbed functions, about 250 functions that are trivially faked (return 0 or void functions that do nothing), and about 25 functions that I faked myself (almost all related to porting the memory allocation systems to userland).

Booting Up

As soon as we start running the fuzzer, we'll run into a snag: many resources require a one-time initialization that happens on boot. The BSD half of the kernel is mostly initialized by calling the bsd_init  function. That function, in turn, calls several subsystem-specific initialization functions. Keeping with the theme of supporting a minimally necessary subset of the kernel, rather than call bsd_init , we create a new function that only initializes parts of the kernel as needed.

Here's an example crash that occurs without the one time kernel bootup initialization:

#7 0x7effbc464ad0 in zalloc /source/build3/../fuzz/zalloc.c:35:3

#8 0x7effbb62eab4 in pipepair_alloc /source/build3/../bsd/kern/sys_pipe.c:634:24

#9 0x7effbb62ded5 in pipe /source/build3/../bsd/kern/sys_pipe.c:425:10

#10 0x7effbc4588ab in pipe_wrapper /source/build3/../fuzz/syscall_wrappers.c:216:10

#11 0x4ee1a4 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:979:19

Our zalloc implementation (covered in the next section) failed because the pipe zone wasn't yet initialized:

static int

pipepair_alloc ( struct  pipe ** rp_out , struct  pipe ** wp_out )

{

struct  pipepair * pp =  zalloc ( pipe_zone );

Scrolling up in sys_pipe.c, we see where that zone is initialized:

void

pipeinit ( void )

{

        nbigpipe = 0 ;

        vm_size_t zone_size ;

        zone_size = 8192 * sizeof ( struct  pipepair );

        pipe_zone =  zinit ( sizeof ( struct  pipepair ),  zone_size , 4096 , "pipe zone" );

Sure enough, this function is called by bsd_init . By adding that to our initial setup function the zone works as expected. After some development cycles spent supporting all the needed bsd_init function calls, we have the following:

__attribute__ (( visibility ( "default" ))) bool  initialize_network () {

  mcache_init ();

  mbinit ();

  eventhandler_init ();

  pipeinit ();

  dlil_init ();

  socketinit ();

  domaininit ();

  loopattach ();

  ether_family_init ();

  tcp_cc_init ();

  net_init_run ();

int  res =  necp_init ();

assert (! res );

return true ;

}


The original
bsd_init  is 683 lines long, but our initialize_network  clone is the preceding short snippet. I want to remark how cool I found it that you could "boot" a kernel like this and have everything work so long as you implemented all the relevant stubs. It just goes to show a surprising fact: a significant amount of kernel code is portable, and simple steps can be taken to make it testable. These codebases can be modernized without being fully rewritten. As this "boot" relies on dynamic allocation, let's look at how I implemented that next.

Dynamic Memory Allocation

Providing a virtual memory abstraction is a fundamental goal of most kernels, but the good news is this is out of scope for this project (this is left as an exercise for the reader). Because networking already assumes working virtual memory, the network stack functions almost entirely on top of high-level allocator APIs. This makes the subsystem amenable to "high-level emulation". We can create a thin shim layer that intercepts XNU specific allocator calls and translates them to the relevant host APIs.

In practice, we have to handle three types of allocations for this project: "classic" allocations (malloc/calloc/free), zone allocations (zalloc), and mbuf (memory buffers). The first two types are more fundamental allocation types used across XNU, while mbufs are a common data structure used in low-level networking code.

The zone allocator is reasonably complicated, but we use a simplified model for our purposes: we just track the size assigned to a zone when it is created and make sure we malloc that size when zalloc is later called using the initialized zone. This could undoubtedly be modeled better, but this initial model worked quite well for the types of bugs I was looking for. In practice, this simplification affects exploitability, but we aren't worried about that for a fuzzing project as we can assess that manually once we discover an issue. As you can see below, I created a custom zone type that simply stored the configured size, knowing that my zinit would return an opaque pointer that would be passed to my zalloc implementation, which could then use calloc  to service the request. zfree  simply freed the requested bytes and ignored the zone, as allocation sizes are tracked by the host malloc already.

struct  zone {

  uintptr_t size ;

};

struct  zone *  zinit ( uintptr_t size ,  uintptr_t max ,  uintptr_t alloc ,

const char *  name ) {

struct  zone *  zone = ( struct  zone *) calloc ( 1 , sizeof ( struct  zone ));

  zone -> size =  size ;

return  zone ;

}

void *  zalloc ( struct  zone *  zone ) {

assert ( zone !=  NULL );

return  calloc ( 1 ,  zone -> size );

}

void  zfree ( void *  zone , void *  dat ) {

( void ) zone ;

  free ( dat );

}

Kalloc, kfree, and related functions were passed through to malloc and free as well. You can see fuzz/zalloc.c for their implementations. Mbufs (memory buffers) are more work to implement because they contain considerable metadata that is exposed to the "client" networking code.

struct  m_hdr {

struct  mbuf * mh_next ; /* next buffer in chain */

struct  mbuf * mh_nextpkt ; /* next chain in queue/record */

        caddr_t         mh_data ; /* location of data */

        int32_t         mh_len ; /* amount of data in this mbuf */

        u_int16_t       mh_type ; /* type of data in this mbuf */

        u_int16_t       mh_flags ; /* flags; see below */

};

/*

 * The mbuf object

 */

struct  mbuf {

struct  m_hdr m_hdr ;

union {

struct {

struct  pkthdr MH_pkthdr ; /* M_PKTHDR set */

union {

struct  m_ext MH_ext ; /* M_EXT set */

char     MH_databuf [ _MHLEN ];

}  MH_dat ;

}  MH ;

char     M_databuf [ _MLEN ]; /* !M_PKTHDR, !M_EXT */

}  M_dat ;

};


I didn't include the
pkthdr  nor m_ext  structure definitions, but they are nontrivial (you can see for yourself in bsd/sys/mbuf.h). A lot of trial and error was needed to create a simplified mbuf format that would work. In practice, I use an inline buffer when possible and, when necessary, locate the data in one large external buffer and set the M_EXT  flag. As these allocations must be aligned, I use posix_memalign  to create them, rather than malloc . Fortunately ASAN can help manage these allocations, so we can detect some bugs with this modification.

Two bugs I reported via the Project Z ero tracker highlight the benefit of the heap-based mbuf implementation . In the first report , I detected an mbuf double free using ASAN. While the m_free  implementation tries to detect double frees by checking the state of the allocation, ASAN goes even further by quarantining recently freed allocations to detect the bug. In this case, it looks like the fuzzer would have found the bug either way, but it was impressive. The second issue  linked is much subtler and requires some instrumentation to detect the bug, as it is a use after free read of an mbuf:

== 22568 == ERROR : AddressSanitizer :  heap - use - after - free on address 0x61500026afe5  at pc 0x7ff60f95cace  bp 0x7ffd4d5617b0  sp 0x7ffd4d5617a8

READ of size 1  at 0x61500026afe5  thread T0

#0 0x7ff60f95cacd in tcp_input bsd/netinet/tcp_input.c:5029:25

#1 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

#2 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10

0x61500026afe5 is  located 229  bytes inside of 256 - byte  region [ 0x61500026af00 , 0x61500026b000 )

freed by  thread T0 here :

#0 0x4a158d in free /b/swarming/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3

#1 0x7ff60fb7444d in m_free fuzz/zalloc.c:220:3

#2 0x7ff60f4e3527 in m_freem bsd/kern/uipc_mbuf.c:4842:7

#3 0x7ff60f5334c9 in sbappendstream_rcvdemux bsd/kern/uipc_socket2.c:1472:3

#4 0x7ff60f95821d in tcp_input bsd/netinet/tcp_input.c:5019:8

#5 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

#6 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10


Apple managed to catch this issue before I reported it, fixing it in iOS 13. I believe Apple has added some internal hardening or testing for mbufs that caught this bug. It could be anything from a hardened mbuf allocator like
GWP-ASAN , to an internal ARM MTE test, to simple auditing, but it was really cool to see this issue detected in this way, and also that Apple was proactive enough to find this themselves.

Accessing User Memory

When talking about this project with a fellow attendee at a fuzzing conference, their biggest question was how I handled user memory access. Kernels are never supposed to trust pointers provided by user-space, so whenever the kernel wants to access memory-mapped in userspace, it goes through intermediate functions copyin  and copyout . By replacing these functions with our fake implementations, we can supply fuzzer-provided input to the tested code. The real kernel would have done the relevant copies from user to kernel pages. Because these copies are driven by the target code and not our testcase, I added a buffer in the protobuf specification to be used to service these requests.

Here's a backtrace from our stub before we implement `copyin` .  As you can see, when calling the `recvfrom` syscall, our fuzzer passed in a pointer as an argument.

#6 0x7fe1176952f3 in Assert /source/build3/../fuzz/stubs.c:21:3

#7 0x7fe11769a110 in copyin /source/build3/../fuzz/fake_impls.c:408:3

#8 0x7fe116951a18 in __copyin_chk /source/build3/../bsd/libkern/copyio.h:47:9

#9 0x7fe116951a18 in recvfrom_nocancel /source/build3/../bsd/kern/uipc_syscalls.c:2056:11

#10 0x7fe117691a86 in recvfrom_nocancel_wrapper /source/build3/../fuzz/syscall_wrappers.c:244:10

#11 0x4e933a in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:936:9

#12 0x4e43b8 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1

I've extended the copyin specification with my fuzzer-specific semantics: when the pointer (void*)1  is passed as an address, we interpret this as a request to fetch arbitrary bytes. Otherwise, we copy directly from that virtual memory address. This way, we can begin by passing (void*)1  everywhere in the fuzz target to get as much cheap coverage as possible. Later, as we want to construct well-formed data to pass into syscalls, we build the data in the protobuf test case handler and pass a real pointer to it, allowing it to be copied. This flexibility saves us time while permitting the construction of highly-structured data inputs as we see fit.

int  __attribute__ (( warn_unused_result ))

copyin ( void *  user_addr , void *  kernel_addr ,  size_t nbytes ) {

// Address 1 means use fuzzed bytes, otherwise use real bytes.

// NOTE: this does not support nested useraddr.

if ( user_addr != ( void *) 1 ) {

    memcpy ( kernel_addr ,  user_addr ,  nbytes );

return 0 ;

}

if ( get_fuzzed_bool ()) {

return - 1 ;

}

  get_fuzzed_bytes ( kernel_addr ,  nbytes );

return 0 ;

}

Copyout is designed similarly. We often don't care about the data copied out; we just care about the safety of the accesses. For that reason, we make sure to memcpy from the source buffer in all cases, using a temporary buffer when a copy to (void*)1  occurs. If the kernel copies out of bounds or from freed memory, for example, ASAN will catch it and inform us about a memory disclosure vulnerability.

Synchronization and Threads

Among the many changes made to XNU's behavior to support this project, perhaps the most extensive and invasive are the changes I made to the synchronization and threading model. Before beginning this project, I had spent over a year working on Chrome browser process research, where high level "sequences" are preferred to using physical threads . Despite a paucity of data races, Chrome still had sequence-related bugs that were triggered by randomly servicing some of the pending work in between performing synchronous IPC calls. In an exploit for a bug found by the AppCache fuzzer , sleep calls were needed to get the asynchronous work to be completed before queueing up some more work synchronously. So I already knew that asynchronous continuation-passing style concurrency could have exploitable bugs that are easy to discover with this fuzzing approach.

I suspected I could find similar bugs if I used a similar model for sockfuzzer. Because XNU uses multiple kernel threads in its networking stack, I would have to port it to a cooperative style. To do this, I provided no-op implementations for all of the thread management functions and sync primitives, and instead randomly called the work functions that would have been called by the real threads. This involved modifying code: most worker threads run in a loop, processing new work as it comes in. I modified these infinitely looping helper functions to do one iteration of work and exposed them to the fuzzer frontend. Then I called them randomly as part of the protobuf message. The main benefit of doing the project this way was improved performance and determinism. Places where the kernel could block the fuzzer were modified to return early. Overall, it was a lot simpler and easier to manage a single-threaded process. But this decision did not end up yielding as many bugs as I had hoped. For example, I suspected that interleaving garbage collection of various network-related structures with syscalls would be more effective. It did achieve the goal of removing threading-related headaches from deploying the fuzzer, but this is a serious weakness that I would like to address in future fuzzer revisions.

Randomness

Randomness is another service provided by kernels to userland (e.g. /dev/random) and in-kernel services requiring it. This is easy to emulate: we can just return as many bytes as were requested from the current test case's data_provider  field.

Authentication

XNU features some mechanisms (KAuth, mac checks, user checks) to determine whether a given syscall is permissible. Because of the importance and relative rarity of bugs in XNU, and my willingness to triage false positives, I decided to allow all actions by default. For example, the TCP multipath code requires a special entitlement, but disabling this functionality precludes us from finding Ian's multipath vulnerability. Rather than fuzz only code accessible inside the app sandbox, I figured I would just triage whatever comes up and report it with the appropriate severity in mind.

For example, when we create a socket, the kernel checks whether the running process is allowed to make a socket of the given domain, type, and protocol provided their KAuth credentials:

static int

socket_common ( struct  proc * p ,

int  domain ,

int  type ,

int  protocol ,

    pid_t epid ,

    int32_t * retval ,

int delegate )

{

struct  socket * so ;

struct  fileproc * fp ;

int  fd ,  error ;

        AUDIT_ARG ( socket ,  domain ,  type ,  protocol );

#if CONFIG_MACF_SOCKET_SUBSET

if (( error =  mac_socket_check_create ( kauth_cred_get (),  domain ,

            type ,  protocol )) != 0 ) {

return  error ;

}

0 Response to "Designing New Operating Primitives to Improve Fuzzing Performance"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel