RE: March 2013

Wednesday, March 13, 2013

KeLoaderBlock and you

My goal of this blog is to generally post undocumented details of the Windows operating system. By details I mean topics that would interest both software reverse-engineers and malware analysts alike. One of those topics to me is a lot more prominent then the rest, and that is mechanisms that attempt to detect or evade debugging. Whether it be DRM or actual malware, I'd have to say it's my favorite topic.

What were going to discuss today has probably already been discussed elsewhere, however out of all the methods used to detect if a kernel debugger is attached to the system, I think this one is hardly used or mentioned. Therefore I think it warrants a quick discussion today.

As you probably already know, KeLoaderBlock is the first argument to KiSystemStartup. Among a plethora of other details this structure contains the boot flags from the current BCD entries corresponding our current boot. For instance boot option selection timeout, test-signing, NX opt in or opt out, /debug flags for the kernel debugger etc.

KeLoaderBlock is not accessible from user-mode, but I'm always surprised that many are unaware that during initialization, the startup flags are written to the following registry fields.

HKLM\System\CurrentControlSet\Control - SystemStartOptions

From these flags the software can easily find out if the system was booted with /TESTSIGNING or /DEBUG ON

This method we discussed as you can see is very simple. So simple that it's often overlooked.

Friday, March 8, 2013

2 anti-tracing mechanisms specific to windows x64

In this context, the term "anti-trace" refers to detecting the mode in which the processor causes a #DB exception after every instruction boundary or in the case of branch tracing, a #DB exception after each successful branch is taken, due to RFLAGS.TF being set.

Please note that under long-mode and windows x64 that the first method I am about to describe will in fact work under the wow64 subsystem, however it only works because of the long-mode implementation. It's best to leave these methods in 64 bit mode code only, because they will not function without long-mode and windows x64.

Nested task bit

The nested task bit of RFLAGS was used and set by the legacy task switching system when the processor would transfer control through a task-gate, a task segment or the actual TSS descriptor in the GDT itself. When this happened, the processor would set EFLAGS.NT. This would enable a subsequent IRET to use the TSS backlink selector to return to the previous task. However like the segmented memory model, the hardware task switching mechanism was hardly used. The only purpose it served in most cases was for storing stack pointers for different privilege levels.

In 64-bit mode, most of the segmented memory model was done away with. Except for GS, FS and system descriptors, the base is always treated as zero, and the displacement offset usually found in a general purpose register is treated as the actual linear address.

The same goes for the hardware task switching model. In long-mode it's purpose is to hold stack pointers for each CPL change and an IST which can be used for secure stacks when needed for NMI's etc.

Thus since there is no TSS backlink for an IRET to dispatch to, nor is the hardware task switching mechanism even available in long-mode, an IRET with RFLAGS.NT=1 will cause a general protection exception. In user-mode, depending on the scenario, these are usually dispatched as STATUS_ACCESS_VIOLATION (0xC0000005).

Now when the trap flag is set for a task we know that a debug exception will occur at the next instruction boundary, the trap flag is also set on the interrupt handler stack RFLAGS image, however... prior to dispatching the exception the kernel will mask the trap flag in the RFLAGS image to 0. As you already know then, this requires the user level debugger code to call SetThreadContext to reenable the trap flag to continue single-stepping (or branch tracing). However an interesting thing occurs in x64 kernels as we have a look at PspSetContext. This function is part of the APC routine used to modify a thread's context on it's saved trap-frame.

If the CONTEXT_CONTROL flag is specified in the ContextFlags member (which it needs to be in order to mask on RFLAGS.TF to continue single-stepping), PspSetContext will mask off RFLAGS.NT each time it's called. This means that if we are single-stepping over an IRET which has RFLAGS.NT=1 no general protection fault will be generated, otherwise it will be.

Here is another interesting scenario, this isn't just limited to detecting tracing. Notice how PspSetContext will mask off RFLAGS.NT each time the APC is queued to the thread and the CONTEXT_CONTROL flag is set? CONTEXT_CONTROL is not only used for RFLAGS it is used for the instruction pointer as well as other general purpose registers. Lets say somewhere during the initialization of our program we set RFLAGS.NT. Then somewhere down the road we use the IRET gp fault mechanism to cause some indirection. If at any time a debugger has re-adjusted the context of our thread with CONTEXT_CONTROL (which it would need to do for int3 ;p), we can assume a debugger is attached because RFLAGS.NT will no longer be set and therefore no GP fault will be generated.

Hopefully you see how this goes beyond just a simple anti-tracing mechanism to a pretty powerful anti-debugging trick altogether.

Alignment check

The second is based off of the exact same logic we just discussed, except in this case it is applied to RFLAGS.AC. When this flag is set it causes an alignment check fault when the task attempts to access data that is not a multiple of the operand offset. For example the following instruction would cause an alignment check fault if RFLAGS.AC was masked on:

mov rax, qword ptr [rsp+04h]

However following the same logic with our above discussion, this flag is also masked off each time PspSetContext is called. Thus if we were stepping over it, it would not generate an alignment check fault. The same logic also applies if PspSetContext is called at any point after RFLAGS.AC is set, it will be unmasked, and not cause a fault at the desired location.

An important thing to note however is that the first mechanism we described today (Nested task bit) will work within wow64. However the x64 kernel will not dispatch alignment faults that are generated in user-mode within the context of a wow64 process. Instead it will simply mask off RFLAGS.AC and IRET to the faulted instruction. This is why these methods should be left strictly to code that runs in a 64 bit process.

Sunday, March 3, 2013

Utilizing paged virtual memory as an anti-debug and anti-dumping mechanism

The Windows memory manager logic is designed around performance, reliability, physical page re purposing, sharing, low disk writes and a hierarchy of named objects and directories. Today we are going to talk about paged memory, user-mode memory in particular.

In most cases as you probably already know, unless specified otherwise, the memory your user-mode software uses is paged. This means prior to first access to the page, there is no associated physical page frame. This is because the Windows memory manager wont commit a physical page until it's absolutely needed. This is done via a page fault.

A page fault is an interrupt and therefore takes more processing time to dispatch the interrupt, find an unused physical page (or in the case of an image, a shared one) and add it into the corresponding page tables for that virtual address. Each process has a 'working set' limit and a list which contains virtual addresses that have valid translations and should not be paged out. This is to reduce time spent dispatching page faults which can otherwise cause the process to take a major performance hit.

When you allocate memory to your process from user-mode, for example VirtualAlloc or NtMapViewOfSection, these functions do not actually set up mappings to pfn's in the process' page tables. Instead it allocates VAD nodes (virtual address descriptors) in the process' VAD tree. Each process has a VAD tree, these nodes represent and describe valid virtual addresses within the process address space. This is the area that the VirtualQuery function gets it's data from.

Now notice I said that a virtual address translation is not created. As said before, the windows memory manager isn't going to commit a page frame or page in the already paged out data until it's absolutely needed. So lets do a basic walk through of NtAllocateVirtualMemory:

-Find an empty address range within the VAD tree

-Allocate a VAD node describing the memory

-Return

Now lets say our return virtual address value is 0x30000 and is a 4kb page.

When we access this page for the first time, there is no valid translation so a page fault is generated. The VAD trees are used to resolve the page fault, a physical page is committed and we IRET right back to the faulted instruction and is generally unbeknownst to the program or the program's author as if the memory was always available.

Wouldn't it be neat if there was a way to see if the page translation is valid for an arbitrary virtual address other then just VirtualQuery telling us it's there even though it's not really been paged in yet?

Well of course there is! NtQueryVirtualMemory provides an infoclass of 0x4 which we can call ProcessWorkingSetInfoEx and there is even a higher level API which will do the dirty working for us called QueryWorkingSetEx.

This is how we can easily determine if the page has ever been read. For instance the kernel implementation of NtReadVirtualMemory will directly access the virtual address, if it's not valid, it will be paged in and it's contents returned to the caller. By examining if bit 0 is set or not in the data provided to us by QueryWorkingSetEx we can determine if the page table entry is valid for that virtual address, if it is, this means the memory has been accessed.

Another way is to use NtRaiseException. Specify the newly allocated virtual address as the instruction pointer in the context argument, and be sure to set the contextflags accordingly. Most debuggers will then read the instruction contents of the instruction pointer address for dis-assembly and this will indicate that a debugger is undoubtedly present.

Another method not involving the use of an API would be to measure time deltas using the kernel/user shared page or processor cycles with rdtsc between instructions that access memory from the linear virtual address. This is because the time to dispatch the page fault will be extremely noticeable compared to a few cycles to access already available memory.

Use your imagination, there are many possibilities ;p