A few months ago the word “syscall” was a fairly abstract and kind of intimidating concept to me. In this post, I will try to help anyone in a similar situation learn about system calls and feel comfortable around their nomenclature.

As a note, this is specifically for Linux on the x86 architecture.

Why should you care about syscalls?

As a web developer, learning about syscalls and the infrastructure around them made me feel quite a bit more confident in my daily work. Ruby and C++ both have their own idiomatic ways of opening files, but in the end they both end up using the syscall open(). This is because userland processes (like the ones we write) have only one way of communicating with the operating system: syscalls.

What to except when you’re excepting

In order for a process to communicate with the kernel, it has to pass execution to it somehow along with a number of arguments. It does that by issuing an exception, which moves the control flow from your process to the kernel’s interrupt handler, which processes the arguments and selects the correct syscall.

An exception is just one name for this concept - but there are a lot of names for the same thing: “different manufacturers have used terms like exceptions, faults, aborts, traps, and interrupts.”[0]

In order to better understand this, let’s take a look at very simple syscall in x86 assembly: getpid, which returns the id of the calling process. Its syscall number is 20, so we put that into the eax cpu register since that’s where the kernel will look to determine which syscall to call.

mov eax, 20
int 0x80

The int instruction above triggers a software interrupt, which causes the kernel to halt and run its interrupt handler. It sees that the interrupt vector we specified was 0x80, or 128, which corresponds to the syscall interrupt vector. The kernel looks in the eax register and see if it can find that number it its syscall table. If found, it calls that syscall.

Let’s take a look at exactly where that takes you inside the the Linux kernel:

sysenter_do_call:
  cmpl $(NR_syscalls), %eax
  jae sysenter_badsys           
  call *sys_call_table(,%eax,4) 

Ok that went by pretty fast for me. Let’s see that again in slo-mo. Also, this code might look different since it’s using AT&T syntax, not Intel syntax as used elsewhere in this post.

sysenter_do_call:
  ; cmpl - subtract
  ; Subtract the total number of syscalls from the syscall number (%eax)
  cmpl $(NR_syscalls), %eax

  ; jae - jump if Above or Equal to 0
  ; If the syscall number was out of range, handle bad call
  jae sysenter_badsys           

  ; call - call a subroutine
  ; *sys_call_table(,%eax,4)
  ;   - The * is a pointer dereference
  ;   - The X is a Y... etc
  ; Call the syscall you wanted
  call *sys_call_table(,%eax,4) 

As we saw before, the syscall number goes in register eax. The Linux kernel knows nothing about syscall names. All it knows is their numbers, and this is where it looks up the syscall’s function pointer and calls it. Here are some examples of some syscalls you might recognize and their numbers:

  • 5 - open(2) - open a file
  • 12 - chdir(2) - your good friend, cd
  • 34 - nice(2) - change a processes nice value

Here’s a full table of syscalls and their arguments.

Once a syscall number is decided, it is never changed. As you can imagine, doing so would literally blow up all the programs.

Aside: when you see syscalls written like this: open(2), exec(2), the 2 is referring to the man page level for syscalls, which is 2.

Passing arguments to syscalls

Ok, so a syscall is just a function in the kernel you call in a special interrupt-y way. How do you pass it arguments?

We saw that you put the syscall number in register eax. The kernel looks for arguments in registers ebx, ecx, and edx. Let’s take a look at a hello world program using the syscalls write() and exit().

global _start
 
section .text
_start:
  mov eax, 4 ; write
  mov ebx, 1 ; stdout
  mov ecx, msg
  mov edx, msg.len
  int 0x80   ; write(stdout, msg, strlen(msg));
 
  mov eax, 1 ; exit
  mov ebx, 0
  int 0x80   ; exit(0)
 
section .data
msg:  db  "Hello, world!", 10
.len: equ $ - msg

The first argument (in ebx) is a file descriptor - in this case stdout. The second argument (ecx) is a pointer to the start of the message, and the third (edx) is the message’s length).

exit takes one argument, the exit code - which was 0.

If the syscall you’re using takes a lot of arguments, instead of putting values in the registers, you’ll put pointers to data structures you own in userspace.

Fin

If you want to learn more about syscalls, please consult these fine sources of nougaty syscall goodness:

If you want to see syscalls in action, try using the strace command on Linux. There’s a fantastic writeup on it by Julia Evans. She has some useful links at the end, too, so you can click on them if you want to read more. FYI: Those links might have even more links.

Footnotes