Understanding sleeping mechanism in Linux Kernel

The students of INFO-0940 were asked to remove a task from it’s current scheduler class when calling a new syscall and put it back when we call again that syscall. The question is : what to do if the task is currently sleeping? Cause the timer will expire at one moment and maybe put back the state to TASK_RUNNING and run the task which should haven’t been able to run again. So either it expires after the processus is back in its previous scheduler, and that’s okay as it will reset it in a runnable state. Or it expires while it’s still removed, and that’s a problem as it will put it back in its previous state (or not, depending on how the syscall is implemented).
Warning : this post is just the result of a quick look. More to give you some ideas and basis for understanding the sleeping mechanism.

So we should try to understand how sleep works.

Let’s look at the nanosleep syscall (which is called when you user sleep() or usleep() functions in C).

SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
		struct timespec __user *, rmtp)
{
	struct timespec tu;

	if (copy_from_user(&tu, rqtp, sizeof(tu)))
		return -EFAULT;

	if (!timespec_valid(&tu))
		return -EINVAL;

	return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
}

It copies the userspace data, check that the time is valid and call hrtimer_nanosleep.

long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
		       const enum hrtimer_mode mode, const clockid_t clockid)
{
	struct restart_block *restart;
	struct hrtimer_sleeper t;
	int ret = 0;
	unsigned long slack;

	slack = current->timer_slack_ns;
	if (rt_task(current))
		slack = 0;

	hrtimer_init_on_stack(&t.timer, clockid, mode);
	hrtimer_set_expires_range_ns(&t.timer, timespec_to_ktime(*rqtp), slack);
	if (do_nanosleep(&t, mode))
		goto out;

	/* Absolute timers do not update the rmtp value and restart: */
	if (mode == HRTIMER_MODE_ABS) {
		ret = -ERESTARTNOHAND;
		goto out;
	}

	if (rmtp) {
		ret = update_rmtp(&t.timer, rmtp);
		if (ret <= 0)
			goto out;
	}

	restart = &current_thread_info()->restart_block;
	restart->fn = hrtimer_nanosleep_restart;
	restart->nanosleep.clockid = t.timer.base->clockid;
	restart->nanosleep.rmtp = rmtp;
	restart->nanosleep.expires = hrtimer_get_expires_tv64(&t.timer);

	ret = -ERESTART_RESTARTBLOCK;
out:
	destroy_hrtimer_on_stack(&t.timer);
	return ret;
}

It initialize a timer and call do_nanosleep giving it the timer. After do_nanosleep it will destroy/free the timer structure.

static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
{
	hrtimer_init_sleeper(t, current);

	do {
		set_current_state(TASK_INTERRUPTIBLE);
		hrtimer_start_expires(&t->timer, mode);
		if (!hrtimer_active(&t->timer))
			t->task = NULL;

		if (likely(t->task))
			schedule();

		hrtimer_cancel(&t->timer);
		mode = HRTIMER_MODE_ABS;

	} while (t->task && !signal_pending(current));

	__set_current_state(TASK_RUNNING);

	return t->task == NULL;
}

This is the interesting part.

The first thing done is calling hrtimer_init_sleeper(); which will set parameters of the timer to call the function hrtimer_wakeup when the timer expires, and specify the task to wakeup at that time. hrtimer_wakeup will simply set the task to NULL and call wake_up_process() on that task. We’ll come back to wake_up_process(task) later.

We see that after that the state is changed to TASK_INTERRUPTIBLE. Then it starts the timer strictly speaking, and go in schedule().

Remember that schedule(), before scheduling a new task, will test the state of the current (which is now “previous”) task, and if the state is not TASK_RUNNING, it will remove that task from its runqueue (and that is the case as it is in the TASK_INTERRUPTIBLE state).

Schedule() goes on and start scheduling another task, if there is not, it will run the “idle” task.

At some point, the timer will expire. The timer rely on an hardware timer which will cause an interrupt, leaving any current task to process the interrupt handler (this is not the schedule() handler and so on !). The interrupt handler for the hardware timer is hrtimer_interrupt() which will run hrtimer_wakeup() for all expired timer. As said before, outr timer for the sleeping mechanism will set the timer task as NULL, and call wake_up_process(). But that functions just put back the process in the runqueue and set its state to TASK_RUNNING, not actually scheduling it. The interrupt handler will finish and the CPU will go back to it’s currently running process (maybe the idle process, running the cpu_idle() function).

That currently running process will eventually finish its processing or be preempted (the normal scheduling mechanism), and the process which was sleeping will re-run again when it will be picked again by pick_next_task().

But where will this process restart? After the sleep() or usleep() call? No ! Where it was… in the nanosleep syscall. What does the syscall do after its call to schedule()? Put the state back to TASK_RUNNING and effectively running. That seems obvious as we said before that the syscall is taking care of destroying the timer structure, so it should not restart after the syscall.

Note that I skipped the part where do_nanosleep() is looping and so on… I tried usleep(1) and sleep(1), and the loop never happend. I suspect it’s for very long timers, the kernel would’nt launch a timer of 1 hour long… But it’s just a guess.