本文将介绍select的用法及其内核实现。

1. 用法

首先要建立起IO多路复用的概念。

1
2
int select(int nfds, fd_set *readfds, fd_set *writefds, \
fd_set *exceptfds, struct timeval *timeout)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

int
main(void)
{
fd_set rfds;
struct timeval tv;
int retval;

/* Watch stdin (fd 0) to see when it has input. */
FD_ZERO(&rfds);
FD_SET(0, &rfds);

/* Wait up to five seconds. */
tv.tv_sec = 5;
tv.tv_usec = 0;

retval = select(1, &rfds, NULL, NULL, &tv);

if (retval == -1)
perror("select()");
else if (retval)
printf("Data is available now.\n");
/* FD_ISSET(0, &rfds) will be true. */
else
printf("No data within five seconds.\n");

exit(EXIT_SUCCESS);
}

2. 内核实现

这里只讲述内核实现select的核心部分。本部分内容源于:select()/poll() 的内核实现

kernel version: v3.9-rc8

1
2
3
4
SYSCALL_DEFINE5(select
core_sys_select
do_select
poll_schedule_timeout

2.1 do_select()循环体

do_select()实质上是一个大的循环体,对每一个主程序要求监听的设备fd(File Descriptor)做一次struct file_operations结构体里的poll操作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
{
// …
for (;;) {
// …
for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
// …
struct fd f;
f = fdget(i);
if (f.file) {
const struct file_operations *f_op;
f_op = f.file->f_op;
mask = DEFAULT_POLLMASK;
if (f_op->poll) {
wait_key_set(wait, in, out,
bit, busy_flag);
// 对每个fd进行I/O事件检测
mask = (*f_op->poll)(f.file, wait);
}
fdput(f);
// …
}
}
// 退出循环体
if (retval || timed_out || signal_pending(current))
break;
// 进入休眠
if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
to, slack))
timed_out = 1;
}
}
`

(*f_op->poll)会返回当前设备fd的状态(比如是否可读可写),根据这个状态,do_select()接着做出不同的动作

  • 如果设备fd的状态与主程序的感兴趣的I/O事件匹配,则记录下来,do_select()退出循环体,并把结果返回给上层主程序。
  • 如果不匹配,do_select()发现timeout已经到了或者进程有signal信号打断,也会退出循环,只是返回空的结果给上层应用。

但如果do_select()发现当前没有事件发生,又还没到timeout,更没signal打扰,内核会在这个循环体里面永远地轮询下去吗?

do_select()把全部fd检测一轮之后如果没有可用I/O事件,会让当前进程去休眠一段时间,等待fd设备或定时器来唤醒自己,然后再继续循环体看看哪些fd可用,以此提高效率。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
ktime_t *expires, unsigned long slack)
{
int rc = -EINTR;

// 休眠
set_current_state(state);
if (!pwq->triggered)
rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
__set_current_state(TASK_RUNNING);

/*
* Prepare for the next iteration.
*
* The following set_mb() serves two purposes. First, it's
* the counterpart rmb of the wmb in pollwake() such that data
* written before wake up is always visible after wake up.
* Second, the full barrier guarantees that triggered clearing
* doesn't pass event check of the next iteration. Note that
* this problem doesn't exist for the first iteration as
* add_wait_queue() has full barrier semantics.
*/
set_mb(pwq->triggered, 0);

return rc;
}
EXPORT_SYMBOL(poll_schedule_timeout);

2.2 struct file_operations设备驱动的操作函数

设备发现I/O事件时会唤醒主程序进程? 每个设备fd的等待队列在哪?我们什么时候把当前进程添加到它们的等待队列里去了?

1
mask = (*f_op->poll)(f.file, wait);

就是上面这行代码干的好事。 不过在此之前,我们得先了解一下系统内核与文件设备的驱动程序之间耦合框架的设计。

上文对每个设备的操作f_op->poll,是一个针对每个文件设备特定的内核函数,区别于我们平时用的系统调用poll()。 并且,这个操作是select() poll() epoll()背后实现的共同基础。

Support for any of these calls requires support from the device driver. This support (for all three calls, select() poll() and epoll()) is provided through the driver’s poll method.

Linux的设计很灵活,它并不知道每个具体的文件设备是怎么操作的(怎么打开,怎么读写),但内核让每个设备拥有一个struct file_operations结构体,这个结构体里定义了各种用于操作设备的函数指针,指向操作每个文件设备的驱动程序实现的具体操作函数,即设备驱动的回调函数(callback)。

1
2
3
4
5
6
7
8
struct file {
struct path f_path;
struct inode *f_inode; /* cached value */
const struct file_operations *f_op;

// …

} __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
// select()轮询设备fd的操作函数
unsigned int (*poll) (struct file *, struct poll_table_struct *);
// …
};

这个f_op->poll对文件设备做了什么事情呢? 一是调用poll_wait()函数; 二是检测文件设备的当前状态。

1
unsigned int (*poll) (struct file *filp, struct poll_table_struct *pwait);

For every file descriptor, it calls that fd’s poll() method, which will add the caller to that fd’s wait queue, and return which events (readable, writeable, exception) currently apply to that fd.

3. 总结

总结一下select()的大概流程。

  1. 先把全部fd扫一遍
  2. 如果发现有可用的fd,跳到5
  3. 如果没有,当前进程去睡觉xx秒
  4. xx秒后自己醒了,或者状态变化的fd唤醒了自己,跳到1
  5. 结束循环体,返回

参考资料:

  1. man select
  2. select()/poll() 的内核实现