- 1 Block Devices Handling
- 2 The Generic Block Layer
- 3 The I/O Scheduler
- 4 Block Device Drivers
- 5 Opening a Block Device File
We start in the first section “Block Devices Handling” to explain the general architecture of the Linux block I/O subsystem. In the sections “The Generic Block Layer,” “The I/O Scheduler,” and “Block Device Drivers,” we will describe the main components of the block I/O subsystem. Finally, in the last section, “Opening a Block Device File,” we will outline the steps performed by the kernel when opening a block device file.
Each operation on a block device driver involves a large number of kernel components; the most important ones are shown in Figure 14-1.
Let us suppose, for instance, that a process issued a read() system call on some disk file. Here is what the kernel typically does to service the process request:
- The service routine of the
read()system call activates a suitable VFS function, passing to it a file descriptor and an offset inside the file.
- The VFS function determines if the requested data is already available and, if necessary, how to perform the read operation.
- Let’s assume that the kernel must read the data from the block device, thus it must determine the physical location of that data. To do this, the kernel relies on the mapping layer, which typically executes two steps:
- It determines the block size of the filesystem including the file and computes the extent of the requested data in terms of file block numbers. Essentially, the file is seen as split in many blocks, and the kernel determines the numbers (indices relative to the beginning of file) of the blocks containing the requested data.
- Next, the mapping layer invokes a filesystem-specific function that accesses the file’s disk inode and determines the position of the requested data on disk in terms of logical block numbers. Essentially, the disk is seen as split in blocks, and the kernel determines the numbers (indices relative to the beginning of the disk or partition) corresponding to the blocks storing the requested data. Because a file may be stored in nonadjacent blocks on disk, a data structure stored in the disk inode maps each file block number to a logical block number.
- The kernel can now issue the read operation on the block device. It makes use of the generic block layer, which starts the I/O operations that transfer the requested data. In general, each I/O operation involves a group of blocks that are adjacent on disk. Because the requested data is not necessarily adjacent on disk, the generic block layer might start several I/O operations. Each I/O operation is represented by a “block I/O” (in short, “bio”) structure, which collects all information needed by the lower components to satisfy the request.
The generic block layer hides the peculiarities of each hardware block device, thus offering an abstract view of the block devices. Because almost all block devices are disks, the generic block layer also provides some general data structures that describe “disks” and “disk partitions.”
- Below the generic block layer, the “I/O scheduler” sorts the pending I/O data transfer requests according to predefined kernel policies. The purpose of the scheduler is to group requests of data that lie near each other on the physical medium.
- Finally, the block device drivers take care of the actual data transfer by sending suitable commands to the hardware interfaces of the disk controllers.
As you can see, there are many kernel components that are concerned with data stored in block devices; each of them manages the disk data using chunks of different length:
- The controllers of the hardware block devices transfer data in chunks of fixed length called “sectors.” Therefore, the I/O scheduler and the block device drivers must manage sectors of data.
- The Virtual Filesystem, the mapping layer, and the filesystems group the disk data in logical units called “blocks.” A block corresponds to the minimal disk storage unit inside a filesystem.
- Block device drivers should be able to cope with “segments” of data: each segment is a memory page—or a portion of a memory page—including chunks of data that are physically adjacent on disk.
- The disk caches work on “pages” of disk data, each of which fits in a page frame.
- The generic block layer glues together all the upper and lower components, thus it knows about sectors, blocks, segments, and pages of data.
The generic block layer is a kernel component that handles the requests for all block devices in the system. Thanks to its functions, the kernel may easily:
- Implement—with some additional effort—a “zero-copy” schema, where disk data is directly put in the User Mode address space without being copied to kernel memory first.
- Manage logical volumes—such as those used by LVM(the Logical Volume Manager) and RAID (Redundant Array of Inexpensive Disks): several disk partitions, even on different block devices, can be seen as a single partition.
- Exploit the advanced features of the most recent disk controllers.
The core data structure of the generic block layer is a descriptor of an ongoing I/O block device operation called bio. Each bio essentially includes an identifier for a disk storage area—the initial sector number and the number of sectors included in the storage area—and one or more segments describing the memory areas involved in the I/O operation. A bio is implemented by the
bio data structure.
Each segment in a bio is represented by a
bio_vec data structure.
A disk is a logical block device that is handled by the generic block layer. Usually a disk corresponds to a hardware block device such as a hard disk, a floppy disk, or a CD-ROM disk. However, a disk can be a virtual device built upon several physical disk partitions, or a storage area living in some dedicated pages of RAM. In any case, the upper kernel components operate on all disks in the same way thanks to the services offered by the generic block layer.
A disk is represented by the
Hard disks are commonly split into logical partitions. Each block device file may represent either a whole disk or a partition inside the disk. If a disk is split in partitions, their layout is kept in an array of
hd_struct structures whose address is stored in the
part field of the
Although block device drivers are able to transfer a single sector at a time, the block I/O layer does not perform an individual I/O operation for each sector to be accessed on disk; this would lead to poor disk performance, because locating the physical position of a sector on the disk surface is quite time-consuming. Instead, the kernel tries, whenever possible, to cluster several sectors and handle them as a whole, thus reducing the average number of head movements.
When a kernel component wishes to read or write some disk data, it actually creates a block device request. That request essentially describes the requested sectors and the kind of operation to be performed on them (read or write). However, the kernel does not satisfy a request as soon as it is created—the I/O operation is just scheduled and will be performed at a later time.
Each block device driver maintains its own request queue, which contains the list of pending requests for the device. If the disk controller is handling several disks, there is usually one request queue for each physical block device. I/O scheduling is performed separately on each request queue, thus increasing disk performance.
backing_dev_info field is a small object of type
backing_dev_info, which stores information about the I/O data flow traffic for the underlying hardware block device. For instance, it holds information about read-ahead and about request queue congestion state.
When a new request is added to a request queue, the generic block layer invokes the I/O scheduler to determine that exact position of the new element in the queue. The I/O scheduler tries to keep the request queue sorted sector by sector. If the requests to be processed are taken sequentially from the list, the amount of disk seeking is significantly reduced because the disk head moves in a linear way from the inner track to the outer one (or vice versa) instead of jumping randomly from one track to another.