Tasks, timers, jobs, and memory management

In the abstract, we can model the operation of a participant in routing protocol exchange and computation as a loop consisting of the following primitive operations: read a protocol message, perform a route computation (this computation may trigger other sub-computations or cause a computation to be scheduled for later execution), write a protocol message.

GateD provides a common substrate, which includes abstractions such as tasks, timers and jobs, for implementing these primitive operations efficiently. A task represents a separately schedulable thread of protocol computation. A timer represents an event scheduled for a future instant, which causes some non-preemptable computation to be performed at that instant. This computation is associated with a specific task. A job (which can either be foreground or background) represents some non-preemptable protocol computation that can be scheduled anytime in the future.

A more detailed description of the task module data types, macros and function definitions may be found here.


Tasks in GateD

Since GateD provides a platform for prototyping and operating multiple routing protocol instances in a single entity (e.g. a router), it must be able to harmoniously manage primitive operations belonging to different routing protocol instances (e.g. a BGP instance and an OSPF instance). To do this, GateD provides a task, which is the thread of execution of a sequence of related primitive operations. These related operations can be scheduled independently of (i.e. arbitrarily interleaved with) the operation of other protocol instances or other operations within the same protocol instance. Tasks are roughly equivalent to co-routines (tasks are scheduled non-preemptively).

Different protocol implementations use tasks differently. The simplest way to use tasks is to allocate a single task to all computation that happens within a protocol. OSPF and RIP have a single task in whose context all protocol computation is performed. BGP, which maintains a transport connection to each individual peer, assigns a task to all protocol activity (message processing, route computation etc.) associated with the peer (or a peer group).

In the current GateD implementation, tasks do not correspond to separate UNIX processes (with the exception of the system trace dump task).

Task Data Structures

The central data structure in the task module is the task. It is a descriptor for all the processing state that is associated with a particular sequence of primitive operations. Such state includes pointers to protocol specific information (such as the protocol implementation state block, e.g. the bgpPeer struct associated with this block). A task is usually associated with a UNIX socket; the task structure contains the socket descriptor.

In addition, the task structure contains pointers to protocol specific routines for performing the different primitive operations that the task module supports. These primitive operations are:

For example, when it is ready to read update messages from a peer, the BGP protocol implementation registers a pointer to bgp_recv_v4_update(). with the corresponding peer's task. Then, when data actually arrives on a socket, the task module simply calls this handler to perform the protocol specific read. To notify protocols of GateD reconfiguration, tasks also maintain pointers to protocol specific notification handlers (e.g., the protocol specific handler to be called before reconfiguration, after reconfiguration and so on).

GateD individually prioritizes each of these operations. At any given instant, pending timers have highest priority, followed by foreground jobs, reads from sockets, protocol specific exception handling, writes to sockets, and finally background jobs. GateD maintains a global list of tasks (task_read_queue, task_write_queue etc.) waiting to perform these operations; the task structure also contains linked list pointers for placement on each of these queues. (Actually, jobs and timers are queued in separate data structures. That's because a task may have more than one timer or job pending).

Task Module Entry Points

Protocol instances normally create a task by first allocating a task data structure for the task (by calling task_alloc()), instantiating the various field of the task structure, and then calling task_create() to insert the task structure into a global list of tasks. This insertion takes place in priority order (tasks are assigned a static priority). This latter function also creates a socket for use by the task for reads and writes.

Operations on the socket associated with the task include task_close() (to close the socket associated with the task, and remove it from the global read and write queues), task_set_socket() and task_reset_socket().

Storage associated with the task data structure is removed when task_delete() is called. This frees up any other resources (timers, jobs) associated with the task and removes the task from any other lists to which it may belong.

Some task module functions are used to notify each individual task of events of global interest. Some of these are processed in task priority order. Thus, task_reconfigure() is called to force all protocol instances to re-read the configuration files. Similarly, task_ifachange() and task_iflchange() are used to notify protocol instances of changes in the status of one or more network interfaces. And, task_terminate() is used to notify a protocol instance of GateD intention to wind up operations (e.g. in response to operator action). Also, task_newpolicy() is used to inform interested protocol modules of routing policy configuration changes. Finally, task_flash() calls each protocol instance's flash handler when any change happens to GateD's global route database.

Once a task has been created, individual protocol instances may schedule reads on the task socket using task_set_recv() (naming inconsistency here), schedule blocked writes to the socket using task_set_write() (when a protocol instance has some data to send on a socket, it attempts to write it directly; if that blocks, it calls this function). No protocol implementation seems to use the socket exception feature.

To actually read data from, or write data to a socket, the task module provides task_receive_packet() and task_send_packet() respectively. While ideal for protocols that use unreliable datagrams for sending routing protocol messages, these functions do not suffice for protocol that use stream-based transports. BGP prefers instead to directly access the socket. From the operating system, task reads go directly into a global task_recv_buffer (GateD can do this because tasks are non-preemptive). It is then the protocol instance's responsibility to consume all the data in the buffer. Similarly, in preparing to write data, protocol instances formulate messages in a single global task_send_buffer; if the write blocks, it is the protocol instance's responsibility to spool the data elsewhere.

Task Processing and the GateD Main Loop

All protocol processing in GateD is scheduled in its main() loop. As described above, the main loop prioritizes different computations and repeatedly loops to discover which computations have been enabled since the last time the loop was traversed.

Basically, the main loop does a select() on all socket descriptors used for reading, writing, or exception handling. It falls into the select() when no timers are scheduled and the job queues are empty. GateD stays in the select() until the next scheduled timer or 8 seconds, whichever is earlier. After coming out of the select, it processes "high priority" timer events (see the timer section). If any operator actions occurred (in the form of UNIX signals), it processes these (from task_process_signals()) and causes any queued foreground jobs to run before returning to the select(). Otherwise, it performs socket reads and writes (from task_process_sockets()), pending timers and background jobs in that order.

Thus, a protocol's message read and write operations, as well as timers and jobs get scheduled from GateD's main loop. Only the flash routines are performed whenever (as a result of a read) GateD's routing database gets updated.


Timers in GateD

A timer is GateD schedules a particular computation for a specified future instant. The GateD timer module allows specification of both one-shot timers and interval timers. Time specification granularity is 1 second. Timer expiry is governed by system load; thus, a timer may not expire at exactly the scheduled interval since its last expiry.

GateD allows protocols to specify higher priority timers (these have a greater probability of correct expiry compared to "normal" timers). GateD also allows specification of drifting timers: that is, depending on how the timer is initialized, the next timer expiry is calculated based on when the timer actually expired last and the interval. It is also possible to configure the timer to be non-drifting: that is, timer expiry is scheduled at exactly some integral multiple intervals after the timer was first scheduled.

Most protocols use timers for periodic tasks such as monitoring connection status, timing out connection opens and so on.

In the current GateD implementation, timer expiry is checked by polling all active timers in the main loop; operating system facilities for timer expiry notification are not used.

Timer data structures

The central data structure is the timer descriptor block task_timer. This includes timer parameters (interval, start time) as well state describing the action to perform when timer expiry happens.

These task_timer data structures are queued into one of three global queues maintained by the task module: a list of high priority timers, a list of active timers and a list of inactive timers (the last is used when protocols temporarily need to idle timer expiry). The first two lists are sorted in increasing order of next expiry, to ease processing.

High priority timers are used by some protocols like OSPF for timing their hello deliveries. BGP uses "normal" timers for all its timing needs: keepalives, connection retries and so on.

Timer Module Entry Points

The function task_timer_create() creates a timer with the specified parameters and inserts it into the appropriate global list of timers. Conversely, task_timer_create() removes a timer from its queue and frees any resources that the timer holds. Also, task_timer_reset() is used to idle the timer (i.e. move it into the inactive list). task_timer_uset() is used to reschedule an idle timer or a timer already in progress.

Timer Module Processing

The primary timer handling is performed in the task_timer_dispatch() and task_timer_hiprio_dispatch() routines. These simply traverse the list of timers, call the appropriate handlers, recompute the next interval at which the timer should fire and reschedule the timer for then. These are called from the main loop.

Jobs in GateD

A GateD job represents a unit of computation that can be scheduled for a time later than the present. Jobs are either foreground or background. Foreground jobs are time-limited computations run when it is safe to modify the GateD routing database and are typically called from high priority timer handlers or flash handlers. Background jobs are longer running computations scheduled when no timers or I/O is pending. Each background job has a priority between zero and seven; background jobs are scheduled in priority order.

For example, BGP uses a foreground job to close a peer connection if an error is observed when flushing routes to that peer. This cannot be performed correctly from the task's write routine since there may be reads queue'd up or timers scheduled for the task. Similarly, the OSPF implementation sets a timer after an SPF computation to ensure that the SPF computation is not done within that time period; at the expiry of the timer, it sets up a background task to run the SPF computation.

Job data structures

The descriptor block for a single job, job_task, contains pointers to the task for which this job is scheduled, the routine encompassing the computation to be performed and a single argument to pass to that routine.

The GateD module maintains a global list of foreground jobs (ordered FIFO) and a list of background jobs (ordered by priority). In addition, there exists a single array of pointers to the locations in the background queue marking ends of blocks of task_job structs with the same priority.

Job Entry Points

Protocol instances call task_job_create() to allocate a task_job structure and insert it into the appropriate list. They call task_job_delete() to do the opposite. Foreground jobs are automatically deleted after completion. Background jobs must be explicitly deleted.

Job Processing

The GateD main loop calls task_job_fg_dispatch() to schedule computation of foreground jobs. This runs all foreground jobs to completion. When possible, the Gated main loop calls task_job_bg_dispatch() to schedule computation of foreground jobs. Unlike the former, this function simply removes the first, highest-priority background job from the list of background jobs, schedules its execution and returns.

Memory Management in GateD

In an effort to optimize memory usage, GateD provides its own memory management routines. While logically not part of the task module, memory management is probably included here for historical reasons.

Protocol modules (and other support modules as well) allocate memory in GateD in two distinctly different ways. For dynamic allocation of fixed size descriptors, GateD uses the memory allocator described below. For variable sized blocks (e.g. the BGP attribute descriptor structure as_path), GateD uses malloc(). (Actually, GateD optimizes this latter allocation as well, using malloc() only for the uncommon case and allocating fixed sized blocks for the uncommon case).

The fixed block allocation model is optimized for the case when a module repeatedly requests the same sized block. The module first contacts the memory manager to obtain a descriptor block (called a task_block, see below) which indicates the size of blocks it will request in the future. To later allocate a block, it calls the memory manager with this descriptor block; the memory manager returns a pointer to the appropriately sized block.

Memory Management Data Structures

The fixed block allocator has two main data structures, a task_block and a task_size_block. The task_block, which is the descriptor block referred to above, contains size of each allocation request and some bookkeeping data.

A task_size_block is a descriptor for a block allocation of a particular size. It contains a linked-list of task_blocks for blocks of that size. In addition, it contains a pointer to a free-list of blocks of that particular size.

Memory Management Entry Points

Protocol modules register with the memory manager by calling task_block_init() and specifying the size of allocations it expects to make with this registration.

Macros task_block_alloc() and task_block_free() respectively allocate and free a block out of the requested size. The memory allocation strategy is explained below.

Some protocol modules allocate data in units of pages directly. For example, the task module allocates its send and receive buffers this way. These modules may explicitly free these pages by calling task_block_reclaim(). These pages go into a global free list, out which the memory manager may allocate data for its block allocator.

Memory Management Internals

The internals of the memory manager can be explained with the aid of the following example. Suppose a protocol module registers with the memory manager for a block allocation of 16 bytes. The memory manager returns to the protocol module a task_block descriptor, after linking that descriptor to the task_size_block descriptor for 16 byte blocks.

The latter descriptor maintains a linked list of free 16 byte blocks. Freed blocks are linked in a singly-linked list (even though the pointer to a free block is cast into the head of a doubly-linked list, only the forward pointer is used in the free list). Whenever the free list for a task_size_block is empty, the memory manager allocates a page, then carves up the page into a linked list of appropriately sized blocks (the fragment of the page that doesn't fit into one block, called a runt, is linked into a free list corresponding to its size).

Allocations are done out of this free list. When a protocol module frees up a block, the free-d block goes back on the free list.