2.0 Code Generation with Ptolemy

2.5 Interprocessor Communication

2.5.1 Send/Receive

When the target architecture is a multiple processor system, the programmer selects a parallel scheduler that best fits best the target and the application. The parallel scheduler determines which actors to assign to which processing elements, as well as when to execute them in each processing element. As an example, consider the simple case in figure 6, where all blocks are homogeneous (producing and consuming a single token).

Suppose that the scheduler generates the schedule as shown in the Gantt chart in figure 7. By assigning star B and star A to different processors, the parallel scheduler introduces interprocessor communication between processor 2 and processor 1. The cost of the communication overhead is dependent on the target. Based on the information that is specified in the target definition, the scheduler schedules the communication resources and reserves the time-slots in the generated schedule.

The next step is to generate code for each processor. For processor 2, code for star B and the "send" star should be generated sequentially. To generate code, however, it is not sufficient to concatenate the code of star B and the code of the "send" star. We first have to allocate the memory and registers appropriately in the processors. Since each processor is also a target, it can allocate the hardware resources suitably for the generated code, given a certain galaxy. Thus, sub-universes are generated for the individual processors after the parallel scheduling is performed, as shown in figure 8. Note that the "send" and "receive" stars are automatically inserted by the Ptolemy kernel when creating the sub-universes. The multiple-processor target class is responsible for defining "send" and "receive" stars.

Once the generated code is loaded, processors run autonomously. The synchronization protocol between processors is hardwired into the "send" and "receive" stars. One common approach in shared-memory architectures is the use of semaphores. Thus a typical synchronization protocol used is to have the send star set a flag signaling the completion of the data transfer; the receive star would then wait for the proper semaphores to be set. When the semaphores are set, the receive star will read the data and clear the semaphores. In a message passing architecture, the send star may form a message header to specify the source and destination processors. In this case, the receive star would decode the message by examining the message header. The routing path from the source to the destination processor is determined at the compile-time as explained in section 2.3.2. Any specific routing algorithm and routing mechanism are not assumed in Ptolemy, but rather should be provided by the target class.

2.5.2 Spread/Collect

In the example of figure 6, we assume that the graph is homogeneous: no sample rate change occurs in the blocks. In such homogeneous applications, each star is naturally assigned to one processor. However, many signal processing applications are multirate, allowing us to split the invocations of a star across multiple processors. Furthermore, operations on blocks of samples, such as an FFT, or operations on vectors make an SDF graph non-homogeneous. Consider the simple multirate application in figure 9, where block A generates two tokens and block B consumes three tokens. One iteration of this universe consists of three invocations of block A and two invocations of block B. The precedence relation among these invocations can be described with the acyclic precedence graph (APG) as shown in figure 10. We assume that there is no data dependency between invocations of block, and assume the same for block B. In the figure, A1 represents the first invocation of A, A2 represents the second, etc. The APG graph represents the communication pattern between the invocations.

In a target architecture with two processors, a valid schedule for the APEG graph is shown in figure 11. According to the schedule, the target would splice the appropriate send and receive stars into the graph. As can be seen in the APEG, B1 receives data from not only A1 but also A2. Also note that A2 has been assigned to the other processor. Since block B1 consumes 3 tokens (figure 10), we need a special block to collect tokens from sources A1 and A2 for the input to B1, and to preserve the appropriate order. This special block is called a Collect star. The sub-universe created for processor 1 is illustrated in figure 12-(a). The Collect star gathers the outputs from both block A and the receive star.

On the other hand, two invocations of block A are assigned to second processor. Among the four output tokens generated from block A in this processor, the first output token is routed to processor 1 and the rest are fed into block B. This behavior can be expressed by introducing another special block, called a Spread star, as shown in figure 12-(b). Note that the sample rate is changed between block A and the Spread star, causing block A to be executed twice. The Spread star directs the first output token of block A to the input buffer of the "send" star; the remaining three tokens are directed to the input buffer of block B.

If memory is used to communicate between blocks, then in most cases it is possible for the Collect and Spread to be implemented simply by overlaying memory buffers; in such cases no code is required to implement these blocks. The AnyAsm pseudo-domain described in section 2.4 provides facilities for actors that work by this kind of buffer address manipulation.

It is worth emphasizing that the sub-universe does not express the execution order of the blocks, which is already determined by the parallel scheduler. For example, in figure 12-(b), the execution order of this block is not A, A, Spread, "send", and B, as might be expected if an SDF scheduling were to be performed with the sub-universe. The order is A, "send", A, and B according to the schedule in figure 11. The sub-universes are created only for the allocation of memory and other resources before generating the code.

2.5.3 Wormholes

A significant feature of Ptolemy is the capability of intermixing different domains or targets by wormholes. Suppose a code-generation domain lies in the SDF domain, where part of the application is to be run in simulation mode on the user's workstation and the remainder of the application is to be downloaded to a DSP target system. When we schedule the actors that are to run in the outside SDF-simulation domain at compile-time, we generate, download, and run the code for the target architecture in the inside code-generation domain. For the purposes of this section, we will say "SDF domain" to refer to actors that are run in simulation mode, and "code generation domain" for actors for which code is generated.

In the example of figure 13-(a), a DSP target system is coded to estimate a power spectrum of a certain signal. At run-time, the estimated spectrum information is transferred to the host computer to be displayed on the screen. Thus, the host computer monitors the DSP system. In the next example in figure 13-(b), a DSP system performs a complicated filtering operation with a signal passed from the host computer, and sends the filtered result back to the host computer. In this case, the DSP hardware serves as a hardware accelerator for number crunching. By the wormhole mechanism in Ptolemy, as demonstrated in the above examples, we are able to make the host computer interact with the DSP system. In Ptolemy, a wormhole is an entity that, from the outside, obeys the semantics of one domain (in this case, it works like an SDF simulation actor), but on the inside, contains actors for another domain entirely.

Data communication between the host computer and the DSP target architecture is achieved in the wormhole boundary. In the SDF domain, data is transferred to the input porthole of the wormhole. The input porthole of a wormhole consists of two parts: one is visible from the outside SDF domain and the other is visible in the inside code-generation domain. The latter part of the porthole is designed in a target-specific manner, so that it sends the incoming data to the target architecture. In the output porthole of the wormhole, the inner part corresponding to the inside code-generation domain receives the data from the DSP hardware, which is transferred to the outer part visible from the outside SDF domain. In summary, for each target architecture, we can optionally design target specific wormholes to communicate data with the Ptolemy simulation environment; all that is needed to create this capability for a new Target is to write a pair of routines for transferring data that use a standard interface.

Software Synthesis for DSP Using Ptolemy - 04 SEP 94

[Next] [Previous] [Up] [Top]