Multi-processing Network Server Models: A Comprehensive Guide

As a seasoned high-performance networking code writer (with a PhD focused on Cache Server for Distributed Applications Adapted to Multicore Systems), I’ve noticed a common oversight in tutorials: they often gloss over the fundamentals of network server models. This article aims to demystify this crucial aspect of high-performance networking by providing a comprehensive overview and comparison of these models.

Which Network Server Model Should I Choose

This piece caters to “system programmers”—back-end developers who delve into the nitty-gritty of their applications, crafting network server code. While this is often done in C++ or C, most contemporary languages and frameworks now offer respectable low-level functionality, albeit with varying efficiency.

I’m assuming you understand that the trend of scaling CPUs by adding cores necessitates software adaptation to leverage these cores effectively. The challenge then becomes how to distribute software tasks among threads (or processes) for parallel execution on multiple CPUs.

I’m also assuming you’re familiar with “concurrency” as essentially “multitasking”—multiple code instances active simultaneously. Achievable even on a single CPU, concurrency was the norm before the multi-core era. It typically involved rapidly switching between processes or threads on a single CPU, giving the illusion of simultaneous execution on older systems. Conversely, “parallelism” explicitly means simultaneous code execution by multiple CPUs or cores.

Dividing an Application (Threads vs. Processes)

For our purposes, the distinction between threads and full processes is largely irrelevant. Modern operating systems (except, notably, Windows) handle processes with nearly the same lightweight efficiency as threads (or vice versa in some cases). The key difference now lies in cross-process/thread communication and data sharing capabilities. I’ll highlight any distinctions relevant to our discussion; otherwise, consider “thread” and “process” interchangeable in this context.

Network Server Models and Their Tasks

This article zeroes in on network server code, which inherently handles three core tasks:

Task #1: Establishing (and terminating) network connections
Task #2: Network communication (IO)
Task #3: The application’s core function or payload

Several general network server models dictate how these tasks are split across processes:

MP: Multi-Process
SPED: Single Process, Event-Driven
SEDA: Staged Event-Driven Architecture
AMPED: Asymmetric Multi-Process Event-Driven
SYMPED: SYmmetric Multi-Process Event-Driven

These terms, commonly used in academia, might have alternative names in practice. Remember, the names are secondary to understanding the underlying code mechanics.

The following sections will delve into each network server model.

The Multi-Process (MP) Model

The MP model, a classic introduction to multithreading, involves a “master” process handling connection acceptance (Task #1). Upon connection, it spawns a new process, delegating the connection socket—resulting in one process per connection. This new process then typically interacts with the connection sequentially: reading data (Task #2), processing it (Task #3), and writing back (Task #2).

While incredibly straightforward to implement, the MP model shines when the process count remains relatively low (generally not exceeding twice the CPU core count). Exceeding this limit often leads to excessive operating system overhead (“thrashing”) as it juggles processes, ultimately hampering performance.

Pros: Simplicity and efficiency with a limited number of connections.

Cons: Potential for OS overload with a large number of processes and latency jitter due to network IO waiting on payload processing.

The Single Process Event-Driven (SPED) Model

Popularized by high-profile servers like Nginx, the SPED model handles all three tasks within a single process through efficient multiplexing. It relies heavily on advanced kernel features like epoll and kqueue. Driven by incoming connections and data “events,” it operates on an “event loop”:

Check for new network “events” (new connections or incoming data)
Establish new connections (Task #1)
Read available data (Task #2) and process it (Task #3)
Repeat until termination

This single-process approach, minimizing context switching, allows for handling tens of thousands of connections concurrently. However, it has two major drawbacks:

The sequential nature of the loop means lengthy payload processing (Task #3) can stall other tasks, creating latency inconsistencies.
It utilizes only a single CPU core, leaving others idle.

These limitations pave the way for more sophisticated models.

Pros: High performance, minimal OS overhead, and single-core operation.

Cons: Single CPU utilization and potential for uneven response latency with variable payload processing times.

The Staged Event-Driven Architecture (SEDA) Model

SEDA is a more intricate model, breaking down a complex application into stages linked by queues. It functions as follows:

Payload work (Task #3) is divided into modules, each a separate process with a specific function, communicating through message queues. This architecture resembles a graph, with processes as nodes and message queues as edges.
A single process (often SPED-based) handles Task #1, directing new connections to entry point nodes, which can be network-focused (Task #2) or handle payload processing (Task #3). Responses originate from individual nodes, often eliminating the need for a central “master” process.

Theoretically scalable to complex scenarios, SEDA can become unwieldy in practice. The message passing overhead can cripple performance compared to SPED, especially with lightweight per-node processing. SEDA is usually reserved for scenarios demanding intricate and time-consuming payload processing.

Pros: Highly modular design, ideal for compartmentalizing tasks.

Cons: Prone to complexity, with message queuing potentially becoming a bottleneck.

The Asymmetric Multi-Process Event-Driven (AMPED) Model

A simplified variant of SEDA, the AMPED model utilizes fewer modules, processes, and message queues. It operates as follows:

Tasks #1 and #2 are handled by a single “master” process (SPED-style), solely responsible for network IO.
Task #3 is delegated to a separate “worker” process (potentially multiple instances), connected to the master through a queue (one per process).
The master process, upon receiving data, finds an available worker process and queues the data. Once the worker processes the data and has a response ready, it notifies the master, which relays the response back through the connection.

This model decouples payload processing, allowing for complexity without affecting network IO and potentially enhancing security.

Pros: Clear separation of network IO and payload processing.

Cons: Potential bottlenecks due to message queue reliance for inter-process communication.

The SYmmetric Multi-Process Event-Driven (SYMPED) Model

Often considered the gold standard, the SYMPED model essentially replicates independent SPED “worker” processes. A single process accepts and distributes connections to these workers, each running its own SPED-like event loop. This approach offers several benefits:

Optimal CPU utilization, with each process dedicated to network IO or payload processing.
Minimal inter-process communication for independent connections (e.g., HTTP).

Modern Nginx implementations utilize this model, spawning worker processes with event loops. Many operating systems further streamline this by allowing multiple processes to listen on a single TCP port, eliminating the need for a dedicated connection handling process. If your application permits, the SYMPED model is highly recommended.

Pros: Controlled CPU usage and efficient SPED-like loops for each process.

Cons: Variable latency can still occur with non-uniform payload processing due to the SPED-like loop in each process.

Low-Level Optimization Techniques

Beyond choosing the right architecture, some low-level optimizations can further enhance network code performance:

Minimize Dynamic Memory Allocation: Memory allocators are complex, often relying on intricate data structures and mutexes. For instance, jemalloc alone comprises roughly 450 KiB of C code! Most models discussed can be implemented with static or pre-allocated buffers, efficiently transferring ownership between threads.
Maximize OS Capabilities: Leverage features like multiple processes listening on a single socket and delaying connection acceptance until data reception. Utilize sendfile() whenever possible.
Profound Protocol Understanding: Optimize based on the specifics of the network protocol. For example, disabling Nagle’s algorithm or connection lingering can be beneficial depending on the context. Familiarize yourself with TCP congestion control algorithms and consider utilizing newer, more efficient options.

I’ll likely delve deeper into these and other optimization techniques in a future post. Hopefully, this article provides a solid foundation for understanding the architectural options available for crafting high-performance networking code.