The remoteness of many Pacific Islands often combines with small populations and low GDP to make connection to international fibre networks prohibitively expensive. The only alternative for local ISPs are satellite connections. While widespread, satellite connections are expensive and island users often complain about slow speeds and connection timeouts. Efficient use of the satellite resource is therefore important for ISPs in the Pacific. Data supplied by a number of ISPs in the Pacific suggests however that peak satellite capacity usage is sometimes nowhere near 100%. Under a previous ISIF grant (2014, under the auspices of the Pacific Island Chapter of the Internet Society (PICISOC) and in collaboration with colleagues at the Massachusetts Institute of Technology and the University of Aalborg), we implemented network encoders/decoders in four satellite-connected island locations around the Pacific: Aitutaki, Niue, Rarotonga and Tuvalu. Our original assumption was that general congestion and packet losses in hopelessly overloaded links were to blame for low goodput, and the aim of the project was to investigate whether network coding could improve TCP goodput on such links. On closer investigation, however, we found that most links were not hopelessly overloaded. Instead, they suffered from a slightly different problem: TCP queue oscillation. TCP queue oscillation occurs when a TCP sender tries to regulate its sending rate to fit the narrowband capacity of the satellite link, but fails to do so because the long latency of the link means that senders cannot detect packet losses at the input to the satellite gateway quickly enough to respond appropriately. This causes the sender to transmit into an already overflowing input queue while subsequently holding back data when the queue has cleared.
TCP queue oscillation has been known for several decades and had been considered solved – albeit in the context of a small number of parallel connections only, or in cases where the satellite link didn’t really represent much of a bottleneck. Provisioning hundreds of clients through a satellite bottleneck had not really been considered, however. We found the effect – characterised by high packet loss during peak demand periods and simultaneous link underutilisation – to be alive and well in the Pacific. For ISPs in the islands, this is a problem because it leaves expensive satellite capacity inaccessible during the “queue empty” phase of the oscillation. We demonstrated that network coding could achieve significantly better goodput for individual connections under queue oscillation conditions, allowing them to claw back some of the unused capacity. What was not so easy to prove was that this would scale to the traffic for an entire island: coded solutions require a non-trivial change in network topology, and this was simply too much to ask for on production links. The alternative is simulation. We initially tried to use software-based network simulators but quickly found that this was not particularly promising: Generating a realistic traffic mix is difficult, especially for a large number of clients. Simulators also have an inherent tendency to process clients in a serial fashion, while real network hosts operate in parallel. Moreover, if software network simulators do not rely on “real” TCP stack code and the associated timing, it is difficult to verify that their behaviour in complex scenarios is correct. When they use real components, then any simulations must be done in real-time. We know from observation that some island queues don’t oscillate until they see over 2000 simultaneous connections. This is impossible to simulate in software on a single machine.
In 2015, we therefore embarked on building laboratory facilities that would enable us to simulate the behaviour of an entire island. Our satellite simulator currently consists of over 110 computers in two 7 foot 19″ racks. One rack contains the “island clients”, the other the “satellite link” and “servers around the world”. The simulator’s hardware was bought with a grant from Internet NZ and university CAPEX, and a research grant balance kindly donated by former IETF chair Brian Carpenter. We have also developed software that lets us run a large number of simultaneous TCP connections with overall statistics based on actual traffic profiles observed in the Pacific. That is, we can make our computers behave as if they were all the Internet users in, say, Rarotonga. This allows us to create island scenarios (coded and uncoded, performance-enhancing proxy (PEP), etc.) and simulate them in real time in a controlled and reproducible manner. We are currently in the final stages of taking the simulator through its “unencoded baseline” tests Their aim is to verify that the simulator is fit for purpose and that neither our software nor our hardware create unexpected bottlenecks that we would not see in a real scenario. Once the baseline tests are complete, the simulator’s first task (under our Internet NZ grant) will be to see how network coding stacks up against performance-enhancing proxies. Beyond the work to be delivered on the Internet NZ grant, the simulator now represents a unique resource that lets us investigate a plethora of other research questions pertaining to satellite networks and network coding in island scenarios. These include: 1) Under which circumstances exactly does queue oscillation occur? How much traffic load is required for a given amount of bandwidth and latency? 2) If an ISP gives us their traffic profile, can we predict whether their satellite link will oscillate, now or in future as traffic grows? Can we predict whether they would benefit from coding? Can we predict which code generation size will work best? 3) Our present network coding software can now adapt the coding overhead to the network conditions over a scale of seconds to minutes. We already know that this can improve goodput over the fixed overhead version. But how do we best size the adaptive algorithm’s window in different scenarios? 3) Would different types of codes work better? The network codes we are using at this point require a lot of overhead to handle the large burst error runs caused by queue oscillation. We expect these runs to shrink when we are able to encode all the traffic to and from an island, but this may not always be possible, e.g., if an ISP is unwilling to encode/decode. In such scenarios in particular, network coding may still be an option for users, who would then want to use codes that are especially good in burst error correction. Prof. Martin Bossert from the University of Ulm (one of the fathers of the GSM mobile communication system) suggested the use of partial unit memory codes (PUM) during a recent visit. Interleaved Reed-Solomon codes could also work in this context. This represents work for several years to come, all of which hinges on our ability to turn the simulator into a flexible and easy-to-operate system, so new research students find it easy to learn and use. The core of this application is therefore a one year scholarship for Lei Qian, one of my current PhD students. His PhD project centres around the simulator’s construction and commissioning, and being able to support him for the next year will enable us to turn the simulator into the resource it can be.