Best-Effort Cache Synchronization with Source Cooperation by Chris Olston and Jennifer Widom Summary: This paper presents a push-based method of cache synchronization, where data sources are aware of a cache and actively coordinate in sending updates to the cache.The method prioritizes which data items within a particular source should be sent. This priority is calculated using a previous update times and some weighted divergence metric variables. Once the priority for a certain data item has reached a local source threshold, the source sends the update. Because of unstable system parameters, such as cache and source side bandwidths, this threshold is made adaptive through negative feedback messages from the cache. Each source gradually increases its update threshold. If the cache detects available bandwidth, it will send a message to as many sources as possible, indicating that sources should lower their thresholds. Comments: Large scale pushed based systems seem somewhat questionable. It is not very likely that a cache's data sources will decide to coordinate with the cache. Also, assuming the additional complexity (though fairly lightweight) required to be implemented at each source, independent sources will be unlikely to adopt such a scheme. Although this problem may be orthogonal to the actual efficiency of the paper's algorithm, the authors nonetheless decided to compare it to a pull-based system which is much more likely to be implemented in practice. Pull-based systems will undoubtedly be less efficient and have higher divergence. This is because push-based systems push only changed data, and pull-based systems are expected to obtain at least some unchanged information in predicting which data items might be changed. However, if a coordinated system of independent data sources and a cache could be developed, the algorithm does seem quite useful. What makes this algorithm attractive is that sources only need to check their data items when the item is updated, and they only need to send updates once the threshold is reached. In addition, the threshold-setting portion of the algorithm ensures that thresholds are kept at reasonable levels. The experimental results are impressive as well, with the actual implemented algorithm performed similarly to an idealized version. There are some questions, however. The paper claims that an ? = 10 is an optimal value. In addition, the paper states that when available bandwidth is detected by the cache, the cache sends a feedback messages "to as many sources as possible." This would result in 10 times more data coming into the cache. This seems odd, because it implies a considerable amount of bandwidth is unused until the cache sends out a feedback message (the cache must make sure it is safe to receive 10 times as much data). Large amounts of unused bandwidth would make pull-based schemes much more appealing, since pull-based schemes never waste bandwidth. If a push-based scheme wastes more bandwidth (through lack of use) than a pull-based scheme (through obtaining unchanged items) than a pull-based scheme may appear more attractive. A more efficient solution would be to send out messages to selected data sources. Perhaps some other algorithm to determine priority data sources (at the cost of more overhead) would then be required as well. There are also some implementation questions, such as how to reorder a priority queue as certain data items overtake other items in priority. The question is not so much in reordering as in how this will effect efficiency of keeping track of all data items. Similarly, how a cache could detect violations in a source's updates (perhaps they violate a cache priority policy) might become problematic as well. In general the algorithm will probably work well when designing a single system with multiple components. It remains to be seen whether independent organizations would be willing to participate in a system that would require more complexity to implement.