Inspired by Should I read papers? by Michael R. Bernstein and the amazing Papers We Love GitHub repository, I set out at the beginning of January 2015 with the goal of reading 15 academic papers. In this post I want to share with you my notes for the 4 I loved the most.

Consult your local Computer Scientist to get the full meaning and significance of each paper. If you catch something or have a comment, please file an issue.

Takeaways

I quickly realized that reading papers is like getting access to a secret treasure trove of new and amazing ideas. Or as James Mickens put it (in reference to the Google BigTable paper):

For a lot of people reading that paper was like receiving knowledge from an alien race whose intelligence vastly surpassed our own.

Most of them are written in a way that someone (like me) with a basic smattering of CS knowledge can understand, and yet the ideas presented in them are extremely thought provoking.

Additionally, in a certain way reading papers is a comforting activity. In a world where the front-end framework you’re using might be eclipsed in usefulness in 1 or 2 years, many of these papers present ideas that are as applicable today as they were 43 years ago. Here are the publish dates of the 15 papers I read: (please humor me, I really wanted to make a timeline with SVG)

Reading papers also gives you a more personal view of the tools you use every day. Sure we all use TCP, but if you read A Protocol for Packet Network Intercommunication you might feel like you’ve gotten to know Vint Cerf and Bob Kahn a little bit. Then every time you hear about TCP, you can be like hey, I know those guys! And feel like you were lucky enough to look into a little window into the past where you saw where TCP came from and the design tradeoffs Cerf and Kahn made.

Anyway, enough rambling. On to the notes.

Bufferbloat: Dark Buffers in the Internet

Something unintuitve is slowing down the internet. You’d think that increasing the packet buffer size on network routers can only be a good thing, but when combined with the way TCP works, it increases latency and degrades performance across the entire network.

TCP was designed in a world where memory was far more expensive and thus network buffers were far smaller than they are now. To begin transmission, TCP starts out sending a small number of packets at a time and gradually increases that number until it detects packet loss. This is called the slow-start strategy.

However, if TCP is sending packets across a network whose routers and equipment have very large buffers, it will not experience enough packet loss for its window sizing algorithm to work correctly. This causes TCP to send out larger and larger windows of packets and simply wait until they (slowly) come back. This increases round-trip time (RTT) for everyone on the network.

One way to combat this is via queue management, which Gettys details in the paper. It’s a really interesting read.

Live Migration of Virtual Machines

Imagine you’re upgrading your infrastructure and need to restart a server. Wouldn’t it be nice if you could have all the VMs running on that machine seamlessly transfer and keep running on another host?

This sort of VM “teleportation” has been accomplished before, but the authors of this paper set out to create a solution that migrates machines in as little time as possible. They acheived this by running a pre-copy phase that would attempt to copy over as much memory as possible from the original VM to the new VM before the actual stop-and-transfer phase happened.

There were two things that made the migration process much harder than just copying memory:

Complication 1: How do you keep the same IP address and make sure packets get routed to the new host?

Other solutions the authors mention require that the old VM still be kept up to forward packets to the new one. This wasn’t an ideal solution. The authors got around this by utilizing the network infrastructure to broadcast a new MAC address using ARP. One limitation though, is that this only works in trusted networks.

Complication 2: What if you can’t use a 3rd machine to orchestrate the migration?

How do you write a program that can move itself and only execute on one machine at a time? The authors detail how they did this in the paper by using checkpointing.

  • Published in 2005 by Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, Andrew Warfield
  • Read the PDF

The Akamai Network: A Platform for High Performance Internet Applications

Coming into this paper without much knowledge of Distributed Systems, I was really excited to learn about some of the hard-won design decisions borne out of the sheer scale of Akamai.

Assume that a significant number of failures are happening at all levels.

It’s cheaper to build software redundancy than hardware redundancy.

Some of the coolest things I took from this paper were:

The internet was never built with guarantees about uptime and deliverability in mind. TCP helped improve this somewhat. Akamai literally built a layer on top of the Internet to be able to make those kinds of guarantees.

Akamai has a service - kind of like a massively-parallel Heroku - that will take your Java code and run it as a service on hundreds of their servers all over the world. For processing-intensive, lag-sensitive applications like video games, this must be a huge win.

  • Published in 2010 by Erik Nygren, Ramesh K. Sitaraman, and Jennifer Sun
  • See Andy Gross (Akamai alum and creator of Riak) present this paper at Papers We Love SF #2.
  • Read the PDF

Improving SSL Warnings: Comprehension and Adherence

This paper revealed to me how much of computer security can be a user interface problem.

Through iteration and user testing, the Google Chrome security team was able to decrease their invalid SSL warning page click-through rate by 30%.

In the beginning, the paper also revealed some disheartening beliefs users have about security, like:

“People believed the opposite was true: that SSL warnings could be ignored because banking websites have good security practices.”

“People thought that because they were using a Mac, nothing bad could happen to them.”

The paper is very accessible and well written and underscores how bringing fields like HCI and Security together with emprical glue can be a huge win for users everywhere.

  • Published in 2015 by Adrienne Porter Felt, Alex Ainslie, Robert W. Reeder, Sunny Consolvo, Somas Thyagaraja, Alan Bettes, Helen Harris, Jeff Grimes
  • Read the PDF

Orleans: Distributed Virtual Actors for Programmability and Scalabilty

I read this because I was going to my first PWL meetup and wanted to be prepared. Catie McCaffery’s talk on the subject was great and there were a bunch of people who worked on distributed systems in the audience who asked thought provoking questions after the talk.

Orleans’s biggest contribution is the concept of virtual actors. The actor model concept allows you to encapsulate state across a distributed system, but managing each actor is difficult because you have to make sure the actor is running on a certain physical machine and babysit it.

Orleans and virtual actors abstracts the tie between actor and host machine away from the application writer. An instantiated class (called an activation in the paper) can be running anywhere, or even nowhere, if it’s not being called. This new model allowed the team to scale Halo’s services linearly - handling up to 600,000 requests per second - while reduce programming complexity by not having to deal with physical machines. I definitely recommend reading this paper.