Update (2/2/22): I've actually joined this class as one of the TA's for the Spring 2022 semester! I'm pretty excited to help other OMSCS students learn about distributed systems and push my understanding of the material even deeper. One thing I should point out about this review is that it only reflects my experience and perspective; my suggestions (such as the "Suggested Background Knowledge/Prereqs" section) are based on things that I found to be helpful personally, but none of them are officially backed by the course or the program. With that being said, I hope you find the review useful!
Phew! It's been a few weeks since the fall semester ended, and I've finally had enough time to wind down and reflect on it. This past semester I took CS 7210, Georgia Tech's Distributed Computing course. Although this was only the 2nd time that this course has been offered online, it has already garnered a reputation in the OMSCS community for being extremely difficult.
In fact, at the time of writing this, CS 7210 has 31 reviews on OMSCentral that rank it as the most difficult class with the highest workload, and I think that matches my own experience. This was my 7th course in the OMSCS program (I've previously completed GIOS, CN, HPCA, IIS, AOS, and ML4T in that order), and it was definitely the most challenging class that I've taken. But you shouldn't take that as a criticism of the class - as challenging as it is, you'll learn a tremendous amount and grow immensely from the experience. If your goal is to learn, you'll definitely satisfy your appetite (and then some) with this class.
I should also say that the class doesn't try to be unnecessarily difficult. From my perspective, the class is difficult because the subject matter itself is difficult. Understanding and building distributed systems requires you to adopt an entirely new mental model that's much looser than the traditional model that you may be used to. Things that you might've taken for granted are no longer guaranteed, and issues that you might've never considered before are now central problems that you constantly have to stay on top of. As a result, I think this class is as difficult as it needs to be for students to really develop a strong understanding of distributed systems.
Areas to Improve
I wouldn't say the class is perfect either. Since this was just the 2nd offering of this course to the OMSCS program, there are still a few minor rough edges to it. However, I can tell that the teaching staff are actively working to continuously improve the experience. For example, they've restructured the schedule to better allocate time among the various deadlines, and the teaching assistants have gone above and beyond to help students along the way (like by organizing introductory walkthroughs of the last two projects). These efforts were tremendously helpful and appreciated by the students. My only suggestion would be to redo some of the lecture videos - I felt that some of them were slightly lacking in clarity and could be improved.
CS 7210 is structured as a scheduled series of lecture videos and readings, weekly office hour meetings (which are also recorded for asynchronous viewing), and a set of projects sprinkled throughout the semester.
Lectures & Readings
The lectures and readings are effectively organized into two parts. The first half of the semester establishes the foundations of distributed systems and its core challenges (e.g. RPC, time, ordering, state, consensus, and fault tolerance). This sets the scene for the second half of the course, which surveys the different applications of distributed systems and dives into the modern developments and research in each of those areas (with a large focus on the recent massive growth in datacenter and cloud-scale work). Essentially, the concepts that you learn about in the first half of the semester come alive in the research that you study in the second half.
You'll then exercise these core concepts yourself through a set of projects. These projects actually come from Ellis Michael's DSLabs, which is a Java framework and collection of labs developed at the University of Washington for teaching distributed systems.
DSLabs gives you a framework to implement logical nodes that can be composed together to form a distributed system. These logical nodes have a very simple interface - at their core, they can fire events using timers and interoperate with each other via message-passing. The framework also allows you to control the distributed system as a whole, such as defining network partitions or the probability of message drops. These mechanisms will be the base upon which you will build and test your distributed systems.
On top of this framework, DSLabs also provides 5 labs/projects. I won't bother describing each project because they can be found on the DSLabs GitHub, but I will say that (in my own experience) the difficulty of each lab increases exponentially. You can probably finish the first lab in an evening, but you'll definitely need all the time you can get (3 or 4 weeks) to finish the later labs. I think it's also worth mentioning that since these projects are externally sourced, there's not much coupling between the labs and the lecture material, so don't expect to be able to complete the labs from just watching the lectures - you'll need to do a decent amount of further research (for example, to understand a protocol that you'll need to implement or to find design inspiration). Just like everyone else, I recommend starting each lab as soon as they become available (note: "starting" doesn't necessarily mean writing code right away - it could be as simple as thinking about the design or doing a first pass of a relevant research paper).
These labs are the primary reason this class is so challenging, and this is because they perform model checking on your implementation. Instead of just running your code a few times to see if anything happens to break, DSLabs will also perform "search tests", which are comprehensive searches on the state space of your distributed system. This exhaustive search is possible because of how much control the framework has over the simulated environment of your distributed system. This means that if anything can go wrong with your implementation, then DSLabs will find it (even if it's a very-rare-almost-impossible edge case). Ultimately, this means that your implementations must be ✨flawless✨.
On the bright side, DSLabs provides some nice tools to help you debug, like returning a trace when it finds an invalid state or supplying a visual debugger to help you explore the state space yourself.
As frustrating as these labs are, they accurately reflect what's demanded of real distributed systems. In my own experience, this is especially true at cloud-scale where anything and everything can go wrong, Murphy's Law is working overtime, and things will break... all... the... time.
If you're concerned about grades, I don't think you have to worry very much because the curve was pretty lenient. I won't list any hard numbers because it'll probably fluctuate, but getting a B shouldn't be too difficult, and getting an A is very doable. However, since this curve is applied at the end, it might be hard to gauge how you're doing throughout the semester.
Suggested Background Knowledge/Prereqs
For the labs, you should probably be somewhat familiar with Java (or have experience with similar object-oriented languages and be able to pick up Java quickly). This is because when the projects suddenly begin to pick up in difficulty (again, I'd say each project gets exponentially harder), you don't want to waste any time being stuck struggling too much with the basics of language. To point out some specifics, you should probably be vaguely familiar with common collections and containers, reflection, Java's object model & garbage collection, and how to use a Java IDE (like IntelliJ) and its various tools (like the debugger). Knowing this ahead of time or being able to quickly pick them up isn't just for writing your code - it also equips you to understand and tinker with the actual DSLabs framework itself. Being able to do this will allow you to gain valuable insights about the design and implementation of your system that might be much harder to obtain otherwise.
I'd also suggest taking CS 6200 (GIOS) & CS 6210 (AOS) before taking this class. This is for multiple reasons:
- Their workloads, difficulty, and time commitments will help acclimate you to the demands of this class.
- They'll teach you how to reason about concurrent or parallel code, event handling, and message passing.
- GIOS teaches you this though its multithreading projects.
- AOS also teaches you this with most of its projects (especially the gRPC one).
- This skill is absolutely necessary for these projects because then it'll pile on additional complications like message loss, reordering, and duplication.
- Also, if you're not super comfortable with reading CS research papers like I was, AOS will help you develop those skills.
- Both AOS and this class will require you to read a large body of research papers, so developing those skills in AOS will be very helpful when you take this class.
- Once you're comfortable with quickly reading a bunch of operating system papers and being able to understand the main points, you shouldn't have a problem doing the same thing for distributed systems.
- There's also a little bit of overlap between the lesson materials in this class and AOS. For the most part, AOS will present them at an introductory level whereas this class will explore them more in-depth. Some of these overlapping topics include:
- IPC (especially via Message Passing)
- Logical Clocks and Ordering
- Distributed Hash Tables
- Fault Tolerance, Undo/Redo Logs, and Transactions
Overall, I thought this was a great class. It was definitely the most difficult one that I've taken, but I grew and learned a lot from it. If you are interested in learning about Distributed Systems, want to grow as a computer science student and software engineer, and are prepared to challenge yourself (and let pain be your teacher), then I'd highly recommend this course.