Recently I came across a few performance experts that turned upside down what little I knew about performance. I thought this is boring, this is low-level, this is better left to some crazy assembler gurus, who think microbenchmarking in the middle of the night is pure fun. Turns out I was wrong on many levels and I want to share the most important things I’ve learned as far.
The subject turned out to be surprisingly exciting and at the same time quite challenging, so I’ll keep writing about this as my learning progresses.
Myth #1. Performance is a trade-off between clean design and low-level ugliness
I’ve always thought about “highly optimized C#” as impossible to understand IL tinkering. According to Martin Thompson that approach might be appropriate for maybe 5% code we write. Most of the time we should simply rely on clean design rules such as single responsibility principle, high-cohesion, keeping our functions small, etc. The things that we really should invest in understanding though, are data structures and algorithms, their characteristics and how they are implemented. Understanding how their choice impacts performance is the single most useful thing we can learn. Martin talks a lot about Mechanical Sympathy and it turns out that it’s possible to explain why clean code is faster when we analyze how modern computers work, especially CPUs and GCs. This knowledge is also very useful when deciding what data structures we will use.
A bit different perspective came from Gil Tene, who made me realize how intelligent modern compilers are, especially JIT compilers (that we have both on JVM and .NET). Knowing how advanced compilers are, Gil advises to not waste our time for trying to compete with them. It’s much better to focus on making our code clean, readable and easy to maintain, because compiler will take care of making it optimized anyway (e.g. it’ll remove redundant variables, inline functions, etc.). Most of the simple “optimizations” turn out to be what compiler would do anyway, so it doesn’t make sense to sacrifice readability for this imaginary “performance improvement”. It might seem otherwise though, because in some settings (e.g. in VS by default) JIT compilation is turned off in debug mode. Of course, it’s useful to understand how all this works under the hoods, because we can realize which of our pseudo-optimizations are a clear waste of time. Besides, sometimes we might hit subtle issues when compiler tries to optimize our code too much.
Myth #2. Performance optimization is a combination of crazy tricks
Both Martin and Gil talk a lot about “behaving nicely”, being predictable, following best practices and understanding the assumptions underlying the platforms and tools we’re using. Lots of super-smart people work on making our tools intelligent, force them to do all the hard low-level work for us, so if we just understand how we’re supposed to use them, we should be allright most of the time. The crazy tricks might be sometimes useful, but unless we work on a real cutting-edge system probably we won’t need them often.
Again, a lot of those assumptions are related to clean design and simple fundamentals, for example:
– Methods inlining – .NET JIT compiler will inline our function if it’s below 32 bytes of IL (short functions!) and is not virtual (be careful with inheritance!) *;
– Garbage collection – in a “well-behaved” application your objects either live very short (0 and 1 generations) or live “forever” (2 generation). Contrary to what many developers believe, you can have lots of GCs and still have great performance, just make sure they’re not full collections (generation 2).
So before I get to those crazy tricks, that I can talk about over a beer to impress other devs, I’ll spend more time learning about .NET, especially with regard to garbage collection and JIT compilation. I’ll pay special attention to what design choices were made, why, what alternatives were dismissed, how all that impacts me and what expectations it sets for my applications (or in other words, what “well-behaved” application really means).
Myth #3. More threads = faster
I was really amazed when I first learned that all LMAX‘s business logic is executed in one thread. Michael Barker and Trisha Gee showed one of the reasons why it was designed this way. I didn’t realize that the cost of locks is that high and even though the specific numbers might be the result of some bug on MacOS (which Michael noted were very impressive), it left me with a lot of food for thought. Maybe the differences are not that high if bug is fixed, but still it’s not really obvious whether adding threads will make my application faster. What is sure though, is that it significantly complicates design, increases maintanance cost and can result in more bugs.
Right now we talk a lot about parallization, using multi-cores to the maximum, microservices… The hype is high and it’s very natural to pay attention to new, shiny silver bullets. However, it’s useful to keep in mind that most of the things done in really cutting-edge systems, are way more than we need in the old good BLOBAs. Chances are we don’t have to be “as good as Netflix”, simply because our system doesn’t need it. We can’t forget that increasing complexity of our solution significantly increases the cost of maintanance, training people, etc.
While most of the world started talking a lot about how much we really, ReAlLy, REALLY need multi-threaded systems to survive in the current market, the amazing team at LMAX used the completely not fancy approach – they focused on their specific requirements, measurements, architecture, data structures, using known patterns in a fresh way and came up with an amazing result. And they process 6 milions transactions per second on a single thread. Not sure about you, but seems that should be more than enough for what we need in the systems I work on at the moment.
Myth #4. First get your application working then uglify optimize
It’s been my experience that performance is treated as an after-thought. We only talk about performance when something bad happens, but not many teams really monitor it on regular basis or are even aware what is needed. Sometimes we have requirements that specify that system should handle specific number of requests per second under whatever load and this is the best I’ve seen as far. But as Gil Tene said many times, even when we have some “requirements”, very often they are either not precise enough, we fall prey to common misconceptions regarding measurement or we simply ignore “all the rest” that is not part of our current requirement (such as 99th percentile).
All performance experts agree that we have to start design with a clear idea what do we need. So it’s virtually impossible to design application without having precise expectations regarding the worst case scenarios and unacceptable behaviours and at least some rough estimates of important numbers (e.g. how many users we might expect, what are their expectations, what would happen if performance would be bad for a few people but most of the time it’s excellent, what metrics matter for us, how will we collect them). It’s also useful to determine at the very beginning what are the real limits and start formulating requirements from there (e.g. network latency, performance of other applications we integrate with, industry “best examples”).
Gil in his presentation noted that without having an understanding of what we really need, we use tools in inappropriate ways and waste time optimizing things that already are good enough or try to optimize them in a completely inappropriate way. Sometimes the good solution might be completely counterintuitive, such as to sacrifice performance, deliberately make it worse in some places, in order to guarantee that our system meets SLAs even in “exceptionally bad” moments. What I understood by that, is that performance is not a binary quality – our system’s performance is not just either good or bad. Moreover, there are no silver bullets and simple answers, what is good for one system and one team, doesn’t have to be good for another. In most cases there’s no one single way to improve performance, there’re plenty of them, each with their own advantages and problems. I need to understand their trade-offs and verify why my system doesn’t perform well. Maybe the problem is in architecture, maybe in data structure, maybe in a language construct, maybe the array doesn’t fit cache, maybe I need to tweak GC settings, introduce pooling for big objects or upgrade hardware. Very often I can solve the problem in multiple ways, each allows me to achieve my goal, but comes with different trade-offs.
All of this sounds very complicated and I think it really is in the beginnig. Just like everything in programming. There’s a lot of conflicting advice, experts disagree or find problems in standard approaches, there’s a lot of jargon or referring to concepts I don’t understand very well. Last but not least, every system is different and I can’t simply blindly follow somebody’s solution, without first understanding the problem. What is easy is to use a random tool, follow “Getting started” tutorial, without spending too much time trying to understand what I’m actually doing and whether my optimizations are even required. So for me the lesson is that it’s important to know your application, have data regarding expectations and requirements, but also constantly collect information about performance in production and in test environments. That sounds like a lot of work, but without it the best I can do is count on my luck.
I really recommend to see slides from Gil’s presentation, he explains the issue with performance requirements (slides 15-29) and discusses the common problem with measurement tools (slides 30-46).
Myth #5. Microbenchmarks are the most basic tool
I thought that the first thing I should learn about performance are microbenchmarks, how to create them correctly and when they are really useful. But it turns out they’re not that important. This article concerns Java, but summarizes nicely what I’ve read in a few different places.
Seems that although microbenchmarks might be valuable from time to time, it’s way less often than I expected. They are prone to errors due to JIT optimizations (which most likely will be different for production and test environments), hardware differences, unrepresentative inputs, etc. It’s still useful to understand how to construct a meaningful benchmark, but definitely it should be used with caution and only when we’re really sure we need them and nothing else would do.
* Note there’re more rules than I mentioned, but those are most obviously related to clean code recommendations.