Achieving Scalability

Over the last 10 years I’ve seen a change in the way the top engineers/architects design, architect and deploy software. This new generation of computer scientists have their eyes firmly focused on quality attributes such as availability, scalability and fail-over-capabilities; issues that are rarely discussed in most enterprises.

Many companies tell me they have moved to web-scale and deploy to the cloud, however, when I study their architecture in detail I notice that they do not architect the software in a way that allows for scale. For some of them, my warnings come too late when their success catches up to them and their systems are not able to cope with the increased traffic.

In this article, I’ll explain why the way we traditionally architect software systems doesn’t scale and suggest some approaches for how to improve the scaling of your software.

Moore’s Law

Gordon E. Moore, the cofounder of Intel Corporation, observed that the number of transistors in a dense IC doubles approximately every 2 years (later adjusted to 18 months). As software developers we were made to look like heroes by the improvements in hardware. The software we built ran twice as fast a couple of years later.

The software speedup trend no longer holds for most large-scale software systems. Only those that write software in a way that allows you to take advantage of the hardware improvements continue the trend of Moore law. This excellent illustration (not sure exactly what the original source is, it is flagged as sharable in Google Images, so I’m simply linking to one of the places were it exists), show us why we no longer can rely fully on the improvement of hardware.


Notice how we keep adding more transistors (fulfilling Moore’s law). The computers are continuing the trend of becoming more and more powerful. However, the method of improvement has changed. Clock speed is no longer the main source of the improved performance (that trend stopped around 2004). Today, we improve performance by adding more cores.

In addition to the increasing number of cores on individual machines, we also are trending towards computing on a set of machines (mainframes are not quite dead yet, but even those are often constructed using a grid of computing units).
Here is the bad news for us software developers:

Unless you change the way you write code, you will not be able to take advantage of the hardware improvements.

I think Herb Sutter put it best

“The Free Lunch is Over!!!” - Herb Sutter (Dr. Dobb’s Journal, 2005)

Herb Sutter goes on to say:

"Concurrency is the next major revolution in how we write software"

Amdahl’s law

To understand why Mr. Sutter predicted, “the Free Lunch is over”, we have to take a look at what has been named Amdahl’s law (sometimes also called Amdahl’s argument, named after Gene Amdahl and a presentation from 1967!). Amdahl’s law states that the maximum performance increase achievable by adding more CPU’s at a problem is inverse propositional to the percentage of the code that runs sequential.


The graph above (source Wikipedia) shows a rather trivial relationship. If the percentage of your code that has to be run sequential is 5% (0.05), then the maximum speedup of your system when adding more processors is 1/0.05 = 20X. If 25% of your software is sequential, your max performance by adding cores is 4X.

The Amdahl problem in traditional enterprise architecture

Most traditional enterprise architecture runs primarily sequential algorithms. In fact, I would argue that over the last 20+ years we have deliberately been constructing languages, platforms and frameworks that promote sequential programming.

The main reason for us to promote sequential programming has been that it is (arguably) significantly easier on the programmers. We optimized for quality attributes such as readability, maintainability, ease of programming, etc.

In a sequential code, it is very easy to understand the state of the program/algorithm. We know that the list of statements is executed one-after-the-other. If the program counter is on statement 4, we know that statement 3 has been completed and that statement 5 will be executed next (after 4 is complete).
We also constructed various ways of locking resources so that the programmer was guaranteed exclusive access and did not have to worry about interference from other programs. This locking was typically achieved with locking semaphores or using transactions (e.g., SQL transactions).

The ease of programming, we though, led to fewer bugs, easier to maintain code, etc. We were taught not to optimize if it made the code harder to read. The argument was often:

“If it doesn’t run fast enough, we can always throw more hardware at it"

Amdahl’s law shows that this approach simply doesn’t work. At some point (quite an early point with current algorithms), you’ll run into the max potential scale. Perhaps this was not an issue for you. If you are building software that services less than 100 concurrent users, you probably never had to worry about this. However, if you intend for your software to be used by thousands (or perhaps millions) of concurrent users, you eventually have to change the way you construct your software.

In some of our projects, we have to support several million concurrent users/transactions. If we built software the traditional way, we would simply not be able to keep up, no matter how much hardware you put on the problem.

Good for us, there are known ways to construct software that promote parallelism and hence increases our chances to scale. We’ll take a look at a couple of interesting approaches later in this article, but before I discuss some of these approaches, perhaps we should first focus on what kind of synchronization/sequential algorithms we try to avoid.

Resource synchronization

Most enterprise software takes advantage of lightweight threads to ensure some degree of parallelism. This allows a programmer to start several sequential programs at the same time. Unfortunately, these programs rarely achieve orthogonality. In particular, they often share resources. The good news is that we already have invented ways for the multiple programs to collaborate when accessing the resources.

Unfortunately, the most common collaboration is to sequentialize the access to the resource. That is, if more than one parallel task need to read/update the same resource, we ‘queue up’ the tasks and only allow one of the programs to read/update the resource at one point in time. While one task is accessing the resource, the other tasks are suspended waiting for the accessing task to complete.

Sequential to improve readability

Let’s take a look at a typical algorithm in enterprise software:

A compiler will ensure that the two statements above are executed in sequence, however, they could really have been executed in parallel. Most likely, there is no need for the program to wait for the log statement to be complete before opening the database.
If one studies a typical sequential algorithm, there are often lots of opportunities for parallelism. However, the style/rules of how the program is constructed often prevents natural parallelism.

Concurrency and parallelism

I’ve read different books giving completely different description of concurrency and parallelism. Books that focus on low level execution tend to specify the difference as:

Concurrency is when two threads of execution can make independent progress (but does not necessarily execute at the same time). Parallelism is when the two threads of execution make progress at the same time.

If you read a book on design or architecture, you may have read:

Concurrency is when two program incidentally work separately but may have interference when accessing resources. Parallelism occurs when two programs were designed to work at the same time with no interference.

I’m going to focus on parallelism in the design/architecture form. That is, I’ll focus on deliberately designed programs that can run independently.

Actor model

One famous computation model that promotes parallelism is the actor model. The actor model is a mathematical model for concurrent computation, but the ideas have been actualized in several programming environments. Perhaps one of the most famous such environments have been concurrency model of the Erlang programming language.
The actor model is inherently concurrent. From Wikipedia:

An actor is a computational entity that, in response to a message it receives, can concurrently:

  • send a finite number of messages to other actors
  • create a finite number of new actors
  • designate the behavior to be used for the next message it receives.

There is no assumed sequence to the above actions and they could be carried out in parallel.

Perhaps in a later blog article, I’ll go through the actor model in details, however, the linked wiki article is quite good and I’ll refer to it for further details.

There have been many critics of the actor model. The most typical complaint about the actor model is that it is difficult (some say impossible) to compose actors. I actually strongly disagree with this statement, I think the composition is different than what people are used to. I believe when designing with actors, one naturally designs clusters of actors that collaborate on some task and that a careful architect/designer will compose the actors into what I call conversations. However, that is a discussion for another article.

Another criticism (which I do agree with), is that the programming model is more difficult and that only a subset of the programmers out there will master this paradigm. In particular, programmers seem to have problems proving the correctness of their program when some required behavior requires collaboration between a set of actors

Lambda architecture

A quickly emerging architectural pattern is called “Lambda Architecture”. Personally, I find the name quite misleading, but trying to be compliant to the rest of the world, I’ll use the term in this article (The Lambda comes from the idea of using Lambda Calculus, Lambda being a letter in the Greek alphabet that Church used to represent a functional abstraction… A long derivation and in my opinion a suboptimal name).

In a Lambda Architecture, the processing of incoming data is delayed as long as possible. This is achieved by storing non-mutable facts and processing these facts at a later time. The facts are often simple time-series data. Some of the facts have to be processed in real-time, but in most application, a surprisingly small subset of the facts requires real-time attention.

Let me try to illustrate with an example:
Say we are building a banking system for online trading (e.g., PayPal, Google Wallet). We’re bringing together payers and payees. It is important to us to ensure that we know what the account balances are in real time. However, the individual transactions and all details around them (e.g., where was it performed, what did they buy, what browser did they use, ...) can be processed at our convenience (or at least several seconds/minutes/hours or perhaps even days later).

Using the Lambda Architecture, we would separate out the real-time data (the account balances) into a data store with extreme performance (e.g., some distributed memory model) from all other data. When a transaction is performed, the balance is checked and updated based on some real-time data store.

The non-real-time data would be stored as facts in a specialized distributed database. We typically store is the incoming event with a time-stamp and all the available knowledge of the event. It may be as simple as a log file, but it could be more sophisticated using a specialized time-series database (our (SciSpike’s) default architecture uses Cassandra which has proven to be perfect for this purpose). Since the facts are stored and never updated, this storage doesn’t require typical locking and we only strive for eventual consistency (as opposed to real-time consistency).
At some later time, we then compute specialized views from the sequence of facts. The views work on the dataset in total atomicity, we can parallelize the generation of these views with ease (e.g., using separate map-reduce tasks).

Popular programming environment

Today we have several frameworks that promotes parallelism. I’ll focus on a couple of them and discuss why they make it easy achieve parallelism.

Non-blocking and asynchronous programming environments

One popular new(-ish) technology is Node.js. Why is it that Node.js has become popular so quickly? There are many reasons:

  • The popularity of JavaScript
  • The ease of getting started
  • The anti-framework initiatives

Although the above properties of Node.js may be the most commonly sited, I’m going to focus on one aspect that is related to the topic at hand, namely how Node.js promotes parallelism.
For most of you, you may be terrified to hear that Node.js uses a single thread! It’s like we went back to the old Windows environment where there was a single event dispatcher thread that processed all incoming events. I’m sure some of you can remember the hourglass and total lack of responsiveness when someone decided to do something that took time on the event dispatcher’s thread. So how can this improve parallelism?

Node.js achieves its parallelism by using non-blocking libraries. What we mean by that is that whenever we execute one of the functions in the library by using non-blocking asynchronous communication (Node.js runs on the Google V8 engine that implements a complete set of non-blocking functions) .
So, for example, say we want to read some data from a console. In node this would be done this way:

The code here is quite different from most other environment where the supplied I/O functions would suspend the calling thread until the input was collected from the user. In Node.js, we ‘sending a message’ to the console asking it to read some input, then we’re passing on a function that will handle the next steps when the data has been collected. It may very well be that the Google V8 engine uses a separate thread to read the user input from the console, however, as a developer I’ve delegated this problem to the V8 engine.

When programmers use the asynchronous messaging style, the algorithm often is naturally concurrent.
Other languages have introduced similar concepts.

  • Java
    • Introduced non-blocking I/O from version 1.4.
    • JEE 6 had partial support for non-blocking I/O and JEE 7 will have full support.
    • The new concurrency library provides many features for supporting algorithmic parallelism (e.g., futures)
    • Libraries are coming starting to take advantage (e.g., AsyncLogger in Log4J), Vert.x
  • Scala
    • Very strong support for parallelism
    • Akka framework (also available for Java programmers)
  • C#/.NET
  • Python


To achieve scalability you may have to change the way you construct software. A very important change is to maximize parallelism. Other architectural tricks are delayed and orthogonal processing of time-series events (also called Lambda Architecture)

If you have not already done so, it is time to learn and take advantage of the new programming paradigms. It makes a huge difference in your ability to scale and take full advantage of the processors in your deployed system.


Electronics, Volume 38, Number 8, April 19, 1965

No Comments Yet.

Leave a comment