I am happy to see Facebook release Tornado, their Python based web-server that was developed to run the FriendFeed service. They have done their best to optimize the stack within the Python process. However, I'm inclined to believe that it is more of the same problem from which many HTTP toolkits suffer.
Since Tornado is written in Python it is typically restricted[1] by the Global Interpreter Lock. The GIL, for short, is a technique often used by dynamic languages such as Python and Ruby 1.9 to simplify development of the languages runtime as well as providing certain features to be implemented safely in the language. For example, it is the GIL that makes multi-swap features atomic in Python (a, b = b, a). The question is "How do we scale Tornado"? The answer is the same as most GIL based systems. You scale using processes, typically, equal to the number of concurrent threads your system can execute.
Update: The previous paragraph reads differently than I intended. Tornado doesn't really suffer from the GIL as its single threaded. What I meant to get across that is if you needed to scale to all the cores of your system you need either threads or processes. Future paragraphs explain this in better detail.
Scaling in this model, however, presents a few challenges. Each process ends up requiring connections to your database and cache tiers. So now instead of N connections you will ultimately grow to N*N_PROCESSES connections. Scaling the number of cores will inversely effect your data tier performance which is typically your bottleneck to begin with.
I think the need to design the system in this direction is ultimately due to the real web-servers letting us down. Tornado handling data in an event-based fashion should not be news in general. Most web-servers have been doing this for a while. Cherokee[2] does it quite well, Apache's mpm-event does it, and Python Twisted's reactor has done it for years. However, all of these systems let your infrastructure developers down in a critical component, the module API.
Once you have a socket accepted you need to do something with it. Web-servers tend to differentiate themselves in this area but I don't think they have gotten it right. Apache provides different models such as MPM which is a combination of worker processes and threads. This was done primarily to work around inefficiencies in massive numbers of threads per process. "Why would you need a massive number of threads" you might ask? It's due to the design-flaw that the module API's are designed around executing synchronously. After the bottom-half (the socket handler) accepts a new request, it is passed to a handler. The handler calls the configured modules to process the request and blocks while this happens. Ultimately, this means you are wasting a thread regardless of your application having the resources to immediately process the request. When your system cannot keep up with requests you create a massive number of threads/processes up to your configured bounds which slows down the threads that actually are able to process a request. Obviously this exacerbates the problem.
What needs to happen is a module API that supports a fully asynchronous and event-based processing of requests from the socket handling bottom-half through the application-based upper-half. A primary example of where this can be useful is exemplified by "partition-based" data such as most user-content systems. If you accept too many requests on a single server for a data partition you will slow down the average response time per request. This is due to the contention you create on your database and/or cache connections. So while you may want to accept the request, you may not want to start processing it until a number of other requests in that partition range are completed.
Once we have switched to a server that can work in this model we will run into the problem of how to handle all those cores efficiently. Thankfully, Apache styled MPM models wont be necessary anymore. If applications can be processed in an asynchronous model then they will be able to yield the thread on IO and other shared resources that are not yet available. Doing so will mean the optimum number of threads will be equal to the number of cores you have to process.
For this to work optimally with language runtimes dependent on a GIL we need to use a little-known technique to have multiple runtimes within a single process. If you have multiple copies of the runtimes shared library each subsequent call to dlopen() will result in new memory zones for the loaded library. In such cases, you can even have different versions of the same runtime loaded within one process (python2.3, 2.4, 2.5, etc). We can load a runtime per core so that each worker has it's own runtime with the GIL disabled (since its single threaded). As I'm sure you noticed this is an implementation detail.
A few things are preventing this from becoming an immediate reality. There needs to be a standard for languages to be able to integrate into the web-server so that shared resources can be stored outside of each runtime. For example, one of our python workers might yield on getting a database connection from the servers connection pool. Now the worker can pause that request until the connection is available and process other requests. This should feel natural to those who have worked in co-routine based systems.
One of the problems that interests me is getting the request from the bottom-half to workers in the upper-half as efficient as possible. Typically this is done with a thread pool. Most thread pools are implemented with a single Queue and various worker threads that pop their future work item off the queue. What research has shown is that this creates a considerable amount of lock-contention as you scale the number of cores. For each additional worker you create a new contender for the lock to retrieve an item from the queue. Nir Shavit et al pioneered a concept called work-stealing which greatly reduces this problem. It consists of a queue per worker and distributing work items in a round-robin fashion among the workers. To prevent worker-starvation, when a thread runs out of work-items, it will look to its neighbors to try to steal an item off their queue. Some fancy footwork is performed to reduce the contention by stealing from the opposite side of the queue and implementing the fast path as lock-free. Additionally, you can choose neighbors that share your die on NUMA. I was so interested in this and the challenges of implementing lock-free algorithms in C (they are easier to implement with garbage collection), that I implemented it in Iris[3], my concurrency toolkit for glib.
The primary thing we've learned from web-servers over the recent years is that to handle high-load scenarios you need to do everything you can to reduce resource usage when forward-motion cannot be performed. By continuing to build web and application frameworks within the container of the synchronous module API we cannot hope to get the massive improvements in performance that we all desire. We need to consider the high-performance web-server as part of our application which means writing HTTP container standards.
[1] I imagine that someone will make this work on IronPython or Google's LLVM based Python.
[2] http://www.cherokee-project.org
[3] http://git.dronelabs.com/iris

Comments (15)
You seem to recognize (correctly) that Apache’s mpm idiom is crap, but why do you think that its model of loading multiple language runtimes into one monolithic process is a good idea?
Why does it matter if you use 4 or 8 event-handling processes on one machine, each with their own database connection? If that’s your scaling bottleneck, you’d be running on many machines anyway, so a small constant multiplier doesn’t change the story. Besides, you could use crap like ODBC to have one persistent DB connection per machine.
Well like I mentioned I believe that is an implementation detail. For example, if you have the subprocess you have two fundamental designs. The first being where the master process still manages the client socket (so data is transfered from the worker back to the master). The second being where you pass the client socket (over a UNIX message with send_msg) and allow the worker to flush the buffers and close the socket. The problem with the first is that you increase the amount of wake-ups your event loop needs to do by 2x (since it needs to handle data in and out for the worker) which could increase your handling latency. This is a no-go in some applications. The problem with the second model is you lose the ability to have connections live longer their request (otherwise they are restricted to that workers affinity which will not be evenly balanced).
ODBC is an option (or SQL Relay) but it adds an increased latency without providing the ability to yield on the resource being ready. For example, even with the models I described above to reduce over subscription of resources, your container may not be correct in its assumption that the connection has little contention. So you effectively add latency and reduce correctness in your ability to run efficiently.
But why do the independent processes need to be sharing sockets at all?
If you’re already going to be running the same app on more than one machine, there’s no reasons left to stick to port 80.
There’s a parallel discussion on Hacker News
Okay I’ll continue discussion over there.
The GIL is irrelevant to Tornado because it’s not a multi-threaded server. The documentation very explicitly states that the intended usage is to run a number of Tornado instances equal to the number of cores and to use nginx (or similar) proxy frontend, not least of all to avoid using Tornado for serving static files or streaming content.
So far as complicated web servers to work around the GIL in Python… screw it. Fix Python to not be a piece of crap and use an interpreter architecture designed using any of the neat scalable techniques we’ve had for the last 20 years, or use a better language than Python.
There are dozens of high performance, easy to program, “fun” languages that don’t have the numerous design idiocies that Python does. If your application architecture doesn’t work with Python, do the intelligent thing and use a different language, instead of wasting a ton of time and money working around a problem that you caused yourself by picking an inferior language for the task at hand.
Hi Sean,
I thought I was quite explicit in that tornado’s short comings weren’t due to the GIL.
The discussion about the GIL was simply to exemplify WHY you have to scale with processes and what you lose by doing so. I agree that this is problem in Python, not Tornado.
The point I’m trying to get across is that we need a fully asynchronous and concurrent stack in the web-server that *hosts* the Tornado server.
The GIL is an implementation-detail of CPython & derived implementations. There are also at least 2 or 3 Python-implementations that don’t have a GIL.
People who want to write python code without thinking about the GIL should just use one of those instead of ranting about why CPython was designed as it is 20 years ago…
BTW: IronPython & Jython are the 2 most mature implementations that don’t have a GIL.
Indeed; I used a subtle hint about that in my footnote #1.
You’re certainly on the right track, but there is plenty further to go.
One of the things people do not realize is that the CPU is behind several layers of cache and to get performance, you have to respect that.
For instance scheduling your worker threads FIFO is guaranteed to pick the thread in the pool which has the least chance of having any part of its state in cache any more.
By induction, it follows that this will also give you the lowest cache-benefit because you maximize your footprint.
You want to schedule them in FILO order, so that the thread that just completed a request gets the next one right away, thus concentrating the workload on the least number of necessary threads.
I’ve worked a lot with these issues in the Varnish HTTP-accellerator, and I would recommend you read my “architects notes”
Poul-Henning
That makes a lot of sense Poul-Henning. I’m looking forward to reading your notes
Also, FWIW, my scheduler in Iris does this with respect to the co-routine like idiom called IrisTask.
For example, if you have a task that yields a new task, that new task will be executed immediately to utilize that cache.
I think the database connection issue is a red herring.
In general, you’ll need as many database connections as you have concurrent transactions (most databases only support a single transaction at a time on a connection). Whether you have all those transactions run from a single process or one process per core won’t make much difference.
You might end up with slightly more connections in the multi-process model, but only if the load is balanced badly (and a few extra idle connections probably won’t hurt much).
@James
Sadly, my only experience scaling massively was with sql server and it was a problem. my dba’s told me each connection would do a static allocation on the server side when the connection was initiated.
however, they may have lied to me, which would be unfortunate, heh.
i assume postgresql is better at this.
@cherget: your DBA is correct about the database allocating resources for each connection.
My point is that if you want to run 16 concurrent transactions on a quad core system, you are going to need 16 connections whether it is one process managing those connections or 4 processes managing 4 connections each.
You may have a small amount of waste if the load balancing is wonky, but you’ll be running into exactly the same problems when you scale to multiple physical machines for handling requests.
Okay, that makes sense.
The nice thing about being able to queue the incoming request without tying up a thread is that as soon as txn1 finishes the next request can be resumed to reuse that connection.