Freesteel Blog » A plan to make Python 2.4 multicore

A plan to make Python 2.4 multicore

Friday, September 20th, 2013 at 11:30 am Written by:

I love Python, but its lack of multi-core support is killing us in the Adaptive Clearing strategy.

We’ve divided the algorithm into 8 separate threads that all work in a pipeline. Some of then can be operated in parallel (for example, the swirly Z-constant clearing strategy within an independent step-down level).

The heavy calculation is done in C++, but the high level toolpath planning, reordering, maintenance and selection from categories of start points is undertaken in Python.

While the C++ functions unlock the GIL (the Global Interpreter Lock) on entry, so that Python interpreter can carry on in parallel — and perhaps call another concurrent C++ function, this only gets us so far. Our limit is full use of two cores, which on a 4 or 8 core machine is a little embarrassing.

So we have two options, neither of which are good. One is to port our thousands of lines of Python code into C++ (which is not going to work as it will be too painful to debug this code yet again in another language that is much harder). And the second is to make Python multi-core.

Why Python2.4? It’s what our HSMWorks kernel has compiled in. We haven’t needed to upgrade, and it’s stable. It’s just used like an internal scripting language which, to the outside, doesn’t exist.

The source code for Python2.4 is here.

Now, how do we make this multicore/able to run in parallel?

The major issue is the GIL, the Global Interpreter Lock.

The source code line for it is here:

static PyThread_type_lock interpreter_lock = 0; /* This is the GIL */

We noticed the pre-processor directive

#ifdef WITH_THREAD

surrounding that block of code. So we disabled it, and compiled Python without any threading support and without the GIL.

Now the plan is we want to create several worker threads and put one isolated Python interpreter into each one, and direct work to it. The Python interpreters would have no interaction between them, so their operations should never clash, and everything would be fine.

Conventionally, one would simply make one Python interpreter per process, and let them communicate through files and sockets, and everything would be good. But processes don’t share memory, so if our functions were, say, interacting with an object representing a model of several million triangles, this object would need to be duplicated twice in main memory. What a waste.

So, what is it that prevents us from creating threads running independent Python interpreters in each one, now that we have rid ourselves of the GIL?

Well, it’s all the other Global variables littered throughout the CPython interpreter code-base that have come in as part of the rot. Once you allow one global variable into the system, everyone gets lazy, so they are all over the place.

Here’s a thing. The method for invoking Python from C++ is as follows:

Py_Initialize();
PyRun_SimpleString("Print 'hello world'"");
Py_Finalize();

These are all global functions, and there’s only one interpreter in the whole of the process they could refer to.

But then there’s this pair of weird functions

PyThreadState* Py_NewInterpreter(); 
Py_NewInterpreter(PyThreadState *tstate);

Clearly the PyThreadState is an object that represents the sub-interpreter — of which you can make as many as you like. But how does the function PyRun_SimpleString() know which of these interpreters you want to send your little Python program to?

Turns out there’s a set of global variables here:

static PyInterpreterState *interp_head = NULL;

PyThreadState *_PyThreadState_Current = NULL;
PyThreadFrameGetter _PyThreadState_GetFrame = NULL;

that reference the sub-interpreter PyRun_SimpleString() is going to send its code snippet to. (Corresponding interpreterstates and threadstates seem to be interchangeable because they have links from one to another.)

We compiled under debug and found that the _PyThreadState_Current value is accessed through PyThreadState_Get(), which is called repeatedly throughout the operation of the interpreter. So whatever the purpose of these multiple interpreters was, it’s not even for running things asynchronously. More like shelving an interpreter on the stack, working on something else, and then pulling it back off. Nothing remotely in parallel.

So what can we do?

Well, there’s this thing called thread-local storage where your defined variables in the global scope can be prepended with the declaration thread_local (or __declspec(thread) in Microsoft C++), and you’ll get one independent syntactically global variable per thread.

So, to make it configurable, we’ve defined the macro:

#define PYGLOBAL __declspec(thread)

and then just stuck it in front of every global variable we can find, like so:

int Py_DebugFlag; /* Needed by parser.c */
int Py_VerboseFlag; /* Needed by import.c */
int Py_InteractiveFlag; /* Needed by Py_FdIsInteractive() below */
int Py_NoSiteFlag; /* Suppress 'import site' */
int Py_UseClassExceptionsFlag = 1; /* Needed by bltinmodule.c: deprecated */
int Py_FrozenFlag; /* Needed by getpath.c */
int Py_UnicodeFlag = 0; /* Needed by compile.c */

becomes

PYGLOBAL int Py_DebugFlag; /* Needed by parser.c */
PYGLOBAL int Py_VerboseFlag; /* Needed by import.c */
PYGLOBAL int Py_InteractiveFlag; /* Needed by Py_FdIsInteractive() below */
PYGLOBAL int Py_NoSiteFlag; /* Suppress 'import site' */
PYGLOBAL int Py_UseClassExceptionsFlag = 1; /* Needed by bltinmodule.c: deprecated */
PYGLOBAL int Py_FrozenFlag; /* Needed by getpath.c */
PYGLOBAL int Py_UnicodeFlag = 0; /* Needed by compile.c */

Now, this doesn’t work for certain place, like here:

char dllVersionBuffer[16] = ""; // a private buffer

// Python Globals
HMODULE PyWin_DLLhModule = NULL;
const char *PyWin_DLLVersionString = dllVersionBuffer;

because this gives error “C2099: initializer is not a constant”, since you are trying to assign PyWin_DLLVersionString to a different value in each thread, and the fact that you can do this for proper global variables is edgy and a bonus gift, given the compiler implementation.

We can work round this by creating a function to initialize all these miscellaneous values that we can call before Py_Initialize().

So that’s where we are up to right now. I can run it in two threads one after another, but if they happen at the same time it crashes. So I must have missed something.

Anyway, if this does work, then it’ll be quite a neat trick, and I could dine out on it at the next Pycon. We’ll see. If it doesn’t work, then I don’t know what I’m going to do. As they say, necessity is the mother of invention — especially when it requires you to break all the conventions and taboos in software engineering, such has panicked hacking of source code of a large well-established library you depend on but don’t understand.

3 Comments

  • 1. Freesteel&hellip replies at 3rd October 2013, 6:00 pm :

    […] is the follow-on from last week’s A plan to make Python 2.4 multicore. It’s been a disaster, but I’ve proved a […]

  • 2. Craig Armstrong replies at 19th November 2013, 7:19 am :

    Julian,

    I’m about to undertake the same course of action to allow multiple instances of the interpreter. Although your post on 10/3 isn’t encouraging, I’m going to try it.

    Using this approach I’ve successfully modified NMAP and puTTY to support multiple instances and I’m hopeful that the same can be done with Python.

    I’ll be more than happy to share what I find and hopefully I will get a head start based on what you’ve already encountered.

  • 3. Julian replies at 20th November 2013, 1:00 pm :

    Are they multiple instances in the same process? This isn’t possible unless you have found a way to make the global variables local to each instance.

    Doing it with multiple instances each in a different sub-process is easy. We’ve got this to work via pushing data between them through their stdin and stdout ports. It’s very successful.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <blockquote cite=""> <code> <em> <strong>