Freesteel Blog » Python 2.4 multicore results

Python 2.4 multicore results

Thursday, October 3rd, 2013 at 6:00 pm Written by:

This is the follow-on from last week’s A plan to make Python 2.4 multicore. It’s been a disaster, but I’ve proved a point.

In summary, the idea was to run a single threading version of Python (compiled with WITH_THREADS undefined) in multiple different Windows threads in a way that these totally isolated interpreters could interact with and access the same pool of C++ functions and very large objects.

And it might have been easy if the Python interpreter wasn’t so totally infested with unnecessary global variables — especially in the garbage collector and object allocators.

A sample of the hacks is at: bitbucket.org/goatchurch/python24threadlocal. (Go into the files and click on [DIFF] to see the sorts of changes I was making.)

First the results.

The Python code I was able to run in parallel is as follows:

import time
t=time.time()
for i in range(9):
    print i, sum(xrange(i*4000000))
print 'T=', time.time()-tn" 

Here is the table of timings on my Intel Core i7-3610QM CPU @ 2.30GHz, which has 4 cores and 8 hyperthreads:

Theads Total time (seconds)
1 14.67
2 15.08
4 15.75
6 21.39
8 26.09
10 26.21 -> 35.09

I don’t really understand this hyperthreads business, but the performance is consistent with a fairly no increase in time as the workload is quadrupled, then going up a bit as you try to do 8 things at the same time (because the extra 4 are not real cores), with the whole performance falling apart beyond 8 with some jobs finishing much faster than others that have been kicked off at the same time (hence the non-square response for 10 in the diagram below).

Here’s a snapshot from the task manager showing how the usage gets towards the 100% mark:

The code in the thread is as follows:

char* code3 = "import timent=time.time()...

DWORD WINAPI ThreadProc(LPVOID lpdwThreadParam)
{
    Py_ThreadGlobalsInitialize();  // initializes the thread local up global variables 
    Py_NoSiteFlag++; 
    Py_InitializeEx(0);
    PyRun_SimpleString(code3);

//    Py_Finalize();  // could'nt stop this from crashing
    return 0;
}

The code in the main loop that kicks off the threads looks like:

int nthreads = 10; 
std::vector threads;
for (int ithread = 0; ithread < nthreads; ithread++)
    threads.push_back(CreateThread(NULL, 0, ThreadProc, NULL, CREATE_SUSPENDED, NULL));

// kick off the threads
for (int i = 0; i < nthreads; i++)
{
    ResumeThread(threads[i]);
    WaitForMultipleObjects(1, &threads[i], TRUE, 50);  // ms of time
}

// wait until done
for (int i = 0; i < nthreads; i++)
    WaitForMultipleObjects(1, &threads[i], TRUE, INFINITE);

And the vast amount of senseless hacking I did to the CPython interpeter can be witnessed at: python24threadlocal

So the conclusion is that this idea worked -- Barely. This is in no way maintainable. If the Python interpreter writers hadn't so carelessly littered thousands of global variables all over the place in every corner of the code, it might be workable.

In general, the code is not that bad. It just suffers totally from being written in C rather than C++, necessitating the use of #defines to implement in-line functions and inheritance structures, as well as severely suffering from the lack of class constructors and destructors to make everything sane.

Anyways, I can declare myself bored, having pushed this idea through for someone else to either draw informed conclusions or take up this idea for themselves. So hopefully it's not all wasted.

Meanwhile, Martin has been working on Plan B, which is to serialize our complex C++/Python objects into a pipe to be popened for execution in another process. This is a far more normal way to engineer it, which is guaranteed to work as long as the IO load doesn't get too high. It also has the advantage that you can spread the work to completely different machines.

Why didn't I do it that way at the start? Good question. As I said, you need to have failure to do any innovation, and I now know more about the internal workings of the Python interpreter than I had ever intended to learn. Might be useful.

2 Comments

  • 1. Hugh replies at 9th October 2013, 5:16 pm :

    Julian,

    I have been patiently waiting for the cloud release from AD as I believe it is a perfect solution to the distribution of software, and combining the $$$ and experience of AutoDesk with your toolpath algorithms must scare the hell out of many CADCAM vendors.

    I also have wondered what is to become of the CAM companies that licensed the adaptive clearing technology now that AutoDesk owns the source code.

    Will the companies have access to future advances or be permitted to include the adaptive clearing in future versions, or are their products now incapable of advancing in HSM unless they develop similar or license from the volumill or moduleworks?

    I know its been a while since this transaction took place, but I have meant to ask for sometime.

    Hugh

  • 2. Julian replies at 11th October 2013, 7:21 am :

    As it’s calculation software that works on a batch process, it doesn’t really know if it is being run on the cloud or not. I’m not sure what everyone else in the team is up to, but I think it’s mostly been integrating the system into other CAD products. This is one reason I have been having an easy time as they have been too preoccupied to generate any bugs. Maybe we should both sign up to this list: http://cam.autodesk.com/cam.php

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <blockquote cite=""> <code> <em> <strong>