Utiliser numpy array en mémoire partagée pour le multitraitement

Question

Je voudrais utiliser un tableau numpy en mémoire partagée pour une utilisation avec le module de multitraitement. La difficulté est de l’utiliser comme un tableau numpy, et pas seulement comme un tableau ctypes.

from multiprocessing import Process, Array import scipy def f(a): a[0] = -a[0] if __== '__main__': # Create the array N = int(10) unshared_arr = scipy.Rand(N) arr = Array('d', unshared_arr) print "Originally, the first two elements of arr = %s"%(arr[:2]) # Create, start, and finish the child processes p = Process(target=f, args=(arr,)) p.start() p.join() # Printing out the changed values print "Now, the first two elements of arr = %s"%arr[:2]

Cela produit des sorties telles que:

Originally, the first two elements of arr = [0.3518653236697369, 0.517794725524976] Now, the first two elements of arr = [-0.3518653236697369, 0.517794725524976]

On peut accéder à la matrice de manière typée, par ex. arr[i] A du sens. Cependant, ce n'est pas un tableau numpy et je ne peux pas effectuer d'opérations telles que -1*arr Ou arr.sum(). Je suppose qu'une solution serait de convertir le tableau ctypes en un tableau numpy. Cependant (à part ne pas pouvoir faire ce travail), je ne crois pas que cela serait partagé.

Il semble qu'il y aurait une solution standard à ce qui doit être un problème commun.

jfs · Answer

Pour ajouter aux réponses de @ unutbu (plus disponible) et de @Henry Gomersall. Vous pouvez utiliser shared_arr.get_lock() pour synchroniser l'accès si nécessaire:

shared_arr = mp.Array(ctypes.c_double, N) # ... def f(i): # could be anything numpy accepts as an index such another numpy array with shared_arr.get_lock(): # synchronize access arr = np.frombuffer(shared_arr.get_obj()) # no data copying arr[i] = -arr[i]

Exemple

import ctypes import logging import multiprocessing as mp from contextlib import closing import numpy as np info = mp.get_logger().info def main(): logger = mp.log_to_stderr() logger.setLevel(logging.INFO) # create shared array N, M = 100, 11 shared_arr = mp.Array(ctypes.c_double, N) arr = tonumpyarray(shared_arr) # fill with random values arr[:] = np.random.uniform(size=N) arr_orig = arr.copy() # write to arr from different processes with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p: # many processes access the same slice stop_f = N // 10 p.map_async(f, [slice(stop_f)]*M) # many processes access different slices of the same array assert M % 2 # odd step = N // 10 p.map_async(g, [slice(i, i + step) for i in range(stop_f, N, step)]) p.join() assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig) def init(shared_arr_): global shared_arr shared_arr = shared_arr_ # must be inherited, not passed as an argument def tonumpyarray(mp_arr): return np.frombuffer(mp_arr.get_obj()) def f(i): """synchronized.""" with shared_arr.get_lock(): # synchronize access g(i) def g(i): """no synchronization.""" info("start %s" % (i,)) arr = tonumpyarray(shared_arr) arr[i] = -1 * arr[i] info("end %s" % (i,)) if __== '__main__': mp.freeze_support() main()

Si vous n'avez pas besoin d'un accès synchronisé ou si vous créez vos propres verrous, alors mp.Array() n'est pas nécessaire. Vous pouvez utiliser mp.sharedctypes.RawArray Dans ce cas.

Henry Gomersall · Answer

L'objet Array est associé à une méthode get_obj(), qui renvoie le tableau ctypes qui présente une interface de tampon. Je pense que ce qui suit devrait fonctionner ...

from multiprocessing import Process, Array import scipy import numpy def f(a): a[0] = -a[0] if __== '__main__': # Create the array N = int(10) unshared_arr = scipy.Rand(N) a = Array('d', unshared_arr) print "Originally, the first two elements of arr = %s"%(a[:2]) # Create, start, and finish the child process p = Process(target=f, args=(a,)) p.start() p.join() # Print out the changed values print "Now, the first two elements of arr = %s"%a[:2] b = numpy.frombuffer(a.get_obj()) b[0] = 10.0 print a[0]

Lorsqu'il est exécuté, le premier élément de a est maintenant 10.0 et indique que a et b ne sont que deux vues dans la même mémoire.

Afin de nous assurer qu'il est toujours sûr pour les multiprocesseurs, je pense que vous devrez utiliser les méthodes acquire et release qui existent sur l'objet Array, a, et son verrou intégré pour s’assurer qu’il est accessible en toute sécurité (bien que je ne sois pas un expert du module multiprocesseur).

EelkeSpaak · Answer

Bien que les réponses déjà données soient bonnes, il existe une solution beaucoup plus simple à ce problème à condition que deux conditions soient remplies:

Vous utilisez un système d'exploitation compatible POSIX (par exemple, Linux, Mac OSX); et
Vos processus enfants ont besoin de accès en lecture seule au tableau partagé.

Dans ce cas, vous n'avez pas besoin de manipuler explicitement les variables partagées, car les processus enfants seront créés à l'aide d'un fork. Un enfant forké partage automatiquement l'espace mémoire du parent. Dans le contexte du multitraitement Python, cela signifie qu'il partage toutes les variables au niveau du module ; notez que cela ne contient pas pour les arguments que vous passez explicitement à vos processus enfants ou aux fonctions que vous appelez sur un multiprocessing.Pool ou alors.

Un exemple simple:

import multiprocessing import numpy as np # will hold the (implicitly mem-shared) data data_array = None # child worker function def job_handler(num): # built-in id() returns unique memory ID of a variable return id(data_array), np.sum(data_array) def launch_jobs(data, num_jobs=5, num_worker=4): global data_array data_array = data pool = multiprocessing.Pool(num_worker) return pool.map(job_handler, range(num_jobs)) # create some random data and execute the child jobs mem_ids, sumvals = Zip(*launch_jobs(np.random.Rand(10))) # this will print 'True' on POSIX OS, since the data was shared print(np.all(np.asarray(mem_ids) == id(data_array)))

mat · Answer

J'ai écrit un petit module python qui utilise la mémoire partagée POSIX pour partager des tableaux numpy entre des interprètes python. Peut-être le trouverez-vous utile.

https://pypi.python.org/pypi/SharedArray

Voici comment cela fonctionne:

import numpy as np import SharedArray as sa # Create an array in shared memory a = sa.create("test1", 10) # Attach it as a different array. This can be done from another # python interpreter as long as it runs on the same computer. b = sa.attach("test1") # See how they are actually sharing the same memory block a[0] = 42 print(b[0]) # Destroying a does not affect b. del a print(b[0]) # See how "test1" is still present in shared memory even though we # destroyed the array a. sa.list() # Now destroy the array "test1" from memory. sa.delete("test1") # The array b is not affected, but once you destroy it then the # data are lost. print(b[0])

Velimir Mlaker · Answer

Vous pouvez utiliser le module sharedmem: https://bitbucket.org/cleemesser/numpy-sharedmem

Voici votre code original alors, cette fois en utilisant une mémoire partagée qui se comporte comme un tableau NumPy (notez la dernière instruction supplémentaire appelant une fonction NumPy sum()):

from multiprocessing import Process import sharedmem import scipy def f(a): a[0] = -a[0] if __== '__main__': # Create the array N = int(10) unshared_arr = scipy.Rand(N) arr = sharedmem.empty(N) arr[:] = unshared_arr.copy() print "Originally, the first two elements of arr = %s"%(arr[:2]) # Create, start, and finish the child process p = Process(target=f, args=(arr,)) p.start() p.join() # Print out the changed values print "Now, the first two elements of arr = %s"%arr[:2] # Perform some NumPy operation print arr.sum()