ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://bnikolic.co.uk/blog/python-numerical-gil |
| Last Crawled | 2026-04-10 17:37:08 (1 day ago) |
| First Indexed | 2023-04-10 22:32:02 (3 years ago) |
| HTTP Status Code | 200 |
| Meta Title | B. Nikolic: Python Global Interpreter Lock (GIL) + numerical data processing |
| Meta Description | null |
| Meta Canonical | null |
| Boilerpipe Text | The global interpreter lock may naively seem to imply that a Python
process can only use one core of a processor. That is
not
at all the
case: one (well-known) reason is that Python extension code (such as
ubiquitous NumPy/SciPy kernels) can themselves use multiple
threads. This is what I'd call fine-grained threading, which is easy
to use and effective when the kernels have available sufficient
parallelism. Coarse-grained parallelism through the
threading
module can however also be effective for the following reasons.
The Python GIL ensures that the Python interpreter in a process is
only ever executing one opcode (Python instruction) at any one
time. Instructions in a compiled extension to Python, or a syscall do
not intrinsically need to the lock acquired, and indeed the lock is
released before most such calls, e.g., the lock is released before
NumPy kernels are called. For this reason many different NumPy
kernels from /different Python threads/ can be executing at the same
time, utilising multiple CPU cores.
Illustration
As an illustration I've slightly Adapted the code Travis posted here
(
http://technicaldiscovery.blogspot.co.uk/2011/06/speeding-up-python-numpy-cython-and.html
) so that it uses either
threading
or
multiprocessing
for
coarse-grained threading. The
multiprocessing
approach uses
multiple Python processes and so can not be affected by the GIL.
# Adapted from: http://technicaldiscovery.blogspot.co.uk/2011/06/speeding-up-python-numpy-cython-and.html
from numpy import zeros, array_split, linspace
import threading
import multiprocessing
NT
=
1
dx
=
0
.1
dy
=
0
.1
dx2
=
dx*dx
dy2
=
dy*dy
def num_update
(
u
)
:
u
[
1
:-1,1:-1
]
=
((
u
[
2
:,1:-1
]
+u
[
:-2,1:-1
])
*dy2 +
(
u
[
1
:-1,2:
]
+ u
[
1
:-1,:-2
])
*dx2
)
/
(
2
*
(
dx2+dy2
))
def py_update
(
u
)
:
nx,
ny
=
u.shape
for
i in range
(
1
,nx-1
)
:
for
j in range
(
1
, ny-1
)
:
u
[
i,j
]
=
((
u
[
i+1, j
]
+ u
[
i-1, j
])
* dy2 +
(
u
[
i, j+1
]
+ u
[
i, j-1
])
* dx2
)
/
(
2
*
(
dx2+dy2
))
def calc
(
N,
Niter
=
100
,
func
=
py_update,
icl
=[
1
])
:
for
ic in icl:
u
=
zeros
([
N, N
])
u
[
0
]
=
ic
for
i in range
(
Niter
)
:
func
(
u
)
def blot
(
np
=
100
,
N
=
30
,
mp
=
False,
num
=
True
)
:
threads
=[]
if
num:
update
=
num_update
else
:
update
=
py_update
for
icl in array_split
(
linspace
(
1
,2,np
)
, NT
)
:
if
mp:
t
=
multiprocessing.Process
(
target
=
calc,
args
=(
N,
30
, update, icl,
))
else
:
t
=
threading.Thread
(
target
=
calc,
args
=(
N,
30
, update, icl,
))
t.start
()
threads.append
(
t
)
for
t in threads:
t.join
()
Measuring the performance of this example as function of number of
threads (NT) gives following results:
The graph labelled Threading+PurePy is for the pure Python version of
the problem. It can be seen there is a small
increase
in the
execution time as more threads are added. This is due to the GIL: each
pure Python thread has to acquire the lock in order to execute any
instructions, so speed of execution engine can never be faster than
single threaded. The small increases in execution time are due to
overheads of acquiring the GIL and switching between the threads.
The graph labelled Threading+NumPy uses NumPy kernels for the
computation. It can be seen that as the number of threads increases
execution time decreases close to linearly and then levels off. A
similar pattern can be seen in the MultiProc+Numpy, in which the
multipricessing
module is used instead of threading. In both of
these scenarious the speed of execution with multiple threads exceeds
the single core case, demonstrating that GIL is not limiting the
execution to an equivalent of one core. It can be seen that
multiprocessing
performance levels of at slightly faster than
threading
but it isn't clear if this is due to lock-acquisition
and thread switching overheads or due O/S level effects. |
| Markdown | [Bojan Nikolic](https://bnikolic.co.uk/blog/python-numerical-gil)
- [\[home\]](http://www.bnikolic.co.uk/)
- [\[blog\]](http://www.bnikolic.co.uk/blog/)
- [\[BN Algorithms Ltd\]](http://www.bnikolic.co.uk/about-bn-algorithms)
- [\[Quantlib\]](http://www.bnikolic.co.uk/ql/)
# Python Global Interpreter Lock (GIL) + numerical data processing
The global interpreter lock may naively seem to imply that a Python process can only use one core of a processor. That is *not* at all the case: one (well-known) reason is that Python extension code (such as ubiquitous NumPy/SciPy kernels) can themselves use multiple threads. This is what I'd call fine-grained threading, which is easy to use and effective when the kernels have available sufficient parallelism. Coarse-grained parallelism through the threading module can however also be effective for the following reasons.
The Python GIL ensures that the Python interpreter in a process is only ever executing one opcode (Python instruction) at any one time. Instructions in a compiled extension to Python, or a syscall do not intrinsically need to the lock acquired, and indeed the lock is released before most such calls, e.g., the lock is released before NumPy kernels are called. For this reason many different NumPy kernels from /different Python threads/ can be executing at the same time, utilising multiple CPU cores.
## Illustration
As an illustration I've slightly Adapted the code Travis posted here (<http://technicaldiscovery.blogspot.co.uk/2011/06/speeding-up-python-numpy-cython-and.html>) so that it uses either threading or multiprocessing for coarse-grained threading. The multiprocessing approach uses multiple Python processes and so can not be affected by the GIL.
```
# Adapted from: http://technicaldiscovery.blogspot.co.uk/2011/06/speeding-up-python-numpy-cython-and.html
from numpy import zeros, array_split, linspace
import threading
import multiprocessing
NT=1
dx = 0.1
dy = 0.1
dx2 = dx*dx
dy2 = dy*dy
def num_update(u):
u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 +
(u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
def py_update(u):
nx, ny = u.shape
for i in range(1,nx-1):
for j in range(1, ny-1):
u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 +
(u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2))
def calc(N, Niter=100, func=py_update, icl=[1]):
for ic in icl:
u = zeros([N, N])
u[0] = ic
for i in range(Niter):
func(u)
def blot(np=100, N=30, mp=False, num=True):
threads=[]
if num:
update=num_update
else:
update=py_update
for icl in array_split(linspace(1,2,np), NT):
if mp:
t=multiprocessing.Process(target=calc, args=(N, 30, update, icl,))
else:
t=threading.Thread(target=calc, args=(N, 30, update, icl,))
t.start()
threads.append(t)
for t in threads:
t.join()
```
Measuring the performance of this example as function of number of threads (NT) gives following results:

The graph labelled Threading+PurePy is for the pure Python version of the problem. It can be seen there is a small *increase* in the execution time as more threads are added. This is due to the GIL: each pure Python thread has to acquire the lock in order to execute any instructions, so speed of execution engine can never be faster than single threaded. The small increases in execution time are due to overheads of acquiring the GIL and switching between the threads.
The graph labelled Threading+NumPy uses NumPy kernels for the computation. It can be seen that as the number of threads increases execution time decreases close to linearly and then levels off. A similar pattern can be seen in the MultiProc+Numpy, in which the multipricessing module is used instead of threading. In both of these scenarious the speed of execution with multiple threads exceeds the single core case, demonstrating that GIL is not limiting the execution to an equivalent of one core. It can be seen that multiprocessing performance levels of at slightly faster than threading but it isn't clear if this is due to lock-acquisition and thread switching overheads or due O/S level effects.
© 2008 to 2019, [Bojan Nikolic](http://www.bnikolic.co.uk/). Send comments/enquiries to: [webs@bnikolic.co.uk](mailto:webs@bnikolic.co.uk). [Terms and Conditions](http://www.bnikolic.co.uk/siteterms).
*** |
| Readable Markdown | The global interpreter lock may naively seem to imply that a Python process can only use one core of a processor. That is *not* at all the case: one (well-known) reason is that Python extension code (such as ubiquitous NumPy/SciPy kernels) can themselves use multiple threads. This is what I'd call fine-grained threading, which is easy to use and effective when the kernels have available sufficient parallelism. Coarse-grained parallelism through the threading module can however also be effective for the following reasons.
The Python GIL ensures that the Python interpreter in a process is only ever executing one opcode (Python instruction) at any one time. Instructions in a compiled extension to Python, or a syscall do not intrinsically need to the lock acquired, and indeed the lock is released before most such calls, e.g., the lock is released before NumPy kernels are called. For this reason many different NumPy kernels from /different Python threads/ can be executing at the same time, utilising multiple CPU cores.
## Illustration
As an illustration I've slightly Adapted the code Travis posted here (<http://technicaldiscovery.blogspot.co.uk/2011/06/speeding-up-python-numpy-cython-and.html>) so that it uses either threading or multiprocessing for coarse-grained threading. The multiprocessing approach uses multiple Python processes and so can not be affected by the GIL.
```
# Adapted from: http://technicaldiscovery.blogspot.co.uk/2011/06/speeding-up-python-numpy-cython-and.html
from numpy import zeros, array_split, linspace
import threading
import multiprocessing
NT=1
dx = 0.1
dy = 0.1
dx2 = dx*dx
dy2 = dy*dy
def num_update(u):
u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 +
(u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
def py_update(u):
nx, ny = u.shape
for i in range(1,nx-1):
for j in range(1, ny-1):
u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 +
(u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2))
def calc(N, Niter=100, func=py_update, icl=[1]):
for ic in icl:
u = zeros([N, N])
u[0] = ic
for i in range(Niter):
func(u)
def blot(np=100, N=30, mp=False, num=True):
threads=[]
if num:
update=num_update
else:
update=py_update
for icl in array_split(linspace(1,2,np), NT):
if mp:
t=multiprocessing.Process(target=calc, args=(N, 30, update, icl,))
else:
t=threading.Thread(target=calc, args=(N, 30, update, icl,))
t.start()
threads.append(t)
for t in threads:
t.join()
```
Measuring the performance of this example as function of number of threads (NT) gives following results:

The graph labelled Threading+PurePy is for the pure Python version of the problem. It can be seen there is a small *increase* in the execution time as more threads are added. This is due to the GIL: each pure Python thread has to acquire the lock in order to execute any instructions, so speed of execution engine can never be faster than single threaded. The small increases in execution time are due to overheads of acquiring the GIL and switching between the threads.
The graph labelled Threading+NumPy uses NumPy kernels for the computation. It can be seen that as the number of threads increases execution time decreases close to linearly and then levels off. A similar pattern can be seen in the MultiProc+Numpy, in which the multipricessing module is used instead of threading. In both of these scenarious the speed of execution with multiple threads exceeds the single core case, demonstrating that GIL is not limiting the execution to an equivalent of one core. It can be seen that multiprocessing performance levels of at slightly faster than threading but it isn't clear if this is due to lock-acquisition and thread switching overheads or due O/S level effects. |
| Shard | 7 (laksa) |
| Root Hash | 3261792786456953207 |
| Unparsed URL | uk,co,bnikolic!/blog/python-numerical-gil s443 |