ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.3 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://www.vintasoftware.com/blog/etl-with-asyncio-asyncpg |
| Last Crawled | 2026-03-29 19:56:58 (8 days ago) |
| First Indexed | 2024-05-26 14:26:11 (1 year ago) |
| HTTP Status Code | 200 |
| Meta Title | Building an ETL flow with asyncio, multiprocessing & asyncpg |
| Meta Description | This post will explain how to implement a concurrent ETL (Extract, Transform, Load) flow combining Python asyncio with multiprocessing to get the best of both worlds. ETL itself is a procedure that starts with data extraction from sources such as a database (or many databases). |
| Meta Canonical | null |
| Boilerpipe Text | This post will explain how to implement a concurrent ETL (
E
xtract,
T
ransform,
L
oad) flow combining Python
asyncio
with
multiprocessing
to get the best of both worlds. ETL itself is a procedure that starts with data extraction from sources such as a database (or many databases). Following extraction, we apply functions to the data in order to modify it. These transformation functions are generally used for cleaning or converting extracted data. This flow is particularly profitable for data warehousing.
Operations with databases are core parts of many applications. Many of these procedures deal with tables containing millions of rows, which can hugely impact the performance of said applications. With Python we’d most commonly use the
psycopg2
library to perform operations in a PostgreSQL database.
To speed up ETL operations, we can benefit from concurrent code. Python provides us with
asyncio
, a built-in library which allows developers to write concurrent code without dealing with low-level programming; thread management, for instance. On
asyncio
, functions that are concurrently executed are called coroutines. For our ETL context, we’ll wrap those coroutines into tasks and wait for all of them to be done with
asyncio.gather
function. Its behavior means our code waits for every concurrent task to be completed before it can advance, similarly to the barriers used to synchronize the execution of threads running concurrently.
For our context, we are going to implement the following steps to model ETL:
Fetch records from a database (
extract
), each row having a
name
and an
age
;
Perform an operation over the obtained data (
transform
) to split the
name
from each record into
first_name
and
last_name
, while keeping the
age
value unaltered;
Write the transformed data into a database (
load
).
Since
psycopg2
doesn’t support
asyncio
, one of the available libraries that implement asynchronous database operations is
asyncpg
. We’ll use it to implement extract and load, steps that deal with database access.
According to
asyncpg
developers,
the library on average is three times faster than psycopg2
. It also provides tools to perform database operations, such as prepared statements and cursors. There are however some missing features, such as a built-in mechanism to prevent SQL injection (there’s a trick to implement it though, and we will discuss it later in the text).
Synchronizing all tasks
In order to implement an ETL flow, we must provide functions to represent each stage. In a non-concurrent code, we’d have functions such as
fetch_from_db
,
transform_data
and
write_into_db
that would be called sequentially. In a concurrent run, we want each function to run concurrently, starting computation as soon as there’s data to work with while also providing data to the next stage whenever the previous function’s job is done.
To establish this kind of synchronization between each stage, one approach is to implement a consumer & producer architecture with one queue being used to pass data between them.
We begin structuring our code writing the
etl
function:
import asyncio
import multiprocessing
from concurrent
.futures
import ProcessPoolExecutor async def
etl
(): with
ProcessPoolExecutor
( max_workers=multiprocessing.
cpu_count
(), ) as pool: loop = asyncio.
get_running_loop
() queue = asyncio.
Queue
(maxsize=
1000
) await asyncio.
gather
( asyncio.
create_task
(
producer
(queue)), asyncio.
create_task
(
consumer
(loop, pool, queue)), return_exceptions=False, ) def
main
(): asyncio.
run
(
etl
())
Firstly, our
etl
function must be asynchronous (through the
async
keyword) in order to fully work with
asyncio
.
We'll be using a
ProcessPoolExecutor
to perform our transformation procedures. This allows us to not halt our execution when processing data. Setting the number of workers in the pool to all CPUs our machine has will help to speed things up by fully using the available resources. The usage of these structures will be further explained in a later section.
Notice that when instantiating our queue, we’ve used the
maxsize
parameter. Why do we do that?
First of all, using an unbounded queue has the risk of depleting memory resources. Secondly, producer would fill the queue indefinitely and only then consumer would start running. That may end up with synchronous code, as your consumer would only start when producer had nothing else to do.
Defining a queue size limit makes consumer start earlier, as soon as the queue reaches the set limit for the first time. When space is freed up, producer will continue its job and add more elements to the queue, making the code run concurrently. This means the queue is acting as a
backpressure mechanism
. It’s positive to have such mechanism as long as the production speed is faster than consumption speed, which happens in many cases. For more details regarding these issues, please check
this article
.
main
will be responsible for invoking
etl
. While
main
itself is not asynchronous, it uses
asyncio.run
to execute an asynchronous function. After
etl
is over, any code after
asyncio.run(etl())
would be executed as regular synchronous code.
Producer has a fairly straightforward implementation:
async
def
producer
(
queue
): connection
=
await
get_asyncpg_connection()
async
for
record
in
extract
(
connection
):
await
queue.
put
(
record
)
await
queue.
put
(
None
)
await
connection.
close
()
We start by grabbing an
asyncpg
connection object. Once attained, we create a loop that fetches data from our database and return the records via
extract
function. Each record is first inserted into queue then processed by consumer, until there is no more data to fetch.
To signalize its job is done, producer will then insert a
None
value into the queue, which must be handled gracefully by consumer. Lastly, we close the
asyncpg
connection.
Regarding both
get_asyncpg_connection
and
extract
functions, let’s dive into their implementations.
import
asyncpg
async
def
get_asyncpg_connection
(): db_conf = {
"user"
:
"user_name"
,
"password"
:
"password"
,
"database"
:
"db_name"
,
"host"
:
"host"
,
"port"
:
"port"
, } connection =
await
asyncpg.connect(**db_conf)
return
connection
async
def
extract
(
connection
):
async
with
connection.transaction(): query =
"SELECT name, age FROM input_table"
async
for
record
in
connection.cursor(query, prefetch=
1000
):
yield
record
get_asyncpg_connection
is the only part of the code that explicitly interacts with the
asyncpg
module. It creates a connection with our database using the
asyncpg.connect
method. This method's parameters are quite similar to the ones found at
psycopg2.connect
. To simplify this post, we opted to instantiate a
db_conf
variable inside
get_asyncpg_connection
, using dummy values for the connection parameters. When adapting contexts, one would simply change values into real ones.
The
extract
function is a generator that receives an open
asyncpg
connection as its single parameter. It's responsible for accessing a table in our database (in our case, the
input_table
table) via the
cursor
method inside
connection
. This fetches records from the database according to the query defined in
extract
and yields each grabbed row.
Now it’s time to talk about
consumer
implementation:
def
transform
(
batch
): transformed_batch = []
for
record
in
batch: first_name, last_name = record[
"name"
].split(
" "
,
1
) age = record[
"age"
] transformed_batch.append((first_name, last_name, age))
return
transformed_batch
async
def
load
(
batch, connection
):
async
with
connection.transaction(): columns = (
"first_name"
,
"last_name"
,
"age"
)
await
connection.copy_records_to_table(
"output_table"
, records=batch, columns=columns )
async
def
task_set_load_helper
(
task_set, connection
):
for
future
in
task_set:
await
load(
await
future, connection)
async
def
consumer
(
loop, pool, queue
): connection =
await
get_asyncpg_connection() task_set =
set
() batch = []
while
True
: record =
await
queue.get()
if
record
is
not
None
: record =
dict
(record) batch.append(record)
if
queue.empty(): task = loop.run_in_executor(pool, transform, batch) task_set.add(task)
if
len
(task_set) >= pool._max_workers: done_set, task_set =
await
asyncio.wait( task_set, return_when=asyncio.FIRST_COMPLETED )
await
task_set_load_helper(done_set, connection) batch = []
if
record
is
None
:
break
if
task_set:
await
task_set_load_helper( asyncio.as_completed(task_set), connection )
await
connection.close()
The
consumer
function will be responsible for grabbing the records our producer stores in the queue and add them to a batch. We must remember to convert each record into a
dict
before adding it to the batch in order to process data later - not doing so would raise a
TypeError: can't pickle asyncpg.Record objects
exception. Whenever the queue becomes empty, consumer will start running the transformation step over the batch it built.
The
transform
function receives a batch from the consumer, which is nothing more than a
list
of
dict
objects fetched from the database. Since we want to split the full names from each record into first and last names, we start by looping over the batch and grabbing every record. Our context is simple, so the transformation is pretty straightforward. After obtaining the transformed data we create a
tuple
containing each value and finally add it to a
list
representing a modified batch.
As applications are more complex and involve multiple output tables, it’s recommended to keep track of the columns and which table the records should be inserted into. This can be achieved by using a
dict
to group records from each table. Since our use-case involves a single output table, using a
list
of
tuple
is enough.
When we use
run_in_executor
method on line 30, our transformation step runs inside a given executor (in our case it’ll run in
pool
, which is the instance of
ProcessPoolExecutor
we've created on
etl
). Since transformation function is CPU-bound, if we executed
transform(batch)
directly here we would halt other coroutines until our batch has been fully processed, therefore dragging execution time.
Moreover,
run_in_executor
returns a
Future
object, not a processed batch. According to the Python docs,
“a
Future
represents an eventual result of an asynchronous operation”
, meaning that our data won’t necessarily be ready when line 30 is executed, but will be at some point in the future. This means
transform
will be running in background, so we can move on to process next batches concurrently.
The conditional block starting on line 32 prevents the application running out of memory by storing too many transform results. Whenever
task_set
has more objects than our pool has of available workers, we can speed things up and process some tasks to obtain transformation results earlier. Since they’re
Future
objects, we must wait for them to be completed. After calling
asyncio.wait
, we’ll wait for a task to become ready. When this happens, we grab a new
done_set
containing all completed tasks and update
task_set
to keep track only of the pending ones.
As we now have completed tasks, we can store the results in the database. Since we are going to use our storing procedures elsewhere, we created a
task_set_load_helper
helper function to avoid code repetition. It’s responsible for iterating over a set of completed tasks and calling the
load
function that will effectively write into database.
Whenever
load
is called from within
task_set_load_helper
, it will receive a
list
with transformed data and store it into
output_table
in our database. To achieve this, we use the
copy_records_to_table
method from
asyncpg
connection object to bulk insert the records,
which is faster than inserting each record individually
. It receives a table to write data into and a
list
of
tuple
objects representing the records to insert. It may additionally receive an iterable with the columns from the table. It’s important to note that each column in this iterable must match the order of the respective value in the objects from the
records
parameter.
When there are no more records to process, we finish creating batches and move on to obtain the last transformation results, in case there are still some objects to work with. We perform a similar procedure to what was done on line 36, but now we have to pass the entire set with the remaining tasks, which could still be pending. That means we must apply the
asyncio.as_completed
function over
task_set
to be assured that we’ll be iterating exclusively over completed tasks as soon as they become available.
Caveats
SQL injection
is one of the main concerns when using
asyncpg
. When using
psycopg2
, we can avoid this issue by composing our queries using the
psycopg2.sql
module objects, such as in the example below:
from
psycopg2
import
sql table_name =
"input_table"
column_name =
"name"
value =
"foo"
query = sql.SQL(
"SELECT * FROM {table_name} WHERE {column_name}={value}"
).
format
( table_name=sql.Identifier(table_name), column_name=sql.Identifier(column_name), value=sql.
Literal
(value)
)
By using the
sql.Identifier
and
sql.Literal
we can respectively sanitize identifiers (i.e.: table and column names) and literals (i.e.: strings and integers) to build parameterized queries, while being assured that the resulting query won’t harm our database due to a maliciously crafted input.
asyncpg
allows us to build parameterized queries using
syntax similar to the one used by native PostgreSQL,
using the
$n
syntax to provide query arguments. However, since PostgreSQL queries don’t allow us to parameterize tables and columns names, we are stuck with using Python’s string formatting to compose queries with dynamic identifiers. Without proper care, this can lead to catastrophic results in case these identifiers are user-provided. A malicious user can send a value to our system that when processed can lead to a
DROP DATABASE
statement being executed.
Build safe systems. Have user profile in mind whenever designing permissions. Allowing admins to dynamically choose tables is sound, but granting an end-user the same privileges may be compromising.
In order to work around these issues we can look how
psycopg2
works behind the scenes with
sql.Identifier
.
async
def
sanitize_identifier
(
identifier, connection
): sanitized =
await
connection.fetchval(
"SELECT quote_ident($1)"
, identifier)
return
sanitized
async
def
sanitize_table
(
table, connection
):
try
: dirty_schema, dirty_table = table.split(
"."
) sanitized_schema =
await
sanitize_identifier(dirty_schema, connection) sanitized_table =
await
sanitize_identifier(dirty_table, connection) sanitized =
f"
{sanitized_schema}
.
{sanitized_table}
"
except
ValueError: sanitized =
await
sanitize_identifier(table, connection)
return
sanitized
The magic happens on
sanitize_identifier
, where we establish a connection with the database (via an
asyncpg
connection object) and retrieve the result for the
quote_ident
PostgreSQL function, which helps us to sanitize input. Because table name is a literal and not an identifier in this case, notice that we are using the
$n
syntax to pass the dynamic value to our query.
sanitize_table
builds upon the previous function in order to cover situations where our table name is prefixed by a schema name. We have to sanitize both components separately, not as a single string. After cleaning, we can safely merge them again into a single value using Python’s formatting.
Below is an example of a more robust
extract
function, which would allow us to fetch data from our database filtering the results from a dynamic table and column.
async
def
extract
(
table_name, column_name, value, connection
)
async
with
connection.transaction(): sanitized_table_name = sanitize_table(table_name, connection) sanitized_column_name = sanitize_identifier(column_name, connection) query =
f"SELECT * FROM
{sanitized_table_name}
WHERE
{sanitized_column_name}
=$1"
async
for
record
in
connection.cursor(query, value):
yield
dict
(record)
More info
Below are some approaches to our topic that were implemented on third-party solutions:
aiomultiprocess
- A Python lib that also combines
asyncio
and
multiprocessing
and implements a solution quite similar to ours. Here's also a
PyCon 2018 talk
that introduces this library.
pypeln
- A Python lib that abstracts the process when creating a concurrent data pipeline.
Special thanks to
Rafael CarĂcio
who kindly reviewed this post. |
| Markdown | Last updated
March 14, 2025
•
4 min read
# Building an ETL flow with asyncio, multiprocessing & asyncpg

Renato Vieira
Fullstack Developer
[Summarize with ChatGPT](https://www.vintasoftware.com/blog/etl-with-asyncio-asyncpg)
.webp)

This post will explain how to implement a concurrent ETL (**E**xtract, **T**ransform, **L**oad) flow combining Python `asyncio` with `multiprocessing` to get the best of both worlds. ETL itself is a procedure that starts with data extraction from sources such as a database (or many databases). Following extraction, we apply functions to the data in order to modify it. These transformation functions are generally used for cleaning or converting extracted data. This flow is particularly profitable for data warehousing.
Operations with databases are core parts of many applications. Many of these procedures deal with tables containing millions of rows, which can hugely impact the performance of said applications. With Python we’d most commonly use the `psycopg2` library to perform operations in a PostgreSQL database.
To speed up ETL operations, we can benefit from concurrent code. Python provides us with [`asyncio`](https://docs.python.org/3/library/asyncio.html), a built-in library which allows developers to write concurrent code without dealing with low-level programming; thread management, for instance. On `asyncio`, functions that are concurrently executed are called coroutines. For our ETL context, we’ll wrap those coroutines into tasks and wait for all of them to be done with `asyncio.gather` function. Its behavior means our code waits for every concurrent task to be completed before it can advance, similarly to the barriers used to synchronize the execution of threads running concurrently.
For our context, we are going to implement the following steps to model ETL:
1. Fetch records from a database (`extract`), each row having a `name` and an `age`;
2. Perform an operation over the obtained data (`transform`) to split the `name` from each record into `first_name` and `last_name`, while keeping the `age` value unaltered;
3. Write the transformed data into a database (`load`).
Since `psycopg2` doesn’t support `asyncio`, one of the available libraries that implement asynchronous database operations is [`asyncpg`](https://github.com/MagicStack/asyncpg). We’ll use it to implement extract and load, steps that deal with database access.
According to `asyncpg` developers, [the library on average is three times faster than psycopg2](https://github.com/MagicStack/asyncpg#performance). It also provides tools to perform database operations, such as prepared statements and cursors. There are however some missing features, such as a built-in mechanism to prevent SQL injection (there’s a trick to implement it though, and we will discuss it later in the text).
# Synchronizing all tasks
In order to implement an ETL flow, we must provide functions to represent each stage. In a non-concurrent code, we’d have functions such as `fetch_from_db`, `transform_data` and `write_into_db` that would be called sequentially. In a concurrent run, we want each function to run concurrently, starting computation as soon as there’s data to work with while also providing data to the next stage whenever the previous function’s job is done.
To establish this kind of synchronization between each stage, one approach is to implement a consumer & producer architecture with one queue being used to pass data between them.
We begin structuring our code writing the `etl` function:
```
import asyncio
import multiprocessing
from concurrent.futures import ProcessPoolExecutor async def etl(): with ProcessPoolExecutor( max_workers=multiprocessing.cpu_count(), ) as pool: loop = asyncio.get_running_loop() queue = asyncio.Queue(maxsize=1000) await asyncio.gather( asyncio.create_task(producer(queue)), asyncio.create_task(consumer(loop, pool, queue)), return_exceptions=False, ) def main(): asyncio.run(etl())
```
Firstly, our `etl` function must be asynchronous (through the `async` keyword) in order to fully work with `asyncio`.
We'll be using a `ProcessPoolExecutor` to perform our transformation procedures. This allows us to not halt our execution when processing data. Setting the number of workers in the pool to all CPUs our machine has will help to speed things up by fully using the available resources. The usage of these structures will be further explained in a later section.
Notice that when instantiating our queue, we’ve used the `maxsize` parameter. Why do we do that?
First of all, using an unbounded queue has the risk of depleting memory resources. Secondly, producer would fill the queue indefinitely and only then consumer would start running. That may end up with synchronous code, as your consumer would only start when producer had nothing else to do.
Defining a queue size limit makes consumer start earlier, as soon as the queue reaches the set limit for the first time. When space is freed up, producer will continue its job and add more elements to the queue, making the code run concurrently. This means the queue is acting as a [backpressure mechanism](https://lucumr.pocoo.org/2020/1/1/async-pressure/). It’s positive to have such mechanism as long as the production speed is faster than consumption speed, which happens in many cases. For more details regarding these issues, please check [this article](https://nullprogram.com/blog/2020/05/24/).
`main` will be responsible for invoking `etl`. While `main` itself is not asynchronous, it uses `asyncio.run` to execute an asynchronous function. After `etl` is over, any code after `asyncio.run(etl())` would be executed as regular synchronous code.
Producer has a fairly straightforward implementation:
```
async def producer(queue): connection = await get_asyncpg_connection() async for record in extract(connection): await queue.put(record) await queue.put(None) await connection.close()
```
We start by grabbing an `asyncpg` connection object. Once attained, we create a loop that fetches data from our database and return the records via `extract` function. Each record is first inserted into queue then processed by consumer, until there is no more data to fetch.To signalize its job is done, producer will then insert a `None` value into the queue, which must be handled gracefully by consumer. Lastly, we close the `asyncpg` connection.
Regarding both `get_asyncpg_connection` and `extract` functions, let’s dive into their implementations.
```
import asyncpg async def get_asyncpg_connection(): db_conf = { "user": "user_name", "password": "password", "database": "db_name", "host": "host", "port": "port", } connection = await asyncpg.connect(**db_conf) return connection async def extract(connection): async with connection.transaction(): query = "SELECT name, age FROM input_table" async for record in connection.cursor(query, prefetch=1000): yield record
```
`get_asyncpg_connection` is the only part of the code that explicitly interacts with the `asyncpg` module. It creates a connection with our database using the `asyncpg.connect` method. This method's parameters are quite similar to the ones found at `psycopg2.connect`. To simplify this post, we opted to instantiate a `db_conf` variable inside `get_asyncpg_connection`, using dummy values for the connection parameters. When adapting contexts, one would simply change values into real ones.
The `extract` function is a generator that receives an open `asyncpg` connection as its single parameter. It's responsible for accessing a table in our database (in our case, the `input_table` table) via the `cursor` method inside `connection`. This fetches records from the database according to the query defined in `extract` and yields each grabbed row.
Now it’s time to talk about `consumer` implementation:
```
def transform(batch): transformed_batch = [] for record in batch: first_name, last_name = record["name"].split(" ", 1) age = record["age"] transformed_batch.append((first_name, last_name, age)) return transformed_batch async def load(batch, connection): async with connection.transaction(): columns = ("first_name", "last_name", "age") await connection.copy_records_to_table( "output_table", records=batch, columns=columns ) async def task_set_load_helper(task_set, connection): for future in task_set: await load(await future, connection) async def consumer(loop, pool, queue): connection = await get_asyncpg_connection() task_set = set() batch = [] while True: record = await queue.get() if record is not None: record = dict(record) batch.append(record) if queue.empty(): task = loop.run_in_executor(pool, transform, batch) task_set.add(task) if len(task_set) >= pool._max_workers: done_set, task_set = await asyncio.wait( task_set, return_when=asyncio.FIRST_COMPLETED ) await task_set_load_helper(done_set, connection) batch = [] if record is None: break if task_set: await task_set_load_helper( asyncio.as_completed(task_set), connection ) await connection.close()
```
The `consumer` function will be responsible for grabbing the records our producer stores in the queue and add them to a batch. We must remember to convert each record into a `dict` before adding it to the batch in order to process data later - not doing so would raise a `TypeError: can't pickle asyncpg.Record objects` exception. Whenever the queue becomes empty, consumer will start running the transformation step over the batch it built.
The `transform` function receives a batch from the consumer, which is nothing more than a `list` of `dict` objects fetched from the database. Since we want to split the full names from each record into first and last names, we start by looping over the batch and grabbing every record. Our context is simple, so the transformation is pretty straightforward. After obtaining the transformed data we create a `tuple` containing each value and finally add it to a `list` representing a modified batch.
As applications are more complex and involve multiple output tables, it’s recommended to keep track of the columns and which table the records should be inserted into. This can be achieved by using a `dict` to group records from each table. Since our use-case involves a single output table, using a `list` of `tuple` is enough.
When we use `run_in_executor` method on line 30, our transformation step runs inside a given executor (in our case it’ll run in `pool`, which is the instance of `ProcessPoolExecutor` we've created on `etl`). Since transformation function is CPU-bound, if we executed `transform(batch)` directly here we would halt other coroutines until our batch has been fully processed, therefore dragging execution time.
Moreover, `run_in_executor` returns a `Future` object, not a processed batch. According to the Python docs, [“a `Future` represents an eventual result of an asynchronous operation”](https://docs.python.org/3/library/asyncio-future.html#asyncio.Future), meaning that our data won’t necessarily be ready when line 30 is executed, but will be at some point in the future. This means `transform` will be running in background, so we can move on to process next batches concurrently.
The conditional block starting on line 32 prevents the application running out of memory by storing too many transform results. Whenever `task_set` has more objects than our pool has of available workers, we can speed things up and process some tasks to obtain transformation results earlier. Since they’re `Future` objects, we must wait for them to be completed. After calling `asyncio.wait`, we’ll wait for a task to become ready. When this happens, we grab a new `done_set` containing all completed tasks and update `task_set` to keep track only of the pending ones.
As we now have completed tasks, we can store the results in the database. Since we are going to use our storing procedures elsewhere, we created a `task_set_load_helper` helper function to avoid code repetition. It’s responsible for iterating over a set of completed tasks and calling the `load` function that will effectively write into database.
Whenever `load` is called from within `task_set_load_helper`, it will receive a `list` with transformed data and store it into `output_table` in our database. To achieve this, we use the `copy_records_to_table` method from `asyncpg` connection object to bulk insert the records, [which is faster than inserting each record individually](https://github.com/MagicStack/asyncpg/issues/346#issuecomment-416070209). It receives a table to write data into and a `list` of `tuple` objects representing the records to insert. It may additionally receive an iterable with the columns from the table. It’s important to note that each column in this iterable must match the order of the respective value in the objects from the `records` parameter.
When there are no more records to process, we finish creating batches and move on to obtain the last transformation results, in case there are still some objects to work with. We perform a similar procedure to what was done on line 36, but now we have to pass the entire set with the remaining tasks, which could still be pending. That means we must apply the `asyncio.as_completed` function over `task_set` to be assured that we’ll be iterating exclusively over completed tasks as soon as they become available.
# Caveats
[SQL injection](https://en.wikipedia.org/wiki/SQL_injection) is one of the main concerns when using `asyncpg`. When using `psycopg2`, we can avoid this issue by composing our queries using the `psycopg2.sql` module objects, such as in the example below:
```
from psycopg2 import sql table_name = "input_table"
column_name = "name"
value = "foo" query = sql.SQL( "SELECT * FROM {table_name} WHERE {column_name}={value}"
).format( table_name=sql.Identifier(table_name), column_name=sql.Identifier(column_name), value=sql.Literal(value)
)
```
By using the `sql.Identifier` and `sql.Literal` we can respectively sanitize identifiers (i.e.: table and column names) and literals (i.e.: strings and integers) to build parameterized queries, while being assured that the resulting query won’t harm our database due to a maliciously crafted input.
`asyncpg` allows us to build parameterized queries using [syntax similar to the one used by native PostgreSQL,](https://magicstack.github.io/asyncpg/current/usage.html#asyncpg-usage) using the `$n` syntax to provide query arguments. However, since PostgreSQL queries don’t allow us to parameterize tables and columns names, we are stuck with using Python’s string formatting to compose queries with dynamic identifiers. Without proper care, this can lead to catastrophic results in case these identifiers are user-provided. A malicious user can send a value to our system that when processed can lead to a `DROP DATABASE` statement being executed.
Build safe systems. Have user profile in mind whenever designing permissions. Allowing admins to dynamically choose tables is sound, but granting an end-user the same privileges may be compromising.
In order to work around these issues we can look how `psycopg2` works behind the scenes with `sql.Identifier`.
```
async def sanitize_identifier(identifier, connection): sanitized = await connection.fetchval("SELECT quote_ident($1)", identifier) return sanitized async def sanitize_table(table, connection): try: dirty_schema, dirty_table = table.split(".") sanitized_schema = await sanitize_identifier(dirty_schema, connection) sanitized_table = await sanitize_identifier(dirty_table, connection) sanitized = f"{sanitized_schema}.{sanitized_table}" except ValueError: sanitized = await sanitize_identifier(table, connection) return sanitized
```
The magic happens on `sanitize_identifier`, where we establish a connection with the database (via an `asyncpg` connection object) and retrieve the result for the `quote_ident` PostgreSQL function, which helps us to sanitize input. Because table name is a literal and not an identifier in this case, notice that we are using the `$n` syntax to pass the dynamic value to our query.
`sanitize_table` builds upon the previous function in order to cover situations where our table name is prefixed by a schema name. We have to sanitize both components separately, not as a single string. After cleaning, we can safely merge them again into a single value using Python’s formatting.
Below is an example of a more robust `extract` function, which would allow us to fetch data from our database filtering the results from a dynamic table and column.
```
async def extract(table_name, column_name, value, connection) async with connection.transaction(): sanitized_table_name = sanitize_table(table_name, connection) sanitized_column_name = sanitize_identifier(column_name, connection) query = f"SELECT * FROM {sanitized_table_name} WHERE {sanitized_column_name}=$1" async for record in connection.cursor(query, value): yield dict(record)
```
# More info
Below are some approaches to our topic that were implemented on third-party solutions:
- `aiomultiprocess` - A Python lib that also combines `asyncio` and `multiprocessing` and implements a solution quite similar to ours. Here's also a [PyCon 2018 talk](https://www.youtube.com/watch?v=0kXaLh8Fz3k&feature=youtu.be) that introduces this library.
- `pypeln` - A Python lib that abstracts the process when creating a concurrent data pipeline.
Special thanks to [Rafael CarĂcio](https://twitter.com/rafaelcaricio) who kindly reviewed this post.
Building on an EHR and need to move faster?
Patient apps, care workflows, and interoperable data.
[Schedule a call](https://www.vintasoftware.com/blog/etl-with-asyncio-asyncpg)
Related Articles
[How to use Django REST Framework serializers effectively Explore the power of Django REST Framework (DRF) serializers in crafting robust APIs. Unlock insights and discover the game-changing drf-rw-serializers tool.](https://www.vintasoftware.com/blog/django-rest-framework-read-write-serializers)
[Don't forget the stamps: testing email content in Django When developing a web app how often do you check the emails you send are all working properly? Not as often as your web pages, right?](https://www.vintasoftware.com/blog/dont-forget-stamps-testing-email-content-django)
[Clients](https://www.vintasoftware.com/work)
[Services](https://www.vintasoftware.com/services)
[Blog](https://www.vintasoftware.com/blog)
[About Us](https://www.vintasoftware.com/about-us)
[Careers](https://www.vintasoftware.com/careers)
[Let's talk](https://www.vintasoftware.com/contact)
[](https://www.vintasoftware.com/)
[](https://www.vintasoftware.com/)
Contact
[contact@vintasoftware.com](mailto:contact@vintasoftware.com)
.webp)
166 Geary St, San Francisco, CA 94108
.webp)
248 Rua do Brum, Recife, PE 50030-260
Healthcare Development
[EHR Development](https://www.vintasoftware.com/services/ehr-development-modernization)[EHR Development](https://www.vintasoftware.com/services/ehr-development-modernization)[EHR Development](https://www.vintasoftware.com/services/ehr-development-modernization)
[EHR Modernization](https://www.vintasoftware.com/services/ehr-development-modernization)[EHR Modernization](https://www.vintasoftware.com/services/ehr-development-modernization)[EHR Modernization](https://www.vintasoftware.com/services/ehr-development-modernization)
[Healthcare Discovery](https://www.vintasoftware.com/services/healthcare-discovery)[Healthcare Discovery](https://www.vintasoftware.com/services/healthcare-discovery)[Healthcare Discovery](https://www.vintasoftware.com/services/healthcare-discovery)
[Medplum Development](https://www.vintasoftware.com/services/medplum)[Medplum Development](https://www.vintasoftware.com/services/medplum)[Medplum Development](https://www.vintasoftware.com/services/medplum)
[Integrations & Interoperability](https://www.vintasoftware.com/services/healthcare-integration-interoperability)[Integrations & Interoperability](https://www.vintasoftware.com/services/healthcare-integration-interoperability)[Integrations & Interoperability](https://www.vintasoftware.com/services/healthcare-integration-interoperability)
[CMS Interoperability Enablement](https://www.vintasoftware.com/services/cms-interoperability-enablement)[CMS Interoperability Enablement](https://www.vintasoftware.com/services/cms-interoperability-enablement)[CMS Interoperability Enablement](https://www.vintasoftware.com/services/cms-interoperability-enablement)
End-to-end Development
[Product Discovery](https://www.vintasoftware.com/services/product-discovery)[Product Discovery](https://www.vintasoftware.com/services/product-discovery)[Product Discovery](https://www.vintasoftware.com/services/product-discovery)
[Web Development](https://www.vintasoftware.com/services/web-mobile-development/)[Web Development](https://www.vintasoftware.com/services/web-mobile-development/)[Web Development](https://www.vintasoftware.com/services/web-mobile-development/)
[AI Agents & LLM Development](https://www.vintasoftware.com/services/ai-agents-llm-development)[AI Agents & LLM Development](https://www.vintasoftware.com/services/ai-agents-llm-development)[AI Agents & LLM Development](https://www.vintasoftware.com/services/ai-agents-llm-development)
[Tech Debt & Legacy](https://www.vintasoftware.com/services/tech-debt)[Tech Debt & Legacy](https://www.vintasoftware.com/services/tech-debt)[Tech Debt & Legacy](https://www.vintasoftware.com/services/tech-debt)
[Dedicated Teams](https://www.vintasoftware.com/services/dedicated-teams)[Dedicated Teams](https://www.vintasoftware.com/services/dedicated-teams)[Dedicated Teams](https://www.vintasoftware.com/services/dedicated-teams)
[Mobile Development](https://www.vintasoftware.com/services/mobile-development)[Mobile Development](https://www.vintasoftware.com/services/mobile-development)[Mobile Development](https://www.vintasoftware.com/services/mobile-development)
[UI/UX Design](https://www.vintasoftware.com/services/design-ui-ux)[UI/UX Design](https://www.vintasoftware.com/services/design-ui-ux)[UI/UX Design](https://www.vintasoftware.com/services/design-ui-ux)
Technology
[Python](https://www.vintasoftware.com/services/python-web-development)[Python](https://www.vintasoftware.com/services/python-web-development)[Python](https://www.vintasoftware.com/services/python-web-development)
[Django](https://www.vintasoftware.com/services/django-development)[Django](https://www.vintasoftware.com/services/django-development)[Django](https://www.vintasoftware.com/services/django-development)
[React Native](https://www.vintasoftware.com/tech/react-native)[React Native](https://www.vintasoftware.com/tech/react-native)[React Native](https://www.vintasoftware.com/tech/react-native)
[Modern JavaScript](https://www.vintasoftware.com/services/modern-javascript-development)[Modern JavaScript](https://www.vintasoftware.com/services/modern-javascript-development)[Modern JavaScript](https://www.vintasoftware.com/services/modern-javascript-development)
[Django REST Framework](https://www.vintasoftware.com/tech/django-rest-framework)[Django REST Framework](https://www.vintasoftware.com/tech/django-rest-framework)[Django REST Framework](https://www.vintasoftware.com/tech/django-rest-framework)
[Advanced Celery](https://www.vintasoftware.com/blog/guide-django-celery-tasks)[Advanced Celery](https://www.vintasoftware.com/blog/guide-django-celery-tasks)[Advanced Celery](https://www.vintasoftware.com/blog/guide-django-celery-tasks)
[FastAPI](https://www.vintasoftware.com/tech/fastapi)[FastAPI](https://www.vintasoftware.com/tech/fastapi)[FastAPI](https://www.vintasoftware.com/tech/fastapi)
[Next.js](https://www.vintasoftware.com/tech/next-js)[Next.js](https://www.vintasoftware.com/tech/next-js)[Next.js](https://www.vintasoftware.com/tech/next-js)
Featured Clients
[United Nations](https://www.vintasoftware.com/work/un-social-protection-org)[United Nations](https://www.vintasoftware.com/work/un-social-protection-org)[United Nations](https://www.vintasoftware.com/work/un-social-protection-org)
[Unilever](https://www.vintasoftware.com/work/unilever-product-discovery)[Unilever](https://www.vintasoftware.com/work/unilever-product-discovery)[Unilever](https://www.vintasoftware.com/work/unilever-product-discovery)
[Alt Legal](https://www.vintasoftware.com/work/altlegal)[Alt Legal](https://www.vintasoftware.com/work/altlegal)[Alt Legal](https://www.vintasoftware.com/work/altlegal)
[Lastmile Retail](https://www.vintasoftware.com/work/lastmile)[Lastmile Retail](https://www.vintasoftware.com/work/lastmile)[Lastmile Retail](https://www.vintasoftware.com/work/lastmile)
[Quilted Health](https://www.vintasoftware.com/work/quilted-health)[Quilted Health](https://www.vintasoftware.com/work/quilted-health)[Quilted Health](https://www.vintasoftware.com/work/quilted-health)
[Rewind Health](https://www.vintasoftware.com/work/rewind)[Rewind Health](https://www.vintasoftware.com/work/rewind)[Rewind Health](https://www.vintasoftware.com/work/rewind)
[Splendid Spoon](https://www.vintasoftware.com/work/splendid-spoon)[Splendid Spoon](https://www.vintasoftware.com/work/splendid-spoon)[Splendid Spoon](https://www.vintasoftware.com/work/splendid-spoon)
[Tesorio](https://www.vintasoftware.com/work/tesorio)[Tesorio](https://www.vintasoftware.com/work/tesorio)[Tesorio](https://www.vintasoftware.com/work/tesorio)
[Plusplus](https://www.vintasoftware.com/work/plusplus)[Plusplus](https://www.vintasoftware.com/work/plusplus)[Plusplus](https://www.vintasoftware.com/work/plusplus)
[Carta](https://www.vintasoftware.com/work/carta)[Carta](https://www.vintasoftware.com/work/carta)[Carta](https://www.vintasoftware.com/work/carta)
[More](https://www.vintasoftware.com/work)[More](https://www.vintasoftware.com/work)[More](https://www.vintasoftware.com/work)
Company
[About us](https://www.vintasoftware.com/about-us)[About us](https://www.vintasoftware.com/about-us)
[About us](https://www.vintasoftware.com/about-us)
[People & Culture](https://www.vintasoftware.com/playbook/people-and-culture)[People & Culture](https://www.vintasoftware.com/playbook/people-and-culture)
[People & Culture](https://www.vintasoftware.com/playbook/people-and-culture)
[Open Source](https://www.vintasoftware.com/blog/category/open-source)[Open Source](https://www.vintasoftware.com/blog/category/open-source)
[Open Source](https://www.vintasoftware.com/blog/category/open-source)
[Talks & Events](https://www.vintasoftware.com/blog/category/talk)[Talks & Events](https://www.vintasoftware.com/blog/category/talk)
[Talks & Events](https://www.vintasoftware.com/blog/category/talk)
[Jobs](https://www.vintasoftware.com/careers)[Jobs](https://www.vintasoftware.com/careers)
[Jobs](https://www.vintasoftware.com/careers)[(We’re Hiring)](https://www.vintasoftware.com/careers)
[Tech Insights](https://www.vintasoftware.com/lessons-learned)[Tech Insights](https://www.vintasoftware.com/lessons-learned)
[Tech Insights](https://www.vintasoftware.com/lessons-learned)
[Our Newsletter](https://www.linkedin.com/build-relation/newsletter-follow?entityUrn=7123400219530018816)[Our Newsletter](https://www.linkedin.com/build-relation/newsletter-follow?entityUrn=7123400219530018816)
[Our Newsletter](https://www.linkedin.com/build-relation/newsletter-follow?entityUrn=7123400219530018816)
[HIPAA Compliant](https://www.vintasoftware.com/blog/vinta-hipaa-compliance-security)[HIPAA Compliant](https://www.vintasoftware.com/blog/vinta-hipaa-compliance-security)
[HIPAA Compliant](https://www.vintasoftware.com/blog/vinta-hipaa-compliance-security)
[Referral Program](https://www.vintasoftware.com/referral-partner-program)[Referral Program](https://www.vintasoftware.com/referral-partner-program)
[Referral Program](https://www.vintasoftware.com/referral-partner-program)
[Privacy Policy](https://www.vintasoftware.com/privacy-policy)[Privacy Policy](https://www.vintasoftware.com/privacy-policy)
[Privacy Policy](https://www.vintasoftware.com/privacy-policy)
[Cookies Policy](https://www.vintasoftware.com/privacy-policy#do-we-use-cookies-and-other-tracking-technologies)[Cookies Policy](https://www.vintasoftware.com/privacy-policy#do-we-use-cookies-and-other-tracking-technologies)
[Cookies Policy](https://www.vintasoftware.com/privacy-policy#do-we-use-cookies-and-other-tracking-technologies)
[Let's Talk\!](https://www.vintasoftware.com/contact)[Let's Talk\!](https://www.vintasoftware.com/contact)
[Let's Talk\!](https://www.vintasoftware.com/contact)
Contact
[contact@vintasoftware.com](mailto:contact@vintasoftware.com)
.webp)
166 Geary St, San Francisco, CA 94108
.webp)
248 Rua do Brum, Recife, PE 50030-260
[ 4.9  VIEW ALL REVIEWS](https://clutch.co/profile/vinta-software#reviews)
[](https://www.vintasoftware.com/blog/vinta-hipaa-compliance-security)[%201.webp)](https://www.glassdoor.com.br/Vis%C3%A3o-geral/Trabalhar-na-Vinta-Software-EI_IE2757112.13,27.htm)[](https://www.vintasoftware.com/blog/clutch-badges-fall)
Vinta Software © 2013 - 2026. All rights reserved.
[](https://www.linkedin.com/company/vintasoftware/)[](https://github.com/vintasoftware/)[](https://www.youtube.com/@VintaSoftware)[](https://www.instagram.com/vintasoftware/)[](https://twitter.com/vintasoftware)
 |
| Readable Markdown | This post will explain how to implement a concurrent ETL (**E**xtract, **T**ransform, **L**oad) flow combining Python `asyncio` with `multiprocessing` to get the best of both worlds. ETL itself is a procedure that starts with data extraction from sources such as a database (or many databases). Following extraction, we apply functions to the data in order to modify it. These transformation functions are generally used for cleaning or converting extracted data. This flow is particularly profitable for data warehousing.
Operations with databases are core parts of many applications. Many of these procedures deal with tables containing millions of rows, which can hugely impact the performance of said applications. With Python we’d most commonly use the `psycopg2` library to perform operations in a PostgreSQL database.
To speed up ETL operations, we can benefit from concurrent code. Python provides us with [`asyncio`](https://docs.python.org/3/library/asyncio.html), a built-in library which allows developers to write concurrent code without dealing with low-level programming; thread management, for instance. On `asyncio`, functions that are concurrently executed are called coroutines. For our ETL context, we’ll wrap those coroutines into tasks and wait for all of them to be done with `asyncio.gather` function. Its behavior means our code waits for every concurrent task to be completed before it can advance, similarly to the barriers used to synchronize the execution of threads running concurrently.
For our context, we are going to implement the following steps to model ETL:
1. Fetch records from a database (`extract`), each row having a `name` and an `age`;
2. Perform an operation over the obtained data (`transform`) to split the `name` from each record into `first_name` and `last_name`, while keeping the `age` value unaltered;
3. Write the transformed data into a database (`load`).
Since `psycopg2` doesn’t support `asyncio`, one of the available libraries that implement asynchronous database operations is [`asyncpg`](https://github.com/MagicStack/asyncpg). We’ll use it to implement extract and load, steps that deal with database access.
According to `asyncpg` developers, [the library on average is three times faster than psycopg2](https://github.com/MagicStack/asyncpg#performance). It also provides tools to perform database operations, such as prepared statements and cursors. There are however some missing features, such as a built-in mechanism to prevent SQL injection (there’s a trick to implement it though, and we will discuss it later in the text).
## Synchronizing all tasks
In order to implement an ETL flow, we must provide functions to represent each stage. In a non-concurrent code, we’d have functions such as `fetch_from_db`, `transform_data` and `write_into_db` that would be called sequentially. In a concurrent run, we want each function to run concurrently, starting computation as soon as there’s data to work with while also providing data to the next stage whenever the previous function’s job is done.
To establish this kind of synchronization between each stage, one approach is to implement a consumer & producer architecture with one queue being used to pass data between them.
We begin structuring our code writing the `etl` function:
```
import asyncio
import multiprocessing
from concurrent.futures import ProcessPoolExecutor async def etl(): with ProcessPoolExecutor( max_workers=multiprocessing.cpu_count(), ) as pool: loop = asyncio.get_running_loop() queue = asyncio.Queue(maxsize=1000) await asyncio.gather( asyncio.create_task(producer(queue)), asyncio.create_task(consumer(loop, pool, queue)), return_exceptions=False, ) def main(): asyncio.run(etl())
```
Firstly, our `etl` function must be asynchronous (through the `async` keyword) in order to fully work with `asyncio`.
We'll be using a `ProcessPoolExecutor` to perform our transformation procedures. This allows us to not halt our execution when processing data. Setting the number of workers in the pool to all CPUs our machine has will help to speed things up by fully using the available resources. The usage of these structures will be further explained in a later section.
Notice that when instantiating our queue, we’ve used the `maxsize` parameter. Why do we do that?
First of all, using an unbounded queue has the risk of depleting memory resources. Secondly, producer would fill the queue indefinitely and only then consumer would start running. That may end up with synchronous code, as your consumer would only start when producer had nothing else to do.
Defining a queue size limit makes consumer start earlier, as soon as the queue reaches the set limit for the first time. When space is freed up, producer will continue its job and add more elements to the queue, making the code run concurrently. This means the queue is acting as a [backpressure mechanism](https://lucumr.pocoo.org/2020/1/1/async-pressure/). It’s positive to have such mechanism as long as the production speed is faster than consumption speed, which happens in many cases. For more details regarding these issues, please check [this article](https://nullprogram.com/blog/2020/05/24/).
`main` will be responsible for invoking `etl`. While `main` itself is not asynchronous, it uses `asyncio.run` to execute an asynchronous function. After `etl` is over, any code after `asyncio.run(etl())` would be executed as regular synchronous code.
Producer has a fairly straightforward implementation:
```
async def producer(queue): connection = await get_asyncpg_connection() async for record in extract(connection): await queue.put(record) await queue.put(None) await connection.close()
```
We start by grabbing an `asyncpg` connection object. Once attained, we create a loop that fetches data from our database and return the records via `extract` function. Each record is first inserted into queue then processed by consumer, until there is no more data to fetch.To signalize its job is done, producer will then insert a `None` value into the queue, which must be handled gracefully by consumer. Lastly, we close the `asyncpg` connection.
Regarding both `get_asyncpg_connection` and `extract` functions, let’s dive into their implementations.
```
import asyncpg async def get_asyncpg_connection(): db_conf = { "user": "user_name", "password": "password", "database": "db_name", "host": "host", "port": "port", } connection = await asyncpg.connect(**db_conf) return connection async def extract(connection): async with connection.transaction(): query = "SELECT name, age FROM input_table" async for record in connection.cursor(query, prefetch=1000): yield record
```
`get_asyncpg_connection` is the only part of the code that explicitly interacts with the `asyncpg` module. It creates a connection with our database using the `asyncpg.connect` method. This method's parameters are quite similar to the ones found at `psycopg2.connect`. To simplify this post, we opted to instantiate a `db_conf` variable inside `get_asyncpg_connection`, using dummy values for the connection parameters. When adapting contexts, one would simply change values into real ones.
The `extract` function is a generator that receives an open `asyncpg` connection as its single parameter. It's responsible for accessing a table in our database (in our case, the `input_table` table) via the `cursor` method inside `connection`. This fetches records from the database according to the query defined in `extract` and yields each grabbed row.
Now it’s time to talk about `consumer` implementation:
```
def transform(batch): transformed_batch = [] for record in batch: first_name, last_name = record["name"].split(" ", 1) age = record["age"] transformed_batch.append((first_name, last_name, age)) return transformed_batch async def load(batch, connection): async with connection.transaction(): columns = ("first_name", "last_name", "age") await connection.copy_records_to_table( "output_table", records=batch, columns=columns ) async def task_set_load_helper(task_set, connection): for future in task_set: await load(await future, connection) async def consumer(loop, pool, queue): connection = await get_asyncpg_connection() task_set = set() batch = [] while True: record = await queue.get() if record is not None: record = dict(record) batch.append(record) if queue.empty(): task = loop.run_in_executor(pool, transform, batch) task_set.add(task) if len(task_set) >= pool._max_workers: done_set, task_set = await asyncio.wait( task_set, return_when=asyncio.FIRST_COMPLETED ) await task_set_load_helper(done_set, connection) batch = [] if record is None: break if task_set: await task_set_load_helper( asyncio.as_completed(task_set), connection ) await connection.close()
```
The `consumer` function will be responsible for grabbing the records our producer stores in the queue and add them to a batch. We must remember to convert each record into a `dict` before adding it to the batch in order to process data later - not doing so would raise a `TypeError: can't pickle asyncpg.Record objects` exception. Whenever the queue becomes empty, consumer will start running the transformation step over the batch it built.
The `transform` function receives a batch from the consumer, which is nothing more than a `list` of `dict` objects fetched from the database. Since we want to split the full names from each record into first and last names, we start by looping over the batch and grabbing every record. Our context is simple, so the transformation is pretty straightforward. After obtaining the transformed data we create a `tuple` containing each value and finally add it to a `list` representing a modified batch.
As applications are more complex and involve multiple output tables, it’s recommended to keep track of the columns and which table the records should be inserted into. This can be achieved by using a `dict` to group records from each table. Since our use-case involves a single output table, using a `list` of `tuple` is enough.
When we use `run_in_executor` method on line 30, our transformation step runs inside a given executor (in our case it’ll run in `pool`, which is the instance of `ProcessPoolExecutor` we've created on `etl`). Since transformation function is CPU-bound, if we executed `transform(batch)` directly here we would halt other coroutines until our batch has been fully processed, therefore dragging execution time.
Moreover, `run_in_executor` returns a `Future` object, not a processed batch. According to the Python docs, [“a `Future` represents an eventual result of an asynchronous operation”](https://docs.python.org/3/library/asyncio-future.html#asyncio.Future), meaning that our data won’t necessarily be ready when line 30 is executed, but will be at some point in the future. This means `transform` will be running in background, so we can move on to process next batches concurrently.
The conditional block starting on line 32 prevents the application running out of memory by storing too many transform results. Whenever `task_set` has more objects than our pool has of available workers, we can speed things up and process some tasks to obtain transformation results earlier. Since they’re `Future` objects, we must wait for them to be completed. After calling `asyncio.wait`, we’ll wait for a task to become ready. When this happens, we grab a new `done_set` containing all completed tasks and update `task_set` to keep track only of the pending ones.
As we now have completed tasks, we can store the results in the database. Since we are going to use our storing procedures elsewhere, we created a `task_set_load_helper` helper function to avoid code repetition. It’s responsible for iterating over a set of completed tasks and calling the `load` function that will effectively write into database.
Whenever `load` is called from within `task_set_load_helper`, it will receive a `list` with transformed data and store it into `output_table` in our database. To achieve this, we use the `copy_records_to_table` method from `asyncpg` connection object to bulk insert the records, [which is faster than inserting each record individually](https://github.com/MagicStack/asyncpg/issues/346#issuecomment-416070209). It receives a table to write data into and a `list` of `tuple` objects representing the records to insert. It may additionally receive an iterable with the columns from the table. It’s important to note that each column in this iterable must match the order of the respective value in the objects from the `records` parameter.
When there are no more records to process, we finish creating batches and move on to obtain the last transformation results, in case there are still some objects to work with. We perform a similar procedure to what was done on line 36, but now we have to pass the entire set with the remaining tasks, which could still be pending. That means we must apply the `asyncio.as_completed` function over `task_set` to be assured that we’ll be iterating exclusively over completed tasks as soon as they become available.
## Caveats
[SQL injection](https://en.wikipedia.org/wiki/SQL_injection) is one of the main concerns when using `asyncpg`. When using `psycopg2`, we can avoid this issue by composing our queries using the `psycopg2.sql` module objects, such as in the example below:
```
from psycopg2 import sql table_name = "input_table"
column_name = "name"
value = "foo" query = sql.SQL( "SELECT * FROM {table_name} WHERE {column_name}={value}"
).format( table_name=sql.Identifier(table_name), column_name=sql.Identifier(column_name), value=sql.Literal(value)
)
```
By using the `sql.Identifier` and `sql.Literal` we can respectively sanitize identifiers (i.e.: table and column names) and literals (i.e.: strings and integers) to build parameterized queries, while being assured that the resulting query won’t harm our database due to a maliciously crafted input.
`asyncpg` allows us to build parameterized queries using [syntax similar to the one used by native PostgreSQL,](https://magicstack.github.io/asyncpg/current/usage.html#asyncpg-usage) using the `$n` syntax to provide query arguments. However, since PostgreSQL queries don’t allow us to parameterize tables and columns names, we are stuck with using Python’s string formatting to compose queries with dynamic identifiers. Without proper care, this can lead to catastrophic results in case these identifiers are user-provided. A malicious user can send a value to our system that when processed can lead to a `DROP DATABASE` statement being executed.
Build safe systems. Have user profile in mind whenever designing permissions. Allowing admins to dynamically choose tables is sound, but granting an end-user the same privileges may be compromising.
In order to work around these issues we can look how `psycopg2` works behind the scenes with `sql.Identifier`.
```
async def sanitize_identifier(identifier, connection): sanitized = await connection.fetchval("SELECT quote_ident($1)", identifier) return sanitized async def sanitize_table(table, connection): try: dirty_schema, dirty_table = table.split(".") sanitized_schema = await sanitize_identifier(dirty_schema, connection) sanitized_table = await sanitize_identifier(dirty_table, connection) sanitized = f"{sanitized_schema}.{sanitized_table}" except ValueError: sanitized = await sanitize_identifier(table, connection) return sanitized
```
The magic happens on `sanitize_identifier`, where we establish a connection with the database (via an `asyncpg` connection object) and retrieve the result for the `quote_ident` PostgreSQL function, which helps us to sanitize input. Because table name is a literal and not an identifier in this case, notice that we are using the `$n` syntax to pass the dynamic value to our query.
`sanitize_table` builds upon the previous function in order to cover situations where our table name is prefixed by a schema name. We have to sanitize both components separately, not as a single string. After cleaning, we can safely merge them again into a single value using Python’s formatting.
Below is an example of a more robust `extract` function, which would allow us to fetch data from our database filtering the results from a dynamic table and column.
```
async def extract(table_name, column_name, value, connection) async with connection.transaction(): sanitized_table_name = sanitize_table(table_name, connection) sanitized_column_name = sanitize_identifier(column_name, connection) query = f"SELECT * FROM {sanitized_table_name} WHERE {sanitized_column_name}=$1" async for record in connection.cursor(query, value): yield dict(record)
```
## More info
Below are some approaches to our topic that were implemented on third-party solutions:
- `aiomultiprocess` - A Python lib that also combines `asyncio` and `multiprocessing` and implements a solution quite similar to ours. Here's also a [PyCon 2018 talk](https://www.youtube.com/watch?v=0kXaLh8Fz3k&feature=youtu.be) that introduces this library.
- `pypeln` - A Python lib that abstracts the process when creating a concurrent data pipeline.
Special thanks to [Rafael CarĂcio](https://twitter.com/rafaelcaricio) who kindly reviewed this post. |
| Shard | 140 (laksa) |
| Root Hash | 14605849146506448140 |
| Unparsed URL | com,vintasoftware!www,/blog/etl-with-asyncio-asyncpg s443 |