🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 143 (from laksa109)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa143.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://enccs.github.io/sycl-workshop/queues-cgs-kernels/\')), getAhrefsUnparsedNoserviceFromURL(\'https://enccs.github.io/sycl-workshop/queues-cgs-kernels/\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/","crawl_time":1774144840,"first_indexed_time":0,"http_code":200,"src_unparsed":"io,github!enccs,\/sycl-workshop\/queues-cgs-kernels\/ s443","src_root_hash":"2566890010099092343","history_drop_reason":null,"meta_title":"Queues, command groups, and kernels — Heterogeneous programming with SYCL documentation","meta_descriptions":[],"attrs_boilerpipe_text":"Questions\nHow do we organize work in a SYCL application?\nObjectives\nLearn about queues to describe ordering of operations.\nCommand groups.\nUnderstand that kernels are units of parallelism in SYCL.\nSYCL\nqueue\nobjects are the abstraction connecting a host program to a single\ndevice.  The\nqueue\nis a central abstraction in\nSYCL\n: all device code is\nsubmitted\nto a queue as\nactions\n. The runtime\nschedules\nthe actions and\nexecutes them\nasynchronously\n.  The runtime keeps track of action\nprerequisites in its scheduling, for example, availability of data.\nWe can state that the tracking of actions and their dependencies is the essence\nof\nSYCL\n.  The SYCL standard models our program as a\ntask graph\n, a set of\nnodes\nconnected by\nedges\n:\nNodes\nare actions to be performed on a device, such as the invocation of a\nkernel or explicit data movements.\nEdges\nare dependencies between the actions and express when it’s legal for\na node to execute. Edges arise most often because of data dependencies between\nnodes.\nThe task graph is a\ndirected acyclic graph (DAG)\n: it has a well-defined\nstart-to-finish direction and no nodes are self-connected.\nThe SYCL runtime can resolve dependencies and thus\ngenerate\nthe task graph.\nFurthermore, it can schedule how to execute the nodes,\ni.e.\ntraversal\nof\nthe task graph, in a completely asynchronous manner from the execution of the\nhost code.\nWe will see in\nThe task graph: data, dependencies, synchronization\nhow to manually modify the\ntask graph.\nTwo kinds of actions can be part of the task graph:\nExecution of device code\nThese actions add nodes to the graph that will, eventually, execute device\ncode. They accept kernel code and its execution space as argument and you\ninvoke them as methods on the\nqueue\nclass directly or on the\nhandler\nclass. They come in three flavors, which represent different abstractions for\nwork distribution in SYCL:\nsingle_task\n: as the name says, this will execute one single instance of\nthe kernel code.\nparallel_for\n: this will launch a kernel with given work-size\nspecification in a single instruction, multiple threads (SIMT) fashion.\nparallel_for_work_group\n: launches a kernel with hierarchical\nparallelism. This is only available on the\nhandler\nclass.\nExplicit memory operations.\nThese actions add nodes to the graph that will, eventually, perform data migrations.\nYou invoke them as methods on the\nqueue\nclass directly or on the\nhandler\nclass:\ncopy\n: copies data.\nupdate_host\n: updates data in the buffer on the host-side.\nfill\n: initializes data in a buffer to the given value.\nWe have given a high-level overview of the abstractions in the execution model:\nfrom the queue to the execution on a device, passing through submission of work,\ndescribed as a data-parallel kernel.\nBut how do we write a kernel?\nKernels\n\nKernels are the fundamental building blocks for performing work in a SYCL\nprogram. We will only consider two ways of writing kernels in SYCL:\nlambda expressions\nKernels as lambdas are very concise, thanks especially to the\ncapture\nsyntax. They cannot be templated and might be cumbersome to reuse. In some\ncases, lambdas can be too terse.\n+1 as a lambda\n[\n=\n](\nid\n<\n1\n>\nidx\n)\n->\nvoid\n{\ndata_acc\n[\nidx\n]\n+=\n1\n;\n}\nfunction objects\nA kernel is a class that overloads\noperator()\nfunction call operator. They\ncan be templated, easily reused, and give full control over what data is\npassed in and out.  They are more verbosee.\n+1 as a function object\nclass\nPlusOne\n{\npublic\n:\nPlusOne\n(\naccessor\n<\nint\n>\nacc\n)\n:\ndata_acc_\n(\nacc\n)\n{}\nvoid\noperator\n()(\nid\n<\n1\n>\nidx\n)\n{\ndata_acc\n[\nidx\n]\n+=\n1\n;\n}\nprivate\n:\naccessor\n<\nint\n>\ndata_acc_\n;\n};\nThere are no technical reasons to prefer one style over the other, it will ultimately boil down to personal preference. Regardless of the chosen style, kernel code has some restrictions:\nIt must have\nvoid\nas return type.\nIt cannot use\nruntime type identification (RTTI)\n.\nIt cannot dynamic allocate memory.\nQueues\n\nOne queue maps to one device: the mapping happens upon construction of a\nqueue\nobject and cannot be changed subsequently.\nIt is not possible to use a single\nqueue\nobject to:\nmanage more than one device. The runtime would face ambiguities in deciding\nwhich device should actually do the work!\nspread enqueued work over multiple devices.\nWhile these might appear as limitations, we are free to declare as many\nqueue\nobject as we like in our programs. It is also valid to create multiple\nqueues to the\nsame\ndevice.  Thus, the relation between queues and devices is\nmany-to-one\n.\nWork on a device can be enqueued with the shortcut methods described above. For\nexample, we can launch a data-parallel kernel with\nparallel_for\ninvoked on\nthe desired queue object:\nCreating work on a device using\nqueue\nshortcuts.\nauto\nQ\n=\nqueue\n{\nmy_selector\n{}};\nQ\n.\nparallel_for\n(\nrange\n<\n1\n>\n{\nsz\n},\n[\n=\n](\nauto\n&\nidx\n){\n\/* kernel code *\/\n});\nCommand groups\n\nA command group handler gives more control over how code is submitted to the\nqueue. Submission is slightly more verbose, but we get access to features of\nhierarchical parallelism.\nThe abstraction for command groups is the class\nhandler\n: these objects are\nconstructed for us by the SYCL runtime.  As such, we will meet them only as\narguments of the lambda functions passed to the\nsubmit\nmethod of our queues.\nA command group handler contains:\nhost code, to set up the dependencies of the corresponding node in the task graph.\nHost code is executed immediately upon submission.\nexactly one\naction of the ones described above. The action executes\nasynchronously on the device.  Parallel work actions will, furthermore, need\nan execution range and a kernel function.\nCreating work on a device using a command group\nhandler\n.\nauto\nQ\n=\nqueue\n{\nmy_selector\n{}};\nQ\n.\nsubmit\n([\n&\n](\nhandler\n&\ncgh\n){\n\/* host code: sets up the dependencies of this node. It executes **immediately!** *\/\naccessor\nacc\n{\nB\n,\nh\n};\n\/* exactly **one** of the available actions. It executes **asynchronously** *\/\ncgh\n.\nparallel_for\n(\nrange\n<\n1\n>\n{\nsz\n},\n[\n=\n](\nauto\n&\nidx\n){\n\/* kernel code *\/\n});\n});\nsingle_task\nand streams\nWe’ll walk through the use of the\nsingle_task\nmethod to create work on a\ndevice.\nAs the name suggests, this will create a task for sequential execution:\nprobably not a method you will use often, but definitely something to be\naware of!\nThe task we would like to perform is a print-out on the device. If you are\nfamiliar with CUDA\/HIP, you probably know that\nprintf\ncan be used in\ndevice code. In keeping with C++, the SYCL standard defines a\nstream\nclass, which works similar to the standard streams. A SYCL stream needs a\nhandler\nobject on construction:\nauto\nout\n=\nstream\n(\n1024\n,\n\/* maximum size of output per kernel invocation *\/\n256\n,\n\/* maximum size before flushing the stream *\/\ncgh\n);\nSYCL streams behave just like standard C++ streams. We can write something to\na stream using\noperator<<\n:\nout\n<<\n\"my message\"\n<<\nstd\n::\nendl\n;\nYou can find a scaffold for the code in the\ncontent\/code\/day-1\/04_single-task\/single-task.cpp\nfile,\nalongside the CMake script to build the executable. You will have to complete\nthe source code to compile and run correctly: follow the hints in the source\nfile.  A working solution is in the\nsolution\nsubfolder.\nCreate a queue object. You’re free to use any of the device selection\nstrategies we have encountered in the previous episode.\nSubmit work to the queue using a command handler group.\nCreate a\nstream\nobject.\nCreate a single task on the\nhandler\nprinting a string to the stream. A\nsingle_task\nonly accepts a function with no input arguments as\nparameter:\ncgh\n.\nsingle_task\n([\n=\n](){\n\/* task code *\/\n});\nKeypoints\nOne queue maps to one device, such that there is no ambiguity in\nspreading work.\nA program can have as many queues as desired. Multiple queues can use the\nsame device: the queue-device mapping is many-to-one.\nEnqueing actions can happen by submitting\ncommand groups\nusing the\nhandler\nclass.\nYou can also enqueue actions with\nshortcut\nmethods on the\nqueue\nclass.\nWork can be enqueued with a command group handler. This gives more\nflexibility over the definition of the corresponding node in the task\ngraph.\nKernels are\ncallables\n: either lambda\nfunctions or function objects.\nKernel code cannot use neither RTTI nor dynamic memory allocation.","attrs_markdown":"[Heterogeneous programming with SYCL ![Logo](https:\/\/enccs.github.io\/sycl-workshop\/_static\/ENCCS.jpg)](https:\/\/enccs.github.io\/sycl-workshop\/)\n\n- [Setting up your system](https:\/\/enccs.github.io\/sycl-workshop\/karolina\/)\n\nThe lesson\n\n- [What is SYCL?](https:\/\/enccs.github.io\/sycl-workshop\/what-is-sycl\/)\n- [Device discovery](https:\/\/enccs.github.io\/sycl-workshop\/device-discovery\/)\n- [Queues, command groups, and kernels](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/)\n  - [Kernels](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#kernels)\n  - [Queues](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#queues)\n  - [Command groups](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#command-groups)\n- [Data management with buffers and accessors](https:\/\/enccs.github.io\/sycl-workshop\/buffers-accessors\/)\n- [Data management with unified shared memory](https:\/\/enccs.github.io\/sycl-workshop\/unified-shared-memory\/)\n- [Expressing parallelism with SYCL: basic data-parallel kernels](https:\/\/enccs.github.io\/sycl-workshop\/expressing-parallelism-basic\/)\n- [Expressing parallelism with SYCL: nd-range data-parallel kernels](https:\/\/enccs.github.io\/sycl-workshop\/expressing-parallelism-nd-range\/)\n- [The task graph: data, dependencies, synchronization](https:\/\/enccs.github.io\/sycl-workshop\/task-graphs-synchronization\/)\n- [Heat equation mini-app](https:\/\/enccs.github.io\/sycl-workshop\/heat-equation\/)\n- [Using sub-groups in SYCL](https:\/\/enccs.github.io\/sycl-workshop\/sub-groups\/)\n- [Profiling SYCL applications](https:\/\/enccs.github.io\/sycl-workshop\/profiling\/)\n- [Buffer-accessor model *vs* unified shared memory](https:\/\/enccs.github.io\/sycl-workshop\/buffer-accessor-vs-usm\/)\n\nReference\n\n- [Quick Reference](https:\/\/enccs.github.io\/sycl-workshop\/quick-reference\/)\n- [Bibliography](https:\/\/enccs.github.io\/sycl-workshop\/zbibliography\/)\n- [Instructor’s guide](https:\/\/enccs.github.io\/sycl-workshop\/guide\/)\n\n[Heterogeneous programming with SYCL](https:\/\/enccs.github.io\/sycl-workshop\/)\n\n- Queues, command groups, and kernels\n- [Edit on GitHub](https:\/\/github.com\/ENCCS\/sycl-workshop\/blob\/main\/content\/queues-cgs-kernels.rst)\n***\n\n# Queues, command groups, and kernels[](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#queues-command-groups-and-kernels \"Permalink to this heading\")\nQuestions\n\n- How do we organize work in a SYCL application?\n\nObjectives\n\n- Learn about queues to describe ordering of operations.\n- Command groups.\n- Understand that kernels are units of parallelism in SYCL.\n\nSYCL `queue` objects are the abstraction connecting a host program to a single device. The [queue](https:\/\/enccs.github.io\/sycl-workshop\/quick-reference\/#term-queue) is a central abstraction in [SYCL](https:\/\/www.khronos.org\/sycl\/): all device code is **submitted** to a queue as *actions*. The runtime **schedules** the actions and executes them **asynchronously**. The runtime keeps track of action prerequisites in its scheduling, for example, availability of data. We can state that the tracking of actions and their dependencies is the essence of [SYCL](https:\/\/www.khronos.org\/sycl\/). The SYCL standard models our program as a **task graph**, a set of *nodes* connected by *edges*:\n\n- **Nodes** are actions to be performed on a device, such as the invocation of a kernel or explicit data movements.\n- **Edges** are dependencies between the actions and express when it’s legal for a node to execute. Edges arise most often because of data dependencies between nodes.\n\nThe task graph is a *directed acyclic graph (DAG)*: it has a well-defined start-to-finish direction and no nodes are self-connected. The SYCL runtime can resolve dependencies and thus **generate** the task graph. Furthermore, it can schedule how to execute the nodes, *i.e.* **traversal** of the task graph, in a completely asynchronous manner from the execution of the host code. We will see in [The task graph: data, dependencies, synchronization](https:\/\/enccs.github.io\/sycl-workshop\/task-graphs-synchronization\/#task-graphs-synchronization) how to manually modify the task graph.\n\nTwo kinds of actions can be part of the task graph:\n\nExecution of device code\n\nThese actions add nodes to the graph that will, eventually, execute device code. They accept kernel code and its execution space as argument and you invoke them as methods on the `queue` class directly or on the `handler` class. They come in three flavors, which represent different abstractions for work distribution in SYCL:\n\n- `single_task`: as the name says, this will execute one single instance of the kernel code.\n- `parallel_for`: this will launch a kernel with given work-size specification in a single instruction, multiple threads (SIMT) fashion.\n- `parallel_for_work_group`: launches a kernel with hierarchical parallelism. This is only available on the `handler` class.\n\nExplicit memory operations.\n\nThese actions add nodes to the graph that will, eventually, perform data migrations. You invoke them as methods on the `queue` class directly or on the `handler` class:\n\n- `copy`: copies data.\n- `update_host`: updates data in the buffer on the host-side.\n- `fill`: initializes data in a buffer to the given value.\n\nWe have given a high-level overview of the abstractions in the execution model: from the queue to the execution on a device, passing through submission of work, described as a data-parallel kernel.\n\nBut how do we write a kernel?\n\n## Kernels[](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#kernels \"Permalink to this heading\")\nKernels are the fundamental building blocks for performing work in a SYCL program. We will only consider two ways of writing kernels in SYCL:\n\n[lambda expressions](https:\/\/en.cppreference.com\/w\/cpp\/language\/lambda)\n\nKernels as lambdas are very concise, thanks especially to the *capture* syntax. They cannot be templated and might be cumbersome to reuse. In some cases, lambdas can be too terse.\n\n\\+1 as a lambda\n```\n[=](id<1> idx) -> void {\n  data_acc[idx] += 1;\n}\n```\n\n[function objects](https:\/\/en.cppreference.com\/w\/cpp\/utility\/functional)\n\nA kernel is a class that overloads `operator()` function call operator. They can be templated, easily reused, and give full control over what data is passed in and out. They are more verbosee.\n\n\\+1 as a function object\n```\nclass PlusOne {\n  public:\n   PlusOne(accessor<int> acc) : data_acc_(acc) {}\n\n   void operator()(id<1> idx) {\n     data_acc[idx] += 1;\n   }\n\n  private:\n   accessor<int> data_acc_;\n};\n```\n\nThere are no technical reasons to prefer one style over the other, it will ultimately boil down to personal preference. Regardless of the chosen style, kernel code has some restrictions:\n\n- It must have `void` as return type.\n- It cannot use [runtime type identification (RTTI)](https:\/\/en.m.wikibooks.org\/wiki\/C%2B%2B_Programming\/RTTI).\n- It cannot dynamic allocate memory.\n\n## Queues[](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#queues \"Permalink to this heading\")\nOne queue maps to one device: the mapping happens upon construction of a `queue` object and cannot be changed subsequently. It is not possible to use a single `queue` object to:\n\n- manage more than one device. The runtime would face ambiguities in deciding which device should actually do the work\\!\n- spread enqueued work over multiple devices.\n\nWhile these might appear as limitations, we are free to declare as many `queue` object as we like in our programs. It is also valid to create multiple queues to the *same* device. Thus, the relation between queues and devices is **many-to-one**.\n\nWork on a device can be enqueued with the shortcut methods described above. For example, we can launch a data-parallel kernel with `parallel_for` invoked on the desired queue object:\n\nCreating work on a device using `queue` shortcuts.\n```\nauto Q = queue{my_selector{}};\n\nQ.parallel_for(range<1>{sz}, [=](auto &idx){\n  \/* kernel code *\/\n});\n```\n\n## Command groups[](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#command-groups \"Permalink to this heading\")\nA command group handler gives more control over how code is submitted to the queue. Submission is slightly more verbose, but we get access to features of hierarchical parallelism. The abstraction for command groups is the class `handler`: these objects are constructed for us by the SYCL runtime. As such, we will meet them only as arguments of the lambda functions passed to the `submit` method of our queues. A command group handler contains:\n\n- host code, to set up the dependencies of the corresponding node in the task graph. Host code is executed immediately upon submission.\n- **exactly one** action of the ones described above. The action executes asynchronously on the device. Parallel work actions will, furthermore, need an execution range and a kernel function.\n\nCreating work on a device using a command group `handler`.\n```\nauto Q = queue{my_selector{}};\n\nQ.submit([&](handler &cgh){\n \/* host code: sets up the dependencies of this node. It executes **immediately!** *\/\n accessor acc{B, h};\n\n \/* exactly **one** of the available actions. It executes **asynchronously** *\/\n cgh.parallel_for(range<1>{sz}, [=](auto &idx){\n    \/* kernel code *\/\n });\n});\n```\n\n`single_task` and streams\n\nWe’ll walk through the use of the `single_task` method to create work on a device. As the name suggests, this will create a task for sequential execution: probably not a method you will use often, but definitely something to be aware of! The task we would like to perform is a print-out on the device. If you are familiar with CUDA\/HIP, you probably know that `printf` can be used in device code. In keeping with C++, the SYCL standard defines a `stream` class, which works similar to the standard streams. A SYCL stream needs a `handler` object on construction:\n```\nauto out = stream(1024, \/* maximum size of output per kernel invocation *\/\n                   256, \/* maximum size before flushing the stream *\/\n                   cgh);\n```\nSYCL streams behave just like standard C++ streams. We can write something to a stream using `operator<<`:\n```\nout << \"my message\" << std::endl;\n```\nYou can find a scaffold for the code in the `content\/code\/day-1\/04_single-task\/single-task.cpp` file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the `solution` subfolder.\n\n1. Create a queue object. You’re free to use any of the device selection strategies we have encountered in the previous episode.\n2. Submit work to the queue using a command handler group.\n3. Create a `stream` object.\n4. Create a single task on the `handler` printing a string to the stream. A `single_task` only accepts a function with no input arguments as parameter:\n   ```\n   cgh.single_task([=](){\n  \/* task code *\/\n});\n   ```\n\nKeypoints\n\n- One queue maps to one device, such that there is no ambiguity in spreading work.\n- A program can have as many queues as desired. Multiple queues can use the same device: the queue-device mapping is many-to-one.\n- Enqueing actions can happen by submitting **command groups** using the `handler` class.\n- You can also enqueue actions with *shortcut* methods on the `queue` class.\n- Work can be enqueued with a command group handler. This gives more flexibility over the definition of the corresponding node in the task graph.\n- Kernels are [callables](https:\/\/en.cppreference.com\/w\/cpp\/named_req\/Callable): either lambda functions or function objects.\n- Kernel code cannot use neither RTTI nor dynamic memory allocation.\n\n[Previous](https:\/\/enccs.github.io\/sycl-workshop\/device-discovery\/ \"Device discovery\") [Next](https:\/\/enccs.github.io\/sycl-workshop\/buffers-accessors\/ \"Data management with buffers and accessors\")\n***\n© Copyright 2021, Roberto Di Remigio and individual contributors..\n\nBuilt with [Sphinx](https:\/\/www.sphinx-doc.org\/) using a [theme](https:\/\/github.com\/readthedocs\/sphinx_rtd_theme) provided by [Read the Docs](https:\/\/readthedocs.org\/).","attrs_readable_markdown":"Questions\n\n- How do we organize work in a SYCL application?\n\nObjectives\n\n- Learn about queues to describe ordering of operations.\n- Command groups.\n- Understand that kernels are units of parallelism in SYCL.\n\nSYCL `queue` objects are the abstraction connecting a host program to a single device. The [queue](https:\/\/enccs.github.io\/sycl-workshop\/quick-reference\/#term-queue) is a central abstraction in [SYCL](https:\/\/www.khronos.org\/sycl\/): all device code is **submitted** to a queue as *actions*. The runtime **schedules** the actions and executes them **asynchronously**. The runtime keeps track of action prerequisites in its scheduling, for example, availability of data. We can state that the tracking of actions and their dependencies is the essence of [SYCL](https:\/\/www.khronos.org\/sycl\/). The SYCL standard models our program as a **task graph**, a set of *nodes* connected by *edges*:\n\n- **Nodes** are actions to be performed on a device, such as the invocation of a kernel or explicit data movements.\n- **Edges** are dependencies between the actions and express when it’s legal for a node to execute. Edges arise most often because of data dependencies between nodes.\n\nThe task graph is a *directed acyclic graph (DAG)*: it has a well-defined start-to-finish direction and no nodes are self-connected. The SYCL runtime can resolve dependencies and thus **generate** the task graph. Furthermore, it can schedule how to execute the nodes, *i.e.* **traversal** of the task graph, in a completely asynchronous manner from the execution of the host code. We will see in [The task graph: data, dependencies, synchronization](https:\/\/enccs.github.io\/sycl-workshop\/task-graphs-synchronization\/#task-graphs-synchronization) how to manually modify the task graph.\n\nTwo kinds of actions can be part of the task graph:\n\nExecution of device code\n\nThese actions add nodes to the graph that will, eventually, execute device code. They accept kernel code and its execution space as argument and you invoke them as methods on the `queue` class directly or on the `handler` class. They come in three flavors, which represent different abstractions for work distribution in SYCL:\n\n- `single_task`: as the name says, this will execute one single instance of the kernel code.\n- `parallel_for`: this will launch a kernel with given work-size specification in a single instruction, multiple threads (SIMT) fashion.\n- `parallel_for_work_group`: launches a kernel with hierarchical parallelism. This is only available on the `handler` class.\n\nExplicit memory operations.\n\nThese actions add nodes to the graph that will, eventually, perform data migrations. You invoke them as methods on the `queue` class directly or on the `handler` class:\n\n- `copy`: copies data.\n- `update_host`: updates data in the buffer on the host-side.\n- `fill`: initializes data in a buffer to the given value.\n\nWe have given a high-level overview of the abstractions in the execution model: from the queue to the execution on a device, passing through submission of work, described as a data-parallel kernel.\n\nBut how do we write a kernel?\n\n## Kernels[](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#kernels \"Permalink to this heading\")\nKernels are the fundamental building blocks for performing work in a SYCL program. We will only consider two ways of writing kernels in SYCL:\n\n[lambda expressions](https:\/\/en.cppreference.com\/w\/cpp\/language\/lambda)\n\nKernels as lambdas are very concise, thanks especially to the *capture* syntax. They cannot be templated and might be cumbersome to reuse. In some cases, lambdas can be too terse.\n\n\\+1 as a lambda\n```\n[=](id<1> idx) -> void {\n  data_acc[idx] += 1;\n}\n```\n\n[function objects](https:\/\/en.cppreference.com\/w\/cpp\/utility\/functional)\n\nA kernel is a class that overloads `operator()` function call operator. They can be templated, easily reused, and give full control over what data is passed in and out. They are more verbosee.\n\n\\+1 as a function object\n```\nclass PlusOne {\n  public:\n   PlusOne(accessor<int> acc) : data_acc_(acc) {}\n\n   void operator()(id<1> idx) {\n     data_acc[idx] += 1;\n   }\n\n  private:\n   accessor<int> data_acc_;\n};\n```\n\nThere are no technical reasons to prefer one style over the other, it will ultimately boil down to personal preference. Regardless of the chosen style, kernel code has some restrictions:\n\n- It must have `void` as return type.\n- It cannot use [runtime type identification (RTTI)](https:\/\/en.m.wikibooks.org\/wiki\/C%2B%2B_Programming\/RTTI).\n- It cannot dynamic allocate memory.\n\n## Queues[](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#queues \"Permalink to this heading\")\nOne queue maps to one device: the mapping happens upon construction of a `queue` object and cannot be changed subsequently. It is not possible to use a single `queue` object to:\n\n- manage more than one device. The runtime would face ambiguities in deciding which device should actually do the work\\!\n- spread enqueued work over multiple devices.\n\nWhile these might appear as limitations, we are free to declare as many `queue` object as we like in our programs. It is also valid to create multiple queues to the *same* device. Thus, the relation between queues and devices is **many-to-one**.\n\nWork on a device can be enqueued with the shortcut methods described above. For example, we can launch a data-parallel kernel with `parallel_for` invoked on the desired queue object:\n\nCreating work on a device using `queue` shortcuts.\n```\nauto Q = queue{my_selector{}};\n\nQ.parallel_for(range<1>{sz}, [=](auto &idx){\n  \/* kernel code *\/\n});\n```\n\n## Command groups[](https:\/\/enccs.github.io\/sycl-workshop\/queues-cgs-kernels\/#command-groups \"Permalink to this heading\")\nA command group handler gives more control over how code is submitted to the queue. Submission is slightly more verbose, but we get access to features of hierarchical parallelism. The abstraction for command groups is the class `handler`: these objects are constructed for us by the SYCL runtime. As such, we will meet them only as arguments of the lambda functions passed to the `submit` method of our queues. A command group handler contains:\n\n- host code, to set up the dependencies of the corresponding node in the task graph. Host code is executed immediately upon submission.\n- **exactly one** action of the ones described above. The action executes asynchronously on the device. Parallel work actions will, furthermore, need an execution range and a kernel function.\n\nCreating work on a device using a command group `handler`.\n```\nauto Q = queue{my_selector{}};\n\nQ.submit([&](handler &cgh){\n \/* host code: sets up the dependencies of this node. It executes **immediately!** *\/\n accessor acc{B, h};\n\n \/* exactly **one** of the available actions. It executes **asynchronously** *\/\n cgh.parallel_for(range<1>{sz}, [=](auto &idx){\n    \/* kernel code *\/\n });\n});\n```\n\n`single_task` and streams\n\nWe’ll walk through the use of the `single_task` method to create work on a device. As the name suggests, this will create a task for sequential execution: probably not a method you will use often, but definitely something to be aware of! The task we would like to perform is a print-out on the device. If you are familiar with CUDA\/HIP, you probably know that `printf` can be used in device code. In keeping with C++, the SYCL standard defines a `stream` class, which works similar to the standard streams. A SYCL stream needs a `handler` object on construction:\n```\nauto out = stream(1024, \/* maximum size of output per kernel invocation *\/\n                   256, \/* maximum size before flushing the stream *\/\n                   cgh);\n```\nSYCL streams behave just like standard C++ streams. We can write something to a stream using `operator<<`:\n```\nout << \"my message\" << std::endl;\n```\nYou can find a scaffold for the code in the `content\/code\/day-1\/04_single-task\/single-task.cpp` file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the `solution` subfolder.\n\n1. Create a queue object. You’re free to use any of the device selection strategies we have encountered in the previous episode.\n2. Submit work to the queue using a command handler group.\n3. Create a `stream` object.\n4. Create a single task on the `handler` printing a string to the stream. A `single_task` only accepts a function with no input arguments as parameter:\n   ```\n   cgh.single_task([=](){\n  \/* task code *\/\n});\n   ```\n\nKeypoints\n\n- One queue maps to one device, such that there is no ambiguity in spreading work.\n- A program can have as many queues as desired. Multiple queues can use the same device: the queue-device mapping is many-to-one.\n- Enqueing actions can happen by submitting **command groups** using the `handler` class.\n- You can also enqueue actions with *shortcut* methods on the `queue` class.\n- Work can be enqueued with a command group handler. This gives more flexibility over the definition of the corresponding node in the task graph.\n- Kernels are [callables](https:\/\/en.cppreference.com\/w\/cpp\/named_req\/Callable): either lambda functions or function objects.\n- Kernel code cannot use neither RTTI nor dynamic memory allocation.","meta_canonical":null,"ml_categories_json":"","ml_types_json":"","ml_intent_types_json":"","meta_language":"en","attrs_author":null,"attrs_publish_time":0,"attrs_original_publish_time":1638545810,"attrs_is_republished":0,"attrs_nr_words":"1571","attrs_boilerpipe_nr_words":"1441","body_ext_links_number":10,"body_int_links_number":17,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":0,"src_redirect":"","download_time_msec":49,"download_ttfb_msec":49,"download_size":6986}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

1 month ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	1.1 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property	Value
URL	https://enccs.github.io/sycl-workshop/queues-cgs-kernels/
Last Crawled	2026-03-22 02:00:40 (1 month ago)
First Indexed	not set
HTTP Status Code	200
Content
Meta Title	Queues, command groups, and kernels — Heterogeneous programming with SYCL documentation
Meta Description	null
Meta Canonical	null
Boilerpipe Text	Questions How do we organize work in a SYCL application? Objectives Learn about queues to describe ordering of operations. Command groups. Understand that kernels are units of parallelism in SYCL. SYCL queue objects are the abstraction connecting a host program to a single device. The queue is a central abstraction in SYCL : all device code is submitted to a queue as actions . The runtime schedules the actions and executes them asynchronously . The runtime keeps track of action prerequisites in its scheduling, for example, availability of data. We can state that the tracking of actions and their dependencies is the essence of SYCL . The SYCL standard models our program as a task graph , a set of nodes connected by edges : Nodes are actions to be performed on a device, such as the invocation of a kernel or explicit data movements. Edges are dependencies between the actions and express when it’s legal for a node to execute. Edges arise most often because of data dependencies between nodes. The task graph is a directed acyclic graph (DAG) : it has a well-defined start-to-finish direction and no nodes are self-connected. The SYCL runtime can resolve dependencies and thus generate the task graph. Furthermore, it can schedule how to execute the nodes, i.e. traversal of the task graph, in a completely asynchronous manner from the execution of the host code. We will see in The task graph: data, dependencies, synchronization how to manually modify the task graph. Two kinds of actions can be part of the task graph: Execution of device code These actions add nodes to the graph that will, eventually, execute device code. They accept kernel code and its execution space as argument and you invoke them as methods on the queue class directly or on the handler class. They come in three flavors, which represent different abstractions for work distribution in SYCL: single_task : as the name says, this will execute one single instance of the kernel code. parallel_for : this will launch a kernel with given work-size specification in a single instruction, multiple threads (SIMT) fashion. parallel_for_work_group : launches a kernel with hierarchical parallelism. This is only available on the handler class. Explicit memory operations. These actions add nodes to the graph that will, eventually, perform data migrations. You invoke them as methods on the queue class directly or on the handler class: copy : copies data. update_host : updates data in the buffer on the host-side. fill : initializes data in a buffer to the given value. We have given a high-level overview of the abstractions in the execution model: from the queue to the execution on a device, passing through submission of work, described as a data-parallel kernel. But how do we write a kernel? Kernels  Kernels are the fundamental building blocks for performing work in a SYCL program. We will only consider two ways of writing kernels in SYCL: lambda expressions Kernels as lambdas are very concise, thanks especially to the capture syntax. They cannot be templated and might be cumbersome to reuse. In some cases, lambdas can be too terse. +1 as a lambda [ = ]( id < 1 > idx ) -> void { data_acc [ idx ] += 1 ; } function objects A kernel is a class that overloads operator() function call operator. They can be templated, easily reused, and give full control over what data is passed in and out. They are more verbosee. +1 as a function object class PlusOne { public : PlusOne ( accessor < int > acc ) : data_acc_ ( acc ) {} void operator ()( id < 1 > idx ) { data_acc [ idx ] += 1 ; } private : accessor < int > data_acc_ ; }; There are no technical reasons to prefer one style over the other, it will ultimately boil down to personal preference. Regardless of the chosen style, kernel code has some restrictions: It must have void as return type. It cannot use runtime type identification (RTTI) . It cannot dynamic allocate memory. Queues  One queue maps to one device: the mapping happens upon construction of a queue object and cannot be changed subsequently. It is not possible to use a single queue object to: manage more than one device. The runtime would face ambiguities in deciding which device should actually do the work! spread enqueued work over multiple devices. While these might appear as limitations, we are free to declare as many queue object as we like in our programs. It is also valid to create multiple queues to the same device. Thus, the relation between queues and devices is many-to-one . Work on a device can be enqueued with the shortcut methods described above. For example, we can launch a data-parallel kernel with parallel_for invoked on the desired queue object: Creating work on a device using queue shortcuts. auto Q = queue { my_selector {}}; Q . parallel_for ( range < 1 > { sz }, [ = ]( auto & idx ){ /* kernel code / }); Command groups  A command group handler gives more control over how code is submitted to the queue. Submission is slightly more verbose, but we get access to features of hierarchical parallelism. The abstraction for command groups is the class handler : these objects are constructed for us by the SYCL runtime. As such, we will meet them only as arguments of the lambda functions passed to the submit method of our queues. A command group handler contains: host code, to set up the dependencies of the corresponding node in the task graph. Host code is executed immediately upon submission. exactly one action of the ones described above. The action executes asynchronously on the device. Parallel work actions will, furthermore, need an execution range and a kernel function. Creating work on a device using a command group handler . auto Q = queue { my_selector {}}; Q . submit ([ & ]( handler & cgh ){ / host code: sets up the dependencies of this node. It executes immediately! / accessor acc { B , h }; / exactly one of the available actions. It executes asynchronously / cgh . parallel_for ( range < 1 > { sz }, [ = ]( auto & idx ){ / kernel code / }); }); single_task and streams We’ll walk through the use of the single_task method to create work on a device. As the name suggests, this will create a task for sequential execution: probably not a method you will use often, but definitely something to be aware of! The task we would like to perform is a print-out on the device. If you are familiar with CUDA/HIP, you probably know that printf can be used in device code. In keeping with C++, the SYCL standard defines a stream class, which works similar to the standard streams. A SYCL stream needs a handler object on construction: auto out = stream ( 1024 , / maximum size of output per kernel invocation / 256 , / maximum size before flushing the stream / cgh ); SYCL streams behave just like standard C++ streams. We can write something to a stream using operator<< : out << "my message" << std :: endl ; You can find a scaffold for the code in the content/code/day-1/04_single-task/single-task.cpp file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the solution subfolder. Create a queue object. You’re free to use any of the device selection strategies we have encountered in the previous episode. Submit work to the queue using a command handler group. Create a stream object. Create a single task on the handler printing a string to the stream. A single_task only accepts a function with no input arguments as parameter: cgh . single_task ([ = ](){ / task code */ }); Keypoints One queue maps to one device, such that there is no ambiguity in spreading work. A program can have as many queues as desired. Multiple queues can use the same device: the queue-device mapping is many-to-one. Enqueing actions can happen by submitting command groups using the handler class. You can also enqueue actions with shortcut methods on the queue class. Work can be enqueued with a command group handler. This gives more flexibility over the definition of the corresponding node in the task graph. Kernels are callables : either lambda functions or function objects. Kernel code cannot use neither RTTI nor dynamic memory allocation.
Markdown	[Heterogeneous programming with SYCL ![Logo](https://enccs.github.io/sycl-workshop/_static/ENCCS.jpg)](https://enccs.github.io/sycl-workshop/) - [Setting up your system](https://enccs.github.io/sycl-workshop/karolina/) The lesson - [What is SYCL?](https://enccs.github.io/sycl-workshop/what-is-sycl/) - [Device discovery](https://enccs.github.io/sycl-workshop/device-discovery/) - [Queues, command groups, and kernels](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/) - [Kernels](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#kernels) - [Queues](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#queues) - [Command groups](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#command-groups) - [Data management with buffers and accessors](https://enccs.github.io/sycl-workshop/buffers-accessors/) - [Data management with unified shared memory](https://enccs.github.io/sycl-workshop/unified-shared-memory/) - [Expressing parallelism with SYCL: basic data-parallel kernels](https://enccs.github.io/sycl-workshop/expressing-parallelism-basic/) - [Expressing parallelism with SYCL: nd-range data-parallel kernels](https://enccs.github.io/sycl-workshop/expressing-parallelism-nd-range/) - [The task graph: data, dependencies, synchronization](https://enccs.github.io/sycl-workshop/task-graphs-synchronization/) - [Heat equation mini-app](https://enccs.github.io/sycl-workshop/heat-equation/) - [Using sub-groups in SYCL](https://enccs.github.io/sycl-workshop/sub-groups/) - [Profiling SYCL applications](https://enccs.github.io/sycl-workshop/profiling/) - [Buffer-accessor model vs unified shared memory](https://enccs.github.io/sycl-workshop/buffer-accessor-vs-usm/) Reference - [Quick Reference](https://enccs.github.io/sycl-workshop/quick-reference/) - [Bibliography](https://enccs.github.io/sycl-workshop/zbibliography/) - [Instructor’s guide](https://enccs.github.io/sycl-workshop/guide/) [Heterogeneous programming with SYCL](https://enccs.github.io/sycl-workshop/) - Queues, command groups, and kernels - [Edit on GitHub](https://github.com/ENCCS/sycl-workshop/blob/main/content/queues-cgs-kernels.rst) * # Queues, command groups, and kernels[](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#queues-command-groups-and-kernels "Permalink to this heading") Questions - How do we organize work in a SYCL application? Objectives - Learn about queues to describe ordering of operations. - Command groups. - Understand that kernels are units of parallelism in SYCL. SYCL `queue` objects are the abstraction connecting a host program to a single device. The [queue](https://enccs.github.io/sycl-workshop/quick-reference/#term-queue) is a central abstraction in [SYCL](https://www.khronos.org/sycl/): all device code is submitted** to a queue as actions. The runtime schedules the actions and executes them asynchronously. The runtime keeps track of action prerequisites in its scheduling, for example, availability of data. We can state that the tracking of actions and their dependencies is the essence of [SYCL](https://www.khronos.org/sycl/). The SYCL standard models our program as a task graph, a set of nodes connected by edges: - Nodes are actions to be performed on a device, such as the invocation of a kernel or explicit data movements. - Edges are dependencies between the actions and express when it’s legal for a node to execute. Edges arise most often because of data dependencies between nodes. The task graph is a directed acyclic graph (DAG): it has a well-defined start-to-finish direction and no nodes are self-connected. The SYCL runtime can resolve dependencies and thus generate the task graph. Furthermore, it can schedule how to execute the nodes, i.e. traversal of the task graph, in a completely asynchronous manner from the execution of the host code. We will see in [The task graph: data, dependencies, synchronization](https://enccs.github.io/sycl-workshop/task-graphs-synchronization/#task-graphs-synchronization) how to manually modify the task graph. Two kinds of actions can be part of the task graph: Execution of device code These actions add nodes to the graph that will, eventually, execute device code. They accept kernel code and its execution space as argument and you invoke them as methods on the `queue` class directly or on the `handler` class. They come in three flavors, which represent different abstractions for work distribution in SYCL: - `single_task`: as the name says, this will execute one single instance of the kernel code. - `parallel_for`: this will launch a kernel with given work-size specification in a single instruction, multiple threads (SIMT) fashion. - `parallel_for_work_group`: launches a kernel with hierarchical parallelism. This is only available on the `handler` class. Explicit memory operations. These actions add nodes to the graph that will, eventually, perform data migrations. You invoke them as methods on the `queue` class directly or on the `handler` class: - `copy`: copies data. - `update_host`: updates data in the buffer on the host-side. - `fill`: initializes data in a buffer to the given value. We have given a high-level overview of the abstractions in the execution model: from the queue to the execution on a device, passing through submission of work, described as a data-parallel kernel. But how do we write a kernel? ## Kernels[](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#kernels "Permalink to this heading") Kernels are the fundamental building blocks for performing work in a SYCL program. We will only consider two ways of writing kernels in SYCL: [lambda expressions](https://en.cppreference.com/w/cpp/language/lambda) Kernels as lambdas are very concise, thanks especially to the capture syntax. They cannot be templated and might be cumbersome to reuse. In some cases, lambdas can be too terse. \+1 as a lambda ``` [=](id<1> idx) -> void { data_acc[idx] += 1; } ``` [function objects](https://en.cppreference.com/w/cpp/utility/functional) A kernel is a class that overloads `operator()` function call operator. They can be templated, easily reused, and give full control over what data is passed in and out. They are more verbosee. \+1 as a function object ``` class PlusOne { public: PlusOne(accessor<int> acc) : data_acc_(acc) {} void operator()(id<1> idx) { data_acc[idx] += 1; } private: accessor<int> data_acc_; }; ``` There are no technical reasons to prefer one style over the other, it will ultimately boil down to personal preference. Regardless of the chosen style, kernel code has some restrictions: - It must have `void` as return type. - It cannot use [runtime type identification (RTTI)](https://en.m.wikibooks.org/wiki/C%2B%2B_Programming/RTTI). - It cannot dynamic allocate memory. ## Queues[](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#queues "Permalink to this heading") One queue maps to one device: the mapping happens upon construction of a `queue` object and cannot be changed subsequently. It is not possible to use a single `queue` object to: - manage more than one device. The runtime would face ambiguities in deciding which device should actually do the work\! - spread enqueued work over multiple devices. While these might appear as limitations, we are free to declare as many `queue` object as we like in our programs. It is also valid to create multiple queues to the same device. Thus, the relation between queues and devices is many-to-one. Work on a device can be enqueued with the shortcut methods described above. For example, we can launch a data-parallel kernel with `parallel_for` invoked on the desired queue object: Creating work on a device using `queue` shortcuts. ``` auto Q = queue{my_selector{}}; Q.parallel_for(range<1>{sz}, [=](auto &idx){ /* kernel code / }); ``` ## Command groups[](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#command-groups "Permalink to this heading") A command group handler gives more control over how code is submitted to the queue. Submission is slightly more verbose, but we get access to features of hierarchical parallelism. The abstraction for command groups is the class `handler`: these objects are constructed for us by the SYCL runtime. As such, we will meet them only as arguments of the lambda functions passed to the `submit` method of our queues. A command group handler contains: - host code, to set up the dependencies of the corresponding node in the task graph. Host code is executed immediately upon submission. - exactly one* action of the ones described above. The action executes asynchronously on the device. Parallel work actions will, furthermore, need an execution range and a kernel function. Creating work on a device using a command group `handler`. ``` auto Q = queue{my_selector{}}; Q.submit([&](handler &cgh){ /* host code: sets up the dependencies of this node. It executes immediately! / accessor acc{B, h}; / exactly one of the available actions. It executes asynchronously / cgh.parallel_for(range<1>{sz}, [=](auto &idx){ / kernel code / }); }); ``` `single_task` and streams We’ll walk through the use of the `single_task` method to create work on a device. As the name suggests, this will create a task for sequential execution: probably not a method you will use often, but definitely something to be aware of! The task we would like to perform is a print-out on the device. If you are familiar with CUDA/HIP, you probably know that `printf` can be used in device code. In keeping with C++, the SYCL standard defines a `stream` class, which works similar to the standard streams. A SYCL stream needs a `handler` object on construction: ``` auto out = stream(1024, / maximum size of output per kernel invocation / 256, / maximum size before flushing the stream / cgh); ``` SYCL streams behave just like standard C++ streams. We can write something to a stream using `operator<<`: ``` out << "my message" << std::endl; ``` You can find a scaffold for the code in the `content/code/day-1/04_single-task/single-task.cpp` file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the `solution` subfolder. 1. Create a queue object. You’re free to use any of the device selection strategies we have encountered in the previous episode. 2. Submit work to the queue using a command handler group. 3. Create a `stream` object. 4. Create a single task on the `handler` printing a string to the stream. A `single_task` only accepts a function with no input arguments as parameter: ``` cgh.single_task([=](){ / task code / }); ``` Keypoints - One queue maps to one device, such that there is no ambiguity in spreading work. - A program can have as many queues as desired. Multiple queues can use the same device: the queue-device mapping is many-to-one. - Enqueing actions can happen by submitting command groups* using the `handler` class. - You can also enqueue actions with shortcut methods on the `queue` class. - Work can be enqueued with a command group handler. This gives more flexibility over the definition of the corresponding node in the task graph. - Kernels are [callables](https://en.cppreference.com/w/cpp/named_req/Callable): either lambda functions or function objects. - Kernel code cannot use neither RTTI nor dynamic memory allocation. [Previous](https://enccs.github.io/sycl-workshop/device-discovery/ "Device discovery") [Next](https://enccs.github.io/sycl-workshop/buffers-accessors/ "Data management with buffers and accessors") *** © Copyright 2021, Roberto Di Remigio and individual contributors.. Built with [Sphinx](https://www.sphinx-doc.org/) using a [theme](https://github.com/readthedocs/sphinx_rtd_theme) provided by [Read the Docs](https://readthedocs.org/).
Readable Markdown	Questions - How do we organize work in a SYCL application? Objectives - Learn about queues to describe ordering of operations. - Command groups. - Understand that kernels are units of parallelism in SYCL. SYCL `queue` objects are the abstraction connecting a host program to a single device. The [queue](https://enccs.github.io/sycl-workshop/quick-reference/#term-queue) is a central abstraction in [SYCL](https://www.khronos.org/sycl/): all device code is submitted to a queue as actions. The runtime schedules the actions and executes them asynchronously. The runtime keeps track of action prerequisites in its scheduling, for example, availability of data. We can state that the tracking of actions and their dependencies is the essence of [SYCL](https://www.khronos.org/sycl/). The SYCL standard models our program as a task graph, a set of nodes connected by edges: - Nodes are actions to be performed on a device, such as the invocation of a kernel or explicit data movements. - Edges are dependencies between the actions and express when it’s legal for a node to execute. Edges arise most often because of data dependencies between nodes. The task graph is a directed acyclic graph (DAG): it has a well-defined start-to-finish direction and no nodes are self-connected. The SYCL runtime can resolve dependencies and thus generate the task graph. Furthermore, it can schedule how to execute the nodes, i.e. traversal of the task graph, in a completely asynchronous manner from the execution of the host code. We will see in [The task graph: data, dependencies, synchronization](https://enccs.github.io/sycl-workshop/task-graphs-synchronization/#task-graphs-synchronization) how to manually modify the task graph. Two kinds of actions can be part of the task graph: Execution of device code These actions add nodes to the graph that will, eventually, execute device code. They accept kernel code and its execution space as argument and you invoke them as methods on the `queue` class directly or on the `handler` class. They come in three flavors, which represent different abstractions for work distribution in SYCL: - `single_task`: as the name says, this will execute one single instance of the kernel code. - `parallel_for`: this will launch a kernel with given work-size specification in a single instruction, multiple threads (SIMT) fashion. - `parallel_for_work_group`: launches a kernel with hierarchical parallelism. This is only available on the `handler` class. Explicit memory operations. These actions add nodes to the graph that will, eventually, perform data migrations. You invoke them as methods on the `queue` class directly or on the `handler` class: - `copy`: copies data. - `update_host`: updates data in the buffer on the host-side. - `fill`: initializes data in a buffer to the given value. We have given a high-level overview of the abstractions in the execution model: from the queue to the execution on a device, passing through submission of work, described as a data-parallel kernel. But how do we write a kernel? ## Kernels[](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#kernels "Permalink to this heading") Kernels are the fundamental building blocks for performing work in a SYCL program. We will only consider two ways of writing kernels in SYCL: [lambda expressions](https://en.cppreference.com/w/cpp/language/lambda) Kernels as lambdas are very concise, thanks especially to the capture syntax. They cannot be templated and might be cumbersome to reuse. In some cases, lambdas can be too terse. \+1 as a lambda ``` [=](id<1> idx) -> void { data_acc[idx] += 1; } ``` [function objects](https://en.cppreference.com/w/cpp/utility/functional) A kernel is a class that overloads `operator()` function call operator. They can be templated, easily reused, and give full control over what data is passed in and out. They are more verbosee. \+1 as a function object ``` class PlusOne { public: PlusOne(accessor<int> acc) : data_acc_(acc) {} void operator()(id<1> idx) { data_acc[idx] += 1; } private: accessor<int> data_acc_; }; ``` There are no technical reasons to prefer one style over the other, it will ultimately boil down to personal preference. Regardless of the chosen style, kernel code has some restrictions: - It must have `void` as return type. - It cannot use [runtime type identification (RTTI)](https://en.m.wikibooks.org/wiki/C%2B%2B_Programming/RTTI). - It cannot dynamic allocate memory. ## Queues[](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#queues "Permalink to this heading") One queue maps to one device: the mapping happens upon construction of a `queue` object and cannot be changed subsequently. It is not possible to use a single `queue` object to: - manage more than one device. The runtime would face ambiguities in deciding which device should actually do the work\! - spread enqueued work over multiple devices. While these might appear as limitations, we are free to declare as many `queue` object as we like in our programs. It is also valid to create multiple queues to the same device. Thus, the relation between queues and devices is many-to-one. Work on a device can be enqueued with the shortcut methods described above. For example, we can launch a data-parallel kernel with `parallel_for` invoked on the desired queue object: Creating work on a device using `queue` shortcuts. ``` auto Q = queue{my_selector{}}; Q.parallel_for(range<1>{sz}, [=](auto &idx){ /* kernel code / }); ``` ## Command groups[](https://enccs.github.io/sycl-workshop/queues-cgs-kernels/#command-groups "Permalink to this heading") A command group handler gives more control over how code is submitted to the queue. Submission is slightly more verbose, but we get access to features of hierarchical parallelism. The abstraction for command groups is the class `handler`: these objects are constructed for us by the SYCL runtime. As such, we will meet them only as arguments of the lambda functions passed to the `submit` method of our queues. A command group handler contains: - host code, to set up the dependencies of the corresponding node in the task graph. Host code is executed immediately upon submission. - exactly one* action of the ones described above. The action executes asynchronously on the device. Parallel work actions will, furthermore, need an execution range and a kernel function. Creating work on a device using a command group `handler`. ``` auto Q = queue{my_selector{}}; Q.submit([&](handler &cgh){ /* host code: sets up the dependencies of this node. It executes immediately! / accessor acc{B, h}; / exactly one of the available actions. It executes asynchronously / cgh.parallel_for(range<1>{sz}, [=](auto &idx){ / kernel code / }); }); ``` `single_task` and streams We’ll walk through the use of the `single_task` method to create work on a device. As the name suggests, this will create a task for sequential execution: probably not a method you will use often, but definitely something to be aware of! The task we would like to perform is a print-out on the device. If you are familiar with CUDA/HIP, you probably know that `printf` can be used in device code. In keeping with C++, the SYCL standard defines a `stream` class, which works similar to the standard streams. A SYCL stream needs a `handler` object on construction: ``` auto out = stream(1024, / maximum size of output per kernel invocation / 256, / maximum size before flushing the stream / cgh); ``` SYCL streams behave just like standard C++ streams. We can write something to a stream using `operator<<`: ``` out << "my message" << std::endl; ``` You can find a scaffold for the code in the `content/code/day-1/04_single-task/single-task.cpp` file, alongside the CMake script to build the executable. You will have to complete the source code to compile and run correctly: follow the hints in the source file. A working solution is in the `solution` subfolder. 1. Create a queue object. You’re free to use any of the device selection strategies we have encountered in the previous episode. 2. Submit work to the queue using a command handler group. 3. Create a `stream` object. 4. Create a single task on the `handler` printing a string to the stream. A `single_task` only accepts a function with no input arguments as parameter: ``` cgh.single_task([=](){ / task code / }); ``` Keypoints - One queue maps to one device, such that there is no ambiguity in spreading work. - A program can have as many queues as desired. Multiple queues can use the same device: the queue-device mapping is many-to-one. - Enqueing actions can happen by submitting command groups* using the `handler` class. - You can also enqueue actions with shortcut methods on the `queue` class. - Work can be enqueued with a command group handler. This gives more flexibility over the definition of the corresponding node in the task graph. - Kernels are [callables](https://en.cppreference.com/w/cpp/named_req/Callable): either lambda functions or function objects. - Kernel code cannot use neither RTTI nor dynamic memory allocation.
ML Classification
ML Categories	null
ML Page Types	null
ML Intent Types	null
Content Metadata
Language	en
Author	null
Publish Time	not set
Original Publish Time	2021-12-03 15:36:50 (4 years ago)
Republished	No
Word Count (Total)	1,571
Word Count (Content)	1,441
Links
External Links	10
Internal Links	17
Technical SEO
Meta Nofollow	No
Meta Noarchive	No
JS Rendered	No
Redirect Target	null
Performance
Download Time (ms)	49
TTFB (ms)	49
Download Size (bytes)	6,986
Shard	143 (laksa)
Root Hash	2566890010099092343
Unparsed URL	io,github!enccs,/sycl-workshop/queues-cgs-kernels/ s443