See also:

  • performance
  • algorithms
  • concurrency

Umut Acar Parallel Computing: Theory and Practice Intro in C++ 8 Oregon Programming Languages Summer School parallel algorithms sam westwick mpl “maple” a parallel ml compiler

Disentangling

https://enccs.github.io/gpu-programming/# gpu programming when why how

GPU

Why Roofline model latency vs bandwidth arithmetic intensity

SM streaming multiprocessor warp 32 threads

Cuda

tensor cores

https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/

https://docs.nvidia.com/cuda/ docs nvcc compiler -gencode -arch. godbolt nvvm nvtx nvdisasm nvprune cuda-gdb cudafe++ - sperates host code from gpu cuobjdump nvprof nvlink ptxas bin2c

#nvcc
cd /usr/local/cuda-12.3/bin/
ls
#./nvcc --help

nvvm ir https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html llvm ir variant https://developer.nvidia.com/cupti cupti cude profiling tool interface. https://developer.nvidia.com/nvidia-visual-profiler visual profiler. libnvvp https://developer.nvidia.com/nsight-systems performance analysis tool

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language-extensions C extensions <<< >>> execution configurqation. It transpiles to cuda runtime calls

PTX https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html virtual instruction set. JITed by driver https://github.com/cloudcores/CuAssembler isa changes between cards (sm_86, sm_80, sm_75, sm_70, sm_60) https://github.com/Danil6969/Ghidra-Nvidia-GPU ghidra spec for one gpu. not very complete looking https://arxiv.org/abs/2301.11389 A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

runtime cudart.a libcudart.so cudaInitDevice() cudaSetDevice() cudaMalloc cudaFree

thrust https://github.com/NVIDIA/thrust cub https://github.com/NVIDIA/cccl cuda core lbraries

https://github.com/cuda-mode/resource-stream https://github.com/cuda-mode/lectures/ numba has cuda simulator

https://github.com/openai/triton triton jax torch.compile

cudnn physx tensorrt cublas curand cufft cusolver cusparse npp nvidia perfroamcne primitives nvml management library cudart - runtme nvrtc runtime compilation cutlass https://github.com/NVIDIA/cuCollections (cuco)

book - programming massively parallel processors https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-dp-0323912311/dp/0323912311/

pycuda https://documen.tician.de/pycuda/

https://github.com/inducer/loopy A code generator for array-based code on CPUs and GPUs

import pycuda

https://github.com/rapidsai/cudf gpu dataframe https://github.com/rapidsai

opencl

https://developer.nvidia.com/opencl

https://documen.tician.de/pyopencl/ pyopencl

https://github.com/intel/compute-runtime for my integrated graphics. https://github.com/intel/gmmlib graphics memory management library

create context, create queue, create buffer, create program, create kernel, set arguments, enqueue kernel, enqueue copy, enqueue map, release

import numpy as np
import pyopencl as cl

a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

prg = cl.Program(ctx, """
__kernel void sum(
    __global const float *a_g, __global const float *b_g, __global float *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()

res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
knl = prg.sum  # Use this Kernel object for repeated calls
knl(queue, a_np.shape, None, a_g, b_g, res_g)

res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)

# Check on CPU with Numpy:
print(res_np - (a_np + b_np))
print(np.linalg.norm(res_np - (a_np + b_np)))
assert np.allclose(res_np, a_np + b_np)

https://github.com/KhronosGroup/OpenCL-Guide/tree/main https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/getting_started_linux.md

sudo apt install opencl-headers ocl-icd-opencl-dev -y
echo "
// C standard includes
#include <stdio.h>

// OpenCL includes
#include <CL/cl.h>

int main()
{
    cl_int CL_err = CL_SUCCESS;
    cl_uint numPlatforms = 0;

    CL_err = clGetPlatformIDs( 0, NULL, &numPlatforms );

    if (CL_err == CL_SUCCESS)
        printf(\"%u platform(s) found\n\", numPlatforms);
    else
        printf(\"clGetPlatformIDs(%i)\n\", CL_err);

    return 0;
}
" > /tmp/test.c
gcc -Wall -Wextra -D CL_TARGET_OPENCL_VERSION=300  /tmp/test.c -o /tmp/test -lOpenCL 
# -std=c++11 -lOpenCL /tmp/test.cpp -o /tmp/test
/tmp/test

https://github.com/boostorg/compute

https://github.com/KhronosGroup/OpenCL-TTL tensor tiling library

https://github.com/ProjectPhysX/FluidX3D fluid lattice botzlamnn https://github.com/pypr/pysph Smoothed Particle Hydrodynamics (

https://github.com/Polytonic/Chlorine opencl wrapper “dead simple”

HIP

SYCL

https://github.com/triSYCL/triSYCL

WebGPU

https://developer.mozilla.org/en-US/docs/Web/API/WebGPU_API

Vulkan

https://github.com/google/clspv opencl to vulkan compiler

SPIR-V

Metal

Algorithms

Parallel Scan Sort Reduce parallel hashmap

union find gpu?

Joins

https://moderngpu.github.io/join.html https://digitalcommons.usf.edu/etd/8484/ https://www.usenix.org/system/files/atc23-shovon.pdf Towards Iterative Relational Algebra on the GPU. micinski usnenix ‘23 https://github.com/harp-lab/usenixATC23 https://github.com/harp-lab/gdlog https://github.com/harp-lab/GPUJoin

PL

TACO https://github.com/tensor-compiler/taco https://github.com/manya-bansal/mosaic TVM halide

accelerate repa futhark https://github.com/diku-dk/futhark co-dfns https://github.com/Co-dfns/Co-dfns

grobner gpu datalog gpu hvm https://news.ycombinator.com/item?id=37805759 sat https://link.springer.com/article/10.1007/s10703-023-00432-z inprocessing https://www.worldscientific.com/doi/10.1142/9789811223334_0178 3sat cuda https://github.com/muhos/ParaFROST https://github.com/muhos/gpu4bmc resolution? term rewriting (the was that K/J webpage)

fluids molecular dynamics bioinformatics structure from motion CT scan something rendering

a grid of sho? celullar automata