This talk introduces a practical mental model for understanding GPU acceleration in Python, with a focus on real performance limits rather than abstract theory. Using a simple but representative example problem, we explore how modern hardware actually executes tensor computations and why performance is often constrained by memory rather than compute.
We walk through tensors, parallelism, and GPU architecture, then build toward a quantitative framework using FLOPs, memory bandwidth, and arithmetic intensity. The talk culminates in applying the roofline model to analyze performance, followed by practical optimization techniques such as kernel fusion and tiling, with a real implementation example.
The goal is to move beyond “make it faster” and toward understanding why something is fast or slow.
PyOhio 2026