While the conventional system waits for post-processing to complete before starting its second wave of matmul operations, Einsummable is already on its fourth and final wave of matmul operations. Due to the long pause required for post-processing in conventional systems, they finish much later than fine-grained systems like Einsummable, which maintain continuous computation throughout.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
EinsummableIndiziert:
Einsummable is a deep learning system that we are developing at Rice University. This video shows how Einsummable can be faster than other systems, such as PyTorch or JAX.
I'm summable uses a fine-grained data flow graph to execute a computation on a set of GPUs.
Fine-grained means that for a matrix multiply chain like X * Y * Z, I'm summable breaks the computation up into a large number of pieces.
This is the graph that I'm summable might use to run the computation on four GPUs. The operations it uses in this example are matmul, move, and matadd.
The graph is produced automatically by I'm summable.
In this example, I'm summable starts by running a first wave of matmul operations load balanced across the four GPUs.
Once those matmul operations finish, a second wave of matmul operations starts.
At the same time, I'm summable begins the transmission of results from the first wave through the network, where they are aggregated.
I'm summable is an asynchronous work conserving system where communication and computation are overlapped, and there is no pause for communication.
Once the second wave of matmul operations finishes, the third wave of matmul operations begins.
As the third wave runs, the intermediate results from the second wave are transmitted through the network and aggregated.
Again, there is no pause for communication.
Then the third wave finishes, and the fourth wave of matmul operations begins.
As the fourth wave runs, the results of the third wave are transmitted to the location required for the final aggregation.
The entire computation has been run without the need to pause and wait for communication and aggregation.
Let's compare with how a conventional system, such as PyTorch or JAX, might operate.
Typically, the system would break things up so that the amount of work matches the number of GPUs.
Keep an eye on the comparison with Einsummable, shown running at the same time at the bottom right.
The conventional system starts its first wave of matmul ops, which take twice as long as the Einsummable matmul ops because they operate on larger chunks.
We're going to have to wait a little while for those matmul ops to finish.
Once those matmul ops finish, the process of post-processing the results begins.
In this case, that means concatenating the intermediate results to form larger submatrices, and then replicating those submatrices on the appropriate GPUs.
Note that while this is happening, the GPUs sit idle as no expensive matmul operations are being run.
Finally, the second wave of matmul operations can begin.
Note that at this point, Einsummable is already on its fourth and final wave of matmul ops.
Now, Einsummable finishes the computation.
Because of the long pause required to post-process the outputs of the first wave, the conventional system finishes much later than Einsummable.
Ähnliche Videos
Ubuntu Touch Q&A 190
UBports
241 views•2026-05-17
Iterators and Generators: Real Use Cases
jsmentor-uk
188 views•2026-05-17
TCS NQT Coding Questions Solution (One Shot) | TCS NQT Preparation 2027 | TCS Actual PYQ 2026
knacademy20
2K views•2026-05-17
The 4 Bit AI Training Trick
explaquiz
414 views•2026-05-19
Image to 3D World Workflow 👀
badxstudio
843 views•2026-05-16
Why Learn Algorithms in the AI Era
bitsandproofs
245 views•2026-05-17
NFA - Transition Diagram and Transition Table
nesoacademy
198 views•2026-05-19
BCS | BASIC COMPUTER SKILLS | WHOLE SUBJECT EXPLANATION | OSMANIA UNIVERSITY | @shivanipallela
shivanipallela
345 views•2026-05-22











