dv.view('toc')
Note: FLOPs and FLOPS are different, though there are some overlaps
FLOP vs. FLOPS
FLOP (Floating Point Operation)
- Refers to single operation involving floating-point arithmetic.
- These are calculations involving real numbers (decimals), which are more complex than integer operations.
- Used in tasks like simulations, rendering, and machine learning, where precision is crucial.
- Unit of Measurement: FLOP is a count or measure of how many floating-point calculations are needed for a task or algorithm.
- Example: A matrix multiplication of matrices may require millions of FLOPs.
- Examples of floating-point operations include:
- Addition:
- Multiplication:
- Division:
FLOPS (Floating Point Operations Per Second)
- Refers to the rate at which the floating-point operations are performed, measuring computational performance
- Unit of Measurement: FLOPS (or variations like GFLOPS, TFLOPS, PFLOPS) is used to indicate the speed or throughput of a processor.
- = Billion () FLOPs per second.
- = Trillion () FLOPs per second.
- = Quadrillion () FLOPs per second.
- Example usage:
- A GPU capable of 10 TFLOPS can perform floating-point operations per second.
- Importance in GPUs:
- GPUs are optimized for parallel processing, making them capable of achieving extremely high FLOPS compared to CPUs.
- For instance, a modern GPU like NVIDIA’s A100 has FP32 performance around 19.5 TFLOPS, meaning it can perform ~19.5 trillion 32-bit floating-point calculations per second.
- Practical Applications:
FLOP vs. FLOPS: Key Difference:
Aspect | FLOP | FLOPS |
---|---|---|
Meaning | A single floating-point operation. | Floating-point operations completed per second. |
Purpose | Describes workload size or algorithm complexity. | Measures processing power or performance. |
Example Context | “This algorithm requires .” | “This GPU delivers .” |
Analogy:
==Think of FLOP as a single task (like hammering one nail) and FLOPS as the rate at which you can hammer nails per second. A project task may require many hammering actions (FLOPs), and the speed of completing the project task depends on how fast you can hammer (FLOPS).==
FLOP in detail
FLOP stands for Floating Point Operation, which is a fundamental arithmetic operation involving floating-point numbers (e.g., addition, subtraction, multiplication, division). In the context of GPUs and other computing devices, FLOPs (plural) often refer to the number of such operations a system can perform per second and are used as a measure of computational performance.
Key Points About FLOP in GPUs
- Floating-Point Precision:
- GPUs can perform operations at different precision levels, which affects their FLOP throughput:
- FP32 (Single Precision): Commonly used in many ML/DL tasks and general computing.
- FP64 (Double Precision): Required for high-precision scientific computations.
- FP16 (Half Precision): Used in ML/DL for faster computations with reduced precision.
- BF16/INT8: Used for specific tasks like deep learning inference for further speedup.
- GPUs can perform operations at different precision levels, which affects their FLOP throughput:
- Measuring FLOP:
- FLOPs (Floating Point Operations per Second) measure the computational power of a GPU:
- For instance, a GPU rated at 5 TFLOPs can perform 5 trillion floating-point operations per second.
- FLOPs (Floating Point Operations per Second) measure the computational power of a GPU:
- Relevance of FLOP in GPUs:
- GPUs are optimized for high FLOP throughput because many tasks in graphics rendering, machine learning, and scientific simulations require massive numbers of floating-point operations.
- In deep learning, FLOPs are a common benchmark for measuring the compute intensity of a model (e.g., number of FLOPs required to process one forward pass of a neural network).
- Theoretical vs. Practical FLOPs:
- Theoretical Peak FLOPs: Maximum achievable performance under ideal conditions.
- Practical FLOPs: Actual performance achieved in real-world workloads, often lower due to inefficiencies like memory bottlenecks, control flow divergence, or under-utilised cores.
Example of FLOP Throughput in GPUs
- NVIDIA A100:
- FP64: 19.5 TFLOPs
- FP32: 156 TFLOPs (Tensor Cores)
- FP16: 312 TFLOPs (Tensor Cores)
This highlights how GPUs are optimized for specific workloads, with ML tasks benefiting from lower-precision formats like FP16 for higher throughput.
Examples of FLOPS in Action
- 1 FLOP:
- 2.1 + 3.2 → A single floating-point addition.
- 4.5 × 6.7 → A single floating-point multiplication.
- Multiple FLOPS:
- (2.1 + 3.2) × 4.5 → Consists of two FLOPS:
- 1 FLOP for addition: 2.1 + 3.2.
- 1 FLOP for multiplication: Result × 4.5.
- Matrix Multiplication Example:
- Consider multiplying two matrices, and to get
- : matrix (say, )
- : matrix (say, )
- = , will be matrix ()
- Each element in is computed by performing a dot () product of a row from and column from , and each of that dot product involves
- () multiplications (one for each element of the row and column)
- additions to sum these products together
- The above multiplications and additions can be approximated as () operations.
- Since there are elements in C, then, there will be a total of ( = = ) operations = 400 million FLOPs.
- Consider multiplying two matrices, and to get
- Complex Operations:
- Operations like matrix multiplications or solving differential equations can require millions or billions of FLOPS depending on their size and complexity.
In summary, every basic floating-point arithmetic operation constitutes one FLOP, making this a fundamental measure for evaluating computational workloads.
FLOP vs Other Metrics
While FLOPs provide a useful measure of raw computational power, they should be considered alongside: • Memory bandwidth (important for data-intensive tasks). • Latency (critical for real-time applications). • Power efficiency (relevant for deployment and scaling).
For optimal performance, balance between FLOPs and these factors is key!