dv.view('toc')

Note: FLOPs and FLOPS are different, though there are some overlaps

FLOP vs. FLOPS

FLOP (Floating Point Operation)

  • Refers to single operation involving floating-point arithmetic.
    • These are calculations involving real numbers (decimals), which are more complex than integer operations.
    • Used in tasks like simulations, rendering, and machine learning, where precision is crucial.
  • Unit of Measurement: FLOP is a count or measure of how many floating-point calculations are needed for a task or algorithm.
    • Example: A matrix multiplication of matrices may require millions of FLOPs.
  • Examples of floating-point operations include:
    • Addition:
    • Multiplication:
    • Division:

FLOPS (Floating Point Operations Per Second)

  • Refers to the rate at which the floating-point operations are performed, measuring computational performance
  • Unit of Measurement: FLOPS (or variations like GFLOPS, TFLOPS, PFLOPS) is used to indicate the speed or throughput of a processor.
    • = Billion () FLOPs per second.
    • = Trillion () FLOPs per second.
    • = Quadrillion () FLOPs per second.
  • Example usage:
    • A GPU capable of 10 TFLOPS can perform floating-point operations per second.
  • Importance in GPUs:
    • GPUs are optimized for parallel processing, making them capable of achieving extremely high FLOPS compared to CPUs.
    • For instance, a modern GPU like NVIDIA’s A100 has FP32 performance around 19.5 TFLOPS, meaning it can perform ~19.5 trillion 32-bit floating-point calculations per second.
  • Practical Applications:

FLOP vs. FLOPS: Key Difference:

AspectFLOPFLOPS
MeaningA single floating-point operation.Floating-point operations completed per second.
PurposeDescribes workload size or algorithm complexity.Measures processing power or performance.
Example Context“This algorithm requires .”“This GPU delivers .”

Analogy:

==Think of FLOP as a single task (like hammering one nail) and FLOPS as the rate at which you can hammer nails per second. A project task may require many hammering actions (FLOPs), and the speed of completing the project task depends on how fast you can hammer (FLOPS).==


FLOP in detail

FLOP stands for Floating Point Operation, which is a fundamental arithmetic operation involving floating-point numbers (e.g., addition, subtraction, multiplication, division). In the context of GPUs and other computing devices, FLOPs (plural) often refer to the number of such operations a system can perform per second and are used as a measure of computational performance.

Key Points About FLOP in GPUs

  1. Floating-Point Precision:
    • GPUs can perform operations at different precision levels, which affects their FLOP throughput:
      • FP32 (Single Precision): Commonly used in many ML/DL tasks and general computing.
      • FP64 (Double Precision): Required for high-precision scientific computations.
      • FP16 (Half Precision): Used in ML/DL for faster computations with reduced precision.
      • BF16/INT8: Used for specific tasks like deep learning inference for further speedup.
  2. Measuring FLOP:
    • FLOPs (Floating Point Operations per Second) measure the computational power of a GPU:
    • For instance, a GPU rated at 5 TFLOPs can perform 5 trillion floating-point operations per second.
  3. Relevance of FLOP in GPUs:
    • GPUs are optimized for high FLOP throughput because many tasks in graphics rendering, machine learning, and scientific simulations require massive numbers of floating-point operations.
    • In deep learning, FLOPs are a common benchmark for measuring the compute intensity of a model (e.g., number of FLOPs required to process one forward pass of a neural network).
  4. Theoretical vs. Practical FLOPs:
    • Theoretical Peak FLOPs: Maximum achievable performance under ideal conditions.
    • Practical FLOPs: Actual performance achieved in real-world workloads, often lower due to inefficiencies like memory bottlenecks, control flow divergence, or under-utilised cores.

Example of FLOP Throughput in GPUs

  • NVIDIA A100:
    • FP64: 19.5 TFLOPs
    • FP32: 156 TFLOPs (Tensor Cores)
    • FP16: 312 TFLOPs (Tensor Cores)

This highlights how GPUs are optimized for specific workloads, with ML tasks benefiting from lower-precision formats like FP16 for higher throughput.

Examples of FLOPS in Action

  1. 1 FLOP:
    • 2.1 + 3.2 → A single floating-point addition.
    • 4.5 × 6.7 → A single floating-point multiplication.
  2. Multiple FLOPS:
    • (2.1 + 3.2) × 4.5 → Consists of two FLOPS:
    • 1 FLOP for addition: 2.1 + 3.2.
    • 1 FLOP for multiplication: Result × 4.5.
  3. Matrix Multiplication Example:
    • Consider multiplying two matrices, and to get
      • : matrix (say, )
      • : matrix (say, )
      • = , will be matrix ()
    • Each element in is computed by performing a dot () product of a row from and column from , and each of that dot product involves
      • () multiplications (one for each element of the row and column)
      • additions to sum these products together
      • The above multiplications and additions can be approximated as () operations.
    • Since there are elements in C, then, there will be a total of ( = = ) operations = 400 million FLOPs.
  4. Complex Operations:
    • Operations like matrix multiplications or solving differential equations can require millions or billions of FLOPS depending on their size and complexity.

In summary, every basic floating-point arithmetic operation constitutes one FLOP, making this a fundamental measure for evaluating computational workloads.

FLOP vs Other Metrics

While FLOPs provide a useful measure of raw computational power, they should be considered alongside: • Memory bandwidth (important for data-intensive tasks). • Latency (critical for real-time applications). • Power efficiency (relevant for deployment and scaling).

For optimal performance, balance between FLOPs and these factors is key!