Skip to content

Benchmarks

Metriq-Gym provides a comprehensive suite of quantum benchmarks to characterize and compare quantum hardware performance.

Running Benchmarks

Single Benchmark

mgym job dispatch metriq_gym/schemas/examples/quantum_volume.example.json \
    --provider local --device aer_simulator

Poll for Results

mgym job poll <JOB_ID>

Configuration

All benchmarks use JSON configuration files:

  • Schemas: metriq_gym/schemas/*.schema.json - Define parameters, types, and allowed values
  • Examples: metriq_gym/schemas/examples/*.example.json - Ready-to-run configurations

Available Benchmarks

Benchmark Description
Mirror Circuits Tests state fidelity via forward/reverse Clifford layers
EPLG Error per layered gate across qubit chains
BSEQ Bell state effective qubits via CHSH violation
WIT Wormhole-inspired teleportation protocol
LR-QAOA Linear-ramp QAOA for Max-Cut optimization
QML Kernel Quantum machine learning kernel accuracy
QED-C Benchmarks Application-oriented benchmarks (BV, QFT, etc.)
CLOPS Circuit layer operations per second to measure device speed

mirror_circuits

Mirror Circuits benchmark implementation.

Summary

Generates randomly parameterized mirror circuits that apply layers of Clifford gates, add a middle Pauli layer, and then revert the forward layers to test how well a device preserves state fidelity across the forward and reverse halves of the circuit.

Result interpretation

Polling yields MirrorCircuitsResult with: - success_probability: fraction of runs matching the expected bitstring. - polarization: rescales success_probability to remove the uniform-random baseline; higher implies better performance. - binary_success: boolean indicating whether polarization exceeded 1/e.

References

eplg

EPLG (Error Per Layered Gate) benchmark implementation.

Summary

Measures layer fidelity across qubit chains using randomized benchmarking techniques. Computes EPLG scores at various chain lengths to characterize two-qubit gate performance across the device.

Result interpretation

Polling returns EPLGResult with: - chain_lengths: list of qubit chain lengths tested - chain_eplgs: EPLG values at each chain length - eplg_10/20/50/100: EPLG at standard reference points - score: average EPLG across reference points (lower is better)

References

bseq

BSEQ (Bell state effective qubits) benchmark implementation.

Summary

Evaluates how well a device generates Bell pairs that violate the CHSH inequality across its connectivity graph. Circuits are built per colouring of the topology and executed in four measurement bases to detect correlations.

Connectivity graph

The benchmark uses the device's native connectivity graph to determine which qubit pairs can be coupled. For superconducting devices (e.g., IBM), this reflects the physical coupling map with sparse connectivity. For trapped-ion devices (e.g., IonQ, Quantinuum) and simulators, all-to-all connectivity is assumed (complete graph). The graph structure affects edge coloring: complete graphs with n qubits require n-1 colors (optimal), while sparse topologies typically require fewer colors but test fewer qubit pairs.

Result interpretation

Polling returns BSEQResult with: - largest_connected_size: size of the biggest connected subgraph of qubit pairs that violated CHSH (> 2). Higher means entanglement spans more of the device. - fraction_connected: largest_connected_size normalised by the discovered qubit count, making it easier to compare devices of different sizes.

References

wit

WIT (wormhole-inspired teleportation) benchmark implementation.

Summary

Runs a six- or seven-qubit teleportation-inspired circuit that mimics the protocol from Shapoval et al. (2023) and reports a Pauli-Z expectation value with binomial uncertainty.

Result interpretation

Polling returns WITResult.expectation_value as a BenchmarkScore: - value: estimated Pauli-Z expectation (ideal teleportation trends toward +1). - uncertainty: binomial standard deviation computed from the observed counts. Compare value versus uncertainty to decide whether more shots are required or if noise is degrading the teleportation fidelity.

References

lr_qaoa

Linear Ramp QAOA (LR-QAOA) benchmark implementation.

Summary

Solves weighted Max-Cut instances with a linear-ramp parameter schedule and compares results against classical optima to estimate approximation ratios and optimal sampling probabilities.

For a deeper dive into results across various graph types, see the authors' benchmarking dashboard.

Result interpretation

Polling returns LinearRampQAOAResult with metrics including: - approx_ratio_mean / stddev: how close average costs are to the optimum. - optimal_probability_mean / stddev: frequency of sampling an optimal bitstring. - confidence_pass: boolean indicating whether results meet the configured confidence. Higher approximation ratios and optimal probabilities reflect better QAOA performance.

References
  • Montanez-Barrera et al., "Evaluating the performance of quantum processing units at large width and depth", arXiv:2502.06471.

qml_kernel

Quantum Machine Learning Kernel benchmark implementation.

Summary

Constructs a ZZ feature map kernel, computes the inner-product circuit, and measures the probability of returning to the all-zero state as a proxy for kernel quality.

Result interpretation

Polling returns QMLKernelResult.accuracy_score as a BenchmarkScore where: - value: fraction of shots measuring the expected all-zero bitstring. - uncertainty: binomial standard deviation from the sample counts. Higher accuracy suggests better kernel reproducibility on the selected hardware.

References

qedc_benchmarks

QED-C application-oriented benchmark wrapper.

Summary

Provides a generic dispatch/poll pipeline around the QED-C benchmark suite (Bernstein- Vazirani, Phase Estimation, Hidden Shift, Quantum Fourier Transform) via the QC-App- Oriented-Benchmarks submodule.

Result interpretation

Polling returns QEDCResult.circuit_metrics, a nested dictionary keyed by qubit count and circuit identifier, populated with the fidelity or related metrics computed by the QED-C analyser. Inspect the per-circuit entries to understand performance trends.

References

clops

CLOPS (Circuit Layer Operations Per Second) benchmark implementation.

Summary

Measures the throughput of a quantum system by timing the execution of parameterized quantum volume-style circuits. CLOPS captures end-to-end performance including compilation, communication, and execution overhead.

Result interpretation

Polling returns ClopsResult with: - clops_score: circuit layer operations per second (higher is better) as measured from time of submission to job completion as reported by the cloud platform. - steady_state_clops: circuit layer operations per second (higher is better), ignorning the first execution span to exclude pipeline startup costs, measuring sustained throughput. Only works for cloud platforms that provide execution span metadata (currently only IBM Runtime), and is None when spans are unavailable or there are fewer than two spans.

This metric reflects real-world workload performance rather than isolated gate speeds.

References

Adding Custom Benchmarks

See Adding New Benchmarks to contribute new benchmarks.