Benchmarks¶
Metriq-Gym provides a comprehensive suite of quantum benchmarks to characterize and compare quantum hardware performance.
Running Benchmarks¶
Single Benchmark¶
mgym job dispatch metriq_gym/schemas/examples/quantum_volume.example.json \
--provider local --device aer_simulator
Poll for Results¶
Configuration¶
All benchmarks use JSON configuration files:
- Schemas:
metriq_gym/schemas/*.schema.json- Define parameters, types, and allowed values - Examples:
metriq_gym/schemas/examples/*.example.json- Ready-to-run configurations
Available Benchmarks¶
| Benchmark | Description |
|---|---|
| Mirror Circuits | Tests state fidelity via forward/reverse Clifford layers |
| EPLG | Error per layered gate across qubit chains |
| BSEQ | Bell state effective qubits via CHSH violation |
| WIT | Wormhole-inspired teleportation protocol |
| LR-QAOA | Linear-ramp QAOA for Max-Cut optimization |
| QML Kernel | Quantum machine learning kernel accuracy |
| QED-C Benchmarks | Application-oriented benchmarks (BV, QFT, etc.) |
| CLOPS | Circuit layer operations per second to measure device speed |
mirror_circuits
¶
Mirror Circuits benchmark implementation.
Summary
Generates randomly parameterized mirror circuits that apply layers of Clifford gates, add a middle Pauli layer, and then revert the forward layers to test how well a device preserves state fidelity across the forward and reverse halves of the circuit.
Result interpretation
Polling yields MirrorCircuitsResult with: - success_probability: fraction of runs matching the expected bitstring. - polarization: rescales success_probability to remove the uniform-random baseline; higher implies better performance. - binary_success: boolean indicating whether polarization exceeded 1/e.
References
eplg
¶
EPLG (Error Per Layered Gate) benchmark implementation.
Summary
Measures layer fidelity across qubit chains using randomized benchmarking techniques. Computes EPLG scores at various chain lengths to characterize two-qubit gate performance across the device.
Result interpretation
Polling returns EPLGResult with: - chain_lengths: list of qubit chain lengths tested - chain_eplgs: EPLG values at each chain length - eplg_10/20/50/100: EPLG at standard reference points - score: average EPLG across reference points (lower is better)
References
bseq
¶
BSEQ (Bell state effective qubits) benchmark implementation.
Summary
Evaluates how well a device generates Bell pairs that violate the CHSH inequality across its connectivity graph. Circuits are built per colouring of the topology and executed in four measurement bases to detect correlations.
Connectivity graph
The benchmark uses the device's native connectivity graph to determine which qubit pairs can be coupled. For superconducting devices (e.g., IBM), this reflects the physical coupling map with sparse connectivity. For trapped-ion devices (e.g., IonQ, Quantinuum) and simulators, all-to-all connectivity is assumed (complete graph). The graph structure affects edge coloring: complete graphs with n qubits require n-1 colors (optimal), while sparse topologies typically require fewer colors but test fewer qubit pairs.
Result interpretation
Polling returns BSEQResult with: - largest_connected_size: size of the biggest connected subgraph of qubit pairs that violated CHSH (> 2). Higher means entanglement spans more of the device. - fraction_connected: largest_connected_size normalised by the discovered qubit count, making it easier to compare devices of different sizes.
References
- Original routines attributed to Paul Nation (Qiskit Device Benchmarking).
- Clauser et al., Phys. Rev. Lett. 23, 880 (1969).
wit
¶
WIT (wormhole-inspired teleportation) benchmark implementation.
Summary
Runs a six- or seven-qubit teleportation-inspired circuit that mimics the protocol from Shapoval et al. (2023) and reports a Pauli-Z expectation value with binomial uncertainty.
Result interpretation
Polling returns WITResult.expectation_value as a BenchmarkScore: - value: estimated Pauli-Z expectation (ideal teleportation trends toward +1). - uncertainty: binomial standard deviation computed from the observed counts. Compare value versus uncertainty to decide whether more shots are required or if noise is degrading the teleportation fidelity.
References
- Shapoval et al., "Towards Quantum Gravity in the Lab on Quantum Processors", Quantum 7, 1138 (2023).
- Companion script.
- Implementation lineage credited to Paul Nation (IBM Quantum).
lr_qaoa
¶
Linear Ramp QAOA (LR-QAOA) benchmark implementation.
Summary
Solves weighted Max-Cut instances with a linear-ramp parameter schedule and compares results against classical optima to estimate approximation ratios and optimal sampling probabilities.
For a deeper dive into results across various graph types, see the authors' benchmarking dashboard.
Result interpretation
Polling returns LinearRampQAOAResult with metrics including: - approx_ratio_mean / stddev: how close average costs are to the optimum. - optimal_probability_mean / stddev: frequency of sampling an optimal bitstring. - confidence_pass: boolean indicating whether results meet the configured confidence. Higher approximation ratios and optimal probabilities reflect better QAOA performance.
References
- Montanez-Barrera et al., "Evaluating the performance of quantum processing units at large width and depth", arXiv:2502.06471.
qml_kernel
¶
Quantum Machine Learning Kernel benchmark implementation.
Summary
Constructs a ZZ feature map kernel, computes the inner-product circuit, and measures the probability of returning to the all-zero state as a proxy for kernel quality.
Result interpretation
Polling returns QMLKernelResult.accuracy_score as a BenchmarkScore where: - value: fraction of shots measuring the expected all-zero bitstring. - uncertainty: binomial standard deviation from the sample counts. Higher accuracy suggests better kernel reproducibility on the selected hardware.
References
- Inspired by ZZ-feature map approaches, e.g., Bowles et al., arXiv:2405.09724.
qedc_benchmarks
¶
QED-C application-oriented benchmark wrapper.
Summary
Provides a generic dispatch/poll pipeline around the QED-C benchmark suite (Bernstein- Vazirani, Phase Estimation, Hidden Shift, Quantum Fourier Transform) via the QC-App- Oriented-Benchmarks submodule.
Result interpretation
Polling returns QEDCResult.circuit_metrics, a nested dictionary keyed by qubit count and circuit identifier, populated with the fidelity or related metrics computed by the QED-C analyser. Inspect the per-circuit entries to understand performance trends.
References
- QED-C QC-App-Oriented-Benchmarks repository for algorithm-specific methodology.
- Lubinski et al., "Application-Oriented Performance Benchmarks for Quantum Computing", IEEE Trans. Quantum Eng. (2023).
clops
¶
CLOPS (Circuit Layer Operations Per Second) benchmark implementation.
Summary
Measures the throughput of a quantum system by timing the execution of parameterized quantum volume-style circuits. CLOPS captures end-to-end performance including compilation, communication, and execution overhead.
Result interpretation
Polling returns ClopsResult with: - clops_score: circuit layer operations per second (higher is better) as measured from time of submission to job completion as reported by the cloud platform. - steady_state_clops: circuit layer operations per second (higher is better), ignorning the first execution span to exclude pipeline startup costs, measuring sustained throughput. Only works for cloud platforms that provide execution span metadata (currently only IBM Runtime), and is None when spans are unavailable or there are fewer than two spans.
This metric reflects real-world workload performance rather than isolated gate speeds.
References
Adding Custom Benchmarks¶
See Adding New Benchmarks to contribute new benchmarks.