Evaluating the Chasm
Imagine building the roads but never collecting tolls. That’s AI evaluation today—researchers create benchmarks that validate billion-dollar models, then receive nothing. Zero payment for work that becomes critical infrastructure for an entire field.
Meanwhile, the ARC Prize offers $725,000 for beating a single benchmark. Hugging Face raised $400 million at a $4.5 billion valuation essentially for hosting models and many evaluation datasets. Papers with Code catalogs over 12,000 evaluation datasets representing thousands of researcher-hours, yet their creators capture none of this value.
This is the evaluation economy’s core problem: the people building the benchmarks that determine how well AI systems work can’t monetize their expertise. We pay massive prizes for solving evaluations but peanuts for creating them.
The consequences are clear: Benchmark contamination is widespread because there’s little to no incentive to create fresh, protected datasets. Domain-specific evaluation barely exists because those experts have no sustainable way to contribute. When MMLU gets saturated, when GLUE becomes trivial, when every major benchmark shows “superhuman” performance within months of release, we scramble to create new evaluations, but the core problem remains.
The AI community has built incredible infrastructure for training and inference, but evaluation remains stuck in an academic gift economy that, while valuable for open science, leaves domain experts without sustainable business models for their specialized knowledge. While foundation model companies raise billions and inference providers scale globally, those creating the benchmarks that validate these systems rely on institutional goodwill and grant funding.
The biggest opportunity lies in domain-specific evaluation. General benchmarks like MMLU and GLUE tell us little about how models perform in specialized contexts—legal document analysis, financial risk assessment, or scientific research assistance. These domains require deep expertise to evaluate properly, yet the experts who understand them have no way to make money from their knowledge. The result is a massive gap between what we can measure and what actually matters for real-world AI deployment.
This isn’t sustainable. Production AI demands rigorous evaluation for deployment success. Companies increasingly recognize that comprehensive evaluation—even if more resource-intensive than basic testing—is essential for high-stakes applications where model reliability directly impacts business outcomes and are willing to pay premium prices. Yet we’ve created no mechanism for bridging this gap.
When a single evaluation benchmark commands a $725,000 prize pool, when AI infrastructure companies achieve multi-billion valuations, when over 12,000 datasets exist but their creators earn nothing—that’s a fundamental market failure waiting to be solved.
Evault: Verifiable Measurement at Scale
Evault begins with a solid foundation: Ritual’s ability to call Trusted Execution Environments (TEEs), which enables cryptographically verifiable evaluation results while preserving evaluation logic confidentiality. This core capability is live and operational, providing the infrastructure for a new evaluation economy.
Crucially, Ritual’s ability to call TEEs directly from smart contracts enables seamless integration of cryptographically verified evaluation results with on-chain economic behavior. This contract-native approach allows evaluation outcomes to automatically trigger payments, update reputation scores, or execute complex incentive mechanisms without requiring trusted intermediaries—creating the foundation for bootstrapping the economic layers that transform evaluation from a cost center into a value-generating activity.
Current Layer: Secure Evaluations
Evault, our prototype, offers secure, verifiable evaluation of AI models across different benchmarks using TEEs. Users can run evaluations with cryptographic guarantees that ensure results are untampered and evaluation logic remains protected. Modern TEE implementations have significantly reduced performance overhead compared to early versions, with optimized configurations showing minimal impact on evaluation workloads. Unlike high-frequency inference workloads, evaluation tasks are typically run periodically for model validation and comparison, making any additional computational overhead acceptable for the security and verifiability benefits gained.
We made a demo of Evault here:
Expansion Paths: Three Coordination Mechanisms
Building on this foundation, we think Evault could platform at least three complementary directions that would transform the evaluation economy:
Meridian: Distributed Challenge Coordination
Meridian would enable users to create fully transparent, globally accessible prize pools for domain-specific evaluation challenges. Unlike traditional competitions requiring months of setup, this system could offer instant bounty creation with automatic payouts tied to verifiable model performance scores. While platforms like Kaggle have proven the bounty model works, an on-chain implementation would bring unique advantages: no geographic restrictions on payments or payouts, complex administration, or trust in centralized operators. Users could launch bounties instantly while maintaining complete control over proprietary datasets through TEE protection.
Meridian could further incorporate reputation-based participation filters for both model creators and competition hosts and stake-weighted governance to prevent coordinated manipulation while maintaining open access for legitimate participants.
Prism: Evaluation Asset Exchange
Prism would allow evaluation creators to make money from their curated datasets, earning fees from model owners seeking cryptographically verifiable proof of their models’ performance. A staking mechanism would create quality incentives—evaluation creators stake tokens when listing datasets, with multi-tier arbitration systems ensuring dataset value while rewarding meaningful contributions. The arbitration process would combine automated quality checks via LLMs, with expert human review for edge cases, creating robust quality control without centralized gatekeeping. On the flip side, participants aiming to achieve a score on those datasets could stake or pay to have their models compete to reach a threshold score. The collective stake or payment pool could then be redistributed to the top performers and creators.
High-quality evaluations would attract more participants, generating more revenue for creators and encouraging further investment in evaluation quality. This would particularly benefit AI researchers and domain experts who understand specific fields but currently lack sustainable monetization paths.
Synthesis: Emergent Evaluation Networks
Synthesis would enable crowdsourced creation of challenging examples designed to test model limits and accelerate their evolution. Users could submit individual evaluation examples designed explicitly to challenge state-of-the-art AI models, with TEE-based AI judges filtering low-quality example submissions and scoring LLM answers. High-impact and popular evaluations would be rewarded to incentivize upkeep and recognize their role in advancing frontier model capabilities. Synthesis would aim to create continuously evolving benchmarks that stay ahead of capabilities, driven by market incentives rather than academic cycles, while maintaining the collaborative spirit that has made open evaluation datasets successful.
The Evault Platform
These potential mechanisms would create a comprehensive incentive structure targeting different participants across the evaluation stack for Evault. Users with specific evaluation needs could coordinate challenges through Meridian. Domain experts could make money from their specialized knowledge through Prism. The broader community could contribute to emergent evaluation and earn rewards through Synthesis.
Evaluation quality often requires significant upfront investment and specialized expertise, making it natural for enterprise users to pay premium prices for high-quality, domain-specific assessment capabilities that directly impact their deployment confidence. Evault aims to capture these unmonetized flows.
Scaling Evaluation Infrastructure
The timing aligns perfectly with AI’s maturation. Models have reached sufficient capability to be deployed in critical applications where evaluation quality determines success or failure. The contamination crisis has made the AI community acutely aware of evaluation’s importance. Cryptographic infrastructure now supports the protection mechanisms and incentive structures needed for sustainable evaluation markets.
Meanwhile, market validation is emerging from multiple directions. Enterprise customers are increasingly requesting specialized evaluation services. Academic institutions are exploring sustainable funding models for evaluation research. AI companies are investing heavily in internal evaluation capabilities, demonstrating clear market demand for high-quality assessment tools.
Evault (and its potential directions) is targeted at exactly this. This represents critical infrastructure for AI progress. The platform that successfully creates sustainable incentives for high-quality evaluation will play a foundational role in ensuring AI systems are properly validated before deployment. When the pace of AI advancement outstrips our ability to evaluate it effectively, the stakes couldn’t be higher.
We’re looking for a team that understands both the technical challenges of AI evaluation and the product dynamics of incentive-driven platforms. The ideal team combines deep AI evaluation expertise with proven experience building and scaling marketplace platforms. If you’ve felt the pain of creating valuable evaluation datasets with no way to make money, if you understand the evaluation crisis facing the AI community, or if you simply believe you’re the right team to build this platform, we want to talk.
The evaluation economy is inevitable. The question is who will build it.
Ready to go? Apply to Ritual Shrine and mention Evault in your application.