SolidityBench by IQ was launched as the primary benchmark to guage LLM in Solidity code technology. Accessible on Hugging Face, it options two revolutionary benchmarks, NaΓ―veJudge and HumanEval for Solidity, designed to guage and price the proficiency of AI fashions in producing good contract code.
Developed by IQ BrainDAO as a part of their upcoming IQ Code suite, SolidityBench is used to enhance their very own EVMind LLMs and benchmark them towards generic and community-created fashions. IQ Code goals to supply tailored synthetic intelligence fashions for good contract code technology and auditing, responding to the rising want for safe and environment friendly blockchain purposes.
As IQ mentioned fromcryptoNaΓ―veJudge presents a brand new method by commissioning LLMs to implement good contracts based mostly on detailed specs derived from audited OpenZeppelin contracts. These contracts present the gold customary for equity and effectivity. The generated code is evaluated towards the reference implementation utilizing standards similar to useful completeness, adherence to Solidity greatest practices and safety requirements, and optimization effectiveness.
The analysis course of makes use of superior LLMs, together with varied variations of OpenAI GPT-4 and Claude 3.5 Sonnet as neutral code reviewers. They evaluation code based mostly on strict standards, together with implementation of all key options, dealing with of edge instances, error administration, appropriate use of syntax, and general code construction and maintainability.
Optimization concerns similar to gasoline effectivity and storage administration are additionally evaluated. The rating ranges from 0 to 100 and supplies a complete evaluation of performance, safety and effectivity and displays the complexity {of professional} good contract growth.
Which AI fashions are greatest for creating stable good contracts?
The benchmarking outcomes confirmed that the OpenAI GPT-4o mannequin achieved the best general rating of 80.05 with a NaΓ―veJudge rating of 72.18 and a HumanEval for Solidity rating of 80% at move@1 and 92% at move@3.
Apparently, newer reasoning fashions similar to o1-preview and o1-mini OpenAI have been overwhelmed to the highest spot with scores of 77.61 and 75.08, respectively. Fashions from Anthropic and XAI, together with the Claude 3.5 Sonnet and grok-2, confirmed aggressive efficiency with whole scores hovering round 74. Nvidia's Llama-3.1-Nemotron-70B scored the bottom within the prime 10 with 52.54.
HumanEval for Solidity adapts to IQ the unique HumanEval OpenAI benchmark from Python to Solidity, which incorporates 25 duties of various problem. Every job contains corresponding exams appropriate with Hardhat, the favored Ethereum growth atmosphere, which permit correct compilation and testing of the generated code. The analysis metrics, move@1 and move@3, measure the mannequin's success on preliminary makes an attempt and on a number of makes an attempt, providing perception into accuracy and problem-solving capabilities.
Goals of utilizing AI fashions within the growth of good contracts
By introducing these benchmarks, SolidityBench goals to advance the event of good contracts with the assistance of synthetic intelligence. It helps the creation of extra subtle and dependable AI fashions whereas offering builders and researchers with precious insights into the present capabilities and limitations of AI in Solidity growth.
The benchmarking toolkit goals to boost IQ Code's EVMind LLM and likewise units new requirements for AI-powered good contract growth throughout the blockchain ecosystem. The initiative hopes to deal with a vital want in an trade the place the demand for safe and environment friendly good contracts continues to develop.
Builders, researchers, and AI fanatics are invited to discover and contribute to SolidityBench, which goals to drive the continual enchancment of AI fashions, promote greatest practices, and develop decentralized purposes.
Go to Hugging Face's SolidityBench to be taught extra and begin evaluating Solidity technology fashions.