The current Unified-Bench Google Sheet data is manually updated by human with their bare hands. This is tedious and slow but the data is very accurate. Current web AI agent for general tasks including Manus.ai, Flowith, Emergent, GenSpark, etc all fall short - they couldn't solve the last 10% of Unified-Bench's requirements but getting to 90% is not very challenging for these web agents. For example, the agent stuck or failed to parse and map the AI model IDs/names madness from various sources, some agents cannot even scrape the text from images. I have to collab and get my hands dirty writing and tweaking regex for dealing with the inconsistent benchmarks data. Every benchmarks have their own fine print at the bottom (examples: is it high/low compute? thinking/non-thinking model? reasoning/non-reasoning/hybrid model? 16k/32k thinking budget? pass@1/average pass@4? etc.). Coincidentally, this can be my "soft"-AGI 2027 benchmark. lol!
I got in Dia beta release. Dia is an AI browser. This space is starting to explode with competitors like Perplexity's Comet browser and Google's Mariner.
Dia seems like a better suit for my requirements. Upfront, my impression is it targeting more technical people such as power users or developers. The interface is centered around chat and give you more control of the browser. This is a different approach than Perplexity's Comet (may be targeting less technical savvy users, I don't have access to it, so I don't really know.)
I'm going to scrape the public AI benchmarks/leaderboards using Dia. You can add skills. It's like a prompt template that you can reuse for repetitive prompt. Below is an example of /scrape
skill for scraping HTML and transforming them into JSON data that adhere to the provided schema.
(to be continue...)