489 comments
anotherpaulg · 3 hours ago
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).

  84% Claude 3.5 Sonnet 10/22
  80% o1-preview
  77% Claude 3.5 Sonnet 06/20
  72% DeepSeek V2.5
  72% GPT-4o 08/06
  71% o1-mini
  68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!

  92% Sonnet 10/22
  75% o1-preview
  72% Opus
  64% Sonnet 06/20
  49% GPT-4o 08/06
  45% o1-mini
https://aider.chat/docs/leaderboards/

Show replies

simonw · 25 minutes ago
LASR · 5 hours ago
This is actually a huge deal.

As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.

I wanted to take at stab at this problem and started researching some daily busineses and how they use software.

My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.

My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.

I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.

Boy was I wrong.

I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.

Show replies

arnaudsm · 50 seconds ago
This is a big deal. This is the biggest leap in LLM "intelligence" of the year, the plateau has been broken.

Also it's surprising how Claude has been superior to ChatGPT for the past 8 months, but still has a fraction of its user base. Stickiness at its best.

HarHarVeryFunny · 2 hours ago
The "computer use" ability is extremely impressive!

This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.

Take a look at their demo of using this for coding.

https://www.youtube.com/watch?v=vH2f7cjXjKI

This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.

It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?