115 comments
mythz · 5 days ago
Claude 3.5 Sonnet still holds the LLM crown for code which I'll use when wanting to check the output of the best LLM, however my Continue Dev, Aider and Claude Dev plugins are currently configured to use DeepSeek Coder V2 236B (and local ollama DeepSeek Coder V2 for tab completions) as it offers the best value at $0.14M/$0.28M which sits just below Claude 3.5 Sonnet on Aider's leaderboard [1] whilst being 43x cheaper.

[1] https://aider.chat/docs/leaderboards/

Show replies

anotherpaulg · 5 days ago
Yi-Coder scored below GPT-3.5 on aider's code editing benchmark. GitHub user cheahjs recently submitted the results for the 9b model and a q4_0 version.

Yi-Coder results, with Sonnet and GPT-3.5 for scale:

  77% Sonnet
  58% GPT-3.5
  54% Yi-Coder-9b-Chat
  45% Yi-Coder-9b-Chat-q4_0
Full leaderboard:

https://aider.chat/docs/leaderboards/

Palmik · 5 days ago
The difference between (A) software engineers reacting to AI models and systems for programming and (B) artists (whether it's painters, musicians or otherwise) reacting to AI models for generating images, music, etc. is very interesting.

I wonder what's the reason.

Show replies

JediPig · 5 days ago
I tested this out on my workload ( SRE/Devops/C#/Golang/C++ ). it started responding about non-sense on a simple write me boto python script that changes x ,y,z value.

Then I tried other questions in my past to compare... However, I believe the engineer who did the LLM, just used the questions in benchmarks.

One instance after a hour of use ( I stopped then ) it answered one question with 4 different programming languages, and answers that was no way related to the question.

Show replies

theshrike79 · 5 days ago
> Continue pretrained on 2.4 Trillion high-quality tokens over 52 major programming languages.

I'm still waiting for a model that's highly specialised for a single language only - and either a lot smaller than these jack of all trades ones or VERY good at that specific language's nuances + libraries.

Show replies