Claude 3.5 Sonnet still holds the LLM crown for code which I'll use when wanting to check the output of the best LLM, however my Continue Dev, Aider and Claude Dev plugins are currently configured to use DeepSeek Coder V2 236B (and local ollama DeepSeek Coder V2 for tab completions) as it offers the best value at $0.14M/$0.28M which sits just below Claude 3.5 Sonnet on Aider's leaderboard [1] whilst being 43x cheaper.
Yi-Coder scored below GPT-3.5 on aider's code editing benchmark. GitHub user cheahjs recently submitted the results for the 9b model and a q4_0 version.
Yi-Coder results, with Sonnet and GPT-3.5 for scale:
The difference between (A) software engineers reacting to AI models and systems for programming and (B) artists (whether it's painters, musicians or otherwise) reacting to AI models for generating images, music, etc. is very interesting.
I tested this out on my workload ( SRE/Devops/C#/Golang/C++ ). it started responding about non-sense on a simple write me boto python script that changes x ,y,z value.
Then I tried other questions in my past to compare... However, I believe the engineer who did the LLM, just used the questions in benchmarks.
One instance after a hour of use ( I stopped then ) it answered one question with 4 different programming languages, and answers that was no way related to the question.
> Continue pretrained on 2.4 Trillion high-quality tokens over 52 major programming languages.
I'm still waiting for a model that's highly specialised for a single language only - and either a lot smaller than these jack of all trades ones or VERY good at that specific language's nuances + libraries.
mythz ·5 days ago
[1] https://aider.chat/docs/leaderboards/
Show replies
anotherpaulg ·5 days ago
Yi-Coder results, with Sonnet and GPT-3.5 for scale:
Full leaderboard:https://aider.chat/docs/leaderboards/
Palmik ·5 days ago
I wonder what's the reason.
Show replies
JediPig ·5 days ago
Then I tried other questions in my past to compare... However, I believe the engineer who did the LLM, just used the questions in benchmarks.
One instance after a hour of use ( I stopped then ) it answered one question with 4 different programming languages, and answers that was no way related to the question.
Show replies
theshrike79 ·5 days ago
I'm still waiting for a model that's highly specialised for a single language only - and either a lot smaller than these jack of all trades ones or VERY good at that specific language's nuances + libraries.
Show replies