so you can do
`backlog task create "Feature" --plan "1. Research\n2. Suggest Implementation// #AI AI!"` (yes weird order with the !)
and in the background aider will propose solutions.
I’m not sure how this compares to Claude Code or Codex, but its LLM-flexible. Downside is it doesn’t create a pull request. So it’s more helpful for local code.
I would probably add some Readme.md files to the --watch-files session and I think you need to click once [D]ont ask again so it wont keep asking you to add files
The tagline from the repo seems fine: "A tool for managing project collaboration between humans and AI Agents in a git ecosystem"
> Rich query commands -- view, list, filter, or archive tasks with ease
If these things appeal to you and you haven’t already looked at it, the GitHub CLI tool gh is very useful. For instance:
gh repo clone MrLesk/Backlog.md
cd Backlog.md
gh issue view 140
gh issue view 140 --json body --template "{{.body}}"
— https://cli.github.comYou can do things like fork repos, open pull requests from your current branch, etc.
For a project that is just for me, it's exactly what I need – dependency tracking and not much more, stored offline with the code. Almost all of the code for it was written by Gemini.
Because of the embedded custom instructions Claude knows exactly how to proceed.
Since I never create too big tasks, what blows most context are actually the docs and the decisions markdown files.
All data is saved under backlog folder as human‑readable Markdown with the following format task- - .md (e.g. task-12 - Fix typo.md).
If every "task" is one .md file, I believe AI have issues editing big files, it can't easily append text to a big file due to context window, we need to force a workaround launching a command line to append text instead of editing a file. So this means the tasks have to remain small, or we have to avoid putting too much information in each task.Correct. One of the instructions that ships with backlog.md is to make the tasks “as big as they would fit in a pr”. I know this is very subjective but Claude really gets much better because of this.
https://github.com/MrLesk/Backlog.md/blob/main/src/guideline...
You will notice yourself that smaller atomic tasks are the only way for the moment to achieve a high success rate.
The state is always up to date no matter if you are running backlog.md from main branch or a feature branch.
It works well when there are not many branches but I need to check if I can improve the performance when there are lots of branches.
Another idea is to use git notes
This looks much more thought out, thanks for sharing!
What I did is to add the backlog folder into the .gitignore file, but after every command I get a lengthy error about a git command error.
And even if I were to add these files to my repository, I would want to add them manually.
Many of my tasks already exists in forms of a Jira ticket, would be interesting to prompt it to take over a specific ticket & update its ticket progress as well.
Backlog is more for smaller projects where you wouldn’t normally have a project management tool
I sent a message to someone telling that I was working on backlog.md and it turned the name into a link automatically.
I wanted to remove the link and I clicked on it accidentally and discovered that not only there was nothing on that domain but was not registered yet. I got the domain few mins later :)
I see its a TS app so I am sure the bun bundle is the install, but always good to include in your 5 min intro.
Joking aside there is a npm/bun install -g backlog.md at the top but I can add an extra one in the 5 min intro.
I am using Bun’s new fullstack single file builds. I’m really impressed by how easy it was to set up everything.
Had similar success with making some more markdown files to help guide the agent but never would have thought of something this useful.
Will try your workflow and backlog on a build this week.
Iteration timeline
==================
• 50 % task success - added README.md + CLAUDE.md so the model knew the project.
• 75 % - wrote one markdown file per task; Codex plans, Claude codes.
• 95 %+ - built Backlog.md, a CLI that turns a high-level spec into those task files automatically (yes, using Claude/Codex to build the tool).
Three step loop that works for me 1. Generate tasks - Codex / Claude Opus → self-review.
2. Generate plan - same agent, “plan” mode → tweak if needed.
3. Implement - Claude Sonnet / Codex → review & merge.
For simple features I can even run this from my phone: ChatGPT app (Codex) → GitHub app → ChatGPT app → GitHub merge.
Repo: https://github.com/MrLesk/Backlog.md
Would love feedback and happy to answer questions!
Would love to see an actual end to end example video of you creating, planning, and implementing a task using your preferred models and apps.
- SWE-bench leaderboard: https://www.swebench.com/
- Which metrics for e.g. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork"? https://news.ycombinator.com/item?id=43101314
- MetaGPT, MGX: https://github.com/FoundationAgents/MetaGPT :
> Software Company as Multi-Agent System
> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.
- Mutation-Guided LLM-based Test Generation: https://news.ycombinator.com/item?id=42953885
- https://news.ycombinator.com/item?id=41333249 :
- codefuse-ai/Awesome-Code-LLM > Analysis of AI-Generated Code, Benchmarks: https://github.com/codefuse-ai/Awesome-Code-LLM :
> 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding
- underlines/awesome-ml/tools.md > Benchmarking: https://github.com/underlines/awesome-ml/blob/master/llm-too...
- formal methods workflows, coverage-guided fuzzing: https://news.ycombinator.com/item?id=40884466
- "Large Language Models Based Fuzzing Techniques: A Survey" (2024) https://arxiv.org/abs/2402.00350
After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?
But then outstanding liabilities due to code quality and technical debt aren't costed in by the market.
There are already code quality metrics.
SAST and DAST tools can score or fix code, as part of a LLM-driven development loop.
Formal verification is maybe the best code quality metric.
Is there more than Product-Market fit and infosec liabilities?
Trying this project today looks nice. I see you have sub-tasks. Any thoughts on a 'dependency' relation? I.e., don't do X if it is dependent on task A which is not complete.
FYI, there is a 404 in the AGENTS.md GEMINI.md etc pointing to a non existing README.md.
Will check the 404 issues. Thanks for reporting it
Though i've not had much luck in getting Claude to natively use MCPs, so maybe that's off base heh.
When you initialize backlog in a folder it asks you if you want to set up agent’s instructions like CLAUDE.md. It is important to say yes here so that Claude knows how to use Backlog.md.
Afterwards you can just write something like: Claude please have a look at the @prd.md file and use ultrathink to create relevant tasks to implement it. Make sure you correctly identify dependencies between tasks and use sub tasks when necessary.
Or you can just paste your feature request directly without using extra files.
Feels a bit like magic
Also I’m not fully sure about your setup. In my fresh pov I would next set up agents that check my github repo for backlog tasks and do pull requests on those tasks. If I write a good description and ideally tests I can optimize the results of these.
This creates the possibility of agents checking your backlog and prepare the work.
I usually work with aider everyday and I’m quite fast in achieving task, the next limitation would be the latency and some back and forth. I have some dead time in between. I can definitely define tasks faster than 1-1 AI.
Yeah if you could share a bit more how you do this with Claude we would all be thankful, also I havent seen anywhere to sponsor/tip you, would love to!