Opinionated Rules for Research
Laurence Liang,
I wrote this as some pointers to remind myself of lessons I learned while getting into ML research. While I wrote this with ML research in mind, I hope that these pointers can be useful to a variety of research fields, even those that are quite distant from computers.
I also wrote these with a strong personal bias - please feel free to read these points with a critical point of view, and to reject any points that may seem irrelevant. Please feel free to reach out as well, I'd love to hear your thoughts and even change my views on certain practices as well!
- You have a finite number of experiments to solve a problem, prioritize accordingly. Let's say you are given a question that no one knows the answer to - i.e. how can we design an LLM to attain 100% performance on the ARC AGI 2 benchmark? Realistically, most people have limited monetary budgets and a finite timeline: this means that we only have enough budget and time to try a limited number of experiments. In his YC Startup School 2025 talk (opens in a new tab), John Jumper (co-author of DeepMind's AlphaFold paper) says - if I remember correctly - that only about 10% of experiments work out. Knowing the true cost of experiments, you need to view research as taking a bet on what will work. Start with a baseline - something quick and easy based on existing literature, which needs to be better than randomly guessing. With that baseline in place, work down the priority list - your metrics can change based on how you want to place bets, though it should be a combination of cost (time and monetary budgets) and the likelihood of that experiment coverting to strong results.
- Your solution doesn't need to be elegant. It needs (1) to be clear, (2) it needs to work, and (3) it needs to be reproducible. There's often a drive to produce elegant models, and an elegant experiment setup (i.e. the codebase or lab setup that got you there). However, experiments are often rugged, and refactoring experiments for elegance is an unneeded luxury in most cases. There should be some stucture - a
README.md
with clear documentation, adata
folder with data, ascripts
folder with reproducible scripts to run (this setup certainly has analogues in other non-computer settings) - though it's fine if the code (or experiment itself) has a rugged appearance. As long as the setup is clear, working and reproducible (i.e. anyone can take your code/apparatus and get the same results), that is all what matters at this stage. - You need to run experiments to get results. Set a daily floor. Although this is self-evident, it's easy to become too pre-occupied with theorizing about hyper-elegant ideas. You need to run experiments. Try coming up with rules, such as running at minimum one experiment per day - something very easy with low activation energy, but something that brings strong returns over time.
- Ablations are important. Run an experiment by varying which components are absent. Store the results in a table. In many cases, the minimum combination of features needed to achieve good results (the sufficiency criterion) may be smaller than you think.
- Don't write a high volume of complex code to impress. Keep it simple. We know that you are capable of writing really cool code. But complex code means more points of failure, especially when you start ingesting data and ablations from different sources, where bugs are not easily flagged. Keep it simple.
- If all your experiments can be run in only 10 lines of code, only write 10 lines of code. Only write the amount of code that you need. Use existing libraries as much as possible.
- Your logging system needs to beat pen and paper. If your logging system is less effective than noting the results by pen on paper, then there is an issue. Please log everything, all your conditions. I tend to enjoy WandB and Google Sheets, though there are certainly other options as well.
- Writing a library to reproduce your work is great. Writing a library to replace highly functional existing libraries is an issue. There is a strong need to have reproducible experiments. If you can write a library to load your code/data and run experiments, and to then open-source your library, it will be immensely helpful to the research community. Ifyou are writing a library because you want to replace an existing, well-established one - unless you're introducing never-before-seen features (i.e.
mamba
introducing easy-to-use SSMs, which is incredibly useful) - unless you're in that type of area, your library will likely struggle against its incumbents. In particular, consider Pandas (opens in a new tab) - more than three thousand people have contributed to building it since 2008. Why would your library have any new benefits over an open-source tool that three thousand-plus people have touched and contributed to?