Continuous integration workflow

From Materials Simulation Group
Jump to: navigation, search

A continuous integration (CI) workflow is a robust way to maintain large code bases that are being developed by multiple people. The rationale behind it is illustrated with a simple example. Suppose you have a large scientific code like UNCLE that does some complicated calculations. Next, suppose that you have some graduate students (scientists, not programmers, by trade) who need a small, extra feature to extend UNCLE to work for their dissertation project. Although they test their changes locally in some development branch, how will they know if some of the "minor" changes they have made broke the code somewhere else? After all, it is a large code and most graduate students won't know the whole thing inside out. We have two options:

  • Commit the changes to the main repo and hope that everything is okay.
  • Run all the unit tests for the whole program locally; once they all pass successfully, have another person familiar with the code review what you did (called code review) and only then commit it to the master branch.

Surprisingly, the first option is selected by default by computational scientists all over the world. In fact, most are not sold on the idea that they need unit tests in the first place; they would rather move on to the next project and the next paper. However, in our group, we are trying to make good coding practices and science happen at the same time. Here is how we do it:

  1. Design toward a distribution from the start.
  2. Once the initial repo has been setup, branch it for any new development. Don't commit to master without first running all the unit tests for the whole code (see step 6).
  3. Document individual methods before you code them up. This helps clarify what it is that method is supposed to do (i.e. what does it need and what will it give back).
  4. Plan unit tests for that method before you code it up. How will you code a method well if you don't know what it should do and how to test that it is doing it?
  5. Run the unit tests as you go along. That way, you are debugging each part of your code as you write it. When the final method is coded, all the previous ones will already be tested and working, so the final debug will be fast.
  6. When you are ready to commit to master, submit a pull request. The continuous integration server (CI server) will automatically unit test the whole code with your proposed changes and update status on the pull request. If it passes, then a second developer in the group can review your code.
  7. Once you both agree that it is ready to commit, it can be merged with the master branch.

Don't take shortcuts! They will always end up being the long way. You might feel it saves you time to ignore unit testing while you are coding, but the maintenance costs later on are tremendous; also, without unit tests, you will never have confidence that your code is actually doing everything correctly when it handles new situations or is extended in new ways. As a scientist, such a course should inspire fear in you.

Designing Toward Distribution

This is mostly about a mindset, but also has a lot to do with unit testing and documentation. When we say "design toward a distribution", this is what we mean: when you create a new project folder, before you write any code, add the following directories and files to your "distribution directory":

  • docs/ stores screenshots, images, etc. that are referenced in the repo's wiki.
  • src/ stores the actual code you were about to put in the distribution directory.
  • tests/ stores modules, input/output files etc. that are required to unit test the src/ directory.
  • support/ has supporting code (mathematica notebooks) or PDFs (latex) describing the science that the code distribution solves.
  • HISTORY.md has a record of the revisions to the master branch, a high-level description of what changed.
  • README.md the front page of the repo, has links to pre-reqs, a description of how to get started with the code quickly and some examples.
  • LICENSE describes who can use the code under which circumstances. Look up the open source licenses to see which one you like.

Later, when the project gets bigger, you may want to add a CONTRIBUTE.md file to the distribution with a description of the workflow you use in development (kind of like the contents of this page). It usually describes the testing methodology, how to file and track issues, etc. At that point, you may also have an AUTHORS page that includes a list of everyone who contributed to the project.

Working off a Branch

When you create a branch, you make a copy of the current state of the repository. However, any changes you make and commit do not affect the main distribution of the software. Even for your own projects, it is wise to use branches to manage the enhancements and bug fixes you implement. For code that is used by others, this is an absolute must. There are plenty of instructions online about branching github repos.

Documenting Before Coding

When you create a new function or class to solve a problem, first write down answers to the following questions:

  1. In words, what problem does the coding construct solve.
  2. What does it need to solve the problem?
  3. What does it return after solving the problem? What other results, from a scientific/programming perspective, might be useful for someone trying to solve this problem?
  4. What errors are handled? How should another developer respond to these errors?
  5. How can another developer call this routine if they have never used the code?
  6. When this code is extended later on, will this level of abstraction or modularization make sense?

For the last question, it is often impossible to know whether the level of abstraction is correct in advance. However, much of the time you can make small changes during initial development that make the code more extensible later. The point of having the question there is to get you to think about it. If you code in Fortran and use Fortpy, decorating all the code elements with the relevant XML tags will usually cause you to reflect about each of these points. You may feel that such a high level of detail is unnecessary because your memory is good and you will be able to remember later when you see it again. However, remember that we are designing toward distribution, which means that you aren't coding for yourself. Rather you are writing code for everyone else; convince yourself that your reputation as a computational scientist may depend heavily on your ability to share your code so that others can use it easily (this will be true once the rising generation is in charge). Even things that you think are "small projects" may turn into important pieces of code that lots of people use.

One other point worth emphasizing is that if we want to get scientists to write good code, there need to be good examples. If you have a distribution that is almost 100% unit tested and another scientist wants to extend it to solve a new problem, they will be curious to learn how to run the unit tests you have written. That way they will know they didn't break anything. By writing good code toward a distribution, you can inspire a change for good in computational science.

Write Unit Tests Before Coding

What we mean by this is: have the input and output you expect your function to reproduce before you write the code. Thinking about how to actually test your code will highlight some of the constraints you have; also any holes in your understanding of the larger problem will become apparent. Create the model input/output for your routine and then try to reproduce it with your code. With scientific code especially, it is often the case that you don't know the answer before you have written the code. In that case, you ought to have some idea about the limits and constraints of the solution. Speeds greater than c are not physical, the electric and magnetic fields should be perpendicular, etc. Think about everything you know about the problem and the solution and then write unit tests to make sure your result fits within the constraints.

Test as You Go

If you did the previous step, this will be easy (especially if you use a testing framework like Fortpy). This just means that before you start coding another class or function, make sure that the one you just coded passes all the unit tests you wrote for it. Remember that it is easier to debug just one piece of code than to drill from the top down on a behemoth code to find a small problem. If the whole code is unit tested, then anytime an error shows up, you will know which routine is causing it and save yourself days (literally) of debugging work. Unit Test!

Submit a Pull Request | Code Review

Once the code is written and unit tested, the rest is pretty easy. Commit to your local branch, push the commits to github. Start a pull request from your fork or branch and give a good description of why. This description should be in words and describe the reasons for making the changes. Don't just parrot back the code, the reviewer can look at that easily enough anyway. If your repo is not configured for use with the CI server, install your repository on the server using these instructions. Once the unit tests for the whole code pass successfully, you should have someone review your code before merging the pull request. Since code review is so important, it has its own page.