Trimming changelog files
For people who have used Liquibase for a long time, a common question they have is how to clear out a changelog file that has gotten unwieldy.
The standard process for using Liquibase is to append individual change sets to your changelog file for each database change you need to make. Over time those changes can build up to thousands of entries, many of which are now redundant (create a table and later drop it) or inefficient (create a table, then add columns individually vs. just creating the table with all the columns). What is the best way to simplify all that cruft that has built up?
My first response is always “Do you really need to simplify it?” You built up that changelog over a long period of time and you have ran it and tested it countless times. Once you start messing with the changelog file you are introducing risk which has a cost of its own. Does whatever performance or file size concerns you have really outweigh the risk of messing with a script that you know works?
If it is worth the risk, why is it work the risk? Sometimes the problem is that your changelog file has just gotten so large that your editor chokes on it, or you get too many merge conflicts. The best way to handle this is to simply break up your changelog file into multiple files. Instead of having a single changelog.xml file with everything in it, create a master.changelog.xml file which uses the tag to reference other changelog files.
When you run
liquibase update against the master.changelog.xml file, changesets in com/example/news/news.changelog.xml will run and then the changesets in com/example/directory/directory.changelog.xml will run. You can break up changesets in whatever manner works best for you. Some break them up by feature, some break them up by release. Find what works best for you.
Other times, the problem is that
liquibase update is taking too long. Liquibase tries to be as efficient as possible when comparing the contents of the DATABASECHANGELOG table with the current changelog file and even if there are thousands of already ran changesets, an “update” command should take just seconds to run. If you are finding that update is taking longer than it should, watch the Liquibase log to determine why. Perhaps there is an old runAlways=”true” changeset that no longer needs to run or there are preconditions which are no longer needed. Running Liquibase with –logLevel=INFO or even –logLevel=DEBUG can give additional output which can help you determine which changesets are slow. Once you know what is slowing down your update, try to alter just those changesets rather than throwing out the whole changelog and starting from scratch. You will still want to retest your changelog in-depth, but it is a far less risky change.
For other people, they find that
liquibase update works well for incremental updates, but creating a database from scratch takes far too long. Again I would ask “is that really a problem?” Are you re-creating databases often enough that the risk of a change to the creation script makes sense? If you are, your first step should be to look for problem changesets as described above. Databases are fast, especially when they are empty. Even if you create a table only to drop it again that is usually just a few milliseconds of overhead and not worth optimizing. The biggest performance bottlenecks in creating a database are usually indexes, so start with them. If you are creating and updating indexes frequently in your creation process, you may be able to combine those changesetsinto something more efficient.
When you need to surgically alter your existing changesets, remember how Liquibase works: each changeset has an “id”, an “author”, and a file path which together uniquely identifies it. If the DATABASECHANGELOG table has an entry for that changeset it will not run it. If it has an entry, it throws an error if the checksum for the changeset in the file doesn't match what was stored on the last run.
How you modify your existing changesets will also depend on your environment and where in the changelog the problem changesets are. If you are modifying changesets that have been applied to all of your environments and are now only used on fresh database builds you can treat them differently than if they have been applied to some databases but not yet to others.
To merge or modify existing changesetsyou will be doing a combination of editing existing changesets, removing old changesets, and creating new ones.
Removing unneeded changesets is easy because Liquibase doesn't care about DATABASECHANGELOG rows with no corresponding changesets. Just delete out of date changesets and you are done. For example, if you have a changeset that creates the table “cart” and then another that drops it, just remove both changesets from the file. You must make sure, however, that there are no changesets between the create and the delete that make use of that table or they will fail on a fresh database build. That is an example of how you are introducing risk when changing your changelog file.
Suppose instead you have a “cart” table that is created in one changeSet, then a “promo_code” column is created in another and an “abandoned” flag is created in another.
One option would be to combine everything into a new changesets using the existing id=”1” and delete the other changesets.
This will work well if all existing databases have the cart table with the promo_code and abandoned columns already added. Running Liquibase against existing databases just sees that id=”1” already ran and doesn't do anything new. Running Liquibase against a blank database will create the cart table with all the columns right away. Notice that we had to add the flag or existing databases will throw an error saying that id=”1” has changed since it was run. Just use the checksum in the error message in the validCheckSum tag to mark that you know it changed and the new value is OK.
If you have some databases where the promo_code and/or abandoned columns have not yet been added, update the original createTable as before, but use preconditions with onFail=”MARK_RAN” to handle cases where the old changesets ran while still not adding the columns again if the new changesets ran.
Now, on existing databases that have all 3 changesets already ran, Liquibase will just continue on as before. For existing databases that have the old cart definition, it will see that the columns don't exist for id=”2” and id=”3” and execute then as usual. For blank databases, it will create the table with the promo_code and abandoned columns and then in id=”2” and id=”3” it will see that they are already there and mark that they have ran without re-adding the columns. A word of warning, however: using preconditions will add a performance overhead to your update executions and are ignored in updateSQL mode because Liquibase cannot know how applicable they are when changesets have not actually executed. For that reason it is best to avoid them if possible, but definitely use them when needed. Preconditions also add complexity to your changelog which will require additional testing so keep that in mind when deciding whether to modify your changelog logic. Sometimes it is easiest and safest to wait until all your databases have the columns and then modify the changesets to avoid the preconditions.
The cart/promo_code/abandoned example shows some basic patterns you can use when modifying existing changesets. Similar patters can be used to optimize whatever your bottlenecks are. Just remember when you change one changeset, it can affect other changesets below which may need to be modified as well. This can easily spider out of control so be mindful of what you are doing.
If you end up finding that it will work best to completely restart your changelog, see How to set up Liquibase with an Existing Project and Multiple Environments which describes how to add Liquibase to an existing project (even if that project was previously managed by Liquibase).