Pesquisar neste blogue

quarta-feira, 1 de março de 2017

Software Development Process - Mono or Multi Code Versioning Repository?

When we have a large software project it may make sense dividing to conquer, and create several modules that are independently developed (that may be maintained by different teams and using different programming languages).
But, regarding the versioning system, is it a better option to have each module in its own code repository (multi repositories) or have all the project's modules in 1 same repository (monolithic or mono repository)?
And if we have several independent products/projects that are unrelated (but may share common libs)?

  • Simplifies synchronization between modules
  • Simplified dependency management[4]
  • The main project will build with that checked out code for sure - 1 source of truth
  • Extensive code sharing and reuse[4]
  • Easier to refactor code project wide (like changing an interface and all its callers/users at once) Better continuous modernization. We know better who is using what. [12]
  • Easier to revert project wide changes (bug fix, refactors, features) e.g. by cherry picking at the same time in every module
  • We can tag or branch several modules at the same time (with the correct dependency)[12]
  • Atomic changes: Easier to bug fix or add new feature that require making large changes throughout all code base. "SVN, hg, and git have solve the problem of atomic cross-file changes; monorepos solve the same problem across projects."[13]
  • Better Collaboration across teams[4]
  • Flexible team boundaries and code ownership[4] that make a monorepo more capable of enforcing a more modular code (this may be counter intuitive).[14]
  • Code visibility and clear tree structure providing implicit team namespacing.[4]
  • Faster general build as we have parallelized build of all modules and submodules and less binaries copies, and optimal incremental build
  • Easier to associate with issue tracking applications
  • Exaggeratedly large projects are not supported in standard VCS/SCM (e.g. git isn't scalable) as it becomes more and more unfeasible (network traffic) and time consuming to work with, as the number of commits, branches/tags and files and disk size grows. [1]
    • Many commits: Slow to do git log or blame commands
    • Many Ref advertisements: Slow to do git clone, fetch or push
    • Many tracked files: Slow git status and commit
    • Large files: Affects general repo performance and network
    • Combination of above: Hard to to switch betweens refs, 
  • If several products coexist, tags for 1 product are also made in the others (as there is only 1 tree)[1]
  • We need to have an extra work syncing the several teams that work in the same repo (for the common code between projects)[12]
  • A developer may checkout code that doesn't need to use (instead of a small subset of the repository) [15]
Possible fixes
  • Binaries:
    • use Git LFS to reduce the size of the repository, extracting the binaries (and their versions) to a secondary place (e.g. images)
    • OR tweaking the usage of delta commits for some binary types in .gitattributes[3]
      • delta off for binary files that change significantly
      • running the garbage collection (see below)
  • git-describe allows to have a separate versioning number for each module
  • git shallow clone - copy only recent revisions; or clone only one branch [3]
  • Running the garbage collection: "After a git gc refs are packed in a single file and even listing over 20,000 refs is fast (~0.06 seconds)"[1]
  • Consider removing refs you don't need anymore[1]
  • git filter-branch - clean you repo [3]
  • git sparse-checkout - keep the working directory clean by explicitly detailing which folders you want to populate [3]
  • Use Mercurial with extensions as Facebook (hgwatchman + remotefilelog extensions)[7] [8]
  • Use proprietary SCM like Google Piper[4]

  • In the rare case of different modules needing different versions of a same library (that should be avoided by all means), maintenance is easier. (But if this library depends in more common modules, we maybe don't want to also duplicate these)
  • This creation of each module as a project reinforces isolated library development (but this may be also done in a monorepository)
  • Lighter repositories to work with (fewer historic, files and size)
  • It may make sense to separate open-source code from copyright code.[14]
  • Good when we need to lock access just to some code parts
  • As it is hard to predict all dependencies evolution in advance and as modules may be grouped in better ways than currently, multirepo will be harder to evolve (specially when using Agile) [9] (Harder to move code from one place to another)
  • Harder and slower to understand the implications of a modules' code changes in another module's (even if catched in a unit test)
  • Harder and slower to find a bug due to a module's code changes in the full final project
  • Slower to make the general build, as modules must wait for their dependency modules to build first (submodules aren't in the same solution and this means slower build)
    • Making a change to a project's module (even if very small) requires a release just for that module that even may fail due to many reasons such as build server network, disk space, etc.
  • Hard to coordinate horizontal changes in the all project (project wide changes):
    • Update a lib that many modules use
    • Cases of bug fix or new feature that need to be implemented along all the modules
    • Large, atomic refactorings more difficult
  • Harder to promote consistency (standardization) among the different modules:
    • Harder to promote same libraries usage (to avoid having several libraries and code that do the same thing)
    • Harder to promote same code style
    • Harder to have same versioning system usage (branches/tags names, commits descriptions)
      • Harder for a developer to move across teams (imagining each team is working in a module)
  • Less direct to get a full project build version
  • Less direct to test a project build as a whole.
  • We need a tool to help manage the project build dependencies (which version from which modules)
    • In a  more than 1 level dependency hierarchy, we have to make sure that level-2-or-up modules have the same version project wide (unless that is really pretended, but should be avoided in order not to duplicate modules - see above)
    • If a project build may have duplication of a same module with different versions, support for this must be implemented (e.g. create folder with name+version to distinguish)
    • We need to have 1 of the 2: 
      • A project wide version decision place (easier to avoid different module versions, but each module must go to this central place to know which libraries versions to use - not self contained)
      • OR have each module choosing its own dependencies and versions (harder to avoid different module dependencies, as the full dependency tree must be generated each time).
  • We will lose many time, tracking down and fixing dependency issues[13]
  • New developers spend more time learning the VC structure before they can start coding.[15]
  • We may feel more incentivized to use the dlls that are generated by each module-repo to send directly to a production environment, without updating the full project version. (This will be a pain to track what versions combination each client has - problems occur and we have to bug fix what exactly?)
Possible fixes
  • Mandatory to have a tool to help maintain, sync and make sense of the repositories dependencies
  • Use gitslave that is a sync tool between a super-project and theirs slave repositories[2]
    • This allows to use commands (such as branch, commit, push, pull, merge, tag, checkout, status and log) project wide
  • Use git-subtree
  • Use submodules [11] (but with a lot of problems)
  • Sourcetree and Smartgit [12]
  • git-repo

Summary Table


  • Spotify uses a monorepository [5] [6]
  • Facebook uses a monorepo with Mercurial with hgwatchman + remotefilelog extensions (and this resisted the test of time; they've said scaling constraints of the source control system should not dictate their code structure/organization) [7] [8]
    • (17 million lines of code and 44,000 files @ 2013)
  • Google uses a monorepo with its custom SCM named Piper (implemented on top of  Spanner) and developers access it via CitC: "Clients in the Cloud" application.[4]
    • (1 billion files (2 billion lines of code in 9 million unique source files), 35 million commits, 18-year historic, 86TB)
    • Example from an insider on how well works Google code sharing
  • Linux kernel uses a monorepo (15 million lines of code)[14]
  • Twitter, Digital Ocean, and Etsy uses monorepo[13]
  • From my experience, working with several SVN repositories with several levels of dependencies, forced the creation of a complex custom dependency management system (in Python). Core modules that existed at the bottom of the dependency hierarchy were afterwards avoided to be changed in order to simplify dependencies update and make faster builds (avoiding many intermediary modules builds). Direct dependencies creation to this core methods were also avoided in order to reduce the number of dependency links (we want less places to switch all project's modules versions). This means that code that was initially meant to be reused wasn't maintained or used anymore and same functionalities were rewritten in the modules above. A visual graph of the dependencies was also created (using DOT language and graphviz to generate it) in order to help understand each product state (Which modules were being used by the product? Which version was being used in each module? As only each module knows their dependencies+version, only reading through all the dependencies could give us this information and text was not a good way to quickly make sense of this data and e.g. detect if different versions of a same module are being used when that was not intended). An automatic dependencies updater tool was created in order to avoid human errors and do updates much faster (used when a module is upgraded and we want all the project to upgrade their dependencies for this latest module version), but had limitations and manual updates were frequently needed.

Some Reasoning
Having several repositories may make sense when we want to treat each one independently and operate as being a self-contained service/library for others to use. Few code sharing as well as few synchronization between the repositories is needed.
A monolithic repository is good when we want to have a code structuring that promotes code reuse.
In case of doubt, I choose mono.