week1-2

During the first week, I mainly dove into the documentation of git diff-tree, and also skimmed through the source code to help me write better tests. Diff-tree primarily has two commands related to the index: one is “Compares the content and mode of the blobs found via two tree objects”, and the other one compares commit objects. Therefore, I wrote tests around these two aspects. Super luckily, when I set the “requires-full-index” to false for “diff-tree”, all the tests passed. This indirectly proves that “diff-tree” might have successfully implemented sparse indexing.

But that’s not enough. We need to be certain that this sparse index integration is correct. We should verify that the intended usage associated with those two reads will still work with a sparse index.

So, I started digging into the source code, trying to find proof of lower-level functions already being sparse-aware, but I couldn’t find any. Thanks to the help of my mentor, who clarified a lot of concepts for me after I posted my first RFC patch, I came to understand the reasons for the successful integration:

  1. The first index read was added in fd66bcc31ff (diff-tree: read the index so attribute checks work in bare repositories, 2017-12-06) to deal with reading ‘.gitattributes’ content. 77efbb366ab (attr: be careful about sparse directories, 2021-09-08) established that, in a sparse index, we do not try to load a ‘.gitattributes’ file from within a sparse directory.

  2. The second index access point is involved in rename detection, specifically when reading from stdin. This was initially added in f0c6b2a2fd9 ([PATCH] Optimize diff-tree -[CM]–stdin, 2005-05-27), where ‘setup’ was set to ‘DIFF_SETUP_USE_SIZE_CACHE |DIFF_SETUP_USE_CACHE’. That assignment was later modified to drop the ‘DIFF_SETUP_USE_CACHE’ in ff7fe37b053 (diff.c: move read_index() code back to the caller, 2018-08-13). However, ‘DIFF_SETUP_USE_SIZE_CACHE’ seems to be unused as of 6e0b8ed6d35 (diff.c: do not use a separate “size cache”., 2007-05-07) and nothing about ‘detect_rename’ otherwise indicates index usage.

Therefore, we ensured that the integration is comprehensive and correct. Finally, in performance tests, everything was as expected, achieving a ~98% reduction in execution time.