GSoC: Git Final Report

Previous blogs:

Week1-2

Week3-4

Week5-6

Week7-8

Week9-10

Week11-12

Summary

During my GSoC project, I was trying to integrate a series of commands with sparse index. There are three commands that I’ve been working on: git diff-files, git write-tree, git diff-tree, git worktree and git check-attr. I’ll explain what the situation was, what I’ve done, and what is the result for each command.

git diff-files:

Mailing List

Situation:

git diff-files was relatively compatible with sparse-checkout when I saw it. So the integration with sparse index was started without much hindrance.

What I’ve done:

Before integrating the git diff-files builtin with the sparse index
feature, add tests to t1092-sparse-checkout-compatibility.sh to ensure
it currently works with sparse-checkout and will still work with sparse
index after that integration.

Remove full index requirement for ‘git diff-files’. Refactor the
ensure_expanded and ensure_not_expanded functions by introducing a
common helper function, ensure_index_state. Add test to ensure the index
is no expanded in ‘git diff-files’.

Result:

The ‘p2000’ tests demonstrate a ~96% execution time reduction for ‘git
diff-files’ and a ~97% execution time reduction for ‘git diff-files’
for a file using a sparse index.

git write-tree:

Mailing List

Situation:

The recursive algorithm for update_one() was already updated in 2de37c5
(cache-tree: integrate with sparse directory entries, 2021-03-03) to
handle sparse directory entries in the index.

What I’ve done:

Set the requires-full-index to false for “write-tree”.

Result:

The ‘p2000’ tests demonstrate a ~96% execution time reduction for ‘git
write-tree’ using a sparse index.

git diff-tree:

Mailing List

Situation:

The index is read in ‘cmd_diff_tree’ at two points:

  1. The first index read was added in fd66bcc31ff (diff-tree: read the
    index so attribute checks work in bare repositories, 2017-12-06) to deal
    with reading ‘.gitattributes’ content. 77efbb366ab (attr: be careful
    about sparse directories, 2021-09-08) established that, in a sparse
    index, we do not try to load a ‘.gitattributes’ file from within a
    sparse directory.

  2. The second index access point is involved in rename detection,
    specifically when reading from stdin.This was initially added in
    f0c6b2a2fd9 ([PATCH] Optimize diff-tree -[CM]–stdin, 2005-05-27), where
    ‘setup’ was set to ‘DIFF_SETUP_USE_SIZE_CACHE |DIFF_SETUP_USE_CACHE’.
    That assignment was later modified to drop the’DIFF_SETUP_USE_CACHE’ in
    ff7fe37b053 (diff.c: move read_index() code back to the caller,
    2018-08-13).However, ‘DIFF_SETUP_USE_SIZE_CACHE’ seems to be unused as
    of 6e0b8ed6d35 (diff.c: do not use a separate “size cache”., 2007-05-07)
    and nothing about ‘detect_rename’ otherwise indicates index usage.

What I’ve done:

Set the requires-full-index to false for “diff-tree”. Add test to ensure index won’t expand regardless of ‘diff-tree’ a file inside or outside the cone.

Result:

The ‘p2000’ tests demonstrate a ~98% execution time reduction for
‘git diff-tree’ using a sparse index.

git worktree:

Mailing List
Situation:

The index is read in ‘worktree.c’ at two points:

1.The ‘validate_no_submodules’ function, which checks if there are any
submodules present in the worktree.

2.The ‘check_clean_worktree’ function, which verifies if a worktree is
‘clean’, i.e., there are no untracked or modified but uncommitted files.
This is done by running the ‘git status’ command, and an error message
is thrown if the worktree is not clean. Given that ‘git status’ is
already sparse-aware, the function is also sparse-aware.

What I’ve done:

Set the requires-full-index to false for
“git worktree”. Add tests that verify that ‘git worktree’ behaves correctly when the
sparse index is enabled and test to ensure the index is not expanded.

Result:

The ‘p2000’ tests demonstrate a ~20% execution time reduction for
‘git worktree’ using a sparse index.
(Note:the p2000 test results didn’t reflect the huge speedup because of
the index reading time is minuscule comparing to the filesystem
operations.)

git check-attr:

Mailing List

Situation:

Before this patch, git check-attr was unable to read the attributes from
a .gitattributes file within a sparse directory. The original comment
was operating under the assumption that users are only interested in
files or directories inside the cones. Therefore, in the original code,
in the case of a cone-mode sparse-checkout, we didn’t load the
.gitattributes file.
However, this behavior can lead to missing attributes for files inside
sparse directories, causing inconsistencies in file handling.

What I’ve done:

Add tests for ‘git check-attr’, make sure attribute file does get read
from index when path is either inside or outside of sparse-checkout
definition.

Add a test named ‘diff –check with pathspec outside sparse definition’.
Ensuring the correct application of the attribute rules even when the
file’s path is outside the sparse-checkout definition.

Revise ‘git check-attr’ to allow attribute reading for
files in sparse directories from the corresponding .gitattributes files:

1.Utilize path_in_cone_mode_sparse_checkout() and index_name_pos_sparse
to check if a path falls within a sparse directory.

2.If path is inside a sparse directory, employ the value of
index_name_pos_sparse() to find the sparse directory containing path and
path relative to sparse directory. Proceed to read attributes from the
tree OID of the sparse directory using read_attr_from_blob().

3.If path is not inside a sparse directory,ensure that attributes are
fetched from the index blob with read_blob_data_from_index().

After testing, I turned off the command_requires_full_index , marking this command as compatible with sparse index.

Result:

The ‘p2000’ tests demonstrate a ~63% execution time reduction for’git check-attr’ using a sparse index.

Final Words:

I had a great and very productive time working on this project. Thanks to this project, I have learnt many things about git, C and Bash !

In the community, I learned that it’s very important to explain my code and ideas clearly. Talking and sharing thoughts with other developers can help find new ideas, better solutions, or mistakes I didn’t see before. I also learned that when I get stuck or can’t think of something new, it’s good to ask for help. Maybe talking to a mentor or someone in the community can give me a new point of view. I also got better at using debugging tools to understand code and find mistakes.

Thank you, my mentors, Victoria. All these things, would not been completed if constant guidance from my mentors wasn’t there.

Finally, I would like to thank this amazing community.Thanks for being patient with me and clarifying my doubts.