-
Notifications
You must be signed in to change notification settings - Fork 166
Add support for indexing submodules without repocache #985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for indexing submodules without repocache #985
Conversation
The filesystem argument submitted to git.Open() is of the .git directory, it should be for the worktree, as in go-git this logic is copied from: https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375 One side effect of this bug was that go-git would refuse to fetch repo submodules
RepoURLs for subrepositories are currently filtered out when results are streamed per-repo from shards.
When not using repo cache, index git submodules recursively into subrepos using the go-git API
keegancsmith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so approving! But I wanna just do a bit more review early next week before landing.
|
|
||
| // Because we're keeping descriptors open, we need to close the storage object when we're done. | ||
| repo, err := git.Open(s, fs) | ||
| repo, err := git.Open(s, wt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find on this. However, I just wanna read more code / do more testing early next week before landing it.
search/shards.go
Outdated
| // Filter RepoURLs and LineFragments to only those of repoName and its subRepositories | ||
| repoNames := []string{repoName} | ||
| for _, file := range result.Files[a:b] { | ||
| repoNames = append(repoNames, file.SubRepositoryName) | ||
| } | ||
|
|
||
| filteredRepoURLs := make(map[string]string) | ||
| filteredLineFragments := make(map[string]string) | ||
| for _, name := range repoNames { | ||
| if url, ok := result.RepoURLs[name]; ok { | ||
| filteredRepoURLs[name] = url | ||
| } | ||
| if frag, ok := result.LineFragments[name]; ok { | ||
| filteredLineFragments[name] = frag | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks like it does quite a lot of allocating. Can you test this alternative implementation out.
| // Filter RepoURLs and LineFragments to only those of repoName and its subRepositories | |
| repoNames := []string{repoName} | |
| for _, file := range result.Files[a:b] { | |
| repoNames = append(repoNames, file.SubRepositoryName) | |
| } | |
| filteredRepoURLs := make(map[string]string) | |
| filteredLineFragments := make(map[string]string) | |
| for _, name := range repoNames { | |
| if url, ok := result.RepoURLs[name]; ok { | |
| filteredRepoURLs[name] = url | |
| } | |
| if frag, ok := result.LineFragments[name]; ok { | |
| filteredLineFragments[name] = frag | |
| } | |
| } | |
| filteredRepoURLs := map[string]string{repoName: result.RepoURLs[repoName]} | |
| filteredLineFragments := map[string]string{repoName: result.LineFragments[repoName]} | |
| // Filter RepoURLs and LineFragments to only those of repoName and its | |
| // subRepositories if there are subRepositories | |
| for _, file := range result.Files[a:b] { | |
| name := file.SubRepositoryName | |
| if name == "" { | |
| continue | |
| } | |
| if _, ok := filteredRepoURLs[name]; ok { | |
| continue | |
| } | |
| if url, ok := result.RepoURLs[name]; ok { | |
| filteredRepoURLs[name] = url | |
| } | |
| if frag, ok := result.LineFragments[name]; ok { | |
| filteredLineFragments[name] = frag | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops... I intended to refactor repoNames into a set implementation after a quick test, totally slipped my mind, sorry for the oversight.
Your proposed code works well in my testing, I'd only suggest splitting the skip condition for potential future edge cases
| // Filter RepoURLs and LineFragments to only those of repoName and its subRepositories | |
| repoNames := []string{repoName} | |
| for _, file := range result.Files[a:b] { | |
| repoNames = append(repoNames, file.SubRepositoryName) | |
| } | |
| filteredRepoURLs := make(map[string]string) | |
| filteredLineFragments := make(map[string]string) | |
| for _, name := range repoNames { | |
| if url, ok := result.RepoURLs[name]; ok { | |
| filteredRepoURLs[name] = url | |
| } | |
| if frag, ok := result.LineFragments[name]; ok { | |
| filteredLineFragments[name] = frag | |
| } | |
| } | |
| filteredRepoURLs := map[string]string{repoName: result.RepoURLs[repoName]} | |
| filteredLineFragments := map[string]string{repoName: result.LineFragments[repoName]} | |
| // Filter RepoURLs and LineFragments to only those of repoName and its | |
| // subRepositories if there are subRepositories | |
| for _, file := range result.Files[a:b] { | |
| name := file.SubRepositoryName | |
| if name == "" { | |
| continue | |
| } | |
| _, repoSet := filteredRepoURLs[name] | |
| url, ok := result.RepoURLs[name] | |
| if !repoSet && ok { | |
| filteredRepoURLs[name] = url | |
| } | |
| _, fragSet := filteredLineFragments[name] | |
| frag, ok := result.LineFragments[name] | |
| if !fragSet && ok { | |
| filteredLineFragments[name] = frag | |
| } | |
| } |
|
I took a look at the failing submodule test, it uncovered a bit of a further discussion on desired submodule treatment Test repocache: Line 40 in e1cef6f
# gerrit.googlesource.com/adir.git/config
...
[remote "origin"]
url = http://gerrit.googlesource.com/adir# gerrit.googlesource.com/bdir.git/config
...
[remote "origin"]
url = /tmp/<temphash>/bdir# adir/.gitmodules
[submodule "bname"]
path = bname
url = ../bdirWhen indexing Lines 53 to 57 in e1cef6f
I didn't consider that submodules actually have 2 sources of urls: the url in Imo, this should be ignored, and the submodule zoekt name/url should be treated the same as with parent repo:
However, this makes the behavior inconsistent with the repo-cache case. Thoughts? |
|
Is this relative URL stuff a hack in zoekt, or is it matching the behaviour of git? If the former, I imagine this is very specific to indexing gerrit which was the original use case for zoekt. In practice I don't know if zoekt is actually used for gerrit these days. The search at https://cs.opensource.google/ is not using zoekt. To be honest I agree with your opinion. I imagine the repo-cache should be setup so that this all just works? But yeah, I will rely a lot on your opinion here since I've never really touched this code in Zoekt + I don't like git submodules :P |
- Revert repo-cache behavior to unchanged - Add support for naming repos and modules with no remote (use the dirname for the former and subrepopath for the latter) - Add test for submodule no-repo-cache behavior
Looks like a gerrit-specific treatment to me, yeah. Git only supports EITHER relative paths OR urls, doesn't mix them together. Sounds like the repo-cache treatment has a specific use-case that should remain consistent, while the behavior this PR is introducing can make other choices. I implemented the submodule naming scheme I proposed, and committed the overallocation fix we discussed |
|
Sounds great. Mind working on getting |
Missed the fact that it is already done in zoekt-git-index command, the test was improperly configured https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100
|
Should pass now with this last change |
|
Thanks so much for the high quality PR! |
* bump go-ctags for handling non-fatal startup errors (sourcegraph#972) Currently if you switch to universal-ctags 6.2.0 it will log a few warnings on startup. The old version of go-ctags would error out. Now instead it will log to its info log. Test Plan: Added the problematic file in the linked issue to ./foo/repo directory. Then ran "go run ./cmd/zoekt-index -require_ctags -index ./foo/index ./foo/repo" with uctags 6.2.0. * switch to github.com/sourcegraph/cody-public-snapshot for e2e test (sourcegraph#974) e2e tests were failing because github.com/sourcegraph/cody repository has been set to private and all tags were removed Test plan: CI * sourcegraph: remove GRPC index methods (sourcegraph#977) This was part of an effort to move the queue from Zoekt to Sourcegraph. However, we are not going to pursue this for now and we can remove the corresponging grpc methods. Sourcegraph never had callers of Index and Delete, so both are save to remove. `DeleteAllData` is still being called from Sourcegraph. Test plan: CI * chore: fix grpc drift (sourcegraph#978) In sourcegraph#962 a new field metadata was added to zoekt.Repository, but it was not added to the grpc types and converter methods, which is why the fuzz test occasionally fails. Test plan: CI * copy languages package from Sourcegraph to Zoekt (sourcegraph#979) We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it. Notes: - Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward. - Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case. Test plan: I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph * gitindex: move package from internal (sourcegraph#981) Co-authored-by: ARyzhikov <[email protected]> * Cache docMatchTree results for Meta conditions (sourcegraph#982) * go mod tidy (sourcegraph#983) * Add support for indexing submodules without repocache (sourcegraph#985) * Fix git.Open() fs argument for .git dir instead of worktree The filesystem argument submitted to git.Open() is of the .git directory, it should be for the worktree, as in go-git this logic is copied from: https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375 One side effect of this bug was that go-git would refuse to fetch repo submodules * Fix missing RepoURLs and LineFragments for result subrepositories RepoURLs for subrepositories are currently filtered out when results are streamed per-repo from shards. * Add submodule indexing without repo cache When not using repo cache, index git submodules recursively into subrepos using the go-git API * Adjust subrepo naming conventions - Revert repo-cache behavior to unchanged - Add support for naming repos and modules with no remote (use the dirname for the former and subrepopath for the latter) - Add test for submodule no-repo-cache behavior * Fix overallocation in repoURL and lineFragment filtering * Revert basename treatment from 08f6759 Missed the fact that it is already done in zoekt-git-index command, the test was improperly configured https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100 --------- Co-authored-by: Keegan Carruthers-Smith <[email protected]> Co-authored-by: Stefan Hengl <[email protected]> Co-authored-by: Aleksander Ryzhickov <[email protected]> Co-authored-by: ARyzhikov <[email protected]> Co-authored-by: John Mason <[email protected]> Co-authored-by: Phillip Dykman <[email protected]>
This PR aims to make git repo submodule indexing much more approachable, removing the requirement for a repo cache
Related discussion : #984
Behavior changes:
zoekt-git-indexis run with-submodules=trueand-repo-cacheis not empty, the behavior remains unchangedzoekt-git-indexis run with-submodules=trueand-repo-cacheis empty, recurse on submodules usinggo-gitinsteadThis required 2 related bug fixes:
085b391 fixes an issue where we incorrectly replicated
git.PlainOpen()functionality when setting up filesystem storage. A side effect of this is thatgo-gitrefuses to return submodules (not sure if there are others).Side note: sent
go-gita PR to avoid this code replication entirely:Add support for storage options to PlainOpenWithOptions go-git/go-git#1686
f5f7877 adds subrepository support to RepoURL and LineFragment treatment - without this, results involving subrepositories can't be used to construct correct remote URLs
Functionality Example:
indexing Chataigne
search query
repo list query
Followups
Some other features I'd like to follow up on in future PRs:
-submodule-treatment subrepository | repository (default: subrepository)cli flag.