Skip to content

Conversation

@H3mul
Copy link
Contributor

@H3mul H3mul commented Oct 8, 2025

This PR aims to make git repo submodule indexing much more approachable, removing the requirement for a repo cache

Related discussion : #984

Behavior changes:

  • If zoekt-git-index is run with -submodules=true and -repo-cache is not empty, the behavior remains unchanged
  • If zoekt-git-index is run with -submodules=true and -repo-cache is empty, recurse on submodules using go-git instead

This required 2 related bug fixes:

  • 085b391 fixes an issue where we incorrectly replicated git.PlainOpen() functionality when setting up filesystem storage. A side effect of this is that go-git refuses to return submodules (not sure if there are others).

    Side note: sent go-git a PR to avoid this code replication entirely:
    Add support for storage options to PlainOpenWithOptions go-git/go-git#1686

  • f5f7877 adds subrepository support to RepoURL and LineFragment treatment - without this, results involving subrepositories can't be used to construct correct remote URLs

Functionality Example:

indexing Chataigne

$ zoekt-git-index -submodules=true ~/dev/Chataigne

2025/10/08 12:21:31 adding subrepository files from: Modules/juce_timeline
2025/10/08 12:21:31 adding subrepository files from: asio
2025/10/08 12:21:31 adding subrepository files from: Modules/juce_simpleweb
2025/10/08 12:21:31 adding subrepository files from: Modules/juce_sharedtexture
2025/10/08 12:21:31 failed to open submodule repository: Modules/juce_serial, submodule not initialized
2025/10/08 12:21:31 adding subrepository files from: Modules/juce_dmx
2025/10/08 12:21:31 adding subrepository files from: Modules/juce_organicui
2025/10/08 12:21:31 attempting to index 3745 total files
2025/10/08 12:04:20 finished shard .zoekt/github.com%2Fbenkuper%2FChataigne_v16.00000.zoekt: 91189557 index bytes (overhead 2.7), 3745 files processed 
search query
$ curl -XPOST -d '{"Q":"juce"}' localhost:6070/api/search

...

     {
        "FileName": "Modules/juce_dmx/README.md",
        "Repository": "github.com/benkuper/Chataigne",
        "SubRepositoryName": "github.com/benkuper/juce_dmx",
        "SubRepositoryPath": "Modules/juce_dmx",
        "Version": "dba0680390873dd1a4b820aa77afff78d61894d9",
        "Language": "Markdown",
        "Branches": [
          "HEAD"
        ],
        "LineMatches": [

...

    "RepoURLs": {
      "github.com/benkuper/Chataigne": "{{URLJoinPath \"https://github.com/benkuper/Chataigne\" \"blob\" .Version .Path}}",
      "github.com/benkuper/asio": "{{URLJoinPath \"https://github.com/benkuper/asio\" \"blob\" .Version .Path}}",
      "github.com/benkuper/juce_dmx": "{{URLJoinPath \"https://github.com/benkuper/juce_dmx\" \"blob\" .Version .Path}}",
      "github.com/benkuper/juce_organicui": "{{URLJoinPath \"https://github.com/benkuper/juce_organicui\" \"blob\" .Version .Path}}",
      "github.com/benkuper/juce_sharedtexture": "{{URLJoinPath \"https://github.com/benkuper/juce_sharedtexture\" \"blob\" .Version .Path}}",
      "github.com/benkuper/juce_simpleweb": "{{URLJoinPath \"https://github.com/benkuper/juce_simpleweb\" \"blob\" .Version .Path}}",
      "github.com/benkuper/juce_timeline": "{{URLJoinPath \"https://github.com/benkuper/juce_timeline\" \"blob\" .Version .Path}}"
    },
    "LineFragments": {
      "github.com/benkuper/Chataigne": "#L{{.LineNumber}}",
      "github.com/benkuper/asio": "#L{{.LineNumber}}",
      "github.com/benkuper/juce_dmx": "#L{{.LineNumber}}",
      "github.com/benkuper/juce_organicui": "#L{{.LineNumber}}",
      "github.com/benkuper/juce_sharedtexture": "#L{{.LineNumber}}",
      "github.com/benkuper/juce_simpleweb": "#L{{.LineNumber}}",
      "github.com/benkuper/juce_timeline": "#L{{.LineNumber}}"
    }
...

repo list query
$ curl -XPOST -d '{}' localhost:6070/api/list

...
     {
        "Repository": {
          "TenantID": 0,
          "ID": 0,
          "Name": "github.com/benkuper/Chataigne",
          "URL": "https://github.com/benkuper/Chataigne",
          "Metadata": null,
          "Source": "/home/hemul/dev/Chataigne",
          "Branches": [
            {
              "Name": "HEAD",
              "Version": "e4f150fe5d7fde4c8a40937d3e91cfa1ba13accc"
            }
          ],
          "SubRepoMap": {
            "Modules/juce_dmx": {
              "TenantID": 0,
              "ID": 0,
              "Name": "github.com/benkuper/juce_dmx",
              "URL": "https://github.com/benkuper/juce_dmx",
              "Metadata": null,
              "Source": "",
              "Branches": [
                {
                  "Name": "HEAD",
                  "Version": "dba0680390873dd1a4b820aa77afff78d61894d9"
                }
              ],
              "SubRepoMap": null,
              "CommitURLTemplate": "{{URLJoinPath \"https://github.com/benkuper/juce_dmx\" \"commit\" .Version}}",
              "FileURLTemplate": "{{URLJoinPath \"https://github.com/benkuper/juce_dmx\" \"blob\" .Version .Path}}",
              "LineFragmentTemplate": "#L{{.LineNumber}}",
              "RawConfig": null,
              "Rank": 0,
              "IndexOptions": "",
              "HasSymbols": false,
              "Tombstone": false,
              "LatestCommitDate": "0001-01-01T00:00:00Z"
            },
            ...

Followups

Some other features I'd like to follow up on in future PRs:

  • Subrepo results are quite cumbersome: the submodule file path gets appended onto parent repo submodule path, this needs to be split to get a correct remote URL. It might be a good idea to add an option to index submodules as separate zoekt repos instead of subrepos, eg a -submodule-treatment subrepository | repository (default: subrepository) cli flag.
  • supporting incremental/delta builds

H3mul added 3 commits October 7, 2025 23:52
The filesystem argument submitted to git.Open() is of the .git
directory, it should be for the worktree, as in go-git this logic is copied from:

https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375

One side effect of this bug was that go-git would refuse to fetch repo
submodules
RepoURLs for subrepositories are currently filtered out when results
are streamed per-repo from shards.
When not using repo cache, index git submodules recursively into

subrepos using the go-git API
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so approving! But I wanna just do a bit more review early next week before landing.


// Because we're keeping descriptors open, we need to close the storage object when we're done.
repo, err := git.Open(s, fs)
repo, err := git.Open(s, wt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find on this. However, I just wanna read more code / do more testing early next week before landing it.

search/shards.go Outdated
Comment on lines 820 to 835
// Filter RepoURLs and LineFragments to only those of repoName and its subRepositories
repoNames := []string{repoName}
for _, file := range result.Files[a:b] {
repoNames = append(repoNames, file.SubRepositoryName)
}

filteredRepoURLs := make(map[string]string)
filteredLineFragments := make(map[string]string)
for _, name := range repoNames {
if url, ok := result.RepoURLs[name]; ok {
filteredRepoURLs[name] = url
}
if frag, ok := result.LineFragments[name]; ok {
filteredLineFragments[name] = frag
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks like it does quite a lot of allocating. Can you test this alternative implementation out.

Suggested change
// Filter RepoURLs and LineFragments to only those of repoName and its subRepositories
repoNames := []string{repoName}
for _, file := range result.Files[a:b] {
repoNames = append(repoNames, file.SubRepositoryName)
}
filteredRepoURLs := make(map[string]string)
filteredLineFragments := make(map[string]string)
for _, name := range repoNames {
if url, ok := result.RepoURLs[name]; ok {
filteredRepoURLs[name] = url
}
if frag, ok := result.LineFragments[name]; ok {
filteredLineFragments[name] = frag
}
}
filteredRepoURLs := map[string]string{repoName: result.RepoURLs[repoName]}
filteredLineFragments := map[string]string{repoName: result.LineFragments[repoName]}
// Filter RepoURLs and LineFragments to only those of repoName and its
// subRepositories if there are subRepositories
for _, file := range result.Files[a:b] {
name := file.SubRepositoryName
if name == "" {
continue
}
if _, ok := filteredRepoURLs[name]; ok {
continue
}
if url, ok := result.RepoURLs[name]; ok {
filteredRepoURLs[name] = url
}
if frag, ok := result.LineFragments[name]; ok {
filteredLineFragments[name] = frag
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... I intended to refactor repoNames into a set implementation after a quick test, totally slipped my mind, sorry for the oversight.

Your proposed code works well in my testing, I'd only suggest splitting the skip condition for potential future edge cases

Suggested change
// Filter RepoURLs and LineFragments to only those of repoName and its subRepositories
repoNames := []string{repoName}
for _, file := range result.Files[a:b] {
repoNames = append(repoNames, file.SubRepositoryName)
}
filteredRepoURLs := make(map[string]string)
filteredLineFragments := make(map[string]string)
for _, name := range repoNames {
if url, ok := result.RepoURLs[name]; ok {
filteredRepoURLs[name] = url
}
if frag, ok := result.LineFragments[name]; ok {
filteredLineFragments[name] = frag
}
}
filteredRepoURLs := map[string]string{repoName: result.RepoURLs[repoName]}
filteredLineFragments := map[string]string{repoName: result.LineFragments[repoName]}
// Filter RepoURLs and LineFragments to only those of repoName and its
// subRepositories if there are subRepositories
for _, file := range result.Files[a:b] {
name := file.SubRepositoryName
if name == "" {
continue
}
_, repoSet := filteredRepoURLs[name]
url, ok := result.RepoURLs[name]
if !repoSet && ok {
filteredRepoURLs[name] = url
}
_, fragSet := filteredLineFragments[name]
frag, ok := result.LineFragments[name]
if !fragSet && ok {
filteredLineFragments[name] = frag
}
}

@H3mul
Copy link
Contributor Author

H3mul commented Oct 10, 2025

I took a look at the failing submodule test, it uncovered a bit of a further discussion on desired submodule treatment

Test repocache:

func createSubmoduleRepo(dir string) error {

├── adir      #adir worktree
├── bdir      #bdir worktree
└── gerrit.googlesource.com   # repocache root
    ├── adir.git   # bare repo with submodule
    └── bdir.git   # bare repo 
# gerrit.googlesource.com/adir.git/config
...
[remote "origin"]
    url = http://gerrit.googlesource.com/adir
# gerrit.googlesource.com/bdir.git/config
...
[remote "origin"]
    url = /tmp/<temphash>/bdir
# adir/.gitmodules
[submodule "bname"]
    path = bname
    url = ../bdir

When indexing adir, repo-cache treatment resolves the bname subrepository name AND remote URL to gerrit.googlesource.com/bdir, despite that remote never being listed anywhere, simply because ../bdir is applied "relatively" to the gerrit.googlesource.com/adir remote.

zoekt/gitindex/tree.go

Lines 53 to 57 in e1cef6f

if strings.HasPrefix(relURL, "../") {
u := *rw.repoURL
u.Path = path.Join(u.Path, relURL)
return &u, nil
}

I didn't consider that submodules actually have 2 sources of urls: the url in .gitmodule and the url(s) in the submodule repo remote. The former can be accessed via submodule.Config().URL.

Imo, this should be ignored, and the submodule zoekt name/url should be treated the same as with parent repo:

  • if it has a remote - use it for name and url
  • if no remote - use dirname (or perhaps submodule path) for repo name and no url

However, this makes the behavior inconsistent with the repo-cache case. Thoughts?

@keegancsmith
Copy link
Member

Is this relative URL stuff a hack in zoekt, or is it matching the behaviour of git? If the former, I imagine this is very specific to indexing gerrit which was the original use case for zoekt. In practice I don't know if zoekt is actually used for gerrit these days. The search at https://cs.opensource.google/ is not using zoekt.

To be honest I agree with your opinion. I imagine the repo-cache should be setup so that this all just works? But yeah, I will rely a lot on your opinion here since I've never really touched this code in Zoekt + I don't like git submodules :P

H3mul added 2 commits October 14, 2025 13:25
- Revert repo-cache behavior to unchanged
- Add support for naming repos and modules with no remote (use the dirname for the
  former and subrepopath for the latter)
- Add test for submodule no-repo-cache behavior
@H3mul
Copy link
Contributor Author

H3mul commented Oct 14, 2025

Is this relative URL stuff a hack in zoekt, or is it matching the behaviour of git?

Looks like a gerrit-specific treatment to me, yeah. Git only supports EITHER relative paths OR urls, doesn't mix them together.

Sounds like the repo-cache treatment has a specific use-case that should remain consistent, while the behavior this PR is introducing can make other choices.

I implemented the submodule naming scheme I proposed, and committed the overallocation fix we discussed

@keegancsmith
Copy link
Member

Sounds great. Mind working on getting go test ./... to pass then will review and merge :)

Missed the fact that it is already done in zoekt-git-index
command, the test was improperly configured

https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100
@H3mul
Copy link
Contributor Author

H3mul commented Oct 15, 2025

Should pass now with this last change

@keegancsmith keegancsmith merged commit 90faf6d into sourcegraph:main Oct 16, 2025
8 checks passed
@keegancsmith
Copy link
Member

Thanks so much for the high quality PR!

@H3mul H3mul deleted the pr-support-submodules-without-repocache branch October 16, 2025 17:55
minhna1112 added a commit to minhna1112/zoekt that referenced this pull request Oct 16, 2025
* bump go-ctags for handling non-fatal startup errors (sourcegraph#972)

Currently if you switch to universal-ctags 6.2.0 it will log a few
warnings on startup. The old version of go-ctags would error out. Now
instead it will log to its info log.

Test Plan: Added the problematic file in the linked issue to ./foo/repo
directory. Then ran "go run ./cmd/zoekt-index -require_ctags -index
./foo/index ./foo/repo" with uctags 6.2.0.

* switch to github.com/sourcegraph/cody-public-snapshot for e2e test (sourcegraph#974)

e2e tests were failing because github.com/sourcegraph/cody repository has been set to private and all tags were removed

Test plan:
CI

* sourcegraph: remove GRPC index methods (sourcegraph#977)

This was part of an effort to move the queue from Zoekt to Sourcegraph.
However, we are not going to pursue this for now and we can remove
the corresponging grpc methods.

Sourcegraph never had callers of Index and Delete, so both are save to remove.

`DeleteAllData` is still being called from Sourcegraph.

Test plan:
CI

* chore: fix grpc drift (sourcegraph#978)

In sourcegraph#962 a new field metadata
was added to zoekt.Repository, but it was not added to the grpc
types and converter methods, which is why the fuzz test occasionally
fails.

Test plan:
CI

* copy languages package from Sourcegraph to Zoekt (sourcegraph#979)

We want Zoekt and Sourcegraph to use the same language package. In this PR we move the languages package from Sourcegraph to Zoekt, so that Zoekt can use it and Sourcegraph can import it.

Notes: 
- Zoekt doesn't need to fetch content async which is why I added a little helper func `GetLanguagesFromContent` to make the call sites in Zoekt less awkward.
- Sourcegraph's languages package always classified .cls files as Apex, while Zoekt did a content based check. With this PR we follow Zoekt's approach. Specifically, I removed .cls from `unsupportedByEnryExtensionToNameMap`. I added an additional unit test to cover this case.

Test plan:
I appended the test cases from the old Zoekt languages packages to the tests I copied over from Sourcegraph

* gitindex: move package from internal (sourcegraph#981)

Co-authored-by: ARyzhikov <[email protected]>

* Cache docMatchTree results for Meta conditions (sourcegraph#982)

* go mod tidy (sourcegraph#983)

* Add support for indexing submodules without repocache (sourcegraph#985)

* Fix git.Open() fs argument for .git dir instead of worktree

The filesystem argument submitted to git.Open() is of the .git
directory, it should be for the worktree, as in go-git this logic is copied from:

https://github.com/go-git/go-git/blob/145daf2492dd86b1d7e7787555ebd9837b93bff8/repository.go#L373-L375

One side effect of this bug was that go-git would refuse to fetch repo
submodules

* Fix missing RepoURLs and LineFragments for result subrepositories

RepoURLs for subrepositories are currently filtered out when results
are streamed per-repo from shards.

* Add submodule indexing without repo cache

When not using repo cache, index git submodules recursively into

subrepos using the go-git API

* Adjust subrepo naming conventions

- Revert repo-cache behavior to unchanged
- Add support for naming repos and modules with no remote (use the dirname for the
  former and subrepopath for the latter)
- Add test for submodule no-repo-cache behavior

* Fix overallocation in repoURL and lineFragment filtering

* Revert basename treatment from 08f6759

Missed the fact that it is already done in zoekt-git-index
command, the test was improperly configured

https://github.com/sourcegraph/zoekt/blob/4e4a529c3b63c7d4c7897ba736f1cd52cc163134/cmd/zoekt-git-index/main.go#L93-L100

---------

Co-authored-by: Keegan Carruthers-Smith <[email protected]>
Co-authored-by: Stefan Hengl <[email protected]>
Co-authored-by: Aleksander Ryzhickov <[email protected]>
Co-authored-by: ARyzhikov <[email protected]>
Co-authored-by: John Mason <[email protected]>
Co-authored-by: Phillip Dykman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants