|
|
Subscribe / Log in / New account

Large files with Git: LFS and git-annex

December 11, 2018

This article was contributed by Antoine Beaupré

Git does not handle large files very well. While there is work underway to handle large repositories through the commit graph work, Git's internal design has remained surprisingly constant throughout its history, which means that storing large files into Git comes with a significant and, ultimately, prohibitive performance cost. Thankfully, other projects are helping Git address this challenge. This article compares how Git LFS and git-annex address this problem and should help readers pick the right solution for their needs.

The problem with large files

As readers probably know, Linus Torvalds wrote Git to manage the history of the kernel source code, which is a large collection of small files. Every file is a "blob" in Git's object store, addressed by its cryptographic hash. A new version of that file will store a new blob in Git's history, with no deduplication between the two versions. The pack file format can store binary deltas between similar objects, but if many objects of similar size change in a repository, that algorithm might fail to properly deduplicate. In practice, large binary files (say JPEG images) have an irritating tendency of changing completely when even the smallest change is made, which makes delta compression useless.

There have been different attempts at fixing this in the past. In 2006, Torvalds worked on improving the pack-file format to reduce object duplication between the index and the pack files. Those changes were eventually reverted because, as Nicolas Pitre put it: "that extra loose object format doesn't appear to be worth it anymore".

Then in 2009, Caca Labs worked on improving the fast-import and pack-objects Git commands to do special handling for big files, in an effort called git-bigfiles. Some of those changes eventually made it into Git: for example, since 1.7.6, Git will stream large files directly to a pack file instead of holding them all in memory. But files are still kept forever in the history.

An example of trouble I had to deal with is for the Debian security tracker, which follows all security issues in the entire Debian history in a single file. That file is around 360,000 lines for a whopping 18MB. The resulting repository takes 1.6GB of disk space and a local clone takes 21 minutes to perform, mostly taken up by Git resolving deltas. Commit, push, and pull are noticeably slower than a regular repository, taking anywhere from a few seconds to a minute depending one how old the local copy is. And running annotate on that large file can take up to ten minutes. So even though that is a simple text file, it's grown large enough to cause significant problems for Git, which is otherwise known for stellar performance.

Intuitively, the problem is that Git needs to copy files into its object store to track them. Third-party projects therefore typically solve the large-files problem by taking files out of Git. In 2009, Git evangelist Scott Chacon released GitMedia, which is a Git filter that simply takes large files out of Git. Unfortunately, there hasn't been an official release since then and it's unclear if the project is still maintained. The next effort to come up was git-fat, first released in 2012 and still maintained. But neither tool has seen massive adoption yet. If I would have to venture a guess, it might be because both require manual configuration. Both also require a custom server (rsync for git-fat; S3, SCP, Atmos, or WebDAV for GitMedia) which limits collaboration since users need access to another service.

Git LFS

That was before GitHub released Git Large File Storage (LFS) in August 2015. Like all software taking files out of Git, LFS tracks file hashes instead of file contents. So instead of adding large files into Git directly, LFS adds a pointer file to the Git repository, which looks like this:

    version https://git-lfs.github.com/spec/v1
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    size 12345

LFS then uses Git's smudge and clean filters to show the real file on checkout. Git only stores that small text file and does so efficiently. The downside, of course, is that large files are not version controlled: only the latest version of a file is kept in the repository.

Git LFS can be used in any repository by installing the right hooks with git lfs install then asking LFS to track any given file with git lfs track. This will add the file to the .gitattributes file which will make Git run the proper LFS filters. It's also possible to add patterns to the .gitattributes file, of course. For example, this will make sure Git LFS will track MP3 and ZIP files:

    $ cat .gitattributes
    *.mp3 filter=lfs -text
    *.zip filter=lfs -text

After this configuration, we use Git normally: git add, git commit, and so on will talk to Git LFS transparently.

The actual files tracked by LFS are copied to a path like .git/lfs/objects/{OID-PATH}, where {OID-PATH} is a sharded file path of the form OID[0:2]/OID[2:4]/OID and where OID is the content's hash (currently SHA-256) of the file. This brings the extra feature that multiple copies of the same file in the same repository are automatically deduplicated, although in practice this rarely occurs.

Git LFS will copy large files to that internal storage on git add. When a file is modified in the repository, Git notices, the new version is copied to the internal storage, and the pointer file is updated. The old version is left dangling until the repository is pruned.

This process only works for new files you are importing into Git, however. If a Git repository already has large files in its history, LFS can fortunately "fix" repositories by retroactively rewriting history with git lfs migrate. This has all the normal downsides of rewriting history, however — existing clones will have to be reset to benefit from the cleanup.

LFS also supports file locking, which allows users to claim a lock on a file, making it read-only everywhere except in the locking repository. This allows users to signal others that they are working on an LFS file. Those locks are purely advisory, however, as users can remove other user's locks by using the --force flag. LFS can also prune old or unreferenced files.

The main limitation of LFS is that it's bound to a single upstream: large files are usually stored in the same location as the central Git repository. If it is hosted on GitHub, this means a default quota of 1GB storage and bandwidth, but you can purchase additional "packs" to expand both of those quotas. GitHub also limits the size of individual files to 2GB. This upset some users surprised by the bandwidth fees, which were previously hidden in GitHub's cost structure.

While the actual server-side implementation used by GitHub is closed source, there is a test server provided as an example implementation. Other Git hosting platforms have also implemented support for the LFS API, including GitLab, Gitea, and BitBucket; that level of adoption is something that git-fat and GitMedia never achieved. LFS does support hosting large files on a server other than the central one — a project could run its own LFS server, for example — but this will involve a different set of credentials, bringing back the difficult user onboarding that affected git-fat and GitMedia.

Another limitation is that LFS only supports pushing and pulling files over HTTP(S) — no SSH transfers. LFS uses some tricks to bypass HTTP basic authentication, fortunately. This also might change in the future as there are proposals to add SSH support, resumable uploads through the tus.io protocol, and other custom transfer protocols.

Finally, LFS can be slow. Every file added to LFS takes up double the space on the local filesystem as it is copied to the .git/lfs/objects storage. The smudge/clean interface is also slow: it works as a pipe, but buffers the file contents in memory each time, which can be prohibitive with files larger than available memory.

git-annex

The other main player in large file support for Git is git-annex. We covered the project back in 2010, shortly after its first release, but it's certainly worth discussing what has changed in the eight years since Joey Hess launched the project.

Like Git LFS, git-annex takes large files out of Git's history. The way it handles this is by storing a symbolic link to the file in .git/annex. We should probably credit Hess for this innovation, since the Git LFS storage layout is obviously inspired by git-annex. The original design of git-annex introduced all sorts of problems however, especially on filesystems lacking symbolic-link support. So Hess has implemented different solutions to this problem. Originally, when git-annex detected such a "crippled" filesystem, it switched to direct mode, which kept files directly in the work tree, while internally committing the symbolic links into the Git repository. This design turned out to be a little confusing to users, including myself; I have managed to shoot myself in the foot more than once using this system.

Since then, git-annex has adopted a different v7 mode that is also based on smudge/clean filters, which it called "unlocked files". Like Git LFS, unlocked files will double disk space usage by default. However it is possible to reduce disk space usage by using "thin mode" which uses hard links between the internal git-annex disk storage and the work tree. The downside is, of course, that changes are immediately performed on files, which means previous file versions are automatically discarded. This can lead to data loss if users are not careful.

Furthermore, git-annex in v7 mode suffers from some of the performance problems affecting Git LFS, because both use the smudge/clean filters. Hess actually has ideas on how the smudge/clean interface could be improved. He proposes changing Git so that it stops buffering entire files into memory, allows filters to access the work tree directly, and adds the hooks he found missing (for stash, reset, and cherry-pick). Git-annex already implements some tricks to work around those problems itself but it would be better for those to be implemented in Git natively.

Being more distributed by design, git-annex does not have the same "locking" semantics as LFS. Locking a file in git-annex means protecting it from changes, so files need to actually be in the "unlocked" state to be editable, which might be counter-intuitive to new users. In general, git-annex has some of those unusual quirks and interfaces that often come with more powerful software.

And git-annex is much more powerful: it not only addresses the "large-files problem" but goes much further. For example, it supports "partial checkouts" — downloading only some of the large files. I find that especially useful to manage my video, music, and photo collections, as those are too large to fit on my mobile devices. Git-annex also has support for location tracking, where it knows how many copies of a file exist and where, which is useful for archival purposes. And while Git LFS is only starting to look at transfer protocols other than HTTP, git-annex already supports a large number through a special remote protocol that is fairly easy to implement.

"Large files" is therefore only scratching the surface of what git-annex can do: I have used it to build an archival system for remote native communities in northern Québec, while others have built a similar system in Brazil. It's also used by the scientific community in projects like GIN and DataLad, which manage terabytes of data. Another example is the Japanese American Legacy Project which manages "upwards of 100 terabytes of collections, transporting them from small cultural heritage sites on USB drives".

Unfortunately, git-annex is not well supported by hosting providers. GitLab used to support it, but since it implemented Git LFS, it dropped support for git-annex, saying it was a "burden to support". Fortunately, thanks to git-annex's flexibility, it may eventually be possible to treat LFS servers as just another remote which would make git-annex capable of storing files on those servers again.

Conclusion

Git LFS and git-annex are both mature and well maintained programs that deal efficiently with large files in Git. LFS is easier to use and is well supported by major Git hosting providers, but it's less flexible than git-annex.

Git-annex, in comparison, allows you to store your content anywhere and espouses Git's distributed nature more faithfully. It also uses all sorts of tricks to save disk space and improve performance, so it should generally be faster than Git LFS. Learning git-annex, however, feels like learning Git: you always feel you are not quite there and you can always learn more. It's a double-edged sword and can feel empowering for some users and terrifyingly hard for others. Where you stand on the "power-user" scale, along with project-specific requirements will ultimately determine which solution is the right one for you.

Ironically, after thorough evaluation of large-file solutions for the Debian security tracker, I ended up proposing to rewrite history and split the file by year which improved all performance markers by at least an order of magnitude. As it turns out, keeping history is critical for the security team so any solution that moves large files outside of the Git repository is not acceptable to them. Therefore, before adding large files into Git, you might want to think about organizing your content correctly first. But if large files are unavoidable, the Git LFS and git-annex projects allow users to keep using most of their current workflow.


Index entries for this article
GuestArticlesBeaupré, Antoine


to post comments

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 20:27 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

One of my friends uses a franken-repository by putting large files in an SVN repository and storing their versions in a special .gitsvn file. Works surprisingly well.

Large files with Git: LFS and git-annex

Posted Dec 13, 2018 16:24 UTC (Thu) by MatyasSelmeci (guest, #86151) [Link]

This sounds cool. What does the .gitsvn file look like -- a simple path -> revision mapping? Is there a script that checks out specific files (e.g. via svn export/svn cat or something)? Does that happen automatically via some sort of git hooks?

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 20:28 UTC (Tue) by anarcat (subscriber, #66354) [Link] (4 responses)

as usual, the bug reports and feature requests I opened while writing this article: For the sake of transparency, I should also mention that I am a long time git-annex user and even contributor, as my name sits in the thanks page under the heading "code and other bits" section, which means I probably contributed some code to the project. I can't remember now what code exactly I contributed, but I certainly contributed to the documentation. That, in turn, may bias my point of view in favor of git-annex even though I tried to be as neutral as possible in my review of both projects, both of which I use on a regular basis, as I hinted in the article.

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 20:45 UTC (Tue) by warrax (subscriber, #103205) [Link] (3 responses)

I really *wanted* to like git-annex and use, but the lack of tutorial material (at the time, possibly different now) about how to would around NATs and things of that ilk really hampered me.

That and... some software just doesn't want to work sensibly with symlinks, unforunately :(.

In the end I just chose unison for a star-topology sync (which it looks like git-annex effectively requires if you're being a NAT). Works equally well with large and small files, but obviously not really *versioned* per se.

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 20:49 UTC (Tue) by warrax (subscriber, #103205) [Link]

Sorry for the absolute mess I made of the spelling in that.

*and use it

*how to would => how to work

I can only apologize.

problems symlinks and p2p: might be worth looking into git-annex again

Posted Dec 11, 2018 21:15 UTC (Tue) by anarcat (subscriber, #66354) [Link] (1 responses)

I've been thoroughly impressed by the new v6/v7 "unlocked files" mode. I only brushed over it in the article, but it's a radical change in the way git-annex manages files. It makes things *much* easier with regards to interoperability with other software: they can just modify files and then the operator commits the files normally with git. While there are still a few rough edges in the implementation, the idea is there and makes the entire thing actually workable on USB keys and so on. So you may want to reconsider from that aspect.

I find the p2p implementation to be a little too complex to my taste, but it's there: it uses magic-wormhole and Tor to connect peers across NAT. And from there you can create whatever topology you want. I would rather seen a wormhole-only implementation, honestly, but maybe would have been less of a match for g-a...

Anyways, long story short: if you ever looked at git-annex in the past and found it weird, well, it might be soon time to take a look again. It's still weird in some places (it's haskell after all :p) and it's a complex piece of software, but I generally find that I can do everything I need with it. I am hoping to write a followup article about more in-depth git-annex use cases, specifically about archival and file synchronisation soon (but probably after the new year)... I just had to get this specific article out first so that I don't get a "but what about LFS" blanket response to that other article.

problems symlinks and p2p: might be worth looking into git-annex again

Posted Dec 11, 2018 22:37 UTC (Tue) by warrax (subscriber, #103205) [Link]

I think I might try it again. Thanks for the "update", so to speak.

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 20:32 UTC (Tue) by joey (guest, #328) [Link] (1 responses)

Thanks for this unbiased and accurate comparison.

(BTW, the full irony is that I'm responsible for the Debian security tracker containing that single large file in the first place.)

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 20:36 UTC (Tue) by anarcat (subscriber, #66354) [Link]

I don't think anyone could have imagined that file would grow that big in 2004, so don't be too hard on yourself. (And yes, the irony didn't escape me, I just thought it would be unfair to pin that peculiar one on you... )

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 21:52 UTC (Tue) by corsac (subscriber, #49696) [Link] (6 responses)

CVE data in the security tracker is not that large (it's 18M / 300k lines). But there's a lot of history on the same file
(52k commits) and that' the issue with git. I think we would have the same issue if the file was small but with the same history.

splitting the large CVE list in the security tracker

Posted Dec 11, 2018 22:40 UTC (Tue) by anarcat (subscriber, #66354) [Link] (5 responses)

CVE data in the security tracker is not that large (it's 18M / 300k lines). But there's a lot of history on the same file (52k commits) and that' the issue with git. I think we would have the same issue if the file was small but with the same history.

I have actually done the work to split that file, including history, first with a shallow clone of 1000 commits and then with the full history. Even when keeping the full history of all those 52k commits, the "split by year" repository take up a lot less space than the original repository (145MB vs 1.6GB, an order of magnitude smaller).

Performance is also significantly improved by an order of magnitude: cloning the repository (locally) takes 2 minutes instead of 21 minutes. And of course, running "git annotate" or "git log" on the individual files is much faster than on the larger file, although that's a bit of an unfair comparison.

So splitting the file gets rid of most of the performance issues the repository suffers from, at least according to the results I have been able to produce. The problem is it involves some changes in the workflow, from what I understand, particularly at times like this when we are likely to get CVEs from two different years (2018 and 2019, possibly three with 2017) which means working over multiple files. But it seems to me this is something that's easier to deal with than fixing fundamental design issues with git's internal storage. :)

splitting the large CVE list in the security tracker

Posted Dec 12, 2018 0:10 UTC (Wed) by JoeBuck (subscriber, #2330) [Link] (4 responses)

Perhaps it would be possible to use some kind of wrapper so that the file could be maintained as a large file, but git would store it as many pieces. If the file has structure, the idea would be to split it before checkin and reassemble it on checkout. Perhaps the technique could be generalized to handle cases where files grow roughly by appending (I say "roughly" because multiple development branches would do appends and then merges would be required), so that older sections of the file remain unchanged.

chunked files

Posted Dec 12, 2018 0:22 UTC (Wed) by anarcat (subscriber, #66354) [Link] (1 responses)

This didn't make it to the final text, but that's something that could be an interesting lead in fixing the problem in git itself: chunking. Many backup software (like restic, borg and bup) use a "rolling checksum" system (think rsync, but for storage) to extract the "chunks" that should be stored, instead of limiting the data to be stored on file boundaries. This makes it possible to deduplicate across multiple versions of the same files more efficiently and transparently.

Incidentally, git-annex supports bup as a backend. And so when I asked joeyh about implementing chunking support in the git-annex backend (it already supports chunked transfers), that's what he answered of course. :)

That would be the ultimate git killer feature, in my opinion, as it would permanently solve the large file problem. But having worked on the actual implementation of such rolling checksum backup software, I can tell you it is *much* harder to wrap your head around that data structure than git's more elegant design.

Maybe it could be a new pack format?

chunked files

Posted Dec 14, 2018 1:13 UTC (Fri) by pixelpapst (guest, #55301) [Link]

I think the "new pack format" idea is spot on, and something I have been contemplating for a few months now, inspired by casync.

The chunking approach and on-disk data structure seem solid; git would probably use a standard casync chunk store, but a git-sprecific index file.

(Just for giggles, I've been meaning to even evaluate how much space would be shared when backing up a casync-ified .git directory (including its chunk store) and the checked-out objects to a different, common casync chuck store.)

I cannot wait to see to what new heights git-annex would grow in a world where every ordinary git user already had basic large-file interoperability with it.

(Anarcat, thank you for educating people about git-annex and all your documentation work.)

splitting the large CVE list in the security tracker

Posted Dec 12, 2018 9:02 UTC (Wed) by mjthayer (guest, #39183) [Link] (1 responses)

> Perhaps it would be possible to use some kind of wrapper so that the file could be maintained as a large file, but git would store it as many pieces. If the file has structure, the idea would be to split it before checkin and reassemble it on checkout.

Taking this further, what about losslessly decompiling certain well-known binary formats? Not sure if it would work for e.g. PDF. Structured documents could be saved as folders containing files. Would the smudge/clean filters Antoine mentioned work for that?

On the other hand, I wonder how many binary files could really be versioned sensibly which do not have some accessible source format which could be checked into git instead. I would imagine that e.g. most JPEGs would be successive versions which did not have much in common with each other from a compression point of view. It would just be the question - does one need all versions in the repository or not? And if one does, well not much to be done.

splitting the large CVE list in the security tracker

Posted Dec 15, 2018 0:59 UTC (Sat) by nix (subscriber, #2304) [Link]

The LZMA compression system already does some of this, with a customizable filter system, though at the moment the only non-conventional-compression filters are filters for a lot of ISAs that can absolutize relative jumps to increase the redundancy of executables. :)

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 23:03 UTC (Tue) by ralt (subscriber, #103458) [Link] (4 responses)

Hmm... no mention of GVFS? :-)

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 23:14 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

What's a GNOME library got to do with this? ;)

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 23:23 UTC (Tue) by ralt (subscriber, #103458) [Link]

There are only two hard problems...

GFS (not not that one) AKA VFS for git

Posted Dec 12, 2018 0:16 UTC (Wed) by anarcat (subscriber, #66354) [Link]

you know what, that's true, I totally forgot about GVFS (which we should apparently call "VFS for git" now). That's probably because, first, it just doesn't seem to run on Linux, from what I can tell. To be more precise, it's still at the "prototype" stage, so certainly not something that seems "entreprise-scale" to me.

It could be a promising lead to fix the Debian security team repository size issues, mind you, but then we'd have to figure out how to host the server side of things and I don't know how *that* works either.

Frankly, it looks like a Microsoft thing that's not ready for us mortals, unfortunately. At least the LFS folks had the decency of providing us with usable releases and a test server people could build on top of... But maybe it will become a usable alternative

Large files with Git: LFS and git-annex

Posted Jul 27, 2019 19:25 UTC (Sat) by rweaver6 (guest, #128342) [Link]

I came upon this discussion very late, while investigating GVFS/VFSforGit .

VFSforGit was not designed to solve a large-file problem. See https://docs.microsoft.com/en-us/azure/devops/learn/git/g... .

It was designed to help adapt Git to Microsoft's internal Windows development repository, which was 3.5M files in a zillion directories and branches, 300GB total. Gigantic repository, file size not really the issue. Obviously it will contain some large files as well, but it's not what was limiting their ability to move Windows development to Git.

Whether the file system virtualization provided by VFSforGit *could* be made to help Git also with large files is an interesting question.

Large files with Git: LFS and git-annex

Posted Dec 11, 2018 23:44 UTC (Tue) by ejr (subscriber, #51652) [Link]

The problem **FOR ME** with git-annex is platform support. I deal with platforms that have a C compiler, a kinda-sorta-C++ compiler, and that's it. I use git annex but coupled with plenty of out-of-tree copying that is a pain. I've yet to try git-lfs. It doesn't feel like it fits into my uses that naturally are multi-upstream.

LLVM may eventually make this moot until the next great back-end. Not because of licensing but rather timing. Stupid patent issues, being honest, and horrible things like those.

[BTW, is that coffee shop in Bristol still around? Haven't been "downtown" since I moved. At that point in our trip we don't want to stop.]

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 0:25 UTC (Wed) by kenshoen (guest, #121595) [Link]

It's a shame that jc/split-blob didn't take off...

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 3:13 UTC (Wed) by unixbhaskar (guest, #44758) [Link] (1 responses)

Well, My feelings are in line with this statement ".... feels like learning Git: you always feel you are not quite there and you can always learn more. It's a double-edged sword and can feel empowering for some users and terrifyingly hard for others."

In spite, using and knowing it over the years, still fumble, still, it intimidates me(lack of bent of mind) ...but it is a wonderful software to make life much easier.

Large files with Git: LFS and git-annex

Posted Dec 13, 2018 11:45 UTC (Thu) by Lennie (subscriber, #49641) [Link]

I also noticed new users who are used to CVS/SVN, etc. need to first unlearn some stuff before 'getting git'.

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 5:08 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> However it is possible to reduce disk space usage by using "thin mode" which uses hard links between the internal git-annex disk storage and the work tree. The downside is, of course, that changes are immediately performed on files, which means previous file versions are automatically discarded. This can lead to data loss if users are not careful.

Perhaps this would be a good application for reflinks? Given a suitable filesystem, of course. All the space-saving of hard links (until you start making changes) without the downside of corrupting the original file.

Append-only large files

Posted Dec 12, 2018 7:45 UTC (Wed) by epa (subscriber, #39769) [Link] (4 responses)

I was surprised to hear how much git struggles with Debian’s security issues file. It takes forever to resolve deltas. But this file must surely be append-only for most changes. A naive version control system whose only kind of delta was ‘append these bytes’ (storing a whole new copy of the file otherwise) would handle it without problems, though not packed quite as tightly.

So maybe git needs a hint that a particular file should be treated as append-only, where it takes a simpler approach to computing deltas to save time, at the expense of some disk space.

Append-only large files

Posted Dec 12, 2018 8:18 UTC (Wed) by pabs (subscriber, #43278) [Link]

The Debian CVE list mostly grows from the top as that is where newer issues are placed, although sometimes older issues get updated too.

Append-only large files

Posted Dec 12, 2018 13:32 UTC (Wed) by anarcat (subscriber, #66354) [Link] (2 responses)

the other problem is that the delta algorithm in git works very badly for growing files, because it deduplicates within a certain "window" of "N" blobs (default 10), *sorted by size*. The degenerate case of this is *multiple* growing files of similar size which get grouped together and are absolutely unrelated. alternatively, you might be lucky and have your growing file aligned correctly, but only some of the recent entries will get sorted together, earlier entries will get lost in the mists of time.

of course, widening that window would help the security tracker, but it would require a costly repack, and new clones everywhere... and considering how long that tail of commits is, it would probably imply other performance costs...

Append-only large files

Posted Dec 13, 2018 16:41 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Huh, so the delta is entirely blind to whatever filename the content was added under? That's a clean design, but it seems like adding some amount of hinting (so that similar filenames are grouped together for finding deltas) would greatly improve performance, and not just in this case.

Append-only large files

Posted Dec 13, 2018 16:51 UTC (Thu) by anarcat (subscriber, #66354) [Link]

I'm not exactly sure as I haven't reviewed the source code behind git-pack-objects, only the manual page, which says:
In a packed archive, an object is either stored as a compressed whole or as a difference from some other object. The latter is often called a delta. [...]

--window=<n>, --depth=<n>
These two options affect how the objects contained in the pack are stored using delta compression. The objects are first internally sorted by type, size and optionally names and compared against the other objects within --window to see if using delta compression saves space. --depth limits the maximum delta depth; making it too deep affects the performance on the unpacker side, because delta data needs to be applied that many times to get to the necessary object. The default value for --window is 10 and --depth is 50. The maximum depth is 4095.
So yes, it can also "optionally" "sort by name", but it's unclear to me how that works or how effective that is. Besides, the window size is quite small as well, although it can be bumped up to make pack take all available memory with that parameter. :)

git-annex special remote to store into another git repository possible ?

Posted Dec 12, 2018 8:25 UTC (Wed) by domo (guest, #14031) [Link]

Thanks anarcat for good article (again!) -- I've forgotten git-annex altogether since the early days I looked into it.

Now I have to look again -- I've done 3 programs to store large files in separate git repositories
(latest just got working prototype using clean/smudge filters)...

... just that it looks git-annex using bup special remote would be the solution I've been
achieving in my projects... and taking that into use instead of completing my last one
would possibly be most time and resource effective alternative!

So, I'll put NIH and sunk cost fallacy aside ant try that next :D

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 8:56 UTC (Wed) by gebi (guest, #59940) [Link] (3 responses)

last time i tried git-annex with encrypted remote storages, everytime i checked for consistency the local git repo grew by 700MB and it took _ages_. It went usable small again after packing but it seemed no ideal back in the days.

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 17:30 UTC (Wed) by derobert (subscriber, #89569) [Link] (2 responses)

That sounds like you were running git-annex repair, which starts by unpacking the repository. But you really only ever run that if there is an error, which should be extremely rare since git is pretty stable now. You want git fsck (to check the git repository) and git-annex fsck (to confirm files match their checksums). Neither should appreciably grow the repository (git-annex fsck may store some metadata about last check time).

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 19:04 UTC (Wed) by gebi (guest, #59940) [Link] (1 responses)

yes, exactly, but from my reading of the docs it was the only method to check if the replication count of each object was still what was defined, thus it needed to be run regularaly without errors (eg. wanted to run it once per week, just like zfs scrub).

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 19:11 UTC (Wed) by derobert (subscriber, #89569) [Link]

Pretty sure git-annex fsck does that, at least my runs of it sometimes report a lower than desired number of copies. It also checks the data is correct (matches checksum), detecting any bitrot, though --fast should disable that part.

Note that it only checks one repository (which doesn't have to be the local one, useful especially for special remotes). So you need to have it run for all the repositories you trust to keep copies to detect bitrot, accidental deletion, etc. And it stores the data locally, so you may need git-annex sync to make the results known across the git-annex network.

Large files with Git: LFS and git-annex

Posted Dec 12, 2018 13:29 UTC (Wed) by pj (subscriber, #4506) [Link]

I wonder if it would be possible to shove large files into a 'remote repository' container and then deal with them kind of as if they're submodules. A unified interface might simplify things.

Also, wrt chunking, there are several other merkle-tree-based projects that might have useful ideas: Perkeep (previously Camlistore) and IPFS among others.

Large files with Git: LFS and git-annex

Posted Dec 13, 2018 18:31 UTC (Thu) by AndreiG (guest, #90359) [Link]

caca labs ?
libcaca ...?
libpipi ...?
wtf did you find these people ?😂

Large files with Git: LFS and git-annex

Posted Jul 23, 2022 13:51 UTC (Sat) by mnr (guest, #159856) [Link]


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds