I now have an archive of over 1,000,000 Git commits across 154 repositories with my archival script.
@mckinley@mckinley.cc If you’re curious, here are the top 5 domains.
106 github.com
15 codeberg.org
7 gitlab.com
7 git.codemadness.org
4 bitreich.org
@mckinley@twtxt.net Where does git.mills.io
sit in your rank? 🤣
@prologic@twtxt.net All the way at the bottom. It’s tied for 6th place with 1 repository archived.
@mckinley@twtxt.net Haha oh well at least it’s not last 🤣
@mckinley@twtxt.net What all makes the list? I have been archiving repos that matter to me too of late, though it’s a smaller list.
@ocdtrekkie@twtxt.net I track a lot of repositories with a risk of becoming unavailable for whatever reason. The script tracks how many times in a row Git fails to fetch updates, so I can tell when a remote dies.
However, since it’s so easy to add new ones, it’s mostly repositories which aren’t likely to disappear but carry a lot of value. For example, 143 MiB on my hard drive for the complete history of FFmpeg is a no-brainer for me.
@mckinley@twtxt.net I grab pretty much all unmaintained Sandstorm app repos, in case they disappear, and then anything interesting related to copyrighted games. Like if you saw the Portal64 thing recently… really interesting but begs for a DMCA, so I took a copy.
@ocdtrekkie@twtxt.net A lot of my repositories are on the list specifically to guard against BS takedown requests like when youtube-dl was DMCA’d. I started the project when I discovered Wikiless was taken down, so I have just about all of the popular self-hosted frontends as well.
Portal64 looks interesting, I haven’t heard about it. I might need to get an N64 emulator going.
@mckinley@twtxt.net Do you publicly publish and make them available online somewhere or just privately? 🤔
@prologic@twtxt.net No, it’s just private for now. I’ll share individual repositories when they get nuked, of course. I’m open to the idea of making them publicly available, though.
I wonder if I could push to a Git remote with my current setup. That would be the simplest way to do public distribution and remote backups.
Also, Portal 64 kept freezing on me so I played F-Zero X instead.
@mckinley@twtxt.net I’m thinking of a way in which you, I, anyone else can participate in a “distributed” network of mirrors of these repos (you know because Git is quite good at this 😅)
@prologic@twtxt.net Git itself is a distributed network of mirrors. It’s impossible to truly kill a Git repository as long as someone still has a clone of it on their computer.
However, simple clones are inefficient on disk space and a simple git fetch
will happily obliterate its history if the remote says so.
My goals are as follows.
- Create high quality archives of a large number of repositories and keep them up to date.
- Make them resilient against attacks from the inside, including (but not limited to) force-pushing an empty history and maliciously deleting branches on the remote.
- Minimize storage and bandwidth usage, including (but not limited to) running
git gc --aggressive
when cloning and not fetching unnecessary commits, e.g. Dependabot and pull requests.
@mckinley@twtxt.net So how do I get involved and help you keep copies? 🤔
@prologic@twtxt.net I appreciate it, but there’s really nothing to “get involved” with at the moment. It’s just a shell script on my laptop that I run every day and a ~5GiB directory on my SSD. It isn’t a big deal, I just talk about it because I think it’s interesting and I’m having fun tinkering with it.
Eventually, I’ll make the script public so anyone can easily maintain archives. There’s still a lot I want to do before that, though.