Edit: it seems like my explanation turned out to be too confusing. In simple terms, my topology would look something like this:

I would have a reverse proxy hosted in front of multiple instances of git servers (let’s take 5 for now). When a client performs an action, like pulling a repo/pushing to a repo, it would go through the reverse proxy and to one of the 5 instances. The changes would then be synced from that instance to the rest, achieving a highly available architecture.

Basically, I want a highly available git server. Is this possible?


I have been reading GitHub’s blog on Spokes, their distributed system for Git. It’s a great idea except I can’t find where I can pull and self-host it from.

Any ideas on how I can run a distributed cluster of Git servers? I’d like to run it in 3+ VMs + a VPS in the cloud so if something dies I still have a git server running somewhere to pull from.

Thanks

  • notabot@lemm.ee
    link
    fedilink
    English
    arrow-up
    8
    ·
    2 days ago

    Before you can decide on how to do this, you’re going to have to make a few choices:

    Authentication and Access

    Theres two main ways to expose a git repo, HTTPS or SSH, and they both have pros and cons here:

    • HTTPS A standard sort of protocol to proxy, but you’ll need to make sure you set up authentication on the proxy properly so that only only thise who should have access can get it. The git client will need to store a username and password to talk to the server or you’ll have to enter them on every request. gitweb is a CGI that provides a basic, but useful, web interface.

    • SSH Simpler to set up, and authentication is a solved problem. Proxying it isn’t hard, just forward the port to any of the backend servers, which avoids decrypting on the proxy. You will want to use the same hostkey on all the servers though, or SSH will refuse to connect. Doesn’t require any special setup.

    Replication

    Git is a distributed version control system, so you could replicate it at that level, alternatively you could use a replicated file system, or a simple file based replication. Each has it’s own trade-offs.

    • Git replication Using git pull to replicate between repositories is probably going to be your most reliable option, as it’s the job git was built for, and doesn’t rely on messing with it’s underlying files directly. The one caveat is that, if you push to different servers in quick suscession you may cause a merge confict, which would break your replication. The cleanest way to deal with that is to have the load balancer send all requests to server1 if it’s up, and only switch to the next server if all the prior ones are down. That way writes will alk be going to the same place. Then set up replication in loop, with server2 pulling from server1, server3 pulling from server2, and so on up to server1 pulling from server5. With frequent pulls changes that are commited to server1 will quickly replicate to all the other servers. This would effectively be a shared nothing solution as none of the servers are sharing resources, which would make it easier to geigraphically separate them. The load balancer could be replaced by a CNAME record in DNS, with a daemon that updates it to point to the correct server.

    • Replicated filesystem Git stores its data in a fairly simple file structure, so placing that on a replicated filesystem such as GlusterFS or Ceph would mean multiple servers could use the same data. From experience, this sort of thing is great when it’s working, but can be fragile and break in unexpected ways. You don’t want to be up at 2am trying to fix a file replication issue if you can avoid it.

    • File replication. This is similar to the git replication option, in that you have to be very aware of the risk of conflicts. A similar strategy would probably work, but I’m not sure it brings you any advantages.

    I think my prefered solution would be to have SSH access to the git servers and to set up pull based replication on a fairly fast schedule (where fast is relative to how frequently you push changes). You mention having a VPS as obe of the servers, so you might want to push changes to that rather than have be able to connect to your internal network.

    A useful property of git is that, if the server is missing changesets you can just push them again. So if a server goes down before your last push gets replicated, you can just push again once the system has switched to the new server. Once the first server comes back online it’ll naturally get any changesets it’s missing and effectively ‘heal’.

    • marauding_gibberish142@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      2 days ago

      This is a fantastic comment. Thank you so much for taking the time.

      I wasn’t planning to run a GUI for my git servers unless really required, so I’ll probably use SSH. Thanks, yes that makes the part of the reverse proxy a lot easier.

      I think your idea of having a designated “master” (server 1) and having rolling updates to the rest of the servers is a brilliant idea. The replication procedure becomes a lot easier this way, and it also removes the need for the reverse-proxy too! - I can just use Keepalived, set up weights to make one of them the master and corresponding slaves for failover. It also won’t do round-robin so no special stuff for sticky sessions! This is great news from the perspective of networking for this project.

      Hmm, you said to enable pushing repos to the remote git repo instead of having it pull? I was going create a wireguard tunnel and have it accessible from my network for some stuff but I guess it makes sense.

      Thanks again for the wonderful comment.