Revamping Search for Uninterrupted Availability: GitHub Enterprise Server's Journey

Search is at the heart of GitHub Enterprise Server—powering everything from code search to issue tracking, release pages, and pull request counts. But for years, administrators faced a delicate dance with search indexes, especially in high-availability (HA) setups. The old Elasticsearch integration created brittle clustering across primary and replica nodes, leading to lockups during maintenance or upgrades. After extensive effort, GitHub rebuilt the search architecture to eliminate these pain points. Below, we explore the challenges, the attempts to fix them, and the solution that now keeps your searches running smoothly.

1. How does search power the GitHub Enterprise Server experience?

Search is woven into nearly every interaction on GitHub Enterprise Server. Obvious places include the search bars and filtering on the Issues page, but its role goes far deeper. The releases page, projects page, and even counts for issues and pull requests all depend on search functionality. Without a robust search index, these features would slow to a crawl or break entirely. Given this central position, ensuring search remains available and durable is critical. Any downtime or degradation directly impacts developer productivity, which is why GitHub invested heavily in making the search infrastructure more resilient.

Revamping Search for Uninterrupted Availability: GitHub Enterprise Server's Journey — Source: github.blog

2. What maintenance challenges did administrators face with search indexes?

Administrators had to be extremely careful with search indexes—special database tables optimized for fast retrieval. A single misstep in maintenance or upgrade sequencing could damage the index, requiring time-consuming repairs. In high-availability (HA) setups, the stakes were even higher. If upgrade steps weren't followed in exactly the right order, indexes could become locked, halting the entire upgrade process. This fragility meant that even routine operations required meticulous planning, often increasing the time spent on server management rather than on improving the user experience.

3. How did GitHub's high availability setup work with Elasticsearch?

GitHub Enterprise Server's HA architecture uses a leader/follower pattern. The primary node handles all writes, updates, and traffic, while replica nodes remain read-only and stay synchronized. Elasticsearch, the search database, was integrated into this pattern. However, Elasticsearch couldn't natively support a primary/replica model across separate servers. To work around this, GitHub created an Elasticsearch cluster that spanned both primary and replica nodes. This made data replication straightforward and allowed each node to handle search requests locally, offering performance benefits. But it also introduced hidden complexities.

4. Why did Elasticsearch clustering cause lockups during maintenance?

The cross-server Elasticsearch cluster had a critical flaw: at any moment, the Elasticsearch software could move a primary shard—responsible for receiving and validating writes—from the primary node to a replica. If that replica was then taken down for maintenance, a deadly embrace occurred. The replica would wait for Elasticsearch to become healthy before starting up, but Elasticsearch couldn't become healthy until the replica rejoined. This deadlock left the entire search system unusable, forcing administrators to intervene manually. Such scenarios made maintenance windows unpredictable and risky.

5. What attempts did GitHub make to stabilize the Elasticsearch integration?

Over several releases, GitHub engineers implemented numerous safeguards. They added health checks to ensure Elasticsearch was in a valid state before starting other services, and built processes to correct drifting states. The team even attempted to build a "search mirroring" system that would decouple the primary and replica nodes, moving away from the clustered mode. However, database replication is inherently complex, and these efforts faced consistency challenges. Despite incremental improvements, the underlying risk of shard movement remained, prompting a more fundamental redesign.

6. What was the solution for rebuilding the search architecture for high availability?

The solution involved rethinking how Elasticsearch is deployed within the HA configuration. Instead of a single cluster spanning primary and replica nodes, GitHub redesigned the search layer to avoid cross-server shard movement entirely. Each node now runs its own independent Elasticsearch instance, with data synchronization handled at the application layer rather than by Elasticsearch clustering. This eliminates the possibility of a shard being moved to a node that is about to go offline. The new architecture also includes robust failover mechanisms and seamless index replication. As a result, maintenance lockups are a thing of the past, and administrators can perform upgrades with confidence, knowing search will remain available.