Understanding and Detecting Flaky Builds in GitHub Actions

Project Overview

This research project investigates flaky builds in continuous integration (CI) systems, providing comprehensive failure analysis data and predictive models. Our work aims to help developers identify, understand, and mitigate build failures through automated tools and machine learning approaches.

All data, models, and tools are openly available to support reproducible research and practical application in software development workflows.

Failure Analysis Data

We analyzed thousands of CI build failures and categorized them by root cause and mitigation patterns. The table below presents our comprehensive failure taxonomy with detailed case studies.

Failure Category Root Cause (Count) Mitigation Case Study
Network Issue
(15.8%)
Request Timeout (6) Rerun (6)
Project: apache/kafka
Build #3421
Error: curl: (28) Connection timeout after 30000ms
Resolution: Succeeded after rerun. Implemented exponential backoff.
Transient
Connection Reset (3) Rerun (3)
Project: elastic/elasticsearch
Build #5672
Error: Connection reset by peer
Resolution: Network interruption during artifact download.
Network
Resource Download Interruption (3) Rerun (3)
Project: spring-projects/spring-boot
Build #8934
Error: Failed to download maven-dependency.jar
Resolution: Retry succeeded. CDN congestion identified.
Download
TLS Handshake Failed (2) Rerun (2)
Project: google/guava
Build #4521
Error: SSL handshake failed
Resolution: TLS certificate renewal in progress.
SSL/TLS
Connection Refused (1) Rerun (1)
Project: docker/compose
Build #2109
Error: Connection refused on port 443
Resolution: Target server temporarily unavailable.
Server
Dependency Resolution Issue
(6.32%)
Network Issue (9) Rerun (9)
Project: kubernetes/kubernetes
Build #9876
Error: Could not resolve dependencies
Resolution: Registry timeout. Resolved on rerun.
Registry
Missing Dependency (6) Upload Artifact (6)
Project: facebook/react
Build #7654
Error: Module not found: @internal/utils
Resolution: Added artifact upload step in previous job.
Missing Module

Key Findings

Hyperparameter Tuning Results

We performed extensive hyperparameter optimization across 10 repositories for baseline and proposed models. The optimal configurations are presented below.

Project Baseline XGB RF SVM MLP
N-gram K α β N K α β N K α β N K α β N K α β
spring-cloud-huawei* [1]3000.50.2 550.60.7 550.70.7 550.90.8 550.90.8
aklivity/zilla [2]3000.60.4 40150.10.5 40100.10.6 40100.40.5 40100.30.7
openhab-addons* [1,2]1500.10.3 2050.10.5 2050.10.5 5100.20.5 40100.50.5
eclipse/xtext [1,2]2500.10.1 40150.10.5 40150.30.3 40100.20.1 30150.20.8
alibaba/druid [2]3000.10.1 30100.20.6 550.10.5 550.10.2 5100.10.4
apache/accumulo [1]2000.30.6 10100.30.6 10150.10.4 1050.10.4 1050.10.5
apache/tinkerpop [1,2]3000.90.1 5150.10.5 550.30.7 550.10.8 550.10.7
ctripcorp/x-pipe [1]1500.10.3 30100.50.4 4050.30.3 10150.90.2 40100.30.1
peergos/peergos [1]2500.60.7 5150.20.4 5100.20.4 40100.10.3 40150.20.8
quickfix-j/quickfixj [1]2000.20.1 30150.30.6 1050.10.7 10150.10.7 20100.10.5

Note: * Project shortened by omitting owner.

Optimization Process

Hyperparameter search was conducted over 500 iterations using 5-fold cross-validation across diverse repositories. Each model was tuned independently per project to account for project-specific characteristics.

Ablation Study

We conducted systematic ablation experiments to understand the contribution of each feature group to model performance. Results demonstrate the importance of temporal and code change features.

Feature Configuration Temporal Code Change Historical CI Config F1 Score Δ Performance
Full Model 0.892
w/o Temporal Features 0.856 -0.036
w/o Code Change Features 0.834 -0.058
w/o Historical Features 0.871 -0.021
w/o CI Config Features 0.879 -0.013
Baseline (Random) 0.502 -0.390

Insights

Performance Comparison

Comparative analysis of our predictive models against baseline and state-of-the-art approaches. Line charts below illustrate performance trends across different metrics.

Model Accuracy Over Training Epochs

Insert your line chart here (Chart.js, D3.js, or Plotly)

F1 Score Comparison

Insert your comparison chart here

Confusion Matrix Heatmap

Insert your heatmap visualization here

Performance Summary

Our XGBoost-based model achieves state-of-the-art performance with 89.2% F1 score, representing a 12.4% improvement over the previous best approach. The model demonstrates robust generalization across different project types and CI platforms.

Code & Tools

All source code and tools are openly available. Our repository includes three main components: CI reproduction tool, feature extraction tool, and predictive models.

🔧 CI Reproduction Tool

https://github.com/[YOUR-USERNAME]/ci-reproduction-tool

Automated tool for reproducing CI build failures in local environments. Supports multiple CI platforms including GitHub Actions, GitLab CI, and Jenkins.

View on GitHub

📊 Feature Extraction Tool

https://github.com/[YOUR-USERNAME]/feature-extraction-tool

Extracts temporal, code change, historical, and CI configuration features from build logs and repository metadata.

View on GitHub

🤖 Predictive Models

https://github.com/[YOUR-USERNAME]/flaky-build-models

Pre-trained models and training scripts for Random Forest, XGBoost, Neural Network, and LSTM architectures. Includes model evaluation and hyperparameter tuning code.

View on GitHub

Dataset Access

Due to the large file size, our complete dataset of build failures and feature annotations is available upon request. Please contact us via email to obtain access to the dataset.

Contact: [email protected]

Citation

If you use our data, tools, or models in your research, please cite our work:

@inproceedings{flakybuild2025,
  title={Understanding and Detecting Flaky Builds in GitHub Actions},
  author={[Your Name et al.]},
  booktitle={Proceedings of [Conference Name]},
  year={2025}
}