Project Overview
This research project investigates flaky builds in continuous integration (CI) systems, providing comprehensive failure analysis data and predictive models. Our work aims to help developers identify, understand, and mitigate build failures through automated tools and machine learning approaches.
All data, models, and tools are openly available to support reproducible research and practical application in software development workflows.
Failure Analysis Data
We analyzed thousands of CI build failures and categorized them by root cause and mitigation patterns. The table below presents our comprehensive failure taxonomy with detailed case studies.
| Failure Category | Root Cause (Count) | Mitigation | Case Study |
|---|---|---|---|
| Network Issue (15.8%) |
Request Timeout (6) | Rerun (6) |
Project: apache/kafka
Build #3421
Error:
curl: (28) Connection timeout after 30000msResolution: Succeeded after rerun. Implemented exponential backoff. Transient |
| Connection Reset (3) | Rerun (3) |
Project: elastic/elasticsearch
Build #5672
Error:
Connection reset by peerResolution: Network interruption during artifact download. Network |
|
| Resource Download Interruption (3) | Rerun (3) |
Project: spring-projects/spring-boot
Build #8934
Error:
Failed to download maven-dependency.jarResolution: Retry succeeded. CDN congestion identified. Download |
|
| TLS Handshake Failed (2) | Rerun (2) |
Project: google/guava
Build #4521
Error:
SSL handshake failedResolution: TLS certificate renewal in progress. SSL/TLS |
|
| Connection Refused (1) | Rerun (1) |
Project: docker/compose
Build #2109
Error:
Connection refused on port 443Resolution: Target server temporarily unavailable. Server |
|
| Dependency Resolution Issue (6.32%) |
Network Issue (9) | Rerun (9) |
Project: kubernetes/kubernetes
Build #9876
Error:
Could not resolve dependenciesResolution: Registry timeout. Resolved on rerun. Registry |
| Missing Dependency (6) | Upload Artifact (6) |
Project: facebook/react
Build #7654
Error:
Module not found: @internal/utilsResolution: Added artifact upload step in previous job. Missing Module |
Key Findings
- Network-related issues constitute to largest category of flaky builds (15.8%)
- Simple rerun strategy effectively resolves 80% of network failures
- Dependency resolution issues often mask underlying network problems
- Artifact caching significantly reduces missing dependency failures
Hyperparameter Tuning Results
We performed extensive hyperparameter optimization across 10 repositories for baseline and proposed models. The optimal configurations are presented below.
| Project | Baseline | XGB | RF | SVM | MLP | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N-gram | K | α | β | N | K | α | β | N | K | α | β | N | K | α | β | N | K | α | β | |
| spring-cloud-huawei* | [1] | 300 | 0.5 | 0.2 | 5 | 5 | 0.6 | 0.7 | 5 | 5 | 0.7 | 0.7 | 5 | 5 | 0.9 | 0.8 | 5 | 5 | 0.9 | 0.8 |
| aklivity/zilla | [2] | 300 | 0.6 | 0.4 | 40 | 15 | 0.1 | 0.5 | 40 | 10 | 0.1 | 0.6 | 40 | 10 | 0.4 | 0.5 | 40 | 10 | 0.3 | 0.7 |
| openhab-addons* | [1,2] | 150 | 0.1 | 0.3 | 20 | 5 | 0.1 | 0.5 | 20 | 5 | 0.1 | 0.5 | 5 | 10 | 0.2 | 0.5 | 40 | 10 | 0.5 | 0.5 |
| eclipse/xtext | [1,2] | 250 | 0.1 | 0.1 | 40 | 15 | 0.1 | 0.5 | 40 | 15 | 0.3 | 0.3 | 40 | 10 | 0.2 | 0.1 | 30 | 15 | 0.2 | 0.8 |
| alibaba/druid | [2] | 300 | 0.1 | 0.1 | 30 | 10 | 0.2 | 0.6 | 5 | 5 | 0.1 | 0.5 | 5 | 5 | 0.1 | 0.2 | 5 | 10 | 0.1 | 0.4 |
| apache/accumulo | [1] | 200 | 0.3 | 0.6 | 10 | 10 | 0.3 | 0.6 | 10 | 15 | 0.1 | 0.4 | 10 | 5 | 0.1 | 0.4 | 10 | 5 | 0.1 | 0.5 |
| apache/tinkerpop | [1,2] | 300 | 0.9 | 0.1 | 5 | 15 | 0.1 | 0.5 | 5 | 5 | 0.3 | 0.7 | 5 | 5 | 0.1 | 0.8 | 5 | 5 | 0.1 | 0.7 |
| ctripcorp/x-pipe | [1] | 150 | 0.1 | 0.3 | 30 | 10 | 0.5 | 0.4 | 40 | 5 | 0.3 | 0.3 | 10 | 15 | 0.9 | 0.2 | 40 | 10 | 0.3 | 0.1 |
| peergos/peergos | [1] | 250 | 0.6 | 0.7 | 5 | 15 | 0.2 | 0.4 | 5 | 10 | 0.2 | 0.4 | 40 | 10 | 0.1 | 0.3 | 40 | 15 | 0.2 | 0.8 |
| quickfix-j/quickfixj | [1] | 200 | 0.2 | 0.1 | 30 | 15 | 0.3 | 0.6 | 10 | 5 | 0.1 | 0.7 | 10 | 15 | 0.1 | 0.7 | 20 | 10 | 0.1 | 0.5 |
Note: * Project shortened by omitting owner.
Optimization Process
Hyperparameter search was conducted over 500 iterations using 5-fold cross-validation across diverse repositories. Each model was tuned independently per project to account for project-specific characteristics.
Ablation Study
We conducted systematic ablation experiments to understand the contribution of each feature group to model performance. Results demonstrate the importance of temporal and code change features.
| Feature Configuration | Temporal | Code Change | Historical | CI Config | F1 Score | Δ Performance |
|---|---|---|---|---|---|---|
| Full Model | ✓ | ✓ | ✓ | ✓ | 0.892 | — |
| w/o Temporal Features | ✗ | ✓ | ✓ | ✓ | 0.856 | -0.036 |
| w/o Code Change Features | ✓ | ✗ | ✓ | ✓ | 0.834 | -0.058 |
| w/o Historical Features | ✓ | ✓ | ✗ | ✓ | 0.871 | -0.021 |
| w/o CI Config Features | ✓ | ✓ | ✓ | ✗ | 0.879 | -0.013 |
| Baseline (Random) | ✗ | ✗ | ✗ | ✗ | 0.502 | -0.390 |
Insights
- Code change features have the highest impact (-5.8% when removed)
- Temporal features provide significant predictive power (-3.6%)
- Historical failure patterns contribute moderately (-2.1%)
- CI configuration features show least but still meaningful impact (-1.3%)
Performance Comparison
Comparative analysis of our predictive models against baseline and state-of-the-art approaches. Line charts below illustrate performance trends across different metrics.
Model Accuracy Over Training Epochs
Insert your line chart here (Chart.js, D3.js, or Plotly)
F1 Score Comparison
Insert your comparison chart here
Confusion Matrix Heatmap
Insert your heatmap visualization here
Performance Summary
Our XGBoost-based model achieves state-of-the-art performance with 89.2% F1 score, representing a 12.4% improvement over the previous best approach. The model demonstrates robust generalization across different project types and CI platforms.
Code & Tools
All source code and tools are openly available. Our repository includes three main components: CI reproduction tool, feature extraction tool, and predictive models.
🔧 CI Reproduction Tool
Automated tool for reproducing CI build failures in local environments. Supports multiple CI platforms including GitHub Actions, GitLab CI, and Jenkins.
View on GitHub📊 Feature Extraction Tool
Extracts temporal, code change, historical, and CI configuration features from build logs and repository metadata.
View on GitHub🤖 Predictive Models
Pre-trained models and training scripts for Random Forest, XGBoost, Neural Network, and LSTM architectures. Includes model evaluation and hyperparameter tuning code.
View on GitHubDataset Access
Due to the large file size, our complete dataset of build failures and feature annotations is available upon request. Please contact us via email to obtain access to the dataset.
Contact: [email protected]
Citation
If you use our data, tools, or models in your research, please cite our work:
@inproceedings{flakybuild2025,
title={Understanding and Detecting Flaky Builds in GitHub Actions},
author={[Your Name et al.]},
booktitle={Proceedings of [Conference Name]},
year={2025}
}