Conference paper (in proceedings)
PoolinGH : fast, efficient, and robust GitHub repository mining
-
André, Maxime
ORCID
Namur Digital Institute, Belgium
-
Raglianti, Marco (USI)
ORCID
Istituto del software (SI), Facoltà di scienze informatiche, Università della Svizzera italiana, Svizzera
-
Serbout, Souhaila
ORCID
University of Zurich, Switzerland
-
Cleve, Anthony
ORCID
Namur Digital Institute, Belgium
-
Lanza, Michele
ORCID
Istituto del software (SI), Facoltà di scienze informatiche, Università della Svizzera italiana, Svizzera
Published in:
- ACM International Conference on Mining Software Repositories (MSR 2026). - 2026, p. in press
English
Researchers in Mining (open-source) Software Repositories (MSR) often create datasets that should survive the single paper and support long-term investigation of specific phenomena. Although popular, these studies recurrently deal with similar technical limitations. For instance, public collaborative development platforms, such as GitHub, impose hourly rate limits on their API requests. Furthermore, depending on network and API conditions, queries can fail and disrupt the process. These unexpected events can slow down or even invalidate the mining. Nevertheless, there are ways to minimize the undesirable effects in a reusable way while still complying with such limitations. However, best practices are often (re-)implemented on an {\em ad hoc} basis. Whatever works.
We propose PoolinGH, a lightweight, open-source, easy-to-use library, aimed at supporting researchers. It is designed to accelerate and ensure efficient and robust mining on the GitHub REST API while taking full advantage of its capabilities. PoolinGH enables automatic pooling of multiple access tokens and parallelizes queries. It optimizes queues and regulates network and API usage for respecting GitHub's limits and best practices. Error management and recovery or pruning in case of deadlocks are ensured. Search coverage maximization and progress monitoring are among the most useful features to avoid reinventing the wheel. We also provide solution templates that meet common needs for specific extensions of PoolinGH. A preliminary evaluation of these examples, involving tens of thousands of requests, demonstrates tangible gains.
-
Collections
-
-
Language
-
-
Classification
-
Computer science and technology
-
Notes
-
- MSR 2026
- Rio de Janeiro, Brazil
- 13-14 Apr 2026
-
License
-
-
Open access status
-
gold
-
Identifiers
-
-
Persistent URL
-
https://n2t.net/ark:/12658/srd1334960
Statistics
Document views: 0
File downloads:
-
Raglianti_Lanza_2026_ACM_MSR_PoolinGH: 0