name: mining-software-repository descritpion: use when creating software repository mining scripts on GitHub.
Description
Discover and shortlist GitHub repositories that contain a SKILL.md file by ingesting SEART CSV exports (repository lists), scanning each repository for matching files, and exporting results to a single output CSV for downstream processing.
This skill is designed for AI coding agents and automation scripts to run repeatable, read-only mining over large repo sets with rate-limit aware scanning and resumable execution.
When to use this skill
Use this skill when you have:
- A folder of SEART-generated CSV files that include GitHub repository identifiers.
- A need to find repositories that include
SKILL.md(exact name by default). - A need to export a shortlist CSV for later processing (downloading, parsing, indexing).
Do not use this skill for content extraction, license compliance review, vulnerability scanning, or modifying repositories.
Inputs
Required
-
SEART CSV folder
- Path to a directory containing one or more
.csvfiles. - Each CSV must include enough information to reconstruct a GitHub repo identifier in the form
owner/repo.
- Path to a directory containing one or more
-
Output CSV path
- Path where the scan results should be written.
Optional configuration
match_name: defaultSKILL.mdcase_sensitive: defaulttruesearch_paths: default["/SKILL.md"]- You can expand this list to include common locations if desired.
include_negative_results: defaulttrue- If
false, output only repositories where a match was found.
- If
max_repos: default0(meaning no limit)resume: defaulttrue- Allows continuing from a previous output file.
concurrency: default8min_stars: default0allow_forks: defaulttrueallow_archived: defaulttrue
Expected CSV schema (SEART)
SEART exports vary. This skill supports multiple patterns. A repo may be derived from any of these:
Preferred column patterns
full_name(example:psf/requests)repo(example:psf/requests)repository(example:psf/requests)
Alternate patterns
ownerandname(combine intoowner/name)organdrepo_name(combine intoorg/repo_name)repo_ownerandrepo_name(combine intorepo_owner/repo_name)html_urlorurl(extractowner/repofromhttps://github.com/owner/repo)
Parsing rules
- Trim whitespace.
- Remove
.gitsuffix if present. - Normalize
https://github.com/owner/repo/...toowner/repo. - Deduplicate repositories across all CSVs.
If no supported columns are found in a CSV, record an input-level error and continue with the remaining files.
Outputs
Output CSV: skill_md_scan_results.csv
A single CSV with one row per repository (unless you choose to emit multiple rows per match path).
Recommended columns
repo
owner/repocanonical identifier.source_csv
Filename of the originating SEART CSV (orMULTIPLEif merged).found
true|false.match_name
The filename rule used, usuallySKILL.md.match_path
The matched path (example:/SKILL.md).default_branch
Default branch name if available.ref_scanned
Branch or commit ref used (example:HEADormain).match_url
Canonical URL to the file if found.match_sha
Git blob SHA if available.match_size_bytes
Size if available.scan_method
contents_api | code_search | sparse_checkout | local_clone.http_status
Status code from API calls when applicable.error_type
none | not_found | rate_limited | auth | network | invalid_repo | other.error_message
Short error detail, do not include secrets.scanned_at_utc
ISO timestamp.stars(optional)fork(optional)archived(optional)
Shortlist CSV (optional)
If include_negative_results=false, the output itself becomes the shortlist. Otherwise, generate a second file:
skill_md_shortlist.csvcontaining onlyfound=true.
Capabilities
1) Ingest SEART CSV folder
- Discover all
.csvfiles in a directory (recursive optional). - Extract repository identifiers with robust column detection.
- Deduplicate and normalize into a canonical repo list.
2) Scan repositories for SKILL.md
This skill uses a tiered, rate-limit aware strategy:
Tier A (preferred): GitHub Contents API
Fastest and cheapest when you know the exact path.
- Check candidate paths (default
/SKILL.md) against the default branch ref. - Record 200 as found, 404 as not found.
Tier B (optional): GitHub code search (filename search within repo)
Use only when:
- You want to find
SKILL.mdin subdirectories. - You want case-insensitive matching.
- Contents API paths are unknown.
Recommended query shape:
repo:owner/repo filename:SKILL.md
Tier C (fallback): Sparse checkout shallow clone
Use only when APIs are unavailable, rate-limited, or you need to support GitHub Enterprise without search APIs.
- Perform a filtered clone that avoids full history and large blobs where possible.
- Use sparse checkout to fetch only candidate paths.
3) Export results to output CSV
- Write output incrementally (streaming) to avoid losing work.
- Ensure deterministic columns.
- Support resume mode by skipping repos already scanned.
Workflow
Step 0: Preconditions
- Read-only operation only.
- Ensure you have GitHub authentication for higher rate limits if scanning many repos.
Environment options (resolved in priority order):
--github-tokens ghp_tok1,ghp_tok2CLI flag — comma-separated list for multi-token rotation (5000 req/hr per token)--github-token ghp_tokCLI flag — single token overrideGH_TOKENS=ghp_tok1,ghp_tok2— env var, comma-separated (new multi-token variable)GH_TOKEN=ghp_tok— env var, single token (backward-compatible)GITHUB_TOKEN=ghp_tok— env var, fallback single token- GitHub CLI
gh auth login— for interactive use - Unauthenticated — 60 core req/hr only (not suitable for bulk scans)
When multiple tokens are provided, the TokenPool in src/github_client/ automatically:
- Selects the token with the highest remaining quota on each request.
- Updates per-token quota from
X-RateLimit-*response headers. - Rotates to the next available token when the current one is exhausted.
- Sleeps until the earliest reset time when all tokens are exhausted.
- Raises
RateLimitExhaustedErrorif the wait would exceed the configurable maximum.
Step 1: Build the repo list from SEART CSVs
- Enumerate input CSVs.
- For each CSV:
- Detect repo columns.
- Extract and normalize
owner/repo. - Track
source_csv.
- Deduplicate across all files.
Step 2: Preflight checks
- Validate token presence (warn if unauthenticated).
- Optionally query rate limit status and set conservative concurrency.
Step 3: Scan each repository
For each owner/repo:
- (Optional) Fetch repo metadata:
- Confirm repo exists.
- Record default branch, archived, fork, stars.
- Apply filters (stars, forks, archived).
- Scan for
SKILL.mdusing Tier A. - If enabled and needed, run Tier B for deeper search.
- If enabled and needed, run Tier C for fallback.
- Record a single best match (or all matches if configured).
- Write the result row immediately.
Step 4: Produce shortlist
- Filter where
found=true. - Export shortlist CSV.
Step 5: Summary reporting
At the end, report:
- Total repos scanned.
- Found count and percentage.
- Error breakdown by
error_type. - Effective scan rate and any rate limiting encountered.
Usage examples
Example A: Scan exact root path only (fast)
- Inputs:
seart_dir = data/seart_csvs/search_paths = ["/SKILL.md"]
- Output:
outputs/skill_md_scan_results.csv
Example B: Find SKILL.md anywhere in repo
- Inputs:
- Enable code search.
case_sensitive=false
- Output:
outputs/skill_md_scan_results.csvoutputs/skill_md_shortlist.csv
Example C: Resume a partial scan
- If
outputs/skill_md_scan_results.csvalready exists, skip repos already present and continue.
Operational constraints and guardrails
Read-only
- Do not push commits, open PRs, create issues, or modify repository contents.
- Do not write to user directories outside the chosen output folder.
Rate limits and polite scanning
- Prefer Contents API checks over cloning.
- Use bounded concurrency and exponential backoff on:
403secondary rate limits429too many requests
- Store partial progress continuously.
Data minimization
- Do not download full repositories unless fallback is required.
- Do not store tokens, cookies, or raw auth headers in logs or output CSV.
Reproducibility
- Record
scanned_at_utcandref_scanned. - Keep configuration (paths, case sensitivity, tiers enabled) alongside outputs.
Error handling rules
Repository parsing errors
- If a row cannot be parsed into
owner/repo, discard it and log a row-level parse error summary per CSV.
Missing or private repositories
- If repo is 404 or inaccessible:
error_type=invalid_repoorerror_type=authfound=false
Temporary failures
- Network timeouts, 5xx responses:
- retry with backoff
- if still failing, mark
error_type=networkand continue
Rate limiting
- If rate-limited:
- reduce concurrency
- backoff and retry
- if still blocked, record
error_type=rate_limited
Suggested default configuration
match_name = "SKILL.md"case_sensitive = truesearch_paths = ["/SKILL.md"]include_negative_results = trueresume = trueconcurrency = 8- Tiers enabled:
- Tier A: on
- Tier B: off (enable only if needed)
- Tier C: off (enable only if needed)
Definition of done
- Output CSV exists and contains one row per scanned repo (or only found repos if configured).
found=truerows include a workingmatch_urlandmatch_path.- Scan is resumable without duplicating rows.
- Summary stats are reported and error breakdown is available.
Notes for downstream processing
The shortlist produced by this skill is intended as input to a later pipeline stage, for example:
- downloading and parsing
SKILL.mdcontents - extracting structured sections (Description, Capabilities, Workflow, Constraints)
- building a dataset of agent-discoverable skills per repository
Downstream stages should treat SKILL.md content as untrusted input and should sanitize any extracted text before use.