This repository contains code related to the ORCA project.
Navigate to the github-metrics subdirectory. If you have not installed Gatsby, do so, and then run
npm install
gatsby clean
gatsby develop
To update the production website, run gatsby clean; gatsby build
and then bash push_to_production.sh
.
Data updates are automated via orca_data_pipeline.py
. Additionally, a GitHub action runs once a month to open a PR that updates the data. You can merge this PR and update the production site after reviewing the changes. In short, you shouldn't have to run the steps below manually - these instructions are included in case of some special circumstance.
If the pipeline takes an unusually long time to run or the sensors are timing out, check the log files in the airflow
user's home directory on orca-etl
.
To manually run data preprocessing, ensure that the current data in orca.website_stats
in BigQuery has been exported to GCS in gs://airflow-data-exchange/orca/tmp/website_stats
. Then, run PYTHONPATH='.' python3 scripts/preprocess_for_website.py
.
The Airflow pipeline outlines the sequence of commands to run in more detail, but a quick summary:
-
Run
sql/repos_in_papers.sql
to aggregate GitHub references that appear in papers. If you do not want to update the software extracted from scholarly literature, skip this step. -
Prepare your development environment:
virtualenv venv
. venv/bin/activate
pip install -r requirements.txt
export GITHUB_ACCESS_TOKEN=your access token
export GITHUB_USER=your username
-
Run
PYTHONPATH='.' python3 scripts/retrieve_repos.py
to retrieve a clean list of software to pull metadata from. You can run with the--query_bq
flag to retrieve software that appears in the scholarly literature (if you are a CSET employee with BigQuery access) or--query_topics
to retrieve software that matches the GitHub topics that appear ininput_data/topics.txt
. -
At this point, we will have full metadata for repos we retrieved using the github API (i.e. repos retrieved by topic, at the moment), but not for repos that only appear in papers or other sources. The next script grabs the default metadata retrieved from the github API for repos that don't already have it:
PYTHONPATH='.' python3 scripts/backfill_top_level_repo_data.py
-
Now, we can scrape some additional metadata from GitHub itself, including text of README.md files which we can use to do further analysis. Run
PYTHONPATH='.' python3 scripts/retrieve_repo_metadata.py curr_repos_filled.jsonl curr_repos_final.jsonl
-
To prepare data for the web application, load
curr_repos_final.jsonl
in the previous step into BigQuery and run the sequence of queries insequences/downstream_order.txt
.
These steps are automated and run on a monthly basis on the scholarly literature data using the orca_data_pipeline.py
Airflow pipeline.
- Total stars - This comes from the GitHub API (see
staging_github_metrics.repos_with_full_meta_raw.full_metadata.stargazers_count
) - Total watchers - This comes from the GitHub API (see
staging_github_metrics.repos_with_full_meta_raw.full_metadata.subscribers_count
) - Total contributors - This is scraped by
retrieve_repo_metadata.py
(TODO: maybe retrieve from gh archive) - Total references - This comes from our scholarly literature (see
paper_meta
inwebsite_stats.sql
) - Total open issues - This comes from the GitHub API (see
staging_github_metrics.repos_with_full_meta_raw.full_metadata.open_issues
) - Created date - This comes from the GitHub API (see
staging_github_metrics.repos_with_full_meta_raw.full_metadata.created_at
) - Last push date - This comes from the GitHub API (see
staging_github_metrics.repos_with_full_meta_raw.full_metadata.pushed_at
) - License - This comes from the GitHub API (see
staging_github_metrics.repos_with_full_meta_raw.full_metadata.license.name
) - Top programming language - This comes from the GitHub API (see
staging_github_metrics.repos_with_full_meta_raw.full_metadata.language
) - Stars over time - This counts the number of
WatchEvent
s for the project in the githubarchive BQ public dataset. The sum of these counts may not equal the total stars because users may unstar (or even unstar and re-star!) a repo - Commits over time - This counts the number of distinct commits based on the unnested commits for
PushEvent
s in the githubarchive BQ public dataset. See alsopush_event_commits.sql
- Issues over time - This counts the number of opened and closed issues based on
IssuesEvent
s in the githubarchive BQ public dataset. See alsoissue_events.sql
. We determine whether the issue was opened or closed based on theaction
field - New versus returning contributors over time - this metric is based on commits. If a contributor makes their first commit during a given time interval,
we mark them as a new contributor. Otherwise, they are a returning contributor. See combination of
repo_pushes
inwebsite_stats.sql
andget_new_vs_returning_contributor_counts
inpreprocess_for_website.py
. - Contribution percentages - this metric is based on commits. For each contributor, we count their number of commits,
then calculate the percentage of commits written by each contributor. See combination of
repo_pushes
inwebsite_stats.sql
andget_cumulative_contributor_counts
inpreprocess_for_website.py
. - Cumulative percentage of contributions by number of contributors - This metric is based on commits. See description of
Contribution percentages above, and
github-metrics/src/components/summary_panel.js:getContribTrace
- The deps.dev links are added if the repo is present in
bigquery-public-data.deps_dev_v1
- The pypi downloads over time come from
bigquery-public-data.pypi