Quantifying your reliance on Open Source software

With dependency-management-data (DMD)

Jamie Tanna (https://www.jvt.me)
Senior Software Engineer @ Elastic

/usr/bin/whoami

Timeline of events

  • 2024-02: This talk!
  • 2023-07: First public talk
  • 2023-02: Created the dependency-management-data project 🎂
  • 2022-08: First iteration with Dependabot
  • 2019: "Formally" considering it
  • 2017: Hacking around

Why is it important?

As I wrote in the post Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest

But it's not always ☀🌈

xkcd comic showing a tower of various layers of boulders and stones, labelled "all modern digital infrastructure", which looks a little precarious. Towards the bottom there is a slim load-bearing stone which is labelled "a project some random person in Nebraska has been thanklessly maintaining since 2003"

💖

Do you fully appreciate the depth of your dependency on the software supply chain?

Being able to understand how your business uses Open Source is really important for a few other key reasons:

  • How am I affected by that dependency migrating away from Open Source?
  • Usages of unwanted libraries
  • Understand usage of libraries and frameworks, and their versions
  • Discovering unmaintained, deprecated or vulnerable software

Being able to understand how your business uses Open Source internal software is really important for a few other key reasons:

  • How am I affected by that dependency migrating away from Open Source?
  • Usages of unwanted libraries
  • Understand usage of libraries and frameworks, and their versions
  • Discovering unmaintained, deprecated or vulnerable software

Other insights into:

  • How maintained does the dependency appear to be?
  • How are the dependency's supply chain security practices? (via OpenSSF Security Scorecards)
  • How many dependencies are actively seeking financial support?

How can we do it?

💰🤑💸

              GitHub logo GitLab logo Snyk logo Mend logo

Let's use Open Source!

Open Source Initiative logo

What is dependency-management-data?

Dependency Management Data (DMD) - dmd.tanna.dev

What's in the project?

  • The outputted SQLite database
  • The command-line tool dmd
  • The web application dmd-web, and the GraphQL-only web application dmd-graph
  • (Your SQLite browser of choice)

SQLite database

  • Conveniently distribute, share
  • Great for local-only or building applications on top of it
  • No lock-in to dmd - all state synced to the DB

dmd

  • Create the SQLite database
  • Ingests different sources of dependency data ("datasources")
  • Enrich it with more data ("advisories", "dependency health")
  • Configure your own views over the data via "custom advisories" and "policies"
  • Provide common queries ("reports")

dmd-web

  • Centrally deployable and accessible
  • View reports in the browser
  • Datasette's excellent SQLite UI
  • GraphQL API
  • Great when deployed accessible to all (with authentication)

dmd-graph

  • Web application with only the GraphQL API
  • Great for internally deployed (for internal API access)

How did it come to be?

Idea for Open Source/Startup: monetising the supply chain

Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest

  • Using the Dependabot APIs
  • Good starting point
  • Lack of data for some ecosystems
  • Hard to parse the "current version"

via GIPHY

Mend Renovate logo

EndOfLife.date logo

commit 73a99614a2af6fa9f66508bab8541ed65e18ed66
Author: Jamie Tanna <>
Date:   Thu Feb 2 09:23:43 2023 +0000

    Initialise project

 LICENSE.md        | 13 +++++++++++++
 README.md         |  7 +++++++
 public/index.html | 90 ++++++++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+)

How does it work?

# produce some data that DMD can import, i.e.
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN your-org/repo
# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# optionally, generate advisories
dmd db generate advisories --db dmd.db
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'

Datasources

Currently:

  • renovate-graph: Use Renovate as the source

  • dependabot-graph: Use GitHub's Dependabot API
  • Software Bill of Materials (SPDX, CycloneDX)
  • endoflife-checker: Pull in i.e. AWS infrastructure
  • More welcome!

When importing:

  • Converts to an underlying data model (in SQLite)
  • Uses that for internal querying + enrichment

Once ingested, write SQL to your heart's content 🤓 i.e.

  • "which repos use a vulnerable version of Log4J"
  • "how many repos are using a version of the Datadog SDK that's older than ..."
  • "what is our most used direct/transitive dependency?"

Reports

Pre-baked, Open Source'd queries:

$ dmd report --help

  advisories                 Report advisories that are available for packages or dependencies in use
  dependenton                Report usage of a given dependency
  golangCILint               Query usages of golangci-lint, tracked as a source-based dependency
  infrastructure-advisories  Report infrastructure advisories
  licenses                   Report license information for package dependencies
  mostPopularDockerImages    Query the most popular Docker registries, namespaces and images in use
  mostPopularPackageManagers Query the most popular package managers in use
  policy-violations          Report policy violations that are found for packages or dependencies

Advisories

Right now we can write SQL queries to ask:

  • what Terraform modules and versions are being used across the org?

  • which teams are using the Gin web framework?

But what if we could ask:

  • are any of our projects relying on libraries that no longer are recommended by our language guilds?

  • how much time should my team(s) be planning in the next quarter to upgrade their AWS infrastructure?

via GIPHY

Dependency advisory data sources:

AWS infrastructure advisory data sources:

🤫

Custom advisories 🦸

Community provided advisories via -contrib:

INSERT INTO custom_advisories (
  package_pattern,
  package_manager,
  version,
  version_match_strategy,
  advisory_type,
  description
) VALUES (
  'github.com/golang/mock',
  'gomod',
  NULL,
  NULL,
  'UNMAINTAINED',
  'golang/mock is no longer maintained, and active development been moved to github.com/uber/mock'
);

Policies

Open Policy Agent logo
package policy

import future.keywords.contains
import future.keywords.if

default advisory_type := "UNMAINTAINED"

deny contains "Use the new GitLab.com server" if
  startswith(input.dependency.package_name, "gitlab.example.com/")
// Versions of Gin >= 1.9
deny contains msg if {
	input.dependency.package_manager in {"gomod", "golang"}
	input.dependency.package_name = "github.com/gin-gonic/gin"
	versions[0] in {"v1", "1"}
	to_number(versions[1]) >= 9
	msg := sprintf("%s. Versions of Gin since v1.9.0 have shipped " +
      "ByteDance/sonic as an optional dependency, but it still " +
      "appears as a dependency, and could be in use - more " +
      "details in " +
      "https://github.com/gin-gonic/gin/issues/3653", [prefix])
}

Ownership

  • Who does this production service using end-of-life software belong to?

  • dmd owners + owners table to the rescue!

select
  distinct
  sboms.platform,
  sboms.organisation,
  sboms.repo,
  owner
from
  sboms
  left join owners
  on  sboms.platform     = owners.platform
  and sboms.organisation = owners.organisation
  and sboms.repo         = owners.repo
-- where ...

Repo metadata

How many customer-facing services are running an outdated version of this internal library?

And in the repository_metadata table:

repo additional_metadata
api-service {"customer_facing": "true"}
examples
business-service {"customer_facing": "false"}
select
  distinct
  sboms.platform,
  sboms.organisation,
  sboms.repo,
  (case json_extract(repository_metadata.additional_metadata,
    '$.customer_facing') when "true" then true
   else false
   end) as is_customer_facing
from
  sboms
  left join repository_metadata
  on  sboms.platform     = repository_metadata.platform
  and sboms.organisation = repository_metadata.organisation
  and sboms.repo         = repository_metadata.repo
-- where ...

Example project

Contrib project

Case Studies

https://dmd.tanna.dev/case-studies/

"What package advisories do I have?"

organisationrepopackage_namecurrent_versiondep_typesadvisory_typedescription
alphagov pay-selfservice node 18.17.1 ["engines"] DEPRECATED nodejs 18 has been unsupported (usually only receiving critical security fixes) for 107 days
golangci golangci-lint github.com/golangci/lint-1 v0.0.0-... ["require"] DEPRECATED Use golang.org/x/lint instead, as the golangci fork ... is behind active development and bugfixes in golang.org/x/lint
monzo response python 3.7 [] UNMAINTAINED python 3.7 has been End-of-Life for 220 days

via https://dependency-management-data-example.fly.dev/report/advisories

Which other services may be affected by this production bug?

https://dmd.tanna.dev/case-studies/deliveroo-kafka-sidecar/

# there may be some folks using YAML anchors
sidecars: &sidecars
    # or there could also be comments in here!
    kafka: 1.2.3

app:
    image: "internal.docker.registry/service-name"
    *sidecars

In the renovate table:

repo package_name current_version package_file_path
good-service internal-docker.tld/kafka 0.3.0 .hopper.yml
affected-service internal-docker.tld/kafka 0.2.1 .hopper.yml
also-affected-service internal-docker.tld/kafka 0.1.0 .hopper.yml

Parsed via https://docs.renovatebot.com/modules/manager/regex/

And the repository_metadata table:

repo additional_metadata
good-service {"tier": "tier_1"}
affected-service {"tier": "tier_1"}
also-affected-service {"tier": "tier_2"}

Slightly garish query:

select
  renovate.organisation,
  renovate.repo,
  current_version,
  owner,
  json_extract(additional_metadata, '$.tier') as tier
from
  renovate
  left join owners on
      renovate.platform     = owners.platform
  and renovate.organisation = owners.organisation
  and renovate.repo         = owners.repo
  left join repository_metadata on
      renovate.platform     = repository_metadata.platform
  and renovate.organisation = repository_metadata.organisation
  and renovate.repo         = repository_metadata.repo
where
  -- NOTE: that this is performed with a lexicographical match, which is NOT
  -- likely to be what you are expecting to perform version constraint matching
  -- but this is a good start for these use cases
  renovate.current_version < '0.3'
order by
  tier ASC

Result:

organisation repo current_version owner tier
deliveroo affected-service 0.2.1 Grocery tier_1
deliveroo also-affected-service 0.1.0 tier_2

Log4shell

https://dmd.tanna.dev/case-studies/log4shell/

We could use the dependenton query:

# for Gradle projects
$ dmd report dependenton --db dmd.db --package-manager gradle
  --package-name org.apache.logging.log4j:log4j-core
+-------------------+---------+-------------------+----------------------------+-------------+
| REPO              | VERSION | DEPENDENCY TYPES  | FILEPATH                   | OWNER       |
+-------------------+---------+-------------------+----------------------------+-------------+
| logstash          | 2.17.1  | ["dependencies"]  | logstash-core/build.gradle | Elastic     |
| logstash          | 2.17.1  | ["dependencies"]  | logstash-core/build.gradle | Elastic     |
| fake-private-repo | 2.13.0  | ["dependencies"]  | blank-java/build.gradle    | Jamie Tanna |
| fake-private-repo | 2.13.0  | ["dependencies"]  | blank-java/build.gradle    | Jamie Tanna |
+-------------------+---------+-------------------+----------------------------+-------------+

Or start with SQL:

select
  platform,
  organisation,
  repo,
  current_version
from
  renovate
where
  package_name = 'org.apache.logging.log4j:log4j-core'

With the versions affected:

select
  platform,
  organisation,
  repo,
  current_version
from
  renovate
where
  package_name = 'org.apache.logging.log4j:log4j-core'
  and current_version in (
    '2.0-beta9',	'2.0-rc1',
    '2.0-rc2',		'2.0.1',
    '2.0.2',		'2.0',
    -- ....
    -- ....
    -- ....
    '2.13.0',		'2.13.1',
    '2.13.2',		'2.13.3',
    '2.14.0',		'2.14.1',
  )

Or with a Policy:

default advisory_type := "SECURITY"

versions := split(input.dependency.version, ".")
major := to_number(versions[0])
minor := to_number(versions[1])
patch := to_number(versions[2])

is_log4j2 if {
	input.dependency.package_manager in {"gradle", "maven"}
	input.dependency.package_name =
      "org.apache.logging.log4j:log4j-core"

	major == 2
}

// ...

Or with Policy (continued):

// ...

// CVE-2021-44228 aka Log4shell affects versions 2.0-beta9 to
// 2.14.1
is_vulnerable_version if input.dependency.version in {
  "2.0-beta9", "2.0-rc1", "2.0-rc2"}

// CVE-2021-44228 aka Log4shell affects versions 2.0-beta9 to
// 2.14.1
is_vulnerable_version if {
	minor > 0
	minor <= 14
}

deny contains msg if {
	is_log4j2
	is_vulnerable_version
	msg := "Dependency is vulnerable to Log4Shell CVE " +
      "(CVE-2021-44228)"
}

The Gorilla Toolkit archiving

https://dmd.tanna.dev/case-studies/gorilla-toolkit/

As of dmd v0.26.0, there was an inbuilt query that produced the following output:

$ dmd report gorillaToolkit --db dmd.db
Renovate
Direct dependencies
+------------------------------+---+
| PACKAGE                      | # |
+------------------------------+---+
| github.com/gorilla/mux       | 4 |
| github.com/gorilla/websocket | 1 |
| github.com/gorilla/handlers  | 1 |
+------------------------------+---+
Indirect dependencies
+---------------------------------+---+
| PACKAGE                         | # |
+---------------------------------+---+
| github.com/gorilla/websocket    | 6 |
| github.com/gorilla/securecookie | 2 |
| github.com/gorilla/sessions     | 1 |
| github.com/gorilla/schema       | 1 |
| github.com/gorilla/mux          | 1 |
| github.com/gorilla/context      | 1 |
+---------------------------------+---+

Alternatively:

select
  platform,
  organisation,
  repo,
  arr.value as dep_type
from
  renovate,
  json_each(dep_types) as arr
where
  package_name like 'github.com/gorilla/%'
group by
  package_name
order by
  dep_type desc

Results in:

platform organisation repo dep_type
github dagger dagger require
github deepmap oapi-codegen require
github elastic beats require
github tailscale tailscale require
github tailscale tailscale indirect

Docker free tier sunset

https://dmd.tanna.dev/case-studies/docker-free-sunsetting/

$ dmd report mostPopularDockerImages --db dmd.db
+----------------------------------+-----+
| NAMESPACE                        |   # |
+----------------------------------+-----+
| library                          | 499 |
| ghcr.io/gravitational            |  18 |
| dockersamples                    |  12 |
| gcr.io/distroless                |  11 |
| public.ecr.aws/gravitational     |  10 |
| registry1.dsop.io/redhat/ubi     |  10 |
| docker.mirror.hashicorp.services |  10 |
| wiremock                         |   8 |
| docker                           |   8 |
| docker.elastic.co/elasticsearch  |   7 |
| cimg                             |   6 |
| hashicorpdev                     |   6 |
+----------------------------------+-----+

Getting started

# produce some data that DMD can import, i.e.
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN \
  your-org/repo another-org/repo
# or for GitLab
env RENOVATE_PLATFORM=gitlab npx @jamietanna/renovate-graph@latest \
  --token $GITLAB_TOKEN your-org/repo another-org/nested/repo

# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'

https://dmd.tanna.dev/cookbooks/getting-started/

Resources

Questions?

Come and say hey 👋