Quantifying your reliance on Open Source software

With dependency-management-data (DMD)

Jamie Tanna (https://www.jvt.me)
Senior Software Engineer @ Elastic

`/usr/bin/whoami`

Jamie Tanna (he/him)
https://www.jvt.me
https://www.jvt.me/elsewhere/
Engineering Productivity 🚀
Currently @ Elastic, previously Deliveroo

Timeline of events

2024-11: This talk!
2023-10: TechMids
2023-07: First public talk
2023-02: Created the dependency-management-data project
2022-08: First iteration with Dependabot
2019: "Formally" considering it
2017: Hacking around

Why is it important?

As I wrote in the post Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest

But it's not always ☀🌈

xkcd comic showing a tower of various layers of boulders and stones, labelled "all modern digital infrastructure", which looks a little precarious. Towards the bottom there is a slim load-bearing stone which is labelled "a project some random person in Nebraska has been thanklessly maintaining since 2003"

💖

Do you fully appreciate the depth of your dependency on the software supply chain?

Being able to understand how your business uses Open Source is really important for a few other key reasons:

How am I affected by that dependency migrating away from Open Source?
Usages of unwanted libraries
Understand usage of libraries and frameworks, and their versions
Discovering unmaintained, deprecated or vulnerable software

Being able to understand how your business uses ~~Open Source~~ internal software is really important for a few other key reasons:

~~How am I affected by that dependency migrating away from Open Source?~~
Usages of unwanted libraries
Understand usage of libraries and frameworks, and their versions
Discovering unmaintained, deprecated or vulnerable software

Other insights into:

How maintained does the dependency appear to be?
How are the dependency's supply chain security practices? (via OpenSSF Security Scorecards)
How many dependencies are actively seeking financial support?

How can we do it?

💰🤑💸

GitHub logo GitLab logo

Let's use Open Source!

What is dependency-management-data?

Dependency Management Data (DMD) - dmd.tanna.dev

What's in the project?

The outputted SQLite database
The command-line tool dmd
The web application dmd-web, and the GraphQL-only web application dmd-graph
(Your SQLite browser of choice)

SQLite database

Conveniently distribute, share
Great for local-only or building applications on top of it
No lock-in to dmd - all state synced to the DB

dmd

Create the SQLite database
Ingests different sources of dependency data ("datasources")
Enrich it with more data ("advisories", "dependency health")
Configure your own views over the data via "custom advisories" and "policies"
Provide common queries ("reports")

dmd-web

Centrally deployable and accessible
View reports in the browser
Datasette's excellent SQLite UI
sql-studio's excellent SQLite UI
Lightweight internal SQLite browser
GraphQL API
Great when deployed accessible to all (with authentication)

dmd-graph

Web application with only the GraphQL API
Great for internally deployed (for internal API access)

How did it come to be?

Idea for Open Source/Startup: monetising the supply chain

Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest

Using the Dependabot APIs
Good starting point
Lack of data for some ecosystems
Hard to parse the "current version"

via GIPHY

Mend Renovate logo

EndOfLife.date logo

commit 73a99614a2af6fa9f66508bab8541ed65e18ed66
Author: Jamie Tanna <>
Date:   Thu Feb 2 09:23:43 2023 +0000

    Initialise project

 LICENSE.md        | 13 +++++++++++++
 README.md         |  7 +++++++
 public/index.html | 90 ++++++++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+)

How does it work?

# produce some data that DMD can import, i.e.
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN your-org/repo
# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# optionally, generate advisories
dmd db generate advisories --db dmd.db
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'

# set up the database
dmd db init --db dmd.db

# taking an SBOM that was produced from the GitLab repo
# https://gitlab.com/tanna.dev/dependency-management-data
dmd import sbom --db dmd.db '/path/to/sbom.json' --platform gitlab
  --organisation tanna.dev
  --repo dependency-management-data
# take an SBOM that was produced in some unknown place,
# and auto-detect what we can
dmd import sbom '/path/to/sbom.json'
# take an SBOM that was produced by a vendor
dmd import sbom '/path/to/sbom.json' --vendor ExampleCorp
  --product 'Web Server' --product-version 5.0.0

# optionally, generate advisories
dmd db generate advisories --db dmd.db
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'

When importing:

Converts to an underlying data model (in SQLite)
Uses that for internal querying + enrichment

Once ingested, write SQL to your heart's content 🤓 i.e.

"which repos use a vulnerable version of Log4J"
"how many repos are using a version of the Datadog SDK that's older than ..."
"what is our most used direct/transitive dependency?"

Advisories

Right now we can write SQL queries to ask:

what Terraform modules and versions are being used across the org?
which teams are using the Gin web framework?

But what if we could ask:

are any of our projects relying on libraries that no longer are recommended by our language guilds?
how much time should my team(s) be planning in the next quarter to upgrade their AWS infrastructure?

via GIPHY

Dependency advisory data sources:

AWS infrastructure advisory data sources:

endoflife-checker

🤫

Custom advisories 🦸

Community provided advisories via -contrib:

INSERT INTO custom_advisories (
  package_pattern,
  package_manager,
  version,
  version_match_strategy,
  advisory_type,
  description
) VALUES (
  'github.com/golang/mock',
  'gomod',
  NULL,
  NULL,
  'UNMAINTAINED',
  'golang/mock is no longer maintained, and active development been moved to github.com/uber/mock'
);

Policies

// Versions of Gin >= 1.9
deny contains msg if {
	input.dependency.package_manager in {"gomod", "golang"}
	input.dependency.package_name = "github.com/gin-gonic/gin"
	versions[0] in {"v1", "1"}
	to_number(versions[1]) >= 9
	msg := sprintf("%s. Versions of Gin since v1.9.0 have shipped " +
      "ByteDance/sonic as an optional dependency, but it still " +
      "appears as a dependency, and could be in use - more " +
      "details in " +
      "https://github.com/gin-gonic/gin/issues/3653", [prefix])
}

Case Studies

https://dmd.tanna.dev/case-studies/

"What package advisories do I have?"

organisation	repo	package_name	current_version	dep_types	advisory_type	description
alphagov	pay-selfservice	node	18.20.4	["engines"]	DEPRECATED	nodejs 18 has been unsupported (usually only receiving critical security fixes) for 400 days
elastic	beats	github.com/golang/mock	v1.6.0	["require"]	UNMAINTAINED	golang/mock is no longer maintained, and active development been moved to github.com/uber/mock
monzo	response	python	3.7	[]	UNMAINTAINED	python 3.7 has been End-of-Life for 513 days

via https://dependency-management-data-example.fly.dev/report/advisories

Which other services may be affected by this production bug?

https://dmd.tanna.dev/case-studies/deliveroo-kafka-sidecar/

# there may be some folks using YAML anchors
sidecars: &sidecars
    # or there could also be comments in here!
    kafka: 1.2.3

app:
    image: "internal.docker.registry/service-name"
    *sidecars

In the renovate table:

repo	package_name	current_version	package_file_path
good-service	internal-docker.tld/kafka	0.3.0	.hopper.yml
affected-service	internal-docker.tld/kafka	0.2.1	.hopper.yml
also-affected-service	internal-docker.tld/kafka	0.1.0	.hopper.yml

Parsed via https://docs.renovatebot.com/modules/manager/regex/

And the repository_metadata table:

repo	additional_metadata
good-service	`{"tier": "tier_1"}`
affected-service	`{"tier": "tier_1"}`
also-affected-service	`{"tier": "tier_2"}`

Slightly garish query:

select
  renovate.organisation,
  renovate.repo,
  current_version,
  owner,
  json_extract(additional_metadata, '$.tier') as tier
from
  renovate
  left join owners on
      renovate.platform     = owners.platform
  and renovate.organisation = owners.organisation
  and renovate.repo         = owners.repo
  left join repository_metadata on
      renovate.platform     = repository_metadata.platform
  and renovate.organisation = repository_metadata.organisation
  and renovate.repo         = repository_metadata.repo
where
  -- NOTE: that this is performed with a lexicographical match, which is NOT
  -- likely to be what you are expecting to perform version constraint matching
  -- but this is a good start for these use cases
  renovate.current_version < '0.3'
order by
  tier ASC

Result:

organisation	repo	current_version	owner	tier
deliveroo	affected-service	0.2.1	Grocery	tier_1
deliveroo	also-affected-service	0.1.0		tier_2

Log4shell

https://dmd.tanna.dev/case-studies/log4shell/

We could use the dependenton query:

# for Gradle projects
$ dmd report dependenton --db dmd.db --package-manager gradle
  --package-name org.apache.logging.log4j:log4j-core
+-------------------+---------+-------------------+----------------------------+-------------+
| REPO              | VERSION | DEPENDENCY TYPES  | FILEPATH                   | OWNER       |
+-------------------+---------+-------------------+----------------------------+-------------+
| logstash          | 2.17.1  | ["dependencies"]  | logstash-core/build.gradle | Elastic     |
| logstash          | 2.17.1  | ["dependencies"]  | logstash-core/build.gradle | Elastic     |
| fake-private-repo | 2.13.0  | ["dependencies"]  | blank-java/build.gradle    | Jamie Tanna |
| fake-private-repo | 2.13.0  | ["dependencies"]  | blank-java/build.gradle    | Jamie Tanna |
+-------------------+---------+-------------------+----------------------------+-------------+

Or start with SQL:

select
  platform,
  organisation,
  repo,
  current_version
from
  renovate
where
  package_name = 'org.apache.logging.log4j:log4j-core'

With the versions affected:

select
  platform,
  organisation,
  repo,
  current_version
from
  renovate
where
  package_name = 'org.apache.logging.log4j:log4j-core'
  and current_version in (
    '2.0-beta9',	'2.0-rc1',
    '2.0-rc2',		'2.0.1',
    '2.0.2',		'2.0',
    -- ....
    -- ....
    -- ....
    '2.13.0',		'2.13.1',
    '2.13.2',		'2.13.3',
    '2.14.0',		'2.14.1',
  )

Or with a Policy:

default advisory_type := "SECURITY"

versions := split(input.dependency.version, ".")
major := to_number(versions[0])
minor := to_number(versions[1])
patch := to_number(versions[2])

is_log4j2 if {
	input.dependency.package_manager in {"gradle", "maven"}
	input.dependency.package_name =
      "org.apache.logging.log4j:log4j-core"

	major == 2
}

// ...

Or with Policy (continued):

// ...

// CVE-2021-44228 aka Log4shell affects versions 2.0-beta9 to
// 2.14.1
is_vulnerable_version if input.dependency.version in {
  "2.0-beta9", "2.0-rc1", "2.0-rc2"}

// CVE-2021-44228 aka Log4shell affects versions 2.0-beta9 to
// 2.14.1
is_vulnerable_version if {
	minor > 0
	minor <= 14
}

deny contains msg if {
	is_log4j2
	is_vulnerable_version
	msg := "Dependency is vulnerable to Log4Shell CVE " +
      "(CVE-2021-44228)"
}

Getting started

# produce some data that DMD can import, i.e.
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN \
  your-org/repo another-org/repo
# or for GitLab
env RENOVATE_PLATFORM=gitlab npx @jamietanna/renovate-graph@latest \
  --token $GITLAB_TOKEN your-org/repo another-org/nested/repo

# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'

https://dmd.tanna.dev/cookbooks/getting-started/

Resources

Questions?

via GIPHY