Quantifying your reliance on Open Source software

With dependency-management-data (DMD)

Jamie Tanna (https://www.jvt.me)
Senior Software Engineer @ Elastic

/usr/bin/whoami

Timeline of events

  • 2024-10: This talk!
  • 2023-07: First public talk
  • 2023-02: Created the dependency-management-data project
  • 2022-08: First iteration with Dependabot
  • 2019: "Formally" considering it
  • 2017: Hacking around

Why is it important?

As I wrote in the post Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest

But it's not always ☀🌈

xkcd comic showing a tower of various layers of boulders and stones, labelled "all modern digital infrastructure", which looks a little precarious. Towards the bottom there is a slim load-bearing stone which is labelled "a project some random person in Nebraska has been thanklessly maintaining since 2003"

💖

Do you fully appreciate the depth of your dependency on the software supply chain?

Being able to understand how your business uses Open Source is really important for a few other key reasons:

  • How am I affected by that dependency migrating away from Open Source?
  • Usages of unwanted libraries
  • Understand usage of libraries and frameworks, and their versions
  • Discovering unmaintained, deprecated or vulnerable software

Being able to understand how your business uses Open Source internal software is really important for a few other key reasons:

  • How am I affected by that dependency migrating away from Open Source?
  • Usages of unwanted libraries
  • Understand usage of libraries and frameworks, and their versions
  • Discovering unmaintained, deprecated or vulnerable software

Other insights into:

  • How maintained does the dependency appear to be?
  • How are the dependency's supply chain security practices? (via OpenSSF Security Scorecards)
  • How many dependencies are actively seeking financial support?

How can we do it?

💰🤑💸

              GitHub logo GitLab logo Snyk logo Mend logo

Let's use Open Source!

Open Source Initiative logo

What is dependency-management-data?

Dependency Management Data (DMD) - dmd.tanna.dev

What's in the project?

  • The outputted SQLite database
  • The command-line tool dmd
  • The web application dmd-web, and the GraphQL-only web application dmd-graph
  • (Your SQLite browser of choice)

SQLite database

  • Conveniently distribute, share
  • Great for local-only or building applications on top of it
  • No lock-in to dmd - all state synced to the DB

dmd

  • Create the SQLite database
  • Ingests different sources of dependency data ("datasources")
  • Enrich it with more data ("advisories", "dependency health")
  • Configure your own views over the data via "custom advisories" and "policies"
  • Provide common queries ("reports")

dmd-web

  • Centrally deployable and accessible
  • View reports in the browser
  • Datasette's excellent SQLite UI
  • sql-studio's excellent SQLite UI
  • Lightweight internal SQLite browser
  • GraphQL API
  • Great when deployed accessible to all (with authentication)

dmd-graph

  • Web application with only the GraphQL API
  • Great for internally deployed (for internal API access)

How did it come to be?

Idea for Open Source/Startup: monetising the supply chain

Analysing our dependency trees to determine where we should send Open Source contributions for Hacktoberfest

  • Using the Dependabot APIs
  • Good starting point
  • Lack of data for some ecosystems
  • Hard to parse the "current version"

via GIPHY

Mend Renovate logo

EndOfLife.date logo

commit 73a99614a2af6fa9f66508bab8541ed65e18ed66
Author: Jamie Tanna <>
Date:   Thu Feb 2 09:23:43 2023 +0000

    Initialise project

 LICENSE.md        | 13 +++++++++++++
 README.md         |  7 +++++++
 public/index.html | 90 ++++++++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+)

How does it work?

# produce some data that DMD can import, i.e.
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN your-org/repo
# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# optionally, generate advisories
dmd db generate advisories --db dmd.db
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'
# set up the database
dmd db init --db dmd.db

# taking an SBOM that was produced from the GitLab repo
# https://gitlab.com/tanna.dev/dependency-management-data
dmd import sbom --db dmd.db '/path/to/sbom.json' --platform gitlab
  --organisation tanna.dev
  --repo dependency-management-data
# take an SBOM that was produced in some unknown place,
# and auto-detect what we can
dmd import sbom '/path/to/sbom.json'
# take an SBOM that was produced by a vendor
dmd import sbom '/path/to/sbom.json' --vendor ExampleCorp
  --product 'Web Server' --product-version 5.0.0

# optionally, generate advisories
dmd db generate advisories --db dmd.db
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'

When importing:

  • Converts to an underlying data model (in SQLite)
  • Uses that for internal querying + enrichment

Once ingested, write SQL to your heart's content 🤓 i.e.

  • "which repos use a vulnerable version of Log4J"
  • "how many repos are using a version of the Datadog SDK that's older than ..."
  • "what is our most used direct/transitive dependency?"

Advisories

Right now we can write SQL queries to ask:

  • what Terraform modules and versions are being used across the org?

  • which teams are using the Gin web framework?

But what if we could ask:

  • are any of our projects relying on libraries that no longer are recommended by our language guilds?

  • how much time should my team(s) be planning in the next quarter to upgrade their AWS infrastructure?

via GIPHY

Dependency advisory data sources:

AWS infrastructure advisory data sources:

🤫

Custom advisories 🦸

Community provided advisories via -contrib:

INSERT INTO custom_advisories (
  package_pattern,
  package_manager,
  version,
  version_match_strategy,
  advisory_type,
  description
) VALUES (
  'github.com/golang/mock',
  'gomod',
  NULL,
  NULL,
  'UNMAINTAINED',
  'golang/mock is no longer maintained, and active development been moved to github.com/uber/mock'
);

Policies

Open Policy Agent logo
// Versions of Gin >= 1.9
deny contains msg if {
	input.dependency.package_manager in {"gomod", "golang"}
	input.dependency.package_name = "github.com/gin-gonic/gin"
	versions[0] in {"v1", "1"}
	to_number(versions[1]) >= 9
	msg := sprintf("%s. Versions of Gin since v1.9.0 have shipped " +
      "ByteDance/sonic as an optional dependency, but it still " +
      "appears as a dependency, and could be in use - more " +
      "details in " +
      "https://github.com/gin-gonic/gin/issues/3653", [prefix])
}

Case Studies

https://dmd.tanna.dev/case-studies/

"What package advisories do I have?"

organisationrepopackage_namecurrent_versiondep_typesadvisory_typedescription
alphagov pay-selfservice node 18.20.4 ["engines"] DEPRECATED nodejs 18 has been unsupported (usually only receiving critical security fixes) for 344 days
elastic beats github.com/golang/mock v1.6.0 ["require"] UNMAINTAINED golang/mock is no longer maintained, and active development been moved to github.com/uber/mock
monzo response python 3.7 [] UNMAINTAINED python 3.7 has been End-of-Life for 457 days

via https://dependency-management-data-example.fly.dev/report/advisories

Which other services may be affected by this production bug?

https://dmd.tanna.dev/case-studies/deliveroo-kafka-sidecar/

# there may be some folks using YAML anchors
sidecars: &sidecars
    # or there could also be comments in here!
    kafka: 1.2.3

app:
    image: "internal.docker.registry/service-name"
    *sidecars

In the renovate table:

repo package_name current_version package_file_path
good-service internal-docker.tld/kafka 0.3.0 .hopper.yml
affected-service internal-docker.tld/kafka 0.2.1 .hopper.yml
also-affected-service internal-docker.tld/kafka 0.1.0 .hopper.yml

Parsed via https://docs.renovatebot.com/modules/manager/regex/

And the repository_metadata table:

repo additional_metadata
good-service {"tier": "tier_1"}
affected-service {"tier": "tier_1"}
also-affected-service {"tier": "tier_2"}

Slightly garish query:

select
  renovate.organisation,
  renovate.repo,
  current_version,
  owner,
  json_extract(additional_metadata, '$.tier') as tier
from
  renovate
  left join owners on
      renovate.platform     = owners.platform
  and renovate.organisation = owners.organisation
  and renovate.repo         = owners.repo
  left join repository_metadata on
      renovate.platform     = repository_metadata.platform
  and renovate.organisation = repository_metadata.organisation
  and renovate.repo         = repository_metadata.repo
where
  -- NOTE: that this is performed with a lexicographical match, which is NOT
  -- likely to be what you are expecting to perform version constraint matching
  -- but this is a good start for these use cases
  renovate.current_version < '0.3'
order by
  tier ASC

Result:

organisation repo current_version owner tier
deliveroo affected-service 0.2.1 Grocery tier_1
deliveroo also-affected-service 0.1.0 tier_2

Log4shell

https://dmd.tanna.dev/case-studies/log4shell/

We could use the dependenton query:

# for Gradle projects
$ dmd report dependenton --db dmd.db --package-manager gradle
  --package-name org.apache.logging.log4j:log4j-core
+-------------------+---------+-------------------+----------------------------+-------------+
| REPO              | VERSION | DEPENDENCY TYPES  | FILEPATH                   | OWNER       |
+-------------------+---------+-------------------+----------------------------+-------------+
| logstash          | 2.17.1  | ["dependencies"]  | logstash-core/build.gradle | Elastic     |
| logstash          | 2.17.1  | ["dependencies"]  | logstash-core/build.gradle | Elastic     |
| fake-private-repo | 2.13.0  | ["dependencies"]  | blank-java/build.gradle    | Jamie Tanna |
| fake-private-repo | 2.13.0  | ["dependencies"]  | blank-java/build.gradle    | Jamie Tanna |
+-------------------+---------+-------------------+----------------------------+-------------+

Or start with SQL:

select
  platform,
  organisation,
  repo,
  current_version
from
  renovate
where
  package_name = 'org.apache.logging.log4j:log4j-core'

With the versions affected:

select
  platform,
  organisation,
  repo,
  current_version
from
  renovate
where
  package_name = 'org.apache.logging.log4j:log4j-core'
  and current_version in (
    '2.0-beta9',	'2.0-rc1',
    '2.0-rc2',		'2.0.1',
    '2.0.2',		'2.0',
    -- ....
    -- ....
    -- ....
    '2.13.0',		'2.13.1',
    '2.13.2',		'2.13.3',
    '2.14.0',		'2.14.1',
  )

Or with a Policy:

default advisory_type := "SECURITY"

versions := split(input.dependency.version, ".")
major := to_number(versions[0])
minor := to_number(versions[1])
patch := to_number(versions[2])

is_log4j2 if {
	input.dependency.package_manager in {"gradle", "maven"}
	input.dependency.package_name =
      "org.apache.logging.log4j:log4j-core"

	major == 2
}

// ...

Or with Policy (continued):

// ...

// CVE-2021-44228 aka Log4shell affects versions 2.0-beta9 to
// 2.14.1
is_vulnerable_version if input.dependency.version in {
  "2.0-beta9", "2.0-rc1", "2.0-rc2"}

// CVE-2021-44228 aka Log4shell affects versions 2.0-beta9 to
// 2.14.1
is_vulnerable_version if {
	minor > 0
	minor <= 14
}

deny contains msg if {
	is_log4j2
	is_vulnerable_version
	msg := "Dependency is vulnerable to Log4Shell CVE " +
      "(CVE-2021-44228)"
}

Getting started

# produce some data that DMD can import, i.e.
npx @jamietanna/renovate-graph@latest --token $GITHUB_TOKEN \
  your-org/repo another-org/repo
# or for GitLab
env RENOVATE_PLATFORM=gitlab npx @jamietanna/renovate-graph@latest \
  --token $GITLAB_TOKEN your-org/repo another-org/nested/repo

# set up the database
dmd db init --db dmd.db
# import renovate-graph data
dmd import renovate --db dmd.db 'out/*.json'
# then you can start querying it
sqlite3 dmd.db 'select count(*) from renovate'

https://dmd.tanna.dev/cookbooks/getting-started/

Resources

Questions?

via GIPHY