PyPI, the Python Package Index, began evaluating ways to reduce the amount of identifying information that it stores even before the US Justice Department came asking for data on suspect users.
But now that the code repository has disclosed receiving three subpoenas for data on five users earlier this year, the Python community package registry wants developers to understand that it's working to minimize the user data that it stores.
The goal is not to be unable to respond to lawful requests for information; rather it's to store only the minimum amount of data necessary so as not to expose users to unnecessary privacy intrusion.
As far as we know, RubyGems has not received any subpoenas for user data
Coincidentally, data minimization may prevent organizations from becoming a preferred source of on-demand surveillance: having excessive amounts of information about users invites legal demands, which staff then have to handle.
While data demands from authorities are commonplace among large commercial internet services, like GitHub, we're unaware of previous public reports about subpoenas directed at open source software package registries.
Samuel Giddins, who helps maintain RubyGems, told The Register, "As far as we know, RubyGems has not received any subpoenas for user data."
Mike Fiedler, a member of the PyPI admin team, said in a statement on Friday that the organization's effort to improve user privacy and security dates back to 2020.
Since the receipt of the subpoenas in March and April, that effort has been reinvigorated.
- PyPI subpoenaed: US govt demands data on developers
- Python Package Index had one person on-call to hold back weekend malware rush
- Python head hisses at looming Euro cybersecurity rules
- Python Package Index found stuffed with AWS keys and malware
Much of the concern focuses on IP address data, which gets stored in conjunction with web log access; user events such as logins; project events including uploads; events associated with recently introduced organizations; and administrative PyPI journal entries.
According to Fiedler, PyPI was able to stop storing IP data for journal entries – an append-only transaction log – because these were only exposed to administrators.
"Other places where we currently still need IP data include rate limiting, and fallbacks until we have backfilled the IP data with hashes and geo data," said Fiedler. "Our modern approach has evolved from using the IP data at display time to find the relevant geo data, to storing the geo data directly in the database."
To obscure IP addresses, PyPI is salting them – adding an arbitrary value – and then hashing them – running the data through a one-way scrambling function that creates a value called a hash. This provides a way to store a reference to potentially identifying data without actually storing raw data.
Fiedler explains that while hashing is supposed to be non-reversible, it still may be possible to undo IP address hashes by brute force because the known address space is so small.
"By applying a salt, we require someone to possess both the salt and the hashed IP addresses to brute force the value," he said. "Our salt is not stored in the database while the hashed IP addresses are, we protect against leaks revealing this information."
PyPI has been using its CDN provider Fastly to pass along a salted hash of the IP address for requests via a custom header, along with GeoIP data (where the user is located), and is using that instead of the raw IP address.
In April, the registry adopted code changes for hashing and salting IP addresses for requests that PyPI handles directly in Warehouse, the web application that implements the official Python package index.
And over the past few days, it has been replacing IP addresses in the PyPI user interface with geolocation data.
PyPI still relies on IP address information to identify abuse – the creation of malicious packages, harassments, and so on – but Fiedler says even that is being looked at. "We're thinking about how to manage that without storing IP data, but we're not there yet," he said.
Fiedler says the PyPI team will be weighing whether it can remove IP data from event history records after a period of time and whether the service can handle all its requests via CDN.
That may just kick the privacy can of worms upstream to Fastly, however. The Register asked Fastly whether it has received subpoenas for PyPI IP address data. We've not heard back. ®