Docker publishes its 1st public post-mortem on an internal incident
To confirm its willingness to provide further support to developers, Docker has just described their intention to resolve an incident that occurred in early July on its Docker Hub container image repository. The issue was investigated in conjunction with AWS and Cloudflare teams.
In accordance with its strategy outlined last November, Docker is getting closer to developers. In March, he shared his roadmap with them, asking for their feedback, and this month he just published a “post-mortem” for them, detailing an internal incident on his cloud services. This is the first time that the container specialist has done so publicly, the avowed objective being to build a relationship of trust between its users and its teams. In November, after selling its Enterprise business to Mirantis, Docker announced plans to refocus on developers with its Desktop and Hub solutions.
The post published on August 12 by Brett Inman, the supplier’s senior engineering manager, describes an incident that occurred in early July on the Docker Hub container image storage repository. This problem solving description illustrates a case of complex interaction between different cloud service operators. Between 7:00 p.m. (UTC) on July 5 and 6:30 a.m. on July 6, Amazon Linux users in multiple regions experienced interruptions in downloading Docker images stored in Docker Hub. “The problem stemmed from an anti-botnet protection mechanism deployed by our CDN provider Cloudflare,” explains Brett Inman. Docker, Cloudflare, and AWS teams worked together to identify it, and the mechanism in question was disabled, which restored service.
Autopsy of an incident in the cloud
Around 1:45 a.m. UTC on Monday, July 6 – Sunday again on the Pacific coast – Docker was contacted by AWS regarding the inability of various services and users to pull images from Docker Hub, reports Brett Inman. The infrastructure and repository teams immediately address the malfunction. After checking, they find nothing abnormal about the repository itself, nor about the AWS infrastructure. “This told us that the problem was more related to a region or a mechanism in the services down,” continues the engineering manager. Notified by AWS that the affected systems were those running Amazon Linux (including high-level services like Fargate), the Docker team began launching instances with Amazon Linux and another operating system in different AWS Regions. It turns out that both OS works fine in us-east-1 region, while in other regions Amazon Linux fails to pull the images while the other OS succeeds. “The fact that us-east-1 worked for both OSes made it appear that the problem was with our CDN, Cloudflare.” Indeed, on the us-east-1 region, Docker Hub images are stored in S3 buckets and requests are served directly by S3. In the other regions, they were via the CDN.
Docker then approaches Cloudflare to open an incident and the three providers study the problem together. On an AWS discovery made by comparing the differences between Amazon Linux and the other OS, Cloudflare finds that some traffic to Docker Hub is dropped due to an anti-botnet mitigation system that added a detection marking packets with a certain attribute as potentially part of an attack. Although monitored by Cloudfare, this particular interaction had not yet been spotted. After disabling the mechanism, Docker Hub traffic was able to resume and the incident was closed at 6:30 UTC.
Removal of container images inactive for 6 months
Regarding its container image storage repository, Docker also announced a few days ago that it had changed its data retention policy. On Docker Hub free accounts, the provider has just introduced a maximum retention period of 6 months on inactive images. After 6 months of inactivity, images will be scheduled for destruction. To keep them longer, users just need to subscribe to a paid account, Pro or Team, the price of which starts at $5 or $7 per month.