Site Reliability Engineering at Starship | by Martin Pihlak | Spacecraft Technologies


Martin Pihlak
Photo by Ben Davis, Instagram slovaeck_

Running autonomous robots on city streets is a real software engineering challenge. Some of this software runs on the robot itself, but much of it actually runs in the backend. Things like remote control, path finding, matching robots with customers, managing the health of the fleet, but also interactions with customers and traders. All of this needs to work 24/7, non-stop and dynamically scale to accommodate the workload.

SRE at Starship is responsible for providing the cloud infrastructure and platform services for the execution of these core services. We have standardized on Governors for our Microservices and run it on top of AWS. MongoDb is the main database for most backend services, but we also like PostgreSQL, especially when a solid strike and transactional guarantees are required. For asynchronous messaging Kafka is the messaging platform of choice and we use it for just about everything except sending video feeds from bots. For observability, we rely on Prometheus and Grafana, Loki, left and Jaeger. The CICD is managed by Jenkins.

A good portion of SRE’s time is spent maintaining and improving the Kubernetes infrastructure. Kubernetes is our primary deployment platform and there is always something to improve, whether it’s fine-tuning autoscaling settings, adding pod disruption policies, or optimizing usage. Spot instances. Sometimes it’s like laying bricks – you just need to install a Helm chart to provide special functionality. But often the “bricks” have to be carefully chosen and evaluated (is Loki good at managing logs, is Service Mesh one thing and then what) and sometimes the functionality does not exist in the world and must be written from scratch. When this happens, we usually look to Python and Golang but also Rust and C if necessary.

Another important part of the infrastructure for which SRE is responsible is data and databases. Starship started out with a single monolithic MongoDb – a strategy that has worked well so far. However, as the business grows, we need to revisit this architecture and start thinking about supporting bots by the thousands. Apache Kafka is part of the story of scaling, but we also need to understand sharding, regional clustering, and microservices database architecture. On top of that, we are constantly developing tools and automation to manage the current database infrastructure. Examples: add MongoDb observability with custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate regular failover and recovery tests, collect metrics for the re-sharding Kafka, activate data retention.

Finally, one of the most important goals of site reliability engineering is to minimize downtime for Starship production. While SRE is sometimes called upon to deal with infrastructure outages, the most important work is done to prevent outages and ensure that we can recover quickly. This can be a very broad subject, ranging from a rock solid K8 infrastructure to engineering practices and business processes. There are great opportunities to make an impact!

A day in the life of an SRE

Arrival at work, sometimes between 9 a.m. and 10 a.m. (sometimes teleworking). Have a cup of coffee, check Slack messages and emails. Review the alerts that went off overnight, see if there is anything of interest there.

Find that MongoDb connection latencies increased overnight. As you dig into Prometheus metrics with Grafana, find out that this happens while backups are running. Why is it suddenly a problem we’ve been running these backups for ages? It turns out that we compress backups very aggressively to save network and storage costs, which consumes all available CPU. It looks like the load on the database has increased a bit to make it noticeable. This happens on a standby node, with no impact on production, but remains a problem if the primary node fails. Add a Jira item to resolve this issue.

By the way, modify the MongoDb (Golang) sonar code to add more histogram buckets to better understand the latency distribution. Run a Jenkins pipeline to bring the new probe into production.

At 10am there is a Standup meeting, share your updates with the team and find out what the others have been up to – setting up monitoring for a VPN server, instrumenting a Python application with Prometheus, setting up ServiceMonitors for external services, debugging MongoDb connectivity issues, driving Canary deployments with Flagger.

After the meeting, resume your scheduled work for the day. One of the things I planned to do today was set up an additional Kafka cluster in a test environment. We are running Kafka on Kubernetes, so it should be simple to take the existing cluster YAML files and adjust them for the new cluster. Or, on second thought, should we use Helm instead, or maybe a good Kafka operator is available now? No I’m not going – too much magic, I want more explicit control over my statefulsets. It’s raw YAML. An hour and a half later, a new cluster is running. The setup was pretty straightforward; only the boot containers that register Kafka brokers in DNS needed a configuration change. Generating credentials for the applications required a small bash script to set up accounts on Zookeeper. One item that remained open was configuring Kafka Connect to capture database change log events. Fall behind and move on.

Now is the time to prepare a scenario for the Wheel of Misfortune exercise. At Starship, we run them to improve our understanding of systems and share troubleshooting techniques. It works by breaking part of the system (usually in testing) and having an unlucky person troubleshoot and mitigate the problem. In this case, I will set up a load test with Hey to override the microservice for route calculations. Deploy it as a Kubernetes task called a “haymaker” and hide it enough that it doesn’t immediately show up in the Linkerd service mesh (yes, evil 😈). Later run the “Wheel” exercise and note the gaps we have in playbooks, metrics, alerts, etc.

In the last hours of the day, block all interruptions and try to do some coding. I have reimplemented the Mongoproxy BSON parser as Asynchronous Streaming (Rust + Tokio) and want to see how well it works with real data. Turns out there is a bug somewhere in the guts of the analyzer and I need to add some deep logging to figure it out. Find a wonderful plotting library for Tokio and let yourself be carried away …

Disclaimer: The events described here are based on a true story. It didn’t all happen the same day. Some meetings and interactions with colleagues have been deleted. We are hiring.



Source Link

Please follow and like us: