This is part of a series of contributing articles that led to KubeCon + CloudNativeCon in October.
Let’s explore the experience of companies trying to build their own software distribution tools. This hypothetical scenario relies on a software-as-a-service (SaaS) company and/or a traditional on-premises software company delivering their applications to Kubernetes (K8s) customer environments in the cloud for the first time. Think of it as a composite of many people’s experiences. Hope you don’t make the same mistakes!
A timeline of hope and pain
Today 0 – The sales or product team asks simple engineering questions: “Can we bring our SaaS application to our customers’ self-hosted Kubernetes environments?” or “Now that we’ve updated our app and turned it into containers, can we distribute it to client-managed groups in the cloud?” Either way, what they’re really saying is, “Prospective clients are constantly asking us to do this, and we leave money on the table every time we say ‘no’.”
day 1 – How difficult is it? The lead engineer spends the weekend looking for a rough solution, and is very excited to build something new. It seems very easy to reconfigure the application to work in any AWS or customer-hosted environment, right? We could use Terraform, perhaps.
Today 30 Field engineers deliver the application to the first customer-hosted K8s running in an AWS Virtual Private Cloud (VPC). Scaling up for engineering and some patience from the client, the app is finally deployed. Fives!
Today 45 – The lead engineer has shipped several updates and changes to the new K8s “local” installer to make it work. The production install was started in a different environment, but it doesn’t work the same way, and no one knows why. More and more engineering time is being spent on Zoom with the client, whose frustration is steadily increasing. Modernization, innovation, and/or other back-end work is beginning to take priority, and this project is starting to look more complex than anticipated. The sales team gets a little worried about their accounts and climbs into management.
Today 60 – The project is no longer fun and continues to absorb time and people. Terraform Scripts fail to perform security reviews in some companies. The chief engineer asks the manager to get them out of this ASAP because they’re burning out. The company does not want to stop the project because the product and sales are about to close for this customer. There are a staggering number of K8s-based opportunities on the premises and in the pipeline, and in this economy the VP of Sales doesn’t want to shell out any revenue. The head of engineering is reluctantly assigning more engineers to work on the in-house installer project, delaying the schedule for other planned application features and innovations.
Today 180 A lot has happened in the past four months. New customers run the installer, but each has a different environment and installation requirements. some examples:
- While the first client accepted the Ubuntu-based installer, the next client wanted the RHEL installer. So the team spent two weeks building a second package and designing a CI/CD pipeline to build and test it in parallel with the Ubuntu-based package.
- Two government and financial services clients needed air-gap installers. The engineers decided this was too much of an effort as everything went on. This is a huge blow to the revenue stream that drove the idea in the first place.
Today 270 With mixed failures and successes, the initiative to install K8s in workplaces continues sporadically. More issues keep emerging. The installation success rate hovers around 50%, with half of the installation attempts ending in customer frustration and loss of confidence. Other clients and potential clients keep asking for it, and a number of large accounts are now being deployed with it, so it seems impossible to turn back, but the quagmire is getting deeper:
- One customer encounters some common vulnerabilities and exposures (CVEs), which prevent installation, and it’s a late-night deck practice of patching vulnerabilities and stabilizing everything again.
- Many customers have now (automatically) upgraded their Linux operating systems, which unfortunately has broken application packages, requiring rework and updates to the installer. This seems to happen at least once every quarter.
- Storage failures and mysterious networks required more than 10 hours of hands-on troubleshooting over several weeks.
- The first installer customer still has to upgrade their installer and is at risk due to unpatched bugs, which were fixed long ago in newer versions. Since the first release was not built with the self-service upgrade path in mind, engineers spend another 10+ hours helping the customer perform a very manual migration to the latest version of the tool.
- Despite management efforts to bring other team members into the project, the lead engineer who built the first release was still constantly drawn to escalating on-premises installation support.
- An end customer modified the base image of Ubuntu to change the names of all virtual network interfaces. More obscure network issues cause problems until this change is discovered.
- In environments where a customer brings their own Kubernetes stack, the team encounters 10 different flavors of Kubernetes input that need to be supported through application configuration. Everyone takes hours to fix and takes time apart from other engineering work.
- Many end customers require Enterprise Long Term Support (LTS) releases, which leads to internal chaos and more firefighting operations. There is a need to hire and train a lot of support engineers on Kubernetes or just keep stepping up to engineering.
Today 360 – A year later, the engineering team, angry and exhausted, has another hands-on meeting on the deck to reset and decide what to do. Everyone is afraid to rotate the local installation team; Some people actively seek to get out of the team. A few seasoned engineers sit permanently on the team because they know that without them, a major source of income will be at risk. The engineering and product leadership agree to de-emphasize the work of new features to give the team up to 50% of their three-month time to invest in installers. While they’re at it, engineering agrees to spend a significant amount of time developing the air gap stabilizer that more and more customers are asking for. The team develops a wish list of everything they want:
- Set up CI/CD and automated testing of all application versions in all supported environments.
- Convert your ragtag from hard-to-maintain bash scripts used to collect diagnostic information into a CLI tool that can be delivered with the installer. Consolidation into a framework that allows field engineers to contribute to the list of information being collected. Extended target: Compile the internal scripts used to analyze these log packages for common errors in a tool that end customers can run in their own environment.
- Designed so that the team can focus on a single architecture and installation method, solution architects working with clients don’t need to hack into a bunch of custom configurations that are exotic for specific customer environments.
- Give clients the option to bring in an external database instead of using an in-app data store. This should help address some catastrophic storage and networking failures.
- Introduce snapshot and restore functionality that will work in most client environments, based on a hunch that this will include SSH File Transfer Protocol (SFTP), Network File System (NFS), Storage Area Network (SAN) and possibly others. Make some discoveries with the product team and several key customers to determine the scope for this.
- Automate scanning for CVEs in all code and enforce a no-version-shipping policy without patching all CVEs for which a patch is available.
- Invest time ensuring that the build/test process for developers in on-premises environments can be shortened from 10+ minutes to less than 30 seconds.
- Automate testing of all installer versions on a rapidly growing multidimensional support matrix for OS versions, Kubernetes releases, add-ons, cloud providers, and other dimensions.
- Building a defined “area of responsibility” for the product team to ensure that they can support new versions of operating systems within 30 days of release
- Adopt a strict policy of deprecating older versions to reduce the total number of things that need to be maintained and patched.
Today 390 The team is making progress, and even the key engineers who built version 1 are getting involved again. Some improvements have been made and momentum is building, but there is still a lot to do. More experienced people are still drawn to the many escalations of support with new and existing clients.
Today 480 – The sprint sprint has now been extended by three months to six months. With half the team still improving the build/testing/distribution/support platform for on-premises installs, development of the app’s features is still lagging behind. Work on the air gap stabilizer has not yet reached the prototype stage. With half of the backend team focused on infrastructure-specific tasks, the front-end engineers assigned to work on a SaaS application or other modernization effort are constantly running out of things to do. The two engineers who built version 1 of the installer and had the deepest knowledge of the project left to join the small startups founded by our former colleagues, disappointed and utterly exhausted. This puts the team back further.
Some might read this and conclude that distributing software to on-premises client-managed K8s and private cloud environments isn’t worth it. But 80% of all software spending still goes to apps that aren’t just SaaS, and most organizations now expect apps to be compatible with K8s. We are seeing a looming trend of applications from the cloud rising for reasons of security, compliance, performance, and cost. There must be a better way to solve the difficult problems described above while still increasing the instructable market!
#hidden #pain #homemade #K8sbased #software #distribution