hen we began our quest to replace our legacy software solution for Internet Service Providers (ISPs), we had a clear path in mind. We wanted to build a proof of concept, turn it into an MVP (Minimum Viable Product) and take it from there to add more sophisticated features later on.
Yet the reality would provide a much more difficult road to navigate.
This is part 2 in our 3-part series on transitioning away from legacy technology. See Part 1 here if you missed it.
The key deliverable for us when making an MVP was to reimagine the code using a modern programming language and new tools under a new infrastructure, one that would serve as the basis for a scalable, rapid and reliable solution.
A potential systemic overhaul has a lot more to offer than incremental change, which is something that Formula 1 teams know very well. Each year, the rules and regulations in the world's biggest motorsport racing competition are updated, and teams must design, build and test all the systems that make their car for the season.
This means a lot of hardware and software will be replaced, which makes the Formula 1 example particularly valid for our software topic. The teams know the possibilities, they explore deeper, test and obtain data that helps them advance research and development decisions. Each choice obviously has a trade-off.
Do you want to spend more on aerodynamics? Will it be enough for the power unit? What if instead you focus on maximizing horsepower? Can your current cooling systems handle that change well?
At the end of the process, every team does what it thinks is the optimal solution given their capabilities and talent. Yet the actual performance of these decisions coupled together is only known once the car hits the track. Especially on race day.
Imagine you spend hundreds of millions in R&D, but your cars break down in their first real-world test. This actually happened to McLaren in 1999, as they were coming in as reigning champions and wanted to expand their lead over rivals, they pushed for an aggressive overhaul of their systems.
On the first race of the calendar, in Melbourne, Australia, the new McLarens were impressive early on. In just 13 laps they had built a lead of 18 seconds to the third placed car. That is a significant gap in Formula 1, but soon the problems began.
First went David Coulthard's car, which had to be retired over transmission/gearbox problems. Then, it was the turn for Mika Hakkinen, with multiple mechanical and electrical issues. Both cars were out of the race, despite their breathtaking pace, leaving fans worried and engineers with plenty of work.
While this was entertaining for people who rooted for rival teams, such as myself, it must have been hard to explain within the organizational structure of McLaren.
Of course the event of both cars dropping out and scoring zero points is disastrous, but does it mean that the entire process to generate the new intellectual property was wrong? does it mean that they are worse off?
The answer is a resounding no, because the new changes bring on new challenges, but the opportunity for the performance increase is enormous once the process of optimization takes place.
It is like formulating a problem and its solution with new thinking, which is to yield a big differentiating outcome, only if you iterate and test enough to clear out the new problems.
That year, McLaren optimized their new changes and went on to win the drivers' championship, while the constructors' championship was lost to Scuderia Ferrari, mostly over the reliability issues that plagued the British team.
While we could argue that their mistake was to put too much emphasis on performance, perhaps they thought that the reliability factors would be easier to optimize. But the analogy with software breaks in that time spans are much shorter in Formula 1, and options do not abound.
Formula 1 teams cannot choose to simply stay with what made them successful last year, and wait it out until the new generation systems are mature enough. They can, however, decide to progressively roll out upgrades during the season. These carry an element of risk, but can also make an important competitive difference by the end of the season.
The alternative -if such a thing would be allowed- would be to stay with the old "package", as they say in Formula 1, and continue to fix known problems. This is not viable because there is not much to gain, while other competitors take riskier approaches with newly learned data, which creates a much more difficult scenario of being left behind.
In software companies, the predicament is very similar to this. You could spend a lot of time fixing issues you currently have, but the codebase and infrastructure could not be pushed more if you are working with legacy technologies.
In other words, you have the time, but you might not have the privilege to afford the wait for change, as the market landscape changes around you.
Once you gain certain stability with legacy systems, there is not much room for scaling up and offering more efficiency for existing features, let alone for new features. Technology moves fast.
There is always a better way to make the wheel go faster, and if you get too comfortable, you get locked out of the next opportunity.
This is what we faced when making our new solution for network statistics and provisioning, Strings. We had a wide number of learned lessons from our old product. We knew what we wanted, and what we wanted to avoid.
Our starting point in January 2018 was mostly about developing and testing for the right mix, the right package, that could yield us our needs and wants. So obviously we developed many versions of this.
It never made us reconsider the undertaking, but we did make countless changes to every single part of the package. The guiding principle was simple: it has to perform way better than its predecessor.
If it turned out to be nicer-looking, or nicer-loading, but not significantly different, we were wasting our time.
The past was marked by monolithic applications running on physical servers that are hard to install, cumbersome to maintain and impractical to patch/update.
The present was largely influenced by virtualization, and naturally belongs to serverless architectures that employ microservices, where there is a disaggregation of backend and frontend functions, and processing efficiency is paramount.
It is important to note that using the new technologies that give these capabilities does not automatically grant you with all the benefits for your use case. You still have to adapt the codebase and the infrastructure to do best what you are ultimately responsible for.
In our case, this is directly connected to being able to bring in a constant data stream of metrics that are relevant for ISPs, deduce insights and present them in a useful, action-oriented manner. We knew we could not fail in this department.
What helped us advance the most was that we quickly realized we had enemies of the future, so to speak, in people who were close to the legacy code. You cannot move forward if you have people in your team who doubt the need for change.
Legacy solutions are hard to replace because their existence is based on success. It is a hard task to tell people who contributed to that success that their work is now obsolete, or insufficient.
The industry shifts that made it this way were felt slowly and gradually by them, as they spent less time exploring, and more time maintaining what they knew.
So our team changed entirely over the span of 1.5 years. New ideas were embraced, and the focus was placed on doing things in a better way.
Along our path, we had several moments that called for optimization, similar to having high-performing cars that come short in an unexpected area. To overcome this, we had to think deep about what could we be missing. The picture would become clearer with each iteration.
At one point, we re-wrote our metrics collector using GoLang because we spotted the opportunity to gain reliability. At a different moment, we changed our approach to networking inside our Command Line Interface (CLI). Then we created our own Strings Metrics Processor to significantly reduce the loads that help us provide metrics.
It was a fantastic moment of optimization. It is recognizable by everyone as the moment where the new makes the old seem completely inadequate.
Our customers were able to try out early versions of the new product, which helped our research, design and engineering teams to improve the scope significantly.
Yet the ever present question from our customers was "So can I start using it?". Our answer was an awkward no, as we had to explain that this was a preview, and that the mission critical features were not ready yet.
Their excitement was followed by their disbelief. As if to say "how come?". In most cases, we had to explain that it was not a matter of porting over a feature from the legacy system.
We wanted to get it right with the new approach, to not repeat the thinking of the past, and most of the time was spent building a solid foundation from which to build from.
New features had to be done differently, and while this was a sobering realization for our customers, we were lucky that we had earned their patience and respect for the process.
This allowed us to validate our assumptions, and obtain new clarity regarding the way the features should work. It gave us work for the short, medium and long term.
Perhaps the biggest learning outcome is that you should never shy away from doing demos. I have personally done over 20 demos of different versions of the product (not counting prototypes), and each one went differently. At the 12th or so, it was evident that a cumulative set of assumptions were forming.
The key is to turn these assumptions into actionable items, that you can prototype, define and plan for. It is a team effort, as you have to validate these clues and ideas with everyone, to avoid getting carried away with something that could not be of general interest.
You have to be able to categorize the user types, their company sizes and favored practices. With this data you can determine how specific or universal the feedback can be. But you still need a growing number of representative samples.
In each demo, there is always something else to learn about the way I think about our features, the manner in which they are perceived, and the things that work vs the aspects that need improvement. It deals with curating the best insights to highlight and deepen the ones that will define your roadmap.
We were able to significantly alter the way the system functions, and the way users interact with the data. Now as we prepare to work on bringing these changes to production, get ready for Part 3, the final instalment in our series. It will cover the reasons for the user experience changes.
Did you like the Formula 1 topic? In light of the cancellation of the Australian Grand Prix, take a look at the following article that covers team spending through the years, and how aerodynamics' impact was replaced by power unit's current relevance, dominated by a few manufacturers. There are plenty of examples of new R&D projects that do not fall within the optimization curve and head straight to failure: F1 Metrics.