I recently talked with James Kao, Head of Engineering for APM at Broadcom. James leads the global development team to deliver the AIOps and application monitoring solutions that power the world's most successful businesses. He has had a diverse background in application development and monitoring across development, product management, and solution engineering roles over the last 20 years at both large enterprises and startups such as Oracle, ClearApp, and The Middleware Company.
It was great to get to know him and learn more about the history of Broadcom APM and where it is headed. Following are some of the highlights from our discussion.
How has APM evolved since you started working in the space?
James: Well, it's evolved a lot. When I started working in this area, people didn't know what APM was, and we had to spend a lot of time trying to make the case for it to customers. Now, in modern-day deployments, APM is part of a required set of technology that follows the lifecycle of an application out into production. In many cases, especially for our larger customers, an application isn’t production-ready unless APM is actively monitoring it.
What we’ve seen more recently is the diversification of APM tools that customers use. There has been the rise of developer-delivered monitoring directly instrumented in code. In addition, there’s been the continuation of the traditional technology that started APM, the type of byte code instrumentation, or other types of automated or external instrumentation injected by the tool.
When you started working in APM, did you expect the technology to be where we are today?
APM is fundamentally about monitoring lines of code. And while the technology (i.e., programming languages and different runtimes) has changed over the years, what constitutes a line of code has largely not changed, which is surprising.
One of the most surprising things is almost like a back-to-basics direction over the last 20 years that has also affected APM in that many of the fundamentals, for example, what constitutes a transaction trace and diagnostic data modeling (in the form of exceptions and stack traces) remains the dominant paradigm across programming languages, regardless of their age.
What APM topics are top of mind for you right now?
The scale of the system, the proliferation of the types of systems, and the growth of both applications and important application components are all top of mind for our customers. While underlying code paradigms have remained remarkably constant over the last 20 years and novel abstractions have tended to die out, application structure has become more componentized.
Now, a particular application might be composed of a set of microservices. And, these microservices are themselves many servers that then contain functions underneath them.
Our APM solution must help manage the proliferation of these mid-level grouping constructs. As our customers’ applications grow, the mid-level components within each of those applications have also been growing. Therefore, the number of things that we're monitoring, as well as the total size and quantity of applications has grown tremendously over the last 10 years. The top-of-mind challenge for our customers today is managing that scale, both because they are deploying more applications and because those applications have more components to monitor.
What challenges is Broadcom focused on solving for customers?
The biggest challenge that we want to solve is to provide the infrastructure for centralized production readiness. There’s a lot of back and forth in the industry around how much of the production readiness process should be centralized versus devolved into organizations.
This is the central debate around DevOps. How much should tool sprawl become permissible and perhaps even desirable versus centralized discipline around production readiness?
There is not a single right answer between centralization and devolution. Our customer base is very focused in many ways on the value that centralized production readiness provides because devolving responsibility for production readiness far down into the individual small group dev scrum teams isn’t possible for them.
Many of our customers are not building new applications from scratch. They have existing applications that may have been a vertical stack in the past but are now purely backend. They have a green screen mainframe application but now that green screen application is only accessed by other applications that provide modern APIs on top of it as pure backends for other applications.
There is a mixture of pure backend and pure front-end applications that have different application lifecycles. Some of which are new code, some are old code, and some are super old code. Some applications are crucial, but only being maintained versus those which are crucial and still being built and those at every point in between. Enterprises in many of those cases need to provide centralized guidance and standards across all the different scenarios.
In many cases, there needs to be a team that is dedicated to production readiness rather than pushing production readiness down to each of the individual teams. That shared responsibility as to how much is being done by individual teams versus how much is being done by a central team should vary from organization to organization. At Broadcom, our focus is enabling that at scale.
Figure 1: The current version of DX APM continues a long history of innovation for APM technology from Broadcom.
What is the history of Broadcom APM and how has the product evolved? Why did Broadcom decide to rearchitect the platform and why at that particular time?
The history of Broadcom APM goes way back. Wily Technology created the APM category with byte code instrumentation in 1998. And then Wily was acquired by CA Technologies and then Broadcom acquired CA Technologies. And along, the way, we added topology-based root cause analysis, zero-configuration agents, mobile to mainframe transaction tracing, and support for container and cloud monitoring, and more.
Broadcom APM has evolved considerably over time. The big shift comes at the confluence of perhaps two things. The first is the extreme growth in the size of our customers and the need to rearchitect the way the application worked.
This type of re-architecture is not entirely new and an earlier version of this is something that many APM technologies, including Wily, have gone through. In the early days of Wily, data was stored in commodity SQL databases or the datastore was external technology.
Then as customers grew, that technology could not scale to meet customer needs. This is a common occurrence. When you launch a company, you start with a database vendor as your datastore. And then as that data grows, you need to shift over to custom storage mechanisms that take advantage of the nature of the data that you're storing.
For APM, that key technology shifted from commodity databases to what we call the smart store, which is our own data storage technology. Then, when we reached the limits of the smart store, we needed to achieve an order of magnitude larger scale for customers and an order of magnitude larger scale for us to manage our SaaS solution.
That led to the development of what we call our next-gen DX APM technology. That, at its heart, is what we call the network-attached smart store. It is the next evolution of smart store, where the data layer can then be spread and scaled horizontally across many different systems, which enables us to reorganize and refactor the way that our product deploys and scales.
With SaaS, we can respond to customer load demand quickly if a customer wants to onboard a large set of agents. This can be done by a single administrator who has access to scaling the underlying cloud infrastructure and where the system can self-orchestrate that expansion in capacity.
This is the same thing needed by our large-scale customers. As our customers started to approach the fundamental limits in the sizes of their deployments and their clusters, we had to find ways of enabling them to expand easily while remaining on a single installation.
Those two things have driven the re-architecture on the APM platform starting with the movement from a local smart store to NASS (to network-attached smart store) and in the surrounding technologies of containerization, microservices, OpenShift, and Kubernetes. This enables the underlying orchestration to occur between all the different potential choke points in the system and ensures that we don't have any single-point bottlenecks and can scale up horizontally, virtually infinitely.
What are the challenges and benefits Broadcom customers experience with the new platform, DX APM?
The challenge and benefit both have to do with managing scale. One aspect of managing scale is the server-side expansion. It’s being able to handle tenants with a hundred thousand agents and being able to create those environments within minutes.
Another aspect is enabling our customers to effectively onboard and utilize. If you imagine a customer that wants to grow from 100,000 to 200,000 agents, they may exceed the capacity of existing clusters and need some way to grow server-side capacity.
If they have 100,000 agents, it is a non-starter to have to go back and refactor such a large body of agents. It's part of the reason why you need to have a strong centralized server and metric storage technology.
We have technologies in place on the server side that maintain compatibility with those existing agents so that customers can avoid the difficult volume challenge and redirect that telemetry seamlessly to a much more scalable next-generation architecture. That architecture contains both the capacity and the tooling necessary to manage those large amounts of scale.
Any advice to our Broadcom APM customers as they continue to mature their APM capabilities?
The short answer is to learn about Kubernetes and OpenShift because it's not only the heart of our product but also because containerized orchestration has become the foundation for scalable application deployment across on-premises, public, and hybrid clouds.
The deeper answer is to engage in an APM Center of Excellence (COE). It’s not just about keeping APM running and whether your alerts are working but it is about finding your seat at the table from the beginning. It’s not just at the Go / No Go decision but starting with design. Remember: you have a lot of knowledge about how production systems should be monitored, and you need to be pushing within your organization to be the thought leaders and to set the standard.
For more on Broadcom's next-generation DX APM, read 11 Reasons Why You Should Migrate.