Implementing a Service Mesh with Istio; Understanding Requirements
It has been a little bit over a year that I started working with Istio at work. I think over the time I have gained some valuable experience implementing service infrastructure with Istio and I decided that it is worth documenting my experience as a series of blog posts. My experience with Istio is mainly revolved around Kubernetes (Azure Kubernetes Service) and migrating a legacy monolith to a microservice architecture deployed in Kubernetes clusters. However, I will try my best to keep my writings vendor/platform agnostic.
As we started breaking down our monolith gradually, the number of services we are deploying got increased day by day. Due to the legacy nature of our application, there were a couple of auxiliary services that we had not plans on moving to Kubernetes at all and it was decided from the beginning that the database will also be hosted in a Virtual Machine. These scenarios lead us to few challenges that we though we could solve by introducing a Service Mesh in our service infrastructure. These are the main challenges that we identified as our requirements for a service mesh implementation,
- Facilitate service discovery and communication across different services spread across multiple deployment modes and locations.
- Manage traffic among services based on least privilege access mode.
- Ensure the security around the perimeter of the service network as well as within the perimeter.
- Improve the quality of service communications without changing the implementation of the service logic to handle network communications.
Traditionally services are deployed behind a load balancer with some hardcoded IP addresses. When you are handling a many number of services, it becomes tedious process to maintain these load balancer records and service IP addresses. In case of a scaling up or down, the service infrastructure loses the agility due to manual ways of maintaining the load balancers (in many cases). Another option available to address this to some extent is Service resources in Kubernetes. While it might address the service discovery and load balancing aspects of the requirements, security, service identity management and observability needs may remain unsorted.
Our expectation was that an ideal service mesh implementation should compliment the built it Service resources of Kubernetes by providing security, service identity management and observability capabilities. Well, not by implementing them from the ground up, but integrating with some tools that has been already in use to facilitate some of these needs. For example, easily integrate with Zipkin for tracing the service traffic within applications.
On top of that, it would be interesting to have capabilities to extend the service mesh to services in non-Kubernetes environments as well. This leaves us to think about a solution that facilitate traffic management within the service mesh with least privilege access mode. Traditionally, in giant monoliths it is natural to surround the network perimeter using a firewall, or a network security group. While that is still relevant to secure the service network perimeter, putting a firewall or a network security group could compromise the agility or the scalability of the service network.
One solution is to introduce a form of routing rules that explicitly define which services could communicate to each other, supported protocols when communicating among services and in case of a L7 traffic, what type of HTTP actions are allowed to perform against each service by other services. If a service mesh implementation could abstract this functionality and facilitate this to the services, it offloads a huge burden from the service implementation. Also, this allows the services the agility to scale up or down without worrying about managing the traditional firewall rules.
Again, limiting communication among services using routing rules may not be enough. Leaving a wide open service network within the service mesh is a recipe for a disaster if anyone managed to get in through the network perimeter. It is required to place an explicit authorization mechanism among services, well of course the implementation of the service should not worry about this.
Maintaining identities for the services is one approach to solve this. Services could identify themselves using a TLS certificate in service to service communications. When one service communicate to another, each service could use their certificates to identify each other, and establish an encrypted communication between the services. Which is the idea behind Mutual TLS (mTLS) and the service mesh should facilitate this need by providing a zero trust network among the services.
Maintaining routing rules and managing service identities centrally might overload the central service registries in the service mesh. It would be ideal to have all of these communication rules and authorization decisions made at the service level by avoiding the central registry being the bottleneck of the entire service infrastructure. Therefore, use of sidecar patter in order to manage the traffic flow and authorization could improve the efficiency of the service mesh.
Finally, microservices could come in various forms and shapes. There could be services written in different languages, targeting different operating systems and deployed in various modes, for example virtual machines, containers, databases, serverless functions, etc. And some of these services may not be aware of L7 traffic or use their own protocols for communications. It is essential that the service mesh acts as a compatibility layer to these services.
Capability of handling L4 traffic is an essential part of this. This allows the service mesh to authenticate, authorize, manage, and secure the traffic within the service without being aware of the network protocol. But, in case of a L7 traffic handling, the service mesh could compliment the service infrastructure by adding more intelligent and context aware traffic management.
These are the some of challenges and requirements we identified when adopting a service mesh for managing the traffic of our microservices. They say “talk is cheap, show me your code”, in next post I am hoping to discuss about managing ingress traffic, securing ingress traffic with mTLS and service to service traffic encryption with some examples.