Working with microservices: Part II – A sequence of challenges

In the first part of this story, I was telling you how after 15 years of experience as a full stack developer of Windows applications in C# or/and C/C++, I was given an opportunity to try something new: me and my team would have to take on a microservices project. Here is the second chapter of the story and the many challenges it came with:

Creating and referring the NuGet packages

Creating and referring the first NuGet packages was nothing short of a challenge. First, we wondered if it made sense to separate classes from interfaces, each into their own separate packages. We looked at the Microsoft conventions and we decided that it did make sense. Time confirmed to us that we took the right decision, because this way we could respect the Dependency Inversion principle. This helped us towards creating relevant unit tests, but also for other important business decisions – like using two e-mail reading protocols within the same business code, without making that code aware of the actual protocols.

Next, we didn’t know how permissive we should be in referring the four components of each package’s version – <Major>.<Minor>.<Build number>.<Revision> –. We took some time before we decided to differentiate between the origin of each package: the global NuGet server, the infrastructure application server, and the package server created with dedication towards our application.

We concluded that the least trusted were the external packages, from the global NuGet server – which we will refer with all four components fixed (e.g.: Second, with only one free component, we considered that we will refer the infrastructure application packages: <Major>.<Minor>.<Build number>.* (e.g.: 1.2.3.*). The most trusted packages remained, of course, our own– which we referred with two free components: <Major>.<Minor>.* (1.2.*).

The non-functional execution of microservices

The next major challenge was the “blind” execution of micro-services. This meant that they should run without being completely functional.

We had many steps to perform for each micro-service, and it took a long time to discover all these steps. The moment we had this accomplished, we documented this process in a file that we shared with our colleagues from the infrastructure application team – which validated it and filled several blanks. Up to that moment, I had felt that all we did was groping and debugging. From that moment on, it was like a blindfold was lifted from our eyes.

Installing the infrastructure application and using it in the code

The infrastructure application depended, as I mentioned previously, upon the RabbitMQ library and platform. This library depended herself upon another application: Erlang OTP.

When I and my colleagues had started the installation on each of our local machines, I had the poor inspiration to install these two dependency applications separately, without realizing they were already included in the installer of the infrastructure application. That kept me behind for a few hours. The reason? The versions installed together with the application were different from those I installed manually. When the installation of one version was made, some things would become incompatible with the installation of the other version.

In order to get this mess fixed, I requested the help of one of my colleagues, who worked in the development of the infrastructure application. He spent about 2-3 hours at my desk, debugging and cleaning. In the end, he managed to make things work. Aspect for which I am very grateful.

Meanwhile, one of our teammates had stumbled upon another problem. The Message Broker service, the one based on RabbitMQ, was not functioning properly. The conflict, as another colleague observed, was from a communication port that was also used in a different application installed on his machine. Luckily, it was configurable in that application, and we could change it. This problem in particular required three of our colleagues to intervene, before we could figure it out; only the last colleague realized what the cause was, and suggested a solution.

But the saga continued.

tribe intervention

Both applications – the infrastructure one and our own application – used IIS (Internet Information Services) based microservices. Installing all IIS features required that Windows was up to date with the latest updates. One of our colleagues was having a problem with Windows updates for quite some time. At first, it was due to low disk space. Then, she fixed this issue, but the problem persisted, and the reasons remained unknown. For months in a row, the installation of updates kept on failing. Finally, after a tedious, frustrating Windows reinstallation process, she managed to install IIS – and afterwards, also the application.

Back to the infrastructure application, our colleagues that were developing it made, during our development, an apparently minor, but in reality, rather fatal mistake: when creating a new version of this application, they incremented a different version component of one of the packages, and the changes were not backward-compatible with the previous version. For this reason, the package was automatically upgraded on our side, and the application no longer functioned. That held us back for another day.

It was the only human error delay case throughout the project, caused by the infrastructure application. Other than that, I can say without hesitation that it was a very well-conceived and implemented application, and our colleagues were always more than happy to assist us with any issues that we encountered – whether it was bad configuration, or anything else. Aspect for which I am also very grateful.

Creating build configurations

Every day came with its own challenge, and the next one was about creating build configurations on Team Build, with the dedicated purpose of continuous integration. These configurations would be responsible for multiple tasks, like compiling NuGet package solutions, compiling microservices solutions, running unit tests, running component tests, creating the setup of the application.

Our tester teammate became responsible of this task – mostly because the rest of us were engaged with the development tasks, but also because he accepted it as a personal challenge. However, we all contributed to discovering several limitations of the Team Build platform, and how to overcome them. We already had some configuration examples from other projects, which helped us in some degree, but we still discovered plenty of new limitations. One of the limitations was the fact that one could not create configurations that depended upon each other. You could determine a build to start after another one had ended, but you could not condition the starting of one build by the successful finishing of another. Because of this, many tasks would be repetitive in several different builds, thus taking more time, and consuming more resources from that machine.

Overall, it was a “fun” assignment, both literally, as well as figuratively.

Creating automated tests

In the beginning, we decided to test the code automatically, in three types of tests: unit tests, component tests and system tests.

Unit tests were the easiest, and thus more numerous. They were exclusively testing the business layer, and they tried to cover it as extensively as possible.

For the component tests, we considered each component equal to a microservice. In the first step, we decided to create one component test for each microservice, each one covering the main positive scenarios. The idea was that the tests would first be written by the team developers, and then continued and maintained by our test engineer.

Initially, we had estimated one man-week for creating the infrastructure, and two man-weeks for actually writing the tests. As we went on with creating the tests, we came to following conclusions: first, the total effort resulted from writing these tests was 50% greater than originally estimated; second, our test engineer colleague didn’t by far possess the knowledge for creating or maintain these tests, since they required more programming skills than a test engineer normally has.

It took some time to think about how to simulate interaction with every microservice independently, as if these interactions were coming from different, neighbor microservices. What followed was a painstaking work, where we used each component test to make every microservice work correctly.

Unit tests were helpful, up to one point. Still, they only covered the business layer. The other layers weren’t even manually tested, to that point. And we still had more unit tests to create, in parallel with the component tests, to address scenarios we omitted to cover in the first place.

As for system tests, there was no more time for them.

Handling exceptions

Weirdly enough, a lot of headache came from dealing with exception handling. None of us had worked with asynchronous applications with no graphical interface for main data monitoring before.

When we started automated testing on component tests, but also manual testing, in debug mode, we were taken by surprise by the fact that the application had moments when it stopped working and didn’t produce any relevant output in the log files, or in any other kind of output. Our first instinct was to blame it on the external dependencies.. And the most unreliable, as we had thought at that time, was the infrastructure application. It was recently developed, therefore our lack of trust in it, even though it had been tested and successfully delivered to one or more clients – which had no major complaints about using it. What we learned next is how wrongly we prejudged it.

When to handle exceptions

Some time passed until we realized that the fault was our very own:  in multiple cases, we had just forgotten to handle exceptions in asynchronous code. We considered that multiple pieces of code would work and they would not throw any exceptions. Either this, or, subconsciously, we expected the application to display the usual error messages on the screen that we were used to seeing every day in normal graphical interface applications. Surprisingly to us, but normally expected from this abnormal type of application, that didn’t happen.

Before this new revelation, we had already created a set of classes and interfaces, to help us not only to handle exceptions, but also to re-execute the code that could cause problems – generally, due to accessing external resources (e-mail inboxes, database etc.), that would be temporarily inaccessible. With the help of these classes and interfaces, we started, at a certain point, handling every exception for each microservice, in a thoroughly manner.

The fun part was when we realized that the infrastructure application – which we initially thought was to blame – did foresee, in fact, within the communication between microservices, handling exceptions both from temporary problems as well as for application bugs. When we realized this, we started taking advantage of this newly discovered facility, as much as we could.

What to do with handled exceptions

handling exceptions

Another challenge about handling exceptions was what to do with them, once they were caught. In the old application, apart from writing them in the log files, the exceptions were sent via e-mail to a configured address. We wanted to continue to support this functionality that we considered on the infrastructure side. For this, we needed to consider two aspects:

  • On one hand, when the application is trying to perform an operation more than once without success, in case it throws an exception that is identical each time, with the same message and the same specific details, we wanted to make sure that the log files and/or the e-mail inbox would not be flooded with exception messages, but it would only write/send them once in a while.
  • On the other hand, as much as possible, we didn’t want the application to send non-fatal exceptions via e-mail; this way, we would make sure that the caught exceptions were as relevant as possible for us to have to make minimal investigations.

We implemented the first aspect right from the beginning, with the help of an exceptions cache. More specific, we wrote/sent exceptions the first time they were caught, and we also retained them in memory for a while, in some concurrent dictionaries; if they occurred again with the same message and stack trace in a given amount of time, then we would not write/send them again. After that configured amount of time, we would clear the dictionaries automatically. The method proved very efficient.

The second aspect – we thought about it only towards the end of the implementation of the first version of the application. That’s why we left it to implement it in a future version, and communicated this to the client, so that he would know not to freak out due to receiving too many exception messages on his configured e-mail.

Settings configurations for exceptions handling

When we discussed handling exceptions, we brain-stormed about where we should save the settings for their configuration: the e-mail addresses where to send them, the e-mail server and address where they would come from, and also the cache persistence timeout.

One of my colleagues proposed to save these settings in the database. I considered that the risk for doing this was not to receive exceptions when a database problem occurs (temporary or permanent). That’s why, I proposed the idea of saving them into the configuration files of each microservice (there was not and there could not be a common configuration file, unfortunately).

In order to allow the client to configure all these settings, we decided to create a small application – a Windows Forms tool – that would allow this configuration. This task was accomplished by me in one day. This application would only display the settings from the first microservice it found, and it would save them into all the configuration files specific to each microservice. Also, all microservices were configured to detect when their specific configuration files were modified, and in that moment, to reload them.

Where to go from here

Thanks for hanging in there 🙂 If you’re curious how this ends, please read The microservices saga: Part III – Out of the woods. If you don’t know how it all started, please go back to How I learned to work with microservices: Part I – The opportunity