Seamless Login: How to Build Reliable Authorization in Fintech Applications

Image source provided by the press service of shutterstock

For large banking applications, uninterrupted operation is critical, as any downtime and failures are directly related to the risks of financial losses. It is at such moments that the most vulnerable nodes become obvious, one of which is authorization - the entry point for users into the application, says Nikita Letov, software engineer and expert in building fault-tolerant systems. As a technical leader and head of Java development at Rosbank, he coordinated the work of several teams, was responsible for the architecture of mobile banking and key elements of the system - from authorization to service availability. He managed to significantly increase the stability of the application and reduce user complaints about difficulties with authorization tenfold. Nikita is a member of international associations of engineers-developers IEEE and Hackathon Raptors, which place high demands on their candidates - their level must meet global standards. In an interview, the expert shared which architectural solutions allow avoiding most typical authorization problems and withstanding load without failures, how development in fintech differs from other industries, and which technical areas will be at the peak of demand in the next few years.

Image source provided by the author of the material

Users expect banking applications to be fast and seamless: a 10-second delay when logging in already causes irritation. You designed the authorization architecture in an application with millions of users, and you know the inner workings behind the concise screen: high loads, security, integrations. What tasks are considered the most difficult in their development today?

The requirements for fintech solutions have indeed become as high as possible, and IT teams are facing a number of serious challenges. One of the most important is ensuring high service availability. Banking applications must work 24/7 and withstand thousands, and sometimes millions, of simultaneous user sessions. At the same time, security remains a priority. Financial data always presents an interest for attackers, any vulnerability can result in serious financial and reputational consequences. Another important feature is that fintech products do not live in isolation. Almost always they are connected to many external systems: from government agencies to insurance companies and partner platforms. These integrations need to be not just configured, but built in such a way as to ensure reliable data transitivity and stability during updates. A separate challenge is scalability. The user base is constantly growing: every day, and sometimes every minute, new customers come or old ones return, and solutions that worked great yesterday may not be able to handle the load tomorrow. Therefore, the architecture must be initially designed with growth in mind: flexible, scalable, based on a deep knowledge of system design. And, of course, we must not forget about compliance with regulatory requirements. This is perhaps one of the most difficult and least favorite tasks. Banks, brokers, and other fintech organizations operate under the close supervision of state regulators, so you have to take into account many legislative norms. Very often this greatly narrows the choice of technologies that can be used to solve a particular problem.

You mentioned ensuring scalability and stability of the architecture as one of the key tasks. In your case, we are talking about an application used by almost 3.5 million customers. What changes in the design of authorization under high load? What unusual situations do you have to face, and how to lay down protection against them already at the architecture stage?

When a system has a million user base, authorization ceases to be just a login and password check - it turns into a full-fledged high-load service, which is subject to the same requirements as other critical components of the system: it must be fast, reliable and secure. The first thing that changes is the scale and storage of sessions. Classic approaches, such as storing user sessions in memory, are no longer suitable: they do not scale horizontally. Therefore, authorization must be stateless, and all requests must be authenticated by tokens, for example, by JWT. At the same time, authorization is divided into two modes: full - with password and 2FA verification, and fast - when the user logs in with a previously issued token. In both cases, it is critical to ensure secure token exchange so that an attacker cannot intercept it. Another important point is the unevenness of the load. For example, in the morning on Monday or on the last day of the month, when the salary comes, the load on logging into the mobile application can be abnormally high compared to normal traffic. Such peaks must be able to be processed without allowing a cascading failure of the service. Solutions like Circuit Breaker help here to isolate faulty components, or dynamic upscaling, which allows to quickly increase the number of copies of services depending on the load. And, of course, you can't do without fallback mechanisms. Failures are possible even with a perfectly designed system - and it is important that they are processed transparently for the user: either through backup logic, or, if it is completely critical, with the offer of alternative steps so that the user can still complete the task.

During your time at Rosbank, users regularly encountered problems when logging into the application. You managed to solve the problem - complaints on this issue have almost disappeared. What problems with the entry point to the application did you identify, how did they manifest themselves, and what did you do?

It was one of those cases when, it would seem, everything works, but in fact clients encountered errors every day. For me, this was exactly the case when you had to "dig deep" and not accept the architecture as something given. The problems manifested themselves on the user side: someone could not log in after updating the application, someone's authorization took up to 15 seconds, and some clients did not experience anything at all - just a white screen. Such requests were received by technical support constantly, and they accounted for up to a third of all tickets. The first thing I did was trace client requests. We implemented distributed tracing and improved the logging system at all key stages of authorization - from gateway to responsible backend services. This made it possible to accurately determine where exactly the "chain breaks". As a result, it turned out that many problems arose during the authorization process. One of them was related to inconsistent routing - the gateway directed clients to non-corresponding versions of services. Another - with the leakage of the security context during authorization. In addition, there were problems with the network availability of the cache, in which meta-information about tokens was temporarily stored. Most of the problems were solved by rewriting the gateway code and authorization service filters. And the failure with access to the cache turned out to be caused by a banal reason - the lack of network rules in the backend services orchestrator. This case showed me how important it is to develop critical parts of the system not just as a separate feature, but as a full-fledged engineering platform, where everything is thought out - from UX to SLA.

During your work at the bank, its application became noticeably more stable - availability increased to 99.97%. What technology did you use, how did you manage to achieve such high stability of work?

After I encountered inconsistent routing and began to build transparency of requests, I decided to implement Spring Cloud Gateway. But it is important to understand that stability and high availability of the application are not achieved through one technology, there is no such thing as a so-called "silver bullet" that solves all problems at once. Spring Cloud Gateway did not give an instant or "magical" increase in availability, but played a role in achieving this indicator. The result was achieved through the coordinated and systematic work of all development teams. From the very beginning, working in the role of technical leader, I built processes in such a way as to minimize the likelihood of errors and hidden defects. And in case of their occurrence - to ensure the fastest possible response. It was then that we decided to grow our own SRE engineers within the team - and successfully implemented this direction. As a result, a combination of factors: the spread of event-driven architecture throughout the backend part of the remote banking platform, strong technical leadership, mature development and testing processes, as well as a competent choice of technologies - allowed us to reach very high availability rates for a high-load banking application.

Nikita, you actively communicate with the professional community: you speak at specialized conferences, you are a member of the IEEE and Hackathon Raptors associations, which accept only highly qualified specialists who have already proven themselves. Are there any current trends in fintech development that you would call overrated, and what, on the contrary, is developing quietly, but will soon "shoot"?

Yes, that's an interesting question. In my opinion, the direction of implementing LLM models in applications without a clear understanding of the goals and limitations is currently very much overrated. And this is true not only for the fintech industry, but for the entire IT industry as a whole. More often than not, it looks like this: "Let's add AI because it's fashionable," - but at the same time, a real business problem is not solved. Yes, the technology is powerful, but it must be applied consciously, taking into account regulatory requirements, interpretability and security. As a result of such ill-considered use of AI, we most often see beautiful demo versions and MVPs, but very little real benefit in production. But what, in my opinion, is underestimated and will definitely "shoot" in the near future, and in some places has already shot - this is the spread of event-driven architectures and asynchronous event processing. Especially in banks, where everything used to be built on synchronous REST requests. Now, when the load is growing, and millions of operations are running in parallel, the transition to reactive approaches and the use of such message brokers as Apache Kafka or Pulsar is becoming not just a "plus", but in fact - a necessity. The trend is already noticeable: in many projects, event sourcing allows you to roll back to any recovery point, provide flexible auditing and collect analytics close to real time without suffocating loads on databases. In addition, with this approach, there is no binding to a specific database - you can at any time configure the consumption of events from the broker to any suitable database, while changing the data structure. Another "quiet" trend, not so much related to development as to infrastructure, is the growing popularity of serverless approaches within corporations. Previously, this seemed to be the prerogative of startups, but now even large fintech companies are starting to actively use FaaS and lightweight containers to solve problems: from AML scenarios to report generation. All this speeds up time-to-market and gives flexibility - especially in conditions of limited budgets and small teams. So noise is not always an indicator of maturity, and vice versa - the most useful things often come quietly, but for a long time.