Welcome to The Engineer Banker, a weekly newsletter dedicated to organizing and delivering insightful technical content on the payments domain, making it easy for you to follow and learn at your own pace.
At precisely 3 p.m., as the London skyline was bathed in the dull gray of an overcast afternoon, Michael, a seasoned engineer at UK's Faster Payments infrastructure, sat hunched over his computer, a lukewarm cup of tea forgotten at his side. His eyes, usually calm and measured, widened in alarm. The central transaction view, the heart of their operation, was showing inconsistencies he had never seen before. Data that should have been flowing like the Thames was now a jumbled mess, impossible to reconcile.
"This can't be right," he muttered, his fingers flying over the keyboard in a frantic dance of keystrokes. He refreshed screens, ran diagnostic tools, but the results were the same - an inexplicable anomaly in the transaction data. A black swan incident, unforeseen and potentially catastrophic, was unfolding.
With a sense of foreboding, Michael grabbed his phone to alert the Incident Commander. The weight of responsibility pressed down on him like the heavy London fog. Within minutes, a flurry of activity erupted. The incident response team, a seasoned group of crisis managers and tech gurus, assembled in the war room.
“Everyone, we’ve lost our eyes on the transactions. The central view is compromised,” Michael announced, his voice steady despite the chaos swirling in his mind.
The room, filled with the best minds in the industry, fell into a tense silence. This was not a routine glitch. This was a threat to the financial lifeline of the nation, where millions of transactions were processed every hour.
“Initiate the backup systems,” commanded the Incident Commander, a steely edge to her voice. “I want eyes on every transaction. Manually, if we have to.”
With the clock ticking, and the UK's financial stability hanging in the balance, Michael and his team embarked on a treacherous journey through the labyrinth of their advanced payment system.
As the team dove into the manual reconciliation process, the war room transformed into a hive of intense focus and rapid communication. Teams were quickly assigned to different segments of the transaction data, each meticulously combing through records to identify discrepancies and validate transactions. Michael coordinated with the data analysis team, cross-referencing their findings against backup logs and historical patterns. The task was Herculean - reconciling millions of transactions manually was like finding a needle in a haystack, except the haystack was constantly growing.
In another corner of the room, a subgroup focused on constructing a real-time tracking dashboard, piecing together fragments of data to restore some semblance of visibility. Meanwhile, liaisons from customer service and public relations were crafting messages to stakeholders and the public, preparing for the inevitable inquiries and concerns.
Michael, amidst the controlled chaos, couldn't help but feel a sense of awe at the team's synergy and expertise. Despite the daunting task, there was a shared determination to restore order, to piece together the puzzle that the missing central view had scattered. As they worked tirelessly into the evening, the first glimmers of success began to emerge, with critical transactions being identified and verified. The path to full recovery was still long and uncertain, but the team had taken its crucial first steps…
This previous text mirrors the real-life crisis that struck the UK's Faster Payments Service (FPS) back in July 2018. In that instance, pay.uk's FPS infrastructure experienced a significant disruption, affecting countless transactions and sending shockwaves across the financial landscape. A black swan event or incident refers to an unpredictable, rare occurrence that has significant and widespread ramifications. Coined by Nassim Nicholas Taleb in his 2007 book, "The Black Swan," the term originally described financial market surprises but has since been applied more broadly.
These events are characterized by their extreme rarity, severe impact, and the widespread insistence they were obvious in hindsight. In the context of technology and business, a black swan incident could be an unforeseen catastrophic system failure, a major security breach, or any other event that falls outside the realm of regular expectations and preparedness. Because of their nature, these incidents are nearly impossible to predict and can cause a significant upheaval in the functioning of systems, economies, or societies. The concept emphasizes the need for robustness in systems and the ability to respond flexibly to the unknown, as traditional risk management strategies often fall short in preparing for such exceptional events.
For an insightful overview of incident management, consider reading the following introductory piece.
On that fateful July afternoon, the UK's Faster Payments Service (FPS) faced an unprecedented crisis – the loss of its central view of transactions. This central view provides a real-time overview of all transactional data flowing through the system. Without this critical visibility, the FPS could not accurately track, process, or reconcile transactions. This sudden and unexpected failure left the system operators scrambling to understand the magnitude of the problem. Banks, businesses, and individual customers were left in limbo, unable to execute or confirm transactions, triggering a ripple effect of uncertainty and concern across the nation's financial landscape.
This incident underscored the fragility of even the most advanced financial systems and highlighted the need for robust contingency plans to manage such critical failures and the Major Incident Data Exchange Process to prevent central failure was born.
Following the significant incident that occurred in July 2018, the Faster Payments Service (FPS) has implemented retrospective measures to streamline the process of incident reconciliation for any future occurrences. As a result, in the event of a scheme-wide incident, it is now mandatory for all participating entities to generate a data extract of payments pertaining to a specified timeframe or settlement cycle. This proactive approach is designed to facilitate a more efficient and effective response to incidents, ensuring that any disruptions can be resolved with greater speed and accuracy.
The MIDEP is a procedure focused on restoring the central infrastructure's comprehensive view of transactions. This is achieved by amalgamating and reconciling the fragmented and partial views from all participating entities. In the event of a major disruption where the central system's ability to monitor and manage transactions is compromised, this process becomes crucial. It involves gathering transactional data from all participants, each holding a piece of the overall puzzle, and methodically piecing them together to reconstruct a holistic picture. This collaborative effort is vital for ensuring the integrity and continuity of the transaction process.
Banks act as the distributed backup of the central infrastructure
When the central infrastructure recognizes the loss of its integral view of transactions, it promptly activates the Major Incident Data Exchange Process (MIDEP) across all participating entities. This invocation includes specifying the exact time range that requires data replenishment.
Upon receiving the MIDEP notification, each participant is then responsible for retrieving transactions from their own database that fall within the defined timeframe. They compile these transactions into a detailed file, which reflects their individual, partial view of the transactional state. This file is then submitted back to the central system. This process is crucial for piecing together a complete and accurate picture of the transactions that occurred during the outage, ensuring that the integrity of the transactional data is maintained and that any discrepancies or anomalies can be swiftly addressed.
Upon receiving all the partial transaction views from every participant, these individual data sets are meticulously brought together in a comprehensive reconciliation process. This crucial step involves carefully comparing and matching each piece of data against the others to reconstruct a unified and accurate representation of the transactional activities. The aim of this process is to potentially restore the central view of transactions, effectively piecing back together the complete picture that was lost due to the incident.
Any discrepancies found between the partial views are thoroughly investigated and resolved. This detailed procedure not only helps in restoring the central transaction view but also plays a pivotal role in identifying the root cause of the discrepancy, be it a technical glitch, a processing error, or a systemic failure.
The end goal of this reconciliation process is to reestablish the seamless operation of the payments system, ensuring that all transactions are accurately accounted for and reflected in the central system. This not only helps in maintaining the reliability and trust in the payment infrastructure but also provides valuable insights for preventing similar incidents in the future.
See you next time!