Two important lessons I learned rolling out a new platform feature

Introduction

At work, I'm a member of a platform team. Our team manages a core microservice responsible for creating, authenticating, and authorizing users. The last major feature I built was the implementation of Multi-factor Authentication. We used it to gate access to some particularly risky actions users could take in our app. Here I'm capturing different lessons I learned during this process. I learned a bunch from my manager and had discussions with colleagues inside and outside my team. Firstly, I learned the importance of validating risks before committing hours of engineering time to build unnecessary structures. Second, I learned the importance of building things too down and end to end. This approach helps us deliver value faster and more often. Before we dive into the two lessons, let me provide some background about our app and the feature my team was building.

Our app allows our customers to receive online payments from their customers. When their customer pays, we collect the payment and send the funds to a bank account specified in the app by our customer. Malicious users attempt to take over our customer accounts and divert the funds to another bank account. To combat these account takeovers, we now challenge customers before changing their bank account. They have to successfully complete an SMS challenge before they're able to change banking details. We called this variant Step-up Authentication because we're asking the user to perform some additional authentication as a prerequisite for a particular task. Additionally, users didn't explicitly enroll. We used the phone number they provided in our onboarding process for challenges.

This feature required collaboration with multiple teams. The Payments team manages the bank account change feature and would be using our API to protect our users. The Risk team is responsible for implementing various mechanisms and processes for keeping users safe. In particular, they track account takeovers and handle the repercussions. They were desirous of adding this extra layer of protection. They also provided the back office process to support users who couldn't complete the challenge for various reasons (e.g. we have the incorrect number for the user).

Confirm risks before building for them

I mentioned earlier that users didn't explicitly enroll in this form of MFA we were building. We used the phone number our customers provided during the onboarding process. Without an explicit enrolment where the user is challenged on the phone number, we couldn't be sure the phone numbers were correct or up to date. Our Risk team rarely needed to use the phone numbers so we really couldn't vouch for the currency of the data. We had validation which at least guaranteed the phone numbers length, but not much else. We knew we needed to support users who couldn't complete the challenge, but there were uncertainties about how elaborate the structure needed to be. We didn't have a reliable estimate of how many customers would need back-office support. The team was worried the volume of support requests would overwhelm the already stretched capacity we had. The combination of fears and uncertainties prompted a request for a thorough and robust support mechanism.

As my manager and I chatted about this, he suggested rolling it out slowly with clear evaluation criteria. We could do a staggered rollout and define key metrics for validating our concerns and answering uncertainties. I sat with the Risk team to define metrics that gave us the feedback we were seeking. In addition, we defined the acceptance criteria for each stage of the rollout. Some examples of metrics we tracked were:

Numbers of users prompted with our step-up authentication feature
Number of users who sent a challenge to their phone
Number of users who successfully completed the challenge
Number of users who abandoned the process (didn't complete the process after some threshold)
Number of users who attempted to Contact Us (there was a link provided in the authentication UI for getting in touch with our Risk team)

You could say we ended up doing a rudimentary funnel analysis. We measured how many users were successfully going through the funnel (completing authentication) and what the drop-off rates were. We had a funnel for each stage in the process so we could clearly identify friction or issues within any part of the process.

In the end, we had a clear plan that we could evaluate at the end of each week and determine if we could keep going, or if we needed to pivot to beef up our support system. The primary metric we'd use for determining if we moved forward was the number of users who tried to contact us during the process. This was our closest proxy for the quality of our phone number data. Once all our stakeholders were comfortable with the plan, we were ready to launch!

We started by rolling out the traffic to ~10% of our users. At the end of the first week, we examined the metrics and realized we barely had enough volume trickling in to register on our support team's radar. Practically no one was contacting us. After reviewing the other metrics, we jumped up to 75%. And then, after another week of encouraging metrics, we were able to roll out the traffic to all our customers. Just like that, in a few weeks, we went from being very concerned about the back-office impact to having step-up authentication deployed to all our users. The data supported each stage we progressed through and we couldn't be happier with the results.

This turned out to be a happy story. We were asked to build an elaborate structure to support our customers, but it turned out that something basic was quite sufficient. Defining clear metrics helped us objectively evaluate the need for a robust structure and avoided months of engineering work. With a little discussion, planning, and analysis we avoided building a fancy highway our customers didn't need.

If you're ever unsure about whether something is required, lean in and identify the criteria that will make the choice easier to make. Work with the project stakeholders to determine the criteria that will assuage their concerns and illuminate areas of uncertainty. Then, make a plan! Roll your feature out slowly enough that everyone is happy and able to get answers as you move the dial closer to general availability. You too may find some features don't even need to be built, or a simpler version exists that suffices.

Build things top-down and end-to-end to surface value earlier

Another important lesson I picked up from my manager was a strategic one about delivery. Our team has full ownership of our feature end to end. We were responsible for building the APIs on the backend, and also the experience on the front end in our single-page app. You could say my approach to building out the front-end and back-end was a bottom-up approach. On the backend, I would build all the components of a particular API from the bottom up, then I'd have us work on the front-end components. This would be informed by a design where I identified all the components in the backend and how they would interact. Let's say we had an API that returned a list of masked phone numbers. The customer would select a phone number from this list to receive an SMS, or voice challenge. On the backend, I'd typically start building the endpoint by starting with the lowest layer. In this case, the number was stored in a different microservice, so I'd build a client to fetch those phone numbers first. Then I'd work my way upward to the API layer. So, after building the client, I would add our business logic layers that fetch user data and utilizes the client. Finally, we'd build the API layer that deserializes the wire protocol and passes the arguments down to our business logic layers. Once that was done, we could move to the front end.

This approach works but it has a few drawbacks.

For starters, there are long, inefficient feedback loops.

The further a layer is from the topmost API layer, the less context exists for our use case. When we consider the call stack, the layers at the bottom are the most flexible, and use-case agnostic. They are the furthest removed from our use case. So if we're building components at the bottom first, it means we have the least context to validate exactly how the component will be used. It can be difficult to guarantee that the component satisfies the use case. As such, this approach is prone to mistakes and omissions. While building out the phone number client, for instance, we could easily forget that we need an extra argument. Or when building a repository interface to fetch some data from the database, we could end up providing methods that don't exactly align with what we need from the higher layers. Often, we don't see the issues until the API layer is built and we pass arguments downward. That's a particularly slow and inefficient feedback loop. It could take several days of work to build all the components, and having to wait until the end to surface issues isn't ideal.
This problem compounds when we think about the interaction between the front end and the back end. It takes even longer to verify that we've built the correct structure on the backend to support the front end. Despite our best efforts in planning the APIs and front-end, we will realize that we inadvertently omit things or made incorrect assumptions. Sometimes we won't discover that we forgot an argument until we realize there's no API parameter to support what we need on the front end. Depending on the nature of the modification it can be time-consuming to rectify. In our case, we also have a GraphQL API Gateway that proxies calls through to our microservice. This adds to the time for changes. In the worst case, we could end up having to modify the API Gateway and several components in our back-end API. These modifications could also cross team boundaries (for e.g. asking another team to modify their API to support additional behavior or parameters) which can further inflate the timeline.

The other issue is that it takes long to demonstrate value. By value, I mean being able to show the front end connected to the backend, performing its intended function. Using a bottom-up approach, we don't have a usable API until all the components are built. Using the same example of fetching phone numbers for the user, would mean we needed the phone number client, all the repositories for fetching data from our database, the business logic layers to interact with the client and repositories, and finally, an API layer to exercise the use case. Until then, we couldn't connect the front end to an actual API to show it working.

The alternative requires a bit of a paradigm shift. Instead of building all of the backend components first, we can start at the top and implement functionality in small increments. That's the top-down part. We can use a similar approach on the front end and start to connect the front end to the back end. That's the end-to-end part. Using this approach enables us to surface value earlier and incorporate changes required to remedy any design shortcomings.

Using the same example of fetching contacts, we could start with the API layer returning a list of dummy contacts. This list only exists at the API layer and isn't populated by any functioning layers below. This allows us to build out the contact list components on the front end and connect it to the backend. Building components in this facilitates faster validation of the backend and front-end integration. It's easier and faster to accommodate changes too since there are fewer components to change when we uncover mistakes in our design. Additionally, at this early stage, we can show stakeholders progress on what we've done. Despite the fact that we're using dummy data, the front end is still performing its intended function!

Conclusion

And that's a wrap! Those are two important lessons I learned while building out the last platform feature I worked on. It's really important to validate and critically question risks before we go building solutions our customers don't need. Then, secondly, it helps to build things end-to-end and top-down. This approach helps us to deliver value faster and obtain a much faster feedback loop than if we were building from the bottom up. Thanks for stopping by!

Share your thoughts in the comments!

Two important lessons I learned rolling out a new platform feature

Table of contents

Introduction

Confirm risks before building for them

Build things top-down and end-to-end to surface value earlier

Conclusion