Book Summary: The Art Of Scalability

The Art of Scalability by Abbott Fisher is a great foundational book for software systems designs. Below are my notes for this book:

1. Impact of people and leadership on scalability

People are the most important piece of the scale puzzle. Leadership is about creating a vision. Management is about measurement. Management is about achievement of the goals.

2. Roles for the scalable technology organization:

A common cause of failures in scalability and availability is lack of clarity in responsibilities of people.  Overlapping responsibility creates wasted effort and bad conflicts. To avoid confusions and ambiguity of ownerships, the author suggests creating a RASCI matrix with the clear single ownership of each item.

3. Design organizations:

  • Building a great team: A good team size is a team that can be fed by two large pizzas. A team should have a mix of people with varied experience and diversity. Too large a team size can cause a loss of productivity. 
  • Organizational types are functional, matrix, and agile. In functional, we have one type of role in a team. In a matrix organization, a Project Manager builds a temporary, project specific team from different teams. In agile organization, all required types of roles are within the same team. Agile organizations provide increased innovation by providing an ability to quickly market a product.
  • Conflicts are inevitable.
  • Good conflicts: why should we do it?
  • Bad conflicts: who will do what?
  • A team should have members of different experience levels. That helps driving innovation.

4. Leadership 101:

  • Leadership is a pull activity. Management is a push activity. Management measures.
  • Getting feedback from the team and improving goes a long way.
  • Act and behave ethically and do not take advantage of your position of authority.
  • Be the type of person who thinks first about how to create stakeholder value rather than personal value.
  • Mission First, People Always. 

5. Management 101:

  • Management is about measuring. Leadership pulls and management pushes.
  • AFK 50-95 Rule: 
    • Spend 5% of the time building a plan.
    • Spend 95% of the time planning for contingencies when things don’t go the way you expect.

6. Relationship, Mindset, and the business case:

Both business and technology leaders should develop the knowledge on each others’s areas.

7. Why processes are critical to scale:

Processes are critical part of scaling an application. If we are managing people constantly for repetitive tasks, it’s a sign of introducing processes. For any process, there should be a an owner assigned to it.

8. Incidents and problems:

Incidents are the issues in the production environment. Problems are the causes of incidents. For example, a slowdown of a data transfer is an incident. No data availability on time is a problem caused by the data delay incident. While managing incidents and problems, try to keep people separate from issues. Conducting quarterly incident reviews and post mortem processes are important to improve the processes.

9. Managing crisis and escalations:

  • Crisis can harm businesses severely.
  • We must determine the unique crisis threshold for the businesses.
  • A person managing a crisis should be able to take the charge of the situation. This person should also be calm from inside and persuasive from outside. This person should also keep the business informed about the crisis. Set-up war rooms as required.

10. Controlling change in production environments:

We should plan for quarterly or annual reviews of changes. We should know why a change’s function is and how this change can be validated.

11. Determining headroom for applications:

Purpose of this process is to assess how long this application can serve the customers before it starts failing. Headroom helps in product planning and hiring. A general simple rule is to use the application’s capacity up to 50%.

12. Establishing architectural principles:

Make sure principles follow SMART guidelines. SMART stands for Specific, Measurable, Achievable, Realistic, and Testable. Below are most adopted principles:

  • N +1 Design: Anything we develop has at least one additional instance in the event of failure. Apply rule of three: build one for you, one for customers, and one to fail.
  • Design for rollback: Ensure the product/application is backward compatible.
  • Design to be disabled: design the service/application in a way that it can be marked down or disabled.
  • Design to be monitored: design with the monitoring mindset.
  • Design for multiple live sites: design to deploy from multiple geographical sites.
  • Use mature technologies that are well known.
  • Asynchronous Design. Use synchronous design only when it’s absolutely necessary.
  • Stateless systems. Use state only when it’s business required.
  • Scale out, not up. Forcing transactions through a single person, computer, or a process is a recipe for disaster.
  • Design for at least two axes of scale. Always think how we will execute next set of horizontal splits before the need arises.
  • Buy when non-core. Build things only when you are really good at it.
  • Use commodity hardware. Cheaper is better.
  • Build small, release small, fail fast.
  • Isolate faults.
  • Automate over people. Never rely on people to do something that can be automated.

Keep number of principles that can easily be memorized by the team to utilize these principles. Do not have more than 15.

13. JAD and ARB:

  • JAD (Joint Architecture Design) is a process wherein all engineering teams work together to design new functionalities together in a way that it is consistent with the architecture principles of the organization.
  • ARB (Architecture Review Board) is a review board that ensures that all principles are incorporated and best practices have been applied. For example, in my one previous company, a design team ensured that all teams have implemented the architecture principles.
  • For JAD and ARB details, read the book in detail.

14. Agile architecture design:

Agile teams should act autonomously. JAD and ARB processes ensure a cross-functional design of the services.

15. Build versus Buy:

Use cost and strategy-centric approaches. Use the checklist mentioned in the book to determine a build versus buy decision.

16. Determining Risk:

The first approach to measure a risk is a gut feel method. It’s a very fast method. The second method is the traffic light method. In this method, we break down the action into smallest components and assigning a risk priority to them (like green, yellow, and red).

This is a half way to the book. As I learn more, I will update this page.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s