It’s no secret that businesses of all sizes have an ever-growing need for IT systems. But IT systems are only effective for these businesses and their clients if they are operational, resilient and highly available. As enterprises look to build out their enterprise availability, having a baseline for weighing and assessing your vulnerability can be the difference that produces a successful merger of infrastructure, software, services and support that increases your success.
Sometimes, the most basic of checklists can help you sort through whether or not your solution is highly available or highly vulnerable?
Does your organization have the proper infrastructure to support high availability?
- Do your data centers have environmental sensors in place to measure building systems?
- Do your data centers have 24x7x365 operations?
- Does your data center include redundant power and network connectivity from diverse sources?
- Does your data center include multiple layers of host and storage services?
- As VP of Customer Experience, I have seen customers attempt to create a highly available solution without addressing fundamental foundational issues within their infrastructure.
They deploy software but have instability within the network infrastructure, servers, and datacenter itself. Cloud addresses a lot of the infrastructure issues, but not all cloud platforms are architected the same. Be sure to understand your datacenter, on-premises or cloud.
Does your organization have a runbook (or playbook) in place that covers design, architecture, and process?
- Is your runbook well documented, publicized and easily accessible?
- Are routine parts of your runbook sufficiently automated?
- Who has access to your enterprise runbook?
- Is it current and currently maintained?
- Is there version control for your runbook and any automation tools therein?
If you answered, what is a runbook or playbook then your first step is to find or create one. A runbook (or playbook) helps your organization maintain systems and processes with respect to the highly available system architecture. Some companies use automated tools to create scripts that deploy and configure servers, others use a version-controlled document to outline how all things work together to provide resilience and success. Your team needs to have a place that newcomers and existing team members can go to to understand the environment, the process, and the tools being used.
Does your organization have resources dedicated to maintaining high availability best practices?
- Does your organization give these employees and contractors support and training?
- Does your organization give these teams autonomy to adapt and create better best practices?
“I didn’t set these systems up,” the IT Admin stated, “I just inherited these systems with some other servers.” The lament was an honest and often observed phenomenon in organizations. Whether it is the result of mergers and acquisitions, cost reductions, outsourcing, or general staff turnover, a key component of a highly available enterprise is sufficient staffing. A key to a highly vulnerable enterprise is a lack of staffing, undertrained or undersupported staffing.
Does your organization have proper change management controls in place?
- Do you have a regular update policy and schedule?
- Do you have a defined process on patch maintenance?
- Do you have a review process in place for patches (vulnerabilities, threats, etc)?
Change management is important. Change management controls and polices are an absolute must in reducing risk and making sure that your systems are available. A user without proper restraints can add packages or updates that destroy stability, or make changes that disrupt the organization for hours. In addition, not having a defined policy often creates drift between what is expected (documented) and the actual (what is in place). Change management is also critical to ensure that your standby cluster is at the same patch and software levels as the primary/source system, and that QA (or Pre-Production) are not grossly deviating from Production.
Does your organization have proper access controls in place?
- Do you have account management tiers for server administration?
- Do you have controls to prevent accidental downtime?
Our Services team joined a customer call and waited, and waited, and waited for the administrator with permissions to run a set of elevated commands to join the session to configure and update their software. Weeks later, our team joined a different customer call and watched in horror as multiple users, all with administrative privileges, ran a bevy of commands on the same cluster. The difference in the two calls pointed out with stunning clarity that access controls are important. A highly available enterprise needs to ensure that proper access controls are in place that prevents users from running elevated commands that could damage the configuration or diminish its operation. Be sure that users have limits on what they can do based on their roles, needs, and even experience.
Does your company have a regular test process?
- Does your organization test in a pre-production or QA environment prior to production?
- Does your organization perform regular backups and backup testing?
- Does your organization practice disaster recovery scenarios and chaos testing for continuous improvement?
Testing takes time, but in my role of assisting customers with their cloud migrations and high availability deployments, the time has always been well spent. Often, the difference between the highly available and the highly vulnerable can come down to the customer or partner’s test process. As solutions become more complex, testing and validation are becoming more and more essential to reducing risk and vulnerabilities. If everything goes from design to production, you’re running a highly vulnerable system. But, if you’ve got tests and checkpoints, a process to verify changes before they make it into production your risks are significantly reduced. As VP of Customer Experience, our services team worked with a banner customer who deployed their systems for an entire year in QA before completing their go-live migration. Over that year they simulated outages, disasters, customer loads, downtime, maintenance, patching strategies, backups, recovery from backup, and a bevy of other test suites. Consequently, they’ve had remarkable results in performance, process adherence, high availability, and enterprise success.
While no checklist will be able to cover every potential vulnerability in high availability, answering these questions will give you a strong foundation for understanding if your enterprise is highly available or highly vulnerable.