You are operating a mission-critical system on behalf of one of your customers. You are contractually committed to high-availability of over 99.999% of the time. What methods and procedures will you install to ensure the needed system availability? How will you act when you find out the system is down unexpectedly? Saving Changes...
Sort By:
James ShieldsIS Director - Portfolio Solutions| City and County of San Francisco, SFPDSan Francisco, Ca, United States
The question you pose seems to be operational, not project.
Regardless, the answer on a high-availability requirement is always rooted in a solution that has redundancy, backup & fail-over. Saving Changes...
Stéphane ParentSelf Employed / Semi-retired| Leader MakerPrince Edward Island, Canada
What methods and procedures will you install to ensure the needed system availability? James answered that question. It will be expensive but you will have need the processes to monitor and adjustt the ennvironment preemptively, rather than waiting for something to happen. For example, you may need to re-allocate CPUs, memory or disk space to allow for year-end processes.
How will you act when you find out the system is down unexpectedly? I will follow the process that will have been defined for such situations. You should have the process documented and tested properly. Saving Changes...
Sergio Luis ConteHelping to create solutions for everyone| Worldwide based OrganizationsBuenos Aires, Argentina
This is not about project management. Is about operations management. Procedures you have to implement are well knonw in the framework of quality. Manily take a look to non-functional attributes or requirements of the product. You can find a good guide if you take a look to Barry Boehm´s NFRs clasification. On the other side, others disciplines like ITIL will help you a lot. Saving Changes...
Karl TwortSenior Project Manager| Fresh EggUnited Kingdom
As others have identified, this is operational, not project, however:
Monitoring - ensuring that your team are alerted at the very earliest opportunity is key to the response plan being initiated within the contracted SLAs. With this level of committed uptime, your infrastructure must be overpowered, rugged and resolute. Your monitoring will then buy you the time to address any early signs that could build to a critical outage.
Response Process - Documented, practiced, understood. If the team knows how to respond when a Critical issue is identified, they are one step ahead. Problems happen, its how the team are trained to handle them that will keep you calm and ready to get back to a stable environment
Backups - multiple backups sound like a solution here, in multiple locations. This gives you resilience in the event of a location-based outage. Mirroring to off-site back up locations mitigates this risk.
Fail Overs - Automate your failovers. This can be actioned on early warning signs, meaning that a failure of the primary system may even go unnoticed by the client if you have already monitored, predicted and switched to mirror system before the primary system fails.
Documentation - Not only for the process of how to react but to ensure you document the issue, its causes, the solutions and next steps.
Lessons - Making sure lessons, good and bad, are taken away from an issue is critical to future success. Issues, whilst distracting and sometimes costly are also learning opportunities which can be a positive move forward for not only the immediate, but future projects too.
Obviously, the above is a very high-level list, but certainly things that should be considered from the outset.