Project Management Central

Please login or join to subscribe to this thread
Contingency planning failure
Network:1479



The first release of a system causes downtime of the production server in an healthcare environment, which is supposed to be up and running 24/7. The project team immediately followed and executed the backout procedures but they also failed extending the outage. The SME of this release is out of country due to family emergency. How do you deal with this situation?
Sort By:
Page: 1 2 next>
Network:5



The key to dealing such situations is going to be in communication. Task 1: All stakeholders needs to be informed about the occurrence of this event and continuously fed with updates.
Task 2: In such situation, bringing production system back to normalcy should be the priority. Every production server should have last known image, based on stakeholder approval, that image should be used to restore the production
Task 3: Reach out to the SME and to get the missing information to the project team so that the issue can be resolved

These are some of the first few steps that I would recommend that project team should take
...
1 reply by Anish Abraham
Jan 07, 2018 6:18 PM
Anish Abraham
...
Thanks Dipesh, I agree with you on this.
Network:1453



My suggestions:

1) There should be a test environment to QA the new release before it goes into production.

2) The production server should be backed up before the release is installed.

3) Work with your Data Backup Administrator to create a specific backup / restore procedure to recover from a failed implementation.

4) Test backup / restore procedure to get the time involved to perform it so you know how much time will be required to backp the system and restore the system.

5) There should be some vendor support to assist in a recovery.

I was a Systems Administrator and Enterprise Data Backup Administrator years ago so I was always prepared to recover the system. I also did the PM work too.
...
1 reply by Anish Abraham
Jan 07, 2018 6:23 PM
Anish Abraham
...
Thanks Drake for your feedback and, I agree with you on this.

We did the testing and also had the standby production server, but sometimes things go wrong unexpectedly.
Network:1612



Network:514



Dipesh and Drake provided response actions to a Risk that may have been a low probability but high impact. After the Risk event is addressed, it would be advisable to re-evaluate the Go/No Go criteria for software releases at the start of execution and during the execution. The risk management would include the input from Dipesh and Drake to stop execution of the release and restore to the previous operating conditions.
...
1 reply by Anish Abraham
Jan 07, 2018 6:26 PM
Anish Abraham
...
Thanks Henry for your suggestions, I appreciate it.
Network:942



Anish -

Depends if there was a contingency plan for the backout plan itself - perhaps restoring a full backup of the server from before the release?

In any event, this is where you'll need to leverage your stakeholder expectation management skills to the utmost!

Kiron
...
1 reply by Anish Abraham
Jan 07, 2018 7:02 PM
Anish Abraham
...
Yes, I agree with you on this.

Communication is the key in situations like this.
Network:12753



Fire the project manager, oh that's you Anish, ok I retract that ;-)

Well, for such a critical healthcare system that is suppose to run 24/7, there should have been a mirror system (or two) that can simply be plugged in when the primary system fails, similar to RAID (or why not RAID), or some kind of automatic switch over. However, the question you pose does not mention any contingencies that are in place, because if we knew those, we would know how to answer. Assuming you do have those contingencies in place (and I'm sure you do) then 1. get that on-call SME in asap who is there in case the primary SME is out of the country (you don't just have contingencies for hardware/software resources, but human resources also), 2. get that mirror system live asap (which you would already have configured and backed up) and have one of the on-site or on-call systems engineers (which you would have available) to get it up and running inside the hour. The contingency for critical systems must cover all scenarios: hardware fail, software fail, SME goes walkabout...fire, building collapses, sabotage...and it stops about there, because you can't plan for the asteroid that hits, or WW3.
...
1 reply by Anish Abraham
Jan 07, 2018 8:38 PM
Anish Abraham
...
Sante, I appreciate your intention to fire the project manager :)

Well, in this case we had the contingency plan in place otherwise the change control board won't approve the change and I agree with you that it needs to cover all scenarios. Anyway like I mentioned before, I'm in healthcare IT for the last 15 years but this was my first experience dealing with production breakdowns.

We did the testing for almost 3-4 months, the clinical team as well as the end users were happy with the test results. The release plan was communicated in advance to all stakeholders and end users. We had the standby production server and a SME as a backup to deal with last minute issues. The release was running on a new platform, so we upgraded both hardware and software as well. But unfortunately the server went down as soon as we released the new version.

We were able to bring the server up and running soon, but the end users were kicked out from the application as soon as they do something. This never happened on the test environment, so we contacted the vendor but they didn't had any clue to this issue. Finally, the SME who was out of country came back and figured out the actual problem with the application and was able to fix this issue. Since we were on a new platform, the production server was using more resources then we expected during peak times or when the users do certain things on the application. Anyway we learned a good lesson from this release.
Network:102615



Dipesh and Drake provide excellent answer.
I would insist on the Testing environment similar to the production, and have key user make validations.
Many lessons learned should come out of the event!
...
1 reply by Anish Abraham
Jan 07, 2018 8:40 PM
Anish Abraham
...
Thanks Vincent, for your response.
Network:1479



Jan 07, 2018 5:46 AM
Replying to Dipesh Desai
...
The key to dealing such situations is going to be in communication. Task 1: All stakeholders needs to be informed about the occurrence of this event and continuously fed with updates.
Task 2: In such situation, bringing production system back to normalcy should be the priority. Every production server should have last known image, based on stakeholder approval, that image should be used to restore the production
Task 3: Reach out to the SME and to get the missing information to the project team so that the issue can be resolved

These are some of the first few steps that I would recommend that project team should take
Thanks Dipesh, I agree with you on this.
Network:436



The suggestions provided are excellent but some suggestions are more proactive rather than reactive

According to Sante's suggestion , if its a 24/7 system , there needs to be High Availability architecture from the onset. This ensures real time replication between the primary site and disaster recovery site. Assuming that this was not done and the system was not built for redundancy, we are in this situation.

The back-out procedures were attempted to be executed and they have failed.
This calls for a War Room and emergency procedures.
Your Business Continuity and Disaster Recovery plans should be looked at and the Major Incident /Problem Manager should chair the war room and identification of root causes and status updates should be minuted .

The communication strategy/protocol for Disaster Recovery and Business Continuity should be followed by a Senior Technical Manager or CIO /General Manager of IT to inform a Senior End user at each of the sites that are affected due to the outage so they can start thinking about their own Business continuity and Fall back procedures, which may be paper based while your team brings the system up

An estimation of time to get the system back up and running must be provided to the customer because you are dealing with an SLA
If you do not communicate, you are breaching the SLA and are tarnishing your relations with the customer.

Meanwhile the technical manager should instruct the technical team to troubleshoot and get the system back and running. Troubleshooting can include:-

Trying to reach the SME if at all possible.
Repeating the back-out and troubleshooting error messages
Taking the help of external vendor/specialist support
Trying to restore the last known good backup of the system
Trying to rebuild the system from scratch on a different server
Trying to find if you have stored the latest configuration somewhere (important for a software application , code repository for example)

Once you are able to bring this system back , then you follow all the proactive procedures in place to ensure that this does not happen again.
...
1 reply by Anish Abraham
Jan 09, 2018 11:19 PM
Anish Abraham
...
Thanks Deepesh for your feedback on this.
Network:1479



Jan 07, 2018 7:32 AM
Replying to Drake Settsu
...
My suggestions:

1) There should be a test environment to QA the new release before it goes into production.

2) The production server should be backed up before the release is installed.

3) Work with your Data Backup Administrator to create a specific backup / restore procedure to recover from a failed implementation.

4) Test backup / restore procedure to get the time involved to perform it so you know how much time will be required to backp the system and restore the system.

5) There should be some vendor support to assist in a recovery.

I was a Systems Administrator and Enterprise Data Backup Administrator years ago so I was always prepared to recover the system. I also did the PM work too.
Thanks Drake for your feedback and, I agree with you on this.

We did the testing and also had the standby production server, but sometimes things go wrong unexpectedly.
Page: 1 2 next>  

Please login or join to reply

Content ID:
ADVERTISEMENTS

"A thing worth having is a thing worth cheating for."

- W.C. Fields

ADVERTISEMENT

Sponsors