Project Management Central

Please login or join to subscribe to this thread
Contingency planning failure
Network:1393



The first release of a system causes downtime of the production server in an healthcare environment, which is supposed to be up and running 24/7. The project team immediately followed and executed the backout procedures but they also failed extending the outage. The SME of this release is out of country due to family emergency. How do you deal with this situation?
Sort By:
Page: 1 2 <prev
Network:1393



Jan 07, 2018 1:14 PM
Replying to Henry Hattenrath
...
Dipesh and Drake provided response actions to a Risk that may have been a low probability but high impact. After the Risk event is addressed, it would be advisable to re-evaluate the Go/No Go criteria for software releases at the start of execution and during the execution. The risk management would include the input from Dipesh and Drake to stop execution of the release and restore to the previous operating conditions.
Thanks Henry for your suggestions, I appreciate it.
Network:1393



Jan 07, 2018 3:17 PM
Replying to Kiron Bondale
...
Anish -

Depends if there was a contingency plan for the backout plan itself - perhaps restoring a full backup of the server from before the release?

In any event, this is where you'll need to leverage your stakeholder expectation management skills to the utmost!

Kiron
Yes, I agree with you on this.

Communication is the key in situations like this.
Network:1393



Jan 07, 2018 4:46 PM
Replying to Sante Vergini
...
Fire the project manager, oh that's you Anish, ok I retract that ;-)

Well, for such a critical healthcare system that is suppose to run 24/7, there should have been a mirror system (or two) that can simply be plugged in when the primary system fails, similar to RAID (or why not RAID), or some kind of automatic switch over. However, the question you pose does not mention any contingencies that are in place, because if we knew those, we would know how to answer. Assuming you do have those contingencies in place (and I'm sure you do) then 1. get that on-call SME in asap who is there in case the primary SME is out of the country (you don't just have contingencies for hardware/software resources, but human resources also), 2. get that mirror system live asap (which you would already have configured and backed up) and have one of the on-site or on-call systems engineers (which you would have available) to get it up and running inside the hour. The contingency for critical systems must cover all scenarios: hardware fail, software fail, SME goes walkabout...fire, building collapses, sabotage...and it stops about there, because you can't plan for the asteroid that hits, or WW3.
Sante, I appreciate your intention to fire the project manager :)

Well, in this case we had the contingency plan in place otherwise the change control board won't approve the change and I agree with you that it needs to cover all scenarios. Anyway like I mentioned before, I'm in healthcare IT for the last 15 years but this was my first experience dealing with production breakdowns.

We did the testing for almost 3-4 months, the clinical team as well as the end users were happy with the test results. The release plan was communicated in advance to all stakeholders and end users. We had the standby production server and a SME as a backup to deal with last minute issues. The release was running on a new platform, so we upgraded both hardware and software as well. But unfortunately the server went down as soon as we released the new version.

We were able to bring the server up and running soon, but the end users were kicked out from the application as soon as they do something. This never happened on the test environment, so we contacted the vendor but they didn't had any clue to this issue. Finally, the SME who was out of country came back and figured out the actual problem with the application and was able to fix this issue. Since we were on a new platform, the production server was using more resources then we expected during peak times or when the users do certain things on the application. Anyway we learned a good lesson from this release.
...
1 reply by Sante Vergini
Jan 07, 2018 9:24 PM
Sante Vergini
...
Thanks Anish for the detailed reply. Of course I was kidding about firing the project manager. Healthcare is a crucial industry for obvious reasons. I worked for one briefly around 20 years ago that provided nursing staff, and they have a huge room with servers, racks, RAID etc. so they took security and redundancy very seriously.
Network:1393



Jan 07, 2018 5:42 PM
Replying to Vincent Guerard
...
Dipesh and Drake provide excellent answer.
I would insist on the Testing environment similar to the production, and have key user make validations.
Many lessons learned should come out of the event!
Thanks Vincent, for your response.
Network:10335



Jan 07, 2018 8:38 PM
Replying to Anish Abraham
...
Sante, I appreciate your intention to fire the project manager :)

Well, in this case we had the contingency plan in place otherwise the change control board won't approve the change and I agree with you that it needs to cover all scenarios. Anyway like I mentioned before, I'm in healthcare IT for the last 15 years but this was my first experience dealing with production breakdowns.

We did the testing for almost 3-4 months, the clinical team as well as the end users were happy with the test results. The release plan was communicated in advance to all stakeholders and end users. We had the standby production server and a SME as a backup to deal with last minute issues. The release was running on a new platform, so we upgraded both hardware and software as well. But unfortunately the server went down as soon as we released the new version.

We were able to bring the server up and running soon, but the end users were kicked out from the application as soon as they do something. This never happened on the test environment, so we contacted the vendor but they didn't had any clue to this issue. Finally, the SME who was out of country came back and figured out the actual problem with the application and was able to fix this issue. Since we were on a new platform, the production server was using more resources then we expected during peak times or when the users do certain things on the application. Anyway we learned a good lesson from this release.
Thanks Anish for the detailed reply. Of course I was kidding about firing the project manager. Healthcare is a crucial industry for obvious reasons. I worked for one briefly around 20 years ago that provided nursing staff, and they have a huge room with servers, racks, RAID etc. so they took security and redundancy very seriously.
...
1 reply by Anish Abraham
Jan 07, 2018 11:02 PM
Anish Abraham
...
Well, I knew you were kidding, so no worries.
Network:1393



Jan 07, 2018 9:24 PM
Replying to Sante Vergini
...
Thanks Anish for the detailed reply. Of course I was kidding about firing the project manager. Healthcare is a crucial industry for obvious reasons. I worked for one briefly around 20 years ago that provided nursing staff, and they have a huge room with servers, racks, RAID etc. so they took security and redundancy very seriously.
Well, I knew you were kidding, so no worries.
Network:1393



Jan 07, 2018 6:21 PM
Replying to Deepesh Rammoorthy, PMPĀ®
...
The suggestions provided are excellent but some suggestions are more proactive rather than reactive

According to Sante's suggestion , if its a 24/7 system , there needs to be High Availability architecture from the onset. This ensures real time replication between the primary site and disaster recovery site. Assuming that this was not done and the system was not built for redundancy, we are in this situation.

The back-out procedures were attempted to be executed and they have failed.
This calls for a War Room and emergency procedures.
Your Business Continuity and Disaster Recovery plans should be looked at and the Major Incident /Problem Manager should chair the war room and identification of root causes and status updates should be minuted .

The communication strategy/protocol for Disaster Recovery and Business Continuity should be followed by a Senior Technical Manager or CIO /General Manager of IT to inform a Senior End user at each of the sites that are affected due to the outage so they can start thinking about their own Business continuity and Fall back procedures, which may be paper based while your team brings the system up

An estimation of time to get the system back up and running must be provided to the customer because you are dealing with an SLA
If you do not communicate, you are breaching the SLA and are tarnishing your relations with the customer.

Meanwhile the technical manager should instruct the technical team to troubleshoot and get the system back and running. Troubleshooting can include:-

Trying to reach the SME if at all possible.
Repeating the back-out and troubleshooting error messages
Taking the help of external vendor/specialist support
Trying to restore the last known good backup of the system
Trying to rebuild the system from scratch on a different server
Trying to find if you have stored the latest configuration somewhere (important for a software application , code repository for example)

Once you are able to bring this system back , then you follow all the proactive procedures in place to ensure that this does not happen again.
Thanks Deepesh for your feedback on this.
Page: 1 2 <prev  

Please login or join to reply

Content ID:
ADVERTISEMENTS
ADVERTISEMENT

Sponsors

Vendor Events

See all Vendor Events